Overview
InfoAudio Cloud Guardian (ICG) is a cloud-hosted continuity layer for radio stations running InfoAudio on premises. It mirrors the current and upcoming days of programming — traffic log metadata, audio assets, and music licensing (PRO) logs — into a multi-tenant cloud environment. When the studio becomes unreachable for a configurable window, ICG automatically takes the station to air through a lightweight Web Player Lite, with audited operator intervention and controlled handback once the studio returns.
This document describes the end-to-end failover flow: detection of a studio outage, cloud takeover, operator workflows during contingency, return to normal operation, and reconciliation of what played off-premises.
Tenant State Machine
Each station (tenant) maintains a single operational state in the cloud. State transitions are driven by heartbeat signals from the on-premises Sync Agent and by human confirmation during recovery.
Default Timing Parameters
| Parameter | Default | Purpose |
|---|---|---|
T_warning | 10 s | One heartbeat cycle missed |
T_grace | 2 min | Network-instability tolerance before alerting master control |
T_disaster | 5 min | Decision threshold — cloud takes air from this point on |
T_recovery_confirm | 30 min | Operator window to confirm handback to studio |
T_recovery_escalate | 1 h | Escalation to super-admin if no operator response |
T_safe_startup | 10 s | Buffer delay before Web Player Lite starts on-air reproduction |
flap_threshold_24h | 5 | Flap count per 24 h that triggers stability investigation |
All timings are configurable per tenant, allowing stations with less reliable connectivity to adjust tolerance without changing platform code.
Phase-by-Phase Flow
Normal Operation (HEALTHY)
- The on-premises Sync Agent sends a heartbeat every 10 seconds (timestamp, agent version, active schedule hash, pending upload queue).
- The cloud stores the most recent heartbeat per tenant and keeps a D..D+N mirror of traffic logs and audio assets (default N = 3 days, configurable).
- The studio holds the on-air role; the cloud is a passive mirror.
- Schedule diffs are uploaded every 30 minutes (configurable).
Outage Detection
- A Heartbeat Monitor job scans every tenant every 5 seconds.
- If
now − last_heartbeat > 10 s→ state transitions to WARNING (silent, logged only). - If silence exceeds
T_grace→ state transitions to DEGRADED:- Tenant dashboard shows an amber alert.
- Tenant’s master control team is notified through the configured channels.
- Web Player Lite does not take air yet — it pre-buffers upcoming elements.
- If silence exceeds
T_disaster→ state transitions to FAILOVER_ACTIVE.
Flap Handling
- Heartbeat resumes between WARNING and DEGRADED → silent return to HEALTHY.
- Heartbeat resumes between DEGRADED and FAILOVER_ACTIVE → return to HEALTHY with “near-miss” notification.
- If the flap counter exceeds
flap_threshold_24h, a stability investigation alert is raised.
Cloud Takeover (FAILOVER_ACTIVE)
- Cloud records
failover_active_since = now. - Immediate notification is sent to:
- All operators in the tenant’s master control team.
- Informa super-admin (support).
- Any remote announcer with an active Web Player Lite session.
- Web Player Lite takes air for every logged-in session of that tenant, honoring the cloud-stored programming log for the current clock position.
- To avoid overlap or echo with the studio, the player follows the conservative startup strategy: it waits up to
T_safe_startupand begins playback at the next complete element boundary rather than mid-song. - A complete audit log entry is written with context: who was logged in, exact clock time, last heartbeat received, and elements queued.
Contingency Operation
Role-based actions during FAILOVER_ACTIVE:
| Role | Allowed | Not Allowed |
|---|---|---|
| Remote announcer | Monitor playout; view upcoming breaks; pause only between elements; insert an approved imager/jingle from the catalog | Modify the log; download audio assets; go offline without handing off control |
| Master control operator (tenant) | Everything the announcer can do, plus: insert ad-hoc elements (song, imager, voice track) from the cloud catalog; reorder upcoming items in the current break; skip the current element | Create or edit new audio assets; modify logs for future days |
| Informa super-admin | Read-only access plus emergency intervention (stop, force handback) | Editorial content decisions (contractually reserved to the station) |
Automatic Safeguards
- Every contingency action is audited with
context: FAILOVER_ACTIVE. - All actions require a valid MFA session (maximum 15 minutes, with step-up re-authentication for critical operations).
- The cloud refuses commands that would create dead air longer than 5 seconds; the player automatically falls back to the next catalog element.
Return to Studio (RECOVERING)
- The on-premises Sync Agent reconnects and sends a valid heartbeat.
- Cloud detects the return but does not automatically hand air back. It awaits human confirmation.
- Operators are notified: “Studio is back. Confirm handback?”
- Three options are presented:
- Immediate handback — operator clicks “Return to studio”; Web Player Lite finishes the current element and releases the air at the next safe stop (end of element, post-spot, or next cue point).
- Scheduled handback — pinned to the end of the break, top of the hour, or another safe marker.
- Stay in failover — if the operator does not confirm within
T_recovery_confirm, the cloud continues to broadcast, sends periodic reminders, and eventually escalates to super-admin (T_recovery_escalate).
- State transitions to RECOVERING with a scheduled
handoff_scheduled_fortimestamp.
Reconciliation
During RECOVERING, the cloud sends the Sync Agent a full account of what happened off-premises:
- As-played log for the failover window (elements, timestamps, operator interventions).
- Ad-hoc elements inserted by operators that were not part of the original log.
- Any changes made to upcoming logs during failover.
The Sync Agent then:
- Updates the local InfoAudio database with “played” flags for PRO log and compliance purposes.
- Surfaces any mismatches to the local operator for reconciliation.
- Confirms completion → cloud transitions back to HEALTHY.
If the local operator disputes the reconciliation, a dispute flag is raised; the super-admin mediates; the audit trail remains immutable.
Notification Channels
Severity-driven delivery, configurable per tenant:
| Transition | Severity | Channels |
|---|---|---|
| HEALTHY → WARNING | Info | Log only |
| WARNING → DEGRADED | Warning | E-mail + webhook + web push |
| DEGRADED → FAILOVER_ACTIVE | Critical | All channels + SMS |
| FAILOVER_ACTIVE → RECOVERING | Warning | E-mail + webhook + web push |
| RECOVERING → HEALTHY | Info | E-mail + webhook |
| Flap detected (near-miss) | Warning | E-mail + webhook |
Failure Modes and Mitigations
| Scenario | Behavior |
|---|---|
| Intermittent connectivity (short flap) | WARNING → HEALTHY without action. Logged against a station-stability metric. |
| Cloud unreachable from studio (agent runs, API not) | Agent enters offline upload mode. Highest-risk scenario: studio is on air but cloud misreads silence as outage → potential double broadcast. Mitigation: human confirmation is required at DEGRADED before FAILOVER_ACTIVE, OR a side-channel ping (e.g., studio-side SMS) is added. |
| Web Player Lite has no connectivity when failover fires | Cloud playback does not start. Operators are alerted that contingency has failed and manual escalation begins. |
| No operator responds to RECOVERING | Cloud continues on air; reminders every 10 minutes; super-admin escalation after one hour. |
| Reconciliation dispute | Dispute flag preserved; super-admin mediates; immutable audit log preserved; future logs resync from the next safe break. |
| Zombie agent (process alive, state frozen) | Heartbeat includes a state hash. If the hash does not change for N cycles and no uploads occur, the cloud treats the agent as zombie and escalates. |
| Cloud itself goes down during failover | Out of MVP scope — requires a pre-recorded local emergency routine as an absolute fallback. Tracked for Phase 4+. |
Audit Trail
Every state transition writes an audit record with the following fields:
action:state_transitionresource_type:tenant_statemetadata:{ from, to, reason, last_heartbeat, triggered_by: "system" | user_id }
Operator interventions use specific action codes (insert_element, skip_element, confirm_handback, etc.). Audit records are append-only, retained for at least one year, and candidates for cryptographic hash-chaining to guarantee tamper-evidence.
Open Design Questions
Decisions pending before Phase 1 (MVP)
- Is
T_grace= 2 min the right default, or should it be more (5 min) or less (1 min) tolerant? - Do we keep two intermediate states (WARNING and DEGRADED), or consolidate into a single pre-failover state?
- Confirm the conservative startup strategy (cloud starts at the next complete element, accepting a short gap of 5–10 s).
- When the cloud cannot confirm via Agent that the studio is truly off-air (only heartbeat missing), do we activate automatically or require human confirmation at DEGRADED to prevent double broadcast?
- Is the announcer / operator / super-admin role matrix complete, or should additional roles (editor, auditor) be added?
- Should ad-hoc insertions during failover require dual authorization (operator plus announcer or super-admin) for sensitive actions such as extra commercial spots?
- Is
T_recovery_confirm= 30 min reasonable? Should handback fall back to automatic after 1 h of silence, or always require a human confirmation? - Can handback be scheduled (next break, top of the hour), or only immediate / manual?
- Which notification channels are required for the MVP? Suggested minimum: e-mail + web push; webhook and SMS in Phase 2.
- Which role provisions notification channels and recipients per tenant — tenant-admin or super-admin only?
- For the “cloud unreachable but agent and studio alive” scenario, do we accept mandatory human confirmation at DEGRADED as the MVP mitigation, or invest in a side-channel (studio SMS, on-air player ping)?
- Pre-recorded local emergency routine remains out of MVP scope. Keep tracked for Phase 4+?
- Audit log retention: is one year the legal minimum in Brazil? What about international markets?
- Should the audit log be strictly immutable (append-only, no delete) with hash-chaining for integrity proof?