InfoAudio Cloud Guardian — Failover Flow

Product Preview. InfoAudio Cloud Guardian is under active planning. This specification is shared publicly as an early design reference. Timings, roles, and parameters may evolve before the first pilot release (internal Informa pilot targeted for May 2026). Commercial availability is planned for Q3 2026.

Overview

InfoAudio Cloud Guardian (ICG) is a cloud-hosted continuity layer for radio stations running InfoAudio on premises. It mirrors the current and upcoming days of programming — traffic log metadata, audio assets, and music licensing (PRO) logs — into a multi-tenant cloud environment. When the studio becomes unreachable for a configurable window, ICG automatically takes the station to air through a lightweight Web Player Lite, with audited operator intervention and controlled handback once the studio returns.

This document describes the end-to-end failover flow: detection of a studio outage, cloud takeover, operator workflows during contingency, return to normal operation, and reconciliation of what played off-premises.

Tenant State Machine

Each station (tenant) maintains a single operational state in the cloud. State transitions are driven by heartbeat signals from the on-premises Sync Agent and by human confirmation during recovery.

┌──────────┐ │ HEALTHY │◄─────────────────────────────┐ └────┬─────┘ │ │ heartbeat late > 10s │ ▼ │ ┌──────────┐ heartbeat resumes │ │ WARNING │──────────────────────────────┤ └────┬─────┘ │ │ silence > T_grace (default 2min) │ ▼ │ ┌──────────┐ heartbeat resumes │ │ DEGRADED │──────────────────────────────┤ └────┬─────┘ │ │ silence > T_disaster (default 5min)│ ▼ │ ┌────────────────┐ │ │ FAILOVER_ACTIVE│ │ └────────┬───────┘ │ │ agent reconnects │ │ + operator confirms handback │ ▼ │ ┌────────────┐ │ │ RECOVERING │────────────────────────────┘ └────────────┘ reconciliation complete

Default Timing Parameters

Parameter	Default	Purpose
`T_warning`	10 s	One heartbeat cycle missed
`T_grace`	2 min	Network-instability tolerance before alerting master control
`T_disaster`	5 min	Decision threshold — cloud takes air from this point on
`T_recovery_confirm`	30 min	Operator window to confirm handback to studio
`T_recovery_escalate`	1 h	Escalation to super-admin if no operator response
`T_safe_startup`	10 s	Buffer delay before Web Player Lite starts on-air reproduction
`flap_threshold_24h`	5	Flap count per 24 h that triggers stability investigation

All timings are configurable per tenant, allowing stations with less reliable connectivity to adjust tolerance without changing platform code.

Phase-by-Phase Flow

Phase A

Normal Operation (HEALTHY)

The on-premises Sync Agent sends a heartbeat every 10 seconds (timestamp, agent version, active schedule hash, pending upload queue).
The cloud stores the most recent heartbeat per tenant and keeps a D..D+N mirror of traffic logs and audio assets (default N = 3 days, configurable).
The studio holds the on-air role; the cloud is a passive mirror.
Schedule diffs are uploaded every 30 minutes (configurable).

Phase B

Outage Detection

A Heartbeat Monitor job scans every tenant every 5 seconds.
If now − last_heartbeat > 10 s → state transitions to WARNING (silent, logged only).
If silence exceeds T_grace → state transitions to DEGRADED:
- Tenant dashboard shows an amber alert.
- Tenant’s master control team is notified through the configured channels.
- Web Player Lite does not take air yet — it pre-buffers upcoming elements.
If silence exceeds T_disaster → state transitions to FAILOVER_ACTIVE.

Flap Handling

Heartbeat resumes between WARNING and DEGRADED → silent return to HEALTHY.
Heartbeat resumes between DEGRADED and FAILOVER_ACTIVE → return to HEALTHY with “near-miss” notification.
If the flap counter exceeds flap_threshold_24h, a stability investigation alert is raised.

Phase C

Cloud Takeover (FAILOVER_ACTIVE)

Cloud records failover_active_since = now.
Immediate notification is sent to:
- All operators in the tenant’s master control team.
- Informa super-admin (support).
- Any remote announcer with an active Web Player Lite session.
Web Player Lite takes air for every logged-in session of that tenant, honoring the cloud-stored programming log for the current clock position.
To avoid overlap or echo with the studio, the player follows the conservative startup strategy: it waits up to T_safe_startup and begins playback at the next complete element boundary rather than mid-song.
A complete audit log entry is written with context: who was logged in, exact clock time, last heartbeat received, and elements queued.

Phase D

Contingency Operation

Role-based actions during FAILOVER_ACTIVE:

Role	Allowed	Not Allowed
Remote announcer	Monitor playout; view upcoming breaks; pause only between elements; insert an approved imager/jingle from the catalog	Modify the log; download audio assets; go offline without handing off control
Master control operator (tenant)	Everything the announcer can do, plus: insert ad-hoc elements (song, imager, voice track) from the cloud catalog; reorder upcoming items in the current break; skip the current element	Create or edit new audio assets; modify logs for future days
Informa super-admin	Read-only access plus emergency intervention (stop, force handback)	Editorial content decisions (contractually reserved to the station)

Automatic Safeguards

Every contingency action is audited with context: FAILOVER_ACTIVE.
All actions require a valid MFA session (maximum 15 minutes, with step-up re-authentication for critical operations).
The cloud refuses commands that would create dead air longer than 5 seconds; the player automatically falls back to the next catalog element.

Phase E

Return to Studio (RECOVERING)

The on-premises Sync Agent reconnects and sends a valid heartbeat.
Cloud detects the return but does not automatically hand air back. It awaits human confirmation.
Operators are notified: “Studio is back. Confirm handback?”
Three options are presented:
- Immediate handback — operator clicks “Return to studio”; Web Player Lite finishes the current element and releases the air at the next safe stop (end of element, post-spot, or next cue point).
- Scheduled handback — pinned to the end of the break, top of the hour, or another safe marker.
- Stay in failover — if the operator does not confirm within T_recovery_confirm, the cloud continues to broadcast, sends periodic reminders, and eventually escalates to super-admin (T_recovery_escalate).
State transitions to RECOVERING with a scheduled handoff_scheduled_for timestamp.

Phase F

Reconciliation

During RECOVERING, the cloud sends the Sync Agent a full account of what happened off-premises:

As-played log for the failover window (elements, timestamps, operator interventions).
Ad-hoc elements inserted by operators that were not part of the original log.
Any changes made to upcoming logs during failover.

The Sync Agent then:

Updates the local InfoAudio database with “played” flags for PRO log and compliance purposes.
Surfaces any mismatches to the local operator for reconciliation.
Confirms completion → cloud transitions back to HEALTHY.

If the local operator disputes the reconciliation, a dispute flag is raised; the super-admin mediates; the audit trail remains immutable.

Notification Channels

Severity-driven delivery, configurable per tenant:

Transition	Severity	Channels
HEALTHY → WARNING	Info	Log only
WARNING → DEGRADED	Warning	E-mail + webhook + web push
DEGRADED → FAILOVER_ACTIVE	Critical	All channels + SMS
FAILOVER_ACTIVE → RECOVERING	Warning	E-mail + webhook + web push
RECOVERING → HEALTHY	Info	E-mail + webhook
Flap detected (near-miss)	Warning	E-mail + webhook

Failure Modes and Mitigations

Scenario	Behavior
Intermittent connectivity (short flap)	WARNING → HEALTHY without action. Logged against a station-stability metric.
Cloud unreachable from studio (agent runs, API not)	Agent enters offline upload mode. Highest-risk scenario: studio is on air but cloud misreads silence as outage → potential double broadcast. Mitigation: human confirmation is required at DEGRADED before FAILOVER_ACTIVE, OR a side-channel ping (e.g., studio-side SMS) is added.
Web Player Lite has no connectivity when failover fires	Cloud playback does not start. Operators are alerted that contingency has failed and manual escalation begins.
No operator responds to RECOVERING	Cloud continues on air; reminders every 10 minutes; super-admin escalation after one hour.
Reconciliation dispute	Dispute flag preserved; super-admin mediates; immutable audit log preserved; future logs resync from the next safe break.
Zombie agent (process alive, state frozen)	Heartbeat includes a state hash. If the hash does not change for N cycles and no uploads occur, the cloud treats the agent as zombie and escalates.
Cloud itself goes down during failover	Out of MVP scope — requires a pre-recorded local emergency routine as an absolute fallback. Tracked for Phase 4+.

Audit Trail

Every state transition writes an audit record with the following fields:

action: state_transition
resource_type: tenant_state
metadata: { from, to, reason, last_heartbeat, triggered_by: "system" | user_id }

Operator interventions use specific action codes (insert_element, skip_element, confirm_handback, etc.). Audit records are append-only, retained for at least one year, and candidates for cryptographic hash-chaining to guarantee tamper-evidence.

Open Design Questions

Decisions pending before Phase 1 (MVP)

Is T_grace = 2 min the right default, or should it be more (5 min) or less (1 min) tolerant?
Do we keep two intermediate states (WARNING and DEGRADED), or consolidate into a single pre-failover state?
Confirm the conservative startup strategy (cloud starts at the next complete element, accepting a short gap of 5–10 s).
When the cloud cannot confirm via Agent that the studio is truly off-air (only heartbeat missing), do we activate automatically or require human confirmation at DEGRADED to prevent double broadcast?
Is the announcer / operator / super-admin role matrix complete, or should additional roles (editor, auditor) be added?
Should ad-hoc insertions during failover require dual authorization (operator plus announcer or super-admin) for sensitive actions such as extra commercial spots?
Is T_recovery_confirm = 30 min reasonable? Should handback fall back to automatic after 1 h of silence, or always require a human confirmation?
Can handback be scheduled (next break, top of the hour), or only immediate / manual?
Which notification channels are required for the MVP? Suggested minimum: e-mail + web push; webhook and SMS in Phase 2.
Which role provisions notification channels and recipients per tenant — tenant-admin or super-admin only?
For the “cloud unreachable but agent and studio alive” scenario, do we accept mandatory human confirmation at DEGRADED as the MVP mitigation, or invest in a side-channel (studio SMS, on-air player ping)?
Pre-recorded local emergency routine remains out of MVP scope. Keep tracked for Phase 4+?
Audit log retention: is one year the legal minimum in Brazil? What about international markets?
Should the audit log be strictly immutable (append-only, no delete) with hash-chaining for integrity proof?