Two-Station Redundancy in Live FM Broadcasting: Architecture and Failure Modes

#broadcasting #redundancy #reliability #radio

Two-Station Redundancy in Live FM Broadcasting: Architecture and Failure Modes

By the KAVANA engineering team

Live broadcasting has a reliability requirement that most software systems do not: every second of dead air is a broadcast failure with regulatory and commercial consequences.

What Failure Means in Broadcast

Hard failure: The playout process crashes. Easy to detect.

Silent failure: The process is running but producing silence. Heartbeat checks pass. The transmitter broadcasts nothing.

Content failure: The process is running and producing audio, but the wrong audio. Internal test loops leaking to air. Harder to detect automatically.

Partial failure: Some subsystem has failed. TTS generation is broken, so AI content is missing and the system silently fills with music.

A simple process-alive check is insufficient for broadcast-grade reliability.

The Audio Watchdog

Our primary failure detection mechanism monitors the actual audio signal, not just software processes:

Samples output audio continuously at low resolution
Computes running RMS level
Detects prolonged silence (3+ seconds below threshold)
Detects stuck-loop patterns via autocorrelation
Compares against the expected schedule hash when available

Silence detection is the most reliable check. Stuck-loop detection requires careful threshold tuning per station.

State Synchronization

The primary writes a heartbeat record every 5 seconds: current schedule position, currently-playing content item ID, audio buffer state. The backup maintains a shadow schedule. On failover:

resume_position = last_heartbeat_position + (failover_trigger_time - last_heartbeat_time)

In most cases this is within 10 seconds of accurate. For music rotation, 10 seconds of discontinuity is acceptable. For a news reader mid-sentence, it is not, so AI content gets more frequent position checkpoints.

The Split-Brain Problem

The most dangerous failure mode is both stations believing they are the active transmitter. We handle this with a physical arbitration mechanism: before the backup transitions to active mode, it must acquire a hardware lock token implemented as a relay switch. This eliminates split-brain, though it adds 100-200ms to failover time.

Software-only split-brain prevention (via distributed consensus) introduces network latency at the exact moment the network may be degraded. The physical relay solves this problem unconditionally.

AI Content in the Redundant System

The primary streams generated audio to the backup as secondary output, while the backup maintains its own generation capability as fallback. In normal operation the backup uses the primary stream. In failover it falls back to its own generation, accepting a brief audio discontinuity at the failover point.

This architecture means the backup must have access to all the same content sources as the primary: the same LLM, the same TTS voice models, the same news feeds. The backup is not a cold spare; it is a warm standby with full capability.

What This Means Operationally

The practical outcome of this architecture: stations running our system report 99.97% uptime over their broadcast hours. The remaining 0.03% is mostly scheduled maintenance windows, not failures.

KAVANA designs broadcast playout systems for FM stations where reliability is non-negotiable. Learn more at kavanafm.com.