Stephane Bhiri

Posted on Feb 12 • Originally published at vajracast.com

Video Stream Failover: Best Practices for Zero-Downtime Broadcasting

#streaming #broadcast #reliability #video

Why Failover Matters

In live broadcasting, a dropped stream isn't just a technical issue. It's lost audience, lost revenue, and damaged reputation. From a sports event with 50,000 viewers to a corporate town hall with 500 employees, the expectation is the same: it must not go down.

Video stream failover is the safety net that catches your broadcast when the primary feed fails.

What is Video Failover?

Failover is the automatic switching from a primary video input to a backup when the system detects a failure. A good failover system:

Detects failure fast: milliseconds, not seconds
Switches cleanly: minimal visual disruption for viewers
Recovers automatically: returns to the primary when it's healthy again
Requires no manual intervention: the whole point is automation

Architecture: Redundant Inputs

The foundation of any failover setup is redundant inputs. You need at least two independent paths:

Active/Standby

The simplest model. One input is active, the other is hot standby:

Primary SRT -> [Gateway] -> Output
Backup RTMP -> [Gateway] (on failure)

Active/Active

Both inputs carry the stream simultaneously. The gateway selects the best one:

Input A (SRT) -> [Gateway: compare] -> Best signal -> Output
Input B (SRT) -> [Gateway: compare]

More bandwidth cost, but higher reliability.

Detection: How Fast Can You React?

The speed of failover depends on how quickly you detect the problem:

Stream Health Monitoring

Monitor the incoming stream for:

Packet loss: SRT reports this in real-time
Bitrate drops: sudden decrease often precedes a full failure
Black/frozen frames: content-aware detection (advanced)
Audio silence: loss of audio signal

Timeouts

Detection Method	Typical Timeout	Notes
SRT packet loss	<50ms	SRT statistics report instantly
TCP disconnect	1-5s	TCP timeout dependent
Bitrate threshold	200-500ms	Configurable window
Content analysis	500ms-2s	Compute intensive

The 50ms Target

Professional broadcast equipment targets sub-50ms failover:

Failure detected within 20ms
Switch command issued within 10ms
Output buffer absorbs the transition within 20ms

At 50ms, the switch is invisible to viewers, happening within 1-2 video frames.

Implementation Patterns

Pattern 1: Gateway-Level Failover

The gateway itself handles failover logic. This is the simplest and most reliable approach:

Configure primary and backup inputs
Set detection thresholds (packet loss %, bitrate floor, timeout)
The gateway switches automatically and logs every event
When primary recovers, it switches back

Pattern 2: Encoder-Level Redundancy

Run two encoders independently:

Camera -> Encoder A -> SRT -> Gateway
Camera -> Encoder B -> SRT -> Gateway (backup)

Protects against encoder failure, not just network failure.

Pattern 3: Geographic Redundancy

For mission-critical broadcasts:

Venue Encoder -> SRT -> Gateway (Region A)
Venue Encoder -> SRT -> Gateway (Region B) [failover]

Both gateways output to CDN. CDN-level origin failover provides the final layer.

Monitoring and Alerts

Failover without monitoring is flying blind:

Real-time dashboards: visualize all input health metrics simultaneously
Automated alerts: get notified when failover activates (Slack, email, webhook)
Event logging: timestamp every switch event for post-mortem analysis
Recovery notifications: know when the primary is back

Testing Your Failover

Never trust a failover system you haven't tested:

Scheduled drills: pull the primary cable during a test stream
Network simulation: inject packet loss with tc to test SRT recovery vs. failover threshold
Encoder failure: kill the encoder process and measure switch time
Recovery testing: verify the system returns to primary
Load testing: confirm failover works under peak output conditions

Common Mistakes

Single point of failure in the switch itself: if your failover device fails, everything fails
Backup feed not monitored: your backup might be dead when you need it
Too-aggressive timeouts: switching on momentary packet loss creates unnecessary disruption
No automatic recovery: manual "switch back" means someone has to be awake at 3 AM
Not testing: the first time your failover fires shouldn't be during a live event

For more on SRT protocol fundamentals, see SRT vs RTMP: Which Should You Use?. For setup details, read the SRT Streaming Setup Guide.

DEV Community