Why Failover Matters
In live broadcasting, a dropped stream isn't just a technical issue. It's lost audience, lost revenue, and damaged reputation. From a sports event with 50,000 viewers to a corporate town hall with 500 employees, the expectation is the same: it must not go down.
Video stream failover is the safety net that catches your broadcast when the primary feed fails.
What is Video Failover?
Failover is the automatic switching from a primary video input to a backup when the system detects a failure. A good failover system:
- Detects failure fast: milliseconds, not seconds
- Switches cleanly: minimal visual disruption for viewers
- Recovers automatically: returns to the primary when it's healthy again
- Requires no manual intervention: the whole point is automation
Architecture: Redundant Inputs
The foundation of any failover setup is redundant inputs. You need at least two independent paths:
Active/Standby
The simplest model. One input is active, the other is hot standby:
Primary SRT -> [Gateway] -> Output
Backup RTMP -> [Gateway] (on failure)
Active/Active
Both inputs carry the stream simultaneously. The gateway selects the best one:
Input A (SRT) -> [Gateway: compare] -> Best signal -> Output
Input B (SRT) -> [Gateway: compare]
More bandwidth cost, but higher reliability.
Detection: How Fast Can You React?
The speed of failover depends on how quickly you detect the problem:
Stream Health Monitoring
Monitor the incoming stream for:
- Packet loss: SRT reports this in real-time
- Bitrate drops: sudden decrease often precedes a full failure
- Black/frozen frames: content-aware detection (advanced)
- Audio silence: loss of audio signal
Timeouts
| Detection Method | Typical Timeout | Notes |
|---|---|---|
| SRT packet loss | <50ms | SRT statistics report instantly |
| TCP disconnect | 1-5s | TCP timeout dependent |
| Bitrate threshold | 200-500ms | Configurable window |
| Content analysis | 500ms-2s | Compute intensive |
The 50ms Target
Professional broadcast equipment targets sub-50ms failover:
- Failure detected within 20ms
- Switch command issued within 10ms
- Output buffer absorbs the transition within 20ms
At 50ms, the switch is invisible to viewers, happening within 1-2 video frames.
Implementation Patterns
Pattern 1: Gateway-Level Failover
The gateway itself handles failover logic. This is the simplest and most reliable approach:
- Configure primary and backup inputs
- Set detection thresholds (packet loss %, bitrate floor, timeout)
- The gateway switches automatically and logs every event
- When primary recovers, it switches back
Pattern 2: Encoder-Level Redundancy
Run two encoders independently:
Camera -> Encoder A -> SRT -> Gateway
Camera -> Encoder B -> SRT -> Gateway (backup)
Protects against encoder failure, not just network failure.
Pattern 3: Geographic Redundancy
For mission-critical broadcasts:
Venue Encoder -> SRT -> Gateway (Region A)
Venue Encoder -> SRT -> Gateway (Region B) [failover]
Both gateways output to CDN. CDN-level origin failover provides the final layer.
Monitoring and Alerts
Failover without monitoring is flying blind:
- Real-time dashboards: visualize all input health metrics simultaneously
- Automated alerts: get notified when failover activates (Slack, email, webhook)
- Event logging: timestamp every switch event for post-mortem analysis
- Recovery notifications: know when the primary is back
Testing Your Failover
Never trust a failover system you haven't tested:
- Scheduled drills: pull the primary cable during a test stream
-
Network simulation: inject packet loss with
tcto test SRT recovery vs. failover threshold - Encoder failure: kill the encoder process and measure switch time
- Recovery testing: verify the system returns to primary
- Load testing: confirm failover works under peak output conditions
Common Mistakes
- Single point of failure in the switch itself: if your failover device fails, everything fails
- Backup feed not monitored: your backup might be dead when you need it
- Too-aggressive timeouts: switching on momentary packet loss creates unnecessary disruption
- No automatic recovery: manual "switch back" means someone has to be awake at 3 AM
- Not testing: the first time your failover fires shouldn't be during a live event
For more on SRT protocol fundamentals, see SRT vs RTMP: Which Should You Use?. For setup details, read the SRT Streaming Setup Guide.
Top comments (0)