- NFL Christmas Day 2023 — the highest-stakes live streaming test Netflix had ever run
- Seconds — target decision time in the Live Operations Center; on-call investigation takes minutes
- Predefined authority — every LOC role has specific decisions they can make without escalation
- <30 seconds — feature flag propagation time for live event flags
- Game days — pre-event rehearsals that practice the detection → decision → action cycle under pressure
- The hard part wasn't encoding or CDN — it was building the human infrastructure to respond in real time
When Netflix began streaming live events — boxing, NFL games, comedy specials — the engineering challenge wasn't encoding or delivery. It was building the human infrastructure: the operations team, the escalation paths, the real-time decision systems, and the runbooks that let engineers respond to live event failures in seconds, not minutes.
The Story
Netflix's engineering reputation is built on on-demand streaming — a fundamentally forgiving medium where buffering, retries, and eventual consistency all work in the viewer's favour. Live streaming is the opposite: a viewer watching a boxing match or an NFL game experiences a latency spike or buffering pause in real time, during the action, with no opportunity to replay or retry the moment they missed. Netflix's blog post on Live at Scale wasn't primarily about CDN architecture or encoding pipelines — it was about the human infrastructure: the operations team that makes live events work.
On-demand streaming fails gracefully: buffering algorithms absorb network variability, ABR (Adaptive Bitrate) logic adjusts quality to available bandwidth, and CDN edge nodes cache content to reduce origin load. Live streaming has much less room for graceful degradation: buffering during a touchdown is visible failure, and the viewer doesn't get a second chance at that moment. The engineering systems that make live streaming reliable are well-understood — origin redundancy, edge caching, latency optimisation. But the operational systems — who decides what when something goes wrong, how fast can a decision be made, who has authority to take down a feature if it's affecting stability — were built from scratch for Netflix's live events.
Problem
On-Demand Operations Models Don't Scale to Live
Netflix's existing operations model was designed for a world where failures could be investigated over minutes or hours. For live events, the same decision needed to happen in 10–20 seconds — before viewers noticed the degradation.
Cause
Live Events Have Zero Recovery Time
Unlike on-demand content where a retry or rebuffer is acceptable, live streaming requires decisions to be made before the viewer impact is visible. The detection → decision → action loop must complete in seconds. Existing oncall processes, which allowed investigation before action, were too slow for live event operations.
Solution
Live Operations Center: Command Structure for Speed
Netflix built a dedicated Live Operations Center (LOC) structure for major events — a command centre with real-time data displays, predefined decision authorities, escalation paths, and runbooks that reduced decision time from minutes to seconds.
Result
NFL Christmas Day Streamed Without Major Incidents
Netflix successfully streamed NFL Christmas Day 2023 to tens of millions of simultaneous viewers. The operational infrastructure built for it became the template for subsequent live events including boxing, comedy specials, and additional NFL games.
The Fix
The Live Operations Center: Roles, Runbooks, and Real-Time Decisions
The Live Operations Center structure was Netflix's key operational innovation for live events. Rather than relying on an on-call rotation to respond reactively, the LOC is a proactive operations command staffed during live events with engineers who have predefined authority, real-time data access, and practised runbooks. The LOC staff are not passive monitors — they're empowered to make impactful decisions (taking regions offline, disabling features, changing CDN routing) within defined boundaries and without requiring approval escalations.
- Seconds — target decision time in the LOC versus minutes for standard on-call incident response
- <30s — feature flag propagation time for live event flags; a flag taking 10 minutes is useless during a live failure
- Predefined — authority assignments before each event; every LOC role has specific decisions they can make without escalation
- Game days — pre-event rehearsals practising detection → decision → action under simulated failure pressure
# Conceptual LOC runbook for a CDN region health issue
# Real runbooks include specific tool commands and numeric thresholds
loc_runbook:
title: "CDN Region Health Degradation"
trigger: "Playback success rate < 95% in a geographic region"
severity: P1 # live event is actively degraded
immediate_actions: # within 30 seconds
- owner: cdn_operator # predefined authority — no approval needed
action: "Check CDN region health dashboard"
decision: "If error rate > 5%: disable region from CDN rotation"
tool: "cdn-control --region {region} --disable"
propagation_time: "< 15 seconds globally"
parallel_actions: # simultaneously
- owner: traffic_operator
action: "Verify origin capacity for increased load from CDN failover"
- owner: encoding_operator
action: "Confirm encoding pipeline health"
verification: # within 60 seconds of action
- check: "Playback success rate recovering in affected region"
- check: "Adjacent CDN regions absorbing traffic without degradation"
escalation: # if not resolved in 2 minutes
to: LOC_director
with: "CDN region {region} removed from rotation, impact ongoing"
# Key difference from standard on-call runbook:
# Action is taken FIRST. Investigation is parallel, not a prerequisite.
# Predefined authority makes this possible without chaos.
Pre-event game days (structured rehearsals of incident scenarios conducted before a major event, where the operations team practises executing runbooks, making decisions, and coordinating across teams under simulated pressure) were a critical part of Netflix's live event preparation. The team ran exercises that simulated specific failure scenarios — a CDN region becoming unreachable, an encoding node failing, a sudden traffic spike above capacity — and practised the complete response cycle: detection, decision, action, verification. Game days revealed multiple runbooks that were unclear under time pressure — steps that assumed familiarity with a tool's interface, thresholds that weren't specified precisely, escalation conditions that were ambiguous. A round of runbook rewrites before the NFL game produced runbooks that could be executed correctly by anyone in the LOC without expert system knowledge.
Post-event reviews: the learning system
Netflix conducts structured post-event reviews after every major live event — whether or not there were incidents. The review covers: what went well operationally, what decisions were made and whether they were the right ones, what runbooks were unclear or incorrect under pressure, and what monitoring gaps were discovered. Post-event reviews treat each live event as a learning opportunity regardless of outcome. Over time, this produces increasingly reliable operational runbooks, better-calibrated decision thresholds, and a team that gets better at live event operations with every event.
Capacity pre-provisioning: you can't autoscale fast enough
Unlike on-demand streaming where traffic grows gradually and can be served from cached content, live events create instantaneous global concurrency spikes. Tens of millions of viewers join within the first few minutes of kickoff. Netflix's capacity provisioning required pre-positioning encoding capacity, CDN edge capacity, and origin capacity well in advance of event start. The ramp from 0 to 10 million concurrent viewers in 5 minutes is faster than any autoscaling system can respond — capacity pre-provisioning is mandatory.
The buffer factor: why live needs different response times
On-demand streaming buffers 10–30 seconds of video ahead of playback. A 2-second CDN hiccup is invisible — the buffer absorbs it. Live streaming buffers only 3–8 seconds ahead of playback and cannot buffer further because there's no future content yet. A 4-second CDN hiccup in live streaming visibly impacts viewers; the same hiccup in on-demand streaming is completely invisible. This difference in buffering physics explains why live streaming requires different operational response times.
Architecture
Netflix's live streaming architecture builds on its existing on-demand CDN infrastructure but adds live-specific components. The most significant architectural changes are at the operational layer — the systems that support human operators making fast decisions during live events.
Live Operations Center Command Structure
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Live Streaming Architecture: Technical Stack
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Lessons
Live events require operational infrastructure as much as technical infrastructure. The CDN, encoder, and ABR algorithm are prerequisites. But without predefined decision authority, real-time dashboards, fast feature flags, and practised runbooks, technical infrastructure excellence doesn't translate to operational excellence during live failures.
Predefined authority (assigning specific engineers specific decisions they can make without approval before an event, to eliminate escalation latency during incidents) eliminates the approval overhead that kills live event response times. Pre-define who can make which decisions and under what conditions — before the event, not during it.
Game days build operational muscle memory. Reading a runbook and executing one under pressure are different skills. Practise the complete detection → decision → action → verification cycle in simulated scenarios before live events. Operators who have done it once in practice make better decisions when they have to do it for real.
Feature flag propagation speed is an operational tool. A feature flag that takes 10 minutes to propagate is useless during a live event failure. Build live event feature flags with sub-30-second global propagation and verify that propagation speed in game days, not just in unit tests.
Post-event reviews are the learning system. Review every event, whether there were incidents or not. Smooth events reveal runbook clarity and decision threshold calibration. Incident events reveal operational gaps. Both are necessary inputs for getting better at live event operations over time.
Engineering Glossary
ABR (Adaptive Bitrate) — a streaming technique that continuously selects the highest video quality the viewer's connection can support, switching between quality levels in real time based on available bandwidth. On-demand streaming has up to 30 seconds of buffer to absorb quality switches; live streaming has 3–8 seconds.
Game day — a structured pre-event rehearsal where the operations team practises executing runbooks, making decisions, and coordinating across teams under simulated incident pressure. Reveals unclear runbook steps and unvalidated decision thresholds before they become problems during the real event.
Live Operations Center (LOC) — Netflix's dedicated command structure for major live events. Staffed with engineers who have predefined decision authority, real-time dashboard access, and practised runbooks. Designed to make impactful decisions in seconds without approval escalations.
Predefined authority — the practice of assigning specific engineers specific decisions they can make without approval before a live event begins. Eliminates the escalation latency that would otherwise make live-event response times incompatible with viewer experience requirements.
Post-event review — a structured retrospective conducted after every major live event, covering what went well, what decisions were made and whether they were right, what runbooks were unclear under pressure, and what monitoring gaps were discovered. The operational learning system that improves reliability across all subsequent events.
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack →
(Interactive diagrams, source links, and the full reader experience)
TechLogStack — built at scale, broken in public, rebuilt by engineers.
Top comments (0)