David Marcelo Petrocelli

Posted on Mar 3

How Netflix Turns 2 Trillion Daily Events Into Architectural Decisions (And How You Can Too)

#architecture #microservices #dataengineering #netflix

Difficulty Level: 300 - Advanced

Netflix processes 2+ trillion events/day through Kafka and 20,000+ Flink jobs, but the real differentiator is not scale -- it is using that data to drive every architectural decision, from Java version migrations to database selection.
Their Data Mesh platform with Streaming SQL democratized real-time processing: 1,200 SQL processors created in one year by non-infrastructure teams, processing 100 million events/second across 5,000+ pipelines.
Every product change goes through A/B testing (150K-450K RPS, <1ms cache-warm latency), and in 2025 ML-optimized experimentation reduces experiment duration by up to 40%.
The biggest lesson is what NOT to copy: Netflix explicitly warns against "streaming all the things," and their architecture reflects 15+ years of incremental evolution with 10,000+ engineers -- blindly replicating it is a documented anti-pattern.

The 2 Trillion Events Question

Netflix processes over 2 trillion events every single day. Three petabytes of data ingested. Seven petabytes output.

Those numbers are staggering, but scale is not what makes Netflix's architecture remarkable. What makes it remarkable is that every one of those events feeds back into decisions about what to build next.

Netflix runs 1,000+ microservices on AWS across 100,000+ EC2 instances, serving 300M+ subscribers and generating $39B in revenue (2024). Their estimated annual AWS spend exceeds $1.3B. But the companies that try to replicate Netflix's infrastructure miss the point entirely. The architecture is not the product of a grand design -- it is the product of 15+ years of data-driven decisions, each one measured, validated, and rolled out incrementally.

After years of building distributed systems for enterprise clients and teaching these patterns at university, I have found that the most common mistake teams make is copying Netflix's tools rather than Netflix's discipline. In this article, I will break down how their real-time data pipeline feeds architectural decisions across experimentation, observability, chaos engineering, and platform engineering -- and identify the patterns you can actually adopt.

Prerequisites

Familiarity with microservices architecture patterns (circuit breakers, service discovery, API gateways)
Basic understanding of stream processing concepts (Kafka, Flink, or similar)
Experience with distributed systems at any scale
Understanding of A/B testing fundamentals

What You'll Learn

How Netflix's Kafka + Flink pipeline evolved from 45 billion events/day (2011) to 2+ trillion
Why Netflix rejected graph databases for their 8-billion-node distributed graph and chose Cassandra instead
How their A/B testing platform handles 450K RPS with sub-millisecond latency
What Netflix's observability stack looks like at 17 billion metrics/day and 700 billion traces/day
Which Netflix patterns you should adopt -- and which ones you should absolutely avoid

The Data Pipeline: Kafka, Flink, and Data Mesh at Trillions Scale

Netflix's real-time data infrastructure evolved through four distinct innovation phases over 13 years. Understanding this evolution matters because it reveals that no one designed a "trillions-scale pipeline" from scratch. Every layer was added to solve a concrete problem.

The Keystone Pipeline

At the core is Keystone, a petabyte-scale real-time event streaming and processing system. It scaled from 1 trillion events/day in 2017 to 2+ trillion today -- a 20x increase over four years.

Kafka serves as the universal backbone. Thousands of topics carry roughly 1 million messages per second per topic, all Avro-encoded with schemas persisted in a centralized internal registry. Every record is dual-written to both streaming consumers (Flink) and the analytical warehouse (Apache Iceberg), enabling real-time processing and historical backfills simultaneously.

Apache Flink is the processing engine. Netflix runs 20,000+ Flink jobs concurrently, handling everything from graph materialization to observability analytics to ad event processing. The Data Mesh platform writes 5 million records per second across these pipelines.

Year	Daily Events	Key Innovation
2011	45 billion	Chukwa-based ingestion
2015	500 billion	Keystone pipeline
2017	1 trillion	Managed Kafka platform
2021+	2+ trillion	Data Mesh + Streaming SQL

Streaming SQL: Democratizing Real-Time Processing

The most impactful recent evolution was not a scale increase -- it was an accessibility one. Netflix introduced Streaming SQL in Data Mesh, wrapping Flink's complex DataStream API behind standard SQL.

The results were immediate: 1,200 SQL processors created within one year of launch, built by non-infrastructure teams. The platform now processes 100 million events per second across 5,000+ pipelines. Netflix won the Confluent Data Streaming Award for this work.

-- Netflix Data Mesh Streaming SQL Processor
-- Domain experts write standard SQL against streaming sources
SELECT
    member_id,
    content_id,
    TUMBLE_START(event_time, INTERVAL '5' MINUTE) as window_start,
    COUNT(*) as interaction_count
FROM member_events
GROUP BY
    member_id,
    content_id,
    TUMBLE(event_time, INTERVAL '5' MINUTE)

This is the democratization pattern in action: build complex infrastructure (Flink), then wrap it in an accessible interface (SQL). Domain experts build data products without being stream processing specialists.

Netflix explicitly warns against the opposite approach: "Don't stream all the things." When they migrated critical pipelines from 24-hour batch latency to real-time, they documented the "pioneer tax" -- increased on-call burden, JAR hell, and complex failure recovery. Batch processing remains the right choice when real-time does not add measurable business value.

current.confluent.io

The Real-Time Distributed Graph: Architecture Under the Hood

In October 2025, Netflix published the architecture behind their Real-Time Distributed Graph (RDG) -- a system modeling member interactions at internet scale. The numbers: 8 billion+ nodes, 150 billion+ edges, sustaining 2 million reads/second and 6 million writes/second.

What makes this architecturally instructive is not the scale but the storage decision.

Why Netflix Rejected Graph Databases

Netflix evaluated and rejected Neo4j for the RDG storage layer. Neo4j performed well for millions of records but became inefficient beyond hundreds of millions due to high memory requirements and limited horizontal scaling.

Instead, they chose KVDAL (Key-Value Data Abstraction Layer), built on Apache Cassandra. The storage layer spans approximately 27 namespaces across 12 Cassandra clusters backed by 2,400 EC2 instances. EVCache (Memcached-based) sits in front of Cassandra, providing sub-millisecond read latency on hot data.

Criteria	Neo4j	Cassandra + KVDAL
Scale	Millions of records	Billions+ (8B nodes, 150B edges)
Horizontal scaling	Limited	Linear
Write performance	Moderate	6M writes/sec
Read latency (cached)	N/A	Sub-millisecond (EVCache)
Netflix verdict	Rejected	Selected

The Data Abstraction Layer Pattern

The critical design decision here is not "use Cassandra" -- it is the abstraction layer. Applications interact with KVDAL via gRPC, so storage backends can be swapped without code changes. The namespace model supports flexible backends: different namespaces can use different Cassandra clusters or entirely different storage technologies.

In my experience building distributed storage systems, this pattern pays for itself the first time you need to migrate backends. Netflix's approach -- evaluate with data, abstract the interface, isolate by namespace -- is directly adoptable regardless of your scale.

A/B Testing as an Architectural Principle

At Netflix, every product change goes through A/B testing before becoming the default. This is not a feature -- it is an architectural principle. As Netflix puts it, the goal is "product decisions driven by data, not by the most opinionated and vocal employees."

The Experimentation Platform

Netflix's experimentation platform handles 150K to 450K requests per second with cache-warm latency under 1ms and real-time evaluation averaging approximately 50ms. Allocation is deterministic: a hash of member_id + experiment_id assigns each user to an experiment cell consistently across sessions and devices.

# Netflix experiment allocation pattern
# Each member is assigned to experiment cells deterministically
# using member_id + experiment_id hash
def allocate_member(member_id: int, experiment_id: str, num_cells: int) -> int:
    """Deterministic allocation ensures consistent user experience
    across sessions and devices."""
    hash_value = hash(f"{member_id}:{experiment_id}")
    return hash_value % num_cells

# Sequential testing: allows early stopping
# Netflix monitors experiments continuously rather than
# waiting for fixed sample sizes

Sequential testing is critical for infrastructure experiments where bad changes could degrade streaming quality for millions. Unlike fixed-horizon tests, sequential tests let Netflix stop experiments early when results are conclusive, reducing both time and user exposure to suboptimal experiences.

The 2025 Evolution: ML-Optimized Experimentation

Beginning in 2025, Netflix started using machine learning to optimize A/B testing. Adaptive causal-inference models reduce experiment duration by up to 40%. Combined with server-driven UI -- which enables experimentation without app store releases -- Netflix continuously iterates on the experience of 300M+ subscribers.

Their causal inference extends well beyond simple A/B testing: contextual bandits for content matching, counterfactual logging for offline experiments, and surrogate metrics for inferring long-term effects from short-term data. Data scientists analyze billions of rows on single machines using Python and R -- a deliberate architectural choice prioritizing analyst productivity over distributed computing complexity.

Observability as an Architectural Decision Engine

Netflix's observability stack is not just for debugging. It is the feedback loop that drives architectural evolution.

The numbers: Atlas processes 17 billion metrics per day. The platform handles 700 billion distributed traces per day and 1.5 petabytes of log data. All of this costs less than 5% of Netflix's total infrastructure spend -- a deliberate and measured investment.

The Trace Explosion Problem

Consider encoding a single episode of Squid Game Season 2. According to Netflix engineers at QCon London 2025, this generates 1 million trace spans, 140 video encodes, 552 audio encodes, and consumes 122,000 CPU hours.

At that density, traditional tracing tools collapse. Over 300K+ spans per request overwhelm conventional visualization. Netflix solved this with a request-first tree visualization and stream processing via Flink, transforming raw spans into actionable business intelligence.

The high-cardinality metrics client uses metadata tagging and a taxonomy service exposed via GraphQL API, ensuring consistent metadata across hundreds of services.

Observability Driving Business Outcomes

The business outcomes from this investment are concrete: ROI-based resource allocation, workflow caching without user intervention, and measurable cost efficiency improvements. Netflix also built Inca, a message-level tracing system for data pipelines where each message gets a UUID, enabling detection of loss and duplicates across trillions of daily events.

The key insight: observability at Netflix is not a cost center. It is the mechanism by which data shapes architecture. When encoding costs spike for a particular content type, observability data drives the decision to cache workflows. When trace analysis reveals inefficient service-to-service calls, it informs decomposition decisions.

From Confusion to Clarity: Advanced Observability Strategies for Media Workflows at Netflix - InfoQ

Naveen Mareddy and Sujana Sooreddy discuss the evolution of Netflix’s media processing observability, moving from monolithic tracing to a high-cardinality analytics platform. They explain how to handle "trace explosion" using stream processing and a "request-first" tree visualization, and share how to transform raw spans into actionable business intelligence.

infoq.com

Chaos Engineering and Resilience: Breaking Things With Data

Netflix invented chaos engineering in 2011 with Chaos Monkey, which randomly terminates production VM instances. It evolved into the Simian Army -- Latency Monkey, Conformity Monkey, Doctor Monkey, Security Monkey -- each injecting different failure modes. The discipline was formalized in a 2017 whitepaper establishing five core principles.

The industry data validates the approach: organizations adopting chaos engineering report a 35% average reduction in outages and 41% improvement in MTTR. In 2024, TravelTech implemented Chaos Monkey and discovered a single point of failure in payment processing, preventing a potential outage affecting 30,000+ customers.

Resilience Evolution: From Libraries to Infrastructure

Netflix's resilience approach has fundamentally shifted. The original library-based patterns (Hystrix for circuit breaking, Ribbon for client load balancing) have been deprecated in favor of infrastructure-based resilience via Envoy service mesh -- zero-configuration resilience that does not require application code changes.

Era	Pattern	Implementation	Status (2026)
2011	Chaos testing	Chaos Monkey / Simian Army	Active (evolved)
2012	Circuit breaker	Hystrix	Deprecated
2014	Client load balancing	Ribbon	Deprecated
2015	API gateway	Zuul	Zuul 2 (Netty)
2020	Circuit breaker v2	Resilience4j	Active
2026	Service mesh	Envoy proxies	Active (new)

Stateful Systems and Automated Mitigation

Joseph Lynch's work on Netflix's stateful systems demonstrates data-driven reliability engineering. Near-caches handle billions of requests per second at sub-100-microsecond latency. When a KeyValueService experienced unexpected traffic doubling, automated mitigation recovered the system within 5 minutes -- no human intervention required.

The five principles of chaos engineering remain foundational: (1) build a hypothesis around steady state, (2) vary real-world events, (3) run experiments in production, (4) automate continuous experiments, and (5) minimize blast radius. But the real lesson is that chaos engineering without robust observability is just breaking things. You need the feedback loop.

The Live Streaming Stress Test

Even Netflix's battle-tested architecture has limits. The Tyson-Paul fight in November 2024 drew 65 million concurrent streams and 108 million total viewers, generating 100K+ Downdetector reports. CDN limitations were exposed.

But microservices isolation proved its worth: on-demand streaming was NOT affected. The failure was contained to the live event. This is resilience architecture working as designed -- not preventing all failures, but preventing failures from cascading.

The GraphQL Federation and Platform Engineering Story

Netflix's API evolution tells a story about data-driven platform decisions. The progression: REST ("OpenAPI") to "API.next" to "DNA" (GraphQL-like) to Federated GraphQL with the DGS Framework.

Today, 250+ Domain Graph Services maintained by 200+ teams compose a unified API graph. The gateway processes thousands of queries per second with sub-100ms response times and query planning overhead under 10ms.

// Netflix DGS Framework - Domain Graph Service
@DgsComponent
public class ShowsDataFetcher {
    @DgsQuery
    public List<Show> shows(@InputArgument String titleFilter) {
        // Each team owns their domain's data fetchers
        // Composed into unified supergraph via federation
        return showsService.getShows(titleFilter);
    }

    @DgsData(parentType = "Show", field = "reviews")
    public List<Review> reviews(DgsDataFetchingEnvironment dfe) {
        Show show = dfe.getSource();
        return reviewsService.getReviewsForShow(show.getId());
    }
}

Java at Netflix Scale

Netflix runs 2,800 Java applications with approximately 1,500 internal libraries. Their migration from Java 8 to Java 17 delivered 20% better CPU usage with zero code changes -- a data-driven validation that justified the migration effort across all 2,800 applications.

Java 21 virtual threads are described as "the most exciting Java feature since lambdas" by Netflix engineers, with optimal results for Tomcat thread pools and GraphQL query execution. However, gRPC worker pools showed a performance decrease. This is data-driven decision making in action -- adopt where the numbers support it, hold where they do not.

The Platform Engineering Flywheel

Netflix's workflow orchestrator Maestro handles hundreds of thousands of workflows and 2 million jobs per day, achieving a 100x performance improvement via an actor model combined with Java 21 virtual threads. Their incremental processing with Apache Iceberg reduced costs to 10% of the original pipeline for some workflows while improving data freshness from daily to hourly.

The container platform Titus launches 1M+ containers per week. Spinnaker supports 4,000+ deploys per day. Netflix spends $150 million annually on compute and storage for data pipelines alone.

Platform	Purpose	Scale
DGS Framework	GraphQL Federation	250+ services
Maestro	Workflow orchestration	2M jobs/day
Metaflow	ML infrastructure	3,000+ projects at Netflix
Titus	Container management	1M+ containers/week
Spinnaker	Continuous delivery	4,000+ deploys/day
Atlas	Telemetry	17B metrics/day

The open-source flywheel is deliberate. DGS, Maestro, Metaflow (used by hundreds of companies for ML), and Spinnaker create external contributions that flow back into Netflix's platform investment.

Efficient Incremental Processing with Netflix Maestro and Apache Iceberg - InfoQ

Jun He discusses how to use an IPS to build more reliable, efficient, and scalable data pipelines, unlocking new data processing patterns.

infoq.com

What NOT to Copy: The Anti-Patterns

This is the most important section of this article.

"Netflix's architecture is for Netflix's org chart, not your startup." A 10-person team with 50 microservices creates operational overhead that destroys velocity. Netflix's 1,000+ microservices reflect a 10,000+ person engineering organization. Conway's Law is not a suggestion -- it is a constraint.

The "Don't Stream All the Things" Warning

Netflix explicitly warns against universal stream processing. Their migration from batch to streaming documented the pioneer tax: increased on-call burden, JAR hell between Flink and Netflix OSS libraries, and complex failure recovery. Streaming failures must be addressed immediately -- unlike batch, where you re-run the job.

The Cargo-Culting Trap

Three patterns I see teams consistently get wrong:

Chaos engineering without observability. You break things but cannot learn from failures. Invest in monitoring first.
Microservices without a platform team. Every team reinvents deployment, monitoring, and configuration. The overhead kills you.
Building on deprecated Netflix OSS. Adopting Hystrix, Ribbon, or Zuul 1.x in 2025+ creates immediate technical debt. Use Resilience4j, Spring Cloud Load Balancer, and Spring Cloud Gateway instead.

Even Netflix stumbled at their own game. The Tyson-Paul fight generated 100K+ Downdetector reports, proving that on-demand architecture does not automatically translate to live event capability.

The Right Approach: Adopt Patterns, Not Tools

Start with a monolith. Extract services when pain points emerge organically. Prioritize in this order: (1) observability first, (2) experimentation platform, (3) event-driven communication, (4) microservices only when needed.

Pattern	Adopt When	Skip When
Event-driven (Kafka)	Multiple teams need async communication	Single team, synchronous is fine
Stream processing (Flink)	Real-time adds measurable business value	Batch latency is acceptable
A/B testing platform	10+ experiments/quarter	Fewer than 5 experiments/year
Chaos engineering	Running 50+ microservices in production	Fewer than 10 services
GraphQL Federation	5+ teams need API ownership	Single API team
Data Mesh	Multiple data domains with different owners	Centralized data team

Conclusion: Building Your Data-Driven Architecture

Netflix's power is not scale. It is the feedback loop between data and decisions.

They never did a "big rewrite." Every architectural evolution -- from monolith to microservices, from batch to streaming, from REST to GraphQL Federation, from Hystrix to Envoy -- was measured, validated against production data, and rolled out incrementally. The Java 17 migration happened because they measured a 20% CPU improvement. Streaming SQL replaced the Flink DataStream API because they measured 1,200 new processors in a year from non-infrastructure teams.

Any organization can start building this feedback loop with three pillars:

Instrument everything. You cannot make data-driven decisions without data. Netflix invests less than 5% of infrastructure costs in observability -- and considers it their highest-leverage architectural investment.
Experiment on everything. Build an A/B testing capability. It does not need to handle 450K RPS. It needs to exist so that decisions are driven by evidence, not opinions.
Let data drive architecture. When Netflix evaluated Neo4j vs. Cassandra for their distributed graph, they measured at scale and chose the tool that survived the data. Do the same with your technology decisions.

Pick ONE pattern from this article. Implement it in your current architecture. Measure the result. That is the Netflix way -- not copying their tools, but copying their discipline.

The best architecture is the one that can prove why it made the choices it did.

Resources

Official Sources:

Netflix Tech Blog - Primary source for architecture decisions
Netflix Atlas Documentation - Observability platform
Netflix DGS Framework - GraphQL Federation
Netflix Maestro on GitHub - Workflow orchestration (open-source)

Key Netflix Tech Blog Posts:

Conference Talks:

Books:

Designing Data-Intensive Applications by Martin Kleppmann (O'Reilly) - Foundational theory for Netflix's data pipeline patterns
Chaos Engineering: System Resiliency in Practice by Casey Rosenthal & Nora Jones (O'Reilly) - Written by Netflix chaos engineering pioneers
Data Mesh by Zhamak Dehghani (O'Reilly) - The architectural philosophy Netflix adopted
Microservices Patterns by Chris Richardson (Manning) - Pattern catalog applicable to Netflix's architecture
Observability Engineering by Charity Majors, Liz Fong-Jones & George Miranda (O'Reilly) - Principles behind Netflix's observability stack

Academic References:

Basiri et al., "Chaos Engineering", arXiv 2017 - The original chaos engineering whitepaper from Netflix
Netflix Research, "A Survey of Causal Inference Applications at Netflix" - Beyond A/B testing

Did you find this article helpful? Follow me for more content on AWS, GenAI, and Cloud Architecture!

DEV Community