DEV Community: David Marcelo Petrocelli

How Spotify Uses Data to Build the Product 713 Million Users Actually Want

David Marcelo Petrocelli — Tue, 03 Mar 2026 14:14:26 +0000

Difficulty Level: 300 - Advanced

TL;DR

Spotify processes 1 trillion+ events per day through 38,000+ active data pipelines — every play, skip, and save is a signal that feeds back into every product decision
Discover Weekly generated 100 billion+ streams in its first 10 years using three ML layers: collaborative filtering, NLP, and audio CNNs — now augmented with LLMs via custom Semantic IDs
Their A/B testing culture runs tens of thousands of experiments/year across 300+ teams, including 520 simultaneous experiments on a single screen — and they measure learning rate (64%), not just win rate (12%)
Backstage, born as Spotify's internal developer portal, catalogs 2,000+ services and 4,000 data pipelines — and is now used by 3,000+ companies as the CNCF standard
The real lesson isn't any single tool: it's the tight coupling between organizational design (squads own their services) and technical design (services are independently deployable)

The 1 Trillion Events Question

Spotify hit 713 million monthly active users in Q3 2025. That number looks impressive in a press release and terrifying in a system design meeting.

Scale alone doesn't explain Spotify's success. What matters is that every one of those events — every play, every skip, every playlist add at 2am — feeds directly into product decisions. Not after a quarterly review. In near real-time.

Most companies collect data and build dashboards. Spotify built a closed loop: user behavior shapes the product, the product generates more behavior, and the cycle compounds over 20 years of iteration. In 2024, Spotify posted its first annual profit: €1.1B on €15.6B in revenue. The closed loop is working.

After years of building data systems for enterprise clients and teaching these patterns at university, I've found that the most common mistake teams make is copying Spotify's tools rather than their discipline. In this article I'll break down the actual mechanisms behind their data pipeline, recommendation engine, experimentation culture, and developer platform — and tell you which patterns you can realistically steal.

Prerequisites

Familiarity with stream processing concepts (Kafka, Pub/Sub, or similar)
Basic understanding of microservices architecture (service decomposition, database-per-service)
Experience with A/B testing fundamentals
Some exposure to ML recommendation systems (collaborative filtering concepts)

What You'll Learn

How Spotify's event pipeline evolved from self-managed Kafka to GCP Pub/Sub at 3 million events/second
Why Discover Weekly uses three separate ML layers and what each one contributes
How their A/B testing culture measures 64% learning rate instead of just win rate
What Backstage is and why 3,000+ companies adopted it after Spotify open-sourced it
Which Spotify patterns scale down to your team — and which ones don't

The Numbers Behind 713 Million Users

The scale numbers aren't just impressive — they explain every architectural decision.

Metric	Value	Context
Monthly Active Users	713M (Q3 2025)	Up from 600M in mid-2024
Premium Subscribers	281M	~42% conversion rate
Annual Revenue	€15.6B (2024)	First profitable year
Music Catalog	100M+ tracks	Grows ~60K tracks/day
Podcasts	~7M titles	Second only to Apple
Events per day	1 trillion+	1,800+ event types
Active data pipelines	38,000+	Hourly + daily scheduled
Production components	Thousands	80%+ fleet-managed
A/B experiments/year	Tens of thousands	300+ teams running tests
Discover Weekly streams (10yr)	100 billion+	56M new discoveries/week

This scale didn't emerge from a grand architectural vision. It's the result of 20+ years of small, data-driven decisions — each one measured, validated, and shipped incrementally.

From Monolith to Thousands of Microservices

Spotify started as a monolithic Python application in 2006. By 2010, the codebase had grown to the point where no single team could understand all of it, and deployments required coordinating across multiple squads.

The migration to microservices wasn't a big-bang rewrite. It was driven by a single organizational principle: each squad should be able to deploy independently, without coordinating with other teams. If a team needed another team's sign-off to ship, something was wrong — either in the service design or the org structure.

The Database-Per-Service Pattern

Each Spotify microservice owns its own data store, chosen for the access patterns of that service:

Cassandra + BigTable: High-speed key-value lookups (user state, session data, real-time features)
PostgreSQL: Transactional data (payments, account management)
Google Cloud Storage: Large objects (audio files, model artifacts)
BigQuery: Analytical queries and data pipelines

By 2023, the number of distinct production components had grown to "thousands" — enough that Spotify needed a new abstraction to manage them: Fleet Management.

Fleet Management: Treating Services as a Fleet

The key insight behind Fleet Management is that individual service owners are blind to fleet-wide patterns. If 300 teams each manage their own dependencies, you get 300 different versions of Log4j in production. You can't patch a critical vulnerability in 9 hours by asking each team to update manually.

Fleet Management flips the model: infrastructure defaults to secure and up-to-date, and teams opt out for exceptions (with documented justification).

The results are concrete:

300,000+ automated changes merged across the fleet in 3 years
7,500 automated changes/week with 75% auto-merged without human review
Log4j vulnerability: patched to 80% of backend services in 9 hours
Framework updates: reach 70% of fleet in under 7 days (previously ~200 days)
95% of Spotify developers report Fleet Management improved software quality

The Data Pipeline: How Every Play Becomes a Signal

Every user interaction at Spotify — a play, a skip, a search, a playlist add — generates an event. Those events are the raw material for every recommendation, every A/B test result, every product decision.

Here's how that data flows:

The Migration from Kafka to GCP Pub/Sub

In 2016-2017, Spotify migrated their event delivery system from self-managed Kafka clusters to Google Cloud Pub/Sub. This wasn't a trivial decision — Kafka was working. But managing Kafka at Spotify's scale required significant operational overhead that distracted from product engineering.

The results after migration:

Peak throughput scaled from 800,000 to 3,000,000 events/second
Half a trillion daily ingested events (70 TB compressed)
Pub/Sub handles 1 trillion requests/day
BigQuery runs 10 million+ queries and scheduled jobs/month

Scio: Spotify's Open-Source Apache Beam API

Spotify developed Scio, a Scala API for Apache Beam, to process billions of events. It handles both batch and streaming workloads, running on either Dataflow (managed) or Flink (lower-latency) depending on requirements.

Every data endpoint in the platform has:

Retention policies: data deleted after defined period
Access controls: squad-level permissions
Lineage tracking: full trace from source event to derived dataset
Quality checks: automated alerts for lateness, failures, anomalies

The 38,000+ active pipelines are orchestrated, monitored, and surfaced through Backstage — so any squad can inspect the health of their data at any time.

Recommendations at Scale: Discover Weekly Deconstructed

Discover Weekly launched in July 2015 with a simple premise: every Monday morning, 30 personalized songs you've never heard before. In 10 years, it generated 100 billion streams and 56 million new artist discoveries every week.

That impact comes from a three-layer ML architecture, each layer catching different signals:

Layer 1: Collaborative Filtering

Collaborative filtering answers the question: who else listens to what you listen to, and what else do they listen to?

Spotify's approach uses Logistic Matrix Factorization (LMF) on implicit feedback — not explicit star ratings, but behavioral signals:

# Simplified: how Spotify weights implicit feedback signals
# Real implementation uses distributed matrix factorization at scale

SIGNAL_WEIGHTS = {
    "stream_complete":    1.0,   # Listened to 80%+ of song
    "save_to_library":    2.5,   # Strong positive signal
    "add_to_playlist":    2.0,   # Strong positive signal
    "stream_partial":     0.5,   # Weak positive signal
    "skip_after_30s":    -0.8,   # Negative signal
    "skip_immediately":  -1.5,   # Strong negative signal
}

def compute_interaction_score(events: list[dict]) -> float:
    """
    Compute a weighted interaction score for a user-track pair.
    Used as input to the matrix factorization model.
    """
    score = 0.0
    for event in events:
        signal_type = event["type"]
        weight = SIGNAL_WEIGHTS.get(signal_type, 0.0)
        score += weight
    return max(0.0, score)  # Clamp to non-negative for LMF

# The factorization produces: user_vector @ item_vector = predicted_preference
# Trained via ALS (Alternating Least Squares) on GCP with billions of interactions

The training runs on Hendrix, Spotify's ML platform (named after Jimi Hendrix). Hendrix uses Ray for distributed training on GCP, serves 600+ ML practitioners, and handles the full lifecycle from prototype to production.

Layer 2: NLP Analysis

NLP fills in gaps where behavioral data is sparse — for new artists, for niche genres, for tracks uploaded last week.

Spotify runs web crawlers across music blogs, review sites, and social platforms to extract how people describe songs and artists. The output: vector embeddings where songs described with similar language cluster together.

A song described as "dreamy, lo-fi, bedroom pop" clusters with other songs sharing those descriptors — even if no user has yet listened to both.

Layer 3: Audio CNNs

For truly new content — songs uploaded with no listening history and no web presence — audio analysis is the only signal available.

Convolutional neural networks analyze spectrograms (visual representations of audio). The model learns to detect: tempo, energy, instrumentation, tonality, rhythm patterns. Songs with similar audio characteristics cluster together regardless of metadata.

The LLM Layer (2024-2025)

In 2024, Spotify added a fourth layer: LLMs for contextual recommendations and the AI DJ feature.

The challenge: LLMs don't know Spotify's catalog of 100M tracks. The solution was Semantic IDs — compact token identifiers derived from collaborative-filtering embeddings, generated via RQ-KMeans. The LLM learns to treat these IDs as vocabulary tokens, effectively learning to "speak Spotify."

Outcomes from live experiments:

4% increase in listening time from preference-tuned recommendations
14% improvement from Llama fine-tuned on Spotify's domain vs. vanilla Llama
70% reduction in tool errors for the AI DJ orchestration system

A/B Testing Culture: How Spotify Ships Without Breaking Things

Most companies say they have an "experimentation culture." Spotify has metrics to back it up.

300+ teams run experiments. The mobile home screen alone hosted 520 experiments in one year across 58 simultaneous teams. Total experiments run: tens of thousands per year.

The architecture behind this starts with their coordination engine, which manages mutual exclusion between experiments. When 58 teams are simultaneously testing changes to the same screen, you need a system that prevents two experiments from conflicting — and that randomly reshuffles user assignments between experiment runs (the "salt machine").

ABBA to Confidence: Three Generations of Experimentation

Spotify's experimentation platform evolved through three generations:

Generation	Era	Capability
ABBA	Early 2010s	Feature flags + basic metrics
Experimentation Platform (EP)	2015-2023	Full orchestration, metrics catalog, coordination
Confidence	2023+	Commercial product, Backstage plugin, APIs

The Metric That Changed Everything

The most important shift in Spotify's experimentation culture wasn't a new platform — it was a new metric: learning rate.

Win rate (the conventional metric) measures what percentage of experiments "succeed." At Spotify, that's ~12%.

Learning rate measures what percentage of experiments produce decision-ready insights — whether the answer is yes, no, or "we need to test something different." That's 64%.

Win rate:      12%  (the experiment confirmed our hypothesis)
Learning rate: 64%  (the experiment gave us actionable information)

This reframe matters enormously for culture. A team that runs 100 experiments and "wins" 12 shouldn't feel like they failed 88% of the time. Every "failed" experiment that disproves a hypothesis saved months of building the wrong thing.

Using Confidence for Feature Flags

Spotify open-sourced and commercialized Confidence in August 2023. It's available as a managed service, a Backstage plugin, or via API. Here's what a basic feature flag + A/B test looks like:

from spotify_confidence import Confidence

# Initialize with your project credentials
client = Confidence(client_secret="your-client-secret")

# Resolve a feature flag for a specific user
flag_value = client.resolve_boolean_flag(
    flag="new-home-layout",
    default_value=False,
    evaluation_context={
        "targeting_key": user_id,
        "country": user_country,
        "platform": "ios",
    }
)

if flag_value:
    render_new_home_layout()
else:
    render_legacy_home_layout()

# Track events for analysis
client.track(
    "home-layout-engaged",
    {"user_id": user_id, "session_duration_s": session_seconds}
)

The Confidence platform handles user assignment, experiment coordination, statistical analysis, and validity checks automatically. Squads see results in real time without writing SQL.

Backstage: The Developer Portal That Escaped Spotify

By 2019, Spotify had a problem that no amount of engineering talent could solve manually: 280+ teams managing thousands of services, datasets, APIs, and pipelines — with no shared understanding of what existed or who owned it.

The answer was an internal project called "System Z." In March 2020, Spotify open-sourced it as Backstage.

What Backstage Manages at Spotify Today

Resource Type	Count
Backend Services	2,000+
Websites	300
Data Pipelines	4,000
Mobile Features	200

The Software Catalog is the source of truth. Every component has a catalog-info.yaml file in its repo:

# catalog-info.yaml
# Every Spotify service has one of these in its repo root
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
  name: discover-weekly-generator
  description: "Weekly batch job generating personalized Discover Weekly playlists"
  annotations:
    github.com/project-slug: spotify/discover-weekly
    backstage.io/techdocs-ref: dir:.
    pagerduty.com/service-id: P2XYZAB
    datadog.com/service-name: discover-weekly-generator
  tags:
    - ml
    - recommendations
    - batch
spec:
  type: service
  lifecycle: production
  owner: recommendations-squad
  system: recommendation-platform
  dependsOn:
    - resource:default/user-feature-store
    - resource:default/track-embedding-store
    - component:default/hendrix-ml-platform

The Scaffolder generates new services from golden path templates — templates that include security scanning, observability hooks, CI/CD pipelines, and Backstage registration by default. The "right way" is the easy way.

Outside Spotify

Five years after open-sourcing, Backstage has:

3,400+ adopting companies (Expedia, American Airlines, Zalando, Netflix, Twilio, Wayfair)
1,600+ open-source contributors
Donated to the CNCF — now the standard for internal developer portals
Evolved into Spotify Portal (enterprise SaaS, GA October 2025)

The Squad Model: What Actually Works

The "Spotify Model" — Squads, Tribes, Chapters, Guilds — is the most imitated and most misunderstood organizational pattern in tech.

Here's what the original 2012 whitepaper actually said:

Unit	Size	Purpose
Squad	6-12 people	Full ownership: design, build, test, release, operate
Tribe	40-150 people	Coordination across squads in same product area
Chapter	6-15 specialists	Craft community within a tribe (e.g., all iOS engineers)
Guild	Any size	Voluntary community of interest across the company

The key principle: "Loosely coupled but tightly aligned." Squads move fast independently, but all move in the same strategic direction.

But here's what Henrik Kniberg himself says now: "Don't copy the Spotify model. That's the opposite of what we intended."

Spotify no longer follows the original model exactly — it evolved constantly. The org chart was always secondary to the autonomy principle: if a squad can't deploy independently, something is wrong in the service design or the org design. Fix whichever is broken.

The technical manifestation of squad autonomy is Conway's Law in reverse: design your organization first, and your service architecture will follow. Spotify's thousands of independently deployable microservices exist because thousands of squads have full ownership of them.

What to Steal (and What to Leave Behind)

Here's what's actually worth taking from Spotify's playbook — and what requires Spotify-level scale to justify:

Pattern	Steal It?	Minimum Scale	Effort
Software Catalog (Backstage)	Yes	10+ teams	Low — free, CNCF standard
Golden path templates (Scaffolder)	Yes	5+ teams	Medium — template once, scale forever
64% learning rate metric	Yes	Any scale	Low — just change what you measure
Feature flags + gradual rollouts	Yes	Any scale	Low — Confidence or LaunchDarkly
Fleet automation for dependencies	Yes	50+ services	Medium — Dependabot + custom automation
Squad autonomy principle	Yes (carefully)	3+ teams	High — org change, not tech change
3-layer recommendation engine	Adapted	10K+ users	High — need data volume to work
GCP Pub/Sub at 3M events/sec	No (yet)	100M+ events/day	Infrastructure complexity not worth it early
Hendrix ML platform	No	100+ ML practitioners	Overkill; use SageMaker/Vertex AI instead

The three questions worth asking your team right now:

Can each team deploy independently, without coordinating with other teams? If no, fix the service design or the team structure — but fix it.
Are you measuring learning rate or just win rate? Every experiment that disproves a bad idea is a win. Build a culture that treats it that way.
Does your internal developer portal make the right thing the easy thing? If developers skip security scanning because setting it up is hard, the problem isn't the developers.

Conclusion

Spotify's data-driven architecture didn't emerge from a whiteboard session or a consulting engagement. It emerged from 20 years of building autonomy into every layer of the organization and letting that autonomy produce the architecture.

The event pipeline processes 1 trillion events a day not because Spotify chose GCP Pub/Sub — it's because 300+ squads each own their data and ship their pipelines without waiting for a central team.

Discover Weekly recommends music that feels personal not because of any single ML breakthrough — it's because a recommendations squad owned that problem for 10 years and had the freedom to experiment every Monday.

Backstage manages 4,000 data pipelines and 2,000 services not because it's technically clever — it's because the alternative (no catalog) gets exponentially more painful as you grow.

The tools are available to any company. Most of them are open source or commercially available today. The discipline is what differentiates Spotify — and that part you have to build yourself.

Resources

Official Documentation

Spotify Engineering Blog — primary source for all technical patterns described here
Spotify Research — 200+ ML and recommendation papers
Backstage.io — open source, free, CNCF graduated
Confidence — Spotify's A/B testing platform, now commercial

Books

Building Microservices by Sam Newman (O'Reilly, 2nd ed. 2021) — covers squad/service alignment
Designing Data-Intensive Applications by Martin Kleppmann (O'Reilly, 2017) — event streaming fundamentals

Key Engineering Blog Posts

Research Papers

Semantic IDs for Generative Search and Recommendation (NeurIPS 2025)
Users' Interests are Multi-faceted: Recommendation Models Should Be Too (WSDM 2023)
Optimizing for the Long-Term Without Delay

Original Squad Model Reference

Scaling Agile @ Spotify by Henrik Kniberg & Anders Ivarsson (2012)

Did you find this article helpful? Follow me for more content on system design, data engineering, and cloud architecture!

How Netflix Turns 2 Trillion Daily Events Into Architectural Decisions (And How You Can Too)

David Marcelo Petrocelli — Tue, 03 Mar 2026 02:05:37 +0000

Difficulty Level: 300 - Advanced

Netflix processes 2+ trillion events/day through Kafka and 20,000+ Flink jobs, but the real differentiator is not scale -- it is using that data to drive every architectural decision, from Java version migrations to database selection.
Their Data Mesh platform with Streaming SQL democratized real-time processing: 1,200 SQL processors created in one year by non-infrastructure teams, processing 100 million events/second across 5,000+ pipelines.
Every product change goes through A/B testing (150K-450K RPS, <1ms cache-warm latency), and in 2025 ML-optimized experimentation reduces experiment duration by up to 40%.
The biggest lesson is what NOT to copy: Netflix explicitly warns against "streaming all the things," and their architecture reflects 15+ years of incremental evolution with 10,000+ engineers -- blindly replicating it is a documented anti-pattern.

The 2 Trillion Events Question

Netflix processes over 2 trillion events every single day. Three petabytes of data ingested. Seven petabytes output.

Those numbers are staggering, but scale is not what makes Netflix's architecture remarkable. What makes it remarkable is that every one of those events feeds back into decisions about what to build next.

Netflix runs 1,000+ microservices on AWS across 100,000+ EC2 instances, serving 300M+ subscribers and generating $39B in revenue (2024). Their estimated annual AWS spend exceeds $1.3B. But the companies that try to replicate Netflix's infrastructure miss the point entirely. The architecture is not the product of a grand design -- it is the product of 15+ years of data-driven decisions, each one measured, validated, and rolled out incrementally.

After years of building distributed systems for enterprise clients and teaching these patterns at university, I have found that the most common mistake teams make is copying Netflix's tools rather than Netflix's discipline. In this article, I will break down how their real-time data pipeline feeds architectural decisions across experimentation, observability, chaos engineering, and platform engineering -- and identify the patterns you can actually adopt.

Prerequisites

Familiarity with microservices architecture patterns (circuit breakers, service discovery, API gateways)
Basic understanding of stream processing concepts (Kafka, Flink, or similar)
Experience with distributed systems at any scale
Understanding of A/B testing fundamentals

What You'll Learn

How Netflix's Kafka + Flink pipeline evolved from 45 billion events/day (2011) to 2+ trillion
Why Netflix rejected graph databases for their 8-billion-node distributed graph and chose Cassandra instead
How their A/B testing platform handles 450K RPS with sub-millisecond latency
What Netflix's observability stack looks like at 17 billion metrics/day and 700 billion traces/day
Which Netflix patterns you should adopt -- and which ones you should absolutely avoid

The Data Pipeline: Kafka, Flink, and Data Mesh at Trillions Scale

Netflix's real-time data infrastructure evolved through four distinct innovation phases over 13 years. Understanding this evolution matters because it reveals that no one designed a "trillions-scale pipeline" from scratch. Every layer was added to solve a concrete problem.

The Keystone Pipeline

At the core is Keystone, a petabyte-scale real-time event streaming and processing system. It scaled from 1 trillion events/day in 2017 to 2+ trillion today -- a 20x increase over four years.

Kafka serves as the universal backbone. Thousands of topics carry roughly 1 million messages per second per topic, all Avro-encoded with schemas persisted in a centralized internal registry. Every record is dual-written to both streaming consumers (Flink) and the analytical warehouse (Apache Iceberg), enabling real-time processing and historical backfills simultaneously.

Apache Flink is the processing engine. Netflix runs 20,000+ Flink jobs concurrently, handling everything from graph materialization to observability analytics to ad event processing. The Data Mesh platform writes 5 million records per second across these pipelines.

Year	Daily Events	Key Innovation
2011	45 billion	Chukwa-based ingestion
2015	500 billion	Keystone pipeline
2017	1 trillion	Managed Kafka platform
2021+	2+ trillion	Data Mesh + Streaming SQL

Streaming SQL: Democratizing Real-Time Processing

The most impactful recent evolution was not a scale increase -- it was an accessibility one. Netflix introduced Streaming SQL in Data Mesh, wrapping Flink's complex DataStream API behind standard SQL.

The results were immediate: 1,200 SQL processors created within one year of launch, built by non-infrastructure teams. The platform now processes 100 million events per second across 5,000+ pipelines. Netflix won the Confluent Data Streaming Award for this work.

-- Netflix Data Mesh Streaming SQL Processor
-- Domain experts write standard SQL against streaming sources
SELECT
    member_id,
    content_id,
    TUMBLE_START(event_time, INTERVAL '5' MINUTE) as window_start,
    COUNT(*) as interaction_count
FROM member_events
GROUP BY
    member_id,
    content_id,
    TUMBLE(event_time, INTERVAL '5' MINUTE)

This is the democratization pattern in action: build complex infrastructure (Flink), then wrap it in an accessible interface (SQL). Domain experts build data products without being stream processing specialists.

Netflix explicitly warns against the opposite approach: "Don't stream all the things." When they migrated critical pipelines from 24-hour batch latency to real-time, they documented the "pioneer tax" -- increased on-call burden, JAR hell, and complex failure recovery. Batch processing remains the right choice when real-time does not add measurable business value.

current.confluent.io

The Real-Time Distributed Graph: Architecture Under the Hood

In October 2025, Netflix published the architecture behind their Real-Time Distributed Graph (RDG) -- a system modeling member interactions at internet scale. The numbers: 8 billion+ nodes, 150 billion+ edges, sustaining 2 million reads/second and 6 million writes/second.

What makes this architecturally instructive is not the scale but the storage decision.

Why Netflix Rejected Graph Databases

Netflix evaluated and rejected Neo4j for the RDG storage layer. Neo4j performed well for millions of records but became inefficient beyond hundreds of millions due to high memory requirements and limited horizontal scaling.

Instead, they chose KVDAL (Key-Value Data Abstraction Layer), built on Apache Cassandra. The storage layer spans approximately 27 namespaces across 12 Cassandra clusters backed by 2,400 EC2 instances. EVCache (Memcached-based) sits in front of Cassandra, providing sub-millisecond read latency on hot data.

Criteria	Neo4j	Cassandra + KVDAL
Scale	Millions of records	Billions+ (8B nodes, 150B edges)
Horizontal scaling	Limited	Linear
Write performance	Moderate	6M writes/sec
Read latency (cached)	N/A	Sub-millisecond (EVCache)
Netflix verdict	Rejected	Selected

The Data Abstraction Layer Pattern

The critical design decision here is not "use Cassandra" -- it is the abstraction layer. Applications interact with KVDAL via gRPC, so storage backends can be swapped without code changes. The namespace model supports flexible backends: different namespaces can use different Cassandra clusters or entirely different storage technologies.

In my experience building distributed storage systems, this pattern pays for itself the first time you need to migrate backends. Netflix's approach -- evaluate with data, abstract the interface, isolate by namespace -- is directly adoptable regardless of your scale.

A/B Testing as an Architectural Principle

At Netflix, every product change goes through A/B testing before becoming the default. This is not a feature -- it is an architectural principle. As Netflix puts it, the goal is "product decisions driven by data, not by the most opinionated and vocal employees."

The Experimentation Platform

Netflix's experimentation platform handles 150K to 450K requests per second with cache-warm latency under 1ms and real-time evaluation averaging approximately 50ms. Allocation is deterministic: a hash of member_id + experiment_id assigns each user to an experiment cell consistently across sessions and devices.

# Netflix experiment allocation pattern
# Each member is assigned to experiment cells deterministically
# using member_id + experiment_id hash
def allocate_member(member_id: int, experiment_id: str, num_cells: int) -> int:
    """Deterministic allocation ensures consistent user experience
    across sessions and devices."""
    hash_value = hash(f"{member_id}:{experiment_id}")
    return hash_value % num_cells

# Sequential testing: allows early stopping
# Netflix monitors experiments continuously rather than
# waiting for fixed sample sizes

Sequential testing is critical for infrastructure experiments where bad changes could degrade streaming quality for millions. Unlike fixed-horizon tests, sequential tests let Netflix stop experiments early when results are conclusive, reducing both time and user exposure to suboptimal experiences.

The 2025 Evolution: ML-Optimized Experimentation

Beginning in 2025, Netflix started using machine learning to optimize A/B testing. Adaptive causal-inference models reduce experiment duration by up to 40%. Combined with server-driven UI -- which enables experimentation without app store releases -- Netflix continuously iterates on the experience of 300M+ subscribers.

Their causal inference extends well beyond simple A/B testing: contextual bandits for content matching, counterfactual logging for offline experiments, and surrogate metrics for inferring long-term effects from short-term data. Data scientists analyze billions of rows on single machines using Python and R -- a deliberate architectural choice prioritizing analyst productivity over distributed computing complexity.

Observability as an Architectural Decision Engine

Netflix's observability stack is not just for debugging. It is the feedback loop that drives architectural evolution.

The numbers: Atlas processes 17 billion metrics per day. The platform handles 700 billion distributed traces per day and 1.5 petabytes of log data. All of this costs less than 5% of Netflix's total infrastructure spend -- a deliberate and measured investment.

The Trace Explosion Problem

Consider encoding a single episode of Squid Game Season 2. According to Netflix engineers at QCon London 2025, this generates 1 million trace spans, 140 video encodes, 552 audio encodes, and consumes 122,000 CPU hours.

At that density, traditional tracing tools collapse. Over 300K+ spans per request overwhelm conventional visualization. Netflix solved this with a request-first tree visualization and stream processing via Flink, transforming raw spans into actionable business intelligence.

The high-cardinality metrics client uses metadata tagging and a taxonomy service exposed via GraphQL API, ensuring consistent metadata across hundreds of services.

Observability Driving Business Outcomes

The business outcomes from this investment are concrete: ROI-based resource allocation, workflow caching without user intervention, and measurable cost efficiency improvements. Netflix also built Inca, a message-level tracing system for data pipelines where each message gets a UUID, enabling detection of loss and duplicates across trillions of daily events.

The key insight: observability at Netflix is not a cost center. It is the mechanism by which data shapes architecture. When encoding costs spike for a particular content type, observability data drives the decision to cache workflows. When trace analysis reveals inefficient service-to-service calls, it informs decomposition decisions.

From Confusion to Clarity: Advanced Observability Strategies for Media Workflows at Netflix - InfoQ

Naveen Mareddy and Sujana Sooreddy discuss the evolution of Netflix’s media processing observability, moving from monolithic tracing to a high-cardinality analytics platform. They explain how to handle "trace explosion" using stream processing and a "request-first" tree visualization, and share how to transform raw spans into actionable business intelligence.

infoq.com

Chaos Engineering and Resilience: Breaking Things With Data

Netflix invented chaos engineering in 2011 with Chaos Monkey, which randomly terminates production VM instances. It evolved into the Simian Army -- Latency Monkey, Conformity Monkey, Doctor Monkey, Security Monkey -- each injecting different failure modes. The discipline was formalized in a 2017 whitepaper establishing five core principles.

The industry data validates the approach: organizations adopting chaos engineering report a 35% average reduction in outages and 41% improvement in MTTR. In 2024, TravelTech implemented Chaos Monkey and discovered a single point of failure in payment processing, preventing a potential outage affecting 30,000+ customers.

Resilience Evolution: From Libraries to Infrastructure

Netflix's resilience approach has fundamentally shifted. The original library-based patterns (Hystrix for circuit breaking, Ribbon for client load balancing) have been deprecated in favor of infrastructure-based resilience via Envoy service mesh -- zero-configuration resilience that does not require application code changes.

Era	Pattern	Implementation	Status (2026)
2011	Chaos testing	Chaos Monkey / Simian Army	Active (evolved)
2012	Circuit breaker	Hystrix	Deprecated
2014	Client load balancing	Ribbon	Deprecated
2015	API gateway	Zuul	Zuul 2 (Netty)
2020	Circuit breaker v2	Resilience4j	Active
2026	Service mesh	Envoy proxies	Active (new)

Stateful Systems and Automated Mitigation

Joseph Lynch's work on Netflix's stateful systems demonstrates data-driven reliability engineering. Near-caches handle billions of requests per second at sub-100-microsecond latency. When a KeyValueService experienced unexpected traffic doubling, automated mitigation recovered the system within 5 minutes -- no human intervention required.

The five principles of chaos engineering remain foundational: (1) build a hypothesis around steady state, (2) vary real-world events, (3) run experiments in production, (4) automate continuous experiments, and (5) minimize blast radius. But the real lesson is that chaos engineering without robust observability is just breaking things. You need the feedback loop.

The Live Streaming Stress Test

Even Netflix's battle-tested architecture has limits. The Tyson-Paul fight in November 2024 drew 65 million concurrent streams and 108 million total viewers, generating 100K+ Downdetector reports. CDN limitations were exposed.

But microservices isolation proved its worth: on-demand streaming was NOT affected. The failure was contained to the live event. This is resilience architecture working as designed -- not preventing all failures, but preventing failures from cascading.

The GraphQL Federation and Platform Engineering Story

Netflix's API evolution tells a story about data-driven platform decisions. The progression: REST ("OpenAPI") to "API.next" to "DNA" (GraphQL-like) to Federated GraphQL with the DGS Framework.

Today, 250+ Domain Graph Services maintained by 200+ teams compose a unified API graph. The gateway processes thousands of queries per second with sub-100ms response times and query planning overhead under 10ms.

// Netflix DGS Framework - Domain Graph Service
@DgsComponent
public class ShowsDataFetcher {
    @DgsQuery
    public List<Show> shows(@InputArgument String titleFilter) {
        // Each team owns their domain's data fetchers
        // Composed into unified supergraph via federation
        return showsService.getShows(titleFilter);
    }

    @DgsData(parentType = "Show", field = "reviews")
    public List<Review> reviews(DgsDataFetchingEnvironment dfe) {
        Show show = dfe.getSource();
        return reviewsService.getReviewsForShow(show.getId());
    }
}

Java at Netflix Scale

Netflix runs 2,800 Java applications with approximately 1,500 internal libraries. Their migration from Java 8 to Java 17 delivered 20% better CPU usage with zero code changes -- a data-driven validation that justified the migration effort across all 2,800 applications.

Java 21 virtual threads are described as "the most exciting Java feature since lambdas" by Netflix engineers, with optimal results for Tomcat thread pools and GraphQL query execution. However, gRPC worker pools showed a performance decrease. This is data-driven decision making in action -- adopt where the numbers support it, hold where they do not.

The Platform Engineering Flywheel

Netflix's workflow orchestrator Maestro handles hundreds of thousands of workflows and 2 million jobs per day, achieving a 100x performance improvement via an actor model combined with Java 21 virtual threads. Their incremental processing with Apache Iceberg reduced costs to 10% of the original pipeline for some workflows while improving data freshness from daily to hourly.

The container platform Titus launches 1M+ containers per week. Spinnaker supports 4,000+ deploys per day. Netflix spends $150 million annually on compute and storage for data pipelines alone.

Platform	Purpose	Scale
DGS Framework	GraphQL Federation	250+ services
Maestro	Workflow orchestration	2M jobs/day
Metaflow	ML infrastructure	3,000+ projects at Netflix
Titus	Container management	1M+ containers/week
Spinnaker	Continuous delivery	4,000+ deploys/day
Atlas	Telemetry	17B metrics/day

The open-source flywheel is deliberate. DGS, Maestro, Metaflow (used by hundreds of companies for ML), and Spinnaker create external contributions that flow back into Netflix's platform investment.

Efficient Incremental Processing with Netflix Maestro and Apache Iceberg - InfoQ

Jun He discusses how to use an IPS to build more reliable, efficient, and scalable data pipelines, unlocking new data processing patterns.

infoq.com

What NOT to Copy: The Anti-Patterns

This is the most important section of this article.

"Netflix's architecture is for Netflix's org chart, not your startup." A 10-person team with 50 microservices creates operational overhead that destroys velocity. Netflix's 1,000+ microservices reflect a 10,000+ person engineering organization. Conway's Law is not a suggestion -- it is a constraint.

The "Don't Stream All the Things" Warning

Netflix explicitly warns against universal stream processing. Their migration from batch to streaming documented the pioneer tax: increased on-call burden, JAR hell between Flink and Netflix OSS libraries, and complex failure recovery. Streaming failures must be addressed immediately -- unlike batch, where you re-run the job.

The Cargo-Culting Trap

Three patterns I see teams consistently get wrong:

Chaos engineering without observability. You break things but cannot learn from failures. Invest in monitoring first.
Microservices without a platform team. Every team reinvents deployment, monitoring, and configuration. The overhead kills you.
Building on deprecated Netflix OSS. Adopting Hystrix, Ribbon, or Zuul 1.x in 2025+ creates immediate technical debt. Use Resilience4j, Spring Cloud Load Balancer, and Spring Cloud Gateway instead.

Even Netflix stumbled at their own game. The Tyson-Paul fight generated 100K+ Downdetector reports, proving that on-demand architecture does not automatically translate to live event capability.

The Right Approach: Adopt Patterns, Not Tools

Start with a monolith. Extract services when pain points emerge organically. Prioritize in this order: (1) observability first, (2) experimentation platform, (3) event-driven communication, (4) microservices only when needed.

Pattern	Adopt When	Skip When
Event-driven (Kafka)	Multiple teams need async communication	Single team, synchronous is fine
Stream processing (Flink)	Real-time adds measurable business value	Batch latency is acceptable
A/B testing platform	10+ experiments/quarter	Fewer than 5 experiments/year
Chaos engineering	Running 50+ microservices in production	Fewer than 10 services
GraphQL Federation	5+ teams need API ownership	Single API team
Data Mesh	Multiple data domains with different owners	Centralized data team

Conclusion: Building Your Data-Driven Architecture

Netflix's power is not scale. It is the feedback loop between data and decisions.

They never did a "big rewrite." Every architectural evolution -- from monolith to microservices, from batch to streaming, from REST to GraphQL Federation, from Hystrix to Envoy -- was measured, validated against production data, and rolled out incrementally. The Java 17 migration happened because they measured a 20% CPU improvement. Streaming SQL replaced the Flink DataStream API because they measured 1,200 new processors in a year from non-infrastructure teams.

Any organization can start building this feedback loop with three pillars:

Instrument everything. You cannot make data-driven decisions without data. Netflix invests less than 5% of infrastructure costs in observability -- and considers it their highest-leverage architectural investment.
Experiment on everything. Build an A/B testing capability. It does not need to handle 450K RPS. It needs to exist so that decisions are driven by evidence, not opinions.
Let data drive architecture. When Netflix evaluated Neo4j vs. Cassandra for their distributed graph, they measured at scale and chose the tool that survived the data. Do the same with your technology decisions.

Pick ONE pattern from this article. Implement it in your current architecture. Measure the result. That is the Netflix way -- not copying their tools, but copying their discipline.

The best architecture is the one that can prove why it made the choices it did.

Resources

Official Sources:

Netflix Tech Blog - Primary source for architecture decisions
Netflix Atlas Documentation - Observability platform
Netflix DGS Framework - GraphQL Federation
Netflix Maestro on GitHub - Workflow orchestration (open-source)

Key Netflix Tech Blog Posts:

Conference Talks:

Books:

Designing Data-Intensive Applications by Martin Kleppmann (O'Reilly) - Foundational theory for Netflix's data pipeline patterns
Chaos Engineering: System Resiliency in Practice by Casey Rosenthal & Nora Jones (O'Reilly) - Written by Netflix chaos engineering pioneers
Data Mesh by Zhamak Dehghani (O'Reilly) - The architectural philosophy Netflix adopted
Microservices Patterns by Chris Richardson (Manning) - Pattern catalog applicable to Netflix's architecture
Observability Engineering by Charity Majors, Liz Fong-Jones & George Miranda (O'Reilly) - Principles behind Netflix's observability stack

Academic References:

Basiri et al., "Chaos Engineering", arXiv 2017 - The original chaos engineering whitepaper from Netflix
Netflix Research, "A Survey of Causal Inference Applications at Netflix" - Beyond A/B testing

Did you find this article helpful? Follow me for more content on AWS, GenAI, and Cloud Architecture!

Amazon Bedrock: From Zero to Production in 30 Minutes

David Marcelo Petrocelli — Wed, 07 Jan 2026 16:00:03 +0000

Amazon Bedrock: From Zero to Production in 30 Minutes

If you've been curious about Generative AI but haven't dived in yet, Amazon Bedrock is the easiest way to start. No model training, no GPU management, no ML expertise required—just API calls to state-of-the-art foundation models.

In this guide, I'll take you from zero to a working application that you can actually deploy to production.

What is Amazon Bedrock?

Amazon Bedrock is a fully managed service that provides access to foundation models (FMs) from leading AI companies through a unified API. Think of it as "LLMs as a Service."

Available models include:

Claude 4 & Claude 3.5 (Anthropic) - Best for complex reasoning and long documents
Titan (Amazon) - Cost-effective for general tasks
Llama 3 (Meta) - Open-source performance
Mistral Large - Fast inference, great for code and chat
Stable Diffusion 3 (Stability AI) - Image generation

Setting Up Your Environment

1. Enable Bedrock Models

First, request access to the models you want to use:

Go to Amazon Bedrock in the AWS Console
Navigate to "Model access"
Click "Manage model access"
Select the models you need (I recommend starting with Claude 3.5 Sonnet or Claude 4 Sonnet)
Submit the request

Most models are approved instantly. Some (like Claude 4) may take a few minutes.

2. Configure IAM Permissions

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeModel",
        "bedrock:InvokeModelWithResponseStream"
      ],
      "Resource": [
        "arn:aws:bedrock:*::foundation-model/anthropic.claude-*",
        "arn:aws:bedrock:*::foundation-model/amazon.titan*"
      ]
    }
  ]
}

3. Install Dependencies

pip install boto3 langchain-aws

Your First Bedrock Application

Let's build a simple text generator:

import boto3
import json

# Initialize the client
bedrock = boto3.client(
    service_name='bedrock-runtime',
    region_name='us-east-1'
)

def generate_text(prompt: str, max_tokens: int = 500) -> str:
    """Generate text using Claude 3.5 Sonnet."""

    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": max_tokens,
        "messages": [
            {
                "role": "user",
                "content": prompt
            }
        ]
    })

    response = bedrock.invoke_model(
        modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
        body=body,
        contentType="application/json",
        accept="application/json"
    )

    response_body = json.loads(response['body'].read())
    return response_body['content'][0]['text']


# Test it
result = generate_text("Explain Kubernetes in 3 sentences for a beginner.")
print(result)

Output:

Kubernetes is a system that helps you run and manage applications in containers
across multiple computers automatically. It handles tasks like starting your
applications, restarting them if they crash, and distributing traffic between
them. Think of it as an automated IT team that keeps your applications running
24/7 without manual intervention.

Streaming Responses

For better user experience, stream the response:

def generate_text_streaming(prompt: str):
    """Stream text generation for real-time output."""

    body = json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 1000,
        "messages": [{"role": "user", "content": prompt}]
    })

    response = bedrock.invoke_model_with_response_stream(
        modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
        body=body,
        contentType="application/json"
    )

    for event in response['body']:
        chunk = json.loads(event['chunk']['bytes'])
        if chunk['type'] == 'content_block_delta':
            yield chunk['delta'].get('text', '')


# Use it
for text_chunk in generate_text_streaming("Write a haiku about cloud computing"):
    print(text_chunk, end='', flush=True)

Using LangChain for Production Apps

For more complex applications, LangChain provides a cleaner interface:

from langchain_aws import ChatBedrock
from langchain_core.messages import HumanMessage, SystemMessage

# Initialize the model
llm = ChatBedrock(
    model_id="anthropic.claude-3-5-sonnet-20241022-v2:0",
    model_kwargs={
        "max_tokens": 1000,
        "temperature": 0.7
    }
)

# Simple chat
response = llm.invoke([
    SystemMessage(content="You are a helpful AWS architect."),
    HumanMessage(content="What's the best way to set up a VPC?")
])
print(response.content)

Building a RAG Application

Retrieval-Augmented Generation (RAG) lets you query your own documents:

from langchain_aws import BedrockEmbeddings
from langchain_community.vectorstores import FAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough

# 1. Initialize embeddings
embeddings = BedrockEmbeddings(
    model_id="amazon.titan-embed-text-v1"
)

# 2. Load and split your documents
documents = [...]  # Your documents here
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200
)
splits = text_splitter.split_documents(documents)

# 3. Create vector store
vectorstore = FAISS.from_documents(splits, embeddings)
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})

# 4. Create RAG chain
template = """Answer based on the following context:

Context: {context}

Question: {question}

Answer:"""

prompt = ChatPromptTemplate.from_template(template)

rag_chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
)

# 5. Query your documents
answer = rag_chain.invoke("What is our refund policy?")
print(answer.content)

Cost Optimization Tips

Bedrock pricing is based on input/output tokens. Here's how to optimize:

1. Choose the Right Model

Use Case	Recommended Model	Cost
Simple Q&A	Titan Lite	$
General chat	Claude 3.5 Haiku	$$
Complex reasoning	Claude 3.5 Sonnet	$$$
Advanced code & reasoning	Claude 4 Sonnet/Opus	$$$$

2. Use Provisioned Throughput for High Volume

# For production workloads with consistent traffic
response = bedrock.invoke_model(
    modelId="arn:aws:bedrock:us-east-1:123456789:provisioned-model/my-model",
    body=body
)

3. Cache Frequent Responses

import hashlib
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_generate(prompt_hash: str, prompt: str) -> str:
    return generate_text(prompt)

def generate_with_cache(prompt: str) -> str:
    prompt_hash = hashlib.md5(prompt.encode()).hexdigest()
    return cached_generate(prompt_hash, prompt)

Security Best Practices

1. Use VPC Endpoints

resource "aws_vpc_endpoint" "bedrock" {
  vpc_id              = aws_vpc.main.id
  service_name        = "com.amazonaws.us-east-1.bedrock-runtime"
  vpc_endpoint_type   = "Interface"
  subnet_ids          = aws_subnet.private[*].id
  security_group_ids  = [aws_security_group.bedrock_endpoint.id]
  private_dns_enabled = true
}

2. Enable Model Invocation Logging

# CloudWatch logging for compliance
bedrock_client = boto3.client('bedrock')

bedrock_client.put_model_invocation_logging_configuration(
    loggingConfig={
        'cloudWatchConfig': {
            'logGroupName': '/aws/bedrock/invocations',
            'roleArn': 'arn:aws:iam::123456789:role/BedrockLogging'
        },
        'textDataDeliveryEnabled': True,
        'imageDataDeliveryEnabled': False
    }
)

3. Use Guardrails

Amazon Bedrock Guardrails help filter harmful content:

response = bedrock.invoke_model(
    modelId="anthropic.claude-3-5-sonnet-20241022-v2:0",
    body=body,
    guardrailIdentifier="my-guardrail-id",
    guardrailVersion="DRAFT"
)

Real-World Architecture

Here's a production-ready architecture I use for enterprise clients:

                    ┌─────────────────┐
                    │   CloudFront    │
                    └────────┬────────┘
                             │
                    ┌────────▼────────┐
                    │   API Gateway   │
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
     ┌────────▼───────┐ ┌────▼────┐ ┌───────▼──────┐
     │ Lambda (Chat)  │ │ Lambda  │ │   Lambda     │
     │                │ │ (RAG)   │ │ (Streaming)  │
     └────────┬───────┘ └────┬────┘ └───────┬──────┘
              │              │              │
              └──────────────┼──────────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
     ┌────────▼───────┐ ┌────▼────┐ ┌───────▼──────┐
     │    Bedrock     │ │ OpenSrch│ │  DynamoDB    │
     │ (Foundation M) │ │ (Vector)│ │  (Sessions)  │
     └────────────────┘ └─────────┘ └──────────────┘

What's Next?

Now that you have the basics, here are some directions to explore:

Agents for Bedrock - Create autonomous agents that can use tools
Knowledge Bases - Managed RAG with automatic chunking and embeddings
Fine-tuning - Customize models with your own data
Multi-modal - Work with images and PDFs using Claude 4 Vision

Have questions about implementing Bedrock in your architecture? Drop a comment below!

About the author: David Petrocelli is a Senior Cloud Architect at Caylent, PhD in Computer Science, and University Professor specializing in cloud architecture and generative AI applications.