Difficulty Level: 300 - Advanced
TL;DR
- Spotify processes 1 trillion+ events per day through 38,000+ active data pipelines — every play, skip, and save is a signal that feeds back into every product decision
- Discover Weekly generated 100 billion+ streams in its first 10 years using three ML layers: collaborative filtering, NLP, and audio CNNs — now augmented with LLMs via custom Semantic IDs
- Their A/B testing culture runs tens of thousands of experiments/year across 300+ teams, including 520 simultaneous experiments on a single screen — and they measure learning rate (64%), not just win rate (12%)
- Backstage, born as Spotify's internal developer portal, catalogs 2,000+ services and 4,000 data pipelines — and is now used by 3,000+ companies as the CNCF standard
- The real lesson isn't any single tool: it's the tight coupling between organizational design (squads own their services) and technical design (services are independently deployable)
The 1 Trillion Events Question
Spotify hit 713 million monthly active users in Q3 2025. That number looks impressive in a press release and terrifying in a system design meeting.
Scale alone doesn't explain Spotify's success. What matters is that every one of those events — every play, every skip, every playlist add at 2am — feeds directly into product decisions. Not after a quarterly review. In near real-time.
Most companies collect data and build dashboards. Spotify built a closed loop: user behavior shapes the product, the product generates more behavior, and the cycle compounds over 20 years of iteration. In 2024, Spotify posted its first annual profit: €1.1B on €15.6B in revenue. The closed loop is working.
After years of building data systems for enterprise clients and teaching these patterns at university, I've found that the most common mistake teams make is copying Spotify's tools rather than their discipline. In this article I'll break down the actual mechanisms behind their data pipeline, recommendation engine, experimentation culture, and developer platform — and tell you which patterns you can realistically steal.
Prerequisites
- Familiarity with stream processing concepts (Kafka, Pub/Sub, or similar)
- Basic understanding of microservices architecture (service decomposition, database-per-service)
- Experience with A/B testing fundamentals
- Some exposure to ML recommendation systems (collaborative filtering concepts)
What You'll Learn
- How Spotify's event pipeline evolved from self-managed Kafka to GCP Pub/Sub at 3 million events/second
- Why Discover Weekly uses three separate ML layers and what each one contributes
- How their A/B testing culture measures 64% learning rate instead of just win rate
- What Backstage is and why 3,000+ companies adopted it after Spotify open-sourced it
- Which Spotify patterns scale down to your team — and which ones don't
The Numbers Behind 713 Million Users
The scale numbers aren't just impressive — they explain every architectural decision.
| Metric | Value | Context |
|---|---|---|
| Monthly Active Users | 713M (Q3 2025) | Up from 600M in mid-2024 |
| Premium Subscribers | 281M | ~42% conversion rate |
| Annual Revenue | €15.6B (2024) | First profitable year |
| Music Catalog | 100M+ tracks | Grows ~60K tracks/day |
| Podcasts | ~7M titles | Second only to Apple |
| Events per day | 1 trillion+ | 1,800+ event types |
| Active data pipelines | 38,000+ | Hourly + daily scheduled |
| Production components | Thousands | 80%+ fleet-managed |
| A/B experiments/year | Tens of thousands | 300+ teams running tests |
| Discover Weekly streams (10yr) | 100 billion+ | 56M new discoveries/week |
This scale didn't emerge from a grand architectural vision. It's the result of 20+ years of small, data-driven decisions — each one measured, validated, and shipped incrementally.
From Monolith to Thousands of Microservices
Spotify started as a monolithic Python application in 2006. By 2010, the codebase had grown to the point where no single team could understand all of it, and deployments required coordinating across multiple squads.
The migration to microservices wasn't a big-bang rewrite. It was driven by a single organizational principle: each squad should be able to deploy independently, without coordinating with other teams. If a team needed another team's sign-off to ship, something was wrong — either in the service design or the org structure.
The Database-Per-Service Pattern
Each Spotify microservice owns its own data store, chosen for the access patterns of that service:
- Cassandra + BigTable: High-speed key-value lookups (user state, session data, real-time features)
- PostgreSQL: Transactional data (payments, account management)
- Google Cloud Storage: Large objects (audio files, model artifacts)
- BigQuery: Analytical queries and data pipelines
By 2023, the number of distinct production components had grown to "thousands" — enough that Spotify needed a new abstraction to manage them: Fleet Management.
Fleet Management: Treating Services as a Fleet
The key insight behind Fleet Management is that individual service owners are blind to fleet-wide patterns. If 300 teams each manage their own dependencies, you get 300 different versions of Log4j in production. You can't patch a critical vulnerability in 9 hours by asking each team to update manually.
Fleet Management flips the model: infrastructure defaults to secure and up-to-date, and teams opt out for exceptions (with documented justification).
The results are concrete:
- 300,000+ automated changes merged across the fleet in 3 years
- 7,500 automated changes/week with 75% auto-merged without human review
- Log4j vulnerability: patched to 80% of backend services in 9 hours
- Framework updates: reach 70% of fleet in under 7 days (previously ~200 days)
- 95% of Spotify developers report Fleet Management improved software quality
The Data Pipeline: How Every Play Becomes a Signal
Every user interaction at Spotify — a play, a skip, a search, a playlist add — generates an event. Those events are the raw material for every recommendation, every A/B test result, every product decision.
Here's how that data flows:
The Migration from Kafka to GCP Pub/Sub
In 2016-2017, Spotify migrated their event delivery system from self-managed Kafka clusters to Google Cloud Pub/Sub. This wasn't a trivial decision — Kafka was working. But managing Kafka at Spotify's scale required significant operational overhead that distracted from product engineering.
The results after migration:
- Peak throughput scaled from 800,000 to 3,000,000 events/second
- Half a trillion daily ingested events (70 TB compressed)
- Pub/Sub handles 1 trillion requests/day
- BigQuery runs 10 million+ queries and scheduled jobs/month
Scio: Spotify's Open-Source Apache Beam API
Spotify developed Scio, a Scala API for Apache Beam, to process billions of events. It handles both batch and streaming workloads, running on either Dataflow (managed) or Flink (lower-latency) depending on requirements.
Every data endpoint in the platform has:
- Retention policies: data deleted after defined period
- Access controls: squad-level permissions
- Lineage tracking: full trace from source event to derived dataset
- Quality checks: automated alerts for lateness, failures, anomalies
The 38,000+ active pipelines are orchestrated, monitored, and surfaced through Backstage — so any squad can inspect the health of their data at any time.
Recommendations at Scale: Discover Weekly Deconstructed
Discover Weekly launched in July 2015 with a simple premise: every Monday morning, 30 personalized songs you've never heard before. In 10 years, it generated 100 billion streams and 56 million new artist discoveries every week.
That impact comes from a three-layer ML architecture, each layer catching different signals:
Layer 1: Collaborative Filtering
Collaborative filtering answers the question: who else listens to what you listen to, and what else do they listen to?
Spotify's approach uses Logistic Matrix Factorization (LMF) on implicit feedback — not explicit star ratings, but behavioral signals:
# Simplified: how Spotify weights implicit feedback signals
# Real implementation uses distributed matrix factorization at scale
SIGNAL_WEIGHTS = {
"stream_complete": 1.0, # Listened to 80%+ of song
"save_to_library": 2.5, # Strong positive signal
"add_to_playlist": 2.0, # Strong positive signal
"stream_partial": 0.5, # Weak positive signal
"skip_after_30s": -0.8, # Negative signal
"skip_immediately": -1.5, # Strong negative signal
}
def compute_interaction_score(events: list[dict]) -> float:
"""
Compute a weighted interaction score for a user-track pair.
Used as input to the matrix factorization model.
"""
score = 0.0
for event in events:
signal_type = event["type"]
weight = SIGNAL_WEIGHTS.get(signal_type, 0.0)
score += weight
return max(0.0, score) # Clamp to non-negative for LMF
# The factorization produces: user_vector @ item_vector = predicted_preference
# Trained via ALS (Alternating Least Squares) on GCP with billions of interactions
The training runs on Hendrix, Spotify's ML platform (named after Jimi Hendrix). Hendrix uses Ray for distributed training on GCP, serves 600+ ML practitioners, and handles the full lifecycle from prototype to production.
Layer 2: NLP Analysis
NLP fills in gaps where behavioral data is sparse — for new artists, for niche genres, for tracks uploaded last week.
Spotify runs web crawlers across music blogs, review sites, and social platforms to extract how people describe songs and artists. The output: vector embeddings where songs described with similar language cluster together.
A song described as "dreamy, lo-fi, bedroom pop" clusters with other songs sharing those descriptors — even if no user has yet listened to both.
Layer 3: Audio CNNs
For truly new content — songs uploaded with no listening history and no web presence — audio analysis is the only signal available.
Convolutional neural networks analyze spectrograms (visual representations of audio). The model learns to detect: tempo, energy, instrumentation, tonality, rhythm patterns. Songs with similar audio characteristics cluster together regardless of metadata.
The LLM Layer (2024-2025)
In 2024, Spotify added a fourth layer: LLMs for contextual recommendations and the AI DJ feature.
The challenge: LLMs don't know Spotify's catalog of 100M tracks. The solution was Semantic IDs — compact token identifiers derived from collaborative-filtering embeddings, generated via RQ-KMeans. The LLM learns to treat these IDs as vocabulary tokens, effectively learning to "speak Spotify."
Outcomes from live experiments:
- 4% increase in listening time from preference-tuned recommendations
- 14% improvement from Llama fine-tuned on Spotify's domain vs. vanilla Llama
- 70% reduction in tool errors for the AI DJ orchestration system
A/B Testing Culture: How Spotify Ships Without Breaking Things
Most companies say they have an "experimentation culture." Spotify has metrics to back it up.
300+ teams run experiments. The mobile home screen alone hosted 520 experiments in one year across 58 simultaneous teams. Total experiments run: tens of thousands per year.
The architecture behind this starts with their coordination engine, which manages mutual exclusion between experiments. When 58 teams are simultaneously testing changes to the same screen, you need a system that prevents two experiments from conflicting — and that randomly reshuffles user assignments between experiment runs (the "salt machine").
ABBA to Confidence: Three Generations of Experimentation
Spotify's experimentation platform evolved through three generations:
| Generation | Era | Capability |
|---|---|---|
| ABBA | Early 2010s | Feature flags + basic metrics |
| Experimentation Platform (EP) | 2015-2023 | Full orchestration, metrics catalog, coordination |
| Confidence | 2023+ | Commercial product, Backstage plugin, APIs |
The Metric That Changed Everything
The most important shift in Spotify's experimentation culture wasn't a new platform — it was a new metric: learning rate.
Win rate (the conventional metric) measures what percentage of experiments "succeed." At Spotify, that's ~12%.
Learning rate measures what percentage of experiments produce decision-ready insights — whether the answer is yes, no, or "we need to test something different." That's 64%.
Win rate: 12% (the experiment confirmed our hypothesis)
Learning rate: 64% (the experiment gave us actionable information)
This reframe matters enormously for culture. A team that runs 100 experiments and "wins" 12 shouldn't feel like they failed 88% of the time. Every "failed" experiment that disproves a hypothesis saved months of building the wrong thing.
Using Confidence for Feature Flags
Spotify open-sourced and commercialized Confidence in August 2023. It's available as a managed service, a Backstage plugin, or via API. Here's what a basic feature flag + A/B test looks like:
from spotify_confidence import Confidence
# Initialize with your project credentials
client = Confidence(client_secret="your-client-secret")
# Resolve a feature flag for a specific user
flag_value = client.resolve_boolean_flag(
flag="new-home-layout",
default_value=False,
evaluation_context={
"targeting_key": user_id,
"country": user_country,
"platform": "ios",
}
)
if flag_value:
render_new_home_layout()
else:
render_legacy_home_layout()
# Track events for analysis
client.track(
"home-layout-engaged",
{"user_id": user_id, "session_duration_s": session_seconds}
)
The Confidence platform handles user assignment, experiment coordination, statistical analysis, and validity checks automatically. Squads see results in real time without writing SQL.
Backstage: The Developer Portal That Escaped Spotify
By 2019, Spotify had a problem that no amount of engineering talent could solve manually: 280+ teams managing thousands of services, datasets, APIs, and pipelines — with no shared understanding of what existed or who owned it.
The answer was an internal project called "System Z." In March 2020, Spotify open-sourced it as Backstage.
What Backstage Manages at Spotify Today
| Resource Type | Count |
|---|---|
| Backend Services | 2,000+ |
| Websites | 300 |
| Data Pipelines | 4,000 |
| Mobile Features | 200 |
The Software Catalog is the source of truth. Every component has a catalog-info.yaml file in its repo:
# catalog-info.yaml
# Every Spotify service has one of these in its repo root
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: discover-weekly-generator
description: "Weekly batch job generating personalized Discover Weekly playlists"
annotations:
github.com/project-slug: spotify/discover-weekly
backstage.io/techdocs-ref: dir:.
pagerduty.com/service-id: P2XYZAB
datadog.com/service-name: discover-weekly-generator
tags:
- ml
- recommendations
- batch
spec:
type: service
lifecycle: production
owner: recommendations-squad
system: recommendation-platform
dependsOn:
- resource:default/user-feature-store
- resource:default/track-embedding-store
- component:default/hendrix-ml-platform
The Scaffolder generates new services from golden path templates — templates that include security scanning, observability hooks, CI/CD pipelines, and Backstage registration by default. The "right way" is the easy way.
Outside Spotify
Five years after open-sourcing, Backstage has:
- 3,400+ adopting companies (Expedia, American Airlines, Zalando, Netflix, Twilio, Wayfair)
- 1,600+ open-source contributors
- Donated to the CNCF — now the standard for internal developer portals
- Evolved into Spotify Portal (enterprise SaaS, GA October 2025)
The Squad Model: What Actually Works
The "Spotify Model" — Squads, Tribes, Chapters, Guilds — is the most imitated and most misunderstood organizational pattern in tech.
Here's what the original 2012 whitepaper actually said:
| Unit | Size | Purpose |
|---|---|---|
| Squad | 6-12 people | Full ownership: design, build, test, release, operate |
| Tribe | 40-150 people | Coordination across squads in same product area |
| Chapter | 6-15 specialists | Craft community within a tribe (e.g., all iOS engineers) |
| Guild | Any size | Voluntary community of interest across the company |
The key principle: "Loosely coupled but tightly aligned." Squads move fast independently, but all move in the same strategic direction.
But here's what Henrik Kniberg himself says now: "Don't copy the Spotify model. That's the opposite of what we intended."
Spotify no longer follows the original model exactly — it evolved constantly. The org chart was always secondary to the autonomy principle: if a squad can't deploy independently, something is wrong in the service design or the org design. Fix whichever is broken.
The technical manifestation of squad autonomy is Conway's Law in reverse: design your organization first, and your service architecture will follow. Spotify's thousands of independently deployable microservices exist because thousands of squads have full ownership of them.
What to Steal (and What to Leave Behind)
Here's what's actually worth taking from Spotify's playbook — and what requires Spotify-level scale to justify:
| Pattern | Steal It? | Minimum Scale | Effort |
|---|---|---|---|
| Software Catalog (Backstage) | Yes | 10+ teams | Low — free, CNCF standard |
| Golden path templates (Scaffolder) | Yes | 5+ teams | Medium — template once, scale forever |
| 64% learning rate metric | Yes | Any scale | Low — just change what you measure |
| Feature flags + gradual rollouts | Yes | Any scale | Low — Confidence or LaunchDarkly |
| Fleet automation for dependencies | Yes | 50+ services | Medium — Dependabot + custom automation |
| Squad autonomy principle | Yes (carefully) | 3+ teams | High — org change, not tech change |
| 3-layer recommendation engine | Adapted | 10K+ users | High — need data volume to work |
| GCP Pub/Sub at 3M events/sec | No (yet) | 100M+ events/day | Infrastructure complexity not worth it early |
| Hendrix ML platform | No | 100+ ML practitioners | Overkill; use SageMaker/Vertex AI instead |
The three questions worth asking your team right now:
Can each team deploy independently, without coordinating with other teams? If no, fix the service design or the team structure — but fix it.
Are you measuring learning rate or just win rate? Every experiment that disproves a bad idea is a win. Build a culture that treats it that way.
Does your internal developer portal make the right thing the easy thing? If developers skip security scanning because setting it up is hard, the problem isn't the developers.
Conclusion
Spotify's data-driven architecture didn't emerge from a whiteboard session or a consulting engagement. It emerged from 20 years of building autonomy into every layer of the organization and letting that autonomy produce the architecture.
The event pipeline processes 1 trillion events a day not because Spotify chose GCP Pub/Sub — it's because 300+ squads each own their data and ship their pipelines without waiting for a central team.
Discover Weekly recommends music that feels personal not because of any single ML breakthrough — it's because a recommendations squad owned that problem for 10 years and had the freedom to experiment every Monday.
Backstage manages 4,000 data pipelines and 2,000 services not because it's technically clever — it's because the alternative (no catalog) gets exponentially more painful as you grow.
The tools are available to any company. Most of them are open source or commercially available today. The discipline is what differentiates Spotify — and that part you have to build yourself.
Resources
Official Documentation
- Spotify Engineering Blog — primary source for all technical patterns described here
- Spotify Research — 200+ ML and recommendation papers
- Backstage.io — open source, free, CNCF graduated
- Confidence — Spotify's A/B testing platform, now commercial
Books
- Building Microservices by Sam Newman (O'Reilly, 2nd ed. 2021) — covers squad/service alignment
- Designing Data-Intensive Applications by Martin Kleppmann (O'Reilly, 2017) — event streaming fundamentals
Key Engineering Blog Posts
- Fleet Management at Spotify Part 1
- Data Platform Explained Part II
- Coming Soon: Confidence
- Unleashing ML Innovation with Ray
- Celebrating Five Years of Backstage
Research Papers
- Semantic IDs for Generative Search and Recommendation (NeurIPS 2025)
- Users' Interests are Multi-faceted: Recommendation Models Should Be Too (WSDM 2023)
- Optimizing for the Long-Term Without Delay
Original Squad Model Reference
- Scaling Agile @ Spotify by Henrik Kniberg & Anders Ivarsson (2012)
Did you find this article helpful? Follow me for more content on system design, data engineering, and cloud architecture!
Top comments (0)