Netflix · Distributed Systems · 17 May 2026
Netflix's Meson orchestrator was handling hundreds of thousands of daily data and ML jobs — and running out of machine. Vertically scaling on AWS had a hard ceiling, and the workflows were doubling in size every year. The only way out was a complete architectural rethink.
- 2M+ jobs/day at peak
- Hundreds of thousands of workflows
- Meson → Maestro 2020
- Open-sourced July 2024
- Horizontal scaling (vs vertical)
- 100K+ jobs in single workflow
The Story
Netflix is not just a streaming platform — it is a data factory that runs thousands of ML pipelines (automated sequences of data processing, model training, and validation jobs that produce the recommendation algorithms and personalization signals driving Netflix's content strategy) and data engineering workflows every single day. Recommendation models, A/B test analyses, content quality pipelines, ad-targeting models — every one of these is a graph of interdependent jobs that must be orchestrated, retried on failure, scheduled on time, and monitored continuously. By 2020, Netflix was running all of this through Meson, an in-house workflow orchestrator built around a single-leader architecture that the team described as having achieved high availability — but at the cost of requiring continuous vertical scaling as usage grew.
📈
The number of workflows in Meson was doubling year over year , and the sizes of individual workflows were growing too — some containing tens of thousands of interdependent jobs. Vertical scaling on AWS had a hard physical ceiling: you can only get so many CPUs and so much RAM in a single EC2 instance type.
The core problem with Meson was structural. A single-leader architecture (a system design where one node (the leader) is responsible for all decisions and coordination — providing simplicity and consistency but creating a vertical scaling bottleneck as load increases) means all orchestration decisions flow through one machine. For low-throughput systems, this is fine. For a platform running hundreds of thousands of daily ML workflows with projected 100% year-over-year growth, it meant the infrastructure team was perpetually fighting against the ceiling of what the largest available AWS instance type could handle. During peak usage — typically when multiple large training runs coincided with end-of-month reporting pipelines — Meson experienced slowdowns that required on-call engineers to closely monitor the system, especially during off-hours. The system was not broken, but it was clearly approaching the limits of how it was designed.
THE VERTICAL SCALING WALL
AWS instance types top out. In 2020, Netflix's Meson orchestrator was approaching the compute limits of the largest available EC2 instances. The team could keep picking bigger and bigger machines, but the ceiling was visible and the timeline to hit it was predictable. Horizontal scaling — distributing load across many commodity machines — was the only architectural solution that could match indefinitely growing workflow volumes.
Problem
Meson's Single Leader Under Load
During peak usage, Meson's single-leader architecture struggled under the combined load of hundreds of thousands of daily jobs. On-call engineers were monitoring the orchestrator during off-hours to prevent cascading delays. The system worked — but it required human attention that scale demands cannot sustain.
Cause
Vertical Scaling Has a Ceiling
Meson's single-leader model meant all orchestration state — job status, dependency tracking, retry logic, scheduling — lived on one machine. As Netflix's DAG (Directed Acyclic Graph — a workflow representation where jobs are nodes and dependencies are directed edges, ensuring no circular dependencies) workloads grew 100% year-over-year, vertical scaling was no longer a strategy; it was a countdown.
Solution
Maestro: Horizontal Orchestration at Any Scale
Netflix designed Maestro from first principles for horizontal scalability. The architecture decomposed orchestration into independent stateless workers, event-driven step execution, and purpose-built state management. Hundreds of thousands of workflows migrated from Meson to Maestro with minimal user disruption, and by 2024 Maestro was open-sourced under Apache 2.0.
Result
2 Million Jobs Per Day, No Ceiling in Sight
Maestro handles 2 million jobs on busy days across hundreds of thousands of workflows, with support for individual workflows containing hundreds of thousands of jobs. Scaling is now horizontal — add more workers, handle more load — without the system design itself becoming a constraint.
Unlike traditional workflow orchestrators that only support Directed Acyclic Graphs (DAGs), Maestro supports both acyclic and cyclic workflows and also includes multiple reusable patterns, including foreach loops, subworkflows, and conditional branches.
— — Jun He, Natallia Dzenisenka et al. — via Netflix Technology Blog
What Makes Maestro Different
Traditional workflow orchestrators like Apache Airflow assume that workflows are DAGs (Directed Acyclic Graphs — graphs where edges have direction and no cycles exist, meaning no job can eventually depend on itself). Netflix's data engineering reality is messier: some workflows are genuinely cyclic, some need foreach loops that spawn thousands of child workflow instances, some need conditional branching that determines entire execution paths at runtime. Maestro was built to support all of these natively within the engine, rather than requiring users to simulate them with workarounds. The foreach pattern is particularly powerful: each iteration is internally treated as a separate workflow instance, scaling identically to any other Maestro workflow. A single foreach over a thousand content items spawns a thousand parallel workflow instances , each tracked independently, each retried independently, each reportable independently.
ℹ️
Workflow-as-a-Service: Abstracting Infrastructure from Users
Maestro is designed as a fully managed workflow-as-a-service for Netflix's data practitioners — data scientists, ML engineers, content producers, and business analysts. Users define their business logic in Docker images, notebooks, bash scripts, SQL, or Python, and Maestro handles scheduling, queuing, dependency resolution, retries, and monitoring. Users never configure infrastructure. The engineering investment lives entirely in the platform layer, freeing thousands of practitioners to focus on what their workflows do rather than how they run.
🎯
Strict SLOs During Traffic Spikes
Maestro is designed to maintain its service level objectives even during spikes in workflow submission traffic. This matters because Netflix's data platform is non-uniform: large content drops, end-of-quarter analyses, and major live events all generate burst workflow submissions that can be orders of magnitude above the baseline. A workflow orchestrator that degrades its own SLOs during the exact moments when reliability matters most is worse than useless.
Netflix's Meson-to-Maestro migration was deliberate about user disruption. The team migrated hundreds of thousands of workflows on behalf of users — users didn't rewrite anything. The platform team built migration tooling, validated that migrated workflows produced equivalent outputs, and performed the migration atomically for each user's pipeline. This is a crucial organizational point: the people running the workflows were data scientists and analysts, not infrastructure engineers. Asking them to relearn an orchestration system and rewrite their pipelines would have consumed months of productive time across the company. By owning the migration completely, the Maestro team enabled a seamless transition that felt, from the user perspective, like a platform upgrade that just worked.
⚠️
The On-Call Tax of Vertical Scaling
Meson's vertical scaling strategy created an operational pattern where engineers had to manually monitor the orchestrator during off-hours around peak usage — particularly when end-of-quarter reporting pipelines coincided with large ML training runs. This on-call tax is a leading indicator of approaching architectural limits. When your infrastructure requires human attention to avoid failure during predictable peak events, the architecture is telling you it cannot scale further without change.
The Fix
Architecture: From Single Leader to Distributed Workers
Maestro's architecture replaces Meson's single-leader with three independently scalable services. The Workflow Engine manages the full lifecycle from definition through execution — it is the core state machine that tracks which steps have completed and which are ready to run. The Step Runtime Workers are stateless executors that pick up individual step executions from a queue, run them on the appropriate compute engine (Spark, Trino, Kubernetes, etc.), and report results back. The Signal Service enables event-driven orchestration — workflows can wait for external signals (data availability, upstream pipeline completion) rather than polling or using fixed schedules. Crucially, each of these three layers scales independently : need more throughput on step execution? Add more workers. Need more scheduling capacity? Scale the engine tier. No single bottleneck.
- 2M+ — Jobs executed on peak days — a figure that would have been impossible on Meson's single-leader architecture without continuous vertical scaling emergencies
- 100K+ — Jobs supported within a single workflow — enabled by foreach's spawning of independent sub-workflow instances rather than a flat DAG with 100K nodes
- ∞ scale — Horizontal scaling: add more worker nodes to handle more load — the ceiling that constrained Meson is architecturally eliminated in Maestro
- ~0 disruption — User-visible disruption during migration — the platform team migrated hundreds of thousands of workflows on behalf of users with minimal interruption
// Simplified Maestro workflow definition example
// Users define their logic; Maestro handles all execution mechanics
{
"workflow_id": "ml_model_training_pipeline",
"steps": [
{
"id": "data_prep",
"type": "spark", // run on Netflix's Spark cluster
"image": "netflix/etl:v2.3",
"dependencies": [] // no upstream dependencies — runs first
},
{
"id": "feature_engineering",
"type": "foreach", // Maestro native pattern: spawn parallel sub-workflows
"items": "${data_prep.output.segments}", // iterate over upstream output
"steps": [
// each item spawns its own workflow instance — scales to thousands
{ "id": "compute_features", "type": "spark" }
],
"dependencies": ["data_prep"]
},
{
"id": "model_train",
"type": "kubernetes", // different compute engine — Maestro routes it
"image": "netflix/trainer:v1.1",
"dependencies": ["feature_engineering"]
}
],
// Signal-based trigger: run when upstream data is ready, not on a fixed clock
"trigger": { "type": "signal", "signal_name": "daily_data_ready" }
}
ℹ️
Why Not Airflow, Temporal, or Conductor?
The Netflix team evaluated off-the-shelf alternatives before building Maestro. Apache Airflow lacked native support for cyclic workflows and struggled with Maestro's target scale. Netflix Conductor was an option but offered more state-engine features than required, and its overhead was disproportionate to the need. Temporal was optimized for inter-process orchestration via external service calls — at Maestro's million-tasks-per-day scale with many long-running workflows, coupling the DAG engine to an external service call introduced unnecessary reliability weak spots. The team concluded that for their specific requirements — lightweight state transitions at massive scale — a purpose-built engine was the right call.
THE OPEN SOURCE DECISION
In July 2024, Netflix open-sourced Maestro under the Apache 2.0 license. The decision reflected confidence in the system's production maturity after 4+ years of internal operation and Netflix's long tradition of contributing infrastructure tooling to the engineering community. Maestro joins a list of Netflix open-source projects — Chaos Monkey, Eureka, Hystrix — that have shaped how the industry thinks about distributed systems. The GitHub repository includes the full Java 21 + Gradle codebase and curl-based quickstart examples.
The signal service is one of Maestro's most operationally valuable features. In traditional time-based scheduling, a pipeline runs at midnight regardless of whether the data it depends on is ready. If upstream processing is slow, the pipeline waits, fails, or wastes compute resources scanning for data that isn't there yet. Signal-based triggers invert this: Maestro listens for an event published when upstream data is confirmed ready, and only then starts the dependent workflow. For Netflix's data platform, where hundreds of pipelines have complex dependency chains on upstream tables and streams, this eliminates entire categories of pipeline failures caused by timing assumptions that don't hold under variable upstream load.
✅
Zero-Downtime Migration Strategy
Maestro's team built tooling to migrate Meson workflows in place, validating equivalence of outputs before switching. Each workflow's migration was atomic: Meson and Maestro ran side-by-side during transition, with a configurable percentage router determining which system executed any given workflow instance. Users saw no interruption to their pipelines during the migration period.
🔌
Pluggable Compute Backends
Maestro's step runtime workers are designed to be compute-engine agnostic. A step definition specifies the engine (Spark, Trino, Kubernetes, Python) and the business logic image; the worker handles routing, execution, and result reporting without the workflow engine knowing which compute platform ran the job. This loose coupling means Netflix can add new compute backends — or retire old ones — without touching workflow definitions.
Architecture
Meson's architecture concentrated all orchestration intelligence in a single process: state management, scheduling, dependency tracking, retry logic, and compute routing all ran in one JVM. This was comprehensible and debuggable but fundamentally unscalable. Maestro separates these concerns across three independently deployable services, with Apache Kafka (a distributed event streaming platform used by Netflix for high-throughput, fault-tolerant message passing between Maestro's internal services and to external downstream consumers) as the message bus connecting them and PostgreSQL (the relational database Maestro uses for durable workflow state storage, chosen for its ACID guarantees which are essential for exactly-once execution semantics) as the durable state store.
Meson (Before): Single-Leader Orchestration Architecture
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Maestro (After): Horizontally Scalable Distributed Orchestration
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
FOREACH: SCALING TO 100K JOBS IN ONE WORKFLOW
Maestro's foreach step is the capability that makes truly massive workflows possible. When a foreach iterates over 10,000 content segments, Maestro internally creates 10,000 independent sub-workflow instances — each scheduled, executed, monitored, and retried independently. From the user's perspective, it's a single foreach loop. From Maestro's perspective, it's 10,000 parallel workflow instances scaling identically to any other workflow in the system. This is not possible in flat DAG systems where 10,000 nodes would overwhelm the scheduler.
✅
Exactly-Once Execution: The Deduplication Guarantee
Maestro's scheduler includes explicit deduplication logic — even if the scheduler fires a trigger multiple times for the same workflow (due to retries or race conditions), Maestro guarantees the workflow is executed exactly once. This is a critical property for ML training pipelines and financial reporting workflows where duplicate execution could produce incorrect results or waste significant compute resources. The guarantee is implemented via PostgreSQL's ACID transactions on workflow state transitions.
📡
External Event Publishing: Beyond Netflix
Maestro publishes execution events to external messaging systems — Kafka, SNS, SQS — enabling downstream consumers to react to workflow state changes. A data warehouse system can start preparing tables when a pipeline completes. A monitoring system triggers alerts on critical workflow failures. An analytics dashboard updates in real time. This event bus design transforms Maestro from an orchestrator into a first-class component of Netflix's data platform event fabric.
Lessons
The story of Maestro is ultimately a story about recognizing architectural limits before they become production disasters, and building the replacement with the same care that you'd apply to user-facing products. Platform infrastructure is a product too.
- 01. Single-leader architectures have a ceiling. They are simple to build and reason about, which makes them excellent starting points. But when your workload grows faster than vertical scaling can accommodate, the architecture itself becomes a constraint. Identify the ceiling early and plan the horizontal migration before you're forced to do it under pressure.
- 02. Platform migrations succeed when the platform team owns the migration completely , not when they ask users to rewrite their workloads. Netflix migrated hundreds of thousands of Meson workflows to Maestro on behalf of their users. This is the engineering investment that makes a platform adoption feel like a capability upgrade rather than a tax.
- 03. Signal-based triggers (workflow start conditions driven by events like 'upstream data is available' rather than fixed clock times) eliminate entire categories of timing-based failures in data pipelines. If your pipelines fail regularly because upstream data isn't ready when the cron fires, replace the clock with a signal. The latency reduction and reliability improvement are both significant.
- 04. Build platform capabilities as first-class engine features, not user-land workarounds. Maestro's foreach, subworkflow, and conditional branch are native engine constructs — not clever hacks layered on top of a DAG. When complex patterns are first-class, the engine can optimize them (parallel sub-workflows, independent retry, progress tracking) in ways that workarounds cannot.
- 05. Evaluate off-the-shelf solutions honestly and choose to build only when the custom requirements genuinely justify it. Netflix evaluated Airflow, Temporal, and Conductor before choosing to build Maestro. Their scale, cyclic workflow requirements, and strict SLO needs were genuinely outside what existing tools could provide. Building custom infrastructure for problems that existing tools solve well is waste; building it for problems they genuinely cannot solve is engineering.
WHEN TO BUILD VS BUY
Netflix's evaluation of off-the-shelf orchestrators before building Maestro is a model for how platform teams should approach the build-vs-buy question. They documented specific technical requirements — cyclic workflow support, sub-hourly scheduling at million-job scale, strict SLO maintenance during spikes, horizontal scalability — and evaluated each option against those requirements. Only after finding genuine gaps in available tools did they commit to building. The evaluation process itself clarified the requirements enough to make the eventual design stronger.
ℹ️
The Open Source Moment
Maestro's open-sourcing in July 2024 came after 4+ years of production operation at Netflix scale. This timeline matters: Netflix didn't release an aspirational design or an early prototype. They released a battle-tested system with years of edge cases resolved, failure modes understood, and operational patterns documented. Open-sourcing after production hardening rather than before is a kindness to the community.
Netflix needed a workflow orchestrator that scaled with the universe — so they built one, hit the AWS ceiling, threw it away, and then open-sourced the second one.
TechLogStack — built at scale, broken in public, rebuilt by engineers
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).
Top comments (0)