TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 17

Netflix Hit the AWS Instance Ceiling and Built a Workflow Engine That Scales Forever

#distributedsystems #backend #programming #webdev

2M+ jobs/day at peak — impossible on Meson's single-leader architecture without continuous vertical scaling emergencies
100K+ jobs supported in a single workflow — enabled by foreach's independent sub-workflow instances
Horizontal scaling — add more worker nodes to handle more load; the vertical ceiling is architecturally eliminated
~0 disruption — user-visible impact during migration; the platform team migrated hundreds of thousands of workflows on behalf of users
Open-sourced July 2024 under Apache 2.0 — after 4+ years of production operation at Netflix scale
Native support for cyclic workflows, foreach loops, subworkflows, and conditional branches — not DAG-only

Netflix's Meson orchestrator was handling hundreds of thousands of daily data and ML jobs — and running out of machine. Vertically scaling on AWS had a hard ceiling, and the workflows were doubling in size every year. The only way out was a complete architectural rethink.

The Story

Unlike traditional workflow orchestrators that only support Directed Acyclic Graphs (DAGs), Maestro supports both acyclic and cyclic workflows and also includes multiple reusable patterns, including foreach loops, subworkflows, and conditional branches.

— Jun He, Natallia Dzenisenka et al., via Netflix Technology Blog

Netflix is not just a streaming platform — it is a data factory that runs thousands of ML pipelines (automated sequences of data processing, model training, and validation jobs that produce the recommendation algorithms and personalisation signals driving Netflix's content strategy) and data engineering workflows every single day. By 2020, Netflix was running all of this through Meson, an in-house workflow orchestrator built around a single-leader architecture (a system design where one node is responsible for all decisions and coordination — providing simplicity and consistency but creating a vertical scaling bottleneck as load increases). The team described it as having achieved high availability — but at the cost of requiring continuous vertical scaling as usage grew.

The number of workflows in Meson was doubling year over year, and the sizes of individual workflows were growing too — some containing tens of thousands of interdependent jobs. During peak usage — when large ML training runs coincided with end-of-month reporting pipelines — Meson experienced slowdowns that required on-call engineers to monitor the system during off-hours. The system was not broken, but it was clearly approaching the limits of how it was designed. AWS instance types top out. The ceiling was visible and the timeline to hit it was predictable.

Why Maestro Is Different From Traditional Orchestrators

Traditional workflow orchestrators like Apache Airflow assume that workflows are DAGs (Directed Acyclic Graphs — graphs where edges have direction and no cycles exist, meaning no job can eventually depend on itself). Netflix's data engineering reality is messier: some workflows are genuinely cyclic, some need foreach loops that spawn thousands of child workflow instances, some need conditional branching that determines entire execution paths at runtime. Maestro was built to support all of these natively within the engine, rather than requiring users to simulate them with workarounds.

Problem

Meson's Single Leader Under Load

During peak usage, Meson's single-leader architecture struggled under the combined load of hundreds of thousands of daily jobs. On-call engineers were monitoring the orchestrator during off-hours to prevent cascading delays. The system worked — but it required human attention that scale demands cannot sustain.

Cause

Vertical Scaling Has a Ceiling

Meson's single-leader model meant all orchestration state — job status, dependency tracking, retry logic, scheduling — lived on one machine. As Netflix's DAG workloads grew 100% year-over-year, vertical scaling was no longer a strategy; it was a countdown.

Solution

Maestro: Horizontal Orchestration at Any Scale

Netflix designed Maestro from first principles for horizontal scalability. The architecture decomposed orchestration into independent stateless workers, event-driven step execution, and purpose-built state management. Hundreds of thousands of workflows migrated from Meson to Maestro with minimal user disruption, and by 2024 Maestro was open-sourced under Apache 2.0.

Result

2 Million Jobs Per Day, No Ceiling in Sight

Maestro handles 2 million jobs on busy days across hundreds of thousands of workflows, with support for individual workflows containing hundreds of thousands of jobs. Scaling is now horizontal — add more workers, handle more load — without the system design itself becoming a constraint.

The Fix

Architecture: From Single Leader to Distributed Workers

Maestro's architecture replaces Meson's single leader with three independently scalable services. The Workflow Engine manages the full lifecycle from definition through execution. The Step Runtime Workers are stateless executors that pick up individual step executions from a queue, run them on the appropriate compute engine (Spark, Trino, Kubernetes, etc.), and report results back. The Signal Service enables event-driven orchestration — workflows wait for external signals (data availability, upstream pipeline completion) rather than polling or using fixed schedules. Each of these three layers scales independently.

2M+ — jobs executed on peak days; impossible on Meson's single-leader architecture without continuous vertical scaling emergencies
100K+ — jobs supported within a single workflow, enabled by foreach spawning independent sub-workflow instances
∞ scale — horizontal scaling: add more worker nodes to handle more load; the ceiling that constrained Meson is architecturally eliminated
~0 disruption — user-visible disruption during migration; the platform team migrated hundreds of thousands of workflows on users' behalf

// Simplified Maestro workflow definition example
// Users define their logic; Maestro handles all execution mechanics
{
  "workflow_id": "ml_model_training_pipeline",
  "steps": [
    {
      "id": "data_prep",
      "type": "spark",
      "image": "netflix/etl:v2.3",
      "dependencies": []
    },
    {
      "id": "feature_engineering",
      "type": "foreach",
      "items": "${data_prep.output.segments}",
      "steps": [
        // Each item spawns its own independent workflow instance
        // Scales to thousands — each retried and tracked independently
        { "id": "compute_features", "type": "spark" }
      ],
      "dependencies": ["data_prep"]
    },
    {
      "id": "model_train",
      "type": "kubernetes",
      "image": "netflix/trainer:v1.1",
      "dependencies": ["feature_engineering"]
    }
  ],
  // Signal-based trigger: run when upstream data is ready, not on a fixed clock
  // Eliminates the entire category of failures caused by timing assumptions
  "trigger": { "type": "signal", "signal_name": "daily_data_ready" }
}

The foreach That Scales to 100K Jobs

Maestro's foreach step is the capability that makes truly massive workflows possible. When a foreach iterates over 10,000 content segments, Maestro internally creates 10,000 independent sub-workflow instances — each scheduled, executed, monitored, and retried independently. From the user's perspective, it's a single foreach loop. From Maestro's perspective, it's 10,000 parallel workflow instances scaling identically to any other workflow in the system. This is not possible in flat DAG systems where 10,000 nodes would overwhelm the scheduler.

Netflix's Meson-to-Maestro migration was deliberate about user disruption. The team migrated hundreds of thousands of workflows on behalf of users — users didn't rewrite anything. The platform team built migration tooling, validated that migrated workflows produced equivalent outputs, and performed the migration atomically for each user's pipeline. This is a crucial organisational point: the people running the workflows were data scientists and analysts, not infrastructure engineers. By owning the migration completely, the Maestro team enabled a seamless transition that felt, from the user perspective, like a platform upgrade that just worked.

Why not Airflow, Temporal, or Conductor?

The Netflix team evaluated off-the-shelf alternatives before building Maestro. Apache Airflow lacked native support for cyclic workflows and struggled with Maestro's target scale. Netflix Conductor offered more state-engine features than required; its overhead was disproportionate to the need. Temporal was optimised for inter-process orchestration via external service calls — at Maestro's million-tasks-per-day scale with many long-running workflows, coupling the DAG engine to an external service call introduced unnecessary reliability weak spots. The team concluded that for their specific requirements — lightweight state transitions at massive scale — a purpose-built engine was the right call.

The open-source decision

In July 2024, Netflix open-sourced Maestro under the Apache 2.0 license after 4+ years of internal production operation. The decision reflected confidence in the system's production maturity. Maestro joins a list of Netflix open-source projects — Chaos Monkey, Eureka, Hystrix — that have shaped how the industry thinks about distributed systems. Open-sourcing after production hardening rather than before is a kindness to the community: not an aspirational design or early prototype, but a battle-tested system with years of edge cases resolved.

Zero-downtime migration strategy

Maestro's team built tooling to migrate Meson workflows in place, validating equivalence of outputs before switching. Each workflow's migration was atomic: Meson and Maestro ran side-by-side during transition, with a configurable percentage router determining which system executed any given workflow instance. Users saw no interruption to their pipelines during the migration period. The platform team owned the migration completely — users did not rewrite their workflows.

Architecture

Meson's architecture concentrated all orchestration intelligence in a single process: state management, scheduling, dependency tracking, retry logic, and compute routing all ran in one JVM. Maestro separates these concerns across three independently deployable services, with Apache Kafka (a distributed event streaming platform used by Netflix for high-throughput, fault-tolerant message passing between Maestro's internal services) as the message bus connecting them and PostgreSQL as the durable state store, chosen for its ACID guarantees which are essential for exactly-once execution semantics.

Meson (Before): Single-Leader Orchestration Architecture

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Maestro (After): Horizontally Scalable Distributed Orchestration

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Signal-Based Triggers: Eliminating Timing-Based Failures

In traditional time-based scheduling, a pipeline runs at midnight regardless of whether the data it depends on is ready. If upstream processing is slow, the pipeline waits, fails, or wastes compute resources scanning for data that isn't there yet. Signal-based triggers invert this: Maestro listens for an event published when upstream data is confirmed ready, and only then starts the dependent workflow. For Netflix's data platform, where hundreds of pipelines have complex dependency chains, this eliminates entire categories of pipeline failures caused by timing assumptions that don't hold under variable upstream load.

Lessons

Single-leader architectures have a ceiling. They are simple to build and reason about — excellent starting points. But when your workload grows faster than vertical scaling can accommodate, the architecture itself becomes a constraint. Identify the ceiling early and plan the horizontal migration before you're forced to do it under pressure.
Platform migrations succeed when the platform team owns the migration completely, not when they ask users to rewrite their workloads. Netflix migrated hundreds of thousands of Meson workflows to Maestro on behalf of their users. This is the engineering investment that makes a platform adoption feel like a capability upgrade rather than a tax.
Signal-based triggers (workflow start conditions driven by events like "upstream data is available" rather than fixed clock times) eliminate entire categories of timing-based failures in data pipelines. If your pipelines fail regularly because upstream data isn't ready when the cron fires, replace the clock with a signal. The latency reduction and reliability improvement are both significant.
Build platform capabilities as first-class engine features, not user-land workarounds. Maestro's foreach, subworkflow, and conditional branch are native engine constructs — not clever hacks layered on top of a DAG. When complex patterns are first-class, the engine can optimise them (parallel sub-workflows, independent retry, progress tracking) in ways that workarounds cannot.
Evaluate off-the-shelf solutions honestly and choose to build only when the custom requirements genuinely justify it. Netflix evaluated Airflow, Temporal, and Conductor before choosing to build Maestro. Their scale, cyclic workflow requirements, and strict SLO needs were genuinely outside what existing tools could provide. Building custom infrastructure for problems that existing tools solve well is waste; building it for problems they genuinely cannot solve is engineering.

Engineering Glossary

Apache Kafka — a distributed event streaming platform used by Netflix as the message bus between Maestro's internal services. Provides high-throughput, fault-tolerant message passing for workflow state events, step completion signals, and external event publishing.

DAG (Directed Acyclic Graph) — a workflow representation where jobs are nodes and dependencies are directed edges, with no cycles (no job can eventually depend on itself). Traditional orchestrators like Airflow support only DAGs. Maestro supports both DAGs and cyclic workflows.

Foreach — Maestro's native looping pattern that iterates over a collection and spawns an independent sub-workflow instance for each element. Enables workflows with hundreds of thousands of parallel jobs while keeping the scheduler from being overwhelmed by a flat graph.

Meson — Netflix's previous workflow orchestrator, built around a single-leader architecture. Successfully handled hundreds of thousands of daily jobs but required continuous vertical scaling as workflow volume doubled year-over-year, eventually approaching AWS instance type limits.

Signal-based trigger — a workflow start condition driven by an event (such as "upstream data is available") rather than a fixed clock time. Eliminates the entire category of pipeline failures caused by timing assumptions that don't hold under variable upstream processing load.

Single-leader architecture — a system design where one node is responsible for all decisions and coordination. Simple and consistent, but creates a vertical scaling bottleneck — as load increases, the only option is a bigger machine, which has a hard ceiling.

Step Runtime Worker — a stateless Maestro executor that picks up individual step executions from a queue, runs them on the appropriate compute engine (Spark, Trino, Kubernetes, Python), and reports results back. Independently scalable — add more workers to handle more throughput without touching the workflow engine.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community