DEV Community

TechLogStack
TechLogStack

Posted on • Originally published at techlogstack.com on

Netflix Made Their Workflow Orchestrator 100x Faster by Rewriting the Engine Nobody Thought Was Slow

Netflix · Performance · 17 May 2026

Maestro had been running Netflix's data and ML workflows successfully for two and a half years. Then Live, Ads, and Games drove sub-hourly scheduling requirements that revealed the orchestrator's overhead — not in crashes or alerts, but in slow step launches that nobody had measured. The fix was a complete engine rewrite that delivered 100x throughput improvement.

  • 100x throughput improvement
  • 2.5 years before overhead visible
  • Sub-hourly scheduling trigger
  • State machine rewrite
  • Kept DAG engine, rewrote flow engine
  • 1M+ tasks/day still supported

The Story

Two and a half years after Netflix's Maestro workflow orchestrator replaced Meson, it had achieved its design goals: horizontal scalability, support for hundreds of thousands of workflows, reliable execution of millions of jobs per day. By 2024, however, Netflix's business had changed in ways that revealed new performance requirements. Live programming, Ads, and Games drove use cases with sub-hourly scheduling needs — ad targeting pipelines that needed to run every 15 minutes, live event data processing that needed to execute within seconds of an event, low-latency ad hoc queries. These workloads exposed overhead in Maestro's step execution path that had been invisible during daily and hourly ETL workflows. The orchestrator wasn't broken — but it was noticeably slower than it needed to be for a new class of latency-sensitive use cases.

⏱️

The overhead that sub-hourly workloads exposed wasn't measured in seconds of latency — it was measured in fractions of seconds of step launch time that added up across thousands of daily executions. For hourly ETL pipelines, a 200ms step launch overhead is irrelevant. For 15-minute ad targeting workflows with hundreds of steps, that overhead becomes a material fraction of the entire scheduling budget.

The Maestro engineering team investigated the overhead and traced it to the flow engine — the component responsible for managing state transitions between workflow steps. The original flow engine had been built on top of Netflix Conductor, an open-source workflow orchestration system that provided a full feature set of state management capabilities. Maestro used only a subset of Conductor's features — lightweight state transitions — but paid the overhead of Conductor's full implementation. This overhead was acceptable at 1-million-task-per-day scale with daily scheduling. It was unacceptable for the sub-hourly, low-latency workloads that Netflix's evolving product portfolio demanded.

THE INVISIBLE OVERHEAD

The flow engine overhead didn't cause errors or trigger alerts. Workflows completed. SLOs were met. But the step launch time was higher than it needed to be , and for sub-hourly workloads, 'higher than needed' became 'unacceptably slow.' This is a class of performance issue that only becomes visible when new use cases push the system closer to its boundaries — the boundary had always been there, but daily ETL workloads never reached it.

Problem

Sub-Hourly Workloads Expose Step Launch Overhead

Netflix's expansion into Live, Ads, and Games drove scheduling requirements as short as 15 minutes. Sub-hourly workflows executing hundreds of steps were sensitive to per-step launch overhead that was invisible on daily ETL pipelines. The Maestro flow engine's overhead, acceptable at hourly+ scheduling, became a bottleneck for the new use case class.


Cause

Flow Engine Built on Conductor's Full Feature Set

Maestro's flow engine used Netflix Conductor for state management, but only needed lightweight state transitions — not Conductor's full feature set. The team also considered Temporal (optimized for inter-process orchestration via external service calls) but concluded that coupling the DAG engine to an external service introduced unnecessary reliability risk at 1M+ daily tasks.


Solution

Purpose-Built State Machine: Keep DAG, Rewrite Flow Engine

The team kept the DAG engine (workflow definition and dependency management) and rewrote only the flow engine (state transitions). The new flow engine was purpose-built for Maestro's specific requirements: lightweight state transitions at very high frequency, without the overhead of a general-purpose state management framework.


Result

100x Throughput Improvement

The rewritten flow engine delivered 100x throughput improvement, enabling the sub-hourly and low-latency use cases that Netflix's Live, Ads, and Games products required. The improvement opened new possibilities for workflow orchestration at Netflix that hadn't been feasible on the original engine.


We felt it was an unnecessary source of risk to couple the DAG engine execution with an external service call. If our requirements went beyond lightweight state transition management we might reconsider because Temporal is a very robust control plane orchestration system, but for our needs it introduced complexity and potential reliability weak spots when there was no direct need for the advanced feature set that it offered.

— — Netflix Engineering — via '100X Faster: How We Supercharged Netflix Maestro's Workflow Engine'

ℹ️

Why Not Temporal?

Temporal is a popular workflow orchestration framework that handles complex, long-running workflows with strong durability guarantees. The Netflix team evaluated it seriously but concluded it was optimized for a different use case: inter-process orchestration via external service calls. Maestro operates at 1M+ daily tasks; coupling the DAG execution engine to an external Temporal service call for each state transition would add network latency and a reliability dependency to the most critical path in the system. For Maestro's needs — lightweight, in-process state transitions at very high frequency — Temporal was over-engineered and over-coupled.

The architectural decision to keep the DAG engine while rewriting only the flow engine reflects a key engineering principle: surgical rewrites are better than complete rewrites when you can precisely identify the component causing the problem. The DAG engine — the code that parses workflow definitions, evaluates dependencies, and determines which steps are ready to execute — was not the source of the overhead. Replacing it alongside the flow engine would have added scope, risk, and development time without addressing the actual bottleneck. The team's ability to identify precisely where the overhead lived was the prerequisite for a scoped, successful rewrite.

🚀

New Use Cases Unlocked

The 100x throughput improvement wasn't just a quantitative improvement in existing workflows — it unlocked qualitatively new use cases. Ad targeting pipelines that previously ran hourly can now run on 15-minute cycles, providing fresher signals. Live event data processing can now run within seconds of event completion rather than waiting for the next hourly window. The performance improvement changed what Netflix could build, not just how fast they could run existing things.

⚠️

The 2.5-Year Latency to Visibility

Maestro had operated successfully for two and a half years before the sub-hourly workloads revealed the flow engine overhead. This timeline is instructive: performance bottlenecks are often invisible until a new use case pushes the system closer to its limits. Daily ETL pipelines completing in hours have no reason to notice a 200ms step launch overhead. 15-minute ad targeting pipelines immediately feel it. Building systems with performance observability from the start allows bottlenecks to be found proactively rather than reactively.

LIVE, ADS, GAMES: THE PRODUCT DRIVERS

Netflix's expansion into live events (sports, comedy specials, live programming), advertising (a new revenue stream launched 2022), and games (mobile and cloud gaming) created data pipeline requirements that hadn't existed in Netflix's purely subscription VOD model. Advertising requires near-real-time data to be effective : ad targeting signals from viewer behavior need to be processed and applied within minutes, not hours. Live events generate immediate engagement data that needs to flow through analytics pipelines before the event ends. These new product lines were the forcing function for Maestro's performance improvement.

We built the new flow engine from first principles specifically for Maestro's requirements — lightweight state transitions at very high frequency, without coupling the DAG execution engine to an external service call on every state change.

— — Netflix Engineering — via '100X Faster: How We Supercharged Netflix Maestro's Workflow Engine'


The Fix

The Flow Engine Rewrite

The new flow engine was designed from first principles for Maestro's specific requirements. Rather than building on Conductor's general-purpose state management or Temporal's inter-process orchestration, the team implemented a purpose-built state machine that handled exactly the transitions Maestro needed: step-ready → running → completed/failed, with retry and timeout logic, at extremely high frequency without external service dependencies. The design was minimal by intention: every abstraction layer that wasn't serving Maestro's use case was eliminated.

  • 100x — Throughput improvement from the flow engine rewrite — enabling sub-hourly scheduling and low-latency ad hoc queries that were infeasible on the original engine
  • 2.5 years — Time Maestro operated successfully before the sub-hourly use case revealed the flow engine overhead — a reminder that performance requirements change as products evolve
  • 0 — External service dependencies in the new flow engine — state transitions happen in-process, eliminating the network latency and reliability coupling of external orchestration services
  • Kept DAG — Components preserved from the original architecture — the DAG engine was not the bottleneck and was not rewritten, limiting scope and risk
// Conceptual: The old flow engine approach vs new flow engine
// Old: Conductor-based state management (full feature set, higher overhead)
// New: Purpose-built lightweight state machine

// OLD APPROACH: Conductor state transitions
// Each step state change requires a round-trip to Conductor's state store
// Conductor evaluates full state management logic for each transition
class OldStepExecutor {
    void onStepComplete(Step step, StepResult result) {
        // Conductor handles state transition — full feature set overhead
        conductor.updateTaskStatus(
            step.taskId,
            result.toTaskResult() // serialization + network call
        );
        // Conductor evaluates downstream dependencies
        conductor.decide(step.workflowId); // another network call
    }
}

// NEW APPROACH: Purpose-built in-process state machine
// State transitions are in-memory, no external service calls
// Only the transitions Maestro needs, optimized for high frequency
class NewStepExecutor {
    void onStepComplete(Step step, StepResult result) {
        // In-process state update — no network round-trip
        WorkflowState state = stateStore.get(step.workflowId);
        state.markStepComplete(step.id, result);

        // Evaluate ready steps locally — no external service dependency
        List readySteps = state.getReadySteps();

        // Dispatch ready steps to execution queue
        readySteps.forEach(this::dispatch);

        // Persist state change atomically
        stateStore.save(state);
    }
}
Enter fullscreen mode Exit fullscreen mode

SURGICAL REWRITE: SCOPE IS A VIRTUE

The decision to rewrite only the flow engine — not the DAG engine, not the API layer, not the scheduling system — is what made the 100x improvement possible within a reasonable development timeline. A complete rewrite of Maestro would have taken years and carried enormous risk. A targeted rewrite of the bottleneck component took months and carried bounded risk. The prerequisite was precise understanding of where the overhead lived. Profiling and measurement before architectural decisions is not overhead — it's the work that makes targeted improvements possible.

Open-Source Beneficiaries

The 100x performance improvement was contributed to the open-source Maestro repository. Organizations that adopted Maestro after the original open-sourcing in July 2024 now benefit from an orchestration engine capable of sub-hourly scheduling at million-task-per-day scale. The compound value of open-sourcing battle-tested systems: community users get production-grade improvements as they're developed.

ℹ️

The Netflix Product Evolution That Drove the Fix

Maestro's 100x improvement is a case study in how product evolution creates engineering requirements that didn't exist at system design time. When Maestro was designed in 2020, Netflix's primary workflow use cases were daily ETL pipelines and hourly ML training runs. By 2024–2025, Live, Ads, and Games had created sub-hourly and real-time data requirements. Workflow orchestrators that were designed for daily batch jobs don't automatically handle real-time event-driven workloads — the latency requirements are an order of magnitude different.

ℹ️

Keeping the DAG Engine: The Right Scope Decision

The DAG engine — the component that parses workflow definitions, evaluates dependencies, and determines which steps are ready to run — was not contributing to the flow engine overhead. Rewriting it alongside the flow engine would have added months of development time, introduced new bugs in a working component, and required re-validating all of Maestro's workflow semantics. Scope discipline — rewriting only what needs to be rewritten — is the engineering decision that made 100x improvement achievable in a reasonable timeline.

THE OPEN SOURCE TIMELINE

The 100x improvement was contributed to the open-source Maestro repository following its development. Since Maestro was open-sourced in July 2024, external users who adopted it benefit from a continuously improving orchestration platform — not a snapshot. The value of open-sourcing production systems compounds over time as improvements driven by internal Netflix requirements become available to the broader engineering community.


Architecture

Maestro's architecture after the flow engine rewrite maintains the same three-layer structure: Workflow Engine (DAG state, dependency tracking), Step Runtime Workers (stateless executors), and Signal Service (event-driven triggers). The change is internal to the Workflow Engine layer: the flow engine that manages step state transitions was replaced with a purpose-built implementation. From the outside — from users defining workflows, from the Signal Service publishing events, from the Step Runtime Workers reporting completions — nothing changed. The optimization was architecturally invisible.

Maestro Before: Conductor-Based Flow Engine (Higher Overhead)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Maestro After: Purpose-Built Flow Engine (100x Faster)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

PROFILING BEFORE REWRITING

The 100x improvement was possible because the team could precisely identify the flow engine as the overhead source. This required detailed profiling of Maestro's step execution path — measuring where time was spent at each stage of a step state transition. Without this profiling work, a rewrite might have targeted the wrong component and produced minimal improvement. Measurement before optimization is not a platitude — it's the prerequisite for targeted, effective engineering.

⚠️

The 1M+ Task/Day Scale Constraint

The new flow engine had to maintain support for Maestro's existing workload — 1M+ tasks per day, workflows with hundreds of thousands of steps, long-running daily ETL pipelines. The 100x improvement was not achieved by sacrificing existing workload support — it was achieved by removing overhead that wasn't serving existing workloads either. The new engine is faster at all scales, not just at sub-hourly scales. The improvement was architectural, not a tradeoff.

Performance Impact at Maestro's Scale

The 100x throughput improvement at Maestro's operating scale — 1M+ tasks per day — translates to significant concrete capacity. The same infrastructure can now support 100x more concurrent step executions, enabling Netflix to run sub-hourly workflows alongside existing daily ETL pipelines without requiring additional worker capacity. For a system already handling hundreds of thousands of workflows, the improvement effectively eliminates step-launch as a scaling bottleneck for the foreseeable future.


Lessons

The Maestro 100x story is about the intersection of product evolution, performance measurement, and surgical engineering. The lessons apply to any long-running production system that needs to serve new use cases it wasn't designed for.

  1. 01. Measure before you rewrite. The Maestro team knew exactly which component to rewrite because they had profiled the execution path and located the overhead precisely. A rewrite without measurement is a guess. A rewrite with measurement is a targeted intervention. The profiling work is not overhead — it's the work that makes targeted improvements possible.
  2. 02. Surgical rewrites (replacing only the specific component causing a performance problem, while preserving all surrounding components) have lower risk and faster delivery than complete rewrites. The flow engine was replaced; the DAG engine was kept. This scoping decision is why the improvement was achievable in months rather than years.
  3. 03. Performance requirements change as products evolve. Maestro was correctly designed for daily ETL workloads in 2020. Netflix's expansion into Live, Ads, and Games in 2024–2025 created sub-hourly requirements that didn't exist at design time. Build systems that are measurable and targetable for performance improvement as requirements evolve.
  4. 04. General-purpose frameworks have overhead that purpose-built implementations don't. Use general-purpose frameworks when their full feature set is needed; build purpose-built when it isn't. Conductor was the right choice when Maestro was designed — it provided reliable state management quickly. The rewrite was right when the overhead became the bottleneck — the team had the data to make that call.
  5. 05. Architectural improvements that remove external dependencies improve both performance and reliability simultaneously. The new flow engine is faster because it has no external service round-trips. It's also more reliable because it has fewer failure modes — no external service to go down, no network partition to handle in the hot path.

PERFORMANCE OBSERVABILITY AS DESIGN REQUIREMENT

The Maestro overhead existed for 2.5 years before it became visible. If per-step launch latency had been a tracked metric from day one , the overhead would have been visible from the beginning — even if it hadn't mattered yet. Building systems with detailed performance instrumentation from the start means bottlenecks are discovered via monitoring rather than via new use cases hitting walls. Performance observability is a first-class design requirement, not an afterthought.

⚠️

The Temporal Consideration

The Netflix team explicitly evaluated Temporal before deciding to build a custom flow engine. Their conclusion: Temporal's value proposition is in managing long-running, durably-persisted workflows with complex retry and compensation logic — a use case that requires coupling the execution engine to an external orchestration service. Maestro's lightweight state transition needs don't justify that coupling. Choosing not to adopt a popular framework when its overhead exceeds its benefit is an engineering decision, not a gap.

Netflix's workflow orchestrator ran 2.5 years without anyone noticing a 100x performance improvement was available — which is either a compliment to how well Maestro worked or a reminder that daily ETL jobs don't complain about latency.

TechLogStack — built at scale, broken in public, rebuilt by engineers


This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).

Top comments (0)