DEV Community

Cover image for Netflix Made Their Workflow Orchestrator 100x Faster by Rewriting the Engine Nobody Thought Was Slow
TechLogStack
TechLogStack

Posted on • Originally published at techlogstack.com on

Netflix Made Their Workflow Orchestrator 100x Faster by Rewriting the Engine Nobody Thought Was Slow

  • 100x throughput improvement from rewriting one component — not the whole system
  • 2.5 years of successful operation before sub-hourly workloads revealed the overhead
  • Trigger: Live, Ads, and Games drove 15-minute scheduling cycles; daily ETL never surfaced the bottleneck
  • Flow engine rewritten — DAG engine kept; surgical scope enabled months-not-years delivery
  • 0 external service dependencies in the new engine — state transitions are in-process
  • Improvement contributed back to the open-source Maestro repository

Maestro had been running Netflix's data and ML workflows successfully for two and a half years. Then Live, Ads, and Games drove sub-hourly scheduling requirements that revealed the orchestrator's overhead — not in crashes or alerts, but in slow step launches that nobody had measured. The fix was a complete engine rewrite that delivered 100x throughput improvement.


The Story

Two and a half years after Netflix's Maestro workflow orchestrator replaced Meson, it had achieved its design goals: horizontal scalability, support for hundreds of thousands of workflows, reliable execution of millions of jobs per day. By 2024, however, Netflix's business had changed in ways that revealed new performance requirements. Live programming, Ads, and Games drove use cases with sub-hourly scheduling needs — ad targeting pipelines running every 15 minutes, live event data processing executing within seconds of an event, low-latency ad hoc queries. These workloads exposed overhead in Maestro's step execution path that had been invisible during daily and hourly ETL workflows.

The Maestro engineering team traced the overhead to the flow engine — the component responsible for managing state transitions between workflow steps. The original flow engine had been built on top of Netflix Conductor, an open-source workflow orchestration system. Maestro used only a subset of Conductor's features — lightweight state transitions — but paid the overhead of Conductor's full implementation. For daily ETL pipelines, a 200ms step launch overhead is irrelevant. For 15-minute ad targeting workflows with hundreds of steps, that overhead becomes a material fraction of the entire scheduling budget.


The Invisible Overhead

The flow engine overhead didn't cause errors or trigger alerts. Workflows completed. SLOs were met. But the step launch time was higher than it needed to be, and for sub-hourly workloads, "higher than needed" became "unacceptably slow." This is a class of performance issue that only becomes visible when new use cases push the system closer to its boundaries — the boundary had always been there, but daily ETL workloads never reached it.

Problem

Sub-Hourly Workloads Expose Step Launch Overhead

Netflix's expansion into Live, Ads, and Games drove scheduling requirements as short as 15 minutes. Sub-hourly workflows executing hundreds of steps were sensitive to per-step launch overhead that was invisible on daily ETL pipelines.


Cause

Flow Engine Built on Conductor's Full Feature Set

Maestro's flow engine used Netflix Conductor for state management but only needed lightweight state transitions — not Conductor's full feature set. The overhead was acceptable at hourly+ scheduling but not for sub-hourly use cases.


Solution

Purpose-Built State Machine: Keep DAG, Rewrite Flow Engine

The team kept the DAG engine (workflow definition and dependency management) and rewrote only the flow engine (state transitions). The new flow engine was purpose-built for Maestro's specific requirements: lightweight state transitions at very high frequency, in-process, without external service dependencies.


Result

100x Throughput Improvement

We felt it was an unnecessary source of risk to couple the DAG engine execution with an external service call. If our requirements went beyond lightweight state transition management we might reconsider because Temporal is a very robust control plane orchestration system, but for our needs it introduced complexity and potential reliability weak spots when there was no direct need for the advanced feature set that it offered.

— Netflix Engineering, via '100X Faster: How We Supercharged Netflix Maestro's Workflow Engine'

The rewritten flow engine delivered 100x throughput improvement, enabling the sub-hourly and low-latency use cases that Netflix's Live, Ads, and Games products required.


The Fix

The Flow Engine Rewrite

The new flow engine was designed from first principles for Maestro's specific requirements. Rather than building on Conductor's general-purpose state management, the team implemented a purpose-built state machine that handled exactly the transitions Maestro needed — step-ready → running → completed/failed, with retry and timeout logic — at extremely high frequency and without external service dependencies. The design was minimal by intention: every abstraction layer that wasn't serving Maestro's use case was eliminated.

  • 100x — throughput improvement from the flow engine rewrite; sub-hourly scheduling now feasible
  • 2.5 years — time Maestro operated before the sub-hourly use case revealed the overhead
  • 0 — external service dependencies in the new flow engine; state transitions are in-process
  • Kept DAG — the DAG engine was not the bottleneck and was not rewritten; scope limited to the identified problem
// Conceptual: old flow engine approach vs new flow engine
// Old: Conductor-based state management (full feature set, higher overhead)
// New: Purpose-built lightweight in-process state machine

// OLD APPROACH: Conductor state transitions
// Each step state change requires round-trips to Conductor's state store
class OldStepExecutor {
    void onStepComplete(Step step, StepResult result) {
        // Round-trip to Conductor — serialisation + network call
        conductor.updateTaskStatus(
            step.taskId,
            result.toTaskResult()
        );
        // Conductor evaluates downstream dependencies — another network call
        conductor.decide(step.workflowId);
    }
}

// NEW APPROACH: Purpose-built in-process state machine
// All state transitions in-memory — no network round-trips on the hot path
// Only the transitions Maestro needs, optimised for high frequency
class NewStepExecutor {
    void onStepComplete(Step step, StepResult result) {
        // In-process state update — no network round-trip
        WorkflowState state = stateStore.get(step.workflowId);
        state.markStepComplete(step.id, result);

        // Evaluate ready steps locally — no external service dependency
        List<Step> readySteps = state.getReadySteps();

        // Dispatch ready steps to execution queue
        readySteps.forEach(this::dispatch);

        // Persist state change atomically
        stateStore.save(state);
    }
}
// The improvement: removing the external service round-trips on every
// state transition. At 1M+ tasks/day, this difference is 100x.
Enter fullscreen mode Exit fullscreen mode

Surgical Rewrite: Scope Is a Virtue

The decision to rewrite only the flow engine — not the DAG engine, not the API layer, not the scheduling system — is what made the 100x improvement possible within a reasonable development timeline. A complete Maestro rewrite would have taken years. A targeted rewrite of the bottleneck component took months and carried bounded risk. The prerequisite was precise understanding of where the overhead lived. Profiling and measurement before architectural decisions is not overhead — it's the work that makes targeted improvements possible.

The architectural decision to keep the DAG engine reflects a key engineering principle: surgical rewrites are better than complete rewrites when you can precisely identify the problem component. The DAG engine — which parses workflow definitions, evaluates dependencies, and determines which steps are ready to execute — was not the source of the overhead. Replacing it alongside the flow engine would have added scope, risk, and development time without addressing the actual bottleneck.

Why not Temporal?
The Netflix team evaluated Temporal seriously before deciding to build a custom flow engine. Temporal is optimised for inter-process orchestration via external service calls — managing long-running, durably-persisted workflows with complex retry and compensation logic. Maestro operates at 1M+ daily tasks; coupling the DAG execution engine to an external Temporal service call for each state transition would add network latency and a reliability dependency to the most critical path in the system. For Maestro's needs — lightweight, in-process state transitions at very high frequency — Temporal was over-engineered and over-coupled. Choosing not to adopt a popular framework when its overhead exceeds its benefit is an engineering decision, not a gap.

New use cases unlocked
The 100x throughput improvement wasn't just a quantitative improvement in existing workflows — it unlocked qualitatively new use cases. Ad targeting pipelines that previously ran hourly can now run on 15-minute cycles, providing fresher signals. Live event data processing can run within seconds of event completion rather than waiting for the next hourly window. The performance improvement changed what Netflix could build, not just how fast they could run existing things.


Architecture

Maestro's architecture after the flow engine rewrite maintains the same three-layer structure: Workflow Engine (DAG state, dependency tracking), Step Runtime Workers (stateless executors), and Signal Service (event-driven triggers). The change is internal to the Workflow Engine layer. From the outside — from users defining workflows, from the Signal Service publishing events, from the Step Runtime Workers reporting completions — nothing changed. The optimisation was architecturally invisible.

Maestro Before: Conductor-Based Flow Engine (Higher Overhead)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Maestro After: Purpose-Built Flow Engine (100x Faster)

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).


Profiling Before Rewriting

The 100x improvement was possible because the team could precisely identify the flow engine as the overhead source. This required detailed profiling of Maestro's step execution path — measuring where time was spent at each stage of a step state transition. Without this profiling work, a rewrite might have targeted the wrong component and produced minimal improvement. Measurement before optimisation is not a platitude — it's the prerequisite for targeted, effective engineering.


Lessons

  1. Measure before you rewrite. The Maestro team knew exactly which component to rewrite because they had profiled the execution path and located the overhead precisely. A rewrite without measurement is a guess. A rewrite with measurement is a targeted intervention. The profiling work is not overhead — it's the work that makes targeted improvements possible.

  2. Surgical rewrites (replacing only the specific component causing a performance problem, preserving all surrounding components) have lower risk and faster delivery than complete rewrites. The flow engine was replaced; the DAG engine was kept. This scoping decision is why the improvement was achievable in months rather than years.

  3. Performance requirements change as products evolve. Maestro was correctly designed for daily ETL workloads in 2020. Netflix's expansion into Live, Ads, and Games in 2024–2025 created sub-hourly requirements that didn't exist at design time. Build systems that are measurable and targetable for performance improvement as requirements evolve.

  4. General-purpose frameworks have overhead that purpose-built implementations don't. Use general-purpose frameworks when their full feature set is needed; build purpose-built when it isn't. Conductor was the right choice when Maestro was designed — it provided reliable state management quickly. The rewrite was right when the overhead became the bottleneck — the team had the data to make that call.

  5. Architectural improvements that remove external dependencies improve both performance and reliability simultaneously. The new flow engine is faster because it has no external service round-trips. It's also more reliable because it has fewer failure modes — no external service to go down, no network partition to handle in the hot path.


Engineering Glossary

Conductor — an open-source workflow orchestration system from Netflix. Used as the original flow engine in Maestro — provided reliable state management quickly during Maestro's initial development, but carried more overhead than Maestro needed for lightweight state transitions at high frequency.

DAG engine — the Maestro component that parses workflow definitions, evaluates dependencies, and determines which steps are ready to execute. Was not contributing to the flow engine overhead and was not rewritten. Preserving it was the scope decision that made the 100x improvement achievable in months.

Flow engine — the Maestro component responsible for managing state transitions between workflow steps. The source of the per-step launch overhead revealed by sub-hourly workloads. Rewritten from scratch with a purpose-built in-process state machine.

Sub-hourly scheduling — workflow execution cycles shorter than one hour (e.g. every 15 minutes). Driven by Netflix's Ads product, which requires near-real-time data processing for ad targeting signals, and by Live, which generates engagement data that needs to flow through pipelines within the event window.

Surgical rewrite — the practice of replacing only the specific component causing a performance or reliability problem, while preserving all surrounding architecture. Requires precise identification of the bottleneck through profiling. Lower risk and faster delivery than complete rewrites.


This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)


TechLogStack — built at scale, broken in public, rebuilt by engineers.

Top comments (0)