Gabriel Anhaia

Posted on May 23

Choreography vs Orchestration: The 3 Trade-Offs Textbooks Skip

#architecture #microservices #eventdriven #devops

Book: Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You picked choreography because Netflix did. The slides were beautiful. Loosely coupled services. Each one minding its own business. Two years later, you're six services deep, no one owns the saga, and "where's order 4471?" takes forty minutes of Kibana grep across services nobody on the current team wrote.

The textbook answer is "it depends on your culture." That answer has cost teams real money. Let's replace it with three measurable trade-offs and a decision matrix you can actually use on Monday.

The textbook answer and why it's useless

Open any blog post comparing the two and you'll get the same four bullets. Choreography: distributed, loosely coupled, resilient. Orchestration: centralized, easier to reason about, single point of failure. Then a paragraph that ends with "the right choice depends on your team."

This frames the decision as taste. It isn't. Choreography and orchestration trade three concrete costs against each other, and each cost grows on a different curve. The honest comparison is which curve your team can afford to climb.

Quick definitions so we're talking about the same things:

Choreography: services publish events, other services subscribe. No central coordinator. The "workflow" exists only as the sum of subscriptions across N services. Think Kafka, RabbitMQ, EventBridge, NATS.
Orchestration: a workflow engine owns the steps. Services expose activities; the engine calls them and tracks state. Think Temporal, Camunda, AWS Step Functions, Conductor.

Both can do sagas. Both handle failure. The fight isn't about capability. It's about where the cost shows up.

Trade-off 1: Debuggability (seconds-to-trace)

A customer DMs support: "I placed order 4471 forty minutes ago and nothing happened." Same incident, two architectures, two very different forty minutes.

Choreography path. Your order flow touches order-service, payment-service, inventory-service, notification-service, shipping-service, and analytics-service. You start in Kibana.

1. order-service: "OrderCreated id=4471" at 14:02:11, good, it exists
2. payment-service: search "4471", found "PaymentAuthorized" at 14:02:14
3. inventory-service: search "4471", found "InventoryReserved" at 14:02:15
4. notification-service: search "4471", nothing... wait, search by
   correlation_id... still nothing
5. shipping-service: nothing
6. Slack the ex-team-lead who wrote notification-service: "did you change
   the consumer group last sprint?"
7. Discover the consumer was redeployed with a typo'd topic name

Forty minutes. Six services. Two Slack messages to people who don't work on this anymore. The order exists in five places and is missing from one, and the missing one is the one the customer cares about.

Orchestration path. You open the Temporal UI (or Camunda Cockpit, or Step Functions console). You search for workflow_id=order-4471. The workflow tree shows: AuthorizePayment succeeded, ReserveInventory succeeded, SendOrderConfirmation is in Retrying state with the actual exception attached. Click. There's the stack trace.

Ninety seconds. One UI.

The math is brutal. Choreography seconds-to-trace scales with service count and team turnover. Orchestration seconds-to-trace scales with workflow depth, which is bounded by the workflow definition you can read on one screen.

Here's a Temporal workflow showing what "open the workflow" actually means:

func OrderWorkflow(ctx workflow.Context, in OrderInput) error {
    ao := workflow.ActivityOptions{
        StartToCloseTimeout: 30 * time.Second,
        RetryPolicy: &temporal.RetryPolicy{
            MaximumAttempts: 5,
        },
    }
    ctx = workflow.WithActivityOptions(ctx, ao)

    var auth PaymentAuth
    if err := workflow.ExecuteActivity(ctx,
        AuthorizePayment, in).Get(ctx, &auth); err != nil {
        return err
    }

    if err := workflow.ExecuteActivity(ctx,
        ReserveInventory, in).Get(ctx, nil); err != nil {
        // compensate: refund the authorization
        _ = workflow.ExecuteActivity(ctx,
            VoidPayment, auth.ID).Get(ctx, nil)
        return err
    }

    // notifications are fire-and-forget, don't block the saga
    workflow.ExecuteActivity(ctx, SendConfirmation, in)
    return nil
}

Reading this file tells you the order of operations, the failure handling, and the compensation. The choreography equivalent is a graph that exists nowhere except in the union of six @KafkaListener annotations spread across six repos.

Trade-off 2: Schema-change blast radius

Product wants to add customer_segment to the order event. B2B customers get priority shipping, B2C get the standard SLA. Same field. Same JSON key. Two architectures, two very different week.

Choreography blast radius. The field belongs on the OrderCreated event. You search the codebase for every consumer of that topic:

order-service          (producer)
payment-service        (consumer: needs to know? maybe for fraud rules)
inventory-service      (consumer: no)
shipping-service       (consumer: yes, drives priority)
notification-service   (consumer: yes, different email templates)
analytics-service      (consumer: yes, segment is the whole point)
warehouse-service      (added last quarter, consumer: unclear)

Now coordinate. Each consumer is owned by a different team. Each one has its own deploy schedule. The field has to roll out additive-first (no consumer rejects unknown fields, right?), then producers can populate it, then consumers can read it. If any consumer was using strict schema validation in some forgotten YAML, they break. The Slack thread runs three weeks. Two teams push back because "we don't need it." One team ships a config change that ignores the field but logs a warning at INFO level, which someone discovers six months later when they're paginating through 14 GB of warnings.

Orchestration blast radius. The field is part of the workflow input. The workflow owner adds it to the struct. The activities that need it get it passed in as an argument. Activities that don't need it never see it.

type OrderInput struct {
    OrderID         string
    CustomerID      string
    Items           []LineItem
    Total           Money
    CustomerSegment string // new, added by order-service team
}

// inventory-service activity doesn't care
func ReserveInventory(ctx context.Context,
    in OrderInput) error {
    return inventory.Reserve(ctx, in.OrderID, in.Items)
}

// shipping-service activity cares
func CreateShipment(ctx context.Context,
    in OrderInput) error {
    priority := shipping.StandardPriority
    if in.CustomerSegment == "b2b" {
        priority = shipping.ExpeditedPriority
    }
    return shipping.Create(ctx, in.OrderID, priority)
}

Activities are RPC calls with a typed payload. Unknown fields don't break anyone because nobody is parsing the whole envelope. They're consuming the argument they declared. Schema owner is the workflow author. Blast radius is bounded by who needs the field.

This isn't magic. The orchestrator hasn't deleted the coordination problem. It's moved the boundary. In choreography, the schema is a shared contract across N consumers. In orchestration, the schema is a workflow input owned by one team, distributed as a typed argument.

Choreography blast radius scales with consumer count. Orchestration blast radius scales with activities that need the field. The two curves tend to cross around four consumers.

Trade-off 3: On-call cognitive load

You wake up at 3 AM. PagerDuty fired on order processing latency.

Choreography mental model. To diagnose, you need to hold in your head: which services consume which topics, what the consumer-group lag thresholds are, what the at-least-once-delivery implications are for each consumer, which services have circuit breakers and what trips them, which services are using batch processing vs streaming, and which DLQs to check. That's N services × M event types × the deployment status of each.

A team I talked to last quarter has a Notion page they call "the runbook" that tries to encode this. It's eighteen pages. New on-call rotates in every six weeks. You can guess how that goes.

Orchestration mental model. Open the workflow UI. The stuck workflows are a queryable list. Each one has its full history. The mental model is: one workflow definition, one engine, one UI. Activities are RPC; if an activity is timing out, you know which service to look at and you have the stack.

The cognitive-load difference doesn't show up in your first month. It shows up at month eighteen when your senior engineer who designed the choreography leaves and the new on-call inherits a Slack channel of pinned messages titled "things that look broken but are actually fine."

This is the trade-off teams underestimate most. The decision feels like an engineering call. It's actually a hiring and retention call.

A decision matrix that's actually useful

Throw away "team culture" as a deciding factor. Use this:

Factor	Choreography wins	Orchestrator wins
Service count in the saga	≤ 3	≥ 5
Team ownership boundaries	Single team owns all services	Cross-team
Saga steps that need compensation	0 – 1	2+
Saga involves money or legal compliance	–	Yes
Saga involves notifications, analytics, search-indexing	Yes	–
New engineers join the on-call rotation every	12+ months	3 – 6 months
You can answer "where is order X?" today in	< 2 minutes	> 5 minutes (you need this)
Workflow duration	Seconds	Minutes to days

Read it as a vote. If five of seven applicable rows lean orchestrator, ship orchestrator. The "ship it because Netflix did" approach has cost more engineering-years than almost any architectural mistake of the last decade.

Two patterns the matrix won't tell you but matter:

Long-running workflows are an orchestrator's home court. A saga that waits 7 days for a customer email confirmation, then 30 days for return eligibility, then a year for the warranty window. Choreography can do this with scheduled events and state stored everywhere, but workflow engines were literally built for it. The workflow.Sleep(30 * 24 * time.Hour) line in a Temporal workflow does what twelve services with cron jobs and a Redis state machine struggle to do.

Short, hot-path, single-team flows are choreography's home court. If your "saga" is order-created → publish to two topics → done and one team owns all of it, choreography is fine. You don't need an engine.

The hybrid most production systems converge to

Mature event-driven systems aren't pure. They're hybrid, and the split is predictable.

Orchestrate the money and legal path. Payment auth, payment capture, refund, fraud check, KYC, tax calculation, compliance hold, regulatory reporting. These need traceability. They need compensations. They need a human-readable workflow when the auditor shows up. Temporal or Camunda owns this lane.

Choreograph the notifications and analytics path. Order confirmation emails, push notifications, search-index updates, analytics event fan-out, audit log streaming, cache invalidation. These are fire-and-forget. Failure is non-blocking. Schema is mostly additive. Kafka or EventBridge owns this lane.

The boundary looks like this:

// inside the orchestrated workflow
func OrderWorkflow(ctx workflow.Context, in OrderInput) error {
    // money lane: orchestrated
    var auth PaymentAuth
    if err := workflow.ExecuteActivity(ctx,
        AuthorizePayment, in).Get(ctx, &auth); err != nil {
        return err
    }
    if err := workflow.ExecuteActivity(ctx,
        ReserveInventory, in).Get(ctx, nil); err != nil {
        _ = workflow.ExecuteActivity(ctx,
            VoidPayment, auth.ID).Get(ctx, nil)
        return err
    }
    if err := workflow.ExecuteActivity(ctx,
        CapturePayment, auth.ID).Get(ctx, nil); err != nil {
        return err
    }

    // hand-off to choreography lane: fire and forget
    _ = workflow.ExecuteActivity(ctx,
        PublishOrderCompleted, in).Get(ctx, nil)
    return nil
}

// activity just publishes to Kafka and returns
func PublishOrderCompleted(ctx context.Context,
    in OrderInput) error {
    return producer.Publish("orders.completed", in)
}

The workflow ends when the money is captured. The notification-service, analytics-service, search-indexer, and warehouse-service all consume orders.completed independently. Each one can fail without unwinding the order. The hot path stays observable; the cold path stays cheap.

The mistake teams make is treating the choice as binary at the system level. It isn't. Pick the architecture per saga based on the matrix above. The result will be hybrid whether you planned it or not. Plan for it.

The textbook framing of "choreography for autonomy, orchestration for control" isn't wrong. It's just too vague to act on. Debuggability, schema blast radius, on-call cognitive load. Three curves. Pick the one you can afford.

What's the worst cross-service trace you've ever had to chase down? Drop the service count and the time-to-find in the comments. I want to see how bad the choreography tax actually gets in the wild.

If this was useful

The trade-off matrix in this post sits in the middle of the saga chapter of the Event-Driven Architecture Pocket Guide: Saga, CQRS, Outbox, and the Traps Nobody Warns You About. The book also covers the compensation patterns that bite hybrid systems (idempotent compensations, partial-success states, the "compensation needs compensating" recursion), schema evolution across both lanes, and the operational traps. That includes what happens when your orchestrator and Kafka cluster have a network partition at the same time. If you're sizing a saga right now, the year-by-year forecast for each pattern is the part worth reading first.