Saumya Karnwal

Posted on Feb 24

How to Build Workflows That Never Lose Progress

#architecture #devops #machinelearning #systemdesign

The Half-Deployed Model

Imagine you're running an ML platform. A weekly cron job fires at 3 AM to retrain a customer's model. The pipeline has five steps:

Generate training data from BigQuery
Train the model on a Kubernetes cluster
Push the model artifact to a registry
Create a scoring configuration in the scoring service database
Authorize the model for the customer's traffic

Steps 1 through 3 take about two hours and cost real money — compute time, BigQuery slots, container images. At 5:02 AM, step 3 completes. The model is trained and pushed.

Step 4 calls the scoring service to create the config. The scoring service is in the middle of a routine database migration. Connection refused.

Now you have a problem. The model is sitting in the artifact registry, trained and ready. But it can't serve traffic because there's no scoring config. The pipeline marks the whole run as "FAILED."

What happens next depends on how you built the system.

If you start over: The 6 AM retry re-runs from step 1. Two more hours of BigQuery and Kubernetes compute, re-training a model that's identical to the one you already have. You just burned money and time rebuilding something that already exists.

If you do nothing: The model sits orphaned in the registry. The customer's production model is stale. A data scientist notices three days later and manually creates the config.

If you built a saga: The system knows step 3 completed. It retries step 4. The scoring service comes back from its migration at 5:15 AM. Retry succeeds. Step 5 runs. By 5:20 AM the customer has a fresh model. Nobody was woken up. No work was wasted.

What's a Saga?

The original problem was database transactions that span multiple systems — you can't use a single BEGIN/COMMIT because the data lives in different databases.

The solution: break the big transaction into a sequence of smaller steps, each with:

A known state (pending, in-progress, complete, failed)
The ability to retry safely (idempotency)
An optional compensating action (undo what was done if we need to abort)

The state machine IS the recovery mechanism. You don't need a separate "recovery system" — you just need each step to be resumable.

Why Idempotency Is the Hard Part

The saga pattern sounds simple: track state, retry failed steps. But there's a catch. What if the step did succeed, but you didn't get the confirmation?

Picture this: your pipeline calls the scoring service to create a config. The service creates it, writes it to the database, and starts sending back a 200 response. At that exact moment, the network blips. Your pipeline gets a timeout. It thinks step 4 failed.

On retry, the pipeline calls the scoring service again: "Create this config." If the service isn't idempotent, it creates a second config. Now you have duplicate scoring entries, and the model might score users twice.

Idempotency means: running the same operation twice produces the same result as running it once. The service checks "does this config already exist for this model version?" and if so, returns the existing one instead of creating a duplicate.

This is the non-negotiable foundation. If a step can't be safely retried, the entire saga pattern breaks.

What This Looks Like in Practice

The State Machine

Every deployment in the system has a status that tracks exactly where it is:

PENDING
   │
   ▼
DATA_GEN_IN_PROGRESS ──(fail)──▶ DATA_GEN_FAILED
   │                                     │
   ▼                                  (retry)
DATA_GEN_COMPLETE                        │
   │                              ◀──────┘
   ▼
TRAINING_IN_PROGRESS ──(fail)──▶ TRAINING_FAILED
   │                                     │
   ▼                                  (retry)
TRAINING_COMPLETE                        │
   │                              ◀──────┘
   ▼
PUSHING_IN_PROGRESS ──(fail)──▶ PUSH_FAILED
   │                                     │
   ▼                                  (retry)
PUSHING_COMPLETE                         │
   │                              ◀──────┘
   ▼
CONFIGURING ──(fail)──▶ CONFIG_PENDING
   │                          │
   ▼                    (reconciliation
READY                    loop retries)

This looks like a lot of states. But each state is dirt-simple: "I know exactly what succeeded, and I know exactly what to do next."

Retry With Backoff

When a step fails, the system doesn't immediately retry in a tight loop. That would hammer a service that might already be struggling.
Exponential backoff gives the downstream service time to recover. If it's a 30-second blip, attempt 2 or 3 catches it. If it's a longer outage, the system backs off gracefully.

The Reconciliation Loop

What if 5 retries aren't enough? The deployment state says CONFIG_PENDING. The pipeline stops actively retrying. But it's not abandoned. A background process — the reconciliation loop — periodically scans for stuck deployments.
When the downstream service recovers (maybe after a database migration, maybe after an outage), the reconciliation loop picks up the stuck deployments and completes them. No human intervention. No lost work.

The user sees: "Deployment in progress — model trained, awaiting configuration." Not an error. Not a failure. Just... waiting, and it'll fix itself.

Making Each Step Idempotent

In practice, idempotency looks different for each type of operation:

Database writes: Use INSERT ... ON CONFLICT DO NOTHING or check-before-write. If the row exists with the same key, it's a no-op.
API calls: Include a unique request ID (sometimes called an idempotency key). The server caches results by this key — if it's seen the key before, it returns the cached result.
State changes: Read current state before deciding what to do. If the current state is already what you want, do nothing. This is how Kubernetes controllers work — they compare desired state to actual state on every loop.

The pattern is always the same: check if the work is already done before doing it again.

The Anatomy of a Good Saga

1. State Must Be Durable

The state machine lives in a database, not in memory. If the orchestrator crashes:

It restarts
Reads the state from the database
Picks up where it left off

If the state was in memory, a crash means starting over. If it's in a database, a crash means a brief pause.

2. Compensating Actions for the Unhappy Path

Sometimes you need to abort, not retry. If a model is deployed but turns out to be bad, you don't just retry — you rollback.
The compensating actions are the "undo" for each step. Not every step needs one (training data in BigQuery doesn't hurt anyone just sitting there), but state changes in production databases definitely do.

3. Visibility Into the State

A saga that works perfectly but is opaque to users is almost as bad as one that fails. The user should be able to see:

Which step the deployment is on
What failed and why
Whether the system is retrying or waiting for manual intervention
A "Retry" button for failed steps

When You Need a Saga

Two conditions:

The operation spans multiple services or systems. If it's a single database transaction, use a regular transaction. If it crosses service boundaries, you need a saga.
Partial completion is worse than complete failure. If step 3 of 5 fails and you're left in a half-done state, that's a problem. The saga ensures you either complete or cleanly recover.

A quick gut check: if you find yourself writing code like "first do X, then do Y, and if Y fails... um..." — you need a saga.

Where You've Seen This Pattern

Stripe's idempotency keys — Every Stripe API call accepts an Idempotency-Key header. If your server crashes after Stripe processes a charge but before you record the response, you retry with the same key. Stripe returns the original result. No double-charge. This is idempotency as a first-class API concept.
Kubernetes controllers — The entire K8s control plane is a saga engine. Controllers compare desired state to current state on a reconciliation loop. If a controller crashes mid-action, it restarts, re-evaluates, and acts on the delta. It doesn't need to remember what it did — it looks at what exists.
Airline booking systems — When you book a flight, the system reserves a seat, charges your card, issues a ticket, and sends confirmation. If the charge fails, a compensating action releases the seat hold. If ticketing fails, it retries without re-charging. Each step knows what happened before it.

DEV Community