<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Diego Guerrero</title>
    <description>The latest articles on DEV Community by Diego Guerrero (@diegogue88).</description>
    <link>https://dev.to/diegogue88</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3999156%2F219119b9-33c6-48ae-aaae-b4b295f51603.png</url>
      <title>DEV Community: Diego Guerrero</title>
      <link>https://dev.to/diegogue88</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/diegogue88"/>
    <language>en</language>
    <item>
      <title>Saga orchestrator in Python</title>
      <dc:creator>Diego Guerrero</dc:creator>
      <pubDate>Wed, 24 Jun 2026 21:31:28 +0000</pubDate>
      <link>https://dev.to/diegogue88/saga-orchestrator-in-python-4bd8</link>
      <guid>https://dev.to/diegogue88/saga-orchestrator-in-python-4bd8</guid>
      <description>&lt;h1&gt;
  
  
  Building a Saga orchestrator in Python: why existing tools weren't enough and what I learned designing one from scratch
&lt;/h1&gt;

&lt;p&gt;Distributed workflows break in a specific, painful way. You charge a payment,&lt;br&gt;
reserve inventory, then try to create a shipping label — and the shipping API&lt;br&gt;
times out. The payment went through. The inventory is reserved. But the order&lt;br&gt;
never completed.&lt;/p&gt;

&lt;p&gt;Now what?&lt;/p&gt;

&lt;p&gt;The naive fix is nested try/except blocks with manual cleanup calls. Every&lt;br&gt;
developer has written this code. It works until the cleanup call fails, or the&lt;br&gt;
process crashes between steps, or two workers process the same message&lt;br&gt;
simultaneously. Then you have inconsistent state, angry customers, and no&lt;br&gt;
clear path to recovery.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# The naive approach everyone writes first
&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process_order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;payment&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;charge_payment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;reservation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;reserve_inventory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;shipping&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;ship_order&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;order&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;release_inventory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;reservation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# what if THIS fails?
&lt;/span&gt;                &lt;span class="k"&gt;raise&lt;/span&gt;
        &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;refund_payment&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payment&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# what if THIS fails?
&lt;/span&gt;            &lt;span class="k"&gt;raise&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# now what?
&lt;/span&gt;        &lt;span class="k"&gt;raise&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;sagakit solves this with the Saga pattern: each step in a workflow declares an&lt;br&gt;
explicit compensation handler. If a later step fails, sagakit automatically&lt;br&gt;
runs the compensations in reverse order — releasing inventory, refunding the&lt;br&gt;
payment — with retries, idempotency guarantees, and structured logging built&lt;br&gt;
in. One &lt;code&gt;docker run redis&lt;/code&gt; is all the infrastructure you need.&lt;/p&gt;


&lt;h2&gt;
  
  
  Why not existing tools?
&lt;/h2&gt;

&lt;p&gt;Before building sagakit, I looked at the existing options.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Temporal&lt;/strong&gt; is the gold standard for durable workflow orchestration. It&lt;br&gt;
handles long-running workflows, versioning, and failure recovery with&lt;br&gt;
guarantees that sagakit doesn't offer. But it requires a dedicated cluster, a&lt;br&gt;
separate worker process, and significant operational investment. For a Python&lt;br&gt;
backend engineer who needs to coordinate 3-5 steps reliably, it's a freight&lt;br&gt;
train for a commute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Celery&lt;/strong&gt; is designed for task queues — background jobs, periodic tasks,&lt;br&gt;
fan-out processing. It doesn't have a native model for compensating&lt;br&gt;
transactions. You can approximate it, but you end up building the compensation&lt;br&gt;
logic yourself anyway, without the primitives to do it safely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Manual try/except&lt;/strong&gt; is what everyone writes first (see the code above). It&lt;br&gt;
fails silently when the cleanup call itself fails, offers no protection against&lt;br&gt;
duplicate processing when a message is redelivered, and becomes unmaintainable&lt;br&gt;
past three steps.&lt;/p&gt;

&lt;p&gt;I wanted something that required only Redis — which most Python backends&lt;br&gt;
already run — made compensation logic explicit and co-located with the step it&lt;br&gt;
undoes, and felt like idiomatic async Python. That's sagakit.&lt;/p&gt;


&lt;h2&gt;
  
  
  Three design decisions worth explaining
&lt;/h2&gt;
&lt;h3&gt;
  
  
  1. Orchestration over choreography
&lt;/h3&gt;

&lt;p&gt;Sagas can be implemented two ways. In choreography, each service reacts&lt;br&gt;
autonomously to events — &lt;code&gt;payment-service&lt;/code&gt; publishes &lt;code&gt;payment.charged&lt;/code&gt;,&lt;br&gt;
&lt;code&gt;inventory-service&lt;/code&gt; listens and reserves, and so on. No central coordinator&lt;br&gt;
exists.&lt;/p&gt;

&lt;p&gt;sagakit uses orchestration: a central executor drives each step explicitly, in&lt;br&gt;
order. The entire workflow is defined in one place and readable top to bottom.&lt;/p&gt;

&lt;p&gt;The tradeoff is real — orchestration introduces a coordinator that choreography&lt;br&gt;
avoids. But for a library targeting clarity and testability, having the&lt;br&gt;
workflow in one file is worth it. You can read a saga definition and&lt;br&gt;
immediately understand what it does, what it compensates, and in what order.&lt;br&gt;
With choreography, that understanding is reconstructed from events scattered&lt;br&gt;
across services.&lt;/p&gt;
&lt;h3&gt;
  
  
  2. Idempotency keys don't include the attempt number
&lt;/h3&gt;

&lt;p&gt;The original design used &lt;code&gt;saga_id:step_name:attempt_number&lt;/code&gt; as the idempotency&lt;br&gt;
key. It seemed logical — each attempt is a distinct event, so each gets a&lt;br&gt;
distinct key.&lt;/p&gt;

&lt;p&gt;The problem: if &lt;code&gt;charge_payment&lt;/code&gt; fails on attempt 1 and retries as attempt 2,&lt;br&gt;
the key changes. When the step passes &lt;code&gt;ctx.idempotency_key&lt;/code&gt; to Stripe, Stripe&lt;br&gt;
sees two different keys and treats them as two separate charges. The customer&lt;br&gt;
gets billed twice.&lt;/p&gt;

&lt;p&gt;The fix was simple but required changing the design mid-project: drop&lt;br&gt;
&lt;code&gt;attempt_number&lt;/code&gt; from the key. All retries of the same step share&lt;br&gt;
&lt;code&gt;saga_id:step_name&lt;/code&gt;. Stripe — and any other external system that accepts an&lt;br&gt;
idempotency key — sees the same identifier across all attempts and returns the&lt;br&gt;
same result without re-executing the side effect.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;attempt_number&lt;/code&gt; still exists in &lt;code&gt;SagaContext&lt;/code&gt; for logging and observability.&lt;br&gt;
It just doesn't affect the key.&lt;/p&gt;
&lt;h3&gt;
  
  
  3. reject() uses XACK + XADD instead of XCLAIM
&lt;/h3&gt;

&lt;p&gt;When a step fails and needs to be retried, the message must be returned to the&lt;br&gt;
stream for reprocessing. The obvious Redis primitive is XCLAIM — reassign the&lt;br&gt;
message to a consumer so it can retry.&lt;/p&gt;

&lt;p&gt;The problem: XCLAIM keeps the message tied to the original consumer. If that&lt;br&gt;
consumer is down, the message sits in the Pending Entries List indefinitely.&lt;br&gt;
No other worker can claim it. The saga is silently stuck.&lt;/p&gt;

&lt;p&gt;sagakit uses XACK + XADD instead. The original message is acknowledged&lt;br&gt;
(removed from the PEL), and a new message with the same payload is published&lt;br&gt;
to the stream. Any available consumer in the group can pick it up — the retry&lt;br&gt;
is not tied to the worker that originally failed.&lt;/p&gt;

&lt;p&gt;The re-published message carries a &lt;code&gt;requeue_count&lt;/code&gt; attribute incremented on&lt;br&gt;
each rejection, so the executor can detect pathological retry loops and route&lt;br&gt;
to the DLQ after a threshold.&lt;/p&gt;


&lt;h2&gt;
  
  
  Seeing it in action
&lt;/h2&gt;

&lt;p&gt;The happy path — all three steps complete successfully:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[info] payment.charged     amount=99.99 payment_id=pay_718836cb
[info] inventory.reserved  reservation_id=res_718836cb
[info] order.shipped       tracking_id=trk_718836cb

Status  : completed
Results :
  charge_payment    → pay_718836cb
  reserve_inventory → res_718836cb
  ship_order        → trk_718836cb
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The failure path — &lt;code&gt;ship_order&lt;/code&gt; fails, sagakit retries with exponential&lt;br&gt;
backoff, exhausts attempts, then compensates in reverse order:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[info]    payment.charged     amount=99.99 payment_id=pay_6856bd2b
[info]    inventory.reserved  reservation_id=res_6856bd2b
[warning] step.retrying       attempt=1 delay=0.052s error='Shipping API down'
[warning] step.retrying       attempt=2 delay=0.170s error='Shipping API down'
[error]   step.exhausted_retries attempts=3 error='Shipping API down'
[info]    inventory.released  ← compensation
[info]    payment.refunded    ← compensation

Status    : compensated
Failed at : ship_order
Rolled back: reserve_inventory, charge_payment
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two things worth noticing in the failure output. First, the backoff is visible&lt;br&gt;
in the logs — 52ms, then 170ms, with ±50% jitter applied. Second,&lt;br&gt;
compensations run in strict reverse order: inventory before payment, because&lt;br&gt;
that's the reverse of how they were acquired. sagakit guarantees this order&lt;br&gt;
regardless of which step fails.&lt;/p&gt;

&lt;p&gt;To reproduce: &lt;code&gt;FAIL_AT_STEP=ship_order python run.py&lt;/code&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  Three things I learned building this
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Compensation is not rollback
&lt;/h3&gt;

&lt;p&gt;Coming from a relational database background, my first mental model of&lt;br&gt;
compensation was "undo" — run the saga backwards and erase what happened.&lt;br&gt;
That's not what compensation is.&lt;/p&gt;

&lt;p&gt;A database rollback destroys state retroactively. It's as if the transaction&lt;br&gt;
never happened — no trace, no intermediate state, no customer-visible effect.&lt;/p&gt;

&lt;p&gt;Compensation is new forward-moving business logic that repairs the damage.&lt;br&gt;
Refunding a payment is not the same as never charging it. The customer sees&lt;br&gt;
two transactions on their bank statement. The inventory system received a&lt;br&gt;
reservation and then a cancellation — it may have allocated physical space in&lt;br&gt;
the interim. Other systems may have reacted to the original action before the&lt;br&gt;
compensation ran.&lt;/p&gt;

&lt;p&gt;This distinction has real consequences for how you write compensation handlers.&lt;br&gt;
They cannot assume the system is in the same state it was when the forward step&lt;br&gt;
ran. They must be written defensively, assuming that time has passed and other&lt;br&gt;
things have happened.&lt;/p&gt;

&lt;h3&gt;
  
  
  Design decisions reveal their flaws during implementation
&lt;/h3&gt;

&lt;p&gt;The idempotency key started as &lt;code&gt;saga_id:step_name:attempt_number&lt;/code&gt;. It seemed&lt;br&gt;
correct on paper — each attempt is a distinct event, so each gets a distinct&lt;br&gt;
identifier.&lt;/p&gt;

&lt;p&gt;The flaw only became visible when I thought through the implementation&lt;br&gt;
concretely: if the key changes between retries, external systems like Stripe&lt;br&gt;
see different keys and treat each attempt as a new transaction. A payment gets&lt;br&gt;
charged twice.&lt;/p&gt;

&lt;p&gt;The fix was one line of code. But it required updating the ADR, the&lt;br&gt;
implementation, and the tests — and more importantly, it required catching the&lt;br&gt;
assumption before it shipped.&lt;/p&gt;

&lt;p&gt;This is why I wrote Architecture Decision Records before writing code. The ADR&lt;br&gt;
for idempotency forced me to think through the key construction explicitly,&lt;br&gt;
which is when the flaw surfaced. Without that document, the bug would have&lt;br&gt;
lived in the code until a real payment was doubled.&lt;/p&gt;

&lt;h3&gt;
  
  
  Writing the decision before writing the code
&lt;/h3&gt;

&lt;p&gt;I had never written an Architecture Decision Record before this project. My&lt;br&gt;
previous approach was the common one: make a decision, write the code, maybe&lt;br&gt;
add a comment explaining why.&lt;/p&gt;

&lt;p&gt;The difference with ADRs is that you document not just what you decided, but&lt;br&gt;
what you considered and rejected — and why. That forces a level of rigor that&lt;br&gt;
commenting doesn't. You can't write "rejected because operationally heavy"&lt;br&gt;
without first asking yourself: heavy compared to what? Heavy for whom?&lt;/p&gt;

&lt;p&gt;Four ADRs later, the project has a paper trail of every major architectural&lt;br&gt;
choice: why Sagas over 2PC, why Redis Streams over Kafka, how idempotency&lt;br&gt;
works, what compensation guarantees are provided and which aren't. A new&lt;br&gt;
contributor — or a future version of me — can read those documents and&lt;br&gt;
understand not just what the system does, but why it exists in this shape.&lt;/p&gt;

&lt;p&gt;I won't build a non-trivial system without them again.&lt;/p&gt;




&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;sagakit is pre-alpha — the API may change, and it hasn't been benchmarked in&lt;br&gt;
production. But it works, it's tested, and the order-processing example runs&lt;br&gt;
in under five minutes with a single &lt;code&gt;docker compose up -d&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;If you're building event-driven workflows in Python and the naive try/except&lt;br&gt;
approach is starting to hurt, give it a try and let me know what breaks.&lt;/p&gt;

&lt;p&gt;→ &lt;a href="https://github.com/diegogue88/sagakit" rel="noopener noreferrer"&gt;github.com/diegogue88/sagakit&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;What patterns do you use for distributed transactions in Python? I'd love to&lt;br&gt;
hear what's working — and what isn't — in the comments.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>distributedsystems</category>
      <category>microservices</category>
      <category>python</category>
    </item>
  </channel>
</rss>
