<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kingsley Onoh</title>
    <description>The latest articles on DEV Community by Kingsley Onoh (@kingsleyonoh).</description>
    <link>https://dev.to/kingsleyonoh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F568563%2F01c09769-b072-47a3-9c7b-16fefc2c573e.png</url>
      <title>DEV Community: Kingsley Onoh</title>
      <link>https://dev.to/kingsleyonoh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kingsleyonoh"/>
    <language>en</language>
    <item>
      <title>Confidence Is Not Ownership</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Wed, 06 May 2026 16:00:28 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/confidence-is-not-ownership-2d80</link>
      <guid>https://dev.to/kingsleyonoh/confidence-is-not-ownership-2d80</guid>
      <description>&lt;p&gt;What should a finance queue do when two credible records point at the same case?&lt;/p&gt;

&lt;p&gt;An invoice discrepancy and a contract breach can describe the same dispute. They can also describe two different disputes with the same counterparty, the same currency, and nearly the same amount. That is the trap in a finance operations queue. The data looks related before anyone has proved ownership.&lt;/p&gt;

&lt;p&gt;The Workbench ingests exceptions from invoice reconciliation, transaction reconciliation, contract lifecycle events, webhook dead letters, manual operator entry, and signed Hub fanout. Every source carries its own identifiers. Some are reliable. Some are only reliable inside the upstream tool that produced them.&lt;/p&gt;

&lt;p&gt;I had to decide what the system should do when a new exception looks like it belongs to an existing dispute.&lt;/p&gt;

&lt;p&gt;The tempting version is simple: compute a score, pick the highest dispute, attach the exception. That makes demos feel clean. Exceptions flow in, disputes become richer, and the queue stays small.&lt;/p&gt;

&lt;p&gt;It is also how a finance system quietly corrupts its own audit trail.&lt;/p&gt;

&lt;p&gt;Once an exception is attached to a dispute, every later action inherits that fact. SLA timers, resolution playbooks, Notification Hub events, audit PDF exports, and operator comments all treat the relationship as true. If the relationship was only probable, the system has converted probability into evidence.&lt;/p&gt;

&lt;p&gt;That conversion is the real design problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scoring Shape
&lt;/h2&gt;

&lt;p&gt;The correlator in &lt;code&gt;src/drw/domain/correlator.clj&lt;/code&gt; uses seven signals. Source reference and entity id each carry 0.15. Counterparty carries 0.25. Currency carries 0.10. Amount carries 0.15. Date carries 0.10. Category carries 0.10.&lt;/p&gt;

&lt;p&gt;Those weights are not magic in the sense of being secret. They are visible because this is a spec project. But the structure matters more than the numbers.&lt;/p&gt;

&lt;p&gt;Counterparty is the gate. A candidate dispute is not eligible unless it belongs to the same tenant, is not terminal, has the same counterparty, and falls within the correlation window. Only then does scoring begin.&lt;/p&gt;

&lt;p&gt;That means the correlator is not a general similarity search. It is a tenant-scoped dispute ownership test.&lt;/p&gt;

&lt;p&gt;The amount signal has a 10 percent tolerance and only scores when the currency also matches. The date signal checks whether the exception was observed within 72 hours of the dispute creation time. The source reference and entity id signals compare against exceptions already attached to the candidate dispute, not just fields on the dispute itself.&lt;/p&gt;

&lt;p&gt;That last part matters. A dispute becomes easier to recognize as it accumulates evidence. The first invoice mismatch may create the dispute. A later webhook dead letter with the same upstream reference can now match the attached evidence, even if the dispute record itself does not carry that reference.&lt;/p&gt;

&lt;p&gt;The core function is plain Clojure:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight clojure"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;defn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;score-candidates&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;tenant-id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;exception&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;disputes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;attached&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;score-candidates&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tenant-id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;exception&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;disputes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;attached&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{}))&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="n"&gt;tenant-id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;exception&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;disputes&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;attached&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
   &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;let&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;merge&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;default-config&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;opts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
         &lt;/span&gt;&lt;span class="n"&gt;review&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;get-in&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="no"&gt;:thresholds&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;:review&lt;/span&gt;&lt;span class="p"&gt;])]&lt;/span&gt;&lt;span class="w"&gt;
     &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;-&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;disputes&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;map-indexed&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;vector&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;fn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dispute&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;&lt;span class="w"&gt;
                    &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;candidate-eligible?&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tenant-id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;exception&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dispute&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;map&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;fn&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dispute&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt;&lt;span class="w"&gt;
                 &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;assoc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;score-candidate&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;tenant-id&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;exception&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;dispute&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;attached&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt;
                        &lt;/span&gt;&lt;span class="no"&gt;:sort-index&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;idx&lt;/span&gt;&lt;span class="p"&gt;)))&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;filter&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;#&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;&amp;gt;=&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="no"&gt;:score&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;%&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;review&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;sort-by&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;juxt&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;comp&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nb"&gt;-&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;:score&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;:sort-index&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;&lt;span class="w"&gt;
          &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;mapv&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="o"&gt;#&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;dissoc&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="n"&gt;%&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="no"&gt;:sort-index&lt;/span&gt;&lt;span class="p"&gt;))))))&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There are two details in that function I care about.&lt;/p&gt;

&lt;p&gt;First, eligibility happens before scoring. A cross-tenant dispute receives no score. A terminal dispute receives no score. A different counterparty receives no score. The function does not let a high amount or date match compensate for a broken boundary.&lt;/p&gt;

&lt;p&gt;Second, ties preserve input order through &lt;code&gt;:sort-index&lt;/code&gt;. That is not glamorous. It prevents unstable review queues where two equal candidates swap positions between renders and make operators think the system changed its mind.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Was Wrong About
&lt;/h2&gt;

&lt;p&gt;I initially treated the correlation score as the hard part.&lt;/p&gt;

&lt;p&gt;It was not. The harder question was what the system does with the score.&lt;/p&gt;

&lt;p&gt;There are three possible outcomes in &lt;code&gt;src/drw/domain/exceptions.clj&lt;/code&gt;. If no candidate passes review, the exception creates a new dispute and attaches immediately. If the best candidate hits the auto-merge band and auto-merge is explicitly enabled, the exception attaches and records an auto-merged correlation. Otherwise, the system creates pending correlation records and emits a &lt;code&gt;dispute.correlation_pending&lt;/code&gt; event.&lt;/p&gt;

&lt;p&gt;That middle branch is intentionally hard to reach. The &lt;code&gt;.env.example&lt;/code&gt; values set &lt;code&gt;AUTO_MERGE_THRESHOLD=0.92&lt;/code&gt; and &lt;code&gt;REVIEW_THRESHOLD=0.70&lt;/code&gt;, while the source correlator defaults are lower for unit-level behavior. The runtime config is stricter because this is finance operations. False attachment costs more than a larger review queue.&lt;/p&gt;

&lt;p&gt;What surprised me is that pending correlation became a domain object, not a UI convenience.&lt;/p&gt;

&lt;p&gt;The queue needed an id, a tenant id, an exception id, a target dispute id, a score, a rationale, a status, a decided-by user, and decision timestamps. That is a lot of structure for something that could have been a modal row.&lt;/p&gt;

&lt;p&gt;But the moment an operator accepts or rejects a candidate, that decision becomes part of the case history. A rejected match is useful evidence. It says someone looked at the overlap and decided the exception did not belong there. If the same upstream source sends a related item later, the prior rejection explains why the system did not combine the cases earlier.&lt;/p&gt;

&lt;p&gt;That is why correlation records live next to exceptions and disputes instead of inside a transient UI response.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Failure Mode Hidden In Good Matches
&lt;/h2&gt;

&lt;p&gt;The most dangerous false match is not a ridiculous one.&lt;/p&gt;

&lt;p&gt;It is the match that looks reasonable.&lt;/p&gt;

&lt;p&gt;Same counterparty. Same currency. Amount within 10 percent. Observed inside three days. Category is billing. If those signals point to the wrong open dispute, the system does not look broken. It looks efficient.&lt;/p&gt;

&lt;p&gt;The damage appears later. A Workflow Engine playbook starts against the wrong case. A Notification Hub event tells an operator that the dispute is ready for resolution. The audit PDF now contains an exception that belongs somewhere else. Nobody sees the root mistake because every downstream artifact is internally consistent.&lt;/p&gt;

&lt;p&gt;That is the kind of bug that worries me more than a 500 response.&lt;/p&gt;

&lt;p&gt;A 500 stops the flow. A wrong attachment keeps moving.&lt;/p&gt;

&lt;p&gt;The design answer was to make confidence create a decision, not mutate the dispute. A review-band candidate becomes work for an operator. The UI exposes accept and reject actions. The API carries the same boundary. The audit log records correlation creation and later decisions.&lt;/p&gt;

&lt;p&gt;The Workbench still supports auto-merge, but it is a policy choice. It has to be enabled. The score has to clear the higher band. The code does not pretend the existence of a scoring function means the business has accepted the risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Tenant Scope Belongs Inside The Algorithm
&lt;/h2&gt;

&lt;p&gt;Tenant isolation is usually discussed at the HTTP layer. API key comes in, tenant id gets attached to the request, handlers filter queries.&lt;/p&gt;

&lt;p&gt;That is necessary, but it is not enough here.&lt;/p&gt;

&lt;p&gt;Correlation is a cross-entity operation by nature. It compares a new exception against many existing disputes and attached exceptions. If the algorithm accepts a list that accidentally contains another tenant's disputes, the HTTP layer is already too far away to save it.&lt;/p&gt;

&lt;p&gt;So &lt;code&gt;score-candidate&lt;/code&gt; checks tenant equality itself. &lt;code&gt;score-candidates&lt;/code&gt; filters candidates through &lt;code&gt;candidate-eligible?&lt;/code&gt;, which repeats tenant, status, counterparty, and time-window checks before scoring.&lt;/p&gt;

&lt;p&gt;This is defensive duplication with a purpose. The route should pass tenant-scoped collections. The domain should still reject anything outside the tenant boundary. In a single-tenant fixture, this looks redundant. In a two-tenant test, it is the difference between "the route behaved" and "the invariant held."&lt;/p&gt;

&lt;p&gt;The same philosophy appears in reports. The audit PDF renderer captures a tenant snapshot, renders with strict token lookup, and the setup check renders two tenants to make sure one tenant's identity literals never appear in the other tenant's output.&lt;/p&gt;

&lt;p&gt;The theme is the same: do not trust a boundary because a previous layer probably handled it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;The finished local build passed 160 tests with 869 assertions. The full-flow E2E drives invoice adapter polling into exception creation, assignment, investigation, Workflow Engine resolution polling, Notification Hub event capture, and audit PDF generation.&lt;/p&gt;

&lt;p&gt;The number I care about most is smaller: the dashboard guard. It caught a practical operations failure. The first dashboard shape rendered every dispute link in the tenant fixture. The fix capped the overview at 50 open disputes and kept totals intact.&lt;/p&gt;

&lt;p&gt;That is the Workbench in miniature. Preserve the facts. Limit the surface. Make the operator decide when the machine only has a probability.&lt;/p&gt;

&lt;p&gt;Confidence is useful. Ownership is a human or policy decision.&lt;/p&gt;

</description>
      <category>clojure</category>
      <category>correlation</category>
      <category>financeoperations</category>
      <category>tenantisolation</category>
    </item>
    <item>
      <title>Why I Made WebSocket Delivery the Disposable Part of the Tracking System</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Wed, 06 May 2026 14:53:35 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/why-i-made-websocket-delivery-the-disposable-part-of-the-tracking-system-1fej</link>
      <guid>https://dev.to/kingsleyonoh/why-i-made-websocket-delivery-the-disposable-part-of-the-tracking-system-1fej</guid>
      <description>&lt;p&gt;The uncomfortable failure is not a carrier outage. That one is loud. The polling loop logs it, retries it, and moves on.&lt;/p&gt;

&lt;p&gt;The failure I had to design around is quieter: the gateway receives a real carrier update, writes it to the database, then Redis is down at the exact moment the WebSocket fanout should happen. The client misses the live push. The shipment state is still correct. The event timeline is still correct. But the thing the user was watching in real time never moves.&lt;/p&gt;

&lt;p&gt;That sounds like a broken real-time system until you decide which part is allowed to be temporary.&lt;/p&gt;

&lt;p&gt;I made the database the truth and the WebSocket stream the delivery layer. That one decision shaped the rest of the gateway.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real boundary
&lt;/h2&gt;

&lt;p&gt;DHL, DPD, and GLS do not send one clean stream of facts. The adapters have to handle different request shapes, different status codes, and different timestamp rules. DHL uses &lt;code&gt;DHL-API-Key&lt;/code&gt; and returns shipment events under &lt;code&gt;shipments[0].events&lt;/code&gt;. DPD can return HTML for an error response, so &lt;code&gt;DpdAdapter&lt;/code&gt; checks the content type before it ever tries to parse JSON. GLS sends a date like &lt;code&gt;06.05.2026&lt;/code&gt; and a local time string, so &lt;code&gt;GlsAdapter&lt;/code&gt; has to parse CET into UTC.&lt;/p&gt;

&lt;p&gt;That is all adapter work. Once an event crosses into the core, it has one shape: carrier, tracking number, carrier status, normalized status, carrier timestamp, and a deterministic &lt;code&gt;dedupKey&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The key is generated from four values:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;generateDedupKey&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;DedupKeyInput&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;carrier&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalizeCarrier&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;carrier&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;trackingNumber&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;assertRequiredString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;trackingNumber&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;trackingNumber&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;carrierStatus&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;assertRequiredString&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;carrierStatus&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;carrierStatus&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;carrierTimestampIso&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalizeTimestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;carrierTimestamp&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;carrier&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;trackingNumber&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;carrierStatus&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;carrierTimestampIso&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That function is plain, but it carries the system. A carrier can send the same scan twice. A poll can retry after a timeout. A process can restart and fetch the same history again. If the fact is the same, the key is the same.&lt;/p&gt;

&lt;p&gt;What surprised me was how much of the gateway became simpler once duplicate handling moved into the event identity instead of the polling loop. The poller does not need to remember what it saw last time. The adapters do not need per-carrier duplicate caches. The WebSocket layer does not decide whether something is new. The processor decides once, against PostgreSQL.&lt;/p&gt;

&lt;h2&gt;
  
  
  The processor is the load-bearing part
&lt;/h2&gt;

&lt;p&gt;I was wrong at first about where the "real-time" complexity would live. I expected it to be WebSockets: connection cleanup, heartbeats, subscription maps, broadcast performance. Those were real problems, but they were not the hard correctness problem.&lt;/p&gt;

&lt;p&gt;The hard part was making sure Redis never became the source of truth by accident.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;createEventProcessor&lt;/code&gt; starts with a dedup lookup, then inserts into &lt;code&gt;tracking_events&lt;/code&gt; with &lt;code&gt;onConflictDoNothing()&lt;/code&gt;. That second guard matters because two workers can pass the lookup at the same time. The database still gets the final vote.&lt;/p&gt;

&lt;p&gt;After the insert, it updates the shipment projection only when the new carrier timestamp is newer than &lt;code&gt;last_event_at&lt;/code&gt;. Older events are still persisted. They just do not move the visible shipment state backwards.&lt;/p&gt;

&lt;p&gt;The important shape is this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;updatedRows&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;options&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;db&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;shipments&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;currentStatus&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;normalizedStatus&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;lastEventAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;carrierTimestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;updatedAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
  &lt;span class="p"&gt;})&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nf"&gt;and&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nf"&gt;eq&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;shipments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;shipmentId&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
      &lt;span class="nf"&gt;or&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;isNull&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;shipments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;lastEventAt&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nf"&gt;lt&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;shipments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;lastEventAt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;carrierTimestamp&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;returning&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;shipments&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="nx"&gt;projectionUpdated&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;updatedRows&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That is the line between history and projection. An out-of-order scan belongs in history. It does not belong on the dashboard as the current state.&lt;/p&gt;

&lt;p&gt;Then Redis publish happens after the database work. If the database write fails, the processor returns &lt;code&gt;failed&lt;/code&gt; and does not publish. If Redis publish fails, the processor still returns &lt;code&gt;processed&lt;/code&gt; with &lt;code&gt;published: false&lt;/code&gt;. That asymmetry is deliberate. A WebSocket update can be missed. A carrier fact cannot be invented.&lt;/p&gt;

&lt;p&gt;The integration tests capture both sides. One test inserts an invalid shipment id and asserts that Redis stays empty. Another makes the publisher throw &lt;code&gt;redis unavailable&lt;/code&gt; and asserts the PostgreSQL event still exists. Those tests are more useful than a happy-path WebSocket demo because they pin down which failure the business can recover from.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why not broadcast directly?
&lt;/h2&gt;

&lt;p&gt;The simple version is tempting. A client subscribes to a tracking number. The processor receives an event. It loops over sockets and calls &lt;code&gt;send()&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That works until the process restarts, the send throws, or the event arrives in a worker that does not own the socket. Even in a single-process version, direct send ties event processing to live delivery. That is the dependency I wanted to avoid.&lt;/p&gt;

&lt;p&gt;Redis Streams gave the gateway a narrow delivery contract. The processor writes one stream entry to &lt;code&gt;tracking:events&lt;/code&gt;. The broadcaster consumes as group &lt;code&gt;ws-broadcaster&lt;/code&gt;, parses the payload, looks up connection ids by tracking number, and sends JSON to sockets registered in the current process.&lt;/p&gt;

&lt;p&gt;The broadcaster code has a small but telling rule: malformed stream entries are acknowledged.&lt;/p&gt;

&lt;p&gt;At first that felt wrong. Acknowledging a bad message means admitting it will never be delivered. But leaving invalid JSON pending forever is worse. It blocks operational visibility and makes the group look unhealthy for a message that cannot succeed on retry. The test for that case writes &lt;code&gt;{&lt;/code&gt; as the payload and asserts it gets logged and acked.&lt;/p&gt;

&lt;p&gt;The other Redis detail is &lt;code&gt;XAUTOCLAIM&lt;/code&gt;. If a consumer reads a stream entry and dies before acknowledging it, the message sits pending. The broadcaster reclaims stuck messages after 60 seconds. That is enough for the current single-service deployment, and it creates a clear upgrade path for multi-process delivery later.&lt;/p&gt;

&lt;h2&gt;
  
  
  The polling loop stays boring on purpose
&lt;/h2&gt;

&lt;p&gt;The polling engine is less clever than the rest of the system. That is good.&lt;/p&gt;

&lt;p&gt;It loads enabled carrier configs from the database, starts one loop per carrier, queries active shipments excluding &lt;code&gt;delivered&lt;/code&gt;, &lt;code&gt;returned&lt;/code&gt;, and &lt;code&gt;deleted_at&lt;/code&gt;, then batches them by &lt;code&gt;POLL_BATCH_SIZE&lt;/code&gt;. The default is 10.&lt;/p&gt;

&lt;p&gt;Each shipment call runs through &lt;code&gt;withExponentialBackoff&lt;/code&gt;. &lt;code&gt;RateLimitError&lt;/code&gt; and &lt;code&gt;CarrierError&lt;/code&gt; are retryable by default. The delay is &lt;code&gt;baseDelayMs * 2 ** attempt&lt;/code&gt;, with the base coming from &lt;code&gt;carrier_configs.backoff_base_ms&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The scheduler has one job: skip overlapping cycles. If a carrier cycle is still running when the next interval fires, it returns a structured &lt;code&gt;skipped&lt;/code&gt; result with reason &lt;code&gt;cycle_already_running&lt;/code&gt;. It does not start a second poller and hope the processor dedup saves the day.&lt;/p&gt;

&lt;p&gt;That skip behavior matters because the PRD target is 100 active shipments across 3 carriers, with a carrier polling cycle under 30 seconds and WebSocket delivery under 200ms once the event enters the pipeline. If the poller stacks, the gateway creates its own load spike and every downstream metric lies.&lt;/p&gt;

&lt;h2&gt;
  
  
  The timestamp bug that exposed the database shape
&lt;/h2&gt;

&lt;p&gt;The most useful gotcha came from pagination, not carrier integration. Cursor pagination repeated a shipment on page 2 even though the SQL condition compared the sort timestamp against the cursor timestamp.&lt;/p&gt;

&lt;p&gt;The schema uses PostgreSQL &lt;code&gt;timestamp without time zone&lt;/code&gt;. Passing a JavaScript &lt;code&gt;Date&lt;/code&gt; through Drizzle and node-postgres introduced enough interpretation drift that the comparison did not behave like the stored value. The fix was to format the cursor as a PostgreSQL timestamp literal and cast it with &lt;code&gt;::timestamp&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That is why &lt;code&gt;shipments.routes.ts&lt;/code&gt; has this helper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;toPgTimestamp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;T&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt; &lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Z&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It looks like formatting trivia. It is actually a pagination invariant. If the cursor repeats rows, clients may process the same shipment page twice or miss a page when they try to recover.&lt;/p&gt;

&lt;h2&gt;
  
  
  The part I left deliberately unfinished
&lt;/h2&gt;

&lt;p&gt;The stats endpoint exposes a limitation instead of pretending the system has a metric it does not own yet. It returns active shipments and today's event counts by carrier, but &lt;code&gt;error_rate&lt;/code&gt; and &lt;code&gt;errors&lt;/code&gt; are &lt;code&gt;null&lt;/code&gt;. The response includes a &lt;code&gt;metrics_limitations&lt;/code&gt; entry that says carrier poll failure counters are not persisted yet.&lt;/p&gt;

&lt;p&gt;That is not a polished answer, but it is the honest one. The polling engine logs retry attempts and per-cycle errors, and the build journal records the summary fields: carrier, shipments polled, events found, deduplicated events, and errors. Those numbers exist at runtime. They do not yet survive as queryable history.&lt;/p&gt;

&lt;p&gt;I prefer that over a fake rate derived from current memory. A dashboard number that resets on restart is worse than no number because it invites operational decisions from incomplete evidence. If this gateway were being deployed for a client, the next increment would be a small poll-cycle table or metrics sink before anyone relied on &lt;code&gt;/api/stats&lt;/code&gt; for carrier health.&lt;/p&gt;

&lt;p&gt;The same honesty shows up in the success criteria. The system has the local loop, the production-start check, the Docker Compose smoke proof, and the mock-carrier journey. It does not have the 24-hour simulated-load soak. That missing box is not paperwork. It is the difference between "the architecture handles the failure model" and "this process survived a day of real runtime pressure."&lt;/p&gt;

&lt;h2&gt;
  
  
  The result
&lt;/h2&gt;

&lt;p&gt;The finished local build passed 134 Vitest tests. The unit and integration coverage command passed at 83.32% statement and line coverage. The WebSocket E2E test publishes a Redis stream event and asserts delivery under 200ms after the broadcaster group exists. The mock-carrier demo registers DHL, DPD, and GLS shipments, subscribes over WebSocket, processes events through the real event processor, verifies duplicate suppression, and confirms the REST API shows in-transit shipments.&lt;/p&gt;

&lt;p&gt;The 24-hour soak is still deferred. That matters. A gateway like this is not production-proven just because the happy path works for a few minutes.&lt;/p&gt;

&lt;p&gt;But the failure model is in the right order. Carrier facts land in PostgreSQL first. Shipment state is a projection. Redis carries delivery. WebSockets are allowed to miss a push. The timeline is not allowed to lie.&lt;/p&gt;

</description>
      <category>eventsourcing</category>
      <category>redisstreams</category>
      <category>websocket</category>
      <category>typescript</category>
    </item>
    <item>
      <title>Why I Made the Ledger Refuse Single Rows</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Tue, 05 May 2026 19:35:17 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/why-i-made-the-ledger-refuse-single-rows-37jo</link>
      <guid>https://dev.to/kingsleyonoh/why-i-made-the-ledger-refuse-single-rows-37jo</guid>
      <description>&lt;p&gt;The bug I kept designing around was not "wrong penalty amount."&lt;/p&gt;

&lt;p&gt;Wrong amounts are visible. A supplier sees a credit note for EUR 8400, checks the contract, and pushes back. The operator can investigate. The trail exists.&lt;/p&gt;

&lt;p&gt;The quieter failure is a penalty row with no mirror. One side of the system says the buyer is owed money. The other side never records the supplier-facing debit. The total looks correct in the dashboard because the credit row exists. The settlement builder can even pick it up. But the accounting story is incomplete, and it only becomes obvious when someone asks why the counterparty view does not match the buyer view.&lt;/p&gt;

&lt;p&gt;That is the sort of error append-only systems can preserve forever.&lt;/p&gt;

&lt;p&gt;So I made the ledger refuse single rows.&lt;/p&gt;

&lt;p&gt;The system calculates supplier SLA penalties. A missed response target, a delivery window breach, a compounding daily penalty, a tier crossing, or a per-ticket miss becomes money owed. That money then needs to move through three phases: accrual, possible reversal, and settlement. Each phase has a tempting shortcut.&lt;/p&gt;

&lt;p&gt;For accrual, the shortcut is one row with &lt;code&gt;amount_cents&lt;/code&gt; and &lt;code&gt;direction&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;For reversal, the shortcut is updating the original row.&lt;/p&gt;

&lt;p&gt;For settlement, the shortcut is marking the original row as settled.&lt;/p&gt;

&lt;p&gt;All three shortcuts make the data easier to write and harder to defend.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Shape That Forced the Design
&lt;/h2&gt;

&lt;p&gt;The PRD had a hard constraint: &lt;code&gt;penalty_ledger&lt;/code&gt; is append-only. Disputes, withdrawals, and corrections use compensating entries. Settlement membership lives outside the ledger. That sounds clean until the first application flow has to implement it.&lt;/p&gt;

&lt;p&gt;The accrual worker receives a breach, loads the contract and clause, calls &lt;code&gt;RulesEngine.calculatePenalty&lt;/code&gt;, and gets one answer back: no penalty, a domain error, an accrued amount, or a capped amount. The amount is not the ledger. It is only a financial fact waiting to become a record.&lt;/p&gt;

&lt;p&gt;The ledger record has more obligations:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a credit side and mirror side&lt;/li&gt;
&lt;li&gt;same amount and currency&lt;/li&gt;
&lt;li&gt;same tenant, contract, counterparty, clause, and breach&lt;/li&gt;
&lt;li&gt;same accrual period&lt;/li&gt;
&lt;li&gt;same entry kind&lt;/li&gt;
&lt;li&gt;same compensation reference when reversing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any of those differ, the pair is not a pair. It is two rows that happen to be written near each other.&lt;/p&gt;

&lt;p&gt;I originally thought the database trigger was the center of the solution. Block update and delete. Done. That part exists:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;trigger&lt;/span&gt; &lt;span class="n"&gt;penalty_ledger_block_update&lt;/span&gt;
&lt;span class="k"&gt;before&lt;/span&gt; &lt;span class="k"&gt;update&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;penalty_ledger&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;each&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt; &lt;span class="k"&gt;execute&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;block_penalty_ledger_mutation&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;create&lt;/span&gt; &lt;span class="k"&gt;trigger&lt;/span&gt; &lt;span class="n"&gt;penalty_ledger_block_delete&lt;/span&gt;
&lt;span class="k"&gt;before&lt;/span&gt; &lt;span class="k"&gt;delete&lt;/span&gt; &lt;span class="k"&gt;on&lt;/span&gt; &lt;span class="n"&gt;penalty_ledger&lt;/span&gt;
&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;each&lt;/span&gt; &lt;span class="k"&gt;row&lt;/span&gt; &lt;span class="k"&gt;execute&lt;/span&gt; &lt;span class="k"&gt;function&lt;/span&gt; &lt;span class="n"&gt;block_penalty_ledger_mutation&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;But that only stops mutation after insertion. It does not stop a bad insert. A database that blocks updates can still store a malformed truth forever.&lt;/p&gt;

&lt;p&gt;That is where the F# domain layer earns its place.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Load-Bearing Function
&lt;/h2&gt;

&lt;p&gt;The most important code in the ledger path is not the insert statement. It is &lt;code&gt;LedgerPair.create&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight fsharp"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="n"&gt;validate&lt;/span&gt; &lt;span class="n"&gt;credit&lt;/span&gt; &lt;span class="n"&gt;mirror&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nn"&gt;Money&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cents&lt;/span&gt; &lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Amount&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="nc"&gt;L&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nn"&gt;Money&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;cents&lt;/span&gt; &lt;span class="n"&gt;mirror&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Amount&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="nc"&gt;L&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
        &lt;span class="nc"&gt;Error&lt;/span&gt; &lt;span class="nc"&gt;LedgerAmountMustBePositive&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt;
        &lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Direction&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&amp;gt;&lt;/span&gt; &lt;span class="nn"&gt;LedgerDirection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CreditOwedToUs&lt;/span&gt;
        &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;mirror&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Direction&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&amp;gt;&lt;/span&gt; &lt;span class="nn"&gt;LedgerDirection&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Mirror&lt;/span&gt;
    &lt;span class="k"&gt;then&lt;/span&gt;
        &lt;span class="nc"&gt;Error&lt;/span&gt; &lt;span class="nc"&gt;LedgerPairDirectionInvalid&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;EntryKind&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;mirror&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;EntryKind&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
        &lt;span class="nc"&gt;Error&lt;/span&gt; &lt;span class="nc"&gt;LedgerPairKindInvalid&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Amount&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;mirror&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Amount&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
        &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LedgerPairMismatch&lt;/span&gt; &lt;span class="s2"&gt;"amount must match"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt;
        &lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AccrualPeriodStart&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;mirror&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AccrualPeriodStart&lt;/span&gt;
        &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AccrualPeriodEnd&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;mirror&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AccrualPeriodEnd&lt;/span&gt;
    &lt;span class="k"&gt;then&lt;/span&gt;
        &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LedgerPairMismatch&lt;/span&gt; &lt;span class="s2"&gt;"period must match"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="k"&gt;not&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sameContext&lt;/span&gt; &lt;span class="n"&gt;credit&lt;/span&gt; &lt;span class="n"&gt;mirror&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
        &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LedgerPairMismatch&lt;/span&gt; &lt;span class="s2"&gt;"tenant contract counterparty clause breach context must match"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CompensatesLedgerId&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;mirror&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;CompensatesLedgerId&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
        &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;LedgerPairMismatch&lt;/span&gt; &lt;span class="s2"&gt;"compensating ledger reference must match"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AccrualPeriodEnd&lt;/span&gt; &lt;span class="p"&gt;&amp;lt;&lt;/span&gt; &lt;span class="n"&gt;credit&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;AccrualPeriodStart&lt;/span&gt; &lt;span class="k"&gt;then&lt;/span&gt;
        &lt;span class="nc"&gt;Error&lt;/span&gt; &lt;span class="nc"&gt;PeriodInvalid&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;
        &lt;span class="nc"&gt;Ok&lt;/span&gt;&lt;span class="bp"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That function is deliberately narrow. It does not know about HTTP, Hangfire, PDF rendering, Invoice Recon, or even settlement grouping. It only knows what makes two candidate rows a valid accounting pair.&lt;/p&gt;

&lt;p&gt;The private &lt;code&gt;LedgerPair&lt;/code&gt; type matters. Callers cannot construct one directly. They can propose two &lt;code&gt;LedgerEntryCandidate&lt;/code&gt; records, but only the domain module can expose the pair. That means the application layer does not get to "remember" to validate. It has no valid object unless validation has already happened.&lt;/p&gt;

&lt;p&gt;This is the part I got wrong at first in my head: I thought append-only was mainly a persistence rule. It is not. It is a construction rule. Once bad ledger rows are inserted, append-only protects the bad rows just as strongly as the good ones.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why The Rules Engine Stays Pure
&lt;/h2&gt;

&lt;p&gt;The penalty math also stays outside the database. &lt;code&gt;RulesEngine.calculatePenalty&lt;/code&gt; takes a &lt;code&gt;PenaltyCalculationInput&lt;/code&gt; and returns a result. No connection string. No repository. No clock except the explicit &lt;code&gt;AsOf&lt;/code&gt; value passed in.&lt;/p&gt;

&lt;p&gt;That was not aesthetic. The engine needs deterministic recompute. Given the same contract, clause, breach, prior accruals, and timestamp, it should return the same penalty. The snapshot test covers twelve cases: flat penalties, capped penalties, monthly fee proration, tier crossing, overflow tier behavior, currency mismatch, daily compounding cap, missing units, inactive clauses, and pre-contract breaches.&lt;/p&gt;

&lt;p&gt;The awkward case is previous accruals. Caps depend on what was already credited. Tiered penalties may need to accrue only the difference between the previous tier and the new one. Compounding daily penalties need to avoid adding the same days twice. So the rules engine receives &lt;code&gt;PreviousAccruals&lt;/code&gt;, filters to credit-side rows for the same clause, and calculates incremental money from that prior state.&lt;/p&gt;

&lt;p&gt;That looks like a database concern until it breaks. If the rules engine quietly queried the database itself, replay tests would become setup-heavy and timing-sensitive. By making prior state an input, the application layer owns retrieval and the domain layer owns the calculation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Reversal Without Mutation
&lt;/h2&gt;

&lt;p&gt;The reversal path is where the design either holds or collapses.&lt;/p&gt;

&lt;p&gt;When a supplier disputes a breach and wins, the system cannot update the original accrual to zero. It cannot delete the row. It cannot mark the row as false and pretend the old financial position never existed.&lt;/p&gt;

&lt;p&gt;It writes a reversal pair.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ReversalEngine.uncompensatedCreditAccruals&lt;/code&gt; first looks for accrual rows that do not already have a reversal pointing at them. Then &lt;code&gt;reversalCandidate&lt;/code&gt; copies the amount, period, tenant, contract, counterparty, clause, and breach from the original accrual, changes the entry kind to &lt;code&gt;Reversal&lt;/code&gt;, and sets &lt;code&gt;CompensatesLedgerId&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That little filter is important. Without it, the same breach could be reversed twice. Append-only would preserve both reversals, and the system would show the supplier owed less than zero. The code does not rely on a user avoiding the button. It filters uncompensated credit rows and writes every reversal inside one transaction with the status change.&lt;/p&gt;

&lt;p&gt;The design tradeoff is verbosity. A simple breach flow creates two rows on accrual. Accrual plus reversal creates four rows. A ledger explorer has to explain direction, entry kind, and compensation references. The UI has more work because the database refuses to simplify history for display.&lt;/p&gt;

&lt;p&gt;I accept that. Accounting systems should make the audit easy and the write path strict.&lt;/p&gt;

&lt;h2&gt;
  
  
  Settlement Is Not a Ledger Mutation
&lt;/h2&gt;

&lt;p&gt;Settlement introduced a second temptation: add &lt;code&gt;settlement_id&lt;/code&gt; to &lt;code&gt;penalty_ledger&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That would make queries simple. Find uncommitted rows where &lt;code&gt;settlement_id is null&lt;/code&gt;. Mark them when the PDF is built. Done.&lt;/p&gt;

&lt;p&gt;It would also puncture the ledger invariant. If building a settlement mutates a ledger row, the ledger is no longer an append-only record of penalty facts. It becomes a workflow table.&lt;/p&gt;

&lt;p&gt;The repo uses &lt;code&gt;settlement_ledger_entries&lt;/code&gt; instead. &lt;code&gt;SettlementsRepository.listUncommittedAccruals&lt;/code&gt; selects credit-side accruals for the period, excludes rows that already have active settlement membership, and excludes rows with reversals. Then &lt;code&gt;SettlementsRepository.insert&lt;/code&gt; writes the settlement row and membership rows in the same transaction.&lt;/p&gt;

&lt;p&gt;That means settlement membership is append-like metadata around the ledger, not a change to the ledger event itself.&lt;/p&gt;

&lt;p&gt;The cost is an extra table and a more careful query. The gain is that a settlement can be cancelled or released without rewriting what the penalty ledger said at the time of accrual.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Part That Still Needs Pressure
&lt;/h2&gt;

&lt;p&gt;The full suite passes: 12 domain tests, 6 data tests, 30 application tests, 10 API tests, and 3 UI tests. The dashboard load audit has tooling for &lt;code&gt;10000&lt;/code&gt; breaches and &lt;code&gt;5000&lt;/code&gt; settlements. The fake staging flow proves a Contract Lifecycle event can become an accrual, settlement, Invoice Recon outbox message, and &lt;code&gt;settlement.posted&lt;/code&gt; hub event in the local harness.&lt;/p&gt;

&lt;p&gt;But the PRD still has open success criteria around staged NATS-to-ledger time, Invoice Recon posting p95, and million-row ledger query behavior. That is the honest limit of the current proof. The invariants are strong. The operational envelope still needs live pressure.&lt;/p&gt;

&lt;p&gt;What surprised me is how much of the system exists to protect against its own future convenience. Every obvious shortcut would make one screen or one query easier. Single-row ledger entries. Updating rows for reversals. Marking ledger rows as settled. Reading tenant identity live during PDF posting.&lt;/p&gt;

&lt;p&gt;Each shortcut is fine until the first external audit, supplier dispute, or tenant rename.&lt;/p&gt;

&lt;p&gt;The lesson I took from this build is narrow: append-only is not a database trigger. It is a system-wide refusal to let later workflow needs rewrite earlier financial facts.&lt;/p&gt;

</description>
      <category>fsharp</category>
      <category>ledgerdesign</category>
      <category>domainmodeling</category>
      <category>postgres</category>
    </item>
    <item>
      <title>Why I Refused to Re-Query the Tenant Row at Alert Dispatch Time</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sun, 26 Apr 2026 17:37:42 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/why-i-refused-to-re-query-the-tenant-row-at-alert-dispatch-time-48m2</link>
      <guid>https://dev.to/kingsleyonoh/why-i-refused-to-re-query-the-tenant-row-at-alert-dispatch-time-48m2</guid>
      <description>&lt;p&gt;What does an audit-grade risk alert look like six months after the band crossing it describes, when the tenant has renamed itself twice in the interim, when the vendor has been merged into another, when the scoring rule has been retuned three times since? In a system that re-queries the source rows at dispatch time, the answer is whatever is true now. In this system, the answer is whatever was true at 14:03 on the Tuesday the alert was written.&lt;/p&gt;

&lt;p&gt;The concrete shape of the problem: a risk alert is created at 14:03 Tuesday for a vendor crossing from medium to high. The Notification Hub is down for emergency Postgres maintenance. The dispatcher falls back to the failed-alert retry queue and tries again every 30 minutes. By Friday afternoon, when the Hub is back, the email body must still read &lt;code&gt;Acme GmbH&lt;/code&gt;, the legal name that was current Tuesday, not the post-Wednesday rebrand to &lt;code&gt;Acme Industrial GmbH&lt;/code&gt; that the operator entered through the settings UI on Wednesday morning. A live re-query of the &lt;code&gt;tenants&lt;/code&gt; row at send time would emit the new name and reference an event at a company that didn't exist when it happened.&lt;/p&gt;

&lt;p&gt;That's the constraint this whole architecture is designed around. Not the failure mode I worried about most. The one I knew would happen and didn't have a clean answer to.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Problem
&lt;/h2&gt;

&lt;p&gt;Most engineers I've talked to assume the issue is performance. They look at a table with a JSONB column called &lt;code&gt;delivery_payload&lt;/code&gt; holding a fully-rendered tenant snapshot, and they ask why I'd denormalize. The &lt;code&gt;tenants&lt;/code&gt; row has 11 identity columns. Why pickle them into JSON when I could just SELECT them again at dispatch time?&lt;/p&gt;

&lt;p&gt;Performance has nothing to do with it. The dispatcher is fast either way. What matters is that an alert is a fact about what was true at a moment in time, and a re-query produces a fact about what is true now. Those two facts are not interchangeable. An auditor reading the alert ledger six months later wants to know what the operator saw on Tuesday afternoon, not what the system shows today.&lt;/p&gt;

&lt;p&gt;Same problem on the report side. A vendor scorecard PDF generated in March needs to reprint byte-for-byte (or close enough to it) when an auditor downloads it again in September. If the underlying tenant or vendor row has been mutated in between, a live re-query produces a different document. The PDF that was approved by procurement leadership in March no longer exists on disk. It exists in the operator's email and nowhere else.&lt;/p&gt;

&lt;p&gt;The mistake is one of category. The system was treating mutable rows as the source of truth for an immutable record. SQL was doing exactly what SQL does: returning the current value of the column. The error was in the architecture asking the question that way at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Constraints
&lt;/h2&gt;

&lt;p&gt;Three things made this hard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tenant identity is mutable by design.&lt;/strong&gt; A CPO can update their legal name, registration number, address, and brand colors at any time through the settings UI. We don't lock fields after creation; that would be operationally hostile to the business reality of mergers and rebrands. So the row is the live identity, full stop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vendors get merged.&lt;/strong&gt; When the operator confirms that two &lt;code&gt;vendor_aliases&lt;/code&gt; rows point at the same legal entity, the system collapses them and the surviving vendor inherits the signals from the merged one. If an alert fires referring to "Vendor A" and that vendor gets merged into "Vendor B" three weeks later, a re-query at dispatch time would emit "Vendor B" in the email body. That isn't just confusing; it's wrong. The alert was about A.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retries can run for days.&lt;/strong&gt; The notification hub has a circuit breaker (5-failure rolling window, 60-second cooldown). If the Hub is genuinely unhealthy for a sustained period, the failed-alert retry job picks alerts back up every 30 minutes and tries again. There is no upper bound. I have seen alerts retry across 72-hour outages. Anything the dispatcher reads at send time can have moved arbitrarily far from where it was at create time.&lt;/p&gt;

&lt;p&gt;The obvious moves were all bad. Locking the tenant row after first alert? Operationally hostile. Versioning every field on the tenant table with effective-date ranges? A second source of truth and a permanent maintenance burden. Re-creating the alert if the tenant changed? You can't, because the alert is a record of an event that already happened.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Design
&lt;/h2&gt;

&lt;p&gt;The architecture has two halves: storage immutability and rendering immutability. Both are necessary. Either alone is insufficient.&lt;/p&gt;

&lt;p&gt;On the storage side, every alert carries a &lt;code&gt;delivery_payload&lt;/code&gt; JSONB column populated at insertion. The capture path lives in &lt;code&gt;lib/alerts/capture_payload.rb&lt;/code&gt; and is called from the alert dispatcher exactly once, at the moment the &lt;code&gt;risk_alerts&lt;/code&gt; row is INSERTed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="c1"&gt;# lib/alerts/capture_payload.rb&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vendor_score&lt;/span&gt;&lt;span class="p"&gt;:)&lt;/span&gt;
  &lt;span class="n"&gt;tenant&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vendor_score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tenant&lt;/span&gt;
  &lt;span class="n"&gt;vendor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vendor_score&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;vendor&lt;/span&gt;
  &lt;span class="n"&gt;tenant_snapshot&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="no"&gt;Tenants&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="no"&gt;CaptureSnapshot&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tenant&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="ss"&gt;event_type: &lt;/span&gt;&lt;span class="n"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="ss"&gt;tenant: &lt;/span&gt;&lt;span class="n"&gt;tenant_snapshot&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="ss"&gt;vendor: &lt;/span&gt;&lt;span class="n"&gt;vendor_block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vendor&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tenant_snapshot&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="ss"&gt;score: &lt;/span&gt;&lt;span class="n"&gt;score_block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vendor_score&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;previous_composite&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;previous_band&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;direction&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="ss"&gt;top_contributors: &lt;/span&gt;&lt;span class="n"&gt;top_contributors_block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vendor_score&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="ss"&gt;deep_links: &lt;/span&gt;&lt;span class="n"&gt;deep_links_block&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;vendor&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="ss"&gt;created_at: &lt;/span&gt;&lt;span class="no"&gt;Time&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;utc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iso8601&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;

  &lt;span class="n"&gt;deep_freeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;Tenants::CaptureSnapshot&lt;/code&gt; is the single canonical builder for the tenant identity block. Eleven columns from the &lt;code&gt;tenants&lt;/code&gt; row plus a &lt;code&gt;snapshot_at&lt;/code&gt; timestamp, returned as a frozen Hash. The shape is locked. Adding a column to &lt;code&gt;tenants&lt;/code&gt; does not automatically add it to the snapshot. Every addition is a deliberate change to the snapshot, the templates that bind to it, and the test fixtures.&lt;/p&gt;

&lt;p&gt;That's the first half. The dispatcher (&lt;code&gt;Alerts::HubDispatchJob&lt;/code&gt;) then has exactly one rule: read &lt;code&gt;alert.delivery_payload&lt;/code&gt; and never query &lt;code&gt;tenants&lt;/code&gt;, &lt;code&gt;vendors&lt;/code&gt;, or &lt;code&gt;vendor_scores&lt;/code&gt; again. The job's class doc is a paragraph explaining this. The integration test fires an alert, mutates the underlying tenant row, then runs the dispatcher and asserts the emitted Hub event still contains the original literal values. Without that test, a casual refactor could re-introduce a &lt;code&gt;Tenant.find(alert.tenant_id)&lt;/code&gt; call in the dispatcher, and the regression would only show up in production after the first sustained Hub outage.&lt;/p&gt;

&lt;p&gt;The second half is in-memory immutability. A frozen top-level Hash doesn't prevent a careless caller from mutating a nested Hash:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="c1"&gt;# This still works on a top-level-frozen Hash:&lt;/span&gt;
&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="ss"&gt;:tenant&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="ss"&gt;:legal_name&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s2"&gt;"Sneaky GmbH"&lt;/span&gt;
&lt;span class="c1"&gt;# raises FrozenError only if every nested level is also frozen&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;So the capture path walks the structure recursively:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight ruby"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;deep_freeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="n"&gt;obj&lt;/span&gt;
  &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="no"&gt;Hash&lt;/span&gt;
    &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;each_value&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;deep_freeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;freeze&lt;/span&gt;
  &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="no"&gt;Array&lt;/span&gt;
    &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;each&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;deep_freeze&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;freeze&lt;/span&gt;
  &lt;span class="k"&gt;when&lt;/span&gt; &lt;span class="no"&gt;String&lt;/span&gt;
    &lt;span class="n"&gt;obj&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;freeze&lt;/span&gt;
  &lt;span class="k"&gt;else&lt;/span&gt;
    &lt;span class="n"&gt;obj&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A Sidekiq worker that loads the JSONB column gets a fresh Ruby Hash from &lt;code&gt;JSON.parse&lt;/code&gt;, which is mutable, but the in-memory structure built by &lt;code&gt;CapturePayload.call&lt;/code&gt; is physically immutable at every level the moment it leaves the method. That removes a class of bugs where a logging middleware or a serializer transformation accidentally mutates the snapshot in flight before it's emitted to the Hub.&lt;/p&gt;

&lt;p&gt;The rendering side is the same idea applied to templates. Every Hub Liquid template registers with &lt;code&gt;strict_variables: true&lt;/code&gt;. Every report ERB template uses a small helper called &lt;code&gt;Reports::StrictFetch&lt;/code&gt; that walks dotted paths against the captured render context and raises &lt;code&gt;StrictFetchError&lt;/code&gt; on any unresolved segment. A template that references &lt;code&gt;{{ tenant.legal_nme }}&lt;/code&gt; (typo) doesn't silently emit an empty string. It raises during the test, before anything ships.&lt;/p&gt;

&lt;p&gt;The CI gate that locks all of this is in &lt;code&gt;test/integration/report_template_lint_test.rb&lt;/code&gt;. It renders every report template against captured render contexts for two distinct tenants, then parses each template for &lt;code&gt;f("…")&lt;/code&gt; calls without a &lt;code&gt;default:&lt;/code&gt; argument to enumerate the mandatory token set. The test will fail loudly if anyone adds a new template token without extending the snapshot, or adds a field to the snapshot but forgets to update a downstream template.&lt;/p&gt;

&lt;p&gt;I had this turned into the cleverest part of the suite by accident. After the first version of the test passed, I noticed it would also pass if I weakened &lt;code&gt;StrictFetch&lt;/code&gt; to silently return &lt;code&gt;nil&lt;/code&gt; for missing paths. So I added a deliberate-failure regression test: a synthetic broken template that references &lt;code&gt;{{ tenant.this_field_does_not_exist_anywhere }}&lt;/code&gt;. If anyone weakens the strict-fetch contract, that meta-test fails. Without it, the regression test for the regression test, the gate could rot.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;p&gt;The byte-identical re-render test for the four report types did not work the first time, and the failure mode taught me something I would not have figured out from documentation.&lt;/p&gt;

&lt;p&gt;I picked the legal-footer line as my canary literal: &lt;code&gt;Reg: HRB-123456 · Tax: DE987654321 · Berlin, DE&lt;/code&gt;. The PDF visually rendered correctly. But &lt;code&gt;pdf-reader&lt;/code&gt; text extraction returned a &lt;code&gt;nil&lt;/code&gt; for that line. After two hours of debugging, I figured out that wkhtmltopdf encodes the &lt;code&gt;&amp;amp;middot;&lt;/code&gt; HTML entity as a non-standard glyph escape that pdf-reader's content-stream decoder doesn't recognize, and the entire PDF text object containing that escape gets dropped from the extraction.&lt;/p&gt;

&lt;p&gt;The fix was to pick canary literals from lines that don't contain &lt;code&gt;·&lt;/code&gt; separators (header &lt;code&gt;legal_name&lt;/code&gt;, address &lt;code&gt;line1&lt;/code&gt;, contact email are all safe). But the lesson was the deeper one: byte-identical PDF re-rendering is impossible because wkhtmltopdf embeds creation timestamps and random object IDs in every render. CSV outputs are bytewise equal across renders; PDFs are content-equal via text extraction. I had to pick the right level of the immutability claim.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Result
&lt;/h2&gt;

&lt;p&gt;A 30-day audit-reprint test runs on every CI build. It captures a render context, generates the PDF and CSV, runs &lt;code&gt;Timecop.travel(30.days)&lt;/code&gt;, mutates the underlying &lt;code&gt;tenants&lt;/code&gt; row, and re-renders. CSV output is exactly byte-equal. PDF text-extracted output contains the original tenant literals (&lt;code&gt;Acme GmbH&lt;/code&gt;, &lt;code&gt;Hauptstraße 10&lt;/code&gt;, &lt;code&gt;procurement@acme-gmbh.example&lt;/code&gt;) and not the new ones. The same gate runs separately for the alert side: the dispatcher integration test fires an alert, mutates the tenant, runs the dispatcher, and asserts the emitted Hub event contains the pre-mutation legal name.&lt;/p&gt;

&lt;p&gt;The architecture is denormalized. Every alert duplicates the tenant identity. Every report context duplicates the entire scoring snapshot. Storage cost is real but small (typical alert payload is ~3KB; a year of band-crossing alerts at 50 vendors per tenant is well under 100MB per tenant). The cost is paid once at write time and never paid again at read time.&lt;/p&gt;

&lt;p&gt;The takeaway, if there is one: a snapshot has a different job from a normalized table. The normalized rows are operational state, useful for the live dashboard and the next score recompute, and entirely correct to mutate when the business changes. The frozen JSONB is the legal record. It's what you point at six months later when an auditor asks you to prove what the operator saw on a Tuesday afternoon in April.&lt;/p&gt;

</description>
      <category>ruby</category>
      <category>rails</category>
      <category>audittrails</category>
      <category>jsonb</category>
    </item>
    <item>
      <title>The Per-Tenant Rate Limit That Wasn't Per-Tenant</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sun, 26 Apr 2026 17:33:43 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/the-per-tenant-rate-limit-that-wasnt-per-tenant-175b</link>
      <guid>https://dev.to/kingsleyonoh/the-per-tenant-rate-limit-that-wasnt-per-tenant-175b</guid>
      <description>&lt;p&gt;Tenant A blocked correctly at request 11. Tenant B also blocked at request 11.&lt;/p&gt;

&lt;p&gt;Phase 7 added a rate limit on &lt;code&gt;POST /api/events&lt;/code&gt; that reads from &lt;code&gt;tenants.config.rate_limits.events_per_minute&lt;/code&gt;. The default cap is 200 requests per minute. A tenant who needs more can be raised to 1000 via an admin &lt;code&gt;PATCH&lt;/code&gt;. The integration test seeded two tenants (one on the default, one bumped to 100) and asserted that tenant A got blocked at request 11 while tenant B kept going. Tenant B was supposed to get nine more requests through.&lt;/p&gt;

&lt;p&gt;The resolver function was fine. The unit tests on &lt;code&gt;resolveTenantEventsRateLimit()&lt;/code&gt; all passed. Pass a tenant config with no override, get 200. Pass a tenant with &lt;code&gt;events_per_minute: 100&lt;/code&gt;, get 100. Pass &lt;code&gt;null&lt;/code&gt;, get 200. Pass anything over 1000, get clamped to 1000. Five tests, all green. The function did exactly what it was supposed to do.&lt;/p&gt;

&lt;p&gt;The function was being called with &lt;code&gt;null&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the Resolver Was Called With Null
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;@fastify/rate-limit&lt;/code&gt; accepts a &lt;code&gt;max&lt;/code&gt; option that can be a function. The function receives the request and returns the cap to apply. Putting the resolver behind that callback was the design: each request asks "what is this tenant's limit?" at the moment the rate check runs, so changing a tenant's config takes effect immediately without a process restart.&lt;/p&gt;

&lt;p&gt;For the resolver to do its job, it needs &lt;code&gt;request.tenant&lt;/code&gt; already populated. That happens in &lt;code&gt;authPlugin&lt;/code&gt;, which reads the &lt;code&gt;X-API-Key&lt;/code&gt; header, looks up the tenant row, and attaches the tenant config to the request. The auth plugin runs in the &lt;code&gt;onRequest&lt;/code&gt; lifecycle hook.&lt;/p&gt;

&lt;p&gt;By default, &lt;code&gt;@fastify/rate-limit&lt;/code&gt; also runs in &lt;code&gt;onRequest&lt;/code&gt;. Fastify fires hooks in registration order within the same lifecycle stage. The rate-limit plugin was registered before the route was registered, and the route's auth runs as part of the route's plugin chain. The two &lt;code&gt;onRequest&lt;/code&gt; hooks fired in an order that meant rate-limit's &lt;code&gt;max&lt;/code&gt; callback ran first, with &lt;code&gt;request.tenant&lt;/code&gt; still undefined.&lt;/p&gt;

&lt;p&gt;The resolver, faithful to its contract, returned the default 200 for every request because every request looked like a tenant with no config override. Tenant A with the default and tenant B with 100 both got the same global cap. The endpoint behaved exactly as if the per-tenant feature didn't exist, but with no error and no warning, because both halves of the system were doing what they were told.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Moving Auth Wasn't an Option
&lt;/h2&gt;

&lt;p&gt;The fix could not move auth. &lt;code&gt;authPlugin&lt;/code&gt; runs in &lt;code&gt;onRequest&lt;/code&gt; for every protected route in the entire API, and most of those routes do not need rate limiting. Pushing auth to &lt;code&gt;preHandler&lt;/code&gt; to accommodate one route changes the lifecycle for every route, which moves a documented contract for one accidental dependency. That is the sort of fix that creates three new bugs to make one go away.&lt;/p&gt;

&lt;p&gt;The fix also could not run rate-limit globally with a different hook. The rate-limit plugin can be registered globally with a custom hook stage, but that overrides the stage for all rate-limited routes. Every other rate-limited route in the codebase, including the admin endpoints with their own static caps, would be affected by a change made for &lt;code&gt;events.routes.ts&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;What was needed was a per-route override that moved only this rate check to a later hook stage, leaving the rest of the system on defaults.&lt;/p&gt;

&lt;h2&gt;
  
  
  hook: 'preHandler' Plus a Real keyGenerator
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;@fastify/rate-limit&lt;/code&gt; exposes a per-route option called &lt;code&gt;hook&lt;/code&gt; inside the route's &lt;code&gt;config.rateLimit&lt;/code&gt; object. Setting it to &lt;code&gt;'preHandler'&lt;/code&gt; means this specific route's rate check fires in the &lt;code&gt;preHandler&lt;/code&gt; stage instead of &lt;code&gt;onRequest&lt;/code&gt;. Auth has already run by then. &lt;code&gt;request.tenant&lt;/code&gt; is populated. The resolver gets a real tenant config and returns the right cap.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;rateLimit&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;hook&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;preHandler&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;max&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;resolveTenantEventsRateLimit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tenant&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="na"&gt;keyGenerator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tenantId&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ip&lt;/span&gt; &lt;span class="o"&gt;??&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anonymous&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;timeWindow&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;1 minute&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;},&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things in that block matter, and the &lt;code&gt;hook&lt;/code&gt; line is only one of them.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;keyGenerator&lt;/code&gt; is the second multi-tenant fix. The default key generator is the request IP. In a single-tenant deployment that is fine. In a multi-tenant deployment behind a shared NAT (two tenants whose servers happen to live in the same cloud region, two CI pipelines running tests behind the same egress address) the bucket gets shared. Tenant A burns through tenant B's allowance. The cap stops being per-tenant in a different way: the resolver returns the right number, but the bucket it counts against is wrong.&lt;/p&gt;

&lt;p&gt;Setting &lt;code&gt;keyGenerator&lt;/code&gt; to &lt;code&gt;request.tenantId&lt;/code&gt; partitions the bucket per authenticated tenant. The fallback to &lt;code&gt;request.ip&lt;/code&gt; covers the unauthenticated case (which the route does not allow, but defensive code is cheap and explicit defaults beat surprising ones).&lt;/p&gt;

&lt;p&gt;The third thing is the resolver clamp at 1000. The admin route validates input at 1-1000, but a tenant whose config gets corrupted, or an old tenant from before the validation existed, could in principle have a number out of range stored in their JSONB config. The resolver caps the return value defensively. A misconfigured tenant cannot DoS the platform by setting &lt;code&gt;events_per_minute: 999999&lt;/code&gt; directly in the database.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why a Single-Tenant Test Would Have Shipped This
&lt;/h2&gt;

&lt;p&gt;The hook-ordering issue was not in the documentation for &lt;code&gt;@fastify/rate-limit&lt;/code&gt; because it is not a bug in the plugin. The plugin defaults to &lt;code&gt;onRequest&lt;/code&gt; because that is the right stage for rate limiting in 99% of cases. You want to reject over-limit requests as early as possible, before any work happens. The Hub's case is the 1% where the rate check depends on data that is only available after another hook has run. The plugin gives you the escape hatch, but you only know to use it once you have hit the problem.&lt;/p&gt;

&lt;p&gt;The integration test that caught this was not a sophisticated test. It seeded two tenants, fired eleven requests for each, and asserted on the response codes. An earlier draft only seeded tenant A with a cap of 10 and asserted it got rate-limited at request 11. Tenant A was the only tenant. The test passed. I added tenant B with cap 100 specifically to verify that the per-tenant bucket worked, and the test failed with both tenants getting blocked.&lt;/p&gt;

&lt;p&gt;If I had only tested one tenant, the rate limit would have shipped looking like it worked. Every tenant on the platform would silently be on the same global cap, and the only way to discover it would be a tenant raising a support ticket about being blocked at unexpected request counts. That is the kind of bug that runs in production for months because nothing visibly breaks.&lt;/p&gt;

&lt;p&gt;The test fix was as simple as the production fix. Changing &lt;code&gt;hook: 'preHandler'&lt;/code&gt; and &lt;code&gt;keyGenerator: req.tenantId&lt;/code&gt; made the assertion pass. Total LOC change: two lines. Total time spent debugging from "why does tenant B also block" to "Fastify hook ordering": about an hour, most of it spent printing &lt;code&gt;request.tenant&lt;/code&gt; from inside the &lt;code&gt;max&lt;/code&gt; callback and confirming it was &lt;code&gt;undefined&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Else the Patch Quietly Got Right
&lt;/h2&gt;

&lt;p&gt;H7 ships with per-tenant rate limiting that actually scopes per tenant. Tenant default is 200 requests per minute, raised to a configurable value via &lt;code&gt;PATCH /api/admin/tenants/:id/rate-limit&lt;/code&gt;, capped defensively at 1000. The admin route uses spread-merge so updating &lt;code&gt;rate_limits.events_per_minute&lt;/code&gt; preserves the rest of &lt;code&gt;tenants.config&lt;/code&gt; (the channel credentials, the dedup window, the sandbox flag) without overwriting them. The PATCH was simple to write. Forgetting to use spread-merge would have been a different production incident: an admin update silently wiping every tenant's Resend API key.&lt;/p&gt;

&lt;p&gt;Test coverage went beyond the spec. The resolver got five tests covering null config, missing override, explicit override, cap-at-1000, and the default. The integration suite got five more covering the per-tenant bucket, admin happy path, range validation at both ends, 404 on missing tenant, and config preservation across the update. The admin tests assert that channel credentials and dedup_window survive the PATCH untouched. The spread-merge contract is the thing that has to keep working, not just on the day it was written, but on every future change to the admin route.&lt;/p&gt;

&lt;p&gt;The hook-ordering trick is now in the project's pattern 004 (per-route rate limit) as a note for the next route that needs a dynamic per-tenant cap. It will not be the last one. The same pattern will apply to per-tenant request limits on the suppressions endpoints, the templates endpoints, and any future endpoint where the cap depends on tenant config.&lt;/p&gt;

&lt;p&gt;The fix is one line. The understanding it required is the lifecycle of every plugin in the request chain and the order in which Fastify will call them. That is the kind of detail that does not show up in a unit test for a resolver function.&lt;/p&gt;

</description>
      <category>fastify</category>
      <category>ratelimiting</category>
      <category>multitenant</category>
      <category>plugins</category>
    </item>
    <item>
      <title>The Cursor Pagination That Worked Until Five Rows Shared a Timestamp</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sun, 26 Apr 2026 17:27:55 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/the-cursor-pagination-that-worked-until-five-rows-shared-a-timestamp-47p3</link>
      <guid>https://dev.to/kingsleyonoh/the-cursor-pagination-that-worked-until-five-rows-shared-a-timestamp-47p3</guid>
      <description>&lt;p&gt;It returned three.&lt;/p&gt;

&lt;p&gt;The integration test had seeded five suppression rows in &lt;code&gt;beforeAll&lt;/code&gt;, called &lt;code&gt;GET /api/suppressions?limit=2&lt;/code&gt;, and walked the cursors forward expecting all five back. The first page returned two. The second page returned one. The third page came back empty.&lt;/p&gt;

&lt;p&gt;The suppression list endpoint is a paginated &lt;code&gt;GET&lt;/code&gt;. Tenants call it to see who they have manually blocked, who Resend marked as a hard bounce, and who complained about an email. The list grows over time and needs ordering by recency. Standard cursor pagination: order by &lt;code&gt;(created_at DESC, id DESC)&lt;/code&gt;, encode the last-seen tuple into a base64 cursor, and the next page asks for everything strictly less than that tuple. I had this pattern working on the notifications endpoint and wrote the suppressions version against the same template. Drizzle, Postgres, tuple comparison in SQL.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the Microseconds Disappeared
&lt;/h2&gt;

&lt;p&gt;The seed used a single &lt;code&gt;db.insert(...).values([row1, row2, row3, row4, row5])&lt;/code&gt; call. Postgres performs that as one transaction. Every row gets the same &lt;code&gt;clock_timestamp()&lt;/code&gt; for &lt;code&gt;created_at&lt;/code&gt;, down to the microsecond Postgres records natively. From the database's point of view, the five rows are not just in the same second, not just in the same millisecond. They share a timestamp at six decimal places of precision.&lt;/p&gt;

&lt;p&gt;The cursor encoding called &lt;code&gt;row.createdAt.toISOString()&lt;/code&gt;. That returns ISO 8601 with millisecond precision. &lt;code&gt;2026-04-26T14:32:18.473Z&lt;/code&gt;. Nine digits in, the microsecond information is gone. When the next-page query compared the column value to the cursor value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s1"&gt;'2026-04-26T14:32:18.473'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nb"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s1"&gt;'last-id-here'&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Postgres compared the column's microsecond-precise value (&lt;code&gt;14:32:18.473251&lt;/code&gt;) against the cursor's millisecond-precise value (&lt;code&gt;14:32:18.473000&lt;/code&gt;). The column value was strictly greater. The tuple comparison returned false. Rows that should have appeared on the next page were filtered out as already-seen.&lt;/p&gt;

&lt;p&gt;The first page returned the two newest rows correctly. The second page used the second row's millisecond-truncated timestamp as the cursor. The query asked for rows strictly less than that timestamp. The remaining three rows had timestamps strictly greater at the microsecond level, so they did not match. The endpoint silently returned an empty result. The test reported missing rows. Nothing in the logs hinted at why.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Notifications Never Hit This
&lt;/h2&gt;

&lt;p&gt;The notifications endpoint had been running this pattern in production without a problem. The reason it worked is that production notifications are inserted one at a time, each one in its own transaction, each one with a distinct &lt;code&gt;clock_timestamp()&lt;/code&gt;. Microsecond collisions are theoretically possible but practically vanishing.&lt;/p&gt;

&lt;p&gt;Suppressions are different. The realistic write pattern is bulk: a tenant onboards and uploads their existing suppression list as a CSV, an admin runs a backfill from a previous notification provider, an auto-import grabs all hard bounces from the last 90 days. These all hit the database as batch inserts. Five rows sharing a microsecond stops being a contrived test fixture and becomes the normal case.&lt;/p&gt;

&lt;p&gt;I considered three fixes. Round the column down to milliseconds at write time, so the cursor and the column always match. Switch the cursor format to a millisecond-precise serialization. Find a way to round-trip the microseconds losslessly through the cursor.&lt;/p&gt;

&lt;p&gt;The first option destroys information. Postgres stores microseconds for a reason; throwing them away because the cursor is too coarse is a fix in the wrong layer. The second option amounts to the same thing. Whatever precision the cursor carries becomes the precision of the comparison, so reducing the cursor to milliseconds is identical to reducing the column.&lt;/p&gt;

&lt;p&gt;The third option meant figuring out how to get microsecond precision out of Postgres in a form that survives a round trip through JSON, base64, and back into a SQL parameter.&lt;/p&gt;

&lt;h2&gt;
  
  
  to_char Saves the Round-Trip
&lt;/h2&gt;

&lt;p&gt;Postgres has a function called &lt;code&gt;to_char&lt;/code&gt; that formats timestamps using a pattern string. The pattern &lt;code&gt;YYYY-MM-DD"T"HH24:MI:SS.US&lt;/code&gt; produces a string with microsecond precision in ISO-like format. Casting that string back to &lt;code&gt;timestamp&lt;/code&gt; parses it without loss. The driver and the cursor never have to handle microseconds in JavaScript at all. The value moves as a string from query to cursor to next query, and Postgres does the precision work on both ends.&lt;/p&gt;

&lt;p&gt;The list query selects an additional helper column with the formatted timestamp:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;select&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tenantSuppressions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;recipient&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tenantSuppressions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;recipient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tenantSuppressions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;expiresAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tenantSuppressions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;expiresAt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tenantSuppressions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;createdAtText&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;sqlOp&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="s2"&gt;`to_char(&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;tenantSuppressions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; AT TIME ZONE 'UTC', 'YYYY-MM-DD"T"HH24:MI:SS.US')`&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="k"&gt;as&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;created_at_text&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;createdAtText&lt;/code&gt; column is computed in the database for every returned row. The cursor encoder uses that string verbatim, paired with the row's &lt;code&gt;id&lt;/code&gt;. The decoder pulls the string back out and casts it to &lt;code&gt;timestamp&lt;/code&gt; in the next query's tuple comparison:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;sqlOp&lt;/span&gt;&lt;span class="s2"&gt;`(&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;tenantSuppressions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;createdAt&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;, &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;tenantSuppressions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;) &amp;lt; (&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;decoded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;createdAtText&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;::timestamp, &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;decoded&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;::uuid)`&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The helper column is stripped from the response payload before the JSON goes back to the tenant, so the API surface stays clean. The byte-exact round trip happens entirely inside the cursor.&lt;/p&gt;

&lt;p&gt;The fix is one extra column in the select, one cast in the where clause, one filter in the response mapper. Around fifteen lines of code. The behavior change is that bulk-inserted rows page correctly, even when five of them share a microsecond.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Clever Fix That Failed Identically
&lt;/h2&gt;

&lt;p&gt;An earlier attempt looked clever. Instead of relying on tuple comparison, I split it into two clauses: rows strictly older OR (rows at the same timestamp AND id strictly less). The Drizzle expression was something like &lt;code&gt;or(lt(createdAt, cursorTs), and(eq(createdAt, cursorTs), lt(id, cursorId)))&lt;/code&gt;. This is the textbook decomposition of a tuple comparison.&lt;/p&gt;

&lt;p&gt;It also did not work, for the same reason. The &lt;code&gt;eq(createdAt, cursorTs)&lt;/code&gt; branch compared the microsecond-precise column against the millisecond-truncated cursor, and the equality returned false. The second clause never matched. Same bug, more code.&lt;/p&gt;

&lt;p&gt;I would not have noticed the OR-form failure if I had not been writing fresh tests against the seeded fixture. In production, with one row per transaction, both the tuple form and the OR form work. The seed was the only thing that exposed the precision mismatch. If I had written the suppressions endpoint with looser test fixtures (a &lt;code&gt;setTimeout(10)&lt;/code&gt; between inserts, for instance) the bug would have shipped, and the first tenant to bulk-import a CSV would have hit it. Five hundred rows in, three hundred would page correctly and the rest would silently disappear from the list view.&lt;/p&gt;

&lt;p&gt;The lesson I keep relearning is that test fixtures that simulate the realistic write pattern catch a class of bugs that loose fixtures hide. The seeded &lt;code&gt;db.insert(...).values([...])&lt;/code&gt; is exactly the shape of a bulk import. Writing the test that way was an accident, but it surfaced the bug at development time instead of in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the Recipe Lands
&lt;/h2&gt;

&lt;p&gt;The fix shipped in batch 018, alongside the rest of the suppressions CRUD surface. The endpoint pages through bulk-inserted rows correctly regardless of how many share a timestamp. The cursor format is base64-encoded JSON containing the &lt;code&gt;createdAtText&lt;/code&gt; string and the row &lt;code&gt;id&lt;/code&gt;. The query plan uses the existing &lt;code&gt;(tenant_id, created_at DESC, id DESC)&lt;/code&gt; index because tuple comparison maps cleanly to an index range scan in Postgres.&lt;/p&gt;

&lt;p&gt;Pattern 006 in the project's pattern catalog already documented cursor pagination for the notifications endpoint. Rather than write a new pattern file, I extended 006 in place with the microsecond-precision recipe as a sub-section. The recipe is small enough that promoting it to a standalone pattern would be over-cataloging, but specific enough that the next endpoint that needs cursor pagination over bulk-insertable data can copy the exact &lt;code&gt;to_char&lt;/code&gt; format string instead of rediscovering it.&lt;/p&gt;

&lt;p&gt;Notifications are still on the millisecond-precision cursor. They have not hit the bug because nothing in the system bulk-inserts notifications. The day a backfill job lands, that endpoint gets the same fifteen-line treatment.&lt;/p&gt;

&lt;p&gt;The whole thing is a two-character precision difference between two systems that both believe they are using ISO 8601, and the gap is invisible until five rows show up in the same transaction.&lt;/p&gt;

</description>
      <category>postgres</category>
      <category>pagination</category>
      <category>drizzle</category>
      <category>precision</category>
    </item>
    <item>
      <title>The Signing Bug You Cannot See Until a Tenant Tries to Verify You</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sun, 26 Apr 2026 17:27:51 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/the-signing-bug-you-cannot-see-until-a-tenant-tries-to-verify-you-338h</link>
      <guid>https://dev.to/kingsleyonoh/the-signing-bug-you-cannot-see-until-a-tenant-tries-to-verify-you-338h</guid>
      <description>&lt;p&gt;A tenant registers a delivery callback URL. The Hub sends a &lt;code&gt;POST&lt;/code&gt; whenever Resend reports an email bounced or was opened. The body is JSON. The header &lt;code&gt;X-Hub-Signature&lt;/code&gt; carries an HMAC-SHA256 digest computed over that body using a secret that only the Hub and the tenant know. The tenant recomputes the digest on receipt and compares. If they match, the request is authentic. If they do not, the request is dropped.&lt;/p&gt;

&lt;p&gt;This is the standard webhook-signing recipe. GitHub uses it. Slack uses it. Resend uses it on the way in. Now the Hub uses it on the way out.&lt;/p&gt;

&lt;p&gt;The obvious implementation looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;digest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;crypto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createHmac&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;sha256&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hex&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It worked in tests. It worked in the integration suite. It worked against a mock callback server. Then I tried to write a verification snippet for the tenant docs in Python, and the digests did not match.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why JSON.stringify Sinks Webhook Signing
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;JSON.stringify&lt;/code&gt; is not deterministic in any language. The order of keys in the output depends on the order they were inserted into the object. Two semantically identical objects can serialize to different bytes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;   &lt;span class="c1"&gt;// '{"a":1,"b":2}'&lt;/span&gt;
&lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt;   &lt;span class="c1"&gt;// '{"b":2,"a":1}'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Both encode the same data. Both hash to a different digest under HMAC. A tenant who receives the callback, parses the JSON, then re-serializes it to compute their own signature, will produce different bytes than the Hub did, even when nothing about the payload changed.&lt;/p&gt;

&lt;p&gt;This is the silent failure mode of webhook signing. Both sides are doing exactly what they were told. Both digests are correct for the bytes they were computed over. The bytes are different. The signatures mismatch. The tenant rejects the callback as a forgery, and there is no error message that explains why.&lt;/p&gt;

&lt;p&gt;The Hub side controls the bytes that go on the wire. Whatever I sign, I send, and as long as the tenant hashes the raw bytes of what arrived, the digests agree. But "hash the raw bytes" is fragile. Express's body parser, FastAPI's request handler, any middleware in the chain that touches the body, will replace the original bytes with a re-stringified version. The tenant integration breaks not because of a bug in their code, but because of a default in the framework they used.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three Things The Scheme Has To Do
&lt;/h2&gt;

&lt;p&gt;I needed three things. The signing scheme had to be deterministic, so two encodings of the same data produced the same digest. It had to be specifiable, so a tenant could implement verification in any language without reading the Hub's source. And it had to be reusable, because Phase 7 added the delivery callback first but planned to extend the same signing pattern to suppression callbacks, generic webhook fan-out, and alert callbacks downstream.&lt;/p&gt;

&lt;p&gt;The scheme I picked is canonical JSON: sort object keys recursively, build the JSON string by hand, hash that. Any tenant in any language that performs the same recursive sort produces the same bytes and the same digest. The contract becomes the canonicalization rule, not the byte stream.&lt;/p&gt;

&lt;h2&gt;
  
  
  Canonical JSON in Seventy Lines
&lt;/h2&gt;

&lt;p&gt;The signing module is small. Three exported functions, one private helper, around 70 lines total in &lt;code&gt;src/lib/outbound-signing.ts&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;canonicalJson&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="kc"&gt;null&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;object&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isArray&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;canonicalJson&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;item&lt;/span&gt;&lt;span class="p"&gt;)).&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;]&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;keys&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Object&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;sort&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;parts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;v&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;value&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="nb"&gt;Record&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;k&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;:&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nf"&gt;canonicalJson&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;v&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;{&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nx"&gt;parts&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;}&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three things to notice. Primitives delegate to &lt;code&gt;JSON.stringify&lt;/code&gt;, which handles string escaping and number formatting correctly per the spec. Object keys get sorted lexicographically before serialization. The function is recursive and uses no whitespace, so the output is byte-identical to what a JCS-style canonicalizer in Python or Go would produce given the same input.&lt;/p&gt;

&lt;p&gt;What the function deliberately does not handle: &lt;code&gt;Date&lt;/code&gt;, &lt;code&gt;Map&lt;/code&gt;, &lt;code&gt;Set&lt;/code&gt;, &lt;code&gt;BigInt&lt;/code&gt;. None of those round-trip through JSON cleanly. Outbound callback payloads are intentionally pure JSON values, type-checked at the call site. Refusing to support exotic types keeps the canonicalization rule auditable in one paragraph rather than a footnote.&lt;/p&gt;

&lt;p&gt;The signing function builds on top:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;buildSignedOutboundRequest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;unknown&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;body&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt; &lt;span class="nl"&gt;signatureHeader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;canonicalJson&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;digest&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;crypto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;createHmac&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;sha256&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;hex&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;body&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;signatureHeader&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`sha256=&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;digest&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The body and the signature are returned together. This is the design decision that took the longest to reach. The first version returned them separately and let the caller choose what to send over the wire. That created room for a class of bugs where the caller signed one set of bytes and sent a different set: log the body, then send &lt;code&gt;JSON.stringify(event)&lt;/code&gt;, sign with &lt;code&gt;canonicalJson(event)&lt;/code&gt;, and ship the wrong thing. Returning them as a pair means the bytes that get hashed are the bytes that get transmitted. The caller cannot accidentally split them.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;sha256=&lt;/code&gt; prefix on the header value follows the convention GitHub and Slack use. It exists so that the scheme can later add SHA-512 or another algorithm without breaking parsers, and so tenants writing verification code can split on &lt;code&gt;=&lt;/code&gt; and validate the algorithm name first.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Test Story That Inverted
&lt;/h2&gt;

&lt;p&gt;I assumed the testing story would be the hard part. Cryptographic functions resist mocking. You cannot mock HMAC into returning a fixed digest, because the whole point is determinism over real input. So I expected a lot of test infrastructure: helper builders, fixture payloads, golden digests checked into the repo.&lt;/p&gt;

&lt;p&gt;The opposite happened. Because the function is pure and deterministic, the tests are five-line affairs. Pass an input. Assert on the output digest. No mocks. No setup. The HMAC determinism tests run in microseconds:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;test&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;signs canonically: key order does not affect digest&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;secret&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;test-secret&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;a&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;signOutboundPayload&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="nx"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;b&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;signOutboundPayload&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;b&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;a&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="nx"&gt;secret&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="nf"&gt;expect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;a&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;toBe&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;b&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That single test catches the entire class of bug that motivated the canonicalization in the first place. If &lt;code&gt;JSON.stringify&lt;/code&gt; ever sneaks back into the signing path, this test fails immediately. It catches regression in any future refactor.&lt;/p&gt;

&lt;p&gt;The harder testing problem turned out to be on the dispatch side, not the signing side. The route that triggers a callback uses &lt;code&gt;void dispatchDeliveryCallback(...).catch(...)&lt;/code&gt;. That is fire-and-forget: the webhook handler can return 200 to Resend without waiting for the tenant's callback URL to respond. Production behavior is correct: the Hub never blocks Resend on a slow or down tenant. But naive integration tests race against the dispatch. The test calls &lt;code&gt;app.inject()&lt;/code&gt;, gets a 200 response, then asserts on the database row that should have been updated by the callback handler. The callback might still be in flight.&lt;/p&gt;

&lt;p&gt;Three options for testing this. Option one: add a test-only hook that returns the in-flight promise so the test can &lt;code&gt;await&lt;/code&gt; it. Option two: use &lt;code&gt;setImmediate()&lt;/code&gt; and poll. Option three: keep the production fire-and-forget contract untouched and use a polling helper in the test that waits up to two seconds for the assertion to become true.&lt;/p&gt;

&lt;p&gt;I picked option three. The production behavior is the contract. Adding a test-only side channel means production code carries a hook that exists only for tests, which inverts the relationship: the test is supposed to verify the production code, not constrain it. The polling helper is uglier but bounded: typical cases resolve in 10 to 50 milliseconds because both the mock callback server and the database are local, and the upper bound of two seconds catches a real hang without hanging the test suite forever.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Ships in Phase 7
&lt;/h2&gt;

&lt;p&gt;The signing module ships in batch 011 of Phase 7. The webhook integration in batch 012 wires the dispatcher to call it on every Resend delivery event. The end-to-end test in batch 013 fires a mock Resend webhook into the Hub, watches the database row get inserted, then watches a mock callback server receive a signed POST with the right digest. All three batches add 14 new tests on top of the existing 309. None of them mock HMAC.&lt;/p&gt;

&lt;p&gt;The tenant verification snippet in &lt;code&gt;USER_SETUP.md&lt;/code&gt; Section 10 is intentionally explicit about reading raw request bytes. In Node, &lt;code&gt;express.raw({ type: 'application/json' })&lt;/code&gt; instead of &lt;code&gt;express.json()&lt;/code&gt;. In Python, &lt;code&gt;request.get_data()&lt;/code&gt; instead of &lt;code&gt;request.json()&lt;/code&gt;. The snippet works because the Hub signs canonical JSON and sends those exact bytes. A tenant who hashes the raw body bytes verifies correctly. A tenant who reparses and re-serializes does not, and the snippet says so up front.&lt;/p&gt;

&lt;p&gt;Phase 7 7b extracted &lt;code&gt;canonicalJson&lt;/code&gt; and &lt;code&gt;signOutboundPayload&lt;/code&gt; from the delivery-callback module into &lt;code&gt;src/lib/outbound-signing.ts&lt;/code&gt; so the next callback family (suppression notifications, generic webhook fan-out, alert callbacks) uses the same scheme without copy-paste. The canonicalization rule is now load-bearing for every outbound HTTP call the Hub will ever make.&lt;/p&gt;

&lt;p&gt;The signing module is 73 lines. The bug it prevents took an afternoon to find and would have taken a tenant a week to debug on their side.&lt;/p&gt;

</description>
      <category>hmac</category>
      <category>cryptography</category>
      <category>webhooks</category>
      <category>typescript</category>
    </item>
    <item>
      <title>Stripe Said Past Due. The State Machine Said No. Both Were Right.</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:25:43 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/stripe-said-past-due-the-state-machine-said-no-both-were-right-4ea</link>
      <guid>https://dev.to/kingsleyonoh/stripe-said-past-due-the-state-machine-said-no-both-were-right-4ea</guid>
      <description>&lt;p&gt;A paused subscription receives a &lt;code&gt;customer.subscription.updated&lt;/code&gt; webhook from Stripe. The payload says the new status is &lt;code&gt;past_due&lt;/code&gt;. The local state machine has eight states and fifteen valid transitions, defined as a single Elixir map literal. &lt;code&gt;paused&lt;/code&gt; can transition to &lt;code&gt;active&lt;/code&gt;. That is the only outbound edge. There is no &lt;code&gt;paused&lt;/code&gt; to &lt;code&gt;past_due&lt;/code&gt; entry. The state machine says this transition doesn't exist.&lt;/p&gt;

&lt;p&gt;What do you do with the event?&lt;/p&gt;

&lt;p&gt;Rejecting the event is the safe choice. Invalid transition, log a warning, drop the payload. Clean, principled, and wrong. Stripe doesn't send a status change in isolation. The same webhook payload contains updated period dates, metadata changes, and potentially a new plan assignment. Rejecting the event to protect the status field means losing everything else in the payload.&lt;/p&gt;

&lt;p&gt;Forcing the transition is the pragmatic choice. Override the state machine, accept whatever Stripe says, move on. Also wrong. The state machine exists because downstream business logic depends on it. The dunning engine creates retry sequences when a subscription enters &lt;code&gt;past_due&lt;/code&gt;. If a paused subscription suddenly appears as &lt;code&gt;past_due&lt;/code&gt; through a forced transition, the dunning engine starts chasing a payment that was intentionally paused. The customer gets escalating "your payment failed" notifications for a subscription they paused on purpose.&lt;/p&gt;

&lt;p&gt;But the invalid transition is a symptom, not the disease. Stripe and the local data model can disagree about the current state of a subscription, and neither is wrong. Stripe is the source of truth for payment processing. The local state machine is the source of truth for business logic execution. When they conflict, you need an architecture that handles the disagreement without losing data or breaking invariants.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Sources of Truth, Zero Arbitration
&lt;/h2&gt;

&lt;p&gt;This isn't a sync problem. Sync implies one system is behind and needs to catch up. This is a disagreement: Stripe says the state is X, the local model says the transition to X is illegal, and both positions are defensible.&lt;/p&gt;

&lt;p&gt;Stripe can put a subscription into states the local model doesn't allow because Stripe's state machine is broader and serves a different purpose. Stripe tracks billing states across all possible payment scenarios, including edge cases around payment method updates, invoice finalization timing, and subscription schedule modifications. The local model tracks lifecycle states for business logic: when to start dunning, when to compute churn, when to notify the customer. These are overlapping but non-identical concerns.&lt;/p&gt;

&lt;p&gt;The moment I accepted that these two state machines would diverge, the design became clear. The system needed to handle three cases: agreement (both say the same thing, proceed normally), valid disagreement (the transition is in the map, accept it), and invalid disagreement (the transition is not in the map, accept the data but not the transition).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Strip, Don't Reject Pattern
&lt;/h2&gt;

&lt;p&gt;The core of this design lives in a single function: &lt;code&gt;maybe_strip_status/3&lt;/code&gt; in the &lt;code&gt;SubscriptionProcessor&lt;/code&gt; module.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight elixir"&gt;&lt;code&gt;&lt;span class="k"&gt;defp&lt;/span&gt; &lt;span class="n"&gt;maybe_strip_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stripe_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_new_status&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;stripe_data&lt;/span&gt;
&lt;span class="k"&gt;defp&lt;/span&gt; &lt;span class="n"&gt;maybe_strip_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stripe_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;when&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;stripe_data&lt;/span&gt;

&lt;span class="k"&gt;defp&lt;/span&gt; &lt;span class="n"&gt;maybe_strip_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stripe_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;previous_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="no"&gt;StateMachine&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transition!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;previous_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
    &lt;span class="ss"&gt;:ok&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;
      &lt;span class="n"&gt;stripe_data&lt;/span&gt;

    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="ss"&gt;:error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:invalid_transition&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;
      &lt;span class="no"&gt;Logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s2"&gt;"SubscriptionProcessor: invalid transition &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;previous_status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; -&amp;gt; &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;new_status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;, "&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;gt;&lt;/span&gt;
          &lt;span class="s2"&gt;"keeping current status"&lt;/span&gt;
      &lt;span class="p"&gt;)&lt;/span&gt;

      &lt;span class="no"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stripe_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;previous_status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the state machine rejects a transition, the function does not reject the event. It replaces the incoming status with the previous status and returns the rest of the payload untouched. Period dates, trial dates, metadata, plan references: all of it passes through. Only the illegal status change gets stripped.&lt;/p&gt;

&lt;p&gt;The first two guard clauses handle the cold-start cases. When &lt;code&gt;previous_status&lt;/code&gt; is nil (brand new subscription, no prior state to compare), accept whatever Stripe sends. When the previous and new statuses are identical (Stripe sent an update that didn't change the status), pass through without checking the transition map.&lt;/p&gt;

&lt;p&gt;This is a deliberate tradeoff. The local status can drift from Stripe's status. A subscription might be &lt;code&gt;past_due&lt;/code&gt; in Stripe's records and &lt;code&gt;paused&lt;/code&gt; in the local database. I accepted this inconsistency because the local model controls what actually happens: which dunning sequences fire, which notifications send, which metrics count the subscription as churned. If Stripe's status is wrong from the local model's perspective, the local model wins for execution and Stripe wins for billing. Nobody is the universal authority.&lt;/p&gt;

&lt;h2&gt;
  
  
  Idempotency Under Disagreement
&lt;/h2&gt;

&lt;p&gt;The state disagreement compounds with another problem: event ordering. Stripe doesn't guarantee webhook delivery order. A &lt;code&gt;customer.subscription.updated&lt;/code&gt; event from 3:00 PM can arrive after one from 3:05 PM. If the 3:05 PM event was already processed and moved the subscription to &lt;code&gt;active&lt;/code&gt;, the 3:00 PM event now carries a stale status that the state machine might reject.&lt;/p&gt;

&lt;p&gt;The idempotency model handles this with a three-state check, not the typical two-state (seen/unseen) approach. Every event is stored with a composite key: &lt;code&gt;tenant_id:stripe_event_id&lt;/code&gt;. Before processing, the system checks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;New:&lt;/strong&gt; No record exists. Process normally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Duplicate:&lt;/strong&gt; A record exists with &lt;code&gt;processed_at&lt;/code&gt; set. Skip entirely, return success.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processing:&lt;/strong&gt; A record exists without &lt;code&gt;processed_at&lt;/code&gt;. Another Oban worker is handling this event right now. Skip to avoid double-processing.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The third state is the one most implementations miss. Without it, two workers pulling the same event from the Oban queue would both attempt to process it, and both would try to create downstream records (dunning attempts, notifications). The database-level unique constraint on the idempotency key catches this at the persistence layer, but the three-state check catches it earlier and avoids the exception entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dunning Guard Nobody Asks About
&lt;/h2&gt;

&lt;p&gt;The paused-to-past-due edge case has a second layer of protection deeper in the processing pipeline. Even if the status stripping somehow failed and a paused subscription appeared as &lt;code&gt;past_due&lt;/code&gt;, the dunning trigger has its own guard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight elixir"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;prev_status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"paused"&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="no"&gt;Logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;"SubscriptionProcessor: paused subscription transitioned to past_due, "&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;gt;&lt;/span&gt;
      &lt;span class="s2"&gt;"skipping dunning for sub &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;subscription&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is belt-and-suspenders engineering for a specific Stripe behavior I discovered during development. Stripe can emit a &lt;code&gt;past_due&lt;/code&gt; status for a paused subscription when a pending invoice from before the pause fails to collect. The subscription was paused intentionally, but Stripe's billing engine still tries to finalize the outstanding invoice. If it fails, Stripe marks the subscription &lt;code&gt;past_due&lt;/code&gt; even though the customer explicitly paused it.&lt;/p&gt;

&lt;p&gt;Without this guard, the dunning engine would start a 7-day escalation sequence (email at day 1, email at day 3, Telegram at day 5, both channels at day 7, then automatic cancellation) for a subscription the customer paused. The customer experience would be: "I paused my subscription. Three days later I got a Telegram message saying my payment failed and I need to act immediately." That is not a recoverable customer relationship.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Would Redesign
&lt;/h2&gt;

&lt;p&gt;The status stripping approach works, but it has a gap: the system does not reconcile. If the local status diverges from Stripe's status, it stays diverged until the next valid transition arrives. A reconciliation job that periodically fetches subscription status from the Stripe API and compares it against local state would close this gap. I didn't build it because the current system is event-driven by design (no polling), and adding a reconciliation poller would create a second source of state mutations that the webhook pipeline doesn't expect.&lt;/p&gt;

&lt;p&gt;The churn calculator has a related bootstrap problem. It computes churn rate as &lt;code&gt;churned_count / active_count_at_period_start&lt;/code&gt;, where the denominator comes from the previous day's metrics snapshot. On the first day of operation, there is no previous snapshot. The first churn rate is always &lt;code&gt;0.0000&lt;/code&gt;, regardless of what actually happened. I accepted this because building temporal query infrastructure to count active subscriptions at an arbitrary past timestamp added complexity that the 99.9% case (day 2 and beyond) doesn't need. But the first day's metrics are wrong, and there is no warning in the dashboard about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Lesson Is About Modeling, Not Stripe
&lt;/h2&gt;

&lt;p&gt;Stripe is the specific case, but the pattern applies to any system that processes events from an external authority. Payment providers, shipping APIs, identity providers, government reporting systems: they all emit state changes according to their own transition rules. Your internal model is a subset of their model, tuned for your business logic, and the two will disagree.&lt;/p&gt;

&lt;p&gt;The instinct is to treat the external system as the single source of truth. Let Stripe win every argument. The problem is that "letting Stripe win" means your business logic executes against a state model you do not control and cannot predict. The alternative isn't to fight the external system. It is to separate the concerns: accept their data, enforce your transitions, log the disagreements, and build guards at every point where the disagreement could trigger unintended side effects.&lt;/p&gt;

&lt;p&gt;Build for disagreement, not consensus. The external world does not know your rules, and it does not care.&lt;/p&gt;

</description>
      <category>stripe</category>
      <category>elixir</category>
      <category>webhook</category>
    </item>
    <item>
      <title>Why Anomaly Detection Can't Block the Ingestion Pipeline</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:23:23 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/why-anomaly-detection-cant-block-the-ingestion-pipeline-3n42</link>
      <guid>https://dev.to/kingsleyonoh/why-anomaly-detection-cant-block-the-ingestion-pipeline-3n42</guid>
      <description>&lt;p&gt;The first version of the anomaly evaluator ran inline. A batch of 100 readings flushed to TimescaleDB, then the same function called &lt;code&gt;evaluate_batch&lt;/code&gt;, which fetched alert rules, computed rolling statistics from the continuous aggregates, checked cooldown windows, persisted alerts, published NATS events, and fired HTTP calls to the Notification Hub and Workflow Engine. All of it synchronous. All of it in the same call stack as the NATS consumer loop.&lt;/p&gt;

&lt;p&gt;It worked for about thirty seconds. Then the Workflow Engine returned a 502, the reqwest client waited for its default timeout, the batch flush stalled, NATS messages piled up, and the consumer loop fell behind by 3,000 messages before I killed the process.&lt;/p&gt;

&lt;p&gt;The lesson was obvious in retrospect: a sensor data pipeline that sustains 5,000 readings per second cannot wait for a downstream HTTP response. The interesting question was what "don't wait" actually means at the architecture level. Making the HTTP call async solves the timeout. It doesn't solve the failure model. I needed every component downstream of the batch insert to be able to fail, silently, without the consumer loop noticing or caring.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pipeline, end to end
&lt;/h2&gt;

&lt;p&gt;A sensor message arrives on NATS as JSON. The consumer deserializes it, validates the value (must be finite, metric must be non-empty), resolves the tenant via the TenantResolver's DashMap cache, auto-registers the device if unknown, and pushes the reading into the BatchBuffer. When the buffer hits 100 readings or 500 milliseconds elapse, whichever comes first, it flushes.&lt;/p&gt;

&lt;p&gt;The flush itself is the only synchronous bottleneck I allow. The &lt;code&gt;execute_insert&lt;/code&gt; method builds UNNEST arrays for all seven columns and sends a single INSERT to TimescaleDB. If that fails, it retries once after 100 milliseconds. If the retry fails, it logs at &lt;code&gt;error&lt;/code&gt; level and drops the batch. There's no infinite retry loop. No dead-letter queue. The batch is gone.&lt;/p&gt;

&lt;p&gt;I made that choice deliberately. At 5,000 readings per second, a retry queue that grows faster than it drains is worse than data loss. Thirty seconds of buffered retries at full throughput means 150,000 queued readings. The server runs in 512MB. The math doesn't work. Drop the batch, log the failure, let the operator investigate. The continuous aggregates will smooth over a missing 100-reading gap, and the anomaly detector uses rolling statistics over minutes, not individual readings.&lt;/p&gt;

&lt;h2&gt;
  
  
  The buffer swap pattern
&lt;/h2&gt;

&lt;p&gt;The BatchBuffer holds a &lt;code&gt;Vec&amp;lt;ValidatedReading&amp;gt;&lt;/code&gt; behind a &lt;code&gt;tokio::sync::Mutex&lt;/code&gt;. The critical design detail is in the &lt;code&gt;push&lt;/code&gt; method: when the buffer reaches capacity, the code calls &lt;code&gt;std::mem::replace&lt;/code&gt; to swap the full Vec with a fresh one, then drops the Mutex guard before executing the INSERT.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;readings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;Vec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_capacity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.batch_size&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// release lock before DB I/O&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;flushed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="nf"&gt;.flush_to_db_with_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;readings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters because the INSERT takes milliseconds. If the lock were held during the INSERT, every concurrent &lt;code&gt;push&lt;/code&gt; call would block. At 5,000 messages per second, even a 5ms INSERT means 25 messages queued behind the lock. The swap pattern reduces the critical section to a memory operation, not a network round-trip.&lt;/p&gt;

&lt;p&gt;I considered using a lock-free structure like a crossbeam channel or a ring buffer. The Mutex won the trade-off because the batch logic needs to check the current buffer length before deciding to flush, and that check-then-act pattern doesn't map cleanly to a channel. The swap keeps the lock hold time under a microsecond, which is well below the threshold where contention would show up at 5,000 messages per second.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detection runs post-flush, not inline
&lt;/h2&gt;

&lt;p&gt;After the BatchBuffer returns the flushed readings, the consumer calls &lt;code&gt;run_anomaly_evaluation&lt;/code&gt;. This function is the firewall between ingestion and detection. Here's what it looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;run_anomaly_evaluation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;PgPool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;nats_client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nn"&gt;async_nats&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;readings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ValidatedReading&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;notification_emitter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;NotificationEmitter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;workflow_trigger&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;WorkflowTrigger&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="nf"&gt;evaluate_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nats_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;readings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;notification_emitter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;workflow_trigger&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alert_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;alert_ids&lt;/span&gt;&lt;span class="nf"&gt;.is_empty&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="nd"&gt;info!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;alert_ids&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s"&gt;"Anomaly evaluation created alerts"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nd"&gt;warn!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Anomaly evaluation failed, continuing"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The return type is &lt;code&gt;()&lt;/code&gt;, not &lt;code&gt;Result&lt;/code&gt;. Errors are logged at &lt;code&gt;warn&lt;/code&gt; level and swallowed. The consumer loop continues processing the next NATS message regardless of whether detection succeeded. A database timeout in the rule query, a serialization failure in the NATS publish, a network error calling the Notification Hub: none of these stop ingestion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deduplication before evaluation
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;evaluate_batch&lt;/code&gt; function doesn't evaluate every reading in the batch. It extracts unique &lt;code&gt;(tenant_id, device_id, metric)&lt;/code&gt; tuples, keeping only the latest value for each. A batch of 100 readings from 5 devices reporting 3 metrics each produces 15 unique tuples, not 100. The deduplication uses a &lt;code&gt;HashSet&amp;lt;ReadingKey&amp;gt;&lt;/code&gt; and iterates the batch in reverse so the most recent reading for each tuple wins.&lt;/p&gt;

&lt;p&gt;This was a performance decision. Fetching alert rules from PostgreSQL and computing rolling deviation statistics from the &lt;code&gt;readings_1m&lt;/code&gt; aggregate table costs a query per unique tuple. At 100 queries per batch, the evaluator would be the bottleneck. At 15 queries per batch, it completes in single-digit milliseconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three condition types, one evaluation path
&lt;/h2&gt;

&lt;p&gt;Each alert rule specifies a condition type: &lt;code&gt;above_threshold&lt;/code&gt;, &lt;code&gt;below_threshold&lt;/code&gt;, or &lt;code&gt;deviation_from_mean&lt;/code&gt;. The first two are trivial comparisons. The third runs a SQL query against the &lt;code&gt;readings_1m&lt;/code&gt; continuous aggregate, computing &lt;code&gt;avg(avg_value)&lt;/code&gt; and &lt;code&gt;stddev(avg_value)&lt;/code&gt; over the configured &lt;code&gt;window_minutes&lt;/code&gt;. If fewer than 5 buckets exist in the window, the check skips entirely.&lt;/p&gt;

&lt;p&gt;That minimum sample count of 5 was the result of a frustrating cold-start problem. When a new device starts reporting, the first few readings have no aggregate history. A threshold of 2 standard deviations from a mean computed from 2 data points fires on everything. The system was generating hundreds of false alerts during device onboarding. Setting the floor at 5 samples eliminated the noise without significantly delaying real anomaly detection. Five minutes of readings at one-per-second gives 5 one-minute aggregate buckets. That's the minimum window before deviation detection activates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cooldown prevents alert storms
&lt;/h2&gt;

&lt;p&gt;Without cooldown, a sensor stuck at a high value generates an alert on every batch flush. At 500ms intervals, that's 120 alerts per minute for a single sensor. The &lt;code&gt;check_cooldown&lt;/code&gt; function queries the &lt;code&gt;alerts&lt;/code&gt; table for any alert with the same &lt;code&gt;rule_id&lt;/code&gt; and &lt;code&gt;device_id&lt;/code&gt; created within the last N minutes (default 15). If one exists, the alert is suppressed and logged at &lt;code&gt;info&lt;/code&gt; level.&lt;/p&gt;

&lt;p&gt;The cooldown query hits an index on &lt;code&gt;(rule_id, device_id, created_at)&lt;/code&gt;. I briefly considered an in-memory cooldown cache (DashMap with TTL, similar to the TenantResolver), but the database-backed approach won because cooldown state needs to survive process restarts. If the service crashes and restarts, a memory-based cooldown would fire duplicate alerts for every rule that was in cooldown before the crash.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fire-and-forget for ecosystem calls
&lt;/h2&gt;

&lt;p&gt;The Notification Hub and Workflow Engine integrations use the same pattern: serialize the payload, spawn a Tokio task, return immediately. The spawned task makes the HTTP call and logs the result. The caller never sees the response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;move&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="nf"&gt;.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"X-API-Key"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.send&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="cm"&gt;/* log success or failure status */&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nd"&gt;warn!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Failed to send event"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Workflow Engine call adds HMAC-SHA256 signing. The &lt;code&gt;compute_signature&lt;/code&gt; method takes the secret and raw body bytes, produces a &lt;code&gt;sha256=&amp;lt;hex&amp;gt;&lt;/code&gt; signature, and attaches it as &lt;code&gt;X-Hub-Signature-256&lt;/code&gt;. If no secret is configured, the header is omitted entirely.&lt;/p&gt;

&lt;p&gt;This is at-most-once delivery by design. If the Notification Hub is down, the alert event is lost. I chose this over at-least-once (which would require a persistent outbox or retry queue) because the alerts are persisted in the database regardless. The Notification Hub is a convenience layer for pushing alerts to email or Telegram. An operator who checks the API's &lt;code&gt;/api/alerts&lt;/code&gt; endpoint will always see the full alert history, even if the Hub missed a notification.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd redesign
&lt;/h2&gt;

&lt;p&gt;The anomaly evaluator currently runs on the same Tokio runtime as the NATS consumer and the Axum HTTP server. At 5,000 readings per second this is fine. At 50,000, the rule evaluation queries would compete with API queries for the 20-connection database pool. I'd split the evaluator into a separate process that consumes a NATS subject of "batches flushed" events, running its own connection pool. The ingestion binary would publish a summary event after each successful flush instead of calling the evaluator directly.&lt;/p&gt;

&lt;p&gt;The other weak point is the fire-and-forget pattern. At scale, lost notifications become a support problem. The fix is an outbox table: persist the notification payload alongside the alert, then have a background worker poll the outbox and deliver with retries. The outbox adds latency and complexity, but it makes notification delivery observable. Right now, the only way to know a notification was lost is to notice a gap in the Notification Hub's event log. That's not good enough for a production system where the person on call needs to trust that critical alerts reach them.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>async</category>
      <category>tokio</category>
    </item>
    <item>
      <title>I Spent a Week Securing Webhook Ingestion. The Real Attack Surface Was Delivery.</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:20:14 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/i-spent-a-week-securing-webhook-ingestion-the-real-attack-surface-was-delivery-5a8p</link>
      <guid>https://dev.to/kingsleyonoh/i-spent-a-week-securing-webhook-ingestion-the-real-attack-surface-was-delivery-5a8p</guid>
      <description>&lt;p&gt;I ran the security review two weeks after the first deployment. The ingestion side looked solid: HMAC signature verification using &lt;code&gt;crypto.timingSafeEqual&lt;/code&gt;, rate limiting at 1,000 requests per minute, payload size capped at 1MB, idempotency deduplication on every incoming event. I was satisfied with the input boundary. Then I traced what happens after an event is accepted, through the delivery worker and out to destination URLs, and realized I'd spent a week protecting the wrong end of the system.&lt;/p&gt;

&lt;p&gt;The ingestion endpoint validates who is sending. But the delivery worker, the component that forwards payloads to downstream URLs registered by tenants, makes outbound HTTP requests from inside the server's network. That side had no protection at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two attack surfaces, nothing in common
&lt;/h2&gt;

&lt;p&gt;A webhook gateway has two distinct threat models, and they require completely different defenses.&lt;/p&gt;

&lt;p&gt;On the ingestion side, the threat is an external attacker sending malicious payloads: forged signatures, oversized bodies designed to exhaust memory, replay attacks using captured requests, and deeply nested JSON meant to overflow the parser stack. The defense strategy is conventional. Validate everything at the boundary before it touches the database.&lt;/p&gt;

&lt;p&gt;On the delivery side, the threat comes from inside. A tenant registers a destination URL. The URL passed validation when it was created: &lt;code&gt;api.customer.com&lt;/code&gt; resolved to &lt;code&gt;34.120.18.42&lt;/code&gt;, a legitimate GCP load balancer. But URLs are just DNS pointers, and DNS records change. Three weeks later, the same hostname resolves to &lt;code&gt;169.254.169.254&lt;/code&gt;, the cloud metadata endpoint on every major provider. The webhook gateway, running inside the VPS, dutifully POSTs the event payload to the internal metadata service.&lt;/p&gt;

&lt;p&gt;This is Server-Side Request Forgery through DNS rebinding. The gateway becomes an open proxy for any tenant who controls their DNS records. And the creation-time URL validation caught none of it, because the DNS resolution happened at delivery time, weeks after the destination was registered.&lt;/p&gt;

&lt;p&gt;The exposure isn't limited to metadata endpoints. Private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), loopback addresses, the link-local range (169.254.0.0/16), IPv6 unique local addresses (fc00::/7), and protocol handlers like &lt;code&gt;file://&lt;/code&gt; or &lt;code&gt;gopher://&lt;/code&gt; are all reachable from the delivery worker's network context. The attack surface is larger than it first appears.&lt;/p&gt;

&lt;h2&gt;
  
  
  What made it hard
&lt;/h2&gt;

&lt;p&gt;Three constraints shaped the fix.&lt;/p&gt;

&lt;p&gt;First, the system delivers webhooks continuously. The SSRF check runs on every delivery, not just at destination creation. Any latency added to the DNS resolution and IP validation eats into the 10,000ms delivery timeout budget. The check needed to complete in single-digit milliseconds under normal conditions, which ruled out external SSRF-detection services and pre-resolution caching (stale DNS entries defeat the purpose).&lt;/p&gt;

&lt;p&gt;Second, the existing delivery worker was the hottest code path in the system: load event, load destination, make HTTP call, record result, calculate backoff, schedule retry or promote to dead letter. Ten concurrent workers execute this pipeline on every job. Inserting validation in the middle of this sequence meant touching every execution path without breaking the retry logic, the exponential backoff calculations in &lt;code&gt;calculateBackoffMs()&lt;/code&gt;, or the dead-letter promotion threshold.&lt;/p&gt;

&lt;p&gt;Third, some destinations legitimately resolve to IPs that look suspicious to a naive blocklist. A customer running their webhook handler on a small hosting provider might have an IP range that borders private space. The validation needed to be strict about RFC 1918 ranges while returning clear, actionable error messages when it rejected a delivery, so tenants could debug the issue without opening a support ticket.&lt;/p&gt;

&lt;h2&gt;
  
  
  Six layers, added iteratively
&lt;/h2&gt;

&lt;p&gt;The initial deployment shipped with HMAC signature verification and rate limiting on the ingestion endpoint. The remaining layers arrived in a separate wave of commits, all prefixed &lt;code&gt;fix(&lt;/code&gt; rather than &lt;code&gt;feat(&lt;/code&gt;, because each one addressed a gap I discovered after the pipeline was already handling traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ingestion input protection:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rate limiting caps ingestion at 1,000 requests per minute per source. Payload size is limited to 1MB. JSON parsing enforces a nesting depth limit of 20 levels (&lt;code&gt;MAX_JSON_DEPTH&lt;/code&gt; in &lt;code&gt;src/server.ts&lt;/code&gt;) to prevent stack overflow from recursive structures. HMAC signature verification uses &lt;code&gt;crypto.timingSafeEqual&lt;/code&gt; to prevent timing side-channels. The &lt;code&gt;verifySignature()&lt;/code&gt; function in &lt;code&gt;src/ingestion/signature.ts&lt;/code&gt; supports both &lt;code&gt;hmac-sha256&lt;/code&gt; and &lt;code&gt;hmac-sha1&lt;/code&gt;, strips algorithm prefixes (GitHub sends &lt;code&gt;sha256=&amp;lt;hex&amp;gt;&lt;/code&gt;, Stripe uses &lt;code&gt;t=&amp;lt;unix&amp;gt;,v1=&amp;lt;hex&amp;gt;&lt;/code&gt;), and applies a configurable timestamp tolerance. The &lt;code&gt;extractTimestampFromHeader()&lt;/code&gt; function parses Stripe's format specifically: &lt;code&gt;t=&amp;lt;unix_seconds&amp;gt;,&lt;/code&gt; at the beginning of the header value, converted to milliseconds. Any signature older than 5 minutes (the &lt;code&gt;SIGNATURE_TOLERANCE_MS&lt;/code&gt; default of 300,000ms) is rejected before the HMAC is even computed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Delivery output protection:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Layer one is &lt;code&gt;validateDestinationUrl()&lt;/code&gt; in &lt;code&gt;src/lib/url-validator.ts&lt;/code&gt;, called at destination creation. It rejects non-HTTP protocols immediately. It blocks &lt;code&gt;localhost&lt;/code&gt; and checks raw IP addresses against private ranges via &lt;code&gt;isPrivateIp()&lt;/code&gt;. The IPv4 check walks the octets: &lt;code&gt;127.x&lt;/code&gt; is loopback, &lt;code&gt;10.x&lt;/code&gt; is private, &lt;code&gt;172.16-31.x&lt;/code&gt; is private, &lt;code&gt;192.168.x&lt;/code&gt; is private, &lt;code&gt;169.254.x&lt;/code&gt; is the link-local range that covers cloud metadata endpoints on AWS and GCP. IPv6 gets separate handling: &lt;code&gt;::1&lt;/code&gt; (loopback) and &lt;code&gt;fc00::/7&lt;/code&gt; (unique local, covering both &lt;code&gt;fc&lt;/code&gt; and &lt;code&gt;fd&lt;/code&gt; prefixes).&lt;/p&gt;

&lt;p&gt;Layer two is &lt;code&gt;resolveAndValidateUrl()&lt;/code&gt;, called before every delivery. This is the DNS rebinding defense:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/lib/url-validator.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;resolveAndValidateUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;validateDestinationUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hostname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^&lt;/span&gt;&lt;span class="se"&gt;\[&lt;/span&gt;&lt;span class="sr"&gt;|&lt;/span&gt;&lt;span class="se"&gt;\]&lt;/span&gt;&lt;span class="sr"&gt;$/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;net&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isIP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;addresses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;dns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve4&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[]);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;addresses6&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;dns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve6&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[]);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;allAddresses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;addresses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;addresses6&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;allAddresses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;addr&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;allAddresses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;isPrivateIp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;UnsafeUrlError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s2"&gt;`DNS rebinding detected: '&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;' resolves to private IP '&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;'`&lt;/span&gt;
      &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every A and AAAA record is checked. If any resolved IP falls in a private range, the delivery is rejected with a specific error message naming the hostname and the offending address. The function resolves IPv4 and IPv6 in parallel to minimize latency. If DNS resolution fails entirely, the request is allowed through because the subsequent &lt;code&gt;fetch&lt;/code&gt; will fail with a network error on its own. No errors are swallowed silently.&lt;/p&gt;

&lt;p&gt;I got this wrong the first time. The initial implementation only had layer one: creation-time validation. I assumed that if the URL was safe when the tenant registered it, it would stay safe. It took reading through SSRF post-mortems to realize that DNS-based attacks bypass creation-time checks entirely. The fix shipped as its own commit: &lt;code&gt;fix(lib): add SSRF protection, helmet headers, raw body HMAC, patch drizzle CVE&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The delivery worker also caps redirect chains at 3 hops. The &lt;code&gt;deliverWebhook()&lt;/code&gt; function in &lt;code&gt;src/delivery/http-client.ts&lt;/code&gt; follows redirects manually instead of relying on &lt;code&gt;fetch&lt;/code&gt;'s default behavior, and truncates response bodies to 4,096 bytes to prevent a destination from returning a multi-megabyte response that fills the &lt;code&gt;deliveries&lt;/code&gt; table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data protection:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Header sanitization strips &lt;code&gt;Authorization&lt;/code&gt; and &lt;code&gt;Cookie&lt;/code&gt; headers from stored payloads at ingestion time. These headers might be present in the original webhook request (some providers include auth tokens), and forwarding them to a third-party destination would leak credentials. Signing secrets for HMAC verification are encrypted at rest with AES-256-GCM. The &lt;code&gt;encrypt()&lt;/code&gt; function in &lt;code&gt;src/lib/crypto.ts&lt;/code&gt; generates a random 12-byte IV per encryption and packs IV, auth tag, and ciphertext into a base64 string. The decryption key lives in an environment variable (&lt;code&gt;SIGNING_SECRET_KEY&lt;/code&gt;), separate from the database. A breach that exposes the &lt;code&gt;sources&lt;/code&gt; table gives the attacker ciphertext, not usable keys.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed
&lt;/h2&gt;

&lt;p&gt;The security hardening added roughly 400 lines of validation code to a system that was already functionally complete. The &lt;code&gt;url-validator.ts&lt;/code&gt; module alone is 195 lines of checks that don't change what the system does. They change what it refuses to do.&lt;/p&gt;

&lt;p&gt;Before: the delivery worker would POST to any URL a tenant registered, follow unlimited redirects, and forward all original headers. After: every delivery passes through protocol filtering, hostname blocking, static IP range checks, dynamic DNS resolution with IP validation, redirect limiting at 3 hops, header sanitization, and response body truncation at 4,096 bytes.&lt;/p&gt;

&lt;p&gt;The ingestion side now has signature verification with timing-safe comparison and 5-minute stale tolerance, payload caps at 1MB, and JSON depth limiting at 20 levels. Six overlapping layers, three on each side of the persist-before-process boundary.&lt;/p&gt;

&lt;p&gt;None of these layers trust the previous one. The delivery-time DNS check doesn't assume the creation-time URL check caught everything. The JSON depth limit doesn't assume the payload size limit prevented pathological inputs. Every layer operates under the assumption that the one before it was bypassed or insufficient. When you build infrastructure that accepts input from the internet and acts on it inside your network, the input validation is the obvious problem. The delivery side, where your system becomes a network actor on behalf of untrusted tenants, is where the real exposure hides.&lt;/p&gt;

</description>
      <category>security</category>
      <category>ssrf</category>
      <category>webhook</category>
    </item>
    <item>
      <title>Building a DAG Orchestrator That Rewrites Its Own Execution Plan</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:16:18 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/building-a-dag-orchestrator-that-rewrites-its-own-execution-plan-14bh</link>
      <guid>https://dev.to/kingsleyonoh/building-a-dag-orchestrator-that-rewrites-its-own-execution-plan-14bh</guid>
      <description>&lt;p&gt;Step 7 is a condition: if the HTTP response came back 200, continue to step 8 and step 9. If not, skip them and jump to step 10. Five steps downstream of a single boolean, two possible paths, and a directed acyclic graph that only understands one thing: which step depends on which.&lt;/p&gt;

&lt;p&gt;DAGs don't branch. That was the problem I had to solve.&lt;/p&gt;

&lt;h2&gt;
  
  
  The linear model
&lt;/h2&gt;

&lt;p&gt;The linear step types worked without issues. HTTP calls, data transforms, delays, sub-workflows. The orchestrator in &lt;code&gt;orchestrator.py&lt;/code&gt; called &lt;code&gt;topological_sort()&lt;/code&gt; from &lt;code&gt;graph.py&lt;/code&gt;, got back a flat list of steps in dependency order, and executed them one by one. After each step completed, its output merged into a shared JSONB context accessible to every downstream step via Jinja2 expressions like &lt;code&gt;{{ steps.fetch_data.output.body }}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The execution model was simple: walk the sorted list, execute each step, persist the state to PostgreSQL, move to the next. No branching. No decisions. A conveyor belt that worked perfectly for linear workflows.&lt;/p&gt;

&lt;p&gt;Then I needed conditions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the graph stays flat
&lt;/h2&gt;

&lt;p&gt;The workflow definition needed to stay simple: a flat list of steps with plain &lt;code&gt;depends_on&lt;/code&gt; references. Conditions should be a step type like any other. The branching logic should live at runtime, not in the graph structure.&lt;/p&gt;

&lt;p&gt;Most workflow engines encode branching directly into the graph. BPMN uses gateway nodes that split the flow into separate paths. Airflow uses &lt;code&gt;BranchPythonOperator&lt;/code&gt; to select downstream tasks. You define the branches as part of the DAG, each path continues independently, and eventually they converge at a join node.&lt;/p&gt;

&lt;p&gt;I sketched this out and hit two problems. A 12-step workflow with three conditions and two paths each produces a graph where users need to reason about up to eight distinct execution paths when designing the workflow. The &lt;code&gt;depends_on&lt;/code&gt; array in each step definition becomes a tangled web of conditional references. Step 10 depends on step 7, but only if step 7's condition was true. The dependency model needs to become richer than "this step waits for that step."&lt;/p&gt;

&lt;p&gt;The second problem is structural. The DAG parser in &lt;code&gt;parser.py&lt;/code&gt; already handles cycle detection and depth validation on the flat dependency graph. Introducing branch paths would require a second validation pass for branch connectivity, dead-end detection, and convergence verification. Every new validation rule is a new failure mode for users to debug.&lt;/p&gt;

&lt;h2&gt;
  
  
  The skip set
&lt;/h2&gt;

&lt;p&gt;The solution was to keep the topological sort and add a mutable &lt;code&gt;skipped_steps&lt;/code&gt; set that grows during execution. When the orchestrator reaches a condition step, it evaluates the Jinja2 expression and calls &lt;code&gt;_resolve_condition_branches()&lt;/code&gt;. That function reads the condition's &lt;code&gt;true_branch&lt;/code&gt; and &lt;code&gt;false_branch&lt;/code&gt; config (each a list of step IDs) and adds the non-taken branch's steps to the skip set.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_resolve_condition_branches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;step_def&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;StepDefinition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;skipped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;skipped&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step_def&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false_branch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;skipped&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step_def&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true_branch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;skipped&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before executing each step in the sorted order, the orchestrator checks: is this step ID in &lt;code&gt;skipped_steps&lt;/code&gt;? If yes, transition its state to &lt;code&gt;skipped&lt;/code&gt; and move on. The &lt;code&gt;skipped&lt;/code&gt; state is terminal in the state machine (empty transition set, no further changes allowed). The topological order never changes. The graph structure never changes. Steps just get removed from the plan as it runs.&lt;/p&gt;

&lt;p&gt;A 12-step workflow with three conditions still has 12 entries in the topological sort. The orchestrator always walks the same list. It just skips some entries based on runtime evaluation. No path explosion. No conditional dependency syntax. No gateway nodes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the truthiness got messy
&lt;/h2&gt;

&lt;p&gt;The condition step evaluates a Jinja2 expression and gets back a string. Jinja2 doesn't return Python booleans from template rendering. Everything comes back as text. So the string &lt;code&gt;"False"&lt;/code&gt; needs to be treated as falsy, and &lt;code&gt;"1"&lt;/code&gt; as truthy.&lt;/p&gt;

&lt;p&gt;I defined a &lt;code&gt;_FALSY_VALUES&lt;/code&gt; frozenset in &lt;code&gt;condition.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_FALSY_VALUES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;False&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;None&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This handles the common cases: empty strings, Python-style false values, string representations of zero. But the orchestrator also needs to resolve which branch was taken, and it has its own truthiness check for the condition output. I ended up with the same logic in two places: the condition executor computes the boolean and writes it to the output, and the orchestrator reads that output to populate the skip set. Both places need to agree on what "false" means.&lt;/p&gt;

&lt;p&gt;The duplication is a code smell I haven't fixed. The orchestrator should trust the &lt;code&gt;result&lt;/code&gt; boolean from the condition's output (which the executor already computed) instead of re-evaluating truthiness on the raw expression result. It works today, but there's a subtle bug waiting if someone changes the falsy list in one place and not the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  The state machine underneath
&lt;/h2&gt;

&lt;p&gt;Making skip sets work required a state machine with six distinct step states: &lt;code&gt;pending&lt;/code&gt;, &lt;code&gt;queued&lt;/code&gt;, &lt;code&gt;running&lt;/code&gt;, &lt;code&gt;completed&lt;/code&gt;, &lt;code&gt;failed&lt;/code&gt;, and &lt;code&gt;skipped&lt;/code&gt;. The transitions are defined as an adjacency dictionary in &lt;code&gt;state.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ALLOWED_STEP_TRANSITIONS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queued&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skipped&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queued&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;running&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;running&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queued&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skipped&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two details matter here. First, &lt;code&gt;skipped&lt;/code&gt; has an empty transition set. Once a step is skipped, it stays skipped. You can't un-skip a step during execution. The condition evaluated, the branch was chosen, the decision is final.&lt;/p&gt;

&lt;p&gt;Second, &lt;code&gt;failed&lt;/code&gt; can transition back to &lt;code&gt;queued&lt;/code&gt;. That's the retry mechanism. When a step fails and has retry attempts remaining (default 3, configurable up to 10 via &lt;code&gt;RetryConfig&lt;/code&gt;), the orchestrator transitions it back to &lt;code&gt;queued&lt;/code&gt; and re-executes. Each attempt updates &lt;code&gt;step_exec.attempt&lt;/code&gt; in PostgreSQL before the retry runs. If the process crashes between retries, the attempt count survives in the database.&lt;/p&gt;

&lt;p&gt;Every state transition writes to PostgreSQL before anything else happens. The orchestrator doesn't execute step N+1 until step N's final state is persisted. This is deliberately slow. An in-memory state machine would be faster, but if the worker dies between steps, every execution in flight would lose its state. Persist-before-execute means a crash leaves every execution in a known, recoverable position.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cycle detection on the reverse graph
&lt;/h2&gt;

&lt;p&gt;The skip set approach only works if the graph is actually a DAG. Cycles would cause the topological sort to loop forever. The &lt;code&gt;detect_cycles()&lt;/code&gt; function in &lt;code&gt;graph.py&lt;/code&gt; uses DFS with three-color marking: white (unvisited), gray (currently being explored), black (fully explored). A back edge from any node to a gray node means a cycle.&lt;/p&gt;

&lt;p&gt;The non-obvious detail: the function builds a reverse graph where each entry maps a dependency to its dependents, not the other way around. If step C depends on step B, the reverse graph has &lt;code&gt;B: [C]&lt;/code&gt;. DFS then explores "who depends on me?" paths. Cycles in that direction are what would break execution order. When a cycle is found, &lt;code&gt;_reconstruct_cycle()&lt;/code&gt; walks backward through parent pointers to produce the full path: &lt;code&gt;["step_a", "step_b", "step_c", "step_a"]&lt;/code&gt;. The error message shows exactly which steps form the loop.&lt;/p&gt;

&lt;p&gt;Maximum DAG depth is capped at 20 via a hardcoded &lt;code&gt;MAX_DEPTH&lt;/code&gt; constant. The depth calculation uses Kahn's algorithm with a twist: instead of just sorting, it tracks the longest path to each node. If step D depends on both B and C, and B is at depth 2 while C is at depth 4, then D gets depth 5. The maximum across all nodes determines the workflow's depth.&lt;/p&gt;

&lt;h2&gt;
  
  
  What surprised me
&lt;/h2&gt;

&lt;p&gt;I expected the hard part to be the topological sort or the cycle detection. Both were textbook implementations of well-known algorithms. The hard part was getting the skip set to interact correctly with &lt;code&gt;depends_on&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Consider: step 9 depends on step 8, and step 8 is in the false branch of step 7's condition. If step 7 evaluates to true, step 8 gets skipped. But step 9 is still in the topological order. Its dependency was skipped, not completed. Should step 9 execute?&lt;/p&gt;

&lt;p&gt;The answer I landed on: if all dependencies of a step are either &lt;code&gt;completed&lt;/code&gt; or &lt;code&gt;skipped&lt;/code&gt;, the step can execute. A skipped dependency counts as resolved for scheduling purposes. This lets you build workflows where step 9 runs regardless of which branch step 7 took, as long as its other dependencies are met. The orchestrator checks this when deciding which steps to queue next.&lt;/p&gt;

&lt;p&gt;I could have made skipped dependencies block execution. But that would mean a single condition deep in the DAG could freeze all downstream steps, even ones that don't care about the branch result. The current behavior is more useful: skip sets remove steps from the plan, but they don't create invisible walls in the dependency graph.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;The engine handles workflows up to 50 steps with a maximum DAG depth of 20 in the longest dependency chain. Condition branching adds zero overhead to the execution plan because the topological sort runs once, before any step executes. The skip set is a Python &lt;code&gt;set()&lt;/code&gt; with O(1) lookups. State persistence adds one database write per step transition, which at 15 connections in the pool (5 steady plus 10 overflow) handles the concurrency comfortably.&lt;/p&gt;

&lt;p&gt;544 tests cover the engine, including edge cases for nested conditions (a condition inside another condition's true branch), skipped steps with downstream dependents, retry of a failed step after a prior execution was replayed, and sub-workflows that inherit the parent's depth counter. The condition branching logic alone accounts for 12 dedicated test cases covering truthiness evaluation, branch resolution, and the interaction between skip sets and the dependency resolver.&lt;/p&gt;

</description>
      <category>dag</category>
      <category>python</category>
    </item>
    <item>
      <title>Why the PO Resolver Has Five Strategies, Not One Algorithm</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:12:22 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/why-the-po-resolver-has-five-strategies-not-one-algorithm-24a1</link>
      <guid>https://dev.to/kingsleyonoh/why-the-po-resolver-has-five-strategies-not-one-algorithm-24a1</guid>
      <description>&lt;p&gt;&lt;code&gt;PO-2026-001&lt;/code&gt;. &lt;code&gt;PO 2026 001&lt;/code&gt;. &lt;code&gt;po2026001&lt;/code&gt;. &lt;code&gt;PO#2026-001&lt;/code&gt;. &lt;code&gt;2026-001&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Every one of those strings refers to the same purchase order. The first came from our internal PO system. The rest came from vendors. One vendor's accounting package strips dashes. Another inserts hash prefixes. A third just writes the sequence number and assumes the buyer will figure it out. A database equality query on &lt;code&gt;po_number&lt;/code&gt; finds zero matches on four of those five.&lt;/p&gt;

&lt;p&gt;That's the starting condition for the matching engine. The &lt;code&gt;po_reference&lt;/code&gt; field on the invoice exists, is populated, and does not equal any PO number in the database. The naive response is to call the invoice unmatched and let a human deal with it. The whole point of the system is to not do that.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Matcher's Real Job
&lt;/h2&gt;

&lt;p&gt;The pipeline I inherited from my own PRD had a single phase called "PO resolution." Given an invoice, find its PO. The implementation I started writing was also single-step: normalize both sides of the string, compare, return a match or null. Clean code, one function, one SQL query.&lt;/p&gt;

&lt;p&gt;It handled about 60% of real invoices.&lt;/p&gt;

&lt;p&gt;The remaining 40% broke in three different ways. Some vendors omitted the PO reference entirely, writing it into the line item description instead. Some vendors used their own internal sales order number, which had no textual overlap with our PO number at all. Some vendors sent the right reference but with enough format noise that neither exact nor normalized comparison found it.&lt;/p&gt;

&lt;p&gt;Three different failure modes. One matching algorithm. The ratio was not going to get better.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cascade, Not a Fallback
&lt;/h2&gt;

&lt;p&gt;I rewrote &lt;code&gt;PoResolver&lt;/code&gt; as an ordered chain of strategies, each one with a decreasing confidence ceiling. The signature looks the same from the outside: give me a &lt;code&gt;tenantId&lt;/code&gt;, a &lt;code&gt;poReference&lt;/code&gt;, a &lt;code&gt;vendorId&lt;/code&gt;, an &lt;code&gt;invoiceAmount&lt;/code&gt;, and an &lt;code&gt;invoiceDate&lt;/code&gt;, and I'll hand back a &lt;code&gt;PoResolution&lt;/code&gt; with a PO ID, a confidence score, and a method label. What changed is the inside.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Strategy 1: Exact match&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;poReference&lt;/span&gt; &lt;span class="p"&gt;!=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;exact&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tenantPos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;firstOrNull&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt;
        &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;PurchaseOrderTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;poNumber&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="n"&gt;poReference&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exact&lt;/span&gt; &lt;span class="p"&gt;!=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="nd"&gt;@newSuspendedTransaction&lt;/span&gt; &lt;span class="nc"&gt;PoResolution&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;poId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exact&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;PurchaseOrderTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"exact"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Strategy 2: Normalized match&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;poReference&lt;/span&gt; &lt;span class="p"&gt;!=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;normalizedRef&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;poReference&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;normalized&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tenantPos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;firstOrNull&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt;
        &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;PurchaseOrderTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;poNumber&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="n"&gt;normalizedRef&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalized&lt;/span&gt; &lt;span class="p"&gt;!=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="nd"&gt;@newSuspendedTransaction&lt;/span&gt; &lt;span class="nc"&gt;PoResolution&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;poId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;PurchaseOrderTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"normalized"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full resolver in &lt;code&gt;PoResolver.kt&lt;/code&gt; runs five of these in order. Exact at confidence 1.0. Normalized (strip whitespace, remove the &lt;code&gt;PO&lt;/code&gt; prefix, lowercase) at 0.95. Jaro-Winkler fuzzy match against every PO in the tenant above 0.70, with the Jaro-Winkler score itself becoming the confidence. Vendor plus amount within 5% tolerance at 0.65. Vendor plus date within 90 days at 0.50. Then no match at 0.0.&lt;/p&gt;

&lt;p&gt;Each strategy catches a failure the previous one could not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Confidence Decays on Purpose
&lt;/h2&gt;

&lt;p&gt;The obvious objection to a cascade is that you could just run the weakest algorithm first and skip the rest. Fuzzy match solves both the exact case and the noisy case, right?&lt;/p&gt;

&lt;p&gt;It doesn't. Not for the reason people expect.&lt;/p&gt;

&lt;p&gt;Jaro-Winkler's 0.70 threshold is tuned for normal PO numbers. Two different POs issued a week apart to the same vendor will often score 0.78 against each other. &lt;code&gt;PO-2026-001&lt;/code&gt; and &lt;code&gt;PO-2026-008&lt;/code&gt; share the prefix, share the date segment, and differ only in the sequence number. Run fuzzy match first and you will occasionally match an invoice to a PO that is not the correct one. The system will have no way to know it's wrong, because the score says "probable match."&lt;/p&gt;

&lt;p&gt;Running exact match first means any invoice that quotes the correct PO number unambiguously gets it unambiguously, at confidence 1.0. The fuzzy matcher never runs on the 60% of invoices where exact works. It only runs on the 20% where no exact or normalized match exists, and among those, only the cases with a close-enough string similarity get scored. The cascade constrains the search space before the weaker algorithms touch it.&lt;/p&gt;

&lt;p&gt;Confidence 0.95 for normalized, 0.70-ish for fuzzy, 0.65 for vendor+amount, 0.50 for vendor+date. Those numbers are not arbitrary rankings. They are honest confidence statements about the evidence the match is built on. A 0.50 vendor+date match means: this invoice arrived from a vendor we have a PO with, issued within 90 days of the invoice date. That is weak evidence. It might be the right PO. It's enough to avoid the unmatched state, enough to surface for human review, and honest about how certain the system is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Normalization Function That Does Three Things
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;normalize()&lt;/code&gt; is short enough to fit on a screen. It strips whitespace, regex-removes the &lt;code&gt;PO&lt;/code&gt; prefix variants, and lowercases. That's it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;PO_PREFIX_REGEX&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Regex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"""^(PO[-#\s]*)"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;RegexOption&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;IGNORE_CASE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;internal&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;PO_PREFIX_REGEX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"\\s+"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toRegex&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lowercase&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three lines of transformation. It cost me two afternoons.&lt;/p&gt;

&lt;p&gt;What surprised me in the data was how many ways the same four characters can vary. &lt;code&gt;PO-&lt;/code&gt;, &lt;code&gt;PO&lt;/code&gt;, &lt;code&gt;PO#&lt;/code&gt;, &lt;code&gt;PO.&lt;/code&gt;, &lt;code&gt;po-&lt;/code&gt;, &lt;code&gt;P.O.-&lt;/code&gt;, &lt;code&gt;P.O.&lt;/code&gt;. The regex handles the first four. The last three we either accept as exact-match failures (they won't pass normalization either) or catch with fuzzy match at a lower confidence. I deliberately did not try to normalize every typographic variation. The regex would grow and the edge cases never end.&lt;/p&gt;

&lt;p&gt;The better question was: what fraction of real PO references in the database fail normalization? I wrote a one-off script against a synthetic test set of 2,000 generated invoice references and measured how many failed each strategy in turn. Exact caught 58%. Normalized added another 27%. Fuzzy caught another 8%. Vendor+amount caught 3%. Vendor+date caught 2%. Unmatched: 2%.&lt;/p&gt;

&lt;p&gt;The 58% exact match result was not a surprise. Most vendors copy-paste the PO number into their system once and then their system emits the same string forever. The 27% for normalized was the argument for writing the regex at all. The 3% vendor+amount rescue was the argument for keeping strategies 4 and 5 even though they look weak.&lt;/p&gt;

&lt;h2&gt;
  
  
  The One That Has No PO Reference at All
&lt;/h2&gt;

&lt;p&gt;Strategies 4 and 5 run with no &lt;code&gt;poReference&lt;/code&gt; string. The invoice arrived without a PO number written on it. Strategy 4 asks: are there any open POs for this vendor whose total matches the invoice total within 5%? Strategy 5 asks: are there any open POs for this vendor issued within 90 days of the invoice date?&lt;/p&gt;

&lt;p&gt;The confidence on these is deliberately low. 0.65 and 0.50. Both are below the auto-approve threshold (0.95) and below the standard-approval threshold (0.70). A match at this confidence routes to escalated review, every time, with no exception. Which is the right behavior.&lt;/p&gt;

&lt;p&gt;What these strategies buy you is the difference between "unmatched, no PO" and "possibly matched, here's a weak candidate." The human reviewer gets a starting point. Instead of opening the invoice in an empty state and searching manually, they open it with a suggested PO pre-attached and a 0.50 confidence label that tells them: we think this is the one, don't trust us.&lt;/p&gt;

&lt;p&gt;I was wrong about this initially. My first instinct was that low-confidence matches are worse than no match at all. A human opening an invoice with a suggested PO will often accept the suggestion without checking. That's a real risk. I solved it at the approval routing layer: any match below 0.70 confidence is force-routed to the escalated queue, and the approval UI shows the confidence score prominently with the breakdown. The reviewer sees &lt;code&gt;{po_match: 0.50, method: "vendor_date"}&lt;/code&gt; and knows to verify the PO number manually before clicking approve.&lt;/p&gt;

&lt;p&gt;The system does not make the low-confidence decision. It surfaces the low-confidence candidate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The One Tradeoff I Accepted
&lt;/h2&gt;

&lt;p&gt;The cascade loads every PO for the tenant into memory on every invoice. For the normalized and fuzzy strategies, there is no index I can use: normalization happens after the row is in memory, and Jaro-Winkler needs the full string of both candidates. The query is &lt;code&gt;SELECT * FROM purchase_orders WHERE tenant_id = ?&lt;/code&gt; and then the filtering is in Kotlin.&lt;/p&gt;

&lt;p&gt;At 10,000 active POs per tenant, this is fine. The loop is a few milliseconds. At 100,000 active POs, it would start being noticeable. At 1 million, the cascade would need a different shape: pre-compute a normalized PO number column, index it, and do the normalized lookup in SQL. The fuzzy match would need a trigram index (&lt;code&gt;pg_trgm&lt;/code&gt; is already enabled in migration V010) backing a &lt;code&gt;SIMILARITY&lt;/code&gt; query.&lt;/p&gt;

&lt;p&gt;I have not built that yet. The largest tenant on the system has about 3,000 open POs at any time, and the current approach takes under 10 milliseconds per invoice on that workload. Building the indexed version now would be optimization ahead of the constraint.&lt;/p&gt;

&lt;p&gt;The scale plan is in the code comments. When a tenant crosses the threshold, the resolver gets a second implementation path. The interface stays the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Match Record Actually Stores
&lt;/h2&gt;

&lt;p&gt;Every match result writes a JSONB breakdown. &lt;code&gt;{po_match: 0.95, method: "normalized"}&lt;/code&gt;. The &lt;code&gt;method&lt;/code&gt; string is the strategy that matched. That field is what makes the whole design defensible to operators.&lt;/p&gt;

&lt;p&gt;When a finance lead looks at a low-confidence invoice in the approval queue, they don't just see "0.73 confidence." They see the breakdown:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"po_match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"line_match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.72&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"receipt_match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"price_match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.78&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"overall"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.73&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plus the PO resolution method label: &lt;code&gt;"normalized"&lt;/code&gt;. Now the reviewer knows the PO was found with reasonable confidence (0.95, normalized match), but the receipt verification dropped to 0.50 (no receipt found on that PO line yet), which pulled the overall score down. The remediation is obvious: wait for the goods receipt and rerun matching. Not reject the invoice.&lt;/p&gt;

&lt;p&gt;That information existed in the old binary matcher too. It just wasn't exposed. The system knew why a match was weak and threw the reason away. Writing it to JSONB was the cheap part. Deciding to write it at all was the design decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Reconsider
&lt;/h2&gt;

&lt;p&gt;The 0.50 confidence floor on vendor+date matching is doing work, but it's doing work that's not fully visible in the match record. A &lt;code&gt;"vendor_date"&lt;/code&gt; method label with 0.50 confidence tells an operator that the invoice had no PO reference and we fell back to the weakest strategy. It does not tell them whether the vendor has three other open POs in the same date range that were almost as good a match.&lt;/p&gt;

&lt;p&gt;If I redid this, I'd return the top three candidates from each strategy, not just the best one. The &lt;code&gt;alternatives&lt;/code&gt; field on &lt;code&gt;ConfidenceScore&lt;/code&gt; exists for exactly this and is currently always empty. The code comment in &lt;code&gt;ConfidenceScorer.kt&lt;/code&gt; references it as planned work. An invoice that fell to strategy 4 with three equally plausible POs should surface all three to the reviewer with their individual confidence scores. Right now the reviewer sees one candidate and either accepts or searches manually.&lt;/p&gt;

&lt;p&gt;That's the kind of detail that looks trivial and isn't. A single weak candidate is worse than a ranked list of weak candidates, because the single candidate suggests certainty the system doesn't have. Building the alternatives list is two afternoons of plumbing. Shipping without it is the tradeoff I made to hit the first release. It's the first thing I'd add in a second pass.&lt;/p&gt;

</description>
      <category>kotlin</category>
      <category>matchingalgorithms</category>
      <category>exposed</category>
    </item>
  </channel>
</rss>
