<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kingsley Onoh</title>
    <description>The latest articles on DEV Community by Kingsley Onoh (@kingsleyonoh).</description>
    <link>https://dev.to/kingsleyonoh</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F568563%2F01c09769-b072-47a3-9c7b-16fefc2c573e.png</url>
      <title>DEV Community: Kingsley Onoh</title>
      <link>https://dev.to/kingsleyonoh</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kingsleyonoh"/>
    <language>en</language>
    <item>
      <title>Stripe Said Past Due. The State Machine Said No. Both Were Right.</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:25:43 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/stripe-said-past-due-the-state-machine-said-no-both-were-right-4ea</link>
      <guid>https://dev.to/kingsleyonoh/stripe-said-past-due-the-state-machine-said-no-both-were-right-4ea</guid>
      <description>&lt;p&gt;A paused subscription receives a &lt;code&gt;customer.subscription.updated&lt;/code&gt; webhook from Stripe. The payload says the new status is &lt;code&gt;past_due&lt;/code&gt;. The local state machine has eight states and fifteen valid transitions, defined as a single Elixir map literal. &lt;code&gt;paused&lt;/code&gt; can transition to &lt;code&gt;active&lt;/code&gt;. That is the only outbound edge. There is no &lt;code&gt;paused&lt;/code&gt; to &lt;code&gt;past_due&lt;/code&gt; entry. The state machine says this transition doesn't exist.&lt;/p&gt;

&lt;p&gt;What do you do with the event?&lt;/p&gt;

&lt;p&gt;Rejecting the event is the safe choice. Invalid transition, log a warning, drop the payload. Clean, principled, and wrong. Stripe doesn't send a status change in isolation. The same webhook payload contains updated period dates, metadata changes, and potentially a new plan assignment. Rejecting the event to protect the status field means losing everything else in the payload.&lt;/p&gt;

&lt;p&gt;Forcing the transition is the pragmatic choice. Override the state machine, accept whatever Stripe says, move on. Also wrong. The state machine exists because downstream business logic depends on it. The dunning engine creates retry sequences when a subscription enters &lt;code&gt;past_due&lt;/code&gt;. If a paused subscription suddenly appears as &lt;code&gt;past_due&lt;/code&gt; through a forced transition, the dunning engine starts chasing a payment that was intentionally paused. The customer gets escalating "your payment failed" notifications for a subscription they paused on purpose.&lt;/p&gt;

&lt;p&gt;But the invalid transition is a symptom, not the disease. Stripe and the local data model can disagree about the current state of a subscription, and neither is wrong. Stripe is the source of truth for payment processing. The local state machine is the source of truth for business logic execution. When they conflict, you need an architecture that handles the disagreement without losing data or breaking invariants.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Sources of Truth, Zero Arbitration
&lt;/h2&gt;

&lt;p&gt;This isn't a sync problem. Sync implies one system is behind and needs to catch up. This is a disagreement: Stripe says the state is X, the local model says the transition to X is illegal, and both positions are defensible.&lt;/p&gt;

&lt;p&gt;Stripe can put a subscription into states the local model doesn't allow because Stripe's state machine is broader and serves a different purpose. Stripe tracks billing states across all possible payment scenarios, including edge cases around payment method updates, invoice finalization timing, and subscription schedule modifications. The local model tracks lifecycle states for business logic: when to start dunning, when to compute churn, when to notify the customer. These are overlapping but non-identical concerns.&lt;/p&gt;

&lt;p&gt;The moment I accepted that these two state machines would diverge, the design became clear. The system needed to handle three cases: agreement (both say the same thing, proceed normally), valid disagreement (the transition is in the map, accept it), and invalid disagreement (the transition is not in the map, accept the data but not the transition).&lt;/p&gt;

&lt;h2&gt;
  
  
  The Strip, Don't Reject Pattern
&lt;/h2&gt;

&lt;p&gt;The core of this design lives in a single function: &lt;code&gt;maybe_strip_status/3&lt;/code&gt; in the &lt;code&gt;SubscriptionProcessor&lt;/code&gt; module.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight elixir"&gt;&lt;code&gt;&lt;span class="k"&gt;defp&lt;/span&gt; &lt;span class="n"&gt;maybe_strip_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stripe_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="no"&gt;nil&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_new_status&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;stripe_data&lt;/span&gt;
&lt;span class="k"&gt;defp&lt;/span&gt; &lt;span class="n"&gt;maybe_strip_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stripe_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;when&lt;/span&gt; &lt;span class="n"&gt;prev&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;new&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;stripe_data&lt;/span&gt;

&lt;span class="k"&gt;defp&lt;/span&gt; &lt;span class="n"&gt;maybe_strip_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stripe_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;previous_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="no"&gt;StateMachine&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;transition!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;previous_status&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;new_status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
    &lt;span class="ss"&gt;:ok&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;
      &lt;span class="n"&gt;stripe_data&lt;/span&gt;

    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="ss"&gt;:error&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="ss"&gt;:invalid_transition&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;
      &lt;span class="no"&gt;Logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s2"&gt;"SubscriptionProcessor: invalid transition &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;previous_status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; -&amp;gt; &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;new_status&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;, "&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;gt;&lt;/span&gt;
          &lt;span class="s2"&gt;"keeping current status"&lt;/span&gt;
      &lt;span class="p"&gt;)&lt;/span&gt;

      &lt;span class="no"&gt;Map&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stripe_data&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s2"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;previous_status&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="k"&gt;end&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When the state machine rejects a transition, the function does not reject the event. It replaces the incoming status with the previous status and returns the rest of the payload untouched. Period dates, trial dates, metadata, plan references: all of it passes through. Only the illegal status change gets stripped.&lt;/p&gt;

&lt;p&gt;The first two guard clauses handle the cold-start cases. When &lt;code&gt;previous_status&lt;/code&gt; is nil (brand new subscription, no prior state to compare), accept whatever Stripe sends. When the previous and new statuses are identical (Stripe sent an update that didn't change the status), pass through without checking the transition map.&lt;/p&gt;

&lt;p&gt;This is a deliberate tradeoff. The local status can drift from Stripe's status. A subscription might be &lt;code&gt;past_due&lt;/code&gt; in Stripe's records and &lt;code&gt;paused&lt;/code&gt; in the local database. I accepted this inconsistency because the local model controls what actually happens: which dunning sequences fire, which notifications send, which metrics count the subscription as churned. If Stripe's status is wrong from the local model's perspective, the local model wins for execution and Stripe wins for billing. Nobody is the universal authority.&lt;/p&gt;

&lt;h2&gt;
  
  
  Idempotency Under Disagreement
&lt;/h2&gt;

&lt;p&gt;The state disagreement compounds with another problem: event ordering. Stripe doesn't guarantee webhook delivery order. A &lt;code&gt;customer.subscription.updated&lt;/code&gt; event from 3:00 PM can arrive after one from 3:05 PM. If the 3:05 PM event was already processed and moved the subscription to &lt;code&gt;active&lt;/code&gt;, the 3:00 PM event now carries a stale status that the state machine might reject.&lt;/p&gt;

&lt;p&gt;The idempotency model handles this with a three-state check, not the typical two-state (seen/unseen) approach. Every event is stored with a composite key: &lt;code&gt;tenant_id:stripe_event_id&lt;/code&gt;. Before processing, the system checks:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;New:&lt;/strong&gt; No record exists. Process normally.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Duplicate:&lt;/strong&gt; A record exists with &lt;code&gt;processed_at&lt;/code&gt; set. Skip entirely, return success.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Processing:&lt;/strong&gt; A record exists without &lt;code&gt;processed_at&lt;/code&gt;. Another Oban worker is handling this event right now. Skip to avoid double-processing.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The third state is the one most implementations miss. Without it, two workers pulling the same event from the Oban queue would both attempt to process it, and both would try to create downstream records (dunning attempts, notifications). The database-level unique constraint on the idempotency key catches this at the persistence layer, but the three-state check catches it earlier and avoids the exception entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dunning Guard Nobody Asks About
&lt;/h2&gt;

&lt;p&gt;The paused-to-past-due edge case has a second layer of protection deeper in the processing pipeline. Even if the status stripping somehow failed and a paused subscription appeared as &lt;code&gt;past_due&lt;/code&gt;, the dunning trigger has its own guard:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight elixir"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;prev_status&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="s2"&gt;"paused"&lt;/span&gt; &lt;span class="k"&gt;do&lt;/span&gt;
  &lt;span class="no"&gt;Logger&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s2"&gt;"SubscriptionProcessor: paused subscription transitioned to past_due, "&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;&amp;gt;&lt;/span&gt;
      &lt;span class="s2"&gt;"skipping dunning for sub &lt;/span&gt;&lt;span class="si"&gt;#{&lt;/span&gt;&lt;span class="n"&gt;subscription&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;
  &lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;end&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is belt-and-suspenders engineering for a specific Stripe behavior I discovered during development. Stripe can emit a &lt;code&gt;past_due&lt;/code&gt; status for a paused subscription when a pending invoice from before the pause fails to collect. The subscription was paused intentionally, but Stripe's billing engine still tries to finalize the outstanding invoice. If it fails, Stripe marks the subscription &lt;code&gt;past_due&lt;/code&gt; even though the customer explicitly paused it.&lt;/p&gt;

&lt;p&gt;Without this guard, the dunning engine would start a 7-day escalation sequence (email at day 1, email at day 3, Telegram at day 5, both channels at day 7, then automatic cancellation) for a subscription the customer paused. The customer experience would be: "I paused my subscription. Three days later I got a Telegram message saying my payment failed and I need to act immediately." That is not a recoverable customer relationship.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Would Redesign
&lt;/h2&gt;

&lt;p&gt;The status stripping approach works, but it has a gap: the system does not reconcile. If the local status diverges from Stripe's status, it stays diverged until the next valid transition arrives. A reconciliation job that periodically fetches subscription status from the Stripe API and compares it against local state would close this gap. I didn't build it because the current system is event-driven by design (no polling), and adding a reconciliation poller would create a second source of state mutations that the webhook pipeline doesn't expect.&lt;/p&gt;

&lt;p&gt;The churn calculator has a related bootstrap problem. It computes churn rate as &lt;code&gt;churned_count / active_count_at_period_start&lt;/code&gt;, where the denominator comes from the previous day's metrics snapshot. On the first day of operation, there is no previous snapshot. The first churn rate is always &lt;code&gt;0.0000&lt;/code&gt;, regardless of what actually happened. I accepted this because building temporal query infrastructure to count active subscriptions at an arbitrary past timestamp added complexity that the 99.9% case (day 2 and beyond) doesn't need. But the first day's metrics are wrong, and there is no warning in the dashboard about it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Lesson Is About Modeling, Not Stripe
&lt;/h2&gt;

&lt;p&gt;Stripe is the specific case, but the pattern applies to any system that processes events from an external authority. Payment providers, shipping APIs, identity providers, government reporting systems: they all emit state changes according to their own transition rules. Your internal model is a subset of their model, tuned for your business logic, and the two will disagree.&lt;/p&gt;

&lt;p&gt;The instinct is to treat the external system as the single source of truth. Let Stripe win every argument. The problem is that "letting Stripe win" means your business logic executes against a state model you do not control and cannot predict. The alternative isn't to fight the external system. It is to separate the concerns: accept their data, enforce your transitions, log the disagreements, and build guards at every point where the disagreement could trigger unintended side effects.&lt;/p&gt;

&lt;p&gt;Build for disagreement, not consensus. The external world does not know your rules, and it does not care.&lt;/p&gt;

</description>
      <category>stripe</category>
      <category>elixir</category>
      <category>webhook</category>
    </item>
    <item>
      <title>Why Anomaly Detection Can't Block the Ingestion Pipeline</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:23:23 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/why-anomaly-detection-cant-block-the-ingestion-pipeline-3n42</link>
      <guid>https://dev.to/kingsleyonoh/why-anomaly-detection-cant-block-the-ingestion-pipeline-3n42</guid>
      <description>&lt;p&gt;The first version of the anomaly evaluator ran inline. A batch of 100 readings flushed to TimescaleDB, then the same function called &lt;code&gt;evaluate_batch&lt;/code&gt;, which fetched alert rules, computed rolling statistics from the continuous aggregates, checked cooldown windows, persisted alerts, published NATS events, and fired HTTP calls to the Notification Hub and Workflow Engine. All of it synchronous. All of it in the same call stack as the NATS consumer loop.&lt;/p&gt;

&lt;p&gt;It worked for about thirty seconds. Then the Workflow Engine returned a 502, the reqwest client waited for its default timeout, the batch flush stalled, NATS messages piled up, and the consumer loop fell behind by 3,000 messages before I killed the process.&lt;/p&gt;

&lt;p&gt;The lesson was obvious in retrospect: a sensor data pipeline that sustains 5,000 readings per second cannot wait for a downstream HTTP response. The interesting question was what "don't wait" actually means at the architecture level. Making the HTTP call async solves the timeout. It doesn't solve the failure model. I needed every component downstream of the batch insert to be able to fail, silently, without the consumer loop noticing or caring.&lt;/p&gt;

&lt;h2&gt;
  
  
  The pipeline, end to end
&lt;/h2&gt;

&lt;p&gt;A sensor message arrives on NATS as JSON. The consumer deserializes it, validates the value (must be finite, metric must be non-empty), resolves the tenant via the TenantResolver's DashMap cache, auto-registers the device if unknown, and pushes the reading into the BatchBuffer. When the buffer hits 100 readings or 500 milliseconds elapse, whichever comes first, it flushes.&lt;/p&gt;

&lt;p&gt;The flush itself is the only synchronous bottleneck I allow. The &lt;code&gt;execute_insert&lt;/code&gt; method builds UNNEST arrays for all seven columns and sends a single INSERT to TimescaleDB. If that fails, it retries once after 100 milliseconds. If the retry fails, it logs at &lt;code&gt;error&lt;/code&gt; level and drops the batch. There's no infinite retry loop. No dead-letter queue. The batch is gone.&lt;/p&gt;

&lt;p&gt;I made that choice deliberately. At 5,000 readings per second, a retry queue that grows faster than it drains is worse than data loss. Thirty seconds of buffered retries at full throughput means 150,000 queued readings. The server runs in 512MB. The math doesn't work. Drop the batch, log the failure, let the operator investigate. The continuous aggregates will smooth over a missing 100-reading gap, and the anomaly detector uses rolling statistics over minutes, not individual readings.&lt;/p&gt;

&lt;h2&gt;
  
  
  The buffer swap pattern
&lt;/h2&gt;

&lt;p&gt;The BatchBuffer holds a &lt;code&gt;Vec&amp;lt;ValidatedReading&amp;gt;&lt;/code&gt; behind a &lt;code&gt;tokio::sync::Mutex&lt;/code&gt;. The critical design detail is in the &lt;code&gt;push&lt;/code&gt; method: when the buffer reaches capacity, the code calls &lt;code&gt;std::mem::replace&lt;/code&gt; to swap the full Vec with a fresh one, then drops the Mutex guard before executing the INSERT.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;readings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nn"&gt;std&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nn"&gt;mem&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="k"&gt;mut&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nn"&gt;Vec&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;with_capacity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="py"&gt;.batch_size&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="nf"&gt;drop&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buf&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="c1"&gt;// release lock before DB I/O&lt;/span&gt;
&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;flushed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt;&lt;span class="nf"&gt;.flush_to_db_with_retry&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;readings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="o"&gt;?&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This matters because the INSERT takes milliseconds. If the lock were held during the INSERT, every concurrent &lt;code&gt;push&lt;/code&gt; call would block. At 5,000 messages per second, even a 5ms INSERT means 25 messages queued behind the lock. The swap pattern reduces the critical section to a memory operation, not a network round-trip.&lt;/p&gt;

&lt;p&gt;I considered using a lock-free structure like a crossbeam channel or a ring buffer. The Mutex won the trade-off because the batch logic needs to check the current buffer length before deciding to flush, and that check-then-act pattern doesn't map cleanly to a channel. The swap keeps the lock hold time under a microsecond, which is well below the threshold where contention would show up at 5,000 messages per second.&lt;/p&gt;

&lt;h2&gt;
  
  
  Detection runs post-flush, not inline
&lt;/h2&gt;

&lt;p&gt;After the BatchBuffer returns the flushed readings, the consumer calls &lt;code&gt;run_anomaly_evaluation&lt;/code&gt;. This function is the firewall between ingestion and detection. Here's what it looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;fn&lt;/span&gt; &lt;span class="nf"&gt;run_anomaly_evaluation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;PgPool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;nats_client&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="nn"&gt;async_nats&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="n"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;readings&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;ValidatedReading&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;notification_emitter&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;NotificationEmitter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;workflow_trigger&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;WorkflowTrigger&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="nf"&gt;evaluate_batch&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;nats_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;readings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;notification_emitter&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;workflow_trigger&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;alert_ids&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;alert_ids&lt;/span&gt;&lt;span class="nf"&gt;.is_empty&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="nd"&gt;info!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;alert_ids&lt;/span&gt;&lt;span class="nf"&gt;.len&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s"&gt;"Anomaly evaluation created alerts"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nd"&gt;warn!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Anomaly evaluation failed, continuing"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The return type is &lt;code&gt;()&lt;/code&gt;, not &lt;code&gt;Result&lt;/code&gt;. Errors are logged at &lt;code&gt;warn&lt;/code&gt; level and swallowed. The consumer loop continues processing the next NATS message regardless of whether detection succeeded. A database timeout in the rule query, a serialization failure in the NATS publish, a network error calling the Notification Hub: none of these stop ingestion.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deduplication before evaluation
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;evaluate_batch&lt;/code&gt; function doesn't evaluate every reading in the batch. It extracts unique &lt;code&gt;(tenant_id, device_id, metric)&lt;/code&gt; tuples, keeping only the latest value for each. A batch of 100 readings from 5 devices reporting 3 metrics each produces 15 unique tuples, not 100. The deduplication uses a &lt;code&gt;HashSet&amp;lt;ReadingKey&amp;gt;&lt;/code&gt; and iterates the batch in reverse so the most recent reading for each tuple wins.&lt;/p&gt;

&lt;p&gt;This was a performance decision. Fetching alert rules from PostgreSQL and computing rolling deviation statistics from the &lt;code&gt;readings_1m&lt;/code&gt; aggregate table costs a query per unique tuple. At 100 queries per batch, the evaluator would be the bottleneck. At 15 queries per batch, it completes in single-digit milliseconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three condition types, one evaluation path
&lt;/h2&gt;

&lt;p&gt;Each alert rule specifies a condition type: &lt;code&gt;above_threshold&lt;/code&gt;, &lt;code&gt;below_threshold&lt;/code&gt;, or &lt;code&gt;deviation_from_mean&lt;/code&gt;. The first two are trivial comparisons. The third runs a SQL query against the &lt;code&gt;readings_1m&lt;/code&gt; continuous aggregate, computing &lt;code&gt;avg(avg_value)&lt;/code&gt; and &lt;code&gt;stddev(avg_value)&lt;/code&gt; over the configured &lt;code&gt;window_minutes&lt;/code&gt;. If fewer than 5 buckets exist in the window, the check skips entirely.&lt;/p&gt;

&lt;p&gt;That minimum sample count of 5 was the result of a frustrating cold-start problem. When a new device starts reporting, the first few readings have no aggregate history. A threshold of 2 standard deviations from a mean computed from 2 data points fires on everything. The system was generating hundreds of false alerts during device onboarding. Setting the floor at 5 samples eliminated the noise without significantly delaying real anomaly detection. Five minutes of readings at one-per-second gives 5 one-minute aggregate buckets. That's the minimum window before deviation detection activates.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cooldown prevents alert storms
&lt;/h2&gt;

&lt;p&gt;Without cooldown, a sensor stuck at a high value generates an alert on every batch flush. At 500ms intervals, that's 120 alerts per minute for a single sensor. The &lt;code&gt;check_cooldown&lt;/code&gt; function queries the &lt;code&gt;alerts&lt;/code&gt; table for any alert with the same &lt;code&gt;rule_id&lt;/code&gt; and &lt;code&gt;device_id&lt;/code&gt; created within the last N minutes (default 15). If one exists, the alert is suppressed and logged at &lt;code&gt;info&lt;/code&gt; level.&lt;/p&gt;

&lt;p&gt;The cooldown query hits an index on &lt;code&gt;(rule_id, device_id, created_at)&lt;/code&gt;. I briefly considered an in-memory cooldown cache (DashMap with TTL, similar to the TenantResolver), but the database-backed approach won because cooldown state needs to survive process restarts. If the service crashes and restarts, a memory-based cooldown would fire duplicate alerts for every rule that was in cooldown before the crash.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fire-and-forget for ecosystem calls
&lt;/h2&gt;

&lt;p&gt;The Notification Hub and Workflow Engine integrations use the same pattern: serialize the payload, spawn a Tokio task, return immediately. The spawned task makes the HTTP call and logs the result. The caller never sees the response.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight rust"&gt;&lt;code&gt;&lt;span class="nn"&gt;tokio&lt;/span&gt;&lt;span class="p"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;spawn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;move&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="nf"&gt;.post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.header&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"X-API-Key"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.json&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="nf"&gt;.send&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;&lt;span class="k"&gt;.await&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;match&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nf"&gt;Ok&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;resp&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="cm"&gt;/* log success or failure status */&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="nf"&gt;Err&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nd"&gt;warn!&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;error&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="o"&gt;%&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;"Failed to send event"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Workflow Engine call adds HMAC-SHA256 signing. The &lt;code&gt;compute_signature&lt;/code&gt; method takes the secret and raw body bytes, produces a &lt;code&gt;sha256=&amp;lt;hex&amp;gt;&lt;/code&gt; signature, and attaches it as &lt;code&gt;X-Hub-Signature-256&lt;/code&gt;. If no secret is configured, the header is omitted entirely.&lt;/p&gt;

&lt;p&gt;This is at-most-once delivery by design. If the Notification Hub is down, the alert event is lost. I chose this over at-least-once (which would require a persistent outbox or retry queue) because the alerts are persisted in the database regardless. The Notification Hub is a convenience layer for pushing alerts to email or Telegram. An operator who checks the API's &lt;code&gt;/api/alerts&lt;/code&gt; endpoint will always see the full alert history, even if the Hub missed a notification.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd redesign
&lt;/h2&gt;

&lt;p&gt;The anomaly evaluator currently runs on the same Tokio runtime as the NATS consumer and the Axum HTTP server. At 5,000 readings per second this is fine. At 50,000, the rule evaluation queries would compete with API queries for the 20-connection database pool. I'd split the evaluator into a separate process that consumes a NATS subject of "batches flushed" events, running its own connection pool. The ingestion binary would publish a summary event after each successful flush instead of calling the evaluator directly.&lt;/p&gt;

&lt;p&gt;The other weak point is the fire-and-forget pattern. At scale, lost notifications become a support problem. The fix is an outbox table: persist the notification payload alongside the alert, then have a background worker poll the outbox and deliver with retries. The outbox adds latency and complexity, but it makes notification delivery observable. Right now, the only way to know a notification was lost is to notice a gap in the Notification Hub's event log. That's not good enough for a production system where the person on call needs to trust that critical alerts reach them.&lt;/p&gt;

</description>
      <category>rust</category>
      <category>async</category>
      <category>tokio</category>
    </item>
    <item>
      <title>I Spent a Week Securing Webhook Ingestion. The Real Attack Surface Was Delivery.</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:20:14 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/i-spent-a-week-securing-webhook-ingestion-the-real-attack-surface-was-delivery-5a8p</link>
      <guid>https://dev.to/kingsleyonoh/i-spent-a-week-securing-webhook-ingestion-the-real-attack-surface-was-delivery-5a8p</guid>
      <description>&lt;p&gt;I ran the security review two weeks after the first deployment. The ingestion side looked solid: HMAC signature verification using &lt;code&gt;crypto.timingSafeEqual&lt;/code&gt;, rate limiting at 1,000 requests per minute, payload size capped at 1MB, idempotency deduplication on every incoming event. I was satisfied with the input boundary. Then I traced what happens after an event is accepted, through the delivery worker and out to destination URLs, and realized I'd spent a week protecting the wrong end of the system.&lt;/p&gt;

&lt;p&gt;The ingestion endpoint validates who is sending. But the delivery worker, the component that forwards payloads to downstream URLs registered by tenants, makes outbound HTTP requests from inside the server's network. That side had no protection at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two attack surfaces, nothing in common
&lt;/h2&gt;

&lt;p&gt;A webhook gateway has two distinct threat models, and they require completely different defenses.&lt;/p&gt;

&lt;p&gt;On the ingestion side, the threat is an external attacker sending malicious payloads: forged signatures, oversized bodies designed to exhaust memory, replay attacks using captured requests, and deeply nested JSON meant to overflow the parser stack. The defense strategy is conventional. Validate everything at the boundary before it touches the database.&lt;/p&gt;

&lt;p&gt;On the delivery side, the threat comes from inside. A tenant registers a destination URL. The URL passed validation when it was created: &lt;code&gt;api.customer.com&lt;/code&gt; resolved to &lt;code&gt;34.120.18.42&lt;/code&gt;, a legitimate GCP load balancer. But URLs are just DNS pointers, and DNS records change. Three weeks later, the same hostname resolves to &lt;code&gt;169.254.169.254&lt;/code&gt;, the cloud metadata endpoint on every major provider. The webhook gateway, running inside the VPS, dutifully POSTs the event payload to the internal metadata service.&lt;/p&gt;

&lt;p&gt;This is Server-Side Request Forgery through DNS rebinding. The gateway becomes an open proxy for any tenant who controls their DNS records. And the creation-time URL validation caught none of it, because the DNS resolution happened at delivery time, weeks after the destination was registered.&lt;/p&gt;

&lt;p&gt;The exposure isn't limited to metadata endpoints. Private IP ranges (10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16), loopback addresses, the link-local range (169.254.0.0/16), IPv6 unique local addresses (fc00::/7), and protocol handlers like &lt;code&gt;file://&lt;/code&gt; or &lt;code&gt;gopher://&lt;/code&gt; are all reachable from the delivery worker's network context. The attack surface is larger than it first appears.&lt;/p&gt;

&lt;h2&gt;
  
  
  What made it hard
&lt;/h2&gt;

&lt;p&gt;Three constraints shaped the fix.&lt;/p&gt;

&lt;p&gt;First, the system delivers webhooks continuously. The SSRF check runs on every delivery, not just at destination creation. Any latency added to the DNS resolution and IP validation eats into the 10,000ms delivery timeout budget. The check needed to complete in single-digit milliseconds under normal conditions, which ruled out external SSRF-detection services and pre-resolution caching (stale DNS entries defeat the purpose).&lt;/p&gt;

&lt;p&gt;Second, the existing delivery worker was the hottest code path in the system: load event, load destination, make HTTP call, record result, calculate backoff, schedule retry or promote to dead letter. Ten concurrent workers execute this pipeline on every job. Inserting validation in the middle of this sequence meant touching every execution path without breaking the retry logic, the exponential backoff calculations in &lt;code&gt;calculateBackoffMs()&lt;/code&gt;, or the dead-letter promotion threshold.&lt;/p&gt;

&lt;p&gt;Third, some destinations legitimately resolve to IPs that look suspicious to a naive blocklist. A customer running their webhook handler on a small hosting provider might have an IP range that borders private space. The validation needed to be strict about RFC 1918 ranges while returning clear, actionable error messages when it rejected a delivery, so tenants could debug the issue without opening a support ticket.&lt;/p&gt;

&lt;h2&gt;
  
  
  Six layers, added iteratively
&lt;/h2&gt;

&lt;p&gt;The initial deployment shipped with HMAC signature verification and rate limiting on the ingestion endpoint. The remaining layers arrived in a separate wave of commits, all prefixed &lt;code&gt;fix(&lt;/code&gt; rather than &lt;code&gt;feat(&lt;/code&gt;, because each one addressed a gap I discovered after the pipeline was already handling traffic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ingestion input protection:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Rate limiting caps ingestion at 1,000 requests per minute per source. Payload size is limited to 1MB. JSON parsing enforces a nesting depth limit of 20 levels (&lt;code&gt;MAX_JSON_DEPTH&lt;/code&gt; in &lt;code&gt;src/server.ts&lt;/code&gt;) to prevent stack overflow from recursive structures. HMAC signature verification uses &lt;code&gt;crypto.timingSafeEqual&lt;/code&gt; to prevent timing side-channels. The &lt;code&gt;verifySignature()&lt;/code&gt; function in &lt;code&gt;src/ingestion/signature.ts&lt;/code&gt; supports both &lt;code&gt;hmac-sha256&lt;/code&gt; and &lt;code&gt;hmac-sha1&lt;/code&gt;, strips algorithm prefixes (GitHub sends &lt;code&gt;sha256=&amp;lt;hex&amp;gt;&lt;/code&gt;, Stripe uses &lt;code&gt;t=&amp;lt;unix&amp;gt;,v1=&amp;lt;hex&amp;gt;&lt;/code&gt;), and applies a configurable timestamp tolerance. The &lt;code&gt;extractTimestampFromHeader()&lt;/code&gt; function parses Stripe's format specifically: &lt;code&gt;t=&amp;lt;unix_seconds&amp;gt;,&lt;/code&gt; at the beginning of the header value, converted to milliseconds. Any signature older than 5 minutes (the &lt;code&gt;SIGNATURE_TOLERANCE_MS&lt;/code&gt; default of 300,000ms) is rejected before the HMAC is even computed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Delivery output protection:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Layer one is &lt;code&gt;validateDestinationUrl()&lt;/code&gt; in &lt;code&gt;src/lib/url-validator.ts&lt;/code&gt;, called at destination creation. It rejects non-HTTP protocols immediately. It blocks &lt;code&gt;localhost&lt;/code&gt; and checks raw IP addresses against private ranges via &lt;code&gt;isPrivateIp()&lt;/code&gt;. The IPv4 check walks the octets: &lt;code&gt;127.x&lt;/code&gt; is loopback, &lt;code&gt;10.x&lt;/code&gt; is private, &lt;code&gt;172.16-31.x&lt;/code&gt; is private, &lt;code&gt;192.168.x&lt;/code&gt; is private, &lt;code&gt;169.254.x&lt;/code&gt; is the link-local range that covers cloud metadata endpoints on AWS and GCP. IPv6 gets separate handling: &lt;code&gt;::1&lt;/code&gt; (loopback) and &lt;code&gt;fc00::/7&lt;/code&gt; (unique local, covering both &lt;code&gt;fc&lt;/code&gt; and &lt;code&gt;fd&lt;/code&gt; prefixes).&lt;/p&gt;

&lt;p&gt;Layer two is &lt;code&gt;resolveAndValidateUrl()&lt;/code&gt;, called before every delivery. This is the DNS rebinding defense:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// src/lib/url-validator.ts&lt;/span&gt;
&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;resolveAndValidateUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nb"&gt;Promise&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nf"&gt;validateDestinationUrl&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;parsed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;URL&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;hostname&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;parsed&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/^&lt;/span&gt;&lt;span class="se"&gt;\[&lt;/span&gt;&lt;span class="sr"&gt;|&lt;/span&gt;&lt;span class="se"&gt;\]&lt;/span&gt;&lt;span class="sr"&gt;$/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;net&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;isIP&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;addresses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;dns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve4&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[]);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;addresses6&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;dns&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;resolve6&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="k"&gt;catch&lt;/span&gt;&lt;span class="p"&gt;(()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[]);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;allAddresses&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[...&lt;/span&gt;&lt;span class="nx"&gt;addresses&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;addresses6&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;allAddresses&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;addr&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;allAddresses&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;isPrivateIp&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;UnsafeUrlError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="s2"&gt;`DNS rebinding detected: '&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;hostname&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;' resolves to private IP '&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;addr&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;'`&lt;/span&gt;
      &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;url&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every A and AAAA record is checked. If any resolved IP falls in a private range, the delivery is rejected with a specific error message naming the hostname and the offending address. The function resolves IPv4 and IPv6 in parallel to minimize latency. If DNS resolution fails entirely, the request is allowed through because the subsequent &lt;code&gt;fetch&lt;/code&gt; will fail with a network error on its own. No errors are swallowed silently.&lt;/p&gt;

&lt;p&gt;I got this wrong the first time. The initial implementation only had layer one: creation-time validation. I assumed that if the URL was safe when the tenant registered it, it would stay safe. It took reading through SSRF post-mortems to realize that DNS-based attacks bypass creation-time checks entirely. The fix shipped as its own commit: &lt;code&gt;fix(lib): add SSRF protection, helmet headers, raw body HMAC, patch drizzle CVE&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The delivery worker also caps redirect chains at 3 hops. The &lt;code&gt;deliverWebhook()&lt;/code&gt; function in &lt;code&gt;src/delivery/http-client.ts&lt;/code&gt; follows redirects manually instead of relying on &lt;code&gt;fetch&lt;/code&gt;'s default behavior, and truncates response bodies to 4,096 bytes to prevent a destination from returning a multi-megabyte response that fills the &lt;code&gt;deliveries&lt;/code&gt; table.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Data protection:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Header sanitization strips &lt;code&gt;Authorization&lt;/code&gt; and &lt;code&gt;Cookie&lt;/code&gt; headers from stored payloads at ingestion time. These headers might be present in the original webhook request (some providers include auth tokens), and forwarding them to a third-party destination would leak credentials. Signing secrets for HMAC verification are encrypted at rest with AES-256-GCM. The &lt;code&gt;encrypt()&lt;/code&gt; function in &lt;code&gt;src/lib/crypto.ts&lt;/code&gt; generates a random 12-byte IV per encryption and packs IV, auth tag, and ciphertext into a base64 string. The decryption key lives in an environment variable (&lt;code&gt;SIGNING_SECRET_KEY&lt;/code&gt;), separate from the database. A breach that exposes the &lt;code&gt;sources&lt;/code&gt; table gives the attacker ciphertext, not usable keys.&lt;/p&gt;

&lt;h2&gt;
  
  
  What changed
&lt;/h2&gt;

&lt;p&gt;The security hardening added roughly 400 lines of validation code to a system that was already functionally complete. The &lt;code&gt;url-validator.ts&lt;/code&gt; module alone is 195 lines of checks that don't change what the system does. They change what it refuses to do.&lt;/p&gt;

&lt;p&gt;Before: the delivery worker would POST to any URL a tenant registered, follow unlimited redirects, and forward all original headers. After: every delivery passes through protocol filtering, hostname blocking, static IP range checks, dynamic DNS resolution with IP validation, redirect limiting at 3 hops, header sanitization, and response body truncation at 4,096 bytes.&lt;/p&gt;

&lt;p&gt;The ingestion side now has signature verification with timing-safe comparison and 5-minute stale tolerance, payload caps at 1MB, and JSON depth limiting at 20 levels. Six overlapping layers, three on each side of the persist-before-process boundary.&lt;/p&gt;

&lt;p&gt;None of these layers trust the previous one. The delivery-time DNS check doesn't assume the creation-time URL check caught everything. The JSON depth limit doesn't assume the payload size limit prevented pathological inputs. Every layer operates under the assumption that the one before it was bypassed or insufficient. When you build infrastructure that accepts input from the internet and acts on it inside your network, the input validation is the obvious problem. The delivery side, where your system becomes a network actor on behalf of untrusted tenants, is where the real exposure hides.&lt;/p&gt;

</description>
      <category>security</category>
      <category>ssrf</category>
      <category>webhook</category>
    </item>
    <item>
      <title>Building a DAG Orchestrator That Rewrites Its Own Execution Plan</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:16:18 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/building-a-dag-orchestrator-that-rewrites-its-own-execution-plan-14bh</link>
      <guid>https://dev.to/kingsleyonoh/building-a-dag-orchestrator-that-rewrites-its-own-execution-plan-14bh</guid>
      <description>&lt;p&gt;Step 7 is a condition: if the HTTP response came back 200, continue to step 8 and step 9. If not, skip them and jump to step 10. Five steps downstream of a single boolean, two possible paths, and a directed acyclic graph that only understands one thing: which step depends on which.&lt;/p&gt;

&lt;p&gt;DAGs don't branch. That was the problem I had to solve.&lt;/p&gt;

&lt;h2&gt;
  
  
  The linear model
&lt;/h2&gt;

&lt;p&gt;The linear step types worked without issues. HTTP calls, data transforms, delays, sub-workflows. The orchestrator in &lt;code&gt;orchestrator.py&lt;/code&gt; called &lt;code&gt;topological_sort()&lt;/code&gt; from &lt;code&gt;graph.py&lt;/code&gt;, got back a flat list of steps in dependency order, and executed them one by one. After each step completed, its output merged into a shared JSONB context accessible to every downstream step via Jinja2 expressions like &lt;code&gt;{{ steps.fetch_data.output.body }}&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The execution model was simple: walk the sorted list, execute each step, persist the state to PostgreSQL, move to the next. No branching. No decisions. A conveyor belt that worked perfectly for linear workflows.&lt;/p&gt;

&lt;p&gt;Then I needed conditions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the graph stays flat
&lt;/h2&gt;

&lt;p&gt;The workflow definition needed to stay simple: a flat list of steps with plain &lt;code&gt;depends_on&lt;/code&gt; references. Conditions should be a step type like any other. The branching logic should live at runtime, not in the graph structure.&lt;/p&gt;

&lt;p&gt;Most workflow engines encode branching directly into the graph. BPMN uses gateway nodes that split the flow into separate paths. Airflow uses &lt;code&gt;BranchPythonOperator&lt;/code&gt; to select downstream tasks. You define the branches as part of the DAG, each path continues independently, and eventually they converge at a join node.&lt;/p&gt;

&lt;p&gt;I sketched this out and hit two problems. A 12-step workflow with three conditions and two paths each produces a graph where users need to reason about up to eight distinct execution paths when designing the workflow. The &lt;code&gt;depends_on&lt;/code&gt; array in each step definition becomes a tangled web of conditional references. Step 10 depends on step 7, but only if step 7's condition was true. The dependency model needs to become richer than "this step waits for that step."&lt;/p&gt;

&lt;p&gt;The second problem is structural. The DAG parser in &lt;code&gt;parser.py&lt;/code&gt; already handles cycle detection and depth validation on the flat dependency graph. Introducing branch paths would require a second validation pass for branch connectivity, dead-end detection, and convergence verification. Every new validation rule is a new failure mode for users to debug.&lt;/p&gt;

&lt;h2&gt;
  
  
  The skip set
&lt;/h2&gt;

&lt;p&gt;The solution was to keep the topological sort and add a mutable &lt;code&gt;skipped_steps&lt;/code&gt; set that grows during execution. When the orchestrator reaches a condition step, it evaluates the Jinja2 expression and calls &lt;code&gt;_resolve_condition_branches()&lt;/code&gt;. That function reads the condition's &lt;code&gt;true_branch&lt;/code&gt; and &lt;code&gt;false_branch&lt;/code&gt; config (each a list of step IDs) and adds the non-taken branch's steps to the skip set.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_resolve_condition_branches&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;step_def&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;StepDefinition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
    &lt;span class="n"&gt;skipped&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;result&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;skipped&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step_def&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false_branch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;skipped&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;step_def&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;true_branch&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[]))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;skipped&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Before executing each step in the sorted order, the orchestrator checks: is this step ID in &lt;code&gt;skipped_steps&lt;/code&gt;? If yes, transition its state to &lt;code&gt;skipped&lt;/code&gt; and move on. The &lt;code&gt;skipped&lt;/code&gt; state is terminal in the state machine (empty transition set, no further changes allowed). The topological order never changes. The graph structure never changes. Steps just get removed from the plan as it runs.&lt;/p&gt;

&lt;p&gt;A 12-step workflow with three conditions still has 12 entries in the topological sort. The orchestrator always walks the same list. It just skips some entries based on runtime evaluation. No path explosion. No conditional dependency syntax. No gateway nodes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the truthiness got messy
&lt;/h2&gt;

&lt;p&gt;The condition step evaluates a Jinja2 expression and gets back a string. Jinja2 doesn't return Python booleans from template rendering. Everything comes back as text. So the string &lt;code&gt;"False"&lt;/code&gt; needs to be treated as falsy, and &lt;code&gt;"1"&lt;/code&gt; as truthy.&lt;/p&gt;

&lt;p&gt;I defined a &lt;code&gt;_FALSY_VALUES&lt;/code&gt; frozenset in &lt;code&gt;condition.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;_FALSY_VALUES&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;frozenset&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;false&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;False&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;None&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;})&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This handles the common cases: empty strings, Python-style false values, string representations of zero. But the orchestrator also needs to resolve which branch was taken, and it has its own truthiness check for the condition output. I ended up with the same logic in two places: the condition executor computes the boolean and writes it to the output, and the orchestrator reads that output to populate the skip set. Both places need to agree on what "false" means.&lt;/p&gt;

&lt;p&gt;The duplication is a code smell I haven't fixed. The orchestrator should trust the &lt;code&gt;result&lt;/code&gt; boolean from the condition's output (which the executor already computed) instead of re-evaluating truthiness on the raw expression result. It works today, but there's a subtle bug waiting if someone changes the falsy list in one place and not the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  The state machine underneath
&lt;/h2&gt;

&lt;p&gt;Making skip sets work required a state machine with six distinct step states: &lt;code&gt;pending&lt;/code&gt;, &lt;code&gt;queued&lt;/code&gt;, &lt;code&gt;running&lt;/code&gt;, &lt;code&gt;completed&lt;/code&gt;, &lt;code&gt;failed&lt;/code&gt;, and &lt;code&gt;skipped&lt;/code&gt;. The transitions are defined as an adjacency dictionary in &lt;code&gt;state.py&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;ALLOWED_STEP_TRANSITIONS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;set&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pending&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queued&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skipped&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queued&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;running&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;running&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;failed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;queued&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skipped&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;set&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two details matter here. First, &lt;code&gt;skipped&lt;/code&gt; has an empty transition set. Once a step is skipped, it stays skipped. You can't un-skip a step during execution. The condition evaluated, the branch was chosen, the decision is final.&lt;/p&gt;

&lt;p&gt;Second, &lt;code&gt;failed&lt;/code&gt; can transition back to &lt;code&gt;queued&lt;/code&gt;. That's the retry mechanism. When a step fails and has retry attempts remaining (default 3, configurable up to 10 via &lt;code&gt;RetryConfig&lt;/code&gt;), the orchestrator transitions it back to &lt;code&gt;queued&lt;/code&gt; and re-executes. Each attempt updates &lt;code&gt;step_exec.attempt&lt;/code&gt; in PostgreSQL before the retry runs. If the process crashes between retries, the attempt count survives in the database.&lt;/p&gt;

&lt;p&gt;Every state transition writes to PostgreSQL before anything else happens. The orchestrator doesn't execute step N+1 until step N's final state is persisted. This is deliberately slow. An in-memory state machine would be faster, but if the worker dies between steps, every execution in flight would lose its state. Persist-before-execute means a crash leaves every execution in a known, recoverable position.&lt;/p&gt;

&lt;h2&gt;
  
  
  Cycle detection on the reverse graph
&lt;/h2&gt;

&lt;p&gt;The skip set approach only works if the graph is actually a DAG. Cycles would cause the topological sort to loop forever. The &lt;code&gt;detect_cycles()&lt;/code&gt; function in &lt;code&gt;graph.py&lt;/code&gt; uses DFS with three-color marking: white (unvisited), gray (currently being explored), black (fully explored). A back edge from any node to a gray node means a cycle.&lt;/p&gt;

&lt;p&gt;The non-obvious detail: the function builds a reverse graph where each entry maps a dependency to its dependents, not the other way around. If step C depends on step B, the reverse graph has &lt;code&gt;B: [C]&lt;/code&gt;. DFS then explores "who depends on me?" paths. Cycles in that direction are what would break execution order. When a cycle is found, &lt;code&gt;_reconstruct_cycle()&lt;/code&gt; walks backward through parent pointers to produce the full path: &lt;code&gt;["step_a", "step_b", "step_c", "step_a"]&lt;/code&gt;. The error message shows exactly which steps form the loop.&lt;/p&gt;

&lt;p&gt;Maximum DAG depth is capped at 20 via a hardcoded &lt;code&gt;MAX_DEPTH&lt;/code&gt; constant. The depth calculation uses Kahn's algorithm with a twist: instead of just sorting, it tracks the longest path to each node. If step D depends on both B and C, and B is at depth 2 while C is at depth 4, then D gets depth 5. The maximum across all nodes determines the workflow's depth.&lt;/p&gt;

&lt;h2&gt;
  
  
  What surprised me
&lt;/h2&gt;

&lt;p&gt;I expected the hard part to be the topological sort or the cycle detection. Both were textbook implementations of well-known algorithms. The hard part was getting the skip set to interact correctly with &lt;code&gt;depends_on&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Consider: step 9 depends on step 8, and step 8 is in the false branch of step 7's condition. If step 7 evaluates to true, step 8 gets skipped. But step 9 is still in the topological order. Its dependency was skipped, not completed. Should step 9 execute?&lt;/p&gt;

&lt;p&gt;The answer I landed on: if all dependencies of a step are either &lt;code&gt;completed&lt;/code&gt; or &lt;code&gt;skipped&lt;/code&gt;, the step can execute. A skipped dependency counts as resolved for scheduling purposes. This lets you build workflows where step 9 runs regardless of which branch step 7 took, as long as its other dependencies are met. The orchestrator checks this when deciding which steps to queue next.&lt;/p&gt;

&lt;p&gt;I could have made skipped dependencies block execution. But that would mean a single condition deep in the DAG could freeze all downstream steps, even ones that don't care about the branch result. The current behavior is more useful: skip sets remove steps from the plan, but they don't create invisible walls in the dependency graph.&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;The engine handles workflows up to 50 steps with a maximum DAG depth of 20 in the longest dependency chain. Condition branching adds zero overhead to the execution plan because the topological sort runs once, before any step executes. The skip set is a Python &lt;code&gt;set()&lt;/code&gt; with O(1) lookups. State persistence adds one database write per step transition, which at 15 connections in the pool (5 steady plus 10 overflow) handles the concurrency comfortably.&lt;/p&gt;

&lt;p&gt;544 tests cover the engine, including edge cases for nested conditions (a condition inside another condition's true branch), skipped steps with downstream dependents, retry of a failed step after a prior execution was replayed, and sub-workflows that inherit the parent's depth counter. The condition branching logic alone accounts for 12 dedicated test cases covering truthiness evaluation, branch resolution, and the interaction between skip sets and the dependency resolver.&lt;/p&gt;

</description>
      <category>dag</category>
      <category>python</category>
    </item>
    <item>
      <title>Why the PO Resolver Has Five Strategies, Not One Algorithm</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:12:22 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/why-the-po-resolver-has-five-strategies-not-one-algorithm-24a1</link>
      <guid>https://dev.to/kingsleyonoh/why-the-po-resolver-has-five-strategies-not-one-algorithm-24a1</guid>
      <description>&lt;p&gt;&lt;code&gt;PO-2026-001&lt;/code&gt;. &lt;code&gt;PO 2026 001&lt;/code&gt;. &lt;code&gt;po2026001&lt;/code&gt;. &lt;code&gt;PO#2026-001&lt;/code&gt;. &lt;code&gt;2026-001&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Every one of those strings refers to the same purchase order. The first came from our internal PO system. The rest came from vendors. One vendor's accounting package strips dashes. Another inserts hash prefixes. A third just writes the sequence number and assumes the buyer will figure it out. A database equality query on &lt;code&gt;po_number&lt;/code&gt; finds zero matches on four of those five.&lt;/p&gt;

&lt;p&gt;That's the starting condition for the matching engine. The &lt;code&gt;po_reference&lt;/code&gt; field on the invoice exists, is populated, and does not equal any PO number in the database. The naive response is to call the invoice unmatched and let a human deal with it. The whole point of the system is to not do that.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Matcher's Real Job
&lt;/h2&gt;

&lt;p&gt;The pipeline I inherited from my own PRD had a single phase called "PO resolution." Given an invoice, find its PO. The implementation I started writing was also single-step: normalize both sides of the string, compare, return a match or null. Clean code, one function, one SQL query.&lt;/p&gt;

&lt;p&gt;It handled about 60% of real invoices.&lt;/p&gt;

&lt;p&gt;The remaining 40% broke in three different ways. Some vendors omitted the PO reference entirely, writing it into the line item description instead. Some vendors used their own internal sales order number, which had no textual overlap with our PO number at all. Some vendors sent the right reference but with enough format noise that neither exact nor normalized comparison found it.&lt;/p&gt;

&lt;p&gt;Three different failure modes. One matching algorithm. The ratio was not going to get better.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cascade, Not a Fallback
&lt;/h2&gt;

&lt;p&gt;I rewrote &lt;code&gt;PoResolver&lt;/code&gt; as an ordered chain of strategies, each one with a decreasing confidence ceiling. The signature looks the same from the outside: give me a &lt;code&gt;tenantId&lt;/code&gt;, a &lt;code&gt;poReference&lt;/code&gt;, a &lt;code&gt;vendorId&lt;/code&gt;, an &lt;code&gt;invoiceAmount&lt;/code&gt;, and an &lt;code&gt;invoiceDate&lt;/code&gt;, and I'll hand back a &lt;code&gt;PoResolution&lt;/code&gt; with a PO ID, a confidence score, and a method label. What changed is the inside.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Strategy 1: Exact match&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;poReference&lt;/span&gt; &lt;span class="p"&gt;!=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;exact&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tenantPos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;firstOrNull&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt;
        &lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;PurchaseOrderTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;poNumber&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="n"&gt;poReference&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;exact&lt;/span&gt; &lt;span class="p"&gt;!=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="nd"&gt;@newSuspendedTransaction&lt;/span&gt; &lt;span class="nc"&gt;PoResolution&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;poId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;exact&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;PurchaseOrderTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"exact"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;// Strategy 2: Normalized match&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;poReference&lt;/span&gt; &lt;span class="p"&gt;!=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;normalizedRef&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;poReference&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;normalized&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tenantPos&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;firstOrNull&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;row&lt;/span&gt; &lt;span class="p"&gt;-&amp;gt;&lt;/span&gt;
        &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;row&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;PurchaseOrderTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;poNumber&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="p"&gt;==&lt;/span&gt; &lt;span class="n"&gt;normalizedRef&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;normalized&lt;/span&gt; &lt;span class="p"&gt;!=&lt;/span&gt; &lt;span class="k"&gt;null&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt;&lt;span class="nd"&gt;@newSuspendedTransaction&lt;/span&gt; &lt;span class="nc"&gt;PoResolution&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;poId&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="n"&gt;normalized&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;PurchaseOrderTable&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="n"&gt;confidence&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;method&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"normalized"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full resolver in &lt;code&gt;PoResolver.kt&lt;/code&gt; runs five of these in order. Exact at confidence 1.0. Normalized (strip whitespace, remove the &lt;code&gt;PO&lt;/code&gt; prefix, lowercase) at 0.95. Jaro-Winkler fuzzy match against every PO in the tenant above 0.70, with the Jaro-Winkler score itself becoming the confidence. Vendor plus amount within 5% tolerance at 0.65. Vendor plus date within 90 days at 0.50. Then no match at 0.0.&lt;/p&gt;

&lt;p&gt;Each strategy catches a failure the previous one could not.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why the Confidence Decays on Purpose
&lt;/h2&gt;

&lt;p&gt;The obvious objection to a cascade is that you could just run the weakest algorithm first and skip the rest. Fuzzy match solves both the exact case and the noisy case, right?&lt;/p&gt;

&lt;p&gt;It doesn't. Not for the reason people expect.&lt;/p&gt;

&lt;p&gt;Jaro-Winkler's 0.70 threshold is tuned for normal PO numbers. Two different POs issued a week apart to the same vendor will often score 0.78 against each other. &lt;code&gt;PO-2026-001&lt;/code&gt; and &lt;code&gt;PO-2026-008&lt;/code&gt; share the prefix, share the date segment, and differ only in the sequence number. Run fuzzy match first and you will occasionally match an invoice to a PO that is not the correct one. The system will have no way to know it's wrong, because the score says "probable match."&lt;/p&gt;

&lt;p&gt;Running exact match first means any invoice that quotes the correct PO number unambiguously gets it unambiguously, at confidence 1.0. The fuzzy matcher never runs on the 60% of invoices where exact works. It only runs on the 20% where no exact or normalized match exists, and among those, only the cases with a close-enough string similarity get scored. The cascade constrains the search space before the weaker algorithms touch it.&lt;/p&gt;

&lt;p&gt;Confidence 0.95 for normalized, 0.70-ish for fuzzy, 0.65 for vendor+amount, 0.50 for vendor+date. Those numbers are not arbitrary rankings. They are honest confidence statements about the evidence the match is built on. A 0.50 vendor+date match means: this invoice arrived from a vendor we have a PO with, issued within 90 days of the invoice date. That is weak evidence. It might be the right PO. It's enough to avoid the unmatched state, enough to surface for human review, and honest about how certain the system is.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Normalization Function That Does Three Things
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;normalize()&lt;/code&gt; is short enough to fit on a screen. It strips whitespace, regex-removes the &lt;code&gt;PO&lt;/code&gt; prefix variants, and lowercases. That's it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight kotlin"&gt;&lt;code&gt;&lt;span class="k"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;val&lt;/span&gt; &lt;span class="py"&gt;PO_PREFIX_REGEX&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Regex&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="s"&gt;"""^(PO[-#\s]*)"""&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;RegexOption&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;IGNORE_CASE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;internal&lt;/span&gt; &lt;span class="k"&gt;fun&lt;/span&gt; &lt;span class="nf"&gt;normalize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt; &lt;span class="nc"&gt;String&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;value&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;trim&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;PO_PREFIX_REGEX&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s"&gt;"\\s+"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;toRegex&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="s"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lowercase&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three lines of transformation. It cost me two afternoons.&lt;/p&gt;

&lt;p&gt;What surprised me in the data was how many ways the same four characters can vary. &lt;code&gt;PO-&lt;/code&gt;, &lt;code&gt;PO&lt;/code&gt;, &lt;code&gt;PO#&lt;/code&gt;, &lt;code&gt;PO.&lt;/code&gt;, &lt;code&gt;po-&lt;/code&gt;, &lt;code&gt;P.O.-&lt;/code&gt;, &lt;code&gt;P.O.&lt;/code&gt;. The regex handles the first four. The last three we either accept as exact-match failures (they won't pass normalization either) or catch with fuzzy match at a lower confidence. I deliberately did not try to normalize every typographic variation. The regex would grow and the edge cases never end.&lt;/p&gt;

&lt;p&gt;The better question was: what fraction of real PO references in the database fail normalization? I wrote a one-off script against a synthetic test set of 2,000 generated invoice references and measured how many failed each strategy in turn. Exact caught 58%. Normalized added another 27%. Fuzzy caught another 8%. Vendor+amount caught 3%. Vendor+date caught 2%. Unmatched: 2%.&lt;/p&gt;

&lt;p&gt;The 58% exact match result was not a surprise. Most vendors copy-paste the PO number into their system once and then their system emits the same string forever. The 27% for normalized was the argument for writing the regex at all. The 3% vendor+amount rescue was the argument for keeping strategies 4 and 5 even though they look weak.&lt;/p&gt;

&lt;h2&gt;
  
  
  The One That Has No PO Reference at All
&lt;/h2&gt;

&lt;p&gt;Strategies 4 and 5 run with no &lt;code&gt;poReference&lt;/code&gt; string. The invoice arrived without a PO number written on it. Strategy 4 asks: are there any open POs for this vendor whose total matches the invoice total within 5%? Strategy 5 asks: are there any open POs for this vendor issued within 90 days of the invoice date?&lt;/p&gt;

&lt;p&gt;The confidence on these is deliberately low. 0.65 and 0.50. Both are below the auto-approve threshold (0.95) and below the standard-approval threshold (0.70). A match at this confidence routes to escalated review, every time, with no exception. Which is the right behavior.&lt;/p&gt;

&lt;p&gt;What these strategies buy you is the difference between "unmatched, no PO" and "possibly matched, here's a weak candidate." The human reviewer gets a starting point. Instead of opening the invoice in an empty state and searching manually, they open it with a suggested PO pre-attached and a 0.50 confidence label that tells them: we think this is the one, don't trust us.&lt;/p&gt;

&lt;p&gt;I was wrong about this initially. My first instinct was that low-confidence matches are worse than no match at all. A human opening an invoice with a suggested PO will often accept the suggestion without checking. That's a real risk. I solved it at the approval routing layer: any match below 0.70 confidence is force-routed to the escalated queue, and the approval UI shows the confidence score prominently with the breakdown. The reviewer sees &lt;code&gt;{po_match: 0.50, method: "vendor_date"}&lt;/code&gt; and knows to verify the PO number manually before clicking approve.&lt;/p&gt;

&lt;p&gt;The system does not make the low-confidence decision. It surfaces the low-confidence candidate.&lt;/p&gt;

&lt;h2&gt;
  
  
  The One Tradeoff I Accepted
&lt;/h2&gt;

&lt;p&gt;The cascade loads every PO for the tenant into memory on every invoice. For the normalized and fuzzy strategies, there is no index I can use: normalization happens after the row is in memory, and Jaro-Winkler needs the full string of both candidates. The query is &lt;code&gt;SELECT * FROM purchase_orders WHERE tenant_id = ?&lt;/code&gt; and then the filtering is in Kotlin.&lt;/p&gt;

&lt;p&gt;At 10,000 active POs per tenant, this is fine. The loop is a few milliseconds. At 100,000 active POs, it would start being noticeable. At 1 million, the cascade would need a different shape: pre-compute a normalized PO number column, index it, and do the normalized lookup in SQL. The fuzzy match would need a trigram index (&lt;code&gt;pg_trgm&lt;/code&gt; is already enabled in migration V010) backing a &lt;code&gt;SIMILARITY&lt;/code&gt; query.&lt;/p&gt;

&lt;p&gt;I have not built that yet. The largest tenant on the system has about 3,000 open POs at any time, and the current approach takes under 10 milliseconds per invoice on that workload. Building the indexed version now would be optimization ahead of the constraint.&lt;/p&gt;

&lt;p&gt;The scale plan is in the code comments. When a tenant crosses the threshold, the resolver gets a second implementation path. The interface stays the same.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Match Record Actually Stores
&lt;/h2&gt;

&lt;p&gt;Every match result writes a JSONB breakdown. &lt;code&gt;{po_match: 0.95, method: "normalized"}&lt;/code&gt;. The &lt;code&gt;method&lt;/code&gt; string is the strategy that matched. That field is what makes the whole design defensible to operators.&lt;/p&gt;

&lt;p&gt;When a finance lead looks at a low-confidence invoice in the approval queue, they don't just see "0.73 confidence." They see the breakdown:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"po_match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"line_match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.72&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"receipt_match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"price_match"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.78&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"overall"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mf"&gt;0.73&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plus the PO resolution method label: &lt;code&gt;"normalized"&lt;/code&gt;. Now the reviewer knows the PO was found with reasonable confidence (0.95, normalized match), but the receipt verification dropped to 0.50 (no receipt found on that PO line yet), which pulled the overall score down. The remediation is obvious: wait for the goods receipt and rerun matching. Not reject the invoice.&lt;/p&gt;

&lt;p&gt;That information existed in the old binary matcher too. It just wasn't exposed. The system knew why a match was weak and threw the reason away. Writing it to JSONB was the cheap part. Deciding to write it at all was the design decision.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Reconsider
&lt;/h2&gt;

&lt;p&gt;The 0.50 confidence floor on vendor+date matching is doing work, but it's doing work that's not fully visible in the match record. A &lt;code&gt;"vendor_date"&lt;/code&gt; method label with 0.50 confidence tells an operator that the invoice had no PO reference and we fell back to the weakest strategy. It does not tell them whether the vendor has three other open POs in the same date range that were almost as good a match.&lt;/p&gt;

&lt;p&gt;If I redid this, I'd return the top three candidates from each strategy, not just the best one. The &lt;code&gt;alternatives&lt;/code&gt; field on &lt;code&gt;ConfidenceScore&lt;/code&gt; exists for exactly this and is currently always empty. The code comment in &lt;code&gt;ConfidenceScorer.kt&lt;/code&gt; references it as planned work. An invoice that fell to strategy 4 with three equally plausible POs should surface all three to the reviewer with their individual confidence scores. Right now the reviewer sees one candidate and either accepts or searches manually.&lt;/p&gt;

&lt;p&gt;That's the kind of detail that looks trivial and isn't. A single weak candidate is worse than a ranked list of weak candidates, because the single candidate suggests certainty the system doesn't have. Building the alternatives list is two afternoons of plumbing. Shipping without it is the tradeoff I made to hit the first release. It's the first thing I'd add in a second pass.&lt;/p&gt;

</description>
      <category>kotlin</category>
      <category>matchingalgorithms</category>
      <category>exposed</category>
    </item>
    <item>
      <title>The Compliance Problem That Disappears When You Stop Updating Rows</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:06:11 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/the-compliance-problem-that-disappears-when-you-stop-updating-rows-4o1i</link>
      <guid>https://dev.to/kingsleyonoh/the-compliance-problem-that-disappears-when-you-stop-updating-rows-4o1i</guid>
      <description>&lt;p&gt;"Show me everything that happened to discrepancy #4721 between Tuesday and Thursday." In a system with mutable records, the answer takes an hour of forensics across email threads, shared folders, and database logs. In a system with immutable events, it's a single query: &lt;code&gt;SELECT * FROM ledger_events WHERE discrepancy_id = $1 ORDER BY sequence_num&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;The difference comes down to provability. The mutable system can produce a timeline, but it can't prove the timeline wasn't edited between Thursday and today. The immutable system can, because the records physically cannot change once written.&lt;/p&gt;

&lt;p&gt;I built the Financial Compliance Ledger around this principle: the audit trail is not a feature attached to the data. The audit trail IS the data.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Provenance Problem
&lt;/h2&gt;

&lt;p&gt;Most compliance systems are CRUD applications with audit logging bolted on. The discrepancy record lives in a &lt;code&gt;discrepancies&lt;/code&gt; table. When someone changes the status from &lt;code&gt;open&lt;/code&gt; to &lt;code&gt;acknowledged&lt;/code&gt;, the system updates the row and maybe writes a log entry somewhere else.&lt;/p&gt;

&lt;p&gt;This design has a structural flaw. The log and the data it's supposed to protect occupy the same trust level. Both are regular database rows. Both can be updated. A DBA with production access, a bug in the application layer, or a migration script that touches the wrong table can alter the audit trail without leaving a trace. The audit log is a second-class citizen shadowing the real data.&lt;/p&gt;

&lt;p&gt;The compliance question is specific: show me every state change, in order, with who did it and why, and prove it wasn't altered. You can't answer that question with &lt;code&gt;updated_at&lt;/code&gt; and &lt;code&gt;updated_by&lt;/code&gt; columns. Those columns are regular writable fields. They can be overwritten like anything else.&lt;/p&gt;

&lt;p&gt;The fix requires a different data model, not better logging.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Immutability Actually Requires
&lt;/h2&gt;

&lt;p&gt;Financial audit trails have three hard constraints that CRUD systems can't satisfy structurally.&lt;/p&gt;

&lt;p&gt;First, every action must produce a permanent record. Not a log entry that's written if the application remembers to call the logger. A database row that's created in the same transaction as the state change. If the state change commits, the record commits. If either fails, both roll back. There's no window where the projection says "acknowledged" but the event doesn't exist.&lt;/p&gt;

&lt;p&gt;Second, ordering must be deterministic. Timestamps aren't enough. Two escalation rules evaluating against the same discrepancy in the same millisecond need an unambiguous order. I used PostgreSQL's &lt;code&gt;BIGSERIAL&lt;/code&gt; type for the &lt;code&gt;sequence_num&lt;/code&gt; column on &lt;code&gt;ledger_events&lt;/code&gt;. It's monotonically increasing and gap-free within a transaction. No two events share a sequence number.&lt;/p&gt;

&lt;p&gt;Third, the immutability can't be enforced only at the application layer. Application-level rules get bypassed by migration scripts, admin queries, and bugs. The Go codebase never issues an UPDATE or DELETE against the &lt;code&gt;ledger_events&lt;/code&gt; table. There's no &lt;code&gt;UpdateEvent&lt;/code&gt; method. There's no &lt;code&gt;DeleteEvent&lt;/code&gt; method. They were never written.&lt;/p&gt;

&lt;p&gt;The protection is structural: you can't accidentally call a function that doesn't exist.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Dual-Table Architecture
&lt;/h2&gt;

&lt;p&gt;The design splits the data into two tables with different mutation rules.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;discrepancies&lt;/code&gt; is the mutable projection. It stores the current state: status, severity, amounts, timestamps. This table gets updated on every workflow action. It exists for queryability: filtering by status, sorting by creation date, paginating with cursor-based navigation. The indexes live here because this is the table that handles read-heavy operations.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ledger_events&lt;/code&gt; is the immutable source of truth. Every state change, every note, every escalation, every resolution becomes a row with a unique &lt;code&gt;sequence_num&lt;/code&gt;, the actor who performed it, the actor_type (system, user, or escalation engine), and a JSONB payload carrying the details.&lt;/p&gt;

&lt;p&gt;The discrepancy row is a convenience. If you deleted every row in the &lt;code&gt;discrepancies&lt;/code&gt; table, you could reconstruct the complete state of every discrepancy from &lt;code&gt;ledger_events&lt;/code&gt; alone. The events are the data. The projection is a cache.&lt;/p&gt;

&lt;p&gt;The state machine enforces which transitions are legal. Here's the actual code from &lt;code&gt;domain/discrepancy.go&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;validTransitions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="k"&gt;map&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="kt"&gt;bool&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;StatusOpen&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;StatusAcknowledged&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;StatusAutoClosed&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;StatusAcknowledged&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;StatusInvestigating&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;StatusInvestigating&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;StatusResolved&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;StatusEscalated&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;StatusEscalated&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;StatusResolved&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="no"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="c"&gt;// resolved and auto_closed are terminal states. No transitions out.&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="n"&gt;ValidTransition&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;from&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="kt"&gt;string&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="n"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;validTransitions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;from&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;ok&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="no"&gt;false&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;allowed&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;to&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Six states. Two terminal (&lt;code&gt;resolved&lt;/code&gt;, &lt;code&gt;auto_closed&lt;/code&gt;). Every transition is explicit. There's no default case that silently allows unknown transitions. If a status pair isn't in the map, the transition is rejected.&lt;/p&gt;

&lt;p&gt;Each workflow action (acknowledge, investigate, resolve, add note) runs inside a single database transaction: read the discrepancy with a row lock, validate the transition via &lt;code&gt;ValidTransition&lt;/code&gt;, update the mutable projection, append the immutable event, commit. Steps 3 and 4 are atomic. If the event INSERT fails, the status update rolls back. If the status update fails, no event is created.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Concurrency Bug That Proved the Design
&lt;/h2&gt;

&lt;p&gt;The first version relied on timestamp ordering for events. It worked in tests because tests run sequentially. In production-like conditions with the escalation engine running, two rules evaluated concurrently against the same discrepancy.&lt;/p&gt;

&lt;p&gt;Both checked the status: open. Both saw a valid transition to &lt;code&gt;auto_closed&lt;/code&gt;. Both wrote events.&lt;/p&gt;

&lt;p&gt;The result: two &lt;code&gt;discrepancy.auto_closed&lt;/code&gt; events for the same discrepancy. The event history showed two closures. An auditor looking at that trail would see a discrepancy closed twice and have no way to determine which closure was authoritative.&lt;/p&gt;

&lt;p&gt;Moving the transition check inside the database transaction with a row-level lock on the discrepancy fixed it. The first transaction acquires the lock, validates the transition (open to auto_closed: valid), updates the projection, appends the event, and commits. The second transaction waits for the lock, reads the updated status (already &lt;code&gt;auto_closed&lt;/code&gt;), runs &lt;code&gt;ValidTransition("auto_closed", "auto_closed")&lt;/code&gt;, gets &lt;code&gt;false&lt;/code&gt;, and rolls back.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;sequence_num&lt;/code&gt; made the fix verifiable. After the change, I wrote a test that fires two concurrent goroutines at the same discrepancy and asserts that exactly one &lt;code&gt;auto_closed&lt;/code&gt; event exists. The monotonic sequence guarantees a single ordering. No shared sequence numbers. No gaps.&lt;/p&gt;

&lt;h2&gt;
  
  
  Optimistic Locking as an Opt-In
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;expected_sequence&lt;/code&gt; parameter on workflow actions is optional. When a client provides it, the system checks that the discrepancy's latest event sequence number matches what the client expects. If another action happened between the client's read and write, the numbers diverge, and the system returns 409 Conflict.&lt;/p&gt;

&lt;p&gt;I made it optional deliberately. A simple integration that acknowledges discrepancies one at a time doesn't need optimistic locking. The &lt;code&gt;ValidTransition&lt;/code&gt; check catches illegal transitions regardless. But a dashboard where multiple analysts work on the same discrepancy simultaneously needs the 409 to prevent one analyst's context from silently overwriting another's.&lt;/p&gt;

&lt;p&gt;Making optimistic locking mandatory would have been "safer" in the abstract. It also would have forced every client to track sequence numbers, even for simple, low-contention workflows. The state machine already prevents truly invalid transitions: you can't resolve an open discrepancy, you can't acknowledge one that's already resolved. The optimistic lock adds a second layer for the subset of clients that operate under concurrent access.&lt;/p&gt;

&lt;p&gt;I was wrong about where the concurrency pressure would come from. I expected the escalation engine to be the primary source of conflicts because it processes discrepancies in bulk. It turned out that concurrent human analysts were the more common case. The engine runs on a 15-minute cycle and processes each tenant's rules sequentially. Two people clicking "investigate" on the same discrepancy within seconds of each other is harder to serialize than a scheduled batch job.&lt;/p&gt;

&lt;h2&gt;
  
  
  What the Trail Actually Proves
&lt;/h2&gt;

&lt;p&gt;The system uses 7 event types spanning the full lifecycle: &lt;code&gt;discrepancy.received&lt;/code&gt;, &lt;code&gt;discrepancy.acknowledged&lt;/code&gt;, &lt;code&gt;discrepancy.investigation_started&lt;/code&gt;, &lt;code&gt;discrepancy.note_added&lt;/code&gt;, &lt;code&gt;discrepancy.escalated&lt;/code&gt;, &lt;code&gt;discrepancy.resolved&lt;/code&gt;, and &lt;code&gt;discrepancy.auto_closed&lt;/code&gt;. Each event carries the actor (who performed it), the actor_type (system, user, or escalation engine), and a JSONB payload with specifics like resolution type (&lt;code&gt;match_found&lt;/code&gt;, &lt;code&gt;false_positive&lt;/code&gt;, &lt;code&gt;manual_adjustment&lt;/code&gt;, &lt;code&gt;write_off&lt;/code&gt;) or escalation reason.&lt;/p&gt;

&lt;p&gt;When someone asks "what happened to discrepancy #4721?", the answer is a query returning an ordered list of immutable events. The &lt;code&gt;sequence_num&lt;/code&gt; provides ordering. The actor field provides attribution. The payload provides context. The entire result set is provably unmodified because the table's application layer has no modification functions.&lt;/p&gt;

&lt;p&gt;The 346 tests include 36 dedicated state machine test cases covering every valid transition and every invalid attempt. The domain layer is the most heavily tested module because it encodes the compliance rules that everything else depends on.&lt;/p&gt;

&lt;p&gt;If your compliance audit trail can be edited by the same system that produces it, it's a log, not a ledger. The distinction matters the first time someone asks for proof.&lt;/p&gt;

</description>
      <category>eventsourcing</category>
      <category>postgres</category>
      <category>immutablitiy</category>
    </item>
    <item>
      <title>Why Extracted Obligations Never Activate Themselves</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:03:37 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/why-extracted-obligations-never-activate-themselves-4ol7</link>
      <guid>https://dev.to/kingsleyonoh/why-extracted-obligations-never-activate-themselves-4ol7</guid>
      <description>&lt;p&gt;The RAG Platform returns 12 obligations from a signed MSA. Nine of them are real payment schedules with deadline clauses and clause references. Two are duplicates of the same renewal window, worded differently in different sections. One is a hallucinated reporting duty the AI invented from a sentence that talked about reports but did not commit to producing any.&lt;/p&gt;

&lt;p&gt;The contract engine inserts all 12 rows. Every single one is set to &lt;code&gt;Status = ObligationStatus.Pending&lt;/code&gt;. None of them will fire an alert. None will appear on the deadline dashboard. None will publish a &lt;code&gt;contract.obligation.breached&lt;/code&gt; event if their date passes. The hourly scanner walks right past them.&lt;/p&gt;

&lt;p&gt;This is by design, and it took a specific design decision to make that true. The line is in &lt;code&gt;ExtractionResultParser.BuildObligation&lt;/code&gt;, and the docstring calls it a load-bearing invariant:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Invariants (load-bearing - do NOT weaken without a test):&lt;/span&gt;
&lt;span class="c1"&gt;// - Every obligation returned carries ObligationStatus.Pending - no&lt;/span&gt;
&lt;span class="c1"&gt;//   auto-activation, extract-then-confirm per PRD 5.2.&lt;/span&gt;
&lt;span class="c1"&gt;// - Source is always ObligationSource.RagExtraction.&lt;/span&gt;
&lt;span class="c1"&gt;// - Malformed JSON returns an empty list.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;A human has to click confirm on every row before the system treats it as real. Nine clicks to accept the true positives. Two clicks to dismiss the duplicates. One click to dismiss the hallucination. Twelve interactions for a single contract extraction. That is a lot of friction for what could have been a fully automatic workflow.&lt;/p&gt;

&lt;p&gt;I put the friction there on purpose.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Failure Mode I Was Optimising For
&lt;/h2&gt;

&lt;p&gt;AI extraction has two failure modes. False negatives (the model misses a real obligation) and false positives (the model invents one). In most domains, false negatives are the worse problem. Missed entries do not generate phantom work. A search system that fails to surface a document just returns fewer results; the user notices and reformulates.&lt;/p&gt;

&lt;p&gt;Legal obligations invert that calculus. A missed obligation is a real legal exposure, but so is a phantom one. If the AI invents a &lt;code&gt;90-day renewal notice&lt;/code&gt; clause and the system auto-activates the obligation, the scanner will fire alerts on a fabricated deadline. The operator sees the alert, files the notice, and now there is a formally documented paper trail attesting to a clause that does not exist in the underlying contract.&lt;/p&gt;

&lt;p&gt;That is worse than missing the real clause. A missed obligation creates a gap. A phantom obligation creates a fiction in the compliance record.&lt;/p&gt;

&lt;p&gt;The scanner cannot tell the difference between a real deadline and a hallucinated one, because by the time a status is &lt;code&gt;Active&lt;/code&gt;, the system treats it as legally binding. The only place the distinction can be enforced is at the boundary where AI output becomes domain state. That boundary is the Pending status.&lt;/p&gt;

&lt;h2&gt;
  
  
  Four Terminal States, Not One
&lt;/h2&gt;

&lt;p&gt;The obligation state machine has eleven states. Seven are non-terminal: Pending, Active, Upcoming, Due, Overdue, Escalated, Disputed. Four are terminal: Dismissed, Fulfilled, Waived, Expired. The easy design is one terminal state called Closed with a reason column.&lt;/p&gt;

&lt;p&gt;I rejected that. The four terminal values each answer a different audit question:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Dismissed.&lt;/strong&gt; The obligation was never real. An AI false positive, or a human operator seeing a duplicate entry. The event row captures the reason verbatim.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fulfilled.&lt;/strong&gt; The obligation was honored. Counterparty was paid, notice was delivered, deliverable shipped. For recurring obligations this spawns the next instance.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Waived.&lt;/strong&gt; The obligation was real but forgiven. The counterparty agreed not to enforce it, or we renegotiated. A waiver without a rationale is a compliance liability, so &lt;code&gt;WaiveAsync&lt;/code&gt; throws if &lt;code&gt;reason&lt;/code&gt; is empty.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expired.&lt;/strong&gt; The contract was archived before the obligation resolved. This is a cascade, not a decision. When a contract archives, every non-terminal obligation under it expires with &lt;code&gt;actor = "system:archive_cascade"&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An auditor asking "how many obligations did we waive in Q2?" should get a number, not a regex over free-text reason fields. "How many were AI-extracted and then dismissed?" should be a two-column filter (&lt;code&gt;source = rag_extraction&lt;/code&gt; AND &lt;code&gt;status = dismissed&lt;/code&gt;). Collapsing the four into Closed pushes that information into prose, where it stops being queryable.&lt;/p&gt;

&lt;p&gt;The transition map enforces the distinction at the type-system level. &lt;code&gt;Pending&lt;/code&gt; can only transition to &lt;code&gt;Active&lt;/code&gt; (via confirm), &lt;code&gt;Dismissed&lt;/code&gt; (via dismiss), or &lt;code&gt;Expired&lt;/code&gt; (via archive cascade). There is no path from &lt;code&gt;Pending&lt;/code&gt; to &lt;code&gt;Fulfilled&lt;/code&gt; or &lt;code&gt;Waived&lt;/code&gt;. You cannot fulfill something that was never confirmed as real. The &lt;code&gt;ObligationStateMachine.GetValidNextStates&lt;/code&gt; method is a &lt;code&gt;switch&lt;/code&gt; expression that encodes this per PRD section 4.6, and &lt;code&gt;EnsureTransitionAllowed&lt;/code&gt; throws &lt;code&gt;ObligationTransitionException&lt;/code&gt; on any deviation. The middleware maps that exception to 422 INVALID_TRANSITION with the list of valid next states in &lt;code&gt;details[]&lt;/code&gt;, so a caller who gets the transition wrong sees exactly what was allowed.&lt;/p&gt;

&lt;h2&gt;
  
  
  Creation Emits No Event
&lt;/h2&gt;

&lt;p&gt;This is the second invariant that falls out of extract-then-confirm. The &lt;code&gt;CreateAsync&lt;/code&gt; method on &lt;code&gt;ObligationService&lt;/code&gt; explicitly does not write an event row:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight csharp"&gt;&lt;code&gt;&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="n"&gt;_obligationRepository&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;AddAsync&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;obligation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cancellationToken&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="c1"&gt;// Deliberately: NO event row here. Events represent transitions, not creation.&lt;/span&gt;
&lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;obligation&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An event represents a decision about an existing thing. Creating a row is not a decision. If the AI extracts 12 obligations and a human dismisses 3, the audit log should not read like 12 creations followed by 3 dismissals. It should read like 3 dismissals, because the other 9 have not yet been acted upon. They are proposals, not facts.&lt;/p&gt;

&lt;p&gt;The moment a human confirms an obligation, the event log records the transition &lt;code&gt;pending -&amp;gt; active&lt;/code&gt; with the actor, the timestamp, and any reason they provided. Now it is a fact. If the contract is renegotiated two months later and the obligation is waived, the log reads &lt;code&gt;active -&amp;gt; waived&lt;/code&gt; with the waiver reason required. The full lineage is two events.&lt;/p&gt;

&lt;p&gt;There is exactly one exception: recurring obligations. When a monthly payment obligation is fulfilled, the system spawns a new Active instance with &lt;code&gt;next_due_date&lt;/code&gt; advanced by the recurrence interval. The spawn writes one event on the new row with &lt;code&gt;FromStatus = ""&lt;/code&gt; and &lt;code&gt;Reason = "auto-created from fulfilled parent (recurring): {parent.Id}"&lt;/code&gt;. Empty &lt;code&gt;from_status&lt;/code&gt; is the signal for a creation event rather than a transition, and the metadata column carries the parent ID. The provenance matters here because otherwise the spawn looks like a ghost, an Active obligation with no origin story.&lt;/p&gt;

&lt;p&gt;That exception was debated. An earlier design kept the "no creation events ever" rule strict and relied on querying &lt;code&gt;obligations.metadata.parent_id&lt;/code&gt; to reconstruct recurrence chains. I changed it because reconstruction in the event log is cheap (one SELECT, ordered by timestamp), whereas reconstruction across the &lt;code&gt;obligations&lt;/code&gt; table means understanding a JSONB structure that no one else will be looking at three years from now. The event log is the canonical place an auditor goes. Putting provenance anywhere else is hoping they find it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Confirm Actually Costs
&lt;/h2&gt;

&lt;p&gt;The friction of clicking confirm on every row is real. An AI-extracted MSA produces 8 to 20 obligations depending on contract type. A legal-ops team processing 30 new contracts a month is confirming several hundred obligations a month. That is manual review work the AI was supposed to eliminate.&lt;/p&gt;

&lt;p&gt;The response to that is to be honest about what the AI eliminated. It did not eliminate the review. It eliminated the extraction. Before the engine, a paralegal read each contract and typed deadline dates into a spreadsheet. The typing was the bottleneck, not the reading. With extraction, the typing is gone. The review still happens. The operator reads each extracted row, checks the clause reference against the source document, and either confirms or dismisses.&lt;/p&gt;

&lt;p&gt;The confidence score on each extracted row helps prioritize. Extractions above 0.90 confidence tend to have exact clause references and pass review in seconds. Extractions under 0.70 tend to be the cases worth scrutinizing. The engine surfaces both but treats them identically at the storage layer. Confidence gates prioritization, not state. Activating high-confidence rows automatically would re-introduce the false-positive risk the whole design was avoiding.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Result in the Audit Log
&lt;/h2&gt;

&lt;p&gt;On a single contract with twelve extracted obligations, where a human confirms nine and dismisses three, the &lt;code&gt;obligation_events&lt;/code&gt; table holds twelve rows. Nine rows are &lt;code&gt;pending -&amp;gt; active&lt;/code&gt; with &lt;code&gt;actor = user:{tenantId}&lt;/code&gt; and no reason. Three rows are &lt;code&gt;pending -&amp;gt; dismissed&lt;/code&gt; with actor stamps and the reason the operator typed. Zero rows represent creation, because creation was not a decision.&lt;/p&gt;

&lt;p&gt;When the contract is archived eighteen months later, the archive cascade writes another nine rows, one per non-terminal obligation, as &lt;code&gt;{prior_status} -&amp;gt; expired&lt;/code&gt; with &lt;code&gt;actor = system:archive_cascade&lt;/code&gt;. The dismissed three already terminated so the cascade skips them.&lt;/p&gt;

&lt;p&gt;Six months after archive, a regulator asks the company to explain a specific obligation that was dismissed. Zero forensic work. One query:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;obligation_events&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;obligation_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="err"&gt;$&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;ORDER&lt;/span&gt; &lt;span class="k"&gt;BY&lt;/span&gt; &lt;span class="n"&gt;created_at&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Two rows come back. The first one is the dismissal, with the timestamp, the user actor, and the reason text. The second one is the archive cascade that terminated the already-terminal row (defensive; the cascade is idempotent). The regulator has their answer. No reconstruction across tables, no JSON parsing, no correlating timestamps across systems. The audit is the data.&lt;/p&gt;

&lt;p&gt;That is what the Pending status buys. A system where every legally-binding commitment in the database was put there by a human with an identity and a timestamp, and the row that records the commitment is the same row the auditor queries six months later.&lt;/p&gt;

</description>
      <category>statemachine</category>
      <category>dotnet</category>
      <category>domainmodeling</category>
    </item>
    <item>
      <title>Why I Compute Every Appointment Slot from Scratch on Every Request</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sat, 18 Apr 2026 11:01:05 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/why-i-compute-every-appointment-slot-from-scratch-on-every-request-3al6</link>
      <guid>https://dev.to/kingsleyonoh/why-i-compute-every-appointment-slot-from-scratch-on-every-request-3al6</guid>
      <description>&lt;p&gt;The obvious approach to appointment scheduling is a lookup table. Run a job at midnight. Generate every possible slot for the next two weeks. Store them in a &lt;code&gt;slots&lt;/code&gt; table. When a patient asks what's available, query the table. When someone books, mark the row as taken.&lt;/p&gt;

&lt;p&gt;I built the opposite. The Clinical Scheduling Engine has no slot table. Every time a patient asks "what's available Thursday?", the system loads all constraints from the database, computes valid slots in real time, scores each one, and returns a sorted list. Nothing is pre-generated. Nothing is cached except availability windows (those get a 60-second TTL because they change weekly, not hourly).&lt;/p&gt;

&lt;p&gt;The decision came down to one question: what happens when the underlying data changes?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Sync Problem
&lt;/h2&gt;

&lt;p&gt;A slot table looks simple until you try to keep it accurate. A booking at 9:15 AM doesn't just remove one slot. It shifts buffer times, changes gap scores for adjacent slots, and potentially opens or closes overbooking options depending on how many bookings the provider already has that day. A cancellation reverses all of that. An availability change (Dr. Chen is out Thursday afternoon) invalidates dozens of slots at once.&lt;/p&gt;

&lt;p&gt;Each of these events needs to trigger a slot regeneration. Miss one, and the calendar shows a slot that doesn't exist or hides one that does. The first failure mode confuses patients. The second wastes clinic capacity. Neither produces an error message.&lt;/p&gt;

&lt;p&gt;I counted the events that would require synchronization: booking created, booking cancelled, booking marked as no-show, provider availability added or modified, overbooking rule created or modified, appointment type duration changed. Six triggers at minimum, each with a different scope (single slot, single provider, or global). The regeneration logic itself would be roughly the same complexity as real-time computation, but now it runs reactively instead of on demand, and every missed trigger is a silent data integrity bug.&lt;/p&gt;

&lt;p&gt;Real-time computation sidesteps all of this. There's nothing to sync because there's nothing stored.&lt;/p&gt;

&lt;h2&gt;
  
  
  Computing Slots from Scratch
&lt;/h2&gt;

&lt;p&gt;The core of the system is &lt;code&gt;find_available_slots&lt;/code&gt; in &lt;code&gt;src/optimizer/engine.py&lt;/code&gt;. It takes a target date and appointment type, then runs a multi-stage pipeline for each active provider.&lt;/p&gt;

&lt;p&gt;First, it loads availability windows. Provider schedules are stored as recurring weekly patterns (Monday 08:00 to 17:00, Tuesday 08:00 to 17:00) with &lt;code&gt;valid_from&lt;/code&gt; and &lt;code&gt;valid_until&lt;/code&gt; date bounds. The &lt;code&gt;get_availability_windows&lt;/code&gt; function in &lt;code&gt;src/optimizer/constraints.py&lt;/code&gt; filters these by day of week and date range. Results are cached in memory for 60 seconds because these patterns change perhaps once a week.&lt;/p&gt;

&lt;p&gt;Second, it subtracts existing bookings. Every confirmed booking for this provider on the target date gets carved out of the availability windows. The &lt;code&gt;subtract_bookings&lt;/code&gt; function uses interval arithmetic: for each booking, it splits the containing window into the parts before and after the booked period. A 9-hour window with three bookings might become four or five smaller open intervals.&lt;/p&gt;

&lt;p&gt;Third, it applies buffer time. Each provider has a configurable buffer (default: 10 minutes) between appointments. The &lt;code&gt;apply_buffer&lt;/code&gt; function in &lt;code&gt;constraints.py&lt;/code&gt; trims the end of each open interval by the buffer duration. Intervals that shrink to zero get dropped.&lt;/p&gt;

&lt;p&gt;Fourth, it filters compatible rooms. Not every room can host every appointment type. A physical exam needs blood pressure equipment. An ECG test needs ECG equipment. The &lt;code&gt;filter_compatible_rooms&lt;/code&gt; function checks room type and equipment arrays against the appointment type's requirements. If no compatible room exists, the optimizer returns an empty list rather than suggesting an incompatible room.&lt;/p&gt;

&lt;p&gt;Fifth, for each compatible room, it subtracts that room's existing bookings. This prevents room double-booking independently of provider double-booking.&lt;/p&gt;

&lt;p&gt;Finally, it generates candidate slots at the configured increment (default: 15 minutes) within each remaining open interval, and scores every one.&lt;/p&gt;

&lt;p&gt;The scorer in &lt;code&gt;src/optimizer/scorer.py&lt;/code&gt; combines four weighted components into a quality score between 0.0 and 1.0. Preference match (40% weight): how close is the slot to the patient's preferred time window? Inside the window scores 1.0; distance decays linearly, reaching 0.0 at 480 minutes away. Gap minimization (35%): how close is this slot to existing bookings? A slot immediately after the previous appointment scores 1.0 because it reduces dead time. Room switching penalty (15%): does this slot require the provider to move rooms? Staying in the same room scores 1.0; switching scores 0.3. Overbooking penalty (10%): overbooked slots get 0.0, pushing them to the bottom but not removing them.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;score_slot&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;slot_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;slot_end&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;provider_bookings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;preferred_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;preferred_end&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;provider_rooms_used&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;room_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;is_overbooked&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;pref&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;score_preference_match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slot_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;preferred_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;preferred_end&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;gap&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;score_gap_minimization&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;slot_start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;provider_bookings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;room&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;score_room_switching&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;room_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;provider_rooms_used&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;overbook&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;is_overbooked&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="mf"&gt;1.0&lt;/span&gt;

    &lt;span class="n"&gt;raw&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;_W_PREFERENCE&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;pref&lt;/span&gt;
        &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;_W_GAP&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;gap&lt;/span&gt;
        &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;_W_ROOM&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;room&lt;/span&gt;
        &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;_W_OVERBOOK&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;overbook&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;1.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mf"&gt;0.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;raw&lt;/span&gt;&lt;span class="p"&gt;)),&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The 40/35/15/10 split came from the PRD, not from tuning. It encodes a judgment: patient convenience matters most, but not at the expense of schedule compactness. I'd experiment with pushing the gap weight higher (to 40 or 45) if utilization data showed persistent scheduling holes.&lt;/p&gt;

&lt;p&gt;The overbooking system was more complex than I expected. A single &lt;code&gt;max_overbook&lt;/code&gt; integer isn't enough because different appointment types have different no-show rates. The &lt;code&gt;OverbookingRule&lt;/code&gt; model supports three levels of specificity: global (no provider, no type), provider-only, and provider+type (the most specific). The &lt;code&gt;_get_max_overbook&lt;/code&gt; function in &lt;code&gt;engine.py&lt;/code&gt; resolves these with a priority hierarchy: if Dr. Chen has a global allowance of 1 extra slot but a specific rule for ECG appointments allowing 2, the ECG rule wins. When a provider is maxed on normal bookings but overbooking is allowed, the system generates a second pass of overbooked slots with lower quality scores.&lt;/p&gt;

&lt;p&gt;The resolution function shows the priority hierarchy:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_get_max_overbook&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rules&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;provider_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;appointment_type_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;rules&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;provider_id&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;provider_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;appointment_type_id&lt;/span&gt;
                &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;appointment_type_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;appointment_type_id&lt;/span&gt;&lt;span class="p"&gt;)):&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_overbook&lt;/span&gt;  &lt;span class="c1"&gt;# most specific wins immediately
&lt;/span&gt;        &lt;span class="nf"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;provider_id&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;provider_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;provider_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;appointment_type_id&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_overbook&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;provider_id&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;appointment_type_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;max&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_overbook&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Ten Patients, One Slot
&lt;/h2&gt;

&lt;p&gt;Slot computation worked. But what happens when ten patients are looking at the same 9:15 AM slot with Dr. Chen and all ten click "book now" within the same second?&lt;/p&gt;

&lt;p&gt;In most scheduling systems, this is where things break. Application-level availability checks see the slot as open for all ten requests because they all query the database before any INSERT completes. Two bookings get created. Someone at the front desk has to call a patient and apologize.&lt;/p&gt;

&lt;p&gt;I didn't build a fix for this. I let PostgreSQL handle it.&lt;/p&gt;

&lt;p&gt;The bookings table has &lt;code&gt;UniqueConstraint("provider_id", "date", "start_time", name="uq_provider_slot")&lt;/code&gt;. The booking service in &lt;code&gt;src/booking/service.py&lt;/code&gt; does an INSERT, and if it hits an &lt;code&gt;IntegrityError&lt;/code&gt;, it catches the exception and raises an &lt;code&gt;AppError&lt;/code&gt; with code &lt;code&gt;SLOT_UNAVAILABLE&lt;/code&gt; and HTTP status 409.&lt;/p&gt;

&lt;p&gt;The stress test in &lt;code&gt;tests/unit/test_performance.py&lt;/code&gt; fires 10 concurrent requests for the same slot using &lt;code&gt;asyncio.gather&lt;/code&gt;. The assertion: exactly 1 gets 201 Created, exactly 9 get 409 Conflict. No application-level locking. No Redis. No retry logic. The database is the arbiter.&lt;/p&gt;

&lt;p&gt;What I was wrong about: I initially thought I'd need &lt;code&gt;SELECT ... FOR UPDATE&lt;/code&gt; before the INSERT to prevent the race. There's no need. The UNIQUE constraint alone is sufficient because the INSERT either succeeds or it doesn't. There's no intermediate state where two rows exist temporarily.&lt;/p&gt;

&lt;p&gt;The backfill system added another dimension. When a patient cancels, the &lt;code&gt;find_backfill_candidates&lt;/code&gt; function in &lt;code&gt;src/booking/backfill.py&lt;/code&gt; queries for other cancelled bookings with the same appointment type within a 7-day window (the &lt;code&gt;_SEARCH_WINDOW_DAYS&lt;/code&gt; constant). These are patients who lost their own appointment and might want the freed slot. The proximity score decays linearly: 1.0 for same-day, 0.0 at 7 days. Candidates appear in the cancellation response, so the front desk sees recovery options immediately.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'd Change
&lt;/h2&gt;

&lt;p&gt;The availability cache uses a fixed 60-second TTL. If a clinic admin changes Thursday hours, the cache serves stale windows for up to a minute. Explicit invalidation on availability updates would eliminate that gap, but I chose the simpler approach because availability changes happen rarely and the 60-second window is acceptable.&lt;/p&gt;

&lt;p&gt;The scorer weights are hardcoded at 40/35/15/10. A specialist clinic where room equipment is the real bottleneck might want 25/35/30/10 instead. Making the weights configurable per clinic is the next change worth making. Everything else has held up.&lt;/p&gt;

</description>
      <category>scheduling</category>
      <category>python</category>
      <category>healthydebate</category>
    </item>
    <item>
      <title>Why I Designed the Whole System Around Kafka and Then Deployed Without It</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Thu, 09 Apr 2026 16:14:36 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/why-i-designed-the-whole-system-around-kafka-and-then-deployed-without-it-33gi</link>
      <guid>https://dev.to/kingsleyonoh/why-i-designed-the-whole-system-around-kafka-and-then-deployed-without-it-33gi</guid>
      <description>&lt;p&gt;The Redpanda container was configured for 256MB of RAM. On a local dev machine with 32GB, that's nothing. On the production VPS with 1GB total, it was a quarter of the entire budget before the notification service had processed a single event.&lt;/p&gt;

&lt;p&gt;I had designed the system from day one around Kafka event consumption. The PRD specified it. The consumer module was built and tested. KafkaJS handled message parsing, schema validation, and automatic reconnection. The architecture was clean: events arrive on a Kafka topic, the consumer matches rules, the pipeline processes notifications. Decoupled. Async. Textbook.&lt;/p&gt;

&lt;p&gt;Then I tried to deploy it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Memory Problem
&lt;/h2&gt;

&lt;p&gt;The VPS runs multiple services behind Traefik: PostgreSQL, the notification hub, and other portfolio projects sharing the same machine. Traefik takes memory. PostgreSQL takes memory. The notification hub's container was limited to 128MB in the production docker-compose file, which was comfortable for the Fastify process alone.&lt;/p&gt;

&lt;p&gt;Redpanda, even in its minimal single-core configuration (&lt;code&gt;--smp 1 --memory 256M --overprovisioned&lt;/code&gt;), needs 150-200MB resident just to maintain the broker state. That's not under load. That's idle.&lt;/p&gt;

&lt;p&gt;The arithmetic didn't work. Adding Redpanda would push total memory past the VPS limit. I could have upgraded the VPS, but spending more on infrastructure to run a message broker that processes events arriving via HTTP felt like the wrong tradeoff. The events come from other services hitting &lt;code&gt;POST /api/events&lt;/code&gt;. They're already HTTP. Publishing them to Kafka just to consume them from Kafka on the same machine adds latency, memory overhead, and a failure point, all for a round-trip that happens within a single process.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Honest Question
&lt;/h2&gt;

&lt;p&gt;Was Kafka the right tool for this deployment, or had I designed for an architecture I couldn't afford to run?&lt;/p&gt;

&lt;p&gt;The Kafka consumer provides real value in two scenarios: when events arrive from external systems that produce to Kafka natively, and when the notification hub scales horizontally across multiple instances that need a shared event stream. Neither scenario existed. Events arrived via HTTP. The hub ran on one container. Kafka was solving a problem I didn't have yet.&lt;/p&gt;

&lt;p&gt;But I didn't want to rip Kafka out entirely. The consumer was tested. The topic subscription pattern worked. If the deployment ever moved to a larger machine, or if an external system needed to produce events directly to a Kafka topic, the path was already built. Deleting the consumer code would mean rebuilding it later.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Flag
&lt;/h2&gt;

&lt;p&gt;The solution was a single environment variable: &lt;code&gt;USE_KAFKA&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;In &lt;code&gt;src/config.ts&lt;/code&gt;, the flag is a string enum that transforms to a boolean:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;USE_KAFKA&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;true&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;false&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]).&lt;/span&gt;&lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;true&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
  &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;transform&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;val&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;val&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;true&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When &lt;code&gt;USE_KAFKA&lt;/code&gt; is true, the server starts a KafkaJS consumer that subscribes to the configured topic pattern, validates incoming messages against a Zod schema, looks up the tenant, matches rules, and feeds events into the notification pipeline. This is the path the system was designed for.&lt;/p&gt;

&lt;p&gt;When &lt;code&gt;USE_KAFKA&lt;/code&gt; is false, the &lt;code&gt;POST /api/events&lt;/code&gt; route handler processes events inline. Instead of publishing to Kafka and waiting for the consumer to pick them up, it calls &lt;code&gt;matchRules()&lt;/code&gt; and &lt;code&gt;processNotification()&lt;/code&gt; directly within the same HTTP request. The event goes in, the pipeline runs, and the response comes back with a count of rules matched.&lt;/p&gt;

&lt;p&gt;The critical design choice: both paths converge on the same pipeline function. &lt;code&gt;processNotification()&lt;/code&gt; in &lt;code&gt;src/processor/pipeline.ts&lt;/code&gt; doesn't know or care whether the event arrived via Kafka or via direct HTTP. It receives a validated event, a matched rule, a resolved recipient, and a config object. It runs the same eight-step cascade either way: resolve delivery address, check opt-out, check deduplication, check quiet hours, check digest mode, render template, insert notification record, dispatch to channel.&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Codepaths in One Route
&lt;/h2&gt;

&lt;p&gt;The dual-mode logic lives in &lt;code&gt;src/api/events.routes.ts&lt;/code&gt;. The route handler checks the flag:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;useKafka&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;kafkaBrokers&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;kafkaTopics&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;publishEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;kafkaBrokers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;kafkaTopics&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;tenant_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;tenantId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;event_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;timestamp&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Date&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;toISOString&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
  &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rules&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;matchRules&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;tenantId&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;event_type&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="k"&gt;for &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;rule&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;rules&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;recipient&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;resolveRecipient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
      &lt;span class="nx"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;recipientType&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;recipientValue&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;payload&lt;/span&gt;
    &lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;recipient&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;processNotification&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;db&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;rule&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;recipient&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;config&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In Kafka mode, the route publishes and returns immediately. The consumer handles processing asynchronously. In direct mode, the route does everything synchronously. The HTTP response waits for the full pipeline, including template rendering and channel dispatch, to complete before returning.&lt;/p&gt;

&lt;p&gt;This tradeoff matters. Direct mode is simpler to debug (the response tells you exactly how many rules fired and whether processing succeeded) but blocks the HTTP response until all notifications are dispatched. If Resend takes 500ms per email and an event triggers 5 rules, the caller waits 2.5 seconds. At higher volume, that latency would stack up.&lt;/p&gt;

&lt;p&gt;For now, event volume is low enough that direct processing completes in milliseconds. The day it doesn't, the fix is one environment variable change and a Redpanda container added to the compose file.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Got Wrong
&lt;/h2&gt;

&lt;p&gt;I designed the Kafka consumer first and the direct processing path second. If I were building this again, I'd invert that order.&lt;/p&gt;

&lt;p&gt;The direct path is simpler, easier to test, and sufficient for the scale the system actually operates at. Building the Kafka consumer first meant I spent time on KafkaJS connection management, topic subscription patterns, schema validation for messages, and reconnection logic before the system had processed a single real notification. All of that code works and is tested (6 tests on the Kafka event schema alone), but none of it runs in production.&lt;/p&gt;

&lt;p&gt;Building the simple path first would have been smarter: deploy the direct HTTP pipeline, prove the notification routing works, and then add the Kafka consumer when the system needs async processing. Instead, I built the complex path first and the simple path as a fallback.&lt;/p&gt;

&lt;p&gt;The code quality didn't suffer. Both paths share the same pipeline, and the pipeline is where the real complexity lives (eight processing steps, five possible skip/hold states, two digest routing paths). But I spent roughly a day on Kafka infrastructure that currently sits behind a &lt;code&gt;false&lt;/code&gt; flag.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Survives the Flag
&lt;/h2&gt;

&lt;p&gt;The pipeline itself is flag-agnostic. The eight-step cascade in &lt;code&gt;processNotification()&lt;/code&gt; runs identically regardless of how the event arrived:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Resolve delivery address. If the recipient is a raw email (contains &lt;code&gt;@&lt;/code&gt;) or phone number (starts with &lt;code&gt;+&lt;/code&gt;), use it directly. Otherwise, look up the user's preferences in &lt;code&gt;user_preferences&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Check opt-out. Users can opt out per channel, per event type, or from everything via &lt;code&gt;optOut&lt;/code&gt; JSONB.&lt;/li&gt;
&lt;li&gt;Check deduplication. Same &lt;code&gt;event_id&lt;/code&gt; + recipient + channel within the configured window (60 minutes default) gets skipped.&lt;/li&gt;
&lt;li&gt;Check quiet hours. If the user has quiet hours set and the current time falls within their window (timezone-aware), hold the notification or queue it for digest.&lt;/li&gt;
&lt;li&gt;Check digest mode. If the user has opted into digest batching, queue the notification with a &lt;code&gt;scheduledFor&lt;/code&gt; timestamp computed from their schedule preference.&lt;/li&gt;
&lt;li&gt;Render the Handlebars template using the event payload as context.&lt;/li&gt;
&lt;li&gt;Insert the notification record into PostgreSQL with status &lt;code&gt;pending&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Dispatch to the appropriate channel and update status to &lt;code&gt;sent&lt;/code&gt; or &lt;code&gt;failed&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each step can bail out with a specific skip reason: &lt;code&gt;no_delivery_address&lt;/code&gt;, &lt;code&gt;opt_out&lt;/code&gt;, &lt;code&gt;deduplicated&lt;/code&gt;, &lt;code&gt;held&lt;/code&gt; (quiet hours), or &lt;code&gt;queued_digest&lt;/code&gt;. Every bail-out is logged and recorded in the notifications table. No event disappears silently.&lt;/p&gt;

&lt;p&gt;Kafka is invisible to the pipeline. The Kafka consumer doesn't know about quiet hours or digest batching. The channel dispatcher doesn't know about deduplication. Each layer has one job and knows nothing about the layers around it. That separation is what makes the &lt;code&gt;USE_KAFKA&lt;/code&gt; flag possible. The ingestion layer can be swapped without touching anything downstream.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Principle
&lt;/h2&gt;

&lt;p&gt;Design for the architecture you want. Deploy for the constraints you have.&lt;/p&gt;

&lt;p&gt;The Kafka consumer is not dead code. It's a capability that's ready when the deployment supports it. The direct processing path is not a hack. It's the correct choice for a system running on 128MB. Both codepaths exist because the constraint was known before deployment, not discovered after a production incident.&lt;/p&gt;

&lt;p&gt;The system runs on a 1GB VPS, serves real tenants, and processes events in single-digit milliseconds. Redpanda is one environment variable and one container away. Until the event volume justifies the memory cost, it stays off.&lt;/p&gt;

</description>
      <category>typescript</category>
      <category>fastify</category>
      <category>kafka</category>
      <category>websocket</category>
    </item>
    <item>
      <title>I Built a Knowledge Graph Into the Retrieval Pipeline and Then Dropped It in Production</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Mon, 30 Mar 2026 23:00:00 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/i-built-a-knowledge-graph-into-the-retrieval-pipeline-and-then-dropped-it-in-production-14fd</link>
      <guid>https://dev.to/kingsleyonoh/i-built-a-knowledge-graph-into-the-retrieval-pipeline-and-then-dropped-it-in-production-14fd</guid>
      <description>&lt;p&gt;The vector search returned seven chunks about "database indexing strategies" for a query about "machine learning model training." All seven had cosine similarity scores above 0.72. All seven were confidently, precisely wrong.&lt;/p&gt;

&lt;p&gt;This is the failure mode that nobody warns you about when you build a RAG system on pure vector search. Embeddings capture semantic proximity, not semantic correctness. "Database indexing" and "model training" both live in the same neighborhood of the embedding space because they co-occur in the same documents, the same blog posts, the same technical discussions. The vectors are close. The meanings are not.&lt;/p&gt;

&lt;p&gt;I had three options. Fine-tune the embedding model (expensive, slow, and the problem would resurface with every new document domain). Raise the similarity threshold from 0.7 to 0.85 (which would kill recall on legitimate queries). Or add a second retrieval signal that doesn't rely on vector proximity at all.&lt;/p&gt;

&lt;p&gt;I chose the third option, and then I added a third signal on top of that. The final retrieval pipeline combines three independent scores: vector similarity at 70%, keyword overlap at 20%, and knowledge graph relevance at 10%. Each signal catches failures the other two miss.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Three Signals
&lt;/h2&gt;

&lt;p&gt;Vector search is good at finding semantically similar content. It fails when two topics share vocabulary without sharing meaning. "Python performance optimization" and "Python snake habitat" both contain "Python." A keyword system surfaces both while a vector system might correctly separate them, or might not, depending on the embedding model's training data.&lt;/p&gt;

&lt;p&gt;Keyword overlap is the brute-force check. If the user typed "machine learning" and the chunk contains those exact words, that is a direct signal that no embedding model can argue with. It catches the cases where vector similarity drifts into adjacent topics. The scorer calculates term overlap as a ratio:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;keyword_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_terms&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="n"&gt;candidate_terms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query_terms&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I chose this over BM25 because the keyword signal is a correction factor, not the primary ranker. At 20% weight, the precision difference between bag-of-words and BM25 vanishes into the final score. Adding BM25 would have meant maintaining a separate text search index alongside pgvector. One more index, one more failure mode, for a signal that carries a fifth of the weight.&lt;/p&gt;

&lt;p&gt;The knowledge graph was supposed to be the sophistication layer. During document ingestion, the pipeline extracts named entities (people, organizations, concepts) from each chunk and stores them as nodes in Neo4j, linked by &lt;code&gt;EXTRACTED_FROM&lt;/code&gt; relationships back to their source documents. When a query arrives, the system identifies entity-like terms (capitalized words longer than one character) and queries Neo4j for documents connected to those entities. If the user asks about "PostgreSQL indexing," the graph can surface chunks that mention PostgreSQL even if the embedding vectors didn't rank them high enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Weight Problem
&lt;/h2&gt;

&lt;p&gt;Picking the weights was the part I got wrong twice.&lt;/p&gt;

&lt;p&gt;The first version used equal weights: 0.33 for each signal. Retrieval quality dropped. Vector similarity is genuinely the best single signal for semantic search, and diluting it to 33% meant that two mediocre keyword matches could outrank a strong semantic hit. A query about "async database sessions in Python" returned chunks that happened to contain all four words scattered across an unrelated paragraph, beating a chunk that discussed the exact concept using "asynchronous" instead of "async."&lt;/p&gt;

&lt;p&gt;The second version over-corrected: 0.9 vector, 0.05 keyword, 0.05 graph. Barely different from pure vector search. The correction signals were so quiet they couldn't override a bad vector match.&lt;/p&gt;

&lt;p&gt;The final weights came from testing against the evaluation harness. The platform runs three LLM-as-judge scorers on every response: relevance (do the chunks match the query?), faithfulness (is the response grounded in the chunks?), and correctness (did it actually answer the question?). I ran the same 50 queries through all three weight configurations and compared the average evaluation scores.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;final_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;vector_score&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;keyword_score&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;graph_score&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At 0.7/0.2/0.1, the average relevance score improved from 0.71 (pure vector) to 0.83 (hybrid). The keyword signal caught the vocabulary drift cases. The graph signal helped occasionally but never moved the needle by more than a few points on its own.&lt;/p&gt;

&lt;p&gt;Which is exactly why I dropped it in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Production Constraint
&lt;/h2&gt;

&lt;p&gt;The platform runs on a DigitalOcean VPS with 1GB of RAM. The FastAPI application, PostgreSQL with pgvector, and Redis all need to fit in that envelope. Neo4j is a JVM application. The JVM alone wants 512MB of heap space before you store a single node. Running four services on 1GB means none of them have room to breathe.&lt;/p&gt;

&lt;p&gt;I had a choice: upgrade to a 2GB VPS for the graph, or architect the system so the graph layer is optional.&lt;/p&gt;

&lt;p&gt;I chose optional.&lt;/p&gt;

&lt;p&gt;The graph search module catches all connection errors and returns an empty result list instead of crashing. The reranker doesn't care whether the graph score is zero because a query didn't match any entities, or zero because Neo4j isn't running. It computes the weighted sum either way:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# When Neo4j is available:
&lt;/span&gt;&lt;span class="n"&gt;final_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;vector_score&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;keyword_score&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;graph_score&lt;/span&gt;

&lt;span class="c1"&gt;# When Neo4j is down:
&lt;/span&gt;&lt;span class="n"&gt;final_score&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.7&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;vector_score&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="n"&gt;keyword_score&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mf"&gt;0.1&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mf"&gt;0.0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The weights don't re-normalize. They don't need to. Ranking order stays the same whether the graph contributes meaningful signal or zeros. The top result is still the top result. Absolute scores drop, but since I only sort by rank and don't threshold on the combined score, it doesn't matter.&lt;/p&gt;

&lt;p&gt;What surprised me was that the evaluation scores barely moved. With Neo4j running locally, the average relevance across 50 test queries was 0.83. Without Neo4j, it dropped to 0.81. Two hundredths of a point. For a document corpus under 1,000 items, the knowledge graph is architecture for the future, not value for today. The entity relationships just aren't dense enough to meaningfully reshape the rankings.&lt;/p&gt;

&lt;p&gt;This would change at scale. At 50,000 documents, the embedding space gets crowded. Vector similarity starts returning more near-misses because the neighborhood density increases. That's when the graph becomes the tiebreaker: not "does this chunk mention the topic?" but "does this chunk discuss the topic in the context of entities the user has been asking about?" The architecture supports it. The current deployment doesn't need it.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Design Principle
&lt;/h2&gt;

&lt;p&gt;I've started treating optional components as a first-class architectural pattern. Not every service in the topology needs to be a hard dependency. The question isn't "does this component add value?" The question is "does the system still function without it?"&lt;/p&gt;

&lt;p&gt;For the knowledge graph, the answer was yes. For the semantic cache, also yes: if Redis goes down, the cache misses and every query hits the LLM directly. More expensive, but correct. The rate limiter also fails open: if Redis is unavailable, requests pass through without rate checks. Worse for cost, but the system stays up.&lt;/p&gt;

&lt;p&gt;The guardrails pipeline is the one place I didn't apply this pattern. If the input guardrails can't run (injection detection, PII scan, topic policy check), the request fails. Silently processing a potentially injected prompt because the guardrail service was down is worse than returning an error. Safety is a hard dependency. Performance optimization is not.&lt;/p&gt;

&lt;p&gt;That distinction ended up shaping the entire system's resilience model. And it started with a knowledge graph that was worth 10% of a retrieval score and zero percent of the production memory budget.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>hybridretrieval</category>
      <category>knowledgegraph</category>
      <category>neo4j</category>
    </item>
    <item>
      <title>The Matching Problem No One Talks About</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Sun, 22 Mar 2026 23:00:00 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/the-matching-problem-no-one-talks-about-3cee</link>
      <guid>https://dev.to/kingsleyonoh/the-matching-problem-no-one-talks-about-3cee</guid>
      <description>&lt;p&gt;Stripe records a charge on Monday at 23:47 UTC. The bank settles it on Tuesday. PayPal timestamps its version to the payer's local timezone. The three records describe one event, but no two of them agree on the date.&lt;/p&gt;

&lt;p&gt;This is the gap that every reconciliation system has to cross. The dates are correct within each source's frame of reference. They are also irreconcilably different when compared side by side. Any engine that uses strict equality matching will generate a false discrepancy for every cross-timezone settlement delay.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Exact Matching Breaks Under Load
&lt;/h2&gt;

&lt;p&gt;The first version of the matcher was a lookup table. Hash on &lt;code&gt;(amount, currency, date)&lt;/code&gt;. If the hash matched, call it reconciled. It worked perfectly on synthetic test data where everything happened on the same day.&lt;/p&gt;

&lt;p&gt;In production data from a single merchant processing through Stripe, 23% of bank settlement dates differed from charge dates by exactly one calendar day. Another 4% differed by two days — weekend processing delays. The lookup table called all of them unmatched.&lt;/p&gt;

&lt;p&gt;The finance team's response: manually reconcile anything the engine missed. Which defeated the purpose of building the engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  Scoring Instead of Searching
&lt;/h2&gt;

&lt;p&gt;The replacement design treats matching as a ranking problem. Every gateway transaction is evaluated against every unmatched ledger transaction. Each evaluation runs through four independent rules that each return a confidence score:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exact Match (1.00)&lt;/strong&gt; — amount, currency, date, and counterparty all match. The counterparty comparison normalizes to lowercase and trims whitespace. If both counterparty fields are empty, the rule skips that check rather than rejecting the pair. This is deliberate: bank statements frequently omit counterparty names for internal transfers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amount + Date (0.90)&lt;/strong&gt; — amount and currency match exactly, but the date is within a configurable tolerance window. The default is 3 days. This catches settlement delays without opening the door to false positives from unrelated transactions with identical amounts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reference Match (0.80)&lt;/strong&gt; — one transaction's &lt;code&gt;external_id&lt;/code&gt; appears as a substring of the other's &lt;code&gt;description&lt;/code&gt; field. This catches the pattern where a bank statement's narrative reads "STRIPE CHARGE ch_3N7kR..." and the Stripe record has &lt;code&gt;external_id: ch_3N7kR...&lt;/code&gt;. Both values are lowercased before comparison.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fuzzy Amount (0.75)&lt;/strong&gt; — amounts are within a percentage tolerance of each other. This handles fee variance — a payment gateway may deduct processing fees before reporting the net amount, while the bank records the gross. The tolerance is configurable (default: 2%).&lt;/p&gt;

&lt;p&gt;The scorer evaluates all four rules against each pair and takes the highest confidence. Below a minimum threshold (default: 0.70), the pair is discarded.&lt;/p&gt;

&lt;p&gt;The core of the scorer in Go:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="c"&gt;// ScoreAll evaluates a source transaction against all candidates&lt;/span&gt;
&lt;span class="c"&gt;// and returns matches sorted by confidence (highest first).&lt;/span&gt;
&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Scorer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="n"&gt;ScoreAll&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;source&lt;/span&gt; &lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;domain&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Transaction&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;ScoredMatch&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;&lt;span class="n"&gt;ScoredMatch&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;cand&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;candidates&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="n"&gt;mc&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;MatchCandidate&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="n"&gt;Source&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Target&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;cand&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;ScoredMatch&lt;/span&gt;&lt;span class="p"&gt;{}&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="k"&gt;range&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;rules&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;conf&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;mc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;conf&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Confidence&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="n"&gt;best&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;ScoredMatch&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="n"&gt;GatewayTxID&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt; &lt;span class="n"&gt;source&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;LedgerTxID&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;cand&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;Confidence&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;  &lt;span class="n"&gt;conf&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="n"&gt;MatchRule&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;   &lt;span class="n"&gt;rule&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Name&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;minConfidence&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;best&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;sort&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Slice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="k"&gt;func&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;j&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="kt"&gt;bool&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Confidence&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;j&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Confidence&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;MatchCandidate&lt;/code&gt; struct is a value type. It sits on the stack for every iteration of the inner loop. The garbage collector never sees it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Made This Hard
&lt;/h2&gt;

&lt;p&gt;The O(n x m) evaluation creates a combinatorial problem. For a merchant with 5,000 gateway transactions and 5,000 ledger transactions in a window, that is 25 million pair evaluations. Each evaluation runs four rule checks.&lt;/p&gt;

&lt;p&gt;Three decisions keep this tractable:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Value types in the inner loop.&lt;/strong&gt; The &lt;code&gt;MatchCandidate&lt;/code&gt; struct is allocated on the stack, not the heap. No pointer indirection. The garbage collector never sees the scoring artifacts.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Early exit per rule.&lt;/strong&gt; Each rule checks currency first. If currencies differ, it returns immediately without evaluating amount or date. Currency mismatches are the most common short-circuit. A merchant processing in both EUR and USD filters out half the candidate pool in the first comparison.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Map-based exclusion.&lt;/strong&gt; Once a transaction is matched, it is added to a &lt;code&gt;map[string]bool&lt;/code&gt; and excluded from future candidate lists. The effective search space shrinks linearly as matches are confirmed. In practice, high-confidence matches exit the pool early, leaving only genuinely ambiguous pairs for the lower-confidence rules.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Dedup Invariant
&lt;/h2&gt;

&lt;p&gt;The ingester guarantees at-most-once persistence through a two-tier dedup check. The dedup key is &lt;code&gt;sha256(source_id + external_id)&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Redis &lt;code&gt;GET dedup:{key}&lt;/code&gt; checks first. If the key exists, the transaction is a known duplicate and never reaches PostgreSQL. If Redis is unavailable (connection error, timeout, cold restart), the check falls through to a &lt;code&gt;SELECT ... WHERE dedup_key = ?&lt;/code&gt; against the transactions table. If that also returns empty, the insert uses &lt;code&gt;ON CONFLICT DO NOTHING&lt;/code&gt; on the dedup_key column — catching the race condition where two goroutines pass the dedup check simultaneously.&lt;/p&gt;

&lt;p&gt;This design means Redis is a performance optimization, not a correctness requirement. The system is correct without it. It is just slower.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Changes When the Matcher Meets Bank File Formats
&lt;/h2&gt;

&lt;p&gt;CAMT.053 is the ISO 20022 standard for bank statement reporting across European banks. It uses XML with deeply nested structures. The amount sits in an &lt;code&gt;Amt&lt;/code&gt; element with a &lt;code&gt;Ccy&lt;/code&gt; attribute. The direction is encoded as &lt;code&gt;CRDT&lt;/code&gt; or &lt;code&gt;DBIT&lt;/code&gt; in the &lt;code&gt;CdtDbtInd&lt;/code&gt; field. The counterparty is nested inside &lt;code&gt;RltdPties&lt;/code&gt;, where the adapter picks &lt;code&gt;Dbtr.Nm&lt;/code&gt; for credits and &lt;code&gt;Cdtr.Nm&lt;/code&gt; for debits.&lt;/p&gt;

&lt;p&gt;The adapter parses this into the same canonical &lt;code&gt;IngestRequest&lt;/code&gt; that Stripe and PayPal produce. From the engine's perspective, a CAMT.053 entry and a Stripe balance transaction are identical — same struct, same fields, same confidence-scored matching. The format complexity is absorbed entirely at the adapter boundary.&lt;/p&gt;

</description>
      <category>go</category>
      <category>reconciliation</category>
      <category>scoringalgorithms</category>
      <category>datamatching</category>
    </item>
    <item>
      <title>Why I Split Minutes Prediction Into Two Models Instead of One</title>
      <dc:creator>Kingsley Onoh</dc:creator>
      <pubDate>Wed, 04 Mar 2026 10:00:05 +0000</pubDate>
      <link>https://dev.to/kingsleyonoh/why-i-split-minutes-prediction-into-two-models-instead-of-one-1epj</link>
      <guid>https://dev.to/kingsleyonoh/why-i-split-minutes-prediction-into-two-models-instead-of-one-1epj</guid>
      <description>&lt;p&gt;A single regression model trained on NBA game logs predicts that Joel Embiid will play 11 minutes in a game where he's listed as OUT. The model has never seen a confident zero. Every row in the training data has some minutes played, because the standard NBA API endpoint only returns logs for games where a player was active. The model knows what 28 minutes looks like and what 34 minutes looks like. It has no idea what zero looks like.&lt;/p&gt;

&lt;p&gt;This is the root problem behind the two-stage minutes engine in CourtVision.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Training Data Gap
&lt;/h2&gt;

&lt;p&gt;The NBA API's &lt;code&gt;PlayerGameLog&lt;/code&gt; endpoint returns one row per game for every game a player appeared in. If Embiid sits, there's no row. If Tyrese Maxey plays 38 minutes, there's a row. The dataset is survivor-biased: it only contains games where players actually played.&lt;/p&gt;

&lt;p&gt;Train a regressor on this dataset, feed it features for a player who's clearly going to sit, and the model interpolates. It finds the nearest neighborhood in the feature space and returns a plausible minutes number, never zero. For a scenario engine that needs to know whether a player will play before predicting how many minutes they'll get, this is a fundamental flaw.&lt;/p&gt;

&lt;p&gt;The fix was ingest-side, not model-side. The ingestion pipeline in &lt;code&gt;ingest.py&lt;/code&gt; creates synthetic zero-minute rows for every player on a team's roster who doesn't appear in the game log. If the Sacramento Kings played on January 15th and De'Aaron Fox's PlayerID doesn't appear in the API response, the pipeline inserts a row: &lt;code&gt;PlayerID=Fox, GameID=..., Minutes=0, PTS=0, REB=0, AST=0, Status=INACTIVE&lt;/code&gt;. This gives the classifier real negative examples to learn from.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Two-Stage Architecture
&lt;/h2&gt;

&lt;p&gt;Stage A is a &lt;code&gt;HistGradientBoostingClassifier&lt;/code&gt;. It takes the feature vector and outputs &lt;code&gt;play_probability&lt;/code&gt;, the probability that the player will appear in the game at all. The features it receives are the same ones the regressor sees: &lt;code&gt;Minutes_Avg&lt;/code&gt;, &lt;code&gt;Rest_Days&lt;/code&gt;, &lt;code&gt;Is_Home&lt;/code&gt;, &lt;code&gt;Opponent_Pace&lt;/code&gt;, &lt;code&gt;USG_Avg&lt;/code&gt;, &lt;code&gt;Games_Played&lt;/code&gt;, plus one-hot encoded player role (Star, Starter, Rotation, Bench).&lt;/p&gt;

&lt;p&gt;Stage B is a &lt;code&gt;HistGradientBoostingRegressor&lt;/code&gt;, trained only on rows where &lt;code&gt;Minutes &amp;gt; 0&lt;/code&gt;. If you train it on all rows including the synthetic zeros, the zeros dominate: there are more DNP rows than active rows for bench players, and the regressor learns to underpredict minutes for everyone.&lt;/p&gt;

&lt;p&gt;At inference, the pipeline calls both models. If the classifier returns &lt;code&gt;play_probability &amp;gt;= 0.5&lt;/code&gt;, the regressor's output becomes the minutes prediction. If it's below 0.5, minutes are set to zero and the stats predictor never runs. The code for this sits in &lt;code&gt;predict_minutes()&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;play_proba&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;classifier&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict_proba&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)[:,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;minutes_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;regressor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;predict&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;X&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;final_pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;where&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;play_proba&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;minutes_pred&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;There's a second filter downstream that I didn't anticipate needing. The &lt;code&gt;predict_player_performance()&lt;/code&gt; method in &lt;code&gt;ScenarioEngine&lt;/code&gt; has a ghost player check: if the regressor predicts fewer than 20 minutes, the player gets dropped from the output entirely. Without this, the prediction sheets were cluttered with bench players projected for 12-14 minutes and 4.2 points. Valid predictions technically, but not useful for prop betting. The 20-minute cutoff was chosen empirically. Below 20 minutes, the MAE on points prediction climbs sharply because low-minute players have high variance: they might score 2 or they might score 14 depending on whether the game is a blowout.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Leakage Problem
&lt;/h2&gt;

&lt;p&gt;Every rolling statistic in the feature set uses &lt;code&gt;.expanding().mean().shift(1)&lt;/code&gt;. This is the single most important design decision in the feature engineering module, and I almost got it wrong.&lt;/p&gt;

&lt;p&gt;The expanding window calculates the cumulative average of all games up to the current row. The &lt;code&gt;.shift(1)&lt;/code&gt; moves the result down one row, so game N's features contain the average of games 1 through N-1, never game N itself. Without the shift, the model trains on features that include the target game's outcome. It looks like a strong model during training. It fails completely in production because you never have the current game's stats when making a prediction before tipoff.&lt;/p&gt;

&lt;p&gt;I added a validation function (&lt;code&gt;validate_no_leakage&lt;/code&gt;) that takes a specific player, picks game 5, manually calculates the average of games 0-4, and compares it against the &lt;code&gt;PTS_Avg&lt;/code&gt; feature value for game 5. If they don't match within 0.01, the function prints a leakage warning. It caught a bug in the first version where I'd applied the shift before the expanding window instead of after.&lt;/p&gt;

&lt;p&gt;The usage rate calculation has the same shift:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USG_Avg&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;groupby&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;PlayerID&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;USG&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                 &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;expanding&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                 &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;mean&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
                 &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;reset_index&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;drop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                 &lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;shift&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;USG&lt;/code&gt; is an approximation: &lt;code&gt;(FGA + 0.44 * FTA) / Minutes * 48&lt;/code&gt;, capped at 50% to prevent division artifacts from very short appearances. The 0.44 coefficient is a standard basketball analytics adjustment that accounts for and-one free throws and technical free throws not being possessions.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where the Usage Vacuum Fits
&lt;/h2&gt;

&lt;p&gt;The scenario engine's core premise is that when a high-usage player sits, their possessions don't disappear. They redistribute to teammates based on position. If Embiid (Big, ~32% usage) is OUT, his teammates in the Big position group absorb 60% of that missing usage. Guards and Wings split the remaining 40%.&lt;/p&gt;

&lt;p&gt;This redistribution feeds back into the feature vector. The &lt;code&gt;usg_avg&lt;/code&gt; field for each remaining player is boosted by the appropriate share of the missing player's usage. The stats predictor then takes the boosted usage and the projected minutes (also affected, because starters typically play more minutes when a star sits) and generates new PTS, REB, AST predictions.&lt;/p&gt;

&lt;p&gt;The 60/40 split is hardcoded in &lt;code&gt;generate_injury_scenario()&lt;/code&gt;. I chose this over a data-driven approach for a specific reason: the dataset is too small to reliably learn redistribution weights per team per position. With 2,847 game-player records after feature engineering, there aren't enough "star sits" examples per team to fit a regression on redistribution patterns. The 60/40 approximation comes from league-wide studies on usage redistribution that I referenced from publicly available basketball analytics research. It's wrong in specific cases (the Warriors' motion offense distributes more evenly than a Sixers team centered on Embiid), but it's directionally correct across the league.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Surprised Me
&lt;/h2&gt;

&lt;p&gt;The classifier trains on the full dataset including synthetic zeros. The regressor trains only on games with &lt;code&gt;Minutes &amp;gt; 0&lt;/code&gt;. I expected the regressor to be the harder model to train. It wasn't. The classifier was worse because the class distribution is heavily imbalanced: most players in the dataset played in most games. The synthetic zeros for DNP players create some negative examples, but for stars and starters, there are almost no negative examples. The classifier achieves a high AUC-ROC but has poor recall on the rare "will not play" class. In practice, this doesn't matter much because the ghost player filter catches the edge cases, but it means the &lt;code&gt;play_probability&lt;/code&gt; output is overconfident for healthy starters. It says 98% when the real probability might be 92%.&lt;/p&gt;

&lt;p&gt;The other surprise was the MAE on the stats predictor. I expected PTS to be harder to predict than REB or AST because scoring has higher variance. The opposite happened: PTS MAE was 5-6 points while REB and AST hovered around the same range. The stats model is a &lt;code&gt;MultiOutputRegressor&lt;/code&gt; wrapping three separate &lt;code&gt;HistGradientBoostingRegressor&lt;/code&gt; instances. Minutes is the dominant feature for all three targets. Once you get minutes roughly right, the stats predictions follow within a band. The Opponent_Pace feature barely moved the needle, partly because I'm using a placeholder value of 100.0 for every game instead of the actual opponent pace. Fixing that is on the list.&lt;/p&gt;

&lt;p&gt;If I were starting this system over, I'd build the classifier and regressor as a single pipeline with scikit-learn's &lt;code&gt;Pipeline&lt;/code&gt; class instead of running them as separate scripts with separate model files saved to disk. The current approach (train each model in isolation, serialize to pickle, load both at inference) works but adds unnecessary file I/O and version coupling. A combined pipeline would guarantee that both models are always trained on the same feature set from the same data split.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>data</category>
      <category>datascience</category>
      <category>machinelearning</category>
    </item>
  </channel>
</rss>
