<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mohit Kumar</title>
    <description>The latest articles on DEV Community by Mohit Kumar (@pheonix_mk_e0ecc0233ababe).</description>
    <link>https://dev.to/pheonix_mk_e0ecc0233ababe</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F2340546%2F6ff64ea6-529a-4cd5-a9bd-ab67a556e291.png</url>
      <title>DEV Community: Mohit Kumar</title>
      <link>https://dev.to/pheonix_mk_e0ecc0233ababe</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pheonix_mk_e0ecc0233ababe"/>
    <language>en</language>
    <item>
      <title>Building a Reliable Webhook Delivery System: What Actually Broke and How I Fixed It</title>
      <dc:creator>Mohit Kumar</dc:creator>
      <pubDate>Tue, 23 Jun 2026 05:24:48 +0000</pubDate>
      <link>https://dev.to/pheonix_mk_e0ecc0233ababe/building-a-reliable-webhook-delivery-system-what-actually-broke-and-how-i-fixed-it-l74</link>
      <guid>https://dev.to/pheonix_mk_e0ecc0233ababe/building-a-reliable-webhook-delivery-system-what-actually-broke-and-how-i-fixed-it-l74</guid>
      <description>&lt;p&gt;Webhooks seem simple until a worker crashes mid-delivery, a subscriber goes down for an hour, or a payload gets tampered with in transit.&lt;/p&gt;

&lt;p&gt;Here's what I actually built to handle that — FastAPI + PostgreSQL + Redis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The core problems I solved:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Synchronous delivery blocks everything&lt;/strong&gt;&lt;br&gt;
Naive approach calls the subscriber URL inline. One slow endpoint stalls your whole ingest. Fix: return &lt;code&gt;202 Accepted&lt;/code&gt; immediately, persist the event, deliver async.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Workers crash and jobs disappear&lt;/strong&gt;&lt;br&gt;
If a worker dies mid-delivery, that job is stuck &lt;code&gt;IN_FLIGHT&lt;/code&gt; forever. Fix: a watchdog sweeping every 30s, requeuing anything stale.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Retries without backoff make things worse&lt;/strong&gt;&lt;br&gt;
Hammering a struggling subscriber on failure makes recovery harder. Fix: exponential backoff (2s → 32s, max 5 attempts) using a Redis sorted set as a delay queue — score = next attempt timestamp.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. One dead subscriber degrades the whole system&lt;/strong&gt;&lt;br&gt;
Fix: circuit breaker per subscription. 5 consecutive failures trips it OPEN. After 60s cooldown, one probe tests recovery before resuming.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. No payload integrity&lt;/strong&gt;&lt;br&gt;
Fix: per-subscription HMAC-SHA256 signature on every payload, verified with &lt;code&gt;hmac.compare_digest&lt;/code&gt; to eliminate timing attacks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; 99.9% delivery reliability across 10,000+ daily webhooks, with full visibility via Prometheus + Grafana.&lt;/p&gt;

&lt;p&gt;Full deep-dive coming soon.&lt;/p&gt;

</description>
      <category>api</category>
      <category>backend</category>
      <category>python</category>
      <category>systemdesign</category>
    </item>
  </channel>
</rss>
