Webhooks Are Broken by Design — So I Built a Fix

If you've ever integrated a third-party service, you've dealt with webhooks. Payment processed? Webhook. New subscriber? Webhook. File uploaded? Webhook.

They feel simple. A POST request hits your server and you handle it. Done.

Except it's not that simple. Not even close.

The Problem Nobody Talks About

Webhooks are a "fire and forget" mechanism. The sender makes one HTTP request to your endpoint. If that request fails — your server is down, restarting, overloaded, or just returned a 500 — most senders either give up or retry a handful of times with no real strategy.

And you never know it happened.

No error in your dashboard. No alert. The event is just gone.

This is a fundamental design flaw that affects every system using webhooks:

E-commerce platforms — an order.paid webhook drops, your fulfillment system never triggers
CI/CD pipelines — a push event is missed, your deployment never kicks off
SaaS integrations — a subscription.cancelled webhook fails, you keep charging a customer who already left
IoT and data pipelines — sensor events silently disappear under load

The downstream consequences can be severe: lost revenue, broken workflows, angry customers, and hours of debugging with no clear trail.

**
Why This Is Hard to Solve on Your Side**

The natural reaction is "I'll just make my endpoint more reliable." But that only solves half the problem.

Even with 99.9% uptime, you'll have:

Planned deployments (your server restarts)
Database connection spikes
Cold starts on serverless functions
Network blips between the sender and your server

And here's the thing — you don't control the sender. Stripe, GitHub, Shopify — they decide how many retries they do and when. Some
retry 3 times over an hour. Some retry once. Some don't retry at all.

You're building your system around delivery guarantees you don't actually have.

What a Real Solution Looks Like

The right fix is a reliability layer that sits between the sender and your application:

1. Accept the webhook immediately — always return 200, store the raw payload
2. Queue async delivery to your actual endpoint
3. Retry with exponential backoff — 30s, 5min, 30min, 2h, 24h
4. Track every attempt — status codes, errors, timestamps
5. Let you inspect and manually retry anything that failed

This decouples receipt from processing. The sender's job is done the moment your relay accepts the request. Everything after that is your problem to solve — reliably.

Why I Built My Own

I ran into this problem on a project integrating multiple payment and subscription providers. Events were being missed. I couldn't tell if it was my server, the network, or the sender. There was no audit trail.

I looked at existing solutions. Some were too expensive for a side project. Some were too complex to self-host. Most were black boxes I couldn't modify or extend.

So I built Webhook Relay Layer — an open, self-hostable reliability platform.

The stack is straightforward: FastAPI for async webhook ingestion, Celery + Redis for the task queue and retry logic, PostgreSQL for durable event storage, and a simple dashboard to monitor everything in real time.

The core principle: no webhook ever disappears silently again.

Every event is stored on receipt. Every delivery attempt is logged. Every failure is retryable — automatically or manually. You always know what happened.

What's Next

This is the first post in a series where I'll go deeper on: