DEV Community

kirandeepjassal-crypto
kirandeepjassal-crypto

Posted on • Originally published at prepstack.co.in

How We Deliver 15 Million Webhooks a Day Without Losing a Single Event

A webhook looks like the easiest feature you'll ever build: something happens, you POST it to the customer's URL. Then you ship it, and reality arrives — the endpoint is down, or slow, or returns 500, or times out, or your own process restarts mid-send. Multiply that by 15 million events a day across thousands of endpoints you don't control, and "just POST it" becomes one of the hardest reliability problems in your system.

This is the design we run on Mattrx, our multi-tenant marketing-analytics SaaS, to deliver ~15 million webhook events per day to customer-configured endpoints. The first version was a synchronous POST inside the request handler. It lost events on every deploy and turned one slow customer into an outage for everyone. This post is everything we changed, and why.

TL;DR

Aspect Naive sync POST (before) Outbox + queue + workers (after)
Durability events lost on crash/deploy persisted before delivery, never lost
API latency blocked on the customer's endpoint decoupled; API p95 unaffected
Retries none exponential backoff + jitter, 8 attempts / ~24h
Isolation one slow customer stalls everyone per-tenant partitioning + concurrency caps
Giving up fails the API call dead-letter queue + circuit breaker
Security ad hoc HMAC-SHA256 signed, HTTPS, timestamped
Duplicates unhandled stable event id; customers de-dupe
  • ~15M events/day ≈ 175/sec average, peaks 5–10× (~1,500–1,730/sec).
  • Outbox pattern → zero events dropped after a committed change (we used to lose thousands per deploy).
  • Decoupling kept API p95 at 120 ms.
  • First-attempt delivery ~96%; ~99.98% eventual after retries.
  • ~0.02% permanently fail → dead-letter queue → per-customer status + alert.

The one mental shift: you don't control the endpoints, so you cannot prevent failure — you can only make failure survivable. Persist before you deliver, retry with discipline, isolate the slow from the fast, and make giving up a first-class, observable outcome.

The naive approach — and why it collapses

The first version delivered the webhook inside the request that caused the event:

// BEFORE: fire the webhook synchronously, in the request path.
[HttpPost("campaigns/{id}/complete")]
public async Task<IActionResult> Complete(string id, CancellationToken ct)
{
    await campaigns.CompleteAsync(id, ct);

    var endpoint = await webhooks.GetEndpointAsync(TenantId, "campaign.completed", ct);
    using var http = new HttpClient { Timeout = TimeSpan.FromSeconds(5) };
    await http.PostAsJsonAsync(endpoint.Url, new { type = "campaign.completed", id }, ct); // blocks

    return Ok();
}
Enter fullscreen mode Exit fullscreen mode

If the process restarts between CompleteAsync and the POST, the event is gone forever. The API thread waits on the customer's endpoint, so slow customers exhaust the thread pool. A customer's 500 becomes your 500. And a transient blip is a permanently missed event.

Fix 1: the Outbox Pattern — persist before you deliver

Write the event into an outbox table in the same transaction as the domain change. If it commits, the event will be delivered — later, by a separate relay. If it rolls back, the event never existed.

// AFTER: the outbox row commits atomically with the state change.
public async Task CompleteCampaignAsync(string tenantId, string campaignId, CancellationToken ct)
{
    await using var tx = await db.BeginTransactionAsync(ct);

    await campaigns.MarkCompletedAsync(campaignId, ct);

    await db.Outbox.InsertAsync(new OutboxEvent
    {
        Id = Guid.NewGuid(),          // stable event id == idempotency key
        TenantId = tenantId,
        Type = "campaign.completed",
        Payload = JsonSerializer.Serialize(new { campaignId }),
        Status = OutboxStatus.Pending,
        CreatedAt = clock.UtcNow,
    }, ct);

    await tx.CommitAsync(ct);          // state change + event, all or nothing
}
Enter fullscreen mode Exit fullscreen mode

A background relay claims rows with FOR UPDATE SKIP LOCKED so multiple relay instances never grab the same row, and publishes to the queue. Result: events dropped per deploy went from thousands to zero.

Fix 2: queue + dispatcher + workers — parallelism without noisy neighbours

Publish to a queue partitioned by tenantId, and run a pool of competing-consumer workers with per-tenant concurrency caps. Partitioning gives per-tenant ordering; the caps give isolation — a broken tenant can waste at most its own slots.

var slot = await limits.AcquireAsync(msg.TenantId, maxPerTenant: 20, ct); // noisy-neighbour guard
_ = DeliverAndRelease(msg, slot, ct);                                      // don't block the receive loop
Enter fullscreen mode Exit fullscreen mode

Peak concurrency sits around ~1,400 in-flight deliveries — exactly what Little's law predicts (1,730/s × ~0.8s).

Fix 3: retries, exponential backoff + jitter, dead-letter queue

On a retryable failure, re-queue with a delay that grows exponentially and is jittered to avoid synchronized retry storms. After a fixed number of attempts, dead-letter it.

// Exponential backoff with FULL jitter, capped. 8 attempts span ~24h.
private static TimeSpan NextDelay(int attempt)
{
    var baseSeconds = Math.Min(BaseDelaySeconds * Math.Pow(2, attempt), MaxDelaySeconds); // cap at 6h
    var jittered = baseSeconds * (0.5 + Random.Shared.NextDouble() * 0.5);                // full jitter
    return TimeSpan.FromSeconds(jittered);
}
Enter fullscreen mode Exit fullscreen mode

Not every failure is retryable: a 410 Gone or 400 Bad Request means stop — dead-letter immediately. A 503, 429, timeout, or connection reset is transient — retry. Retries lift delivery from ~96% first-attempt to ~99.98% eventual; the remaining ~0.02% land in a visible, queryable dead-letter queue.

Fixes 4–7: idempotency, HMAC signing, circuit breaker, observability

  • Idempotency: every event carries a stable id (X-Mattrx-Event-Id) that never changes across retries. You can't make delivery exactly-once, but you make processing effectively-once by shipping a stable id and telling customers to key on it.
  • Security: HMAC-SHA256 over timestamp + id + body with a per-tenant secret, HTTPS only. Sign and send the raw serialized body so a proxy reformatting JSON can't break verification.
  • Circuit breaker: track consecutive failures per endpoint; after 20, auto-disable and email the owner. This reclaims retry capacity otherwise burned forever by endpoints that will never succeed (~40/day for us).
  • Observability: every attempt writes a delivery record (event id, endpoint, attempt, status, latency, outcome) powering a per-customer dashboard. At 15M/day, aggregate "99.98% delivered" hides the one tenant at 40%.

The model to carry forward

At-least-once plus idempotency — never exactly-once. You cannot stop the endpoints from failing, so the whole design is about surviving their failure: persist before you deliver, isolate the slow from the fast, and make giving up a loud, observable, recoverable outcome instead of a silent drop.

A webhook really is just a POST. Delivering fifteen million of them a day, to endpoints you don't control, without losing one — that's a distributed system, and it deserves to be designed like one.


Originally published on PrepStack. If you're designing an event/webhook delivery system and want a second pair of eyes on the failure paths, reach me at randhir.jassal@gmail.com.

Top comments (0)