George Belsky

Posted on Mar 29

I Stopped Building Webhook Retry Logic. Here's What I Use Instead.

#webhooks #python #backend #architecture

Every backend team eventually builds the same thing: reliable message delivery between services. And every team builds it wrong at least once.

The Webhook Retry Stack

Here's what "just use webhooks" actually means in production:

# Receiver: build an HTTP endpoint
@app.post("/webhooks/orders")
async def receive_order(req):
    # Verify HMAC signature (or get spoofed)
    signature = req.headers.get("x-webhook-signature")
    if not verify_hmac(signature, req.body, WEBHOOK_SECRET):
        return {"error": "invalid signature"}, 401

    # Idempotency check (webhooks arrive twice, sometimes three times)
    idempotency_key = req.headers.get("x-idempotency-key")
    if db.exists("processed_webhooks", idempotency_key):
        return {"status": "already processed"}, 200

    process_order(req.json())
    db.insert("processed_webhooks", idempotency_key)
    return {"status": "ok"}, 200

# Sender: retry with backoff
async def send_with_retry(url, payload, max_retries=5):
    for attempt in range(max_retries):
        try:
            resp = requests.post(url, json=payload, headers=sign(payload))
            if resp.status_code == 200:
                return resp
            if resp.status_code >= 500:
                raise RetryableError()
        except (ConnectionError, Timeout, RetryableError):
            delay = min(2 ** attempt + random.uniform(0, 1), 300)
            await asyncio.sleep(delay)
    dlq.send(payload)  # dead letter queue
    alert("Webhook delivery failed after 5 retries")

And this is the simplified version. Production adds:

DLQ consumer that retries or alerts
Monitoring for delivery success rate
Alerting on DLQ depth
Cleanup cron for the idempotency table
Secret rotation for HMAC keys
Circuit breaker when receiver is down
Thundering herd protection when receiver comes back up

That's 200+ lines of infrastructure code. For every pair of services that need to talk.

The Alternative: Let the Platform Deliver

from axme import AxmeClient, AxmeClientConfig

client = AxmeClient(AxmeClientConfig(api_key=os.environ["AXME_API_KEY"]))

intent_id = client.send_intent({
    "intent_type": "intent.order.process.v1",
    "to_agent": "agent://myorg/production/order-processor",
    "payload": {
        "order_id": "ORD-2026-00142",
        "customer": "acme-corp",
        "total": 4999.50,
    },
})
result = client.wait_for(intent_id)

No webhook endpoint on the receiver. No HMAC. No idempotency table. No retry logic. No DLQ. No monitoring for delivery failures.

The platform handles at-least-once delivery on all channels.

Five Ways to Receive (Not Just Webhooks)

The receiver picks the delivery mode that fits their architecture:

Mode	Transport	Best For
`stream`	SSE (server-sent events)	Real-time agents, always-on services
`poll`	GET request	Serverless functions, cron jobs
`http`	Webhook POST	Traditional services (but platform handles retry)
`inbox`	Human queue	Approvals, reviews, manual tasks
`internal`	Platform-handled	Reminders, escalations, notifications

The sender doesn't care which mode the receiver uses. send_intent() is the same regardless.

This is the key difference from webhooks: the receiver chooses how to get messages, not the sender. The sender doesn't need to know if the receiver is a Lambda function, a Kubernetes pod, or a human with an email inbox.

What You Stop Building

Component	With Webhooks	With AXME
HTTP endpoint on receiver	You build it	Not needed (for stream/poll/inbox)
HMAC verification	You build it	Platform handles
Idempotency table	You build it	Built into intent lifecycle
Retry with backoff	You build it	Platform handles (configurable)
Dead letter queue	You build it	Platform handles
Delivery monitoring	You build it	Built-in lifecycle events
Secret rotation	You manage	Platform manages
Thundering herd protection	You build it	Platform handles

When Webhooks Are Still Fine

Webhooks work well when:

The receiver is always up (99.9%+ uptime)
Occasional message loss is acceptable
You only have 2-3 service pairs communicating
You already have the retry infrastructure built

Webhooks break down when:

You have 10+ services that need reliable delivery
Receivers go down for minutes/hours (deploys, incidents)
You need delivery guarantees (financial transactions, compliance)
You need human approval gates in the delivery chain
You're tired of debugging "why didn't the webhook arrive"

Try It

Working example - sender submits an order, receiver processes it via SSE stream, no webhook endpoint needed:

github.com/AxmeAI/reliable-delivery-without-webhooks

Python, TypeScript, and Go implementations included.

Built with AXME - 5 delivery bindings with at-least-once guarantees. Alpha - feedback welcome.

DEV Community