Unpopular Opinion: Most Webhook Implementations Are Dangerously Half-Baked

#api #architecture #discuss #rails

There. I said it.

After reviewing dozens of Rails codebases over the years, I've come to a uncomfortable conclusion: webhook implementations are almost universally treated as an afterthought. We copy-paste a Stripe tutorial, slap a before_action :verify_signature on a controller, and call it production-ready. It's not.

The "Just Receive It" Mentality Is Costing Us

Ask most developers what their webhook setup looks like and you'll hear some version of: "Oh, we have an endpoint that Stripe hits and we process the event." That's it. No mention of retry handling. No idempotency. No audit trail. Definitely no outbound webhooks.

This matters because real production systems are almost never one-directional. You receive events from Stripe, yes — but you're also probably sending events to your partners, your analytics pipeline, your fulfillment vendor. That outbound flow gets built as a hastily-assembled HTTParty.post buried in an ActiveRecord callback somewhere. And then it silently fails on a Tuesday at 2am.

Inbound ≠ Outbound, and Treating Them the Same Is the Root Problem

Here's the thing that took me an embarrassingly long time to internalize: receiving and sending webhooks are architecturally opposite problems, even though they look similar on the surface.

When you receive a webhook, you don't control the retry policy. You don't control when events arrive. You don't control what happens if you return a 500. The sender might retry three times, or thirty times, or never again — and you won't know until orders stop processing. Your job is to be fast, forgiving, and idempotent. Acknowledge first, process later.

# This is the correct pattern for inbound webhooks.
# Notice what it does NOT do: heavy processing inline.
def create
  return head :unauthorized unless valid_signature?

  # Acknowledge immediately — do NOT make the sender wait
  WebhookProcessorJob.perform_later(request.raw_post, provider: 'stripe')
  head :ok
end

When you send a webhook, the calculus flips entirely. Now you're responsible for the retry strategy. A network timeout is a transient failure — retry it with backoff. A 400 Bad Request means your payload is wrong — retrying it a hundred times won't fix that. These are fundamentally different failure modes that need different responses, and conflating them leads to both missed deliveries and infinite retry loops.

class OutboundWebhookJob < ApplicationJob
  # Transient errors (timeouts, 5xx): retry with backoff
  retry_on WebhookService::NetworkError, wait: :polynomially_longer, attempts: 10

  # Permanent errors (4xx, bad payload): stop immediately
  discard_on WebhookService::BadPayloadError

  def perform(endpoint_id, event_type, payload)
    WebhookService.deliver!(endpoint_id, event_type, payload)
  end
end

"But We Use Sidekiq Retries" Doesn't Cut It

I hear this one a lot. Yes, Sidekiq retries are great. No, they are not a substitute for intentional error handling in your webhook layer. Sidekiq doesn't know the difference between a 408 timeout and a 422 validation error — it just retries both. That means you're hammering a partner endpoint with a malformed payload 25 times, generating noise in their logs and potentially getting your IP rate-limited or blocked.

Smart retry logic needs to live in your application code, not be delegated entirely to your job queue's default behavior.

The Security Asymmetry Nobody Talks About

Here's another asymmetry that rarely gets called out explicitly: inbound and outbound webhooks use the same cryptographic primitive (HMAC-SHA256 is pretty much the industry standard) but with completely opposite trust models.

For inbound webhooks, you receive someone else's signature and verify it against a shared secret they gave you. For outbound webhooks, you sign your own payloads with a secret you gave to your receiver. Same algorithm, opposite direction of trust. If you build a generic "webhook signature" utility without accounting for this, you'll end up with subtle security bugs or, worse, a system where you accidentally expose your own signing key to the wrong party.

What "Production-Ready" Actually Looks Like

I'd argue that a truly production-ready webhook system needs at minimum:

Idempotency keys on inbound processing — deduplicate events before acting on them
A polymorphic audit log for both directions — you need to answer "what did we receive, and what did we send, and when?"
Differentiated retry logic on outbound — transient vs. permanent failures are not the same
Independent secret rotation per integration — one compromised partner shouldn't require you to rotate everything
Timeouts on outbound requests — seriously, don't let a slow partner endpoint block your job queue workers

Most apps I've seen have maybe two of these five. Some have none.

Why Do We Keep Getting This Wrong?

Honestly? I think it's because webhooks feel simple. It's just HTTP, right? POST to an endpoint, return 200, done. The failure modes are invisible — silent data loss doesn't throw an exception you can see in your error tracker. An outbound webhook that fails with no retry never shows up as an error unless you're explicitly monitoring for it.

We've also been conditioned by the Stripe integration tutorial, which is genuinely excellent but only covers one direction and one provider. It's a starting point, not a template for your entire webhook architecture.

The Takeaway

Stop treating webhooks as a solved problem. Start treating them as an integration boundary that deserves the same architectural rigor you'd give to any other critical data pipeline. Model the two directions separately. Make failure explicit. Build the audit trail. Your on-call rotation will thank you.

Webhooks are not hard to get right — but they require intentionality that most tutorials never bother to teach.

This article builds on a deeper exploration at devgab.com — Building Resilient Webhook Systems: A Tale of Two Directions, which includes full implementation details for signature verification, polymorphic audit trails, and retry strategies in Rails.

What does your webhook setup look like in production? Are you handling both directions explicitly, or is outbound still a Net::HTTP.post in a model somewhere? Drop your setup (or your horror stories) in the comments — genuinely curious where the community is at on this.