Elvis Sautet

Posted on Oct 31

Stop Doing Business Logic in Webhook Endpoints. I Don't Care What Your Lead Engineer Says.

#api #backend #webdev #performance

Yesterday at 1pm I'm on a call with a payment provider's tech team. We're integrating their IPN (Instant Payment Notification) system. The call should've been 15 minutes. It turned into a 2-hour argument about how callbacks should work.

Their lead engineer is telling me we need to validate everything in the callback endpoint. Check for duplicates. Verify the payment hasn't been processed. Update the database. Send confirmations. Return specific error codes for different scenarios.

I'm sitting there thinking "no, that's all wrong."

Finally I said it. "Your job is to hit our endpoint. Our job is to acknowledge we received it. Everything else is our problem, not yours."

Silence on the call. Then he says "that's not how callbacks work."

But here's the thing. That IS how callbacks should work. And most developers, even senior ones, get this wrong.

Let me explain the argument, why I was right, and how to actually handle webhooks properly.

The Setup: What We Were Building

Integrating a payment gateway. Pretty standard stuff. When someone pays, the provider sends a webhook to our endpoint with payment details. We need to:

Update the payment status in our database
Update the order status
Send confirmation email to customer
Send SMS notification
Update inventory
Trigger fulfillment
Track analytics

Their tech lead wanted all of this to happen IN the callback endpoint. Return 200 if everything succeeded, return 400 with error details if anything failed.

I wanted to just acknowledge the webhook and process everything asynchronously.

We went back and forth for 2 hours.

Their Argument (Why They Thought I Was Wrong)

Their lead made these points:

"You need to validate the payment hasn't been processed already"

They were worried about duplicate webhooks. If we don't check for duplicates in the endpoint and return an error, they might send the same webhook multiple times and we'd process it multiple times.

My response: "That's an idempotency problem. We handle that in our processing logic, not in the endpoint response."

"What if your database is down?"

If our database is down when the webhook hits, and we return 200, we've acknowledged a payment we can't process.

My response: "If our database is down, we have bigger problems. And your retry logic will handle it anyway."

"You need to return specific error codes"

They had this whole spec about returning different status codes:

409 for duplicates
422 for validation errors
500 for processing errors
200 only for complete success

My response: "That's coupling your system to our internal logic. You don't need to know why something failed on our end."

"How will we know if processing succeeded?"

They wanted confirmation that everything worked. Email sent, inventory updated, order fulfilled.

My response: "You don't need to know that. You need to know we received the webhook. That's it."

This went on for way too long.

My Argument (Why I Was Right)

Here's what I kept trying to explain:

Separation of concerns

Their job: Send webhooks reliably
Our job: Process them reliably

These are separate responsibilities. Mixing them creates tight coupling and makes both systems fragile.

Network timeouts are real

If we're doing all that processing in the endpoint, and it takes 5+ seconds, their webhook request times out. They retry. We get duplicate webhooks. Everything breaks.

Even if it takes 3 seconds, that's slow. Webhooks should be fast. Sub-second fast.

Our processing might fail for reasons they can't fix

Say our email service is down. We can't send confirmation emails. Should we return an error to the payment provider? What are they supposed to do about it? The payment still succeeded. The email is our problem.

Or our inventory service is slow. Takes 10 seconds to update. Should we make them wait? No. That's our internal issue.

Retry logic belongs in queues, not in HTTP responses

If email sending fails, we should retry. But that retry shouldn't involve the payment provider. We should handle it internally with a message queue.

Their webhook delivered successfully. Everything after that is our responsibility.

The endpoint's only job is to receive and acknowledge

That's it. Verify the webhook signature, add it to a queue, return 200. Done.

All the processing happens asynchronously in worker processes. If something fails, our workers retry. The payment provider doesn't need to know or care.

What The Endpoint Should Actually Look Like

This is what I kept trying to explain. The callback endpoint should be stupid simple:

app.post('/webhooks/ipn', async (req, res) => {
  const payload = req.body

  // Step 1: Verify signature (this is THEIR security requirement)
  const signature = req.headers['x-payment-signature']
  if (!verifySignature(payload, signature)) {
    return res.status(401).json({ error: 'Invalid signature' })
  }

  // Step 2: Add to queue
  await paymentQueue.add('process-payment', payload)

  // Step 3: Acknowledge immediately
  res.status(200).json({ received: true })
})

That's the entire endpoint. Three things:

Verify it's actually from them (security)
Queue it for processing
Respond

Everything else happens in a worker:

paymentQueue.process('process-payment', async (job) => {
  const payload = job.data

  // Check for duplicates HERE, not in the endpoint
  const existing = await db.payments.findUnique({
    where: { transactionId: payload.transactionId }
  })

  if (existing) {
    console.log('Duplicate payment, skipping')
    return // Job completes without doing anything
  }

  // Update payment
  await db.payments.update({
    where: { id: payload.paymentId },
    data: { status: 'completed' }
  })

  // Get order
  const order = await db.orders.findUnique({
    where: { paymentId: payload.paymentId }
  })

  // All the slow stuff
  await sendEmail(order.customerEmail, 'Payment confirmed')
  await sendSMS(order.customerPhone, 'Payment received')
  await updateInventory(order.items)
  await fulfillmentService.createShipment(order)
  await analytics.track('payment_completed', order)
})

If any of this fails, the job retries. Automatically. The payment provider never knows or cares.

Why Their Approach Breaks In Production

I tried explaining what happens with their approach in real production scenarios:

Scenario 1: Email service is down

Webhook hits endpoint. We try to send email. Email service times out after 5 seconds. We return 500 to payment provider. They retry the webhook. We process the payment again. Customer gets charged twice.

Scenario 2: Database query is slow

Webhook hits. Database is under load. Query to check for duplicates takes 8 seconds. Webhook times out. Provider retries. Now we have race conditions. Same payment processed multiple times because both webhooks are checking for duplicates at the same time.

Scenario 3: SMS provider is rate limiting us

We hit our SMS limit. Can't send notifications. Return error to payment provider? They retry. We still can't send SMS. They retry again. Now we have 50 failed webhooks piling up because of our SMS provider.

Scenario 4: Analytics service is down

Analytics is down for maintenance. We can't track events. Should we fail the entire payment processing because analytics is down? No. But if we return 500, the payment provider thinks the payment failed.

All of these are real scenarios that happen in production. And all of them break if you're doing too much work in the callback endpoint.

The Counter Arguments (What They Said Next)

"But how do we know you processed it successfully?"

You don't. And you don't need to. You need to know we received it. Processing is our problem.

If processing fails on our end, we handle it. We retry. We log. We alert. We fix it. You're not involved.

"What if your queue is full or Redis is down?"

Then adding to the queue fails and we return 500. You retry later. That's the only legitimate failure case - we couldn't receive the webhook at all.

But if we added it to the queue successfully, we return 200. Because we received it. What happens after is not your concern.

"How do you handle duplicates then?"

In the worker process. Every payment has a unique transaction ID from you. We check if we've already processed that transaction ID. If yes, we skip it. If no, we process it.

This is called idempotency. The worker logic is idempotent. Running it multiple times with the same transaction ID doesn't cause duplicate processing.

"What about ordering? Webhooks might arrive out of order"

That's also handled in the worker. We don't rely on webhooks arriving in order. Each webhook is self-contained with all the data we need.

If you send webhook A then webhook B, but B arrives first, that's fine. Each one processes independently.

"This seems more complex on your end"

Yes. Because it's OUR problem to solve, not yours. Your job is simple: send webhooks reliably. Our job is complex: process them reliably.

We don't want to push our complexity into your retry logic.

The Real-World Example That Convinced Them

Around 2:30pm I was exhausted. Decided to show them real data from a previous integration where we did it their way.

Pulled up our logs from a payment provider we integrated 6 months ago using their approach:

Week 1:

1,200 payments
47 duplicate webhook receptions
12 duplicate payments processed
8 customers charged twice
3 hours spent issuing refunds

Week 2:

Email service hiccup (was down for 10 minutes)
200+ webhook timeouts during that window
Provider retried all of them
800+ duplicate webhook receipts
Database got hammered
Site went down for 20 minutes
6 hours of cleanup

Week 4:

Analytics service maintenance
Can't track events
Start returning errors to provider
They stop sending webhooks
Miss 50 payments
Customer support explosion
CEO not happy

Week 8:

SMS rate limit hit
Same pattern as week 4
Another 30 payments missed

Showed them this data. "This is what happens when we do too much work in the callback endpoint. We've been fighting this for months."

Then showed them data from a different provider where we used the queue approach:

4 months of operation:

50,000+ payments
Zero duplicate payments processed
Zero missed payments
Average callback response time: 45ms
Zero webhook-related outages

That's when their lead engineer went quiet for a minute. Then said "okay, I see your point."

The Compromise We Reached

Around 3pm we agreed on this:

Their requirements:

Return 401 if signature is invalid
Return 400 if the webhook payload is malformed
Return 200 if webhook is received successfully
Return 503 if our queue is unavailable

Our implementation:

Endpoint does minimal work (verify, queue, respond)
All processing happens asynchronously
We handle duplicates in worker logic
We handle failures with retries in our queue
We alert ourselves if jobs fail repeatedly

What we DON'T do:

Check for duplicates in the endpoint
Validate payment details in the endpoint
Return errors for downstream service failures
Return errors for processing failures

The endpoint's job is to receive and acknowledge. That's it.

How We Actually Implemented It

Here's the production code we ended up with:

const express = require('express')
const { Queue } = require('bullmq')
const crypto = require('crypto')

const app = express()
app.use(express.json())

const ipnQueue = new Queue('ipn-processing', {
  connection: { host: 'localhost', port: 6379 }
})

// IPN endpoint
app.post('/webhooks/ipn', async (req, res) => {
  const payload = req.body
  const signature = req.headers['x-payment-signature']

  // Validate signature
  if (!isValidSignature(payload, signature)) {
    return res.status(401).json({ 
      error: 'Invalid signature',
      received: false 
    })
  }

  // Validate payload structure
  if (!payload.transactionId || !payload.amount) {
    return res.status(400).json({ 
      error: 'Invalid payload',
      received: false 
    })
  }

  try {
    // Add to queue with job ID = transaction ID (automatic deduplication)
    await ipnQueue.add('process-ipn', payload, {
      jobId: payload.transactionId, // If job with this ID exists, it won't add duplicate
      attempts: 5,
      backoff: {
        type: 'exponential',
        delay: 3000
      }
    })

    // Acknowledge immediately
    res.status(200).json({ 
      received: true,
      transactionId: payload.transactionId 
    })

  } catch (error) {
    console.error('Failed to queue IPN:', error)

    // Only return 503 if we couldn't queue it
    res.status(503).json({ 
      error: 'Service temporarily unavailable',
      received: false 
    })
  }
})

function isValidSignature(payload, signature) {
  const secret = process.env.PAYMENT_SECRET
  const hash = crypto
    .createHmac('sha256', secret)
    .update(JSON.stringify(payload))
    .digest('hex')

  return hash === signature
}

app.listen(3000)

Worker process:

const { Worker } = require('bullmq')

const worker = new Worker('ipn-processing', async (job) => {
  const payload = job.data

  console.log(`Processing IPN: ${payload.transactionId}`)

  try {
    // Check if already processed (defensive check, job ID should prevent duplicates)
    const existing = await db.payments.findUnique({
      where: { transactionId: payload.transactionId }
    })

    if (existing && existing.status === 'completed') {
      console.log(`Transaction ${payload.transactionId} already processed`)
      return { status: 'duplicate', transactionId: payload.transactionId }
    }

    // Update payment
    await db.payments.update({
      where: { id: payload.paymentId },
      data: { 
        status: 'completed',
        processedAt: new Date(),
        transactionId: payload.transactionId
      }
    })

    // Get order details
    const order = await db.orders.findUnique({
      where: { paymentId: payload.paymentId },
      include: { items: true, customer: true }
    })

    // All the slow/failable operations
    await Promise.all([
      sendEmail({
        to: order.customer.email,
        template: 'payment-confirmation',
        data: order
      }),
      sendSMS({
        to: order.customer.phone,
        message: `Payment received for order ${order.id}`
      }),
      updateInventory(order.items),
      analytics.track('payment_completed', {
        orderId: order.id,
        amount: order.total,
        transactionId: payload.transactionId
      })
    ])

    // Trigger fulfillment
    await fulfillmentService.createShipment(order)

    console.log(`Successfully processed ${payload.transactionId}`)

    return { 
      status: 'success', 
      transactionId: payload.transactionId,
      orderId: order.id
    }

  } catch (error) {
    console.error(`Failed to process ${payload.transactionId}:`, error)

    // Throw error so job will retry
    throw error
  }
}, {
  connection: { host: 'localhost', port: 6379 },
  concurrency: 10
})

// Monitor failures
worker.on('failed', async (job, error) => {
  console.error(`Job ${job.id} failed after ${job.attemptsMade} attempts:`, error.message)

  // Alert if job has failed all retry attempts
  if (job.attemptsMade >= job.opts.attempts) {
    await alerting.sendAlert({
      type: 'payment_processing_failed',
      transactionId: job.data.transactionId,
      error: error.message
    })
  }
})

worker.on('completed', (job) => {
  console.log(`Job ${job.id} completed:`, job.returnvalue)
})

This is what's running in production now. Been solid for months.

The Results (Why I Was Right)

Been running this implementation for 4 months now. Here's the data:

Response times:

Average: 42ms
P95: 78ms
P99: 120ms

All webhooks respond in under 200ms. Provider is happy.

Reliability:

50,000+ payments processed
Zero duplicate payments
Zero missed payments
99.97% success rate

The 0.03% failures were legitimate issues (invalid payment data, customer account problems). Everything else works.

Failure handling:

Email failures: 47 (all retried successfully within 5 minutes)
SMS failures: 12 (all retried successfully)
Analytics failures: 8 (retried successfully, zero impact on payments)
Fulfillment delays: 23 (retried, orders still shipped on time)

All of these would've caused webhook failures and retries with the old approach. With queues, they're just internal retries that succeed automatically.

Infrastructure:

Webhook endpoint: 1 instance (barely using resources)
Worker processes: 3 instances (handles all processing)
Redis: 1 instance (queue storage)
Total cost: $45/month

The old approach needed 5 API instances just to handle the webhook load during peak times. New approach is more reliable AND cheaper.

What I Learned From This Argument

Few things from that 2-hour 1pm argument:

Most developers conflate receiving and processing

They think the webhook endpoint needs to do everything. But receiving a webhook and processing it are separate concerns.

Endpoint: fast, simple, stateless
Worker: slow, complex, stateful

Keep them separate.

HTTP is terrible for async work

HTTP is request-response. User sends request, waits for response. That's fine for APIs where users are waiting.

But webhooks are fire-and-forget. The sender doesn't care about processing results. They just want confirmation you received it.

Stop trying to force synchronous patterns on asynchronous workflows.

Idempotency belongs in business logic, not in HTTP responses

You handle duplicate requests by making your processing idempotent, not by detecting duplicates in the endpoint and returning errors.

Use job IDs. Check if work was already done. Skip if yes, process if no.

Your internal problems shouldn't leak into external APIs

Email service down? That's your problem. Analytics failing? Your problem. Database slow? Your problem.

The webhook sender shouldn't know or care about any of this. Return 200 if you received the webhook, handle failures internally.

Simple interfaces are better than smart interfaces

Their approach required the endpoint to be smart. Check duplicates, validate everything, return specific errors.

Our approach makes the endpoint dumb. Just receive and queue. All the smart logic is in workers where it belongs.

Dumb interfaces are more reliable.

When You Actually Should Do More In The Endpoint

Real talk: this pattern isn't always right. Sometimes you DO need to do work synchronously.

When the sender needs a response

If they're asking "is this payment valid?" and waiting for an answer, you need to check and respond immediately.

But webhooks aren't questions. They're notifications. Big difference.

When processing is super fast

If checking for duplicates takes 10ms and that's all you're doing, fine, do it in the endpoint. But the moment you're calling external services or doing complex logic, move it to a queue.

When you don't have queue infrastructure

If you're running a small app with 10 users and 5 webhooks per day, don't overcomplicate it. Just handle it in the endpoint. You don't need Redis and worker processes.

When you're prototyping

Get it working first. Add queues later when you have scale problems. Don't overengineer early.

I spent way too long in my career adding queues to everything when a simple endpoint would've been fine.

Use them when you need them, not because they're "best practice."

The Code Structure (How We Organize This)

One thing that helped convince the payment provider was showing them how clean our code structure is:

/webhooks
  /ipn
    endpoint.js      (just receives and queues)
    worker.js        (all the processing logic)
    handlers/
      payment.js     (payment update logic)
      email.js       (email sending logic)
      sms.js         (SMS logic)
      inventory.js   (inventory updates)
      fulfillment.js (order fulfillment)

Each piece has a single responsibility. Easy to test. Easy to modify. Easy to debug.

The endpoint is like 30 lines of code. The worker orchestrates different handlers. Each handler can fail and retry independently.

Compare that to a 500-line endpoint trying to do everything. Which would you rather maintain?

The Monitoring We Added

After the argument, we added monitoring to prove this approach works:

Queue metrics:

Jobs per minute
Average processing time
Failed jobs
Queue depth

Alerts:

Alert if queue depth > 1000
Alert if failed jobs > 10 in last hour
Alert if processing time > 30 seconds
Alert if Redis is down

Dashboard:
Shows all the metrics in real-time. Payment provider can see we're processing webhooks successfully even though we return 200 immediately.

This visibility convinced them our approach works.

Resources That Helped Me Argue This

During the call I referenced these:

Webhook.site blog on webhook design:
https://docs.webhook.site/

Explains why webhooks should be fast and idempotent.

Stripe's webhook documentation:
https://stripe.com/docs/webhooks

They do it the right way. Return 200 immediately, process async.

PayPal's IPN documentation:
https://developer.paypal.com/docs/api-basics/notifications/ipn/

Same pattern. Quick acknowledgment, async processing.

The Twelve-Factor App on backing services:
https://12factor.net/backing-services

External services (like webhooks) should be loosely coupled.

All of these support the pattern I was arguing for.

The Final Agreement (What We Documented)

Around 3pm we finally agreed and documented it:

Webhook endpoint responsibilities:

Verify signature (security requirement)
Validate payload structure (prevent malformed data)
Queue for processing
Return acknowledgment

Status codes we return:

200: Received and queued successfully
401: Invalid signature
400: Malformed payload
503: Cannot queue (Redis down, queue full)

What we DON'T return:

Errors for duplicate transactions (handled internally)
Errors for processing failures (handled internally)
Errors for downstream service issues (handled internally)

Provider responsibilities:

Send webhooks with valid signature
Retry on 503 responses
Don't retry on 200 responses
Provide transaction IDs for deduplication

Our responsibilities:

Process webhooks idempotently
Handle failures with retries
Alert ourselves if processing fails repeatedly
Maintain SLA of 99.9% processing success

This is now the pattern we use for all webhook integrations.

What I'd Do Different Next Time

The argument went on way too long because I didn't lead with data. Next time I'd do this:

1. Show production logs from previous implementations

Start with "here's what happened when we did it your way" with actual data. Numbers convince people.

2. Draw the architecture on a diagram

Visual helps. Show where failures happen with each approach.

3. Reference industry examples earlier

Should've led with "this is how Stripe does it" instead of making it about our specific implementation.

4. Offer to show our monitoring

"I can show you our queue metrics in real-time" would've shortened the argument by an hour.

The Bigger Lesson

This isn't just about webhooks. It's about separation of concerns in API design.

Your public API (the endpoint) should be simple, fast, and reliable.

Your internal processing should be complex, slow, and resilient.

Don't mix them.

This applies to:

Webhooks and callbacks
Async job processing
Event-driven architectures
Microservices communication

Keep interfaces simple. Handle complexity internally.

That's how you build systems that scale and don't wake you up at 2am.

If you've ever argued with a provider about how webhooks should work, share this. Most developers get this wrong because they think HTTP endpoints need to do all the work.

They don't. They just need to receive and acknowledge.

Elvis Sautet

Catch me on X(Twitter) @elvisautet

My Portfolio Not the best, haha

Full Stack Developer | Arguing About Architecture Since 2017

Questions about webhooks, queues, or any collaborations/projects, Hit me up on Twitter.

Top comments (8)

david duymelinck • Oct 31

I don't understand why they want their webhooks to receive all that information.
What if people don't send an SMS.
If they want the information, wouldn't it be better to receive it from another webhook.

Instead of starting with a queue I would use the event observer pattern. When the task of the event fails the event should be stored to be retried at a later date.
To prevent the endpoint getting flooded I would add a throttle. This can be raised in case there is a campaign or a sale. This avoids a full database, and you can create a mechanism to add more resources when the throttle limit is raised. And fade out the extra resources when the throttle limit is lowered again.

Elvis Sautet • Oct 31

Hey David, appreciate the perspective - you raise really good points.
That's actually a cleaner mental model for this - the webhook is just an event trigger, not a request-response contract. I think we're describing similar architectures with different terminology.
In my case, the "queue" I mentioned is essentially serving as the event store you're describing. BullMQ (what we're using) handles:
-Event persistence (jobs are stored in Redis)
-Retry logic with exponential backoff
-Dead letter queue for permanently failed events

So functionally, it's doing what you described - storing failed events for retry, processing asynchronously, etc.

Just Curious - have you implemented the dynamic throttle/scaling mechanism you described?

david duymelinck • Oct 31 • Edited

A basic implementation was moving the events to a sqlite database and using a circuit breaker pattern to empty the overflow from the sqlite database.

𒎏Wii 🏳️‍⚧️ • Oct 31

If for whatever reason they wanted real-time information, a webhook in the other direction would definitely be the better solution, but more realistically, a daily, weekly or even monthly report of the data they want would probably be much easier to handle.

chris407x • Nov 2 • Edited

I agree with your approach, but you should be sending HTTP 201 (created) or HTTP 202 (accepted), not HTTP 200. The HTTP 201 Created status code indicates that a request has resulted in the creation of a new resource. This would be your queued item. Then, if anyone needs confirmation in the future, you should have an endpoint that returns the status based on the ID, and there you could structure an HTTP 200 response with all the information they're asking for. Again, I agree that the webhook should be fast, and you're essentially using a messaging queue between systems. They should not be waiting for you to process that message.

𒎏Wii 🏳️‍⚧️ • Oct 31

Put simply: Webhooks can be imagined as message passing or as remote procedure calls; you seem to think they should be like messages, and I'd agree with that, generally speaking.

"How will we know if processing succeeded?"

It sounds like the "you don't" answer is reasonable in this example, but assuming it was somehow necessary, then message passing has the answer: When everything has worked, the system sends back another message.

There's definitely gonna be cases where an interface needs to behave more like RPC, but that shouldn't be the default approach.

Antal Áron • Oct 31

Nice.

Why not using HTTP 202 status code? The docs says:

The HTTP 202 Accepted successful response status code indicates that a request has been accepted for processing, but processing has not been completed or may not have started. The actual processing of the request is not guaranteed; a task or action may fail or be disallowed when a server tries to process it.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.