I’ve been working in e-commerce for the last three years dealing with millions of fairly mission-critical webhook events. My key takeaway is that it sucks to deal with multiple APIs incoming webhooks such as Shopify, Stripe & Intercom, and I hate not having an alternative.
It’s April 2020, we are stuck inside self quarantining, productivity has been high, and it just strikes me as the perfect time to publish my first post.
Mistakes are human, and so are bugs. Every once in a while, you will introduce errors in your webhook handling methods. Even with the best integration tests, you are probably not prepared for unexpected payload changes (perhaps a stealth API version upgrade) or a platform downtime. You probably have server logs or even better, something like Sentry to catch unexpected exceptions in your code. If these are mission-critical webhooks, are you confident you can find and take action on every single webhook you missed? Maybe, but most likely, you’ll spend the day on it.
I know what you’ll tell me. You aren’t supposed to take action directly on a webhook! That’s right, the “proper way” of dealing with webhooks is to postpone any meaningful processing through the use of queues and to handle errors & retries on your own. Clubhouse.io engineering team wrote a great article on this topic. They are right; you should use queues.
However, to handle a single webhook safely, you would need (as per Clubhouse) 4 new services (SQS, S3, a Publisher(s) and a Consumer(s)). It depends on your current architecture, but in short, that’s a lot of new tools, a lot of new code, and especially a lot of time. Time that I could better spend building products for our customers.
If it feels a lot like reinventing the wheel, well, it’s because it is. Solving this particular problem isn’t specific to your business or your application.
Let’s be honest, why does almost every platform out there offer an ‘automatic retry’ feature for their webhooks? Because developers generally DO NOT postpone taking action on their webhooks and really, nobody can blame them. The time required to set up queues and related retry logic makes it so that developers, such as myself, are instead relying on the platform to do the work for them.
Each platform has its own rules and logic. Shopify will retry 19 times over 48 hours and completely delete your webhook subscription once all attempts are exhausted as opposed to Stripe that will attempt to deliver your webhooks for up to three days with exponential backoff. This becomes increasingly complex to deal with if you are using 3, 5, or 10 API’s webhooks, that each have their own rules on when they will retry, when they will alert you and when they will disable your webhooks.
Credit where credit is due, Stripe does a lot of things right. They provide a useful UI (but no API) to view your events and retry them. That’s the kind of visibility we need, as developers. I’ve saved many hours of troubleshooting because of that feature alone.
Stripe’s UI to view failed webhooks.
The thing is that you are out of luck if you are using any other platform, Stripe is the only one that I’ve used that provides this level of tooling for developers. Have you encountered other platforms with a similar level of tooling?
For my webhook monitoring, I was looking for a solution that can:
- Provide full visibility over all my webhooks and their status;
- Allow me to take actions on the failures that occurred;
- Will alert me on failures that need looking in into;
- Work the same way, regardless of the platforms in my stack.
The bad news is that I couldn’t find any.
The good news is that I built it.
Hookdeck solves these problems by sitting between the API provider and your systems to monitor, distribute and retry all your webhooks. With it, you can trace every webhook event, view the full request and response payloads. You can also standardize the retry and alerting rules across all your webhooks regardless of the provider. This will guarantee that your webhooks behave the same way and be configured to your own use cases. It allows you to take action on any issues or errors that occur with confidence.
Hookdeck — The wheel you won’t have to reinvent
If you are wondering how Hookdeck works, here’s a look at the inner workings of our architecture.
A platform-agnostic solution is possible. Why should it be the developers or the responsibility of the platforms to have those tools? I like to think it’s the missing piece of the puzzle.
If you would like to spare yourself the hassle I went through, Hookdeck now available.
What are your thoughts on handling a large volume of mission-critical webhooks?