At Sequin, the backbone of our syncing infrastructure is polling. This is because polling provides stronger consistency guarantees than webhooks.
As we've written about, when you use webhooks, you give up some control: webhooks are ephemeral. If your service is down or you mishandle a webhook you receive, you're out of luck. You're also at the whims of the webhook provider. They might drop a webhook altogether, meaning you'll never have a chance to process it.
Polling is not without its challenges, however. Besides the complexity of maintaining polling infrastructure, the hardest part about polling is cursoring or paging through a stream of events. When cursoring through API items, you need to traverse the list in such a way that you don't miss any items. (And, ideally, you don't repeat items often either, as that's inefficient.)
Cursoring is surprisingly hard, as most APIs don't make it easy to see what's changed in them.
Cursoring becomes extra hard if the API you're querying is eventually consistent. In an API, eventual consistency means that your result set is not stable – the results you get from a request can change the next time you make the same request. This adds a lot of complexity, as you have to write defensive code.
Stripe events
Stripe is one of the rare API providers that has thoughtful solutions for change detection. They have a dedicated /events
endpoint where they publish most of the changes that happen in their system. Examples include an event for every time a customer is created, a subscription is updated, or a new payment goes through.
We've been happy consumers of Stripe's /events
and want to see more endpoints like it across other APIs.
However, due to the demanding nature of our real-time sync, we poll the /events
endpoint frequently – multiple times per second. This means we're susceptible to even the slightest eventual consistency issues. And, indeed, we found a situation with the /events
endpoint.
I'll give some background on the /events
endpoint, discuss the issue we encountered, then tell you how we're mitigating the issue.
Paginating events
Most Stripe objects have a created
property. This property is a Unix timestamp in seconds.
As a result, there are many Stripe events that will share the same created
timestamp in a given Stripe account. For example, certain Stripe operations cause many Stripe records to be created at the same time. When a customer signs up for your service and starts a new subscription, Stripe creates a bunch of objects like a customer
and subscription
for that customer.
Normally, if we were cursoring Stripe's API with a created
timestamp, this could be a problem. For example, consider this simplified HTTP query:
GET api.stripe.com/v1/events?createdAfter=${cursor}&limit=100
Using created > cursor
would be a problem because we could easily skip any other events created in the same timestamp. Likewise, this could be a problem as well:
GET api.stripe.com/v1/events?createdAtOrAfter=${cursor}&limit=100
Here, using created >= cursor
we'd have the potential of getting stuck on a page where every event had the same created
timestamp – there would be no way for us to move forward.
Fortunately, Stripe lets us cursor by the event's ID. We can make a request to get some stream of Stripe events, like this (for brevity, I'll include just the id
and created
properties of each event):
GET api.stripe.com/v1/events?ending_before=evt_1MoCivKddDnm8ttlZ19ZW52C
{
"data": [
{
"created": 1679422378,
"id": "evt_1Mo9g6KddDnm8ttlWtWxdDBt"
# ...
}
{
"created": 1679422373,
"id": "evt_1Mo9g1KddDnm8ttlitN7Jl38"
# ...
}
{
"created": 1679422371,
"id": "evt_1Mo9fzKddDnm8ttlPDyDsUez"
# ...
}
{
"created": 1679422292,
"id": "evt_1Mo9eiKddDnm8ttlip9sgaB5"
# ...
}
{
"created": 1679422292,
"id": "evt_3MgXZgKddDnm8ttl09DxjuvM"
# ...
}
]
}
The list of events is returned sorted by created descending. So, the most recent event in the list is on top. Assuming we're paginating through the stream from oldest → newest, to continue pagination, we'd pluck the event ID at the top (evt_1MoCivKddDnm8ttlZ19ZW52C
) and send that along as our ending_before
to continue forward.
One odd thing to note is that the event IDs themselves are not strictly ordered. Note that the last event in the list begins with
evt_3
which is "greater than" the event above it,evt_1
. We'll discuss this more in a bit.
Missing events
We had some customers report missing items in their sync. This kicked off an investigation. We logged every request and response to and from Stripe. We then ran audits comparing the state of our synced database over time to the state of Stripe's API.
When our audits caught a missing item in our database – say, a missing Stripe subscription
– we had the full trail of evidence to determine how we got there.
Our investigation revealed: the /events
endpoint is eventually consistent!
Eventually consistent /events
Here's the behavior we observed: We make a request to Stripe with some event ID, say evt_0
. We get back a list of 3 events. For brevity, I'll just include the id
and created
properties of each event. To make the created
timestamps easier to read, I've formatted them into human-readable strings:
[
{
"created": "12:07:00",
"id": "evt_3"
# ...
},
{
"created": "12:05:00",
"id": "evt_2"
# ...
},
{
"created": "12:00:00",
"id": "evt_1"
# ...
}
]
Given this response, our next cursor becomes evt_3
. So, we make that request and get back the following events:
[
{
"created": "12:07:01",
"id": "evt_7"
# ...
},
{
"created": "12:07:01",
"id": "evt_6"
# ...
}
]
Problem is, at the 12:07:00
timestamp, evt_3
wasn't the only event that occurred. There are two other events, evt_4
and evt_5
which were not present in the first response. For some reason, when we used evt_3
to get our second response, the stream started at evt_6
– which occurred at 12:07:01
, the second after the batch of events took place.
We can see this play out in our historic request/response logs. Yet, when we replay the request later with evt_3
, we do get back evt_4
and evt_5
in the response!
This suggests there's something eventually consistent about Stripe's /events
API. If we paginate through the endpoint using Event IDs, we're subject to skipping events. And because we query Stripe's /events
endpoint multiple times per second, we're especially vulnerable to this issue.
How is this happening?
We're not sure why this is happening. We've confirmed it can happen when events are created in the same second, but haven't ruled out it happening in other situations.
One theory we have: some Stripe event IDs are prefixed with evt_3xxx
and others with evt_1xxx
. These IDs seem to correspond to what object the event is enveloping. For example, events for payment_intent
and charge
always have an event ID with evt_3xxx
. It's possible that these objects are generated in a separate system that have their own ID generator. This could explain the objects potentially reaching the /events
endpoint out-of-order.
Solution
To mitigate this issue, we're changing our cursoring logic. After receiving a response, to determine our cursor for the subsequent request, we follow a simple algorithm:
- If the
created
value on the latest event is more than 5 seconds in the past, update to use that cursor. - Otherwise, do not update the cursor. Instead, use the same cursor we just used in our next request.
This means we'll "see" the same events over several requests. And for very busy Stripe /events
endpoints, it could mean we add a few seconds of latency, as we might always be running just a tad behind the present. But the improved consistency guarantee is worth it.
Without knowing the root cause, we can't be sure how much mitigation we'll need to resolve this issue. We'll update this post after we've run this algorithm in production for a bit and had a chance to measure drift.
In general, finding out what's changed in an API is an extremely common requirement for engineering teams. Eventual consistency makes this task very difficult. When designing your API, consider how you can use strategies that will make your API consistent and predictable.
Top comments (0)