137Foundry

Posted on Jun 14

Why Bounded Queues Beat Unbounded Queues for Integration Error Handling

#api #productivity #programming

The default instinct when designing an integration error queue is "make it unbounded." We don't want to lose data. We can always come back and triage. Just keep adding to the queue.

This is the wrong instinct. An unbounded error queue, in practice, fails worse than a bounded one. Here is why, and how to choose the right bound for your system.

Photo by Scott Rodgerson on Unsplash

The unbounded queue's failure mode

The story of an unbounded error queue is almost always the same. The queue starts empty. The integration runs cleanly for weeks. One day a producer starts emitting bad payloads at high rate (a schema change upstream, a new feature with a bug, a partner integration regressing). The queue grows from zero to 5,000 items in a day. The team notices.

The team plans to triage on Monday. Monday comes; the queue is now 12,000 items because the bad payloads kept flowing. The team plans to triage on the next sprint. By the next sprint, it's 35,000 items. Triage now requires a sprint of dedicated effort, which the team doesn't have because they're working on something else.

By month three, the queue has 80,000 items, the team has psychologically written it off, and the queue is functionally identical to "discarded." Except that nobody made that decision explicitly; it happened by erosion.

The unbounded queue gave the team the illusion of safety. "We have all the data, we can always come back to it." But the team can't actually come back to it, because the cost of triaging 80,000 items is higher than the value of any individual item.

What a bounded queue forces

A bounded queue refuses to accept new items once it reaches its cap. The producer either has to handle the rejection (by retrying later, dropping the item, or routing to a different destination) or pause itself.

This sounds worse, but it creates back-pressure. When the queue fills, something has to give. The system can't silently accumulate to graveyard size; it tells someone, immediately, that the queue is full.

The pressure forces a decision earlier:

"We can drain it; let's drain it."
"We can't drain it; this is the upstream cause, let's fix it."
"We can't drain it AND can't fix the upstream cause; we need more triage capacity, let's add it."
"None of the above; let's explicitly accept the data loss and document why."

Any of those decisions is better than the silent unbounded drift. The bound forces a real conversation when the queue is at 500 items, instead of an impossible cleanup when it's at 80,000.

What "bounded" actually means in practice

The bound is usually implemented in one of three ways:

Hard cap with rejection. The queue accepts items up to N. The N+1st item is rejected at write time, with an error that propagates to the producer. The producer handles the rejection (often by triggering a critical alert and pausing itself).

Hard cap with drop-oldest. The queue accepts items up to N. When item N+1 arrives, item 1 (the oldest) is removed. This preserves the most recent N items in the queue, on the theory that fresh errors are more actionable than stale ones. The dropped item is logged but not retained.

Soft cap with alerting. The queue accepts items beyond N but fires a high-severity alert at N and an emergency alert at 2N. This is the lightest-touch version and works when the team is reliably responsive to alerts.

The first pattern (hard cap with rejection) is the strongest because it forces the producer to deal with the back-pressure. The second is a good fit for high-volume systems where occasional data loss is acceptable but graveyard accumulation isn't. The third is the easiest to roll out but the easiest to ignore.

The AWS Builders' Library has detailed writeups of these patterns under their reliability and back-pressure sections.

How to choose the cap

The right cap is "what the team can drain in 48 hours under sustained load."

If your team can triage 100 items per day comfortably, set the cap at 200. That gives you headroom for a 2x incident and still allows recovery within a week.

If your team can triage 500 items per day, set the cap at 1,000. Same logic.

The cap is not "how many items we can theoretically store." That's the bounded-storage limit, which is much higher. The cap is "how many items we can recover from."

Setting the cap too low means more rejections during normal incidents, which adds noise. Setting it too high means the queue can reach graveyard size before the alert fires, defeating the purpose of bounding.

For most integration teams in mid-sized organizations, a cap between 200 and 1000 items is appropriate. Larger if you have a dedicated integration ops team; smaller if you don't.

Handling the rejection cleanly

When the queue rejects a write, the producer needs to handle it. Three reasonable strategies:

Pause the producer entirely. The producer stops emitting. New events are buffered upstream (in your CDC stream, your webhook receiver, your scheduled poller's checkpoint). Once the queue drains below the watermark, the producer resumes. This is the simplest pattern and the one I default to.

Spill to a secondary queue. Excess items go to a "spillover" queue with looser retention. This gives you some recovery option without unbounded primary queue growth. The risk is that the spillover queue becomes a second graveyard.

Drop with full logging. Excess items are dropped, logged at high severity, and counted in a metric. This is appropriate for high-volume integrations where some loss is acceptable; it's wrong for anything financial or compliance-relevant.

The right strategy depends on what kind of data is flowing. Pick deliberately, document the choice, and revisit annually.

The metric that matters

A bounded queue has a different set of useful metrics from an unbounded one. The interesting ones:

Time spent at >80% capacity per day (the "danger zone" the team should never live in)
Number of rejection events per day (if non-zero, something needs attention)
Drained-per-day vs added-per-day ratio (over 1.0 means you're winning)

The unbounded queue's depth-over-time metric is replaced by these three. Together they tell you whether the system is healthy, getting better, or getting worse, in a way raw depth never could.

The Google SRE workbook treats these as "SLI" candidates and they fit cleanly into that framework.

Common objections, briefly

"What if we lose data?" You're already losing data in an unbounded queue, just silently. A bounded queue surfaces the loss explicitly so you can make a real decision about it.

"What if the cap is wrong?" Adjust it. The cap is a configuration value, not a foundational architectural decision. Re-evaluate quarterly based on actual drain rates.

"Won't this break during incidents?" It will refuse new writes during incidents, which is what you want. The incident is the moment to fix the upstream cause, not to silently buffer everything.

"Our compliance requires us to retain everything." Then you need durable archival, not an unbounded queue. Write rejected items to S3 with a lifecycle policy. The queue is for triage; the archive is for compliance. Different problems, different tools.

The Twelve-Factor App methodology has good arguments for separating ephemeral processing from durable storage; the same logic applies here.

How this changes the rest of the system

A bounded queue forces the rest of the integration to be honest. The producer can't pretend it's emitting events indefinitely; sometimes it has to pause. The team can't pretend they can triage everything; sometimes they have to accept loss. The upstream sources can't get away with bad payloads forever; sometimes they're forced to fix them.

The honesty is uncomfortable in the short term and load-bearing in the long term. Teams that adopt bounded queues stop having queue-related graveyards. They have incidents instead, which are easier to fix because they happen at specific moments rather than slowly accumulating.

The fuller walkthrough of the broader queue redesign (covering ownership, alerting, triage rules, and what good metrics look like) is in our long-form article: How to Design Integration Error Queues Your Team Will Actually Drain.

If your team is operating an unbounded queue today and wants help migrating to a bounded design without losing data along the way, our data integration service handles that kind of work. For the broader scope of what 137Foundry covers, see 137Foundry.

The right cap, with the right rejection handling, with the right rate-of-arrival alerts, is the foundation of a queue that stays drainable forever. Without the bound, every queue eventually becomes a graveyard.