DEV Community

Gerus Lab
Gerus Lab

Posted on

Rate Limiting, Retry Logic, and Why Your DIY Claude Proxy Is Silently Dropping Requests

Rate Limiting, Retry Logic, and Why Your DIY Claude Proxy Is Silently Dropping Requests

You built a Claude proxy. Seems simple enough — forward requests, handle authentication, return responses. It works fine in development. You test it with a handful of calls and everything looks great.

Then you add real users.

Suddenly, requests start failing. Not all of them, not predictably, just occasionally. Some users report getting no response. Your logs show errors but not consistently. Usage is lower than expected for the actual traffic you're seeing. Something is dropping requests and you're not sure what.

The answer is almost always rate limiting — and the fact that your retry logic is either broken, absent, or making things worse.

How Anthropic Rate Limits Actually Work

Anthropic's API uses two overlapping rate limit dimensions: requests per minute (RPM) and tokens per minute (TPM). They're separate counters, and you can hit either one independently.

The tiers look something like this:

  • Free tier: 5 RPM, 25,000 TPM
  • Tier 1 (~$100 lifetime spend): 50 RPM, 50,000 TPM
  • Tier 2 (~$500 lifetime spend): 1,000 RPM, 100,000 TPM
  • Tier 3 (~$5,000 lifetime spend): 2,000 RPM, 200,000 TPM
  • Tier 4 (~$25,000 lifetime spend): 4,000 RPM, 400,000 TPM

These look like comfortable headroom until you're running a proxy. When your proxy has multiple users, those aren't separate rate limit pools — they're all drawing from the same bucket tied to your API key.

Five users each making modest requests can blow past your RPM limit in seconds. One user with a long document summary can eat your TPM budget before anyone else gets a chance. And Anthropic returns a 429 Too Many Requests response with no priority queue — it's just a rejection.

The 429 Problem: What Most Proxies Get Wrong

When you hit a rate limit and get a 429, the right move is to wait and retry. The naive implementation does exactly that — but implements it badly.

The flat retry pattern is the most common mistake:

for attempt in range(3):
    response = call_claude(request)
    if response.status_code == 429:
        time.sleep(1)  # Fixed 1-second wait
        continue
    return response
Enter fullscreen mode Exit fullscreen mode

This looks reasonable. It's actually a disaster under load.

Here's why: if ten requests all hit the rate limit simultaneously and all retry after exactly one second, they all retry simultaneously. You've just moved the traffic collision from T+0 to T+1. If the retry also fails (which it will — you haven't cleared your rate limit window in one second), they all retry again at T+2. You've created a synchronized stampede.

The exponential backoff pattern fixes the timing problem:

for attempt in range(5):
    response = call_claude(request)
    if response.status_code == 429:
        wait = 2 ** attempt  # 1, 2, 4, 8, 16 seconds
        time.sleep(wait)
        continue
    return response
Enter fullscreen mode Exit fullscreen mode

Better. But still missing something critical.

The exponential backoff with jitter pattern is what actually works:

import random

for attempt in range(5):
    response = call_claude(request)
    if response.status_code == 429:
        base_wait = 2 ** attempt
        jitter = random.uniform(0, base_wait)
        time.sleep(base_wait + jitter)
        continue
    return response
Enter fullscreen mode Exit fullscreen mode

The jitter spreads retries across a window instead of synchronizing them. Now ten failed requests retry at ten different times within a range instead of all hitting simultaneously.

Without jitter, exponential backoff just shifts your thundering herd problem to larger time intervals.

Silent Drops: The Failure Mode You're Not Logging

Here's the failure mode that's hardest to catch: requests that appear to succeed but return empty or partial results.

This happens in a few ways:

Incomplete response handling. Claude's API supports streaming responses. If your proxy drops the connection before the stream completes — due to a client timeout, a network interruption, or a poorly configured proxy timeout — the client receives a partial response. The request "succeeded" from the API's perspective. Your logs show a 200. But the user got truncated output.

Silent 529 handling. Anthropic occasionally returns 529 Overloaded — not a rate limit, but a server capacity issue. If your retry logic only handles 429 and not 529, overloaded responses get surfaced to the user as errors instead of being retried.

Token budget exhaustion mid-stream. If you're enforcing a token budget in your proxy and a response exceeds it mid-stream, you need to either buffer and truncate or handle the cutoff gracefully. Many DIY implementations just... stop sending. The user sees a response that ends in the middle of a sentence.

Connection pool exhaustion under load. If you're running a proxy without connection pooling configured properly, concurrent requests can starve each other. Requests queue up waiting for a connection slot, hit their client-side timeout, and drop. The API never saw them; they died in your proxy's connection queue.

All of these show up the same way in your logs: a request that started, possibly a brief outgoing call, no error, no completion.

Multi-User Queue Starvation

This is the failure mode that gets teams when they move from single-user to multi-user setups.

Imagine your proxy handles requests on a first-come-first-served basis, and you're running near your RPM limit. A single user makes a rapid series of requests — autocomplete, a long document analysis, a few clarifying questions. This user consumes most of your available request budget for the current window.

Other users' requests are now in a queue. They're waiting. Some of them time out before the rate limit window resets. The users affected don't get an error — they just get no response. From their perspective, the proxy is intermittently broken.

This is queue starvation, and it's almost impossible to diagnose without per-user request attribution in your logs. If you're logging timestamp + status + latency but not user_id + queue_time, you'll see the overall rate as fine while individual users are suffering.

The fix requires implementing fair queuing with per-user rate limiting at the proxy layer, before requests hit Anthropic's API. This is non-trivial to implement correctly. The naive version (round-robin across users) doesn't account for different request sizes or priorities. A proper implementation needs token-aware scheduling.

What Correct Proxy Rate Limiting Actually Requires

If you want to build a DIY proxy that handles rate limiting correctly, here's the minimum you need:

1. Per-user request tracking. Log every request with user ID, timestamp, input tokens, output tokens, and queue time. This is the foundation for everything else.

2. Token-aware scheduling. Don't just count requests — account for the token weight of each request when scheduling. A 2,000-token request consumes the same RPM slot as a 100-token request but 20x the TPM budget.

3. Proper retry with jitter. As described above — exponential backoff with randomized jitter, handling both 429 and 529 responses.

4. Circuit breakers. If Anthropic's API is consistently returning errors, stop hammering it and fail fast until the situation clears. A circuit breaker pattern prevents a temporarily degraded upstream from causing cascading failures in your proxy.

5. Fair queuing with starvation prevention. Some form of per-user rate limiting at the proxy layer so that no single user can exhaust the shared capacity budget.

6. Stream integrity monitoring. Track streaming responses to detect incomplete delivery. If a stream terminates before a natural completion signal, flag it, log it, and optionally retry.

7. Timeout configuration that matches reality. Claude's longer requests can take 30-60 seconds. If your proxy has a 15-second timeout, you're silently dropping a meaningful fraction of legitimate requests.

This is a non-trivial amount of infrastructure to build and maintain correctly. And it's the kind of thing that looks fine in testing (where you control traffic patterns) and breaks in production (where you don't).

Why This Gets Worse Over Time

DIY proxy issues tend to compound rather than stabilize.

You add a new user. Now queue starvation occasionally hits user #1 for the first time. You notice, add a band-aid. You add another user. The band-aid doesn't hold. You patch again.

You add a new feature that generates longer outputs. Now you're hitting TPM limits you weren't before. Your existing retry logic handles RPM 429s but not TPM 429s (Anthropic returns different error messages for each). You fix that.

A user finds a use case that generates a lot of short requests in rapid succession. Now your RPM limit is the constraint. Your token-aware scheduler doesn't account for this pattern. You adjust.

Each fix is reasonable. The cumulative result is a fragile system with multiple layered patches, each dependent on the others, none of them tested against the combination of load patterns that actually appears in production.

The math question you eventually ask yourself: how many engineering hours have you spent on this? What's the opportunity cost of not shipping the actual features your users want?

The Managed Alternative

ShadoClaw exists because this infrastructure problem is solved and shouldn't require rebuilding by every team that wants to run Claude reliably.

The proxy layer handles Anthropic rate limits, retry logic, fair queuing, stream integrity, and connection management at the infrastructure level. You get predictable behavior without building and maintaining the mechanisms that produce it.

Pricing is flat-rate by tier:

  • Solo ($29/month): One account. Predictable monthly cost.
  • Pro ($79/month): Five accounts. Right for small teams and agencies.
  • Team ($179/month): Twenty accounts. Production workloads at scale.

No per-token billing anxiety. No "engineering sprint to fix the retry logic" because you added two new users. The rate limiting complexity is handled; you focus on what you're building.

All plans include a free 3-day trial. If you've been chasing intermittent failures in a DIY proxy setup, the quickest diagnostic is to route traffic through ShadoClaw and see if they disappear.

ShadoClaw is built and maintained by Gerus-lab, an IT engineering studio specializing in AI, Web3, and SaaS infrastructure.

The Practical Checklist Before Your Next Deployment

If you're staying DIY for now, run through this before you add more users:

  1. Does your retry logic use jitter? If not, add it before you do anything else.
  2. Are you handling both 429 and 529? Check your error handling explicitly for both status codes.
  3. Do your timeouts match Claude's actual response times? Test with your longest expected prompts and set timeouts 20% above that.
  4. Are you logging user ID and queue time on every request? Without this, starvation is invisible.
  5. Do you have per-user rate limiting at the proxy layer? Or can one user consume all capacity?
  6. Are streaming responses validated for completion? Or can partial responses reach users silently?

Fix these in order. They're not equal in impact — the retry jitter issue alone accounts for a disproportionate share of correlated failures in multi-user setups.

The problems are solvable. The question is whether solving them yourself is the right use of your time.


ShadoClaw — Managed Claude API proxy. Flat-rate pricing. Free 3-day trial.

Top comments (0)