Lars Winstand

Posted on May 18 • Originally published at standardcompute.com

I stopped fighting the Anthropic API rate limit when I realized one model shouldn’t do every job

#ai #anthropic #api #devops

I kept seeing the same advice every time someone hit an Anthropic wall:

ask support for higher limits
buy more credits
trim the prompt
disable thinking
retry slower

Sometimes that helps.

A lot of the time, it doesn’t.

What changed my mind was a thread on r/openclaw where someone wrote: “Every time something interesting emerges in the Claude ecosystem, Anthropic finds a way to throttle it.”

That line is dramatic, but the underlying point is solid: if your OpenClaw stack, n8n workflow, Zapier automation, or custom agent runner assumes one provider should handle every request, every burst, and every recovery path, you didn’t build a resilient system.

You built a single point of failure.

And rate limits are usually just the first symptom.

Anthropic rate limits are not one number

A lot of teams talk about an “Anthropic API rate limit” like it’s one clean threshold.

It isn’t.

Anthropic’s limits are multi-axis:

RPM: requests per minute
ITPM: input tokens per minute
OTPM: output tokens per minute
acceleration limits: penalties when usage ramps too fast

That last one is where agent workloads get wrecked.

Your dashboard can look fine on average. Your spreadsheet can look fine on average. But agents do not behave like averages.

They fan out.
They retry.
They call tools.
They summarize.
They wake up in bursts.

That means you can absolutely get 429s even when your “normal” usage looks fine.

If you’re running multi-step automations, coding agents, or background workflows, this matters more than prompt tuning.

The 23-second first token problem

Another r/openclaw thread described a worse failure mode than 429s:

“The problem is that my agents are taking 23 seconds to respond to me, even in a new chat session with 0 context.”

That’s not a small latency regression.

That’s a broken product experience.

What made that thread interesting is that they had already tried the usual fixes:

different models
thinking disabled
memory disabled
MCP servers disabled
faster provider path

At that point, changing a prompt is just superstition.

If the request is still slow after obvious cleanup, the problem is probably somewhere in the full path:

gateway overhead
retries
provider selection
orchestration
fallback behavior
sending interactive work to a model path that should be reserved for harder jobs

That’s where routing stops being an optimization and becomes basic engineering.

One provider for everything sounds clean. It breaks in boring ways.

I understand why teams do it.

One provider is easier for:

evals
compliance review
prompt tuning
output consistency
debugging

That part is real.

But the trade is brutal: you simplify governance by pushing complexity into operations.

Then operations punches you in the face.

If your architecture assumes one upstream will always be available, always be fast, and always tolerate bursts, you are building on hope.

A commenter in that throttling thread had the best summary:

“You get to use the engine, but you’re not allowed to redline it.”

Exactly.

If your agents run 24/7, that’s not a strategy. That’s a warning label.

The fix is routing by job type

The real fix for an Anthropic rate limit problem usually isn’t more credits.

It’s routing.

Not every request deserves Claude Opus 4.6.
Not every request should hit Anthropic synchronously.
Not every request should go through the same path.

A sane routing policy looks more like this:

Interactive, low-stakes turns go to the fastest acceptable path.
Hard coding, planning, and recovery turns go to Claude Sonnet 4.6 or Claude Opus 4.6.
Cheap bulk work goes to lower-cost models where quality is good enough.
Large async jobs go to batch APIs or queues.

That’s not anti-Claude.

That’s architecture.

A practical routing policy you can ship this week

Here’s a simple mental model:

Job type	Best path
User waiting in chat	Lowest-latency acceptable model/provider
Hard code generation	Claude Sonnet 4.6 or Claude Opus 4.6
Recovery after tool failure	Premium reasoning model
Classification / summarization / tagging	Smaller or cheaper model
Nightly backfills / async analysis	Batch API or queued worker

If you’re sending all five of those through one synchronous provider path, you are creating your own outages.

Anthropic Message Batches is useful, but only for the right jobs

One thing more teams should use: Anthropic Message Batches.

It exists for a reason.

Async work should be async.

If the job can finish later, don’t force it through the same path as a user-facing interaction.

Good candidates:

nightly summaries
backfills
non-urgent document processing
large-scale classification
background enrichment

Bad candidates:

chat replies
live copilots
“user is staring at the spinner” workflows

That distinction sounds obvious, but a lot of systems ignore it.

The tooling for routing already exists

This is not a future pattern. You can do it now.

Option 1: OpenRouter-style provider controls

If you want one API surface with provider routing and fallbacks, this kind of request shape is the idea:

{
  "model": "openai/gpt-4.1",
  "messages": [
    {"role": "user", "content": "ping"}
  ],
  "provider": {
    "order": ["anthropic", "openai"],
    "allow_fallbacks": true,
    "sort": "latency",
    "preferred_max_latency": 5
  }
}

That is much more useful than arguing online about which model is the One True Model.

Option 2: LiteLLM Router

If you want explicit fallbacks and routing logic in Python, LiteLLM is a practical option.

from litellm import Router

router = Router(
    model_list=[
        {
            "model_name": "fast-path",
            "litellm_params": {
                "model": "openai/gpt-4.1-mini",
                "api_key": "<key>",
                "rpm": 60
            }
        },
        {
            "model_name": "premium-path",
            "litellm_params": {
                "model": "anthropic/claude-sonnet-4.6",
                "api_key": "<key>",
                "rpm": 10
            }
        }
    ],
    fallbacks=[
        {"premium-path": ["fast-path"]}
    ]
)

Run the proxy if you want a shared gateway:

litellm --config /path/to/config.yaml

Option 3: Build the policy into your own app

You don’t need a giant control plane to get the benefits.

Even a simple router helps.

type JobType = "interactive" | "coding" | "recovery" | "batch";

function pickModel(jobType: JobType) {
  switch (jobType) {
    case "interactive":
      return "fast-model";
    case "coding":
      return "claude-sonnet-4.6";
    case "recovery":
      return "claude-opus-4.6";
    case "batch":
      return "batch-queue";
  }
}

That tiny amount of explicitness is better than “send everything to Claude and hope.”

Traffic shaping matters as much as model choice

If your workload is bursty, routing alone won’t save you.

You also need traffic shaping.

At minimum:

queue non-urgent work
cap concurrency
jitter retries
separate interactive traffic from background traffic
stop one workflow from stampeding the same endpoint

A basic worker queue is often enough.

# example shape, not a production command
export INTERACTIVE_CONCURRENCY=20
export BATCH_CONCURRENCY=4
export RETRY_JITTER_MS=250

And if you’re calling models from n8n, Make, Zapier, or custom workers, this separation is even more important because automation platforms love bursts.

The counterintuitive part: routing is mainly about reliability

Yes, routing can cut cost.

Yes, it can stop you from wasting premium models on junk work.

But the bigger win is reliability.

Routing gives you:

fewer 429 cascades
better latency for user-facing turns
cleaner failover behavior
less provider lock-in
less operational drama during traffic spikes

That’s the adult version of running agent systems.

What I’d actually do this week

Not a six-month platform rewrite.

Just these three things.

1. Split interactive and async traffic

If a human is waiting, optimize for latency.

If nobody is waiting, queue it or batch it.

2. Define a premium-model trigger

Don’t send every turn to Claude Opus 4.6.

Create explicit rules.

Example:

use premium model for code generation
use premium model for multi-step planning
use premium model for failure recovery
use fast path for ordinary chat and tool glue

3. Add fallback rules now, not later

Encode the behavior:

if Anthropic is slow, fail over
if latency crosses threshold, switch path
if the job is non-urgent, queue it
if traffic spikes, smooth it

That’s routing in practice.

Not a whitepaper. Not a buzzword. Just fewer broken nights.

Where Standard Compute fits

If you like the idea of routing but hate stitching together five vendors, proxy layers, and billing dashboards, that’s basically the problem Standard Compute is trying to solve.

Standard Compute gives you an OpenAI-compatible API endpoint, but behind that endpoint it can route across models like GPT-5.4, Claude Opus 4.6, and Grok 4.20.

The part I think matters most for agent builders is not just the routing. It’s the pricing model.

Flat monthly pricing changes how you design systems.

You stop babysitting token spend.
You stop avoiding useful retries because they might get expensive.
You stop treating every long-running automation like a billing risk.

If you’re running agents in n8n, Make, Zapier, OpenClaw, or your own worker stack, predictable cost plus routing is a much better combo than “one premium model for everything, billed per token, with burst limits.”

That’s the architectural shift.

Final take

The biggest mistake I see right now is not choosing the wrong model.

It’s asking one model-provider path to be fast, cheap, reliable, burst-tolerant, and premium at the same time.

Nothing works that way.

Not Claude.
Not GPT-5.4.
Not Grok 4.20.
Not Qwen.
Not Llama.

Once you accept that, the design gets simpler.

Stop begging for more credits.

Route the job to the path that deserves it.

DEV Community