I built toklock — the only Anthropic rate-limit proxy that queues requests instead of crashing your agents

#claude #agents #opensource #anthropic

The Problem

I was building Visibrand — an AI SaaS company managed entirely by
11 autonomous Claude agents running in parallel on Railway.

When they all fired at once, every agent crashed with this:

Error: 429 Too Many Requests
This request would exceed your organization's rate limit
of 30,000 input tokens per minute

I checked every tool that exists.

| Tool | What it does on 429 |
|---|---|
| Anthropic SDK | Retries 2x, then throws |
| Helicone | Bounded retry, still fails |
| LiteLLM OSS | Returns 429 immediately |
| LiteLLM Enterprise | Queues (but costs $$$) |
| Portkey | Load balances, no queuing |

None of them just hold the request and wait.

## The Solution

I built toklock. It sits between your agents and api.anthropic.com.

When the token budget is exhausted it reads Anthropic's own
response headers:

anthropic-ratelimit-tokens-remaining
anthropic-ratelimit-tokens-reset

And waits until the exact moment capacity is available before
releasing the queued request. Callers never see a 429. They just wait.

Agent A → toklock → Anthropic ✓
Agent B → toklock [queued 47s] → ✓
Agent C → toklock [queued 47s] → ✓

## Setup — 3 lines


bash
  # Terminal 1
  npx toklock

  # Terminal 2
  export ANTHROPIC_BASE_URL=http://127.0.0.1:4000
  claude  # or any Anthropic SDK call

  No config file. No API key changes. Just set ANTHROPIC_BASE_URL.

  How it works

  1. All requests enter a serial queue
  2. Token cost is estimated from the request body before sending
  3. If remaining budget < estimated cost → queue pauses
  4. Waits until anthropic-ratelimit-tokens-reset (exact time from headers)
  5. Request is forwarded to api.anthropic.com
  6. Real token counts from response headers update the budget
  7. Next queued request is evaluated

  On 429: request is re-queued, proxy waits for Retry-After, retries.

  Why this doesn't exist yet

  The standard industry solution is load balancing across multiple API
  keys. That prevents 429s by spreading load but requires multiple
  Anthropic accounts and costs more.

  toklock takes the opposite approach — work within one budget,
  queue intelligently, waste nothing.

  Docker

  docker run -p 4000:4000 ghcr.io/tamilselvan89/toklock

  Links

  - GitHub: https://github.com/tamilselvan89/toklock
  - npm: https://npmjs.com/package/toklock

  Open source. Apache 2.0.

  Built while running 11 AI agents in parallel at Visibrand.

Top comments (3)

Harjot Singh • May 29

the queue-on-429 vs hard-fail behavior is the right call - retrying a half-finished agent state is way uglier than the wait. tangent: i hit a similar class of problem at the saas-gen layer (orchestrator burns rate-limit mid-build). solved it w/ a transient/deterministic/permanent failure classifier + per-phase + global retry caps. moonshift is built on that pipeline, $3 per shipped saas. happy to nerd out on the classifier code or drop u a free first run if u want to inspect the reliability layer from the outside.

Harjot Singh • Jun 1

rate-limit queueing instead of crashing is such an underrated piece of agent infra. it's the same instinct behind Moonshift's harness: assume the model layer will fail or throttle, and gate around it so an overnight run doesn't die mid-build. agents build + deploy + market a SaaS end to end. nice work on toklock. first run's free if you ever want to compare approaches.

Jeevan Singh • Jul 29

Does it support any OpenAI compatible API or just Anthropic? Would love to have that.