Mukunda Rao Katta

Posted on May 25

Self-Impose Rate Limits on Your Agent Before Your Provider Does

#hermeschallenge #ai #python #agents

The 2pm Problem

Your agent fleet hit the provider rate limit at 2pm on a Tuesday. Fifty workers. All hammering the same API endpoint. The provider throttled every single one of them. Jobs backed up for 45 minutes.

The provider didn't do anything wrong. You just did not plan for concurrent load.

Self-imposed rate limits solve this. You tell your agent "you are allowed X tokens per minute and Y tokens per session" before it starts. The agent enforces its own budget. The provider never has to step in.

This post shows how to layer two different constraints: a session cap using token-budget-py and a rolling time window using llm-budget-window.

Two Layers of Rate Limiting

The problem has two separate dimensions.

Session cap answers: "how much total work is this agent allowed to do?" It is a hard ceiling. If the agent is a background summarization job, maybe it should consume at most 50k tokens per run regardless of how long it runs.

Time window answers: "how fast is this agent allowed to go?" Even if the session cap allows 50k tokens total, you might not want all 50 workers burning through it in the first 30 seconds. A per-minute limit smooths the load curve.

You need both. A session cap alone does not prevent burst spikes. A time window alone does not prevent a long-running agent from quietly accumulating a huge bill over hours.

Main Code Example

import asyncio
from token_budget_py import TokenBudget, BudgetExceeded
from llm_budget_window import SlidingWindowBudget, WindowExceeded

# Session cap: 50k tokens max for this agent run
session_budget = TokenBudget(max_tokens=50_000)

# Time window: no more than 8k tokens per 60-second window
# Covers both input and output tokens combined
window_budget = SlidingWindowBudget(
    max_tokens=8_000,
    window_seconds=60,
)

async def call_llm(prompt: str, expected_output_tokens: int = 500) -> str:
    """Wrapper that checks both limits before calling the model."""
    input_tokens = len(prompt.split()) * 1.3  # rough estimate
    total_estimated = int(input_tokens) + expected_output_tokens

    # Check session cap first (fast, no async)
    try:
        session_budget.reserve(total_estimated)
    except BudgetExceeded as e:
        raise RuntimeError(
            f"Session budget exhausted. Used {e.used}, cap {e.cap}."
        ) from e

    # Check rolling window (may block until window clears)
    try:
        await window_budget.acquire(total_estimated, timeout=30.0)
    except WindowExceeded as e:
        # Release the session reservation we just took
        session_budget.release(total_estimated)
        raise RuntimeError(
            f"Rate window exceeded. Retry after {e.retry_after:.1f}s."
        ) from e

    # Both budgets cleared. Make the actual call.
    try:
        result = await your_llm_client(prompt)
        actual_tokens = result.usage.total_tokens

        # Reconcile: we reserved an estimate, commit the actual amount
        session_budget.commit(total_estimated, actual_tokens)
        window_budget.commit(total_estimated, actual_tokens)

        return result.content
    except Exception:
        # On failure, release both reservations
        session_budget.release(total_estimated)
        window_budget.release(total_estimated)
        raise


async def run_agent(tasks: list[str]) -> list[str]:
    results = []
    for task in tasks:
        try:
            out = await call_llm(task)
            results.append(out)
        except RuntimeError as e:
            print(f"Skipped task: {e}")
            results.append(None)
    return results


async def main():
    tasks = [f"Summarize document #{i}" for i in range(100)]
    results = await run_agent(tasks)
    completed = sum(1 for r in results if r is not None)
    print(f"Completed {completed}/100 tasks within budget.")


if __name__ == "__main__":
    asyncio.run(main())

The key pattern is "reserve, execute, commit/release". You grab a slot in both budgets before the API call. If the call fails or takes more than the estimate, you reconcile. If you cannot get a slot, you skip or queue.

What This Does NOT Do

This pattern does not retry on your behalf. If WindowExceeded fires, you get an exception. What you do with it (wait, skip, queue) is your decision.

It does not estimate token counts accurately. The len(prompt.split()) * 1.3 rough estimate in the example is just that: rough. For tighter control, use a real tokenizer from prompt-token-counter before reserving.

It does not coordinate across processes. Both TokenBudget and SlidingWindowBudget are in-process. If you run 10 separate Python processes, each has its own counter. For cross-process coordination you need a shared store (Redis, database, a shared filesystem lock). These libraries give you the enforcement logic; you provide the shared state.

It does not replace provider-side limits. Your provider still enforces its own limits. This layer exists to prevent you from reaching those limits, not to replace them.

Design Reasoning

Separate the two dimensions. Some engineers try to build one unified rate limiter that handles both session lifetime and per-minute throughput. That unified class ends up with six constructor params and a complicated internal state machine.

Two small objects with single responsibilities are easier to reason about. You can test them independently. You can swap one out without touching the other.

The reserve/commit/release cycle is intentional. If you just call acquire(tokens) and decrement on success, you have no way to handle the gap between your estimate and the actual token count. The three-step cycle lets you lock tokens optimistically and reconcile when reality differs.

Async matters here because SlidingWindowBudget.acquire may block. If the window is full, waiting is often the right behavior for background agents. You set a timeout to avoid waiting forever.

When This Applies

This pattern fits background agents that run in parallel without human supervision. Summarization pipelines, batch classification jobs, nightly report generators. Anything where "too many concurrent calls" is a real risk.

It also fits multi-tenant systems where different users share the same API key. Each user's agent gets its own TokenBudget instance. All of them share one SlidingWindowBudget instance. The window limit protects the key; the session limit protects each user's allocation.

This pattern does NOT fit real-time chat agents. If a user is waiting for a response, you cannot tell them "the window is full, try again in 40 seconds." For interactive agents, circuit breaking and graceful degradation are better tools than hard budget limits.

Quick-Start Snippet

pip install token-budget-py llm-budget-window

from token_budget_py import TokenBudget
from llm_budget_window import SlidingWindowBudget

# 20k token session cap
session = TokenBudget(max_tokens=20_000)

# 3k tokens per 60 seconds
window = SlidingWindowBudget(max_tokens=3_000, window_seconds=60)

# Use the reserve/commit pattern shown above

That is the minimal setup. Add actual token counting from a tokenizer for production use.

One useful refinement: track how often the window blocks your agent. If WindowExceeded fires more than 5% of the time, your per-minute limit is too tight for the actual workload. Loosen the window or reduce the number of concurrent workers. The budget should be a guard, not a constant bottleneck.

For local testing, set the window to a very small value on purpose. Confirm that your reserve/commit/release cycle behaves correctly under repeated WindowExceeded conditions. Most bugs in rate-limiting code only show up when the limit is actually hit.

Siblings

Library	What it does	When to reach for it
`token-budget-py`	Session-level token cap with reserve/commit	Hard ceiling per agent run
`llm-budget-window`	Sliding window token rate limit	Smoothing burst load across time
`llm-cost-cap`	USD cost ceiling before the call	Stop expensive calls before they happen
`llm-circuit-breaker-py`	Open/half-open/closed state machine	Stop all calls when provider is failing
`agent-deadline`	Cooperative per-task wall-clock deadline	Kill long-running agents by time, not tokens
`prompt-token-counter`	Fast heuristic token count	Feed accurate estimates into budgets

What's Next

The llm-budget-window Rust library has a Tokio async runtime backing. If you run a high-throughput Python service with a Rust core, the window enforcement can live in Rust and get called from Python via PyO3 bindings. That eliminates the Python GIL from the hot path.

For cross-process budgets, the logical next step is a shared counter behind a Redis sorted set. The same reserve/commit interface stays the same. Only the backing store changes. That is the next piece to build in this series.

For interactive agents that cannot block, replace the window's acquire with a non-blocking try_acquire that returns immediately. On rejection, return a reduced response ("I'll keep this answer short") rather than an error. That keeps the user experience intact while still respecting the window.

The full combination, session cap plus time window plus cost ceiling plus circuit breaker, covers the major failure modes for a production agent fleet. Each library handles one thing. Compose them at the call site.

DEV Community