Karan Padhiyar

Posted on Jun 10

Why We Added Rate Limits Between AI Agents

#ai #llm #infrastructure #brainpackai

Most developers think about rate limits at API boundaries.

Protect the database.

Protect external services.

Protect model providers.

Protect public endpoints.

That is standard infrastructure design.

What surprised us was where we eventually needed rate limits the most.

Between AI agents.

Not between users and agents.

Between agents themselves.

Everything Looked Fine Initially

Our workflows started simply.

One agent handled a task.

If it needed additional information, it called another specialized agent.

That second agent might call a retrieval service.

Or a third agent.

Or an external integration.

The architecture looked clean.

Responsibilities were separated.

Each agent had a focused purpose.

The system worked well during testing.

Then we put it into production.

Agents Create More Work Than Humans

Humans are naturally slow.

Agents are not.

An agent can make decisions and trigger follow-up actions almost instantly.

That sounds great until multiple agents start interacting continuously.

A single user request could trigger:

document retrieval
classification
validation
summarization
workflow planning
action execution

Each step might involve additional agent interactions.

Under load, those interactions multiplied quickly.

The result was unexpected infrastructure pressure.

Not because users increased.

Because agents increased.

Agent-to-Agent Amplification Is Real

One of the first things we noticed was amplification.

A single request entering the system could generate dozens of internal requests.

For example:

Agent A requests context.
Agent B requests additional context.
Agent C validates information.
Agent D performs verification.
Agent B retries because confidence is low.

Nothing is technically wrong.

Every action appears reasonable.

But collectively, the workflow expands dramatically.

One request becomes ten.

Ten become fifty.

Fifty become hundreds.

The infrastructure experiences pressure that is completely disconnected from user traffic.

Feedback Loops Are Hard to Spot

The most dangerous issue was not high volume.

It was feedback loops.

Agents occasionally developed interaction patterns where they continuously requested information from each other.

Not infinitely.

But enough to create significant waste.

Examples included:

repeated validation cycles
duplicate retrieval requests
recursive planning behavior
confidence verification loops
unnecessary retries

Outputs still looked correct.

Users rarely noticed.

But infrastructure costs increased.

Latency increased.

Resource utilization increased.

Without detailed monitoring, these patterns were difficult to detect.

More Intelligence Created More Infrastructure Load

A common assumption is that smarter agents reduce workload.

Sometimes the opposite happens.

Additional reasoning often creates additional actions.

More planning can create:

more retrieval calls
more validation requests
more coordination messages
more execution paths

The system becomes operationally heavier even when response quality improves.

That forced us to think about agents the same way we think about distributed systems.

Every interaction has a cost.

Rate Limits Created Boundaries

Eventually we introduced internal rate limits between agent workflows.

Not because agents were failing.

Because they were succeeding too enthusiastically.

We started controlling:

requests per workflow
agent interaction frequency
retry volume
validation cycles
retrieval expansion rates

The goal was not restriction.

The goal was preventing runaway behavior.

Boundaries forced workflows to remain efficient.

The Unexpected Benefit

The biggest benefit was not lower infrastructure costs.

It was better system behavior.

Once interaction limits existed, inefficient workflows became obvious.

Architectural problems that previously hid behind unlimited execution suddenly surfaced.

We discovered:

redundant agent responsibilities
unnecessary validation stages
duplicated retrieval patterns
excessive planning loops

Rate limits acted like a diagnostic tool.

They exposed inefficiencies that would otherwise remain invisible.

AI Systems Need Resource Governance

Traditional distributed systems already understand this principle.

Every service operates within limits.

Every resource has constraints.

Every workflow has boundaries.

AI systems need the same discipline.

As agent architectures become more sophisticated, resource governance becomes increasingly important.

Without limits, complexity grows faster than expected.

And complexity eventually becomes operational risk.

The Bigger Lesson

The challenge with multi-agent systems is not getting agents to communicate.

Modern frameworks make that relatively easy.

The challenge is controlling how much they communicate.

Because once agents can create work for other agents, infrastructure load stops being directly tied to user demand.

It becomes tied to system behavior.

And system behavior can scale much faster than anyone expects.

That is why we added rate limits between AI agents.

Not to slow them down.

To keep them predictable.

DEV Community