Paul Twist

Posted on Jun 18

Stop Measuring Agent Infrastructure by Gateway Latency Alone

#ai #agents #infrastructure #performance

I've been watching the LLM gateway benchmarks get faster. Bifrost at 11 microseconds, Helicone at 8 milliseconds, LiteLLM at 8ms. On single requests, the math is brutal: Bifrost is 720x faster than LiteLLM.

But this week I watched three teams benchmark gateways, pick based on latency, deploy to production, and then realize they'd solved the wrong problem.

The issue isn't the benchmarks. The issue is what they're benchmarking. And what they're not.

The Latency Benchmark Measures a Chat Interface

Here's what a typical gateway latency benchmark does:

Send a single LLM request (chat completion, embedding call, etc.)
Measure the overhead the gateway adds
Report the percentile latency and throughput

This makes sense if your application is a chat interface. One user sends one message. Low latency feels good. You route that call through a gateway and add 11 microseconds? Invisible. Phenomenal.

But agent systems don't work like chat interfaces.

Agent Calls Are Sequential, Not Parallel

A production agent making a decision typically isn't making one LLM call. It's making many:

Planning: "What tools do I need?" (1 call)
Execution: "Use tool X with these args" (1 call per tool)
Validation: "Does the output make sense?" (1 call)
Refinement: "Did that work?" (1 call)
Reporting: "Summarize what happened" (1 call)

That's 5–15 calls per decision. Some agents do 50+.

If each call goes through your gateway, the latency compounds:

Bifrost at 11µs per call, 10 calls = 110µs
Helicone at 8ms per call, 10 calls = 80ms
LiteLLM at 8ms per call, 10 calls = 80ms

The relative difference shrinks. But—and this is the part the benchmarks hide—in production, agents don't run in isolation.

Multiple agents run concurrently. Tool calls can block. Fallbacks trigger retries. Cost attribution happens per request. Session state needs to persist across crashes.

The gateway overhead is one component of agent latency. It's not the only component, and it's usually not the largest.

What Production Teams Actually Evaluate

I talked to five teams this month running coding agents in production. Here's what they actually cared about, ranked by impact:

1. Session Persistence
"If our agent crashes mid-task, we lose everything. The benchmark didn't mention session state at all." They needed agents to survive pod restarts, maintain tool call history, and resume from where they left off.

2. Cost Attribution
"Our CFO asked what each agent decision costs. The gateway's latency benchmark didn't tell us that." They needed to tag requests by agent, workflow, team, and user—then roll up costs per agent and per decision. Latency benchmarks measure throughput, not cost per decision.

3. Model Routing
"We use Claude for complex tasks, GPT-4 for speed, and open-source for cheap calls. The fastest gateway doesn't route on task complexity." They needed conditional routing: "If this agent is handling a finance decision, use Claude. If it's a simple lookup, use a cheaper model." Bifrost at 11µs overhead doesn't matter if it can't route based on decision type.

4. Fallback & Retry Policy
"Our tool sometimes fails. We need to know how many retries happened and why, not just how many total requests went through." They needed to instrument retry loops and prevent cost spirals. A gateway that handles 10,000 RPS but logs every retry identically isn't helping.

5. Sandbox Isolation
"Each agent gets its own session. Tools run in isolated sandboxes. The gateway latency benchmark doesn't mention sandboxes at all." They needed agents to run in per-team, per-workflow sandboxes with resource limits and audit trails.

6. Observability & Debugging
"When an agent makes a bad decision, we need to replay it. We need to see every tool call, every model invocation, every decision point." They needed structured tracing, not just latency metrics.

The gateway latency benchmark measures exactly zero of these.

Where the Latency Argument Breaks Down

Let's do the math on a real workflow:

An agent processes a customer support ticket. It:

Calls the LLM to classify the issue (1 call)
Looks up the customer record (tool call, ~200ms on database)
Calls the LLM to draft a response (1 call)
Checks sentiment (embedded call or tool, ~50ms)
Routes to a human if confidence is low (1 call)

Total latency: ~400–600ms

Gateway overhead in this flow:

Bifrost: 11µs × 4 LLM calls = 44µs
LiteLLM: 8ms × 4 LLM calls = 32ms

So Bifrost saves 31ms on a 500ms workflow. That's 6%. Important? Sure. But not more important than cost governance, session persistence, and model routing.

And LiteLLM at 8ms overhead is already dwarfed by the actual workflow latency.

The Real Question: Control Plane vs. Data Plane

The benchmark conversation assumes you need one gateway for everything. But production agent infrastructure needs two layers:

Data plane (where gateway latency matters):

Fast request routing to model providers
Fallback and retry logic
Cost tracking at request level
Minimal overhead

This is where Bifrost shines. Go's concurrency model is genuinely suited for high-throughput, low-latency routing. 11µs overhead is real and measurable.

Control plane (where gateway latency doesn't matter):

Session persistence across restarts
Per-agent and per-team sandboxes
Workflow scheduling and memory management
Cost attribution to agents and workflows
Access control and audit trails
Tool binding and validation

This is where data plane latency is irrelevant. A control plane call that takes 200ms is acceptable if it's handling session state, sandbox provisioning, or workflow routing. You're not making 5,000 of them per second. You're making a few per agent lifecycle.

This is also where LiteLLM Agent Platform operates. It's not trying to be a low-latency gateway. It's trying to be a reliable control plane that actually makes agents runnable in production.

How to Evaluate Agent Infrastructure

Here's a framework teams should actually use:

Can it run agents across pod restarts without losing state? (Session persistence)
Can it isolate agents per team and per workflow? (Multi-tenancy)
Can you measure cost per agent and per decision? (Cost attribution)
Can you route based on task type or agent type, not just latency? (Intelligent routing)
Can you see why an agent made a decision? (Observability)
Can you set per-agent cost budgets and enforce them? (Cost governance)
Can you schedule recurring agent work? (Orchestration)
Can you connect tools, MCPs, skills, and custom rules? (Extensibility)
How fast is the gateway? (Data plane performance)

Most teams ask #9 first. They should ask it last.

The Architecture That Actually Works

A production agent system needs both:

Fast data plane: Bifrost or LiteLLM-Rust or Helicone for routing, retries, and cost tracking at request level
Reliable control plane: LiteLLM Agent Platform or similar for sessions, isolation, scheduling, and governance

The latency benchmark tells you about the data plane. It tells you nothing about the control plane.

Teams that pick based solely on data plane latency end up with a fast gateway that can't handle agent sessions, costs, or multi-tenancy. They solve the wrong problem and build the wrong system.

What This Means for Your Architecture

If you're evaluating agent infrastructure:

Separate the layers. Don't try to measure everything in one benchmark. Measure data plane latency separately from control plane reliability and feature depth.
Measure the right things. Ask vendors: How do you handle session persistence? Cost attribution? Multi-tenancy? Sandbox isolation? Observability? These matter more than the 11µs vs 8ms difference.
Test the full workflow. Don't benchmark a single LLM call. Benchmark a complete agent decision with tool calls, retries, and cost tracking. That's closer to production reality.
Separate costs. Data plane should be fast and cheap (commodity hardware). Control plane should be reliable and governable (probably more expensive per request, but fewer requests).

The teams building agents at scale are not chasing 11-microsecond gateway overhead. They're building systems where sessions survive crashes, costs are predictable, and agents can actually be governed in production.

That's a different set of problems. And latency benchmarks don't measure it.

Paul Twist is an AI engineer based in Berlin. He works on production infrastructure for agents and writes about the gap between what works in demos and what works at scale.