DEV Community

Cover image for High-performance AI agents are distributed systems
Kirti Rathore
Kirti Rathore

Posted on • Originally published at fixbugs.ai

High-performance AI agents are distributed systems

"Codex took 6 hours to implement this seemingly simple refactor".

"I think Research mode on Perplexity is stuck."

We all know LLM APIs are slow, and are content with staring at a spinner while the model slowly emits tokens.

But what happens when you're building AI agents that need to be low latency?

We hit this while building FixBugs, an AI debugging agent that reads bug reports, logs, code, screenshots, traces, and issue comments, then reproduces the bug and finally generates a validated fix. The product has a simple promise: every code change is verified to do only the necessary work to fix the issue.

The implementation is not simple.

Bug reports and their associated logs/metrics/traces can contain too much context for one model call. A repository can have hundreds of files. Logs can be larger than the model's useful context window. The final answer may need thousands of output tokens. And if the agent takes ten minutes to say anything useful, the user assumes it is broken.

Summarization, also referred to as compaction, is the usual way to work with huge context. However, summarization is slow and often loses essential context.

Modern coding agents like Claude Code and Cursor rely heavily on blindly grepping through log files and reading from specific offsets. The effective context window the coding agent is allowed to process at once is smaller than the total context window. GPT 5.5 for example has a context window of 400K tokens but it's 'input context' is closer to 258K tokens.

Once you step beyond conversational agent loops, other interesting patterns become usable.

You realize the underlying performance engineering problems are similar to those encountered when optimizing large distributed systems.

Scatter-gather. Pipelining. Queues. Backpressure. Streaming. Serializability. These are the problems we spent the most time thinking about.

start with token math

Most agent performance discussions start in the wrong place.

They ask:

Which model is fastest in terms of tokens/sec?

That is a useful question later. The first question is:

How many input tokens and output tokens does this task need?

LLM latency has two different pieces that matter to the user.

Time to first token is how long the user waits before the model starts responding. Token throughput is a measure of how much time it takes to get the full answer.

They are not the same problem.

Diagram showing LLM prefill phase, decode phase, time to first token, and time per output token.

Prefill affects time to first token. Decode affects the stream of output tokens after that.

In the prefill phase, the model processes the input context and prepares the key/value cache used to generate the first new token. In the decode phase, the model generates output tokens one at a time autoregressively.

For a practical agent, a crude mental model is enough:

latency model
T ≈ TTFT + output tokens × time/token

Input tokens are not free. They hit prefill and therefore time to first token.

prefill cost example
20,000 input tokens × 0.05ms/token = 1,000ms ≈ 1s TTFT The constant is model- and provider-specific; the shape of the cost is the useful part.

But long answers are expensive in a different way. Every output token has to be generated. If your agent asks the model to explain every file in a repository, your user is paying for that decision in wall-clock time.

This matters because debugging agents are usually output-heavy. They do not just answer "yes" or "no." They produce hypotheses, evidence, file rankings, reproduction plans, code diffs, and validation notes.

Output tokens dominate faster than people expect.

the 10-minute file search

The biggest bottleneck in early FixBugs was not repository parsing.

It was asking the LLM which files were relevant to a bug.

The naive version looked reasonable:

  1. Gather the bug context.
  2. Gather the repository files.
  3. Put all relevant context into one prompt.
  4. Ask the model to rank files and explain why.

For a small repo, this works.

For 50 files, it turns into a bad batch job disguised as a chat request.

If the model emits 30,000 output tokens and the endpoint gives you 50 output tokens per second, you are waiting about 600 seconds. Ten minutes. That is before retries, rate limits, or any downstream fix generation.

To get faster performance, we realized we had to use as much parallelism as possible.

file relevance stage
one giant call 30,000 tokens ÷ 50 tokens/s = 600s about 10 minutes before retries or downstream work 16 independent calls 16 × 50 tokens/s ≈ 800 tokens/s roughly 40 seconds for the demo workload

Each file was decomposed into chunks. Each chunk got its own relevance call. Those calls ran concurrently.

If one call gives you 50 output tokens per second, 16 independent calls on a 16-vCPU VM can expose roughly 16 times the useful throughput to the workflow.

The demo version dropped the file-relevance stage from roughly 10 minutes to roughly 40 seconds.

That number is not a universal benchmark. It depends on the provider, model, prompt, chunk sizes, rate limits, and output format.

The important part is the pattern.

This was a scatter-gather workload.

Scatter the independent file checks. Gather the evidence. Merge the local judgments into one ranked view.

We already know how to do this.

"Do the same analysis over many independent records, then combine the results."

This is what Hadoop does and why it is so useful for data analysis.

LLM agents have the same class of problems.

chunking is not free

Chunking is easy to abuse.

I'll give an output both on the map and reduce side to illustrate.

Mapping a chunk

If you're given:

  • A: A bug report.
  • B: A repository file tree.

Now you've got to figure out which code files are relevant to the bug report.

How many files do you add to a chunk?

It would not be a good idea to use byte-level granularity and stuff as much context as a model call can handle.

Instead you want to have file level granularity, and add complete files where you can.

Reducing chunks

Suppose you're given:

  • A: A bug report.
  • B: Log files and Traces.

Now you've got to extract log snippets relevant to the bug from the input.

Mapping a log file is simple. You can chunk it greedily.

Reducing is a bit more complex.

The parallel LLM calls gave you a series of log snippets, but without putting the snippets in sequence by time, grouping them by span, separating them by service name, the snippets would not be useful at all.

In my experience, the "Reduce" phase is often messier than people would like it to be.

The rule I use now:

Chunk for evidence.
Merge for judgment.
Enter fullscreen mode Exit fullscreen mode

The local calls should find facts, signals, and candidate explanations. The final call should resolve conflicts, rank evidence, and decide what to do next.

streaming changes the wait

There is another latency problem that chunking does not solve.

Sometimes the user needs to see that the agent is alive.

For interactive debugging, time to first token matters more than total completion time. The engineer does not always need the whole final report immediately. They need the first useful hypothesis, the first file name, the first sign that the investigation is moving.

Streaming helps.

streaming tradeoff
demo workload TTFT without streaming 13s TTFT with streaming 2.4s throughput without streaming 486 tok/s throughput with streaming 244 tok/s

In the demo, streaming reduced time to first token from 13 seconds to 2.4 seconds.

That is a huge UX difference.

But throughput got worse: 486 tokens/sec without streaming versus 244 tokens/sec with streaming.

This is the kind of tradeoff that disappears if you only measure "request completed in N seconds." Streaming is not a throughput optimization. It is a user-experience optimization.

For chat-like workflows, it is usually worth it.

For batch stages inside an agent pipeline, it may not be.

FixBugs uses both modes. User-facing stages stream progress. Internal worker stages optimize for total job completion, retry behavior, and queue throughput.

That distinction keeps the system honest.

concurrency has a ceiling

The first time you parallelize LLM calls and see a 5x or 10x improvement, it is tempting to conclude the next fix is "more workers."

That works until it does not.

Graph showing LLM throughput gains flattening as concurrency increases.

Throughput improves with concurrency, then starts flattening. More workers eventually stop buying much.

At low concurrency, the system has idle capacity. Adding workers improves utilization. Throughput rises quickly.

At higher concurrency, the slope changes. You are now fighting shared bottlenecks: provider rate limits, GPU scheduling, KV cache memory, network overhead, queueing, retries, and your own post-processing.

The unpleasant part is that throughput may plateau while user latency gets worse.

Graph showing time to first token getting worse as concurrency rises.

Concurrency can keep aggregate throughput healthy while making individual users wait much longer.

That is why "tokens per second" is not enough.

You need at least four metrics:

  • time to first token
  • output tokens per second
  • total wall-clock time
  • failure/retry rate under load

And you need to record them by stage.

For an AI debugging agent, "the whole thing took 90 seconds" is not a useful measurement. Which stage took 90 seconds? File relevance? Log compression? Root cause analysis? Reproduction? Fix generation? Validation?

If you do not know, you cannot optimize it.

queues beat request chains

Once the workflow has more than one stage, a single request chain becomes fragile.

Analyze the bug. Then reproduce it. Then identify root cause. Then generate a fix. Then validate the fix.

If this runs as one synchronous chain, every stage inherits every other stage's latency and failure mode. A slow reproduction attempt blocks root cause work. A provider retry blocks the entire request. One expensive bug can starve smaller bugs behind it.

The better shape is a pipeline.

Pipeline diagram showing analyze, reproduce bug, root cause, and fix stages as independent microservices connected by message queues.

Independent stages let incoming bugs move through the system without one long blocking request chain.

In FixBugs, the natural stages are:

  • analysis: parse and compress the bug report and artifacts
  • reproduction: reproduce the bug and write a failing test
  • root cause: identify the most likely cause
  • fix: generate a patch
  • validation: prove the patch fixes the reproduced failure

Those stages do not all have the same workload.

Some are network-heavy. Some are model-heavy. Some are repo-heavy. Some need a sandbox. Some can run with cheaper models. Some need stronger models. Some should retry aggressively. Some should fail fast and ask for human input.

That is why the pipeline should not pretend they are one operation.

Use message queues when stages should operate independently. Use idempotent workers. Put explicit retry limits around LLM calls. Track partial state. Make each stage observable. Do not make the user wait on work that can finish after the first useful answer.

This is not fancy.
It is normal backend engineering.

memory is a compression layer

Long-context models are useful.

They are also an attractive nuisance, because they hide two different problems.

The first is a performance problem. Larger prompts mean more prefill work, more tokens to move through the system, higher latency, and lower throughput.

The second is a reasoning problem. Irrelevant context is not neutral. Old hypotheses, stale summaries, repeated log snippets, and unrelated file notes compete with the evidence that matters for the current step.

A memory layer helps only if it handles both problems: send fewer tokens and preserve the facts the next stage needs.

For one demo, I tested a Mem0-style memory layer to isolate the performance side: extract facts from prior context, store them, and retrieve only the facts relevant to the current step.

context management demo
60.51 tok/s → 216.06 tok/s full context versus retrieved memory facts for the demo case

In that demo, token throughput improved from 60.51 tokens/sec to 216.06 tokens/sec compared with sending the full context.

Again, do not overfit to the number. The useful principle is simpler: every token you do not send is latency you do not pay for.

But the benchmark only measures the performance side.

Memory is not just for personalization. In agent systems, memory is a compression layer and an evidence ledger. It decides which facts survive across steps.

model choice is infrastructure

Token throughput for various models on OpenRouter.

Token throughput for various models on OpenRouter.

Choosing a model for an agent is not only a quality decision.

It is an infrastructure decision.

Two models with similar benchmark accuracy can behave very differently under your workload. One may stream quickly but produce verbose answers. One may have great throughput but poor tool-use reliability.

FixBugs treats model choice by stage.

Cheap model for broad relevance scans. Stronger model for root cause synthesis. Different prompt shape for reproduction. Different retry policy for fix generation. Different timeout for validation.

The mistake is using the same model, the same timeout, and the same output format everywhere.

The better question is: what does this stage need to be correct about?

A file relevance stage does not need perfect prose. It needs high recall and structured evidence.

A root cause stage needs to reconcile conflicting signals.

A fix stage needs to produce a small patch.

A validation stage needs to be conservative, because false confidence is worse than no answer.

Once you write those constraints down, model selection becomes less mystical.

the checklist

The final checklist from the talk still holds.

  • Know your workload.

  • Before building the feature, estimate input tokens, output tokens, expected concurrency, and whether the user needs an instant response or can tolerate asynchronous processing.

  • Reduce tokens.

  • Do not send full context because it is convenient. Compress, retrieve, summarize, and preserve provenance.

  • Embrace parallelism.

  • If the work is independent, split it. File scans, log-window analysis, artifact classification, and candidate hypothesis scoring often parallelize well.

  • Microservices and queues add complexity, but they also let different stages scale, retry, and fail independently. Don't overoptimize.

  • Expect failures.

LLM APIs fail. Providers rate-limit. Responses violate schema. Tool calls hang. Sandboxes break. Repos have bad tests. Treat every model call like a network call to a flaky dependency / data source, because that is what it is.

original talk

This post is based on my SingleStore x PyDelhi talk on building high-performance AI agents in Python.

Code and artifacts from the talk are available in the pydelhi-talk repository.

The original recap and slide deck remain archived in the PyDelhi talk post.

references


Originally published at fixbugs.ai.

Top comments (1)

Collapse
 
seabreeze86 profile image
Jin Li

Nice blog!

Why do you think the throughput increased with streaming enabled?