DEV Community: Kirti Rathore

Oncall isn't supposed to be this hard

Kirti Rathore — Tue, 02 Jun 2026 20:26:17 +0000

Bad Prometheus alerts tell an oncall engineer something is wrong, while good alerts connect the symptom to traces, logs, deploys, and the suspect commit.

That distinction sounds small until you're on-call and an alert storm appears.

You open one of the alerts and see:

[CRITICAL] CheckoutHighErrorRate - 7.3% 5xx in prod-eu-west-1

The alert is not wrong. Checkout is error'ing out. But it hasn't told you why or even which host/container/VM to start investigating from.

let the wild hunt begin

The SRE / Developer now has all the work to do.

If you know what you're doing, you first check the alert definition.

A basic Prometheus setup usually looks like this:

- alert: CheckoutHighErrorRate
  expr: |
    sum(rate(http_requests_total{service="checkout",status=~"5.."}[5m]))
    /
    sum(rate(http_requests_total{service="checkout"}[5m])) > 0.05
  for: 10m
  labels:
    severity: critical
    team: payments
    service: checkout
  annotations:
    summary: "Checkout 5xx rate is above 5%"
    runbook_url: "https://runbooks.corp/payments/checkout-5xx"
    dashboard_url: "https://grafana.corp/d/checkout"

This gives you some important pieces of information:

The alert aggregates HTTP errors (including timeouts) over a 5 minute period and compares it to a threshold.
The alert is owned by the Payments team.
There is a playbook you can start from.

PromQL graph shows why the alert fired but doesn't give much more context.

But there is no reason to celebrate just yet.

The real work begins now.

Adjust the time window to within 5 minutes of the alert time.

Open Grafana and check if the dashboards have any extra information.
Open Loki and write a query like `{service_name="checkout"} |~ "(?i)error"
Open Tempo and filter traces by time. Guess which trace represents the incident.
Open your CD pipeline and search for any deploys just before the alert.

At some point, several possible hypothesis appear.

text Big newly introduced feature in the checkout-api@v2.4.1 looks fishy. High CPU usage on 3 out of 5 hosts that reported 5xx errors. Suspicious I/O errors on all the investigated hosts. Slow DB transactions.

Eventually the developer manages to reconstruct context across four tools, in about an hour if they know exactly what they're doing.

Meanwhile, there may be other fresh alerts to investigate.

good alerts tell you where to start looking

The same stack can behave very differently.

Not a different vendor. Not a more expensive alerting product.

The same stack, wired correctly to bubble up context.

Here is what it would look like for the Prometheus/Grafana/Tempo/Loki stack:

text -> Prometheus exporter using OpenTelemetry SDK. -> histograms correlated with trace spans. -> Grafana exemplars enabled. -> Tempo setup with trace-to-logs enabled. -> deploy marker / service.version / commit SHA added as metadata with each alert.

The alert still starts with a metric. It should. Metrics are how you detect the symptom.

But the metric now carries a breadcrumb to a specific request.

Exemplars are the bridge from an aggregate bucket to a specific request.

Prometheus alerts do not naturally carry a trace_id. A histogram bucket is an aggregate, not a single request.

Exemplars change that. A sampled measurement can attach the active trace_id to the bucket. Grafana can render that as a clickable diamond on its graph. Click it and Tempo opens the representative trace.

The trace shows the slow span and the context attached to it: database statement, feature flag, user, and service version.

In the good version, the selected span says:

text service: db-primary operation: SELECT orders WHERE user_id=$1 duration: 1210ms db.rows_affected: 1110482 feature_flag.new_checkout: true service.version: 2.4.1

We see the slow database queries in the distributed trace .

Then Tempo's trace-to-logs link opens Loki for the exact trace.

Trace-to-logs only works if logs carry the same trace identifier.

The log line is not buried in a time-window query anymore:

text slow query: seq scan on orders (1.1M rows), index not used trace_id=4bf92f3577b34da6a3ce929d0e0e4736 span_id=00f067aa0ba902b7 service.version=2.4.1 commit=7a3f9c2

Now the hypothesis is no longer vague.

text checkout-api@v2.4.1 added the new order-history query path. The user-id column needs to be added as an index. The bad path is gated by feature_flag.new_checkout=true. Disable the flag or roll back 7a3f9c2.

the configuration is what makes the oncall experience fun

None of this is automatic and doesn't come automatically, whether you are using Prometheus + Grafana, Datadog, or New Relic.

The good path needs deliberate plumbing:

Page on symptoms: error rate, latency, traffic, saturation, or SLO burn.
Put team, service, severity, runbook_url, and a scoped dashboard_url on the alert.
Propagate W3C trace context through every service.
Inject trace_id and span_id into structured logs.
Enable exemplars on the histogram used by the alert.
Configure Grafana so exemplars open Tempo.
Configure Tempo trace-to-logs so spans open Loki.
Emit service.version, deploy annotations, and commit SHA from CI/CD.

You can play around with such a well-configured setup here.

AI SRE

The useful AI SRE workflow starts after the observability stack has preserved the evidence trail. The agent can help with root cause analysis, propose the fix, and validate the patch. But if the alert drops the trace, the log correlation, and the deploy context, the agent has the same problem the human does: it is guessing.

For where we are taking this in the product, see FixBugs and AI SRE tools.

references worth reading

Originally published at fixbugs.ai.

High-performance AI agents are distributed systems

Kirti Rathore — Tue, 02 Jun 2026 20:10:18 +0000

"Codex took 6 hours to implement this seemingly simple refactor".

"I think Research mode on Perplexity is stuck."

We all know LLM APIs are slow, and are content with staring at a spinner while the model slowly emits tokens.

But what happens when you're building AI agents that need to be low latency?

We hit this while building FixBugs, an AI debugging agent that reads bug reports, logs, code, screenshots, traces, and issue comments, then reproduces the bug and finally generates a validated fix. The product has a simple promise: every code change is verified to do only the necessary work to fix the issue.

The implementation is not simple.

Bug reports and their associated logs/metrics/traces can contain too much context for one model call. A repository can have hundreds of files. Logs can be larger than the model's useful context window. The final answer may need thousands of output tokens. And if the agent takes ten minutes to say anything useful, the user assumes it is broken.

Summarization, also referred to as compaction, is the usual way to work with huge context. However, summarization is slow and often loses essential context.

Modern coding agents like Claude Code and Cursor rely heavily on blindly grepping through log files and reading from specific offsets. The effective context window the coding agent is allowed to process at once is smaller than the total context window. GPT 5.5 for example has a context window of 400K tokens but it's 'input context' is closer to 258K tokens.

Once you step beyond conversational agent loops, other interesting patterns become usable.

You realize the underlying performance engineering problems are similar to those encountered when optimizing large distributed systems.

Scatter-gather. Pipelining. Queues. Backpressure. Streaming. Serializability. These are the problems we spent the most time thinking about.

start with token math

Most agent performance discussions start in the wrong place.

They ask:

Which model is fastest in terms of tokens/sec?

That is a useful question later. The first question is:

How many input tokens and output tokens does this task need?

LLM latency has two different pieces that matter to the user.

Time to first token is how long the user waits before the model starts responding. Token throughput is a measure of how much time it takes to get the full answer.

They are not the same problem.

Prefill affects time to first token. Decode affects the stream of output tokens after that.

In the prefill phase, the model processes the input context and prepares the key/value cache used to generate the first new token. In the decode phase, the model generates output tokens one at a time autoregressively.

For a practical agent, a crude mental model is enough:

latency model
T ≈ TTFT + output tokens × time/token

Input tokens are not free. They hit prefill and therefore time to first token.

prefill cost example
20,000 input tokens × 0.05ms/token = 1,000ms ≈ 1s TTFT The constant is model- and provider-specific; the shape of the cost is the useful part.

But long answers are expensive in a different way. Every output token has to be generated. If your agent asks the model to explain every file in a repository, your user is paying for that decision in wall-clock time.

This matters because debugging agents are usually output-heavy. They do not just answer "yes" or "no." They produce hypotheses, evidence, file rankings, reproduction plans, code diffs, and validation notes.

Output tokens dominate faster than people expect.

the 10-minute file search

The biggest bottleneck in early FixBugs was not repository parsing.

It was asking the LLM which files were relevant to a bug.

The naive version looked reasonable:

Gather the bug context.
Gather the repository files.
Put all relevant context into one prompt.
Ask the model to rank files and explain why.

For a small repo, this works.

For 50 files, it turns into a bad batch job disguised as a chat request.

If the model emits 30,000 output tokens and the endpoint gives you 50 output tokens per second, you are waiting about 600 seconds. Ten minutes. That is before retries, rate limits, or any downstream fix generation.

To get faster performance, we realized we had to use as much parallelism as possible.

file relevance stage
one giant call 30,000 tokens ÷ 50 tokens/s = 600s about 10 minutes before retries or downstream work 16 independent calls 16 × 50 tokens/s ≈ 800 tokens/s roughly 40 seconds for the demo workload

Each file was decomposed into chunks. Each chunk got its own relevance call. Those calls ran concurrently.

If one call gives you 50 output tokens per second, 16 independent calls on a 16-vCPU VM can expose roughly 16 times the useful throughput to the workflow.

The demo version dropped the file-relevance stage from roughly 10 minutes to roughly 40 seconds.

That number is not a universal benchmark. It depends on the provider, model, prompt, chunk sizes, rate limits, and output format.

The important part is the pattern.

This was a scatter-gather workload.

Scatter the independent file checks. Gather the evidence. Merge the local judgments into one ranked view.

We already know how to do this.

"Do the same analysis over many independent records, then combine the results."

This is what Hadoop does and why it is so useful for data analysis.

LLM agents have the same class of problems.

chunking is not free

Chunking is easy to abuse.

I'll give an output both on the map and reduce side to illustrate.

Mapping a chunk

If you're given:

A: A bug report.
B: A repository file tree.

Now you've got to figure out which code files are relevant to the bug report.

How many files do you add to a chunk?

It would not be a good idea to use byte-level granularity and stuff as much context as a model call can handle.

Instead you want to have file level granularity, and add complete files where you can.

Reducing chunks

Suppose you're given:

A: A bug report.
B: Log files and Traces.

Now you've got to extract log snippets relevant to the bug from the input.

Mapping a log file is simple. You can chunk it greedily.

Reducing is a bit more complex.

The parallel LLM calls gave you a series of log snippets, but without putting the snippets in sequence by time, grouping them by span, separating them by service name, the snippets would not be useful at all.

In my experience, the "Reduce" phase is often messier than people would like it to be.

The rule I use now:

Chunk for evidence.
Merge for judgment.

The local calls should find facts, signals, and candidate explanations. The final call should resolve conflicts, rank evidence, and decide what to do next.

streaming changes the wait

There is another latency problem that chunking does not solve.

Sometimes the user needs to see that the agent is alive.

For interactive debugging, time to first token matters more than total completion time. The engineer does not always need the whole final report immediately. They need the first useful hypothesis, the first file name, the first sign that the investigation is moving.

Streaming helps.

streaming tradeoff
demo workload TTFT without streaming 13s TTFT with streaming 2.4s throughput without streaming 486 tok/s throughput with streaming 244 tok/s

In the demo, streaming reduced time to first token from 13 seconds to 2.4 seconds.

That is a huge UX difference.

But throughput got worse: 486 tokens/sec without streaming versus 244 tokens/sec with streaming.

This is the kind of tradeoff that disappears if you only measure "request completed in N seconds." Streaming is not a throughput optimization. It is a user-experience optimization.

For chat-like workflows, it is usually worth it.

For batch stages inside an agent pipeline, it may not be.

FixBugs uses both modes. User-facing stages stream progress. Internal worker stages optimize for total job completion, retry behavior, and queue throughput.

That distinction keeps the system honest.

concurrency has a ceiling

The first time you parallelize LLM calls and see a 5x or 10x improvement, it is tempting to conclude the next fix is "more workers."

That works until it does not.

Throughput improves with concurrency, then starts flattening. More workers eventually stop buying much.

At low concurrency, the system has idle capacity. Adding workers improves utilization. Throughput rises quickly.

At higher concurrency, the slope changes. You are now fighting shared bottlenecks: provider rate limits, GPU scheduling, KV cache memory, network overhead, queueing, retries, and your own post-processing.

The unpleasant part is that throughput may plateau while user latency gets worse.

Concurrency can keep aggregate throughput healthy while making individual users wait much longer.

That is why "tokens per second" is not enough.

You need at least four metrics:

time to first token
output tokens per second
total wall-clock time
failure/retry rate under load

And you need to record them by stage.

For an AI debugging agent, "the whole thing took 90 seconds" is not a useful measurement. Which stage took 90 seconds? File relevance? Log compression? Root cause analysis? Reproduction? Fix generation? Validation?

If you do not know, you cannot optimize it.

queues beat request chains

Once the workflow has more than one stage, a single request chain becomes fragile.

Analyze the bug. Then reproduce it. Then identify root cause. Then generate a fix. Then validate the fix.

If this runs as one synchronous chain, every stage inherits every other stage's latency and failure mode. A slow reproduction attempt blocks root cause work. A provider retry blocks the entire request. One expensive bug can starve smaller bugs behind it.

The better shape is a pipeline.

Independent stages let incoming bugs move through the system without one long blocking request chain.

In FixBugs, the natural stages are:

analysis: parse and compress the bug report and artifacts
reproduction: reproduce the bug and write a failing test
root cause: identify the most likely cause
fix: generate a patch
validation: prove the patch fixes the reproduced failure

Those stages do not all have the same workload.

Some are network-heavy. Some are model-heavy. Some are repo-heavy. Some need a sandbox. Some can run with cheaper models. Some need stronger models. Some should retry aggressively. Some should fail fast and ask for human input.

That is why the pipeline should not pretend they are one operation.

Use message queues when stages should operate independently. Use idempotent workers. Put explicit retry limits around LLM calls. Track partial state. Make each stage observable. Do not make the user wait on work that can finish after the first useful answer.

This is not fancy.
It is normal backend engineering.

memory is a compression layer

Long-context models are useful.

They are also an attractive nuisance, because they hide two different problems.

The first is a performance problem. Larger prompts mean more prefill work, more tokens to move through the system, higher latency, and lower throughput.

The second is a reasoning problem. Irrelevant context is not neutral. Old hypotheses, stale summaries, repeated log snippets, and unrelated file notes compete with the evidence that matters for the current step.

A memory layer helps only if it handles both problems: send fewer tokens and preserve the facts the next stage needs.

For one demo, I tested a Mem0-style memory layer to isolate the performance side: extract facts from prior context, store them, and retrieve only the facts relevant to the current step.

context management demo
60.51 tok/s → 216.06 tok/s full context versus retrieved memory facts for the demo case

In that demo, token throughput improved from 60.51 tokens/sec to 216.06 tokens/sec compared with sending the full context.

Again, do not overfit to the number. The useful principle is simpler: every token you do not send is latency you do not pay for.

But the benchmark only measures the performance side.

Memory is not just for personalization. In agent systems, memory is a compression layer and an evidence ledger. It decides which facts survive across steps.

model choice is infrastructure

Token throughput for various models on OpenRouter.

Choosing a model for an agent is not only a quality decision.

It is an infrastructure decision.

Two models with similar benchmark accuracy can behave very differently under your workload. One may stream quickly but produce verbose answers. One may have great throughput but poor tool-use reliability.

FixBugs treats model choice by stage.

Cheap model for broad relevance scans. Stronger model for root cause synthesis. Different prompt shape for reproduction. Different retry policy for fix generation. Different timeout for validation.

The mistake is using the same model, the same timeout, and the same output format everywhere.

The better question is: what does this stage need to be correct about?

A file relevance stage does not need perfect prose. It needs high recall and structured evidence.

A root cause stage needs to reconcile conflicting signals.

A fix stage needs to produce a small patch.

A validation stage needs to be conservative, because false confidence is worse than no answer.

Once you write those constraints down, model selection becomes less mystical.

the checklist

The final checklist from the talk still holds.

Know your workload.
Before building the feature, estimate input tokens, output tokens, expected concurrency, and whether the user needs an instant response or can tolerate asynchronous processing.
Reduce tokens.
Do not send full context because it is convenient. Compress, retrieve, summarize, and preserve provenance.
Embrace parallelism.
If the work is independent, split it. File scans, log-window analysis, artifact classification, and candidate hypothesis scoring often parallelize well.
Microservices and queues add complexity, but they also let different stages scale, retry, and fail independently. Don't overoptimize.
Expect failures.

LLM APIs fail. Providers rate-limit. Responses violate schema. Tool calls hang. Sandboxes break. Repos have bad tests. Treat every model call like a network call to a flaky dependency / data source, because that is what it is.

original talk

This post is based on my SingleStore x PyDelhi talk on building high-performance AI agents in Python.

Code and artifacts from the talk are available in the pydelhi-talk repository.

The original recap and slide deck remain archived in the PyDelhi talk post.

references

Originally published at fixbugs.ai.

Engineering approach: Startup Mode v/s Big Tech Mode

Kirti Rathore — Wed, 29 Apr 2026 19:00:31 +0000

Engineering approach: Startup Mode v/s Big Tech Mode

Last week, I delivered a talk at PyDelhi discussing strategies that leverage how large language models work to improve the performance of your LLM applications. Here is the talk as a PDF.

Some techniques I discussed were:

Strategies for faster token throughput.
Strategies for quick time to first token.
Effective context window management and
Model routing strategies.

But here's the uncomfortable truth for founders: if you're just starting your LLM startup, you should completely ignore this advice.

Let me explain why — and when application performance actually matters.

How It Works in the Ideal World: Big Tech's Playbook

Imagine the typical trajectory at a company like Google or Stripe. You see a problem in the market. It's well-defined. Your user base is established. You build a team to solve it.

Your first step isn't writing code—it's understanding your performance requirements.

You study incumbent competitors. You conduct user research. You measure what your users actually tolerate. For e-commerce, that's Amazon's 5-second response time threshold. For payments, that's Stripe's sub-100ms latency requirement. For real-time LLM interfaces, that might be streaming tokens within 200ms.

These user expectations become Service Level Objectives (SLOs)—formal performance, reliability, and usability targets your application must meet to remain competitive.

Once you have SLOs, someone (usually a principal engineer or architect) translates them into a system architecture. This involves:

Weighing architectural tradeoffs (monolith vs. microservices, synchronous vs. asynchronous)
Selecting technology stacks for different components
Deciding on execution environments (web app vs. IDE plugin vs. CLI tool)
Planning for scale from day one

This approach works beautifully for mature companies with stable product-market fit. You have reliable data about what your users need, so you can build the right system the first time.

The Cost of Performance Engineering

Performance optimization requires trade-offs—all of them expensive.

At Google and VMware, my teams answered questions like:

How much does adopting AVX-512 improve RAID-6 computational throughput?
How much latency can we save by building local caches with remote diffs?
Can we prefetch data and pipeline operations based on dependency graphs?

These questions have answers, and the answers are valuable. But solving them has a cost: optimized code is complex, harder to understand, and harder to debug.

Consider a simple workflow with a few network calls and database queries. Now transform it for performance: add Redis for slow queries, implement continuations for async operations, consider UDP over TCP for specific data patterns, reduce logging overhead.

Consider a simple workflow with a few network calls and database queries. Now transform it for performance: add Redis for slow queries, use async with continuations, add TCP connection pooling with keepalives, distribute read load across multiple backend instances, say NO to heap allocations.. you get the point.

Each optimization adds complexity. Each line becomes harder for the next engineer to reason about.

Performance engineering also locks you into early technical decisions. Refactoring code can mean rolling back optimizations.

How It Actually Works: The Startup Reality

Here's where the Big Tech playbook breaks down.

At a startup, almost nothing is stable. Your SLOs don't exist yet because you don't know who your customers are. Your product architecture will change—not once, but repeatedly.

The sources of uncertainty are constant:

Product pivot: Your initial idea evolves. Instagram started as Burbn, a cluttered check-in app with photos as a side feature. When founders realized users were ignoring the check-in functionality and only engaging with photo sharing, they stripped everything away and rebuilt the architecture around that single use case.
Customer pivot: You discover your ideal customer profile is different from what you assumed. That financial services firm won't buy your product, but the open-source community will—and they have completely different scalability requirements.
The landscape is evolving: New models, new APIs, better caching strategies emerge monthly. Locking into early architectural decisions is especially costly.
Your use cases will change: You might start with synchronous inference, then need streaming. You might start with single-turn interactions, then add multi-turn conversations. Each shift requires rearchitecting.

As Gergely Orosz noted after years at Uber: the biggest constraint at startups isn't computing resources—it's the coordination overhead. At big tech companies, you wait days for approvals on simple plumbing changes. At startups, you need to move fast and change direction constantly.

The Counterargument: When Performance Matters Early

I need to be clear: there are exceptions.

If your business model directly depends on latency—say, you're selling real-time trading alerts and charge per-millisecond-saved—then performance optimization matters from day one.

If your unit economics fundamentally depend on throughput (you make money per inference, and your margins vanish if you're inefficient), then measure and optimize.

But ask yourself honestly: is performance actually your constraint, or is it a distraction?

Most startups discover their real constraints are customer acquisition, product-market fit, and unit economics—not milliseconds.

What to Do Instead

Here's your startup engineering philosophy:

Use third-party solutions liberally. Use managed databases instead of self-hosting Postgres. Use cloud APIs instead of building infrastructure. Use open-source libraries even if they're slower or have some overhead. The velocity gain from not building custom infrastructure outweighs the performance cost—until you reach scale.

As Paul Graham noted in his essay on startup strategies: founders often resist early customer work because they'd "rather sit at home writing code." The same applies here. You'd rather optimize your codebase than talk to customers. Both are mistakes.

Optimize for changeability, not performance. Write simple code that's easy to refactor. Clear, straightforward solutions beat clever optimizations.

This means:

Choose simple data structures over complex ones
Write tests that give you confidence to refactor
Measure, but don't optimize based on measurements

Think of it this way: if you were solving this problem in a language like Python or JavaScript (where performance is never the limit), what would you do? Do that. Build it carefully, but don't overthink it.

Build the metrics foundation, but not the optimizations yet. Set up basic monitoring from day one. Understand where time is spent. Just don't act on it yet—collect data for when it matters.

The Inflection Point: When Everything Changes

Here's the transition: when your product stabilizes and you have real users, everything changes.

Once you've validated that customers actually want what you built, and you understand your unit economics, then you switch modes. At this point:

Define your actual SLOs based on user behavior and business requirements
Profile your application to find real bottlenecks
Invest in performance engineering

Notice what happens at this stage: you have a product that works, customers who are paying, and clear visibility into what's slow. You're no longer gambling on architecture decisions.

The Real Lesson

The difference between Big Tech and startups isn't that Big Tech engineers are smarter. It's that Big Tech has certainty about its problem space, while startups operate under radical uncertainty about everything.

The engineering approach must match reality.

The best startup engineers I've known—including those who came from Big Tech—learned to shift modes. They brought discipline and architectural thinking from their Big Tech experience, but they abandoned the assumption that everything needs to be perfect from day one.

Your job as a startup founder isn't to build the most performant system. It's to build something that works, that users want, and that you can change when you learn something new.

Performance optimization will still be there when you need it. For now, focus on moving fast and learning what actually matters.

Key Takeaways

Big Tech Engineering	Startup Engineering
Problem space is known; optimize for scale	Problem space is uncertain; optimize for learning
SLOs defined upfront based on market research	SLOs emerge from customer feedback
Complex architecture justified by requirements	Simple architecture enables rapid pivots
Performance optimization adds value	Performance optimization is often wasted work
Code should be optimized and reliable	Code should be clear and changeable

Your job early on is to prove the hypothesis, not to implement it perfectly.

Link to the full blog: https://fixbugs.ai/blog/startup-vs-bigtech-blog
Link to the full talk: https://fixbugs.ai/content/misc/Writing-High-Performance-AI-Agents-in-Python-Insights-from-building-Modulo-2.pdf