<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kirti Rathore</title>
    <description>The latest articles on DEV Community by Kirti Rathore (@kirtivr).</description>
    <link>https://dev.to/kirtivr</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3904893%2Fdb03ff9a-a1ba-463e-a949-2e60d026ee2a.png</url>
      <title>DEV Community: Kirti Rathore</title>
      <link>https://dev.to/kirtivr</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kirtivr"/>
    <language>en</language>
    <item>
      <title>Oncall isn't supposed to be this hard</title>
      <dc:creator>Kirti Rathore</dc:creator>
      <pubDate>Tue, 02 Jun 2026 20:26:17 +0000</pubDate>
      <link>https://dev.to/kirtivr/oncall-isnt-supposed-to-be-this-hard-1n5b</link>
      <guid>https://dev.to/kirtivr/oncall-isnt-supposed-to-be-this-hard-1n5b</guid>
      <description>&lt;p&gt;Bad Prometheus alerts tell an oncall engineer something is wrong, while good alerts connect the symptom to traces, logs, deploys, and the suspect commit.&lt;/p&gt;

&lt;p&gt;That distinction sounds small until you're on-call and an alert storm appears.&lt;/p&gt;

&lt;p&gt;You open one of the alerts and see:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[CRITICAL] CheckoutHighErrorRate - 7.3% 5xx in prod-eu-west-1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The alert is not wrong. Checkout is error'ing out. But it hasn't told you why or even which host/container/VM to start investigating from.&lt;/p&gt;

&lt;h2&gt;
  
  
  let the wild hunt begin
&lt;/h2&gt;

&lt;p&gt;The SRE / Developer now has all the work to do.&lt;/p&gt;

&lt;p&gt;If you know what you're doing, you first check the alert definition.&lt;/p&gt;

&lt;p&gt;A basic Prometheus setup usually looks like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;alert&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;CheckoutHighErrorRate&lt;/span&gt;
  &lt;span class="na"&gt;expr&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="pi"&gt;|&lt;/span&gt;
    &lt;span class="s"&gt;sum(rate(http_requests_total{service="checkout",status=~"5.."}[5m]))&lt;/span&gt;
    &lt;span class="s"&gt;/&lt;/span&gt;
    &lt;span class="s"&gt;sum(rate(http_requests_total{service="checkout"}[5m])) &amp;gt; 0.05&lt;/span&gt;
  &lt;span class="na"&gt;for&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;10m&lt;/span&gt;
  &lt;span class="na"&gt;labels&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;severity&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;critical&lt;/span&gt;
    &lt;span class="na"&gt;team&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;payments&lt;/span&gt;
    &lt;span class="na"&gt;service&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;checkout&lt;/span&gt;
  &lt;span class="na"&gt;annotations&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;summary&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Checkout&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;5xx&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;rate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;above&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;5%"&lt;/span&gt;
    &lt;span class="na"&gt;runbook_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://runbooks.corp/payments/checkout-5xx"&lt;/span&gt;
    &lt;span class="na"&gt;dashboard_url&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://grafana.corp/d/checkout"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This gives you some important pieces of information:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The alert aggregates HTTP errors (including timeouts) over a &lt;em&gt;5 minute&lt;/em&gt; period and compares it to a threshold.&lt;/li&gt;
&lt;li&gt;The alert is owned by the Payments team.&lt;/li&gt;
&lt;li&gt;There is a playbook you can start from.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffixbugs.ai%2Fcontent%2Fimages%2Falerting-good-bad-alerts%2Fprometheus-raw-graph.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffixbugs.ai%2Fcontent%2Fimages%2Falerting-good-bad-alerts%2Fprometheus-raw-graph.png" alt="Prometheus expression browser showing the raw query behind the checkout high error rate alert." width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;PromQL graph shows why the alert fired but doesn't give much more context.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;But there is no reason to celebrate just yet.&lt;/p&gt;

&lt;p&gt;The real work begins now.&lt;/p&gt;

&lt;p&gt;Adjust the time window to within &lt;em&gt;5 minutes&lt;/em&gt; of the &lt;em&gt;alert time&lt;/em&gt;.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Open Grafana and check if the dashboards have any extra information.&lt;/li&gt;
&lt;li&gt;Open Loki and write a query like `{service_name="checkout"} |~ "(?i)error"&lt;/li&gt;
&lt;li&gt;Open Tempo and filter traces by time. Guess which trace represents the incident.&lt;/li&gt;
&lt;li&gt;Open your CD pipeline and search for any deploys just before the alert.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;At some point, several possible hypothesis appear.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
Big newly introduced feature in the checkout-api@v2.4.1 looks fishy.&lt;br&gt;
High CPU usage on 3 out of 5 hosts that reported 5xx errors.&lt;br&gt;
Suspicious I/O errors on all the investigated hosts.&lt;br&gt;
Slow DB transactions.&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Eventually the developer manages to reconstruct context across four tools, in about an hour if they know exactly what they're doing.&lt;/p&gt;

&lt;p&gt;Meanwhile, there may be other fresh alerts to investigate.&lt;/p&gt;

&lt;h2&gt;
  
  
  good alerts tell you where to start looking
&lt;/h2&gt;

&lt;p&gt;The same stack can behave very differently.&lt;/p&gt;

&lt;p&gt;Not a different vendor. Not a more expensive alerting product.&lt;/p&gt;

&lt;p&gt;The same stack, wired correctly to bubble up context.&lt;/p&gt;

&lt;p&gt;Here is what it would look like for the Prometheus/Grafana/Tempo/Loki stack:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
  -&amp;gt; Prometheus exporter using OpenTelemetry SDK.&lt;br&gt;
  -&amp;gt; histograms correlated with trace spans.&lt;br&gt;
  -&amp;gt; Grafana exemplars enabled.&lt;br&gt;
  -&amp;gt; Tempo setup with trace-to-logs enabled.&lt;br&gt;
  -&amp;gt; deploy marker / service.version / commit SHA added as metadata with each alert.&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;The alert still starts with a metric. It should. Metrics are how you detect the symptom.&lt;/p&gt;

&lt;p&gt;But the metric now carries a breadcrumb to a specific request.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffixbugs.ai%2Fcontent%2Fimages%2Falerting-good-bad-alerts%2Fgrafana-exemplars.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffixbugs.ai%2Fcontent%2Fimages%2Falerting-good-bad-alerts%2Fgrafana-exemplars.png" alt="Grafana latency histogram with exemplar diamonds linking the high p95 bucket to a Tempo trace." width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Exemplars are the bridge from an aggregate bucket to a specific request.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Prometheus alerts do not naturally carry a &lt;code&gt;trace_id&lt;/code&gt;. A histogram bucket is an aggregate, not a single request.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://grafana.com/docs/grafana/latest/fundamentals/exemplars/" rel="noopener noreferrer"&gt;Exemplars&lt;/a&gt; change that. A sampled measurement can attach the active &lt;code&gt;trace_id&lt;/code&gt; to the &lt;a href="https://grafana.com/docs/grafana/latest/datasources/prometheus/configure/#provision-the-prometheus-data-source" rel="noopener noreferrer"&gt;bucket&lt;/a&gt;. Grafana can render that as a clickable diamond on its graph. Click it and Tempo opens the representative trace.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffixbugs.ai%2Fcontent%2Fimages%2Falerting-good-bad-alerts%2Ftempo-trace-waterfall.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffixbugs.ai%2Fcontent%2Fimages%2Falerting-good-bad-alerts%2Ftempo-trace-waterfall.png" alt="Grafana Tempo trace waterfall showing the slow database span, span attributes, logs tab, and service version." width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The trace shows the slow span and the context attached to it: database statement, feature flag, user, and service version.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the good version, the selected span says:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
service: db-primary&lt;br&gt;
operation: SELECT orders WHERE user_id=$1&lt;br&gt;
duration: 1210ms&lt;br&gt;
db.rows_affected: 1110482&lt;br&gt;
feature_flag.new_checkout: true&lt;br&gt;
service.version: 2.4.1&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;We see the slow database queries in the distributed trace .&lt;/p&gt;

&lt;p&gt;Then Tempo's &lt;a href="https://grafana.com/docs/grafana/latest/datasources/tempo/configure-tempo-data-source/configure-trace-to-logs/" rel="noopener noreferrer"&gt;trace-to-logs&lt;/a&gt; link opens Loki for the exact trace.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffixbugs.ai%2Fcontent%2Fimages%2Falerting-good-bad-alerts%2Floki-trace-logs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Ffixbugs.ai%2Fcontent%2Fimages%2Falerting-good-bad-alerts%2Floki-trace-logs.png" alt="Grafana Explore with Loki logs filtered by trace_id for the checkout trace." width="" height=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Trace-to-logs only works if logs carry the same trace identifier.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The log line is not buried in a time-window query anymore:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
slow query: seq scan on orders (1.1M rows), index not used&lt;br&gt;
trace_id=4bf92f3577b34da6a3ce929d0e0e4736&lt;br&gt;
span_id=00f067aa0ba902b7&lt;br&gt;
service.version=2.4.1&lt;br&gt;
commit=7a3f9c2&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;Now the hypothesis is no longer vague.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;/code&gt;&lt;code&gt;text&lt;br&gt;
checkout-api@v2.4.1 added the new order-history query path.&lt;br&gt;
The user-id column needs to be added as an index.&lt;br&gt;
The bad path is gated by feature_flag.new_checkout=true.&lt;br&gt;
Disable the flag or roll back 7a3f9c2.&lt;br&gt;
&lt;/code&gt;&lt;code&gt;&lt;/code&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  the configuration is what makes the oncall experience fun
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffierlpwlg69eyy65csw9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffierlpwlg69eyy65csw9.png" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;None of this is automatic and doesn't come automatically, whether you are using Prometheus + Grafana, Datadog, or New Relic.&lt;/p&gt;

&lt;p&gt;The good path needs deliberate plumbing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Page on symptoms: error rate, latency, traffic, saturation, or SLO burn.&lt;/li&gt;
&lt;li&gt;Put &lt;code&gt;team&lt;/code&gt;, &lt;code&gt;service&lt;/code&gt;, &lt;code&gt;severity&lt;/code&gt;, &lt;code&gt;runbook_url&lt;/code&gt;, and a scoped &lt;code&gt;dashboard_url&lt;/code&gt; on the alert.&lt;/li&gt;
&lt;li&gt;Propagate W3C trace context through every service.&lt;/li&gt;
&lt;li&gt;Inject &lt;code&gt;trace_id&lt;/code&gt; and &lt;code&gt;span_id&lt;/code&gt; into structured logs.&lt;/li&gt;
&lt;li&gt;Enable exemplars on the histogram used by the alert.&lt;/li&gt;
&lt;li&gt;Configure Grafana so exemplars open Tempo.&lt;/li&gt;
&lt;li&gt;Configure Tempo trace-to-logs so spans open Loki.&lt;/li&gt;
&lt;li&gt;Emit &lt;code&gt;service.version&lt;/code&gt;, deploy annotations, and commit SHA from CI/CD.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can play around with such a well-configured setup &lt;a href="https://github.com/open-telemetry/opentelemetry-demo" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  AI SRE
&lt;/h2&gt;

&lt;p&gt;The useful AI SRE workflow starts after the observability stack has preserved the evidence trail. The agent can help with root cause analysis, propose the fix, and validate the patch. But if the alert drops the trace, the log correlation, and the deploy context, the agent has the same problem the human does: it is guessing.&lt;/p&gt;

&lt;p&gt;For where we are taking this in the product, see &lt;a href="https://fixbugs.ai/fixbugs-vs-ai-sre" rel="noopener noreferrer"&gt;FixBugs and AI SRE tools&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  references worth reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://sre.google/sre-book/monitoring-distributed-systems/" rel="noopener noreferrer"&gt;Google SRE: Monitoring Distributed Systems&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://sre.google/workbook/alerting-on-slos/" rel="noopener noreferrer"&gt;Google SRE Workbook: Alerting on SLOs&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/" rel="noopener noreferrer"&gt;Prometheus alerting rules&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://grafana.com/blog/intro-to-exemplars-which-enable-grafana-tempos-distributed-tracing-at-massive-scale/" rel="noopener noreferrer"&gt;Grafana: intro to exemplars&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://grafana.com/docs/grafana/latest/datasources/tempo/configure-tempo-data-source/configure-trace-to-logs/" rel="noopener noreferrer"&gt;Grafana: configure trace to logs correlation&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://grafana.com/docs/grafana/latest/visualizations/dashboards/build-dashboards/annotate-visualizations/" rel="noopener noreferrer"&gt;Grafana: annotate visualizations&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://opentelemetry.io/docs/specs/semconv/resource/service/" rel="noopener noreferrer"&gt;OpenTelemetry service semantic conventions&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fixbugs.ai/blog/good-alerts-bad-alerts" rel="noopener noreferrer"&gt;fixbugs.ai&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>monitoring</category>
      <category>devops</category>
      <category>infrastructure</category>
      <category>sre</category>
    </item>
    <item>
      <title>High-performance AI agents are distributed systems</title>
      <dc:creator>Kirti Rathore</dc:creator>
      <pubDate>Tue, 02 Jun 2026 20:10:18 +0000</pubDate>
      <link>https://dev.to/kirtivr/high-performance-ai-agents-are-distributed-systems-4c4g</link>
      <guid>https://dev.to/kirtivr/high-performance-ai-agents-are-distributed-systems-4c4g</guid>
      <description>&lt;p&gt;"Codex took 6 hours to implement this seemingly simple refactor".&lt;/p&gt;

&lt;p&gt;"I think Research mode on Perplexity is stuck."&lt;/p&gt;

&lt;p&gt;We all know LLM APIs are slow, and are content with staring at a spinner while the model slowly emits tokens.&lt;/p&gt;

&lt;p&gt;But what happens when you're building AI agents that need to be low latency?&lt;/p&gt;

&lt;p&gt;We hit this while building &lt;a href="https://fixbugs.ai" rel="noopener noreferrer"&gt;FixBugs&lt;/a&gt;, an AI debugging agent that reads bug reports, logs, code, screenshots, traces, and issue comments, then reproduces the bug and finally generates a validated fix. The product has a simple promise: every code change is verified to do only the necessary work to fix the issue.&lt;/p&gt;

&lt;p&gt;The implementation is not simple.&lt;/p&gt;

&lt;p&gt;Bug reports and their associated logs/metrics/traces can contain too much context for one model call. A repository can have hundreds of files. Logs can be larger than the model's useful context window. The final answer may need thousands of output tokens. And if the agent takes ten minutes to say anything useful, the user assumes it is broken.&lt;/p&gt;

&lt;p&gt;Summarization, also referred to as compaction, is the usual way to work with huge context. However, summarization is slow and often loses essential context.&lt;/p&gt;

&lt;p&gt;Modern coding agents like Claude Code and Cursor rely heavily on blindly grepping through log files and reading from specific offsets. The effective context window the coding agent is allowed to process at once is smaller than the total context window. GPT 5.5 for example has a context window of 400K tokens but it's 'input context' is closer to 258K tokens.&lt;/p&gt;

&lt;p&gt;Once you step beyond conversational agent loops, other interesting patterns become usable.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;You realize the underlying performance engineering problems are similar to those encountered when optimizing large distributed systems.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Scatter-gather. Pipelining. Queues. Backpressure. Streaming. Serializability. These are the problems we spent the most time thinking about.&lt;/p&gt;

&lt;h2&gt;
  
  
  start with token math
&lt;/h2&gt;

&lt;p&gt;Most agent performance discussions start in the wrong place.&lt;/p&gt;

&lt;p&gt;They ask:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Which model is fastest in terms of tokens/sec?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That is a useful question later. The first question is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;How many input tokens and output tokens does this task need?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;LLM latency has two different pieces that matter to the user.&lt;/p&gt;

&lt;p&gt;Time to first token is how long the user waits before the model starts responding. Token throughput is a measure of how much time it takes to get the full answer.&lt;/p&gt;

&lt;p&gt;They are not the same problem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdrsi4ceagztujxuh2xw7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdrsi4ceagztujxuh2xw7.png" alt="Diagram showing LLM prefill phase, decode phase, time to first token, and time per output token." width="800" height="275"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Prefill affects time to first token. Decode affects the stream of output tokens after that.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In the prefill phase, the model processes the input context and prepares the key/value cache used to generate the first new token. In the decode phase, the model generates output tokens one at a time autoregressively.&lt;/p&gt;

&lt;p&gt;For a practical agent, a crude mental model is enough:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;latency model&lt;/strong&gt;&lt;br&gt;
T ≈ TTFT + output tokens × time/token&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Input tokens are not free. They hit prefill and therefore time to first token.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;prefill cost example&lt;/strong&gt;&lt;br&gt;
20,000 input tokens × 0.05ms/token = 1,000ms ≈ 1s TTFT The constant is model- and provider-specific; the shape of the cost is the useful part.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But long answers are expensive in a different way. Every output token has to be generated. If your agent asks the model to explain every file in a repository, your user is paying for that decision in wall-clock time.&lt;/p&gt;

&lt;p&gt;This matters because debugging agents are usually output-heavy. They do not just answer "yes" or "no." They produce hypotheses, evidence, file rankings, reproduction plans, code diffs, and validation notes.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Output tokens dominate faster than people expect.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  the 10-minute file search
&lt;/h2&gt;

&lt;p&gt;The biggest bottleneck in early FixBugs was not repository parsing.&lt;/p&gt;

&lt;p&gt;It was asking the LLM which files were relevant to a bug.&lt;/p&gt;

&lt;p&gt;The naive version looked reasonable:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Gather the bug context.&lt;/li&gt;
&lt;li&gt;Gather the repository files.&lt;/li&gt;
&lt;li&gt;Put all relevant context into one prompt.&lt;/li&gt;
&lt;li&gt;Ask the model to rank files and explain why.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;For a small repo, this works.&lt;/p&gt;

&lt;p&gt;For 50 files, it turns into a bad batch job disguised as a chat request.&lt;/p&gt;

&lt;p&gt;If the model emits 30,000 output tokens and the endpoint gives you 50 output tokens per second, you are waiting about 600 seconds. Ten minutes. That is before retries, rate limits, or any downstream fix generation.&lt;/p&gt;

&lt;p&gt;To get faster performance, we realized we had to use as much parallelism as possible.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;file relevance stage&lt;/strong&gt;&lt;br&gt;
one giant call 30,000 tokens ÷ 50 tokens/s = 600s about 10 minutes before retries or downstream work 16 independent calls 16 × 50 tokens/s ≈ 800 tokens/s roughly 40 seconds for the demo workload&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Each file was decomposed into chunks. Each chunk got its own relevance call. Those calls ran concurrently.&lt;/p&gt;

&lt;p&gt;If one call gives you 50 output tokens per second, 16 independent calls on a 16-vCPU VM can expose roughly 16 times the useful throughput to the workflow.&lt;/p&gt;

&lt;p&gt;The demo version dropped the file-relevance stage from roughly 10 minutes to roughly 40 seconds.&lt;/p&gt;

&lt;p&gt;That number is not a universal benchmark. It depends on the provider, model, prompt, chunk sizes, rate limits, and output format.&lt;/p&gt;

&lt;p&gt;The important part is the pattern.&lt;/p&gt;

&lt;p&gt;This was a scatter-gather workload.&lt;/p&gt;

&lt;p&gt;Scatter the independent file checks. Gather the evidence. Merge the local judgments into one ranked view.&lt;/p&gt;

&lt;p&gt;We already know how to do this.&lt;/p&gt;

&lt;p&gt;"Do the same analysis over many independent records, then combine the results."&lt;/p&gt;

&lt;p&gt;This is what Hadoop does and why it is so useful for data analysis.&lt;/p&gt;

&lt;p&gt;LLM agents have the same class of problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  chunking is not free
&lt;/h2&gt;

&lt;p&gt;Chunking is easy to abuse.&lt;/p&gt;

&lt;p&gt;I'll give an output both on the map and reduce side to illustrate.&lt;/p&gt;

&lt;h3&gt;
  
  
  Mapping a chunk
&lt;/h3&gt;

&lt;p&gt;If you're given:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A: A bug report.&lt;/li&gt;
&lt;li&gt;B: A repository file tree.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now you've got to figure out which code files are relevant to the bug report.&lt;/p&gt;

&lt;p&gt;How many files do you add to a chunk?&lt;/p&gt;

&lt;p&gt;It would not be a good idea to use byte-level granularity and stuff as much context as a model call can handle.&lt;/p&gt;

&lt;p&gt;Instead you want to have file level granularity, and add complete files where you can.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reducing chunks
&lt;/h3&gt;

&lt;p&gt;Suppose you're given:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A: A bug report.&lt;/li&gt;
&lt;li&gt;B: Log files and Traces.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now you've got to extract log snippets relevant to the bug from the input.&lt;/p&gt;

&lt;p&gt;Mapping a log file is simple. You can chunk it greedily.&lt;/p&gt;

&lt;p&gt;Reducing is a bit more complex.&lt;/p&gt;

&lt;p&gt;The parallel LLM calls gave you a series of log snippets, but without putting the snippets in sequence by time, grouping them by span, separating them by service name, the snippets would not be useful at all.&lt;/p&gt;

&lt;p&gt;In my experience, the "Reduce" phase is often messier than people would like it to be.&lt;/p&gt;

&lt;p&gt;The rule I use now:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chunk for evidence.
Merge for judgment.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The local calls should find facts, signals, and candidate explanations. The final call should resolve conflicts, rank evidence, and decide what to do next.&lt;/p&gt;

&lt;h2&gt;
  
  
  streaming changes the wait
&lt;/h2&gt;

&lt;p&gt;There is another latency problem that chunking does not solve.&lt;/p&gt;

&lt;p&gt;Sometimes the user needs to see that the agent is alive.&lt;/p&gt;

&lt;p&gt;For interactive debugging, time to first token matters more than total completion time. The engineer does not always need the whole final report immediately. They need the first useful hypothesis, the first file name, the first sign that the investigation is moving.&lt;/p&gt;

&lt;p&gt;Streaming helps.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;streaming tradeoff&lt;/strong&gt;&lt;br&gt;
demo workload TTFT without streaming 13s TTFT with streaming 2.4s throughput without streaming 486 tok/s throughput with streaming 244 tok/s&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the &lt;a href="https://www.youtube.com/watch?v=w-P89nan9dM" rel="noopener noreferrer"&gt;demo&lt;/a&gt;, streaming reduced time to first token from 13 seconds to 2.4 seconds.&lt;/p&gt;

&lt;p&gt;That is a huge UX difference.&lt;/p&gt;

&lt;p&gt;But throughput got worse: 486 tokens/sec without streaming versus 244 tokens/sec with streaming.&lt;/p&gt;

&lt;p&gt;This is the kind of tradeoff that disappears if you only measure "request completed in N seconds." Streaming is not a throughput optimization. It is a user-experience optimization.&lt;/p&gt;

&lt;p&gt;For chat-like workflows, it is usually worth it.&lt;/p&gt;

&lt;p&gt;For batch stages inside an agent pipeline, it may not be.&lt;/p&gt;

&lt;p&gt;FixBugs uses both modes. User-facing stages stream progress. Internal worker stages optimize for total job completion, retry behavior, and queue throughput.&lt;/p&gt;

&lt;p&gt;That distinction keeps the system honest.&lt;/p&gt;

&lt;h2&gt;
  
  
  concurrency has a ceiling
&lt;/h2&gt;

&lt;p&gt;The first time you parallelize LLM calls and see a 5x or 10x improvement, it is tempting to conclude the next fix is "more workers."&lt;/p&gt;

&lt;p&gt;That works until it does not.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19j337zjc3qwsl2iv7ts.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F19j337zjc3qwsl2iv7ts.png" alt="Graph showing LLM throughput gains flattening as concurrency increases." width="800" height="516"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Throughput improves with concurrency, then starts flattening. More workers eventually stop buying much.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;At low concurrency, the system has idle capacity. Adding workers improves utilization. Throughput rises quickly.&lt;/p&gt;

&lt;p&gt;At higher concurrency, the slope changes. You are now fighting shared bottlenecks: provider rate limits, GPU scheduling, KV cache memory, network overhead, queueing, retries, and your own post-processing.&lt;/p&gt;

&lt;p&gt;The unpleasant part is that throughput may plateau while user latency gets worse.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6ax2kpggn7rm5xx5llx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm6ax2kpggn7rm5xx5llx.png" alt="Graph showing time to first token getting worse as concurrency rises." width="800" height="541"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Concurrency can keep aggregate throughput healthy while making individual users wait much longer.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That is why "tokens per second" is not enough.&lt;/p&gt;

&lt;p&gt;You need at least four metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;time to first token&lt;/li&gt;
&lt;li&gt;output tokens per second&lt;/li&gt;
&lt;li&gt;total wall-clock time&lt;/li&gt;
&lt;li&gt;failure/retry rate under load&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And you need to record them by stage.&lt;/p&gt;

&lt;p&gt;For an AI debugging agent, "the whole thing took 90 seconds" is not a useful measurement. Which stage took 90 seconds? File relevance? Log compression? Root cause analysis? Reproduction? Fix generation? Validation?&lt;/p&gt;

&lt;p&gt;If you do not know, you cannot optimize it.&lt;/p&gt;

&lt;h2&gt;
  
  
  queues beat request chains
&lt;/h2&gt;

&lt;p&gt;Once the workflow has more than one stage, a single request chain becomes fragile.&lt;/p&gt;

&lt;p&gt;Analyze the bug. Then reproduce it. Then identify root cause. Then generate a fix. Then validate the fix.&lt;/p&gt;

&lt;p&gt;If this runs as one synchronous chain, every stage inherits every other stage's latency and failure mode. A slow reproduction attempt blocks root cause work. A provider retry blocks the entire request. One expensive bug can starve smaller bugs behind it.&lt;/p&gt;

&lt;p&gt;The better shape is a pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7523h4tp2xj4trgq6aa4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7523h4tp2xj4trgq6aa4.png" alt="Pipeline diagram showing analyze, reproduce bug, root cause, and fix stages as independent microservices connected by message queues." width="800" height="520"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Independent stages let incoming bugs move through the system without one long blocking request chain.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;In FixBugs, the natural stages are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;analysis: parse and compress the bug report and artifacts&lt;/li&gt;
&lt;li&gt;reproduction: reproduce the bug and write a failing test&lt;/li&gt;
&lt;li&gt;root cause: identify the most likely cause&lt;/li&gt;
&lt;li&gt;fix: generate a patch&lt;/li&gt;
&lt;li&gt;validation: prove the patch fixes the reproduced failure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Those stages do not all have the same workload.&lt;/p&gt;

&lt;p&gt;Some are network-heavy. Some are model-heavy. Some are repo-heavy. Some need a sandbox. Some can run with cheaper models. Some need stronger models. Some should retry aggressively. Some should fail fast and ask for human input.&lt;/p&gt;

&lt;p&gt;That is why the pipeline should not pretend they are one operation.&lt;/p&gt;

&lt;p&gt;Use message queues when stages should operate independently. Use idempotent workers. Put explicit retry limits around LLM calls. Track partial state. Make each stage observable. Do not make the user wait on work that can finish after the first useful answer.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;This is not fancy.&lt;br&gt;
It is normal backend engineering.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  memory is a compression layer
&lt;/h2&gt;

&lt;p&gt;Long-context models are useful.&lt;/p&gt;

&lt;p&gt;They are also an attractive nuisance, because they hide two different problems.&lt;/p&gt;

&lt;p&gt;The first is a performance problem. Larger prompts mean more prefill work, more tokens to move through the system, higher latency, and lower throughput.&lt;/p&gt;

&lt;p&gt;The second is a reasoning problem. Irrelevant context is not neutral. Old hypotheses, stale summaries, repeated log snippets, and unrelated file notes compete with the evidence that matters for the current step.&lt;/p&gt;

&lt;p&gt;A memory layer helps only if it handles both problems: send fewer tokens and preserve the facts the next stage needs.&lt;/p&gt;

&lt;p&gt;For one demo, I tested a Mem0-style memory layer to isolate the performance side: extract facts from prior context, store them, and retrieve only the facts relevant to the current step.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;context management demo&lt;/strong&gt;&lt;br&gt;
60.51 tok/s → 216.06 tok/s full context versus retrieved memory facts for the demo case&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In that demo, token throughput improved from 60.51 tokens/sec to 216.06 tokens/sec compared with sending the full context.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Again, do not overfit to the number. The useful principle is simpler: every token you do not send is latency you do not pay for.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But the benchmark only measures the performance side.&lt;/p&gt;

&lt;p&gt;Memory is not just for personalization. In agent systems, memory is a compression layer and an evidence ledger. It decides which facts survive across steps.&lt;/p&gt;

&lt;h2&gt;
  
  
  model choice is infrastructure
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvt9edj4309w1fpzm7uaa.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvt9edj4309w1fpzm7uaa.jpeg" alt="Token throughput for various models on OpenRouter." width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Token throughput for various models on OpenRouter.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Choosing a model for an agent is not only a quality decision.&lt;/p&gt;

&lt;p&gt;It is an infrastructure decision.&lt;/p&gt;

&lt;p&gt;Two models with similar benchmark accuracy can behave very differently under your workload. One may stream quickly but produce verbose answers. One may have great throughput but poor tool-use reliability.&lt;/p&gt;

&lt;p&gt;FixBugs treats model choice by stage.&lt;/p&gt;

&lt;p&gt;Cheap model for broad relevance scans. Stronger model for root cause synthesis. Different prompt shape for reproduction. Different retry policy for fix generation. Different timeout for validation.&lt;/p&gt;

&lt;p&gt;The mistake is using the same model, the same timeout, and the same output format everywhere.&lt;/p&gt;

&lt;p&gt;The better question is: what does this stage need to be correct about?&lt;/p&gt;

&lt;p&gt;A file relevance stage does not need perfect prose. It needs high recall and structured evidence.&lt;/p&gt;

&lt;p&gt;A root cause stage needs to reconcile conflicting signals.&lt;/p&gt;

&lt;p&gt;A fix stage needs to produce a small patch.&lt;/p&gt;

&lt;p&gt;A validation stage needs to be conservative, because false confidence is worse than no answer.&lt;/p&gt;

&lt;p&gt;Once you write those constraints down, model selection becomes less mystical.&lt;/p&gt;

&lt;h2&gt;
  
  
  the checklist
&lt;/h2&gt;

&lt;p&gt;The final checklist from the talk still holds.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Know your workload.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Before building the feature, estimate input tokens, output tokens, expected concurrency, and whether the user needs an instant response or can tolerate asynchronous processing.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Reduce tokens.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Do not send full context because it is convenient. Compress, retrieve, summarize, and preserve provenance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Embrace parallelism.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;If the work is independent, split it. File scans, log-window analysis, artifact classification, and candidate hypothesis scoring often parallelize well.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Microservices and queues add complexity, but they also let different stages scale, retry, and fail independently. Don't overoptimize.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Expect failures.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LLM APIs fail. Providers rate-limit. Responses violate schema. Tool calls hang. Sandboxes break. Repos have bad tests. Treat every model call like a network call to a flaky dependency / data source, because that is what it is.&lt;/p&gt;

&lt;h2&gt;
  
  
  original talk
&lt;/h2&gt;

&lt;p&gt;This post is based on my SingleStore x PyDelhi talk on building high-performance AI agents in Python.&lt;/p&gt;

&lt;p&gt;Code and artifacts from the talk are available in the &lt;a href="https://github.com/kirtivr/pydelhi-talk" rel="noopener noreferrer"&gt;pydelhi-talk repository&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The original recap and slide deck remain archived in the &lt;a href="https://fixbugs.ai/blog/pydelhi_high_performance_ai_agents" rel="noopener noreferrer"&gt;PyDelhi talk post&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  references
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/" rel="noopener noreferrer"&gt;NVIDIA: Mastering LLM Techniques: Inference Optimization&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://developer.nvidia.com/blog/llm-inference-benchmarking-performance-tuning-with-tensorrt-llm" rel="noopener noreferrer"&gt;NVIDIA: LLM Inference Benchmarking: Performance Tuning with TensorRT-LLM&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://arxiv.org/abs/2504.19413" rel="noopener noreferrer"&gt;Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;Originally published at &lt;a href="https://fixbugs.ai/blog/high-performance-ai-agents-distributed-systems" rel="noopener noreferrer"&gt;fixbugs.ai&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>distributedsystems</category>
      <category>performance</category>
    </item>
    <item>
      <title>Engineering approach: Startup Mode v/s Big Tech Mode</title>
      <dc:creator>Kirti Rathore</dc:creator>
      <pubDate>Wed, 29 Apr 2026 19:00:31 +0000</pubDate>
      <link>https://dev.to/kirtivr/engineering-approach-startup-mode-vs-big-tech-mode-8ci</link>
      <guid>https://dev.to/kirtivr/engineering-approach-startup-mode-vs-big-tech-mode-8ci</guid>
      <description>&lt;h1&gt;
  
  
  Engineering approach: Startup Mode v/s Big Tech Mode
&lt;/h1&gt;

&lt;p&gt;Last week, I delivered a talk at PyDelhi discussing strategies that leverage how large language models work to improve the performance of your LLM applications. Here is the &lt;a href="https://fixbugs.ai/content/misc/Writing-High-Performance-AI-Agents-in-Python-Insights-from-building-Modulo-2.pdf" rel="noopener noreferrer"&gt;talk&lt;/a&gt; as a PDF.&lt;/p&gt;

&lt;p&gt;Some techniques I discussed were:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strategies for faster token throughput.&lt;/li&gt;
&lt;li&gt;Strategies for quick time to first token.&lt;/li&gt;
&lt;li&gt;Effective context window management and&lt;/li&gt;
&lt;li&gt;Model routing strategies.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But here's the uncomfortable truth for founders: if you're just starting your LLM startup, you should completely ignore this advice.&lt;/p&gt;

&lt;p&gt;Let me explain why — and when application performance actually matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Works in the Ideal World: Big Tech's Playbook
&lt;/h2&gt;

&lt;p&gt;Imagine the typical trajectory at a company like Google or Stripe. You see a problem in the market. It's well-defined. Your user base is established. You build a team to solve it.&lt;/p&gt;

&lt;p&gt;Your first step isn't writing code—it's understanding your performance requirements.&lt;/p&gt;

&lt;p&gt;You study incumbent competitors. You conduct user research. You measure what your users actually tolerate. For e-commerce, that's Amazon's 5-second response time threshold. For payments, that's Stripe's sub-100ms latency requirement. For real-time LLM interfaces, that might be streaming tokens within 200ms.&lt;/p&gt;

&lt;p&gt;These user expectations become &lt;strong&gt;Service Level Objectives (SLOs)&lt;/strong&gt;—formal performance, reliability, and usability targets your application must meet to remain competitive.&lt;/p&gt;

&lt;p&gt;Once you have SLOs, someone (usually a principal engineer or architect) translates them into a system architecture. This involves:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Weighing architectural tradeoffs (monolith vs. microservices, synchronous vs. asynchronous)&lt;/li&gt;
&lt;li&gt;Selecting technology stacks for different components&lt;/li&gt;
&lt;li&gt;Deciding on execution environments (web app vs. IDE plugin vs. CLI tool)&lt;/li&gt;
&lt;li&gt;Planning for scale from day one&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This approach works beautifully for mature companies with stable product-market fit. You have reliable data about what your users need, so you can build the right system the first time.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Cost of Performance Engineering
&lt;/h2&gt;

&lt;p&gt;Performance optimization requires trade-offs—all of them expensive.&lt;/p&gt;

&lt;p&gt;At Google and VMware, my teams answered questions like:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How much does adopting AVX-512 improve RAID-6 computational throughput?&lt;/li&gt;
&lt;li&gt;How much latency can we save by building local caches with remote diffs?&lt;/li&gt;
&lt;li&gt;Can we prefetch data and pipeline operations based on dependency graphs?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These questions have answers, and the answers are valuable. But solving them has a cost: &lt;strong&gt;optimized code is complex, harder to understand, and harder to debug.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Consider a simple workflow with a few network calls and database queries. Now transform it for performance: add Redis for slow queries, implement continuations for async operations, consider UDP over TCP for specific data patterns, reduce logging overhead.&lt;/p&gt;

&lt;p&gt;Consider a simple workflow with a few network calls and database queries. Now transform it for performance: add Redis for slow queries, use async with continuations, add TCP connection pooling with keepalives, distribute read load across multiple backend instances, say NO to heap allocations.. you get the point.&lt;/p&gt;

&lt;p&gt;Each optimization adds complexity. Each line becomes harder for the next engineer to reason about.&lt;/p&gt;

&lt;p&gt;Performance engineering also locks you into early technical decisions. Refactoring code can mean rolling back optimizations.&lt;/p&gt;

&lt;h2&gt;
  
  
  How It Actually Works: The Startup Reality
&lt;/h2&gt;

&lt;p&gt;Here's where the Big Tech playbook breaks down.&lt;/p&gt;

&lt;p&gt;At a startup, almost nothing is stable. Your SLOs don't exist yet because you don't know who your customers are. Your product architecture will change—not once, but repeatedly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The sources of uncertainty are constant:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Product pivot&lt;/strong&gt;: Your initial idea evolves. Instagram started as Burbn, a cluttered check-in app with photos as a side feature. When founders realized users were ignoring the check-in functionality and only engaging with photo sharing, they stripped everything away and rebuilt the architecture around that single use case.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Customer pivot&lt;/strong&gt;: You discover your ideal customer profile is different from what you assumed. That financial services firm won't buy your product, but the open-source community will—and they have completely different scalability requirements.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;The landscape is evolving&lt;/strong&gt;: New models, new APIs, better caching strategies emerge monthly. Locking into early architectural decisions is especially costly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Your use cases will change&lt;/strong&gt;: You might start with synchronous inference, then need streaming. You might start with single-turn interactions, then add multi-turn conversations. Each shift requires rearchitecting.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As Gergely Orosz noted after years at Uber: the biggest constraint at startups isn't computing resources—it's the coordination overhead. At big tech companies, you wait days for approvals on simple plumbing changes. At startups, you need to move fast and change direction constantly.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Counterargument: When Performance Matters Early
&lt;/h2&gt;

&lt;p&gt;I need to be clear: there are exceptions.&lt;/p&gt;

&lt;p&gt;If your business model directly depends on latency—say, you're selling real-time trading alerts and charge per-millisecond-saved—then performance optimization matters from day one.&lt;/p&gt;

&lt;p&gt;If your unit economics fundamentally depend on throughput (you make money per inference, and your margins vanish if you're inefficient), then measure and optimize.&lt;/p&gt;

&lt;p&gt;But ask yourself honestly: is performance actually your constraint, or is it a distraction?&lt;/p&gt;

&lt;p&gt;Most startups discover their real constraints are customer acquisition, product-market fit, and unit economics—not milliseconds.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to Do Instead
&lt;/h2&gt;

&lt;p&gt;Here's your startup engineering philosophy:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use third-party solutions liberally.&lt;/strong&gt; Use managed databases instead of self-hosting Postgres. Use cloud APIs instead of building infrastructure. Use open-source libraries even if they're slower or have some overhead. The velocity gain from not building custom infrastructure outweighs the performance cost—until you reach scale.&lt;/p&gt;

&lt;p&gt;As Paul Graham noted in his essay on startup strategies: founders often resist early customer work because they'd "rather sit at home writing code." The same applies here. You'd rather optimize your codebase than talk to customers. Both are mistakes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Optimize for changeability, not performance.&lt;/strong&gt; Write simple code that's easy to refactor. Clear, straightforward solutions beat clever optimizations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This means:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Choose simple data structures over complex ones&lt;/li&gt;
&lt;li&gt;Write tests that give you confidence to refactor&lt;/li&gt;
&lt;li&gt;Measure, but don't optimize based on measurements&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Think of it this way: if you were solving this problem in a language like Python or JavaScript (where performance is never the limit), what would you do? Do that. Build it carefully, but don't overthink it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Build the metrics foundation, but not the optimizations yet.&lt;/strong&gt; Set up basic monitoring from day one. Understand where time is spent. Just don't act on it yet—collect data for when it matters.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Inflection Point: When Everything Changes
&lt;/h2&gt;

&lt;p&gt;Here's the transition: when your product stabilizes and you have real users, everything changes.&lt;/p&gt;

&lt;p&gt;Once you've validated that customers actually want what you built, and you understand your unit economics, &lt;em&gt;then&lt;/em&gt; you switch modes. At this point:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define your actual SLOs based on user behavior and business requirements&lt;/li&gt;
&lt;li&gt;Profile your application to find real bottlenecks&lt;/li&gt;
&lt;li&gt;Invest in performance engineering&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Notice what happens at this stage: you have a product that works, customers who are paying, and clear visibility into what's slow. You're no longer gambling on architecture decisions.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Lesson
&lt;/h2&gt;

&lt;p&gt;The difference between Big Tech and startups isn't that Big Tech engineers are smarter. It's that Big Tech has &lt;strong&gt;certainty about its problem space&lt;/strong&gt;, while startups operate under &lt;strong&gt;radical uncertainty about everything.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The engineering approach must match reality.&lt;/p&gt;

&lt;p&gt;The best startup engineers I've known—including those who came from Big Tech—learned to shift modes. They brought discipline and architectural thinking from their Big Tech experience, but they abandoned the assumption that everything needs to be perfect from day one.&lt;/p&gt;

&lt;p&gt;Your job as a startup founder isn't to build the most performant system. It's to build something that works, that users want, and that you can change when you learn something new.&lt;/p&gt;

&lt;p&gt;Performance optimization will still be there when you need it. For now, focus on moving fast and learning what actually matters.&lt;/p&gt;




&lt;h2&gt;
  
  
  Key Takeaways
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Big Tech Engineering&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Startup Engineering&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Problem space is known; optimize for scale&lt;/td&gt;
&lt;td&gt;Problem space is uncertain; optimize for learning&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SLOs defined upfront based on market research&lt;/td&gt;
&lt;td&gt;SLOs emerge from customer feedback&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Complex architecture justified by requirements&lt;/td&gt;
&lt;td&gt;Simple architecture enables rapid pivots&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Performance optimization adds value&lt;/td&gt;
&lt;td&gt;Performance optimization is often wasted work&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Code should be optimized and reliable&lt;/td&gt;
&lt;td&gt;Code should be clear and changeable&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Your job early on is to prove the hypothesis, not to implement it perfectly.&lt;/p&gt;

&lt;p&gt;Link to the full blog: &lt;a href="https://fixbugs.ai/blog/startup-vs-bigtech-blog" rel="noopener noreferrer"&gt;https://fixbugs.ai/blog/startup-vs-bigtech-blog&lt;/a&gt;&lt;br&gt;
Link to the full talk: &lt;a href="https://fixbugs.ai/content/misc/Writing-High-Performance-AI-Agents-in-Python-Insights-from-building-Modulo-2.pdf" rel="noopener noreferrer"&gt;https://fixbugs.ai/content/misc/Writing-High-Performance-AI-Agents-in-Python-Insights-from-building-Modulo-2.pdf&lt;/a&gt;&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>llm</category>
      <category>performance</category>
      <category>startup</category>
    </item>
  </channel>
</rss>
