<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Albert Alov</title>
    <description>The latest articles on DEV Community by Albert Alov (@vola-trebla).</description>
    <link>https://dev.to/vola-trebla</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3836100%2Fef7f69de-6efb-4fa6-9594-b4766a4ecead.jpg</url>
      <title>DEV Community: Albert Alov</title>
      <link>https://dev.to/vola-trebla</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vola-trebla"/>
    <language>en</language>
    <item>
      <title>We traced an MCP server calling an LLM — both sides, one trace</title>
      <dc:creator>Albert Alov</dc:creator>
      <pubDate>Sat, 04 Apr 2026 03:18:45 +0000</pubDate>
      <link>https://dev.to/vola-trebla/we-traced-an-mcp-server-calling-an-llm-both-sides-one-trace-39ae</link>
      <guid>https://dev.to/vola-trebla/we-traced-an-mcp-server-calling-an-llm-both-sides-one-trace-39ae</guid>
      <description>&lt;p&gt;&lt;a href="https://dev.to/vola-trebla/mcp-servers-are-the-fastest-growing-part-of-the-ai-stack-they-have-zero-observability-5299"&gt;Last article&lt;/a&gt; we opened the MCP black box. One line of middleware, and every tool call gets a span, metrics, and privacy controls. Problem solved.&lt;/p&gt;

&lt;p&gt;Except it wasn't. We had traces. We had metrics. But we couldn't &lt;em&gt;see&lt;/em&gt; them — no dashboard, no demo, and the most interesting MCP feature was completely invisible.&lt;/p&gt;

&lt;p&gt;MCP servers don't just receive tool calls. They can also &lt;em&gt;call LLMs themselves&lt;/em&gt; — through a feature called sampling. Your server asks the client's LLM to generate a response. The request goes out. The response comes back. And the trace? Silent.&lt;/p&gt;

&lt;p&gt;This article is about the four follow-ups that turned "we have observability" into "here's what it actually looks like — try it yourself in 5 minutes."&lt;/p&gt;




&lt;h2&gt;
  
  
  The missing piece: sampling
&lt;/h2&gt;

&lt;p&gt;Most MCP tutorials show tools as pure functions. Input goes in, output comes out. But the MCP spec has a feature called &lt;code&gt;sampling/createMessage&lt;/code&gt; — the server can ask the client to run an LLM call on its behalf.&lt;/p&gt;

&lt;p&gt;Why? Because MCP servers don't have API keys. They don't talk to OpenAI directly. But sometimes a tool needs an LLM — to summarize a document, to classify an input, to decide the next step. Sampling lets the server delegate back to the client:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[Agent / Client]                        [MCP Server]
invoke_agent orchestrator
├── chat gpt-4o                  →
├── tools/call summarize         →      receives tool call
│                                       needs LLM to summarize
│                                  ←    sampling/createMessage
│   chat gpt-4o (sampling)       →
│                                  ←    gets LLM response
│                                       returns tool result
│                                  ←
└── chat gpt-4o                  →
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool call triggers an LLM call which is &lt;em&gt;invisible&lt;/em&gt; in the trace. The middleware from article #7 traces &lt;code&gt;tools/call summarize&lt;/code&gt; — but the sampling call inside it? Ghost. No span, no duration, no model name. A 2-second tool call where 1.8 seconds was the LLM and 200ms was the actual tool logic — and you can't tell.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;code&gt;traceSampling()&lt;/code&gt; — manual wrap for an unwrappable call
&lt;/h2&gt;

&lt;p&gt;Sampling can't be auto-intercepted. The &lt;code&gt;.tool()&lt;/code&gt; wrapper catches handler registration — but &lt;code&gt;ctx.mcpReq.requestSampling()&lt;/code&gt; is a method call inside the handler body. There's no registration to intercept.&lt;/p&gt;

&lt;p&gt;So we made it explicit:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;toadEyeMiddleware&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;traceSampling&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;toad-eye/mcp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;McpServer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;my-server&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1.0.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="nf"&gt;toadEyeMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;summarize&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// traceSampling wraps the sampling call with an OTel span&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;traceSampling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;ctx&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mcpReq&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;requestSampling&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
      &lt;span class="na"&gt;maxTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;maxTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;500&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;content&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;text&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;traceSampling()&lt;/code&gt; wrapper creates a &lt;code&gt;sampling/createMessage {model}&lt;/code&gt; span — &lt;code&gt;SpanKind.CLIENT&lt;/code&gt;, because the server is &lt;em&gt;requesting&lt;/em&gt; an LLM call from the client. The span captures model, duration, and status:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;startSamplingSpan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="err"&gt;→&lt;/span&gt; &lt;span class="nx"&gt;span&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;chat gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="nx"&gt;gen_ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;operation&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;chat&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="nx"&gt;gen_ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;request&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="nx"&gt;gen_ai&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;sampling&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;duration_ms&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;1834&lt;/span&gt;
  &lt;span class="nx"&gt;mcp&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;my-server&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
  &lt;span class="nx"&gt;SpanKind&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;CLIENT&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now the trace tells the full story:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;tools/call summarize         2.1s
└── chat gpt-4o (sampling)   1.8s  ← this was invisible before
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The tool took 2.1 seconds. 1.8 of those were the LLM call. 300ms was the actual summarization logic. Without this span, you'd optimize the wrong thing.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Grafana dashboard — from metrics to answers
&lt;/h2&gt;

&lt;p&gt;Having metrics in Prometheus is step one. Knowing what to ask is step two. We built an MCP Server dashboard that answers the questions you actually have:&lt;/p&gt;

&lt;h3&gt;
  
  
  Top row — the four numbers
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;┌──────────────┬──────────────┬──────────────┬──────────────┐
│  Tool Call   │  Avg Tool    │  Error Rate  │  Resource    │
│  Rate        │  Duration    │              │  Reads       │
│  12.4 req/s  │  45.2 ms     │  2.3%        │  3.1 req/s   │
└──────────────┴──────────────┴──────────────┴──────────────┘
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Four stats. Glanceable. Green/yellow/red thresholds. If the error rate is red — you know immediately.&lt;/p&gt;

&lt;h3&gt;
  
  
  Middle row — the trends
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Tool Call Rate by Tool&lt;/strong&gt; — timeseries, broken down by &lt;code&gt;gen_ai_tool_name&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by (gen_ai_tool_name) (rate(gen_ai_mcp_tool_calls_total[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Is your agent hammering one tool? Is traffic shifting from search to calculate over time? The line chart shows the pattern.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tool Duration p50 / p95&lt;/strong&gt; — two lines per tool:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;histogram_quantile(0.50, sum by (gen_ai_tool_name, le) (
  rate(gen_ai_mcp_tool_duration_bucket[5m])
))
histogram_quantile(0.95, sum by (gen_ai_tool_name, le) (
  rate(gen_ai_mcp_tool_duration_bucket[5m])
))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When your search tool's P95 jumps from 200ms to 2 seconds — you see it before users complain.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bottom row — errors and resources
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Errors by Tool&lt;/strong&gt; — stacked bars by tool + error type:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by (gen_ai_tool_name, error_type) (rate(gen_ai_mcp_tool_errors_total[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Not just "errors are up" — &lt;em&gt;which tool&lt;/em&gt; and &lt;em&gt;what kind&lt;/em&gt;. RateLimitError on search? ValidationError on calculate? The stacked bars tell you instantly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Resource Reads by URI&lt;/strong&gt; — which resources are popular:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;sum by (gen_ai_data_source_id) (rate(gen_ai_mcp_resource_reads_total[5m]))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  The table — one-screen overview
&lt;/h3&gt;

&lt;p&gt;The bottom is a table that merges all metrics per tool:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Tool&lt;/th&gt;
&lt;th&gt;Call Rate (req/s)&lt;/th&gt;
&lt;th&gt;Avg Duration (ms)&lt;/th&gt;
&lt;th&gt;p95 Duration (ms)&lt;/th&gt;
&lt;th&gt;Error Rate&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;calculate&lt;/td&gt;
&lt;td&gt;8.2&lt;/td&gt;
&lt;td&gt;12.3&lt;/td&gt;
&lt;td&gt;24.1&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;get-weather&lt;/td&gt;
&lt;td&gt;3.1&lt;/td&gt;
&lt;td&gt;145.2&lt;/td&gt;
&lt;td&gt;312.8&lt;/td&gt;
&lt;td&gt;3.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;search&lt;/td&gt;
&lt;td&gt;1.1&lt;/td&gt;
&lt;td&gt;890.4&lt;/td&gt;
&lt;td&gt;2,134&lt;/td&gt;
&lt;td&gt;8.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Error rate cells are color-coded: green &amp;lt; 5%, yellow &amp;lt; 10%, red &amp;gt; 10%. You see the problem tool in one glance.&lt;/p&gt;

&lt;p&gt;The dashboard is auto-provisioned — &lt;code&gt;npx toad-eye init&lt;/code&gt; scaffolds it into your &lt;code&gt;infra/toad-eye/grafana/dashboards/&lt;/code&gt; directory. No manual Grafana setup.&lt;/p&gt;

&lt;h2&gt;
  
  
  The demo server — try it yourself
&lt;/h2&gt;

&lt;p&gt;Theory is nice. Running code is better.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# 1. Start the observability stack&lt;/span&gt;
npx toad-eye up

&lt;span class="c"&gt;# 2. Run the demo MCP server via MCP Inspector&lt;/span&gt;
npx @modelcontextprotocol/inspector npx tsx demo/src/mcp-server/index.ts
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;MCP Inspector opens in your browser. You see three tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;calculate&lt;/strong&gt; — safe math evaluation (&lt;code&gt;2 + 2 * 3&lt;/code&gt; → &lt;code&gt;14&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;get-weather&lt;/strong&gt; — mock weather API with simulated latency (50-250ms)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;timestamp&lt;/strong&gt; — current time in multiple formats&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Plus a resource (&lt;code&gt;server-info&lt;/code&gt;) and a prompt (&lt;code&gt;weather-report&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;Call a few tools. Then open the dashboards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Jaeger&lt;/strong&gt; &lt;a href="http://localhost:16686" rel="noopener noreferrer"&gt;http://localhost:16686&lt;/a&gt; — find service &lt;code&gt;toad-eye-mcp-demo&lt;/code&gt;, see individual spans&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Grafana&lt;/strong&gt; &lt;a href="http://localhost:3100" rel="noopener noreferrer"&gt;http://localhost:3100&lt;/a&gt; — MCP Server dashboard, see metrics aggregate&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prometheus&lt;/strong&gt; &lt;a href="http://localhost:9090" rel="noopener noreferrer"&gt;http://localhost:9090&lt;/a&gt; — raw queries, autocomplete &lt;code&gt;gen_ai_mcp&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The demo server is intentionally simple — three tools, mock data, no external dependencies. The point isn't the tools. The point is seeing what the observability looks like in practice.&lt;/p&gt;

&lt;p&gt;Here's the full server — 50 lines of actual logic:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;initObservability&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;toad-eye&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;toadEyeMiddleware&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;toad-eye/mcp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;McpServer&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/sdk/server/mcp.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;StdioServerTransport&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;@modelcontextprotocol/sdk/server/stdio.js&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;zod&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;initObservability&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;serviceName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;toad-eye-mcp-demo&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;endpoint&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;http://localhost:4318&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;McpServer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;toad-eye-mcp-demo&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1.0.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="nf"&gt;toadEyeMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;recordInputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;recordOutputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;calculate&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Evaluate a math expression&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;expression&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;sanitized&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;replace&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sr"&gt;/&lt;/span&gt;&lt;span class="se"&gt;[^&lt;/span&gt;&lt;span class="sr"&gt;0-9+&lt;/span&gt;&lt;span class="se"&gt;\-&lt;/span&gt;&lt;span class="sr"&gt;*&lt;/span&gt;&lt;span class="se"&gt;/&lt;/span&gt;&lt;span class="sr"&gt;().% &lt;/span&gt;&lt;span class="se"&gt;]&lt;/span&gt;&lt;span class="sr"&gt;/g&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;""&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sanitized&lt;/span&gt; &lt;span class="o"&gt;!==&lt;/span&gt; &lt;span class="nx"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`Invalid characters in expression: &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Function&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`return (&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;sanitized&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;)`&lt;/span&gt;&lt;span class="p"&gt;)()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;expression&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt; = &lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt; &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;get-weather&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Get current weather for a city (mock data)&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;city&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;city&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Promise&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nf"&gt;setTimeout&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;r&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;conditions&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sunny&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;cloudy&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;rainy&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;snowy&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;windy&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;condition&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;conditions&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;floor&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;conditions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;length&lt;/span&gt;&lt;span class="p"&gt;)]&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;tempC&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;round&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;10&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="nb"&gt;Math&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;45&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;JSON&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stringify&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;city&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;condition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;tempC&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="p"&gt;}],&lt;/span&gt;
    &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;transport&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;StdioServerTransport&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;connect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;transport&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;console&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;MCP demo server running — open Grafana at http://localhost:3100&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice &lt;code&gt;console.error&lt;/code&gt; on the last line — not &lt;code&gt;console.log&lt;/code&gt;. Because stdout is the JSON-RPC wire. We learned this the hard way (&lt;a href="https://dev.toARTICLE_7_URL"&gt;article #7&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  Metrics in the public API
&lt;/h2&gt;

&lt;p&gt;One detail that bit us: MCP metrics existed in code but were invisible to library users. The &lt;code&gt;GEN_AI_METRICS&lt;/code&gt; constant — the public interface for all toad-eye metric names — didn't include MCP metrics. Users writing custom dashboards or alerts had no way to discover them.&lt;/p&gt;

&lt;p&gt;Fixed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;GEN_AI_METRICS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// ... existing LLM metrics ...&lt;/span&gt;

  &lt;span class="c1"&gt;// MCP Server&lt;/span&gt;
  &lt;span class="na"&gt;MCP_TOOL_DURATION&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gen_ai.mcp.tool.duration&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;MCP_TOOL_CALLS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gen_ai.mcp.tool.calls&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;MCP_TOOL_ERRORS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gen_ai.mcp.tool.errors&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;MCP_RESOURCE_READS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gen_ai.mcp.resource.reads&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;MCP_SESSION_ACTIVE&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gen_ai.mcp.session.active&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="kd"&gt;const&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you can reference &lt;code&gt;GEN_AI_METRICS.MCP_TOOL_DURATION&lt;/code&gt; in your code instead of hardcoding the string &lt;code&gt;"gen_ai.mcp.tool.duration"&lt;/code&gt;. Small thing, but it's the difference between a library and a collection of code.&lt;/p&gt;

&lt;p&gt;Session tracking was also added — &lt;code&gt;MCP_SESSION_ACTIVE&lt;/code&gt; is an UpDownCounter that increments when middleware initializes. In a multi-server deployment, you can see how many MCP sessions are active across your fleet.&lt;/p&gt;

&lt;h2&gt;
  
  
  The full trace tree
&lt;/h2&gt;

&lt;p&gt;With all four follow-ups shipped, here's what a complete MCP trace looks like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;invoke_agent orchestrator                          [client process]
├── chat gpt-4o                          1.2s      client LLM call
├── tools/call calculate                 12ms  ✅  [MCP server]
├── tools/call get-weather               187ms ✅  [MCP server]
├── tools/call summarize                 2.1s  ✅  [MCP server]
│   └── chat gpt-4o (sampling)           1.8s      server → client LLM
├── resources/read toad-eye://info       3ms   ✅  [MCP server]
└── chat gpt-4o                          800ms     client LLM call
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Client-side agent spans. Server-side tool spans. Server-initiated LLM spans. One trace, complete story. From "the agent decided to call a tool" to "the tool asked the LLM for help" to "the result came back" — every step is visible.&lt;/p&gt;

&lt;p&gt;This is what MCP observability looks like when it's done.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick start — 5 minutes to your first MCP trace
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Install toad-eye (if not already)&lt;/span&gt;
npm &lt;span class="nb"&gt;install &lt;/span&gt;toad-eye

&lt;span class="c"&gt;# Start the stack&lt;/span&gt;
npx toad-eye init
npx toad-eye up

&lt;span class="c"&gt;# Run the demo MCP server with Inspector&lt;/span&gt;
npx @modelcontextprotocol/inspector npx tsx demo/src/mcp-server/index.ts

&lt;span class="c"&gt;# Call some tools in Inspector, then check:&lt;/span&gt;
&lt;span class="c"&gt;# Jaeger:     http://localhost:16686 (service: toad-eye-mcp-demo)&lt;/span&gt;
&lt;span class="c"&gt;# Grafana:    http://localhost:3100  (dashboard: MCP Server)&lt;/span&gt;
&lt;span class="c"&gt;# Prometheus: http://localhost:9090  (query: gen_ai_mcp_tool_calls_total)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Five minutes. Real traces. Real metrics. Real dashboard. No mock data.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Implementation:&lt;/strong&gt; &lt;a href="https://github.com/vola-trebla/toad-eye/pull/216" rel="noopener noreferrer"&gt;Follow-up 1: Demo&lt;/a&gt; · &lt;a href="https://github.com/vola-trebla/toad-eye/pull/217" rel="noopener noreferrer"&gt;Follow-up 2: Dashboard&lt;/a&gt; · &lt;a href="https://github.com/vola-trebla/toad-eye/pull/218" rel="noopener noreferrer"&gt;Follow-up 3: Sampling&lt;/a&gt; · &lt;a href="https://github.com/vola-trebla/toad-eye/pull/219" rel="noopener noreferrer"&gt;Follow-up 4: Metrics API&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Previous articles:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/your-llm-traces-are-write-only-20ci"&gt;#5: Your AI agent is re-sending 80% of your budget every loop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/your-ai-agent-re-sends-80-of-your-budget-every-loop-27an"&gt;#6: Your LLM traces are write-only&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/mcp-servers-are-the-fastest-growing-part-of-the-ai-stack-they-have-zero-observability-5299"&gt;#7: MCP servers are a black box&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;toad-eye&lt;/strong&gt; — open-source LLM observability, OTel-native: &lt;a href="https://github.com/vola-trebla/toad-eye" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.npmjs.com/package/toad-eye" rel="noopener noreferrer"&gt;npm&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🐸👁️&lt;/p&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>programming</category>
      <category>javascript</category>
    </item>
    <item>
      <title>Every PostgreSQL MCP server eats your context window. Here's how I collapsed 4 into 1.</title>
      <dc:creator>Albert Alov</dc:creator>
      <pubDate>Sat, 04 Apr 2026 02:56:11 +0000</pubDate>
      <link>https://dev.to/vola-trebla/every-postgresql-mcp-server-eats-your-context-window-heres-how-i-collapsed-4-into-1-3o14</link>
      <guid>https://dev.to/vola-trebla/every-postgresql-mcp-server-eats-your-context-window-heres-how-i-collapsed-4-into-1-3o14</guid>
      <description>&lt;p&gt;I have four PostgreSQL environments. Dev, stage, prod, dev2. Each behind different credentials. Prod behind an SSH bastion. The standard MCP approach says: spin up a server per database. Four servers, twelve tools, 60,000+ tokens of metadata before the model even reads my question.&lt;/p&gt;

&lt;p&gt;At four databases, it's annoying. At 24 — which is what enterprise teams actually deal with — it's catastrophic. The model spends more attention resolving tool schemas than answering your question.&lt;/p&gt;

&lt;p&gt;I built a router that collapses it to one server, four tools, ~500 tokens. It also manages SSH tunnels automatically, blocks destructive queries, and requires human approval before touching prod.&lt;/p&gt;

&lt;p&gt;It's called &lt;a href="https://github.com/vola-trebla/toad-tunnel-mcp" rel="noopener noreferrer"&gt;toad-tunnel-mcp&lt;/a&gt;. Here's what it actually looks like.&lt;/p&gt;




&lt;h2&gt;
  
  
  The math that made me stop and build
&lt;/h2&gt;

&lt;p&gt;Take a typical PostgreSQL MCP setup. Each database exposes 3 tools: &lt;code&gt;query&lt;/code&gt;, &lt;code&gt;list_tables&lt;/code&gt;, &lt;code&gt;describe_table&lt;/code&gt;. Each tool definition is ~2,500 tokens of JSON Schema.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Setup&lt;/th&gt;
&lt;th&gt;Databases&lt;/th&gt;
&lt;th&gt;Tools&lt;/th&gt;
&lt;th&gt;Tokens (metadata only)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;One DB&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;~2,500&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Four envs&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;~10,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Enterprise (24 DBs)&lt;/td&gt;
&lt;td&gt;24&lt;/td&gt;
&lt;td&gt;72&lt;/td&gt;
&lt;td&gt;~150,000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;toad-tunnel-mcp&lt;/td&gt;
&lt;td&gt;Any&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At 24 databases, your model burns 150k tokens on tool definitions alone. That's 40-50% of the context window — gone before you type a single word. Research shows tool selection accuracy drops below 49% past 15 tools. Your model isn't stupid. You're drowning it.&lt;/p&gt;

&lt;p&gt;The existing solutions don't fix this. DBHub collapses tools but has no environment-first routing. pgEdge supports multiple instances but relies on the model to "remember" which database it's talking to. MCP aggregators just prefix everything (&lt;code&gt;db1_query&lt;/code&gt;, &lt;code&gt;db2_query&lt;/code&gt;) — all 72 tools still enter the context.&lt;/p&gt;

&lt;p&gt;None of them handle SSH tunnels. In the real world, prod isn't on &lt;code&gt;localhost:5432&lt;/code&gt;. It's behind a bastion host, and you're doing &lt;code&gt;ssh -L 15432:prod-db.internal:5432 deploy@bastion.company.com&lt;/code&gt; before every session.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture: one server, env as a parameter
&lt;/h2&gt;

&lt;p&gt;The core idea is embarrassingly simple. Instead of separate tools per database, you have one &lt;code&gt;execute_query&lt;/code&gt; tool with &lt;code&gt;env&lt;/code&gt; as a required enum parameter:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;registerTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;toad_tunnel__execute_query&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;inputSchema&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;object&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
      &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enum&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;dev&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;stage&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;prod&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;dev2&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]),&lt;/span&gt;
      &lt;span class="na"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;z&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;string&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;}),&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="k"&gt;async &lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;sql&lt;/span&gt; &lt;span class="p"&gt;})&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// env is validated by Zod before this runs.&lt;/span&gt;
    &lt;span class="c1"&gt;// The model can't invent "production-main" or "prod2".&lt;/span&gt;
    &lt;span class="c1"&gt;// It picks from the enum or gets rejected.&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;connectionManager&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getPool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;env&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;pool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;formatResult&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;rows&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Zod enum is the forcing function. If the model passes anything not in the list, the MCP SDK rejects it at the protocol level — the handler never executes. Safety moves from "the model's intent" to "the protocol's contract."&lt;/p&gt;

&lt;p&gt;This is what the research calls the "Action-Selector" pattern. One tool, deterministic routing. The model doesn't choose between &lt;code&gt;dev_query&lt;/code&gt; and &lt;code&gt;prod_query&lt;/code&gt; and hope it picks right. It fills in a parameter.&lt;/p&gt;

&lt;h2&gt;
  
  
  Progressive disclosure: don't load what you don't need
&lt;/h2&gt;

&lt;p&gt;The second problem with eager-loading schemas is that most queries only touch 2-3 tables. Why dump the entire schema of 50 tables into context?&lt;/p&gt;

&lt;p&gt;We decomposed introspection into three tools that mirror how a human developer actually explores a database:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Step 1: "What environments exist?"
→ toad_tunnel__list_nodes
← dev: sandbox_dev (read-write, auto) | stage: sandbox_stage (read-only, auto) | prod: sandbox_prod (read-only, hitl)

Step 2: "What tables are in stage?"
→ toad_tunnel__get_overview { env: "stage" }
← products  ~1000 rows | categories  ~50 rows | data_checks  ~1500 rows

Step 3: "What columns does data_checks have?"
→ toad_tunnel__describe_columns { env: "stage", table: "data_checks" }
← id:serial:PK:NOT NULL | code:varchar(50):NOT NULL | severity:varchar(20):NOT NULL | ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Step 1 costs ~200 tokens. Step 2 costs maybe 300. Step 3 costs whatever the specific table needs. The model only pays for what it uses.&lt;/p&gt;

&lt;p&gt;Compare this to eager-loading: dump all tables, all columns, all environments into context on startup. That's the current standard. It's like loading every Wikipedia article before answering a question about frogs.&lt;/p&gt;

&lt;p&gt;The output format matters too. We use a minified format (&lt;code&gt;id:serial:PK:NOT NULL&lt;/code&gt;) instead of verbose JSON. TSV for query results instead of JSON. That's 30-40% fewer tokens per response — which adds up across a multi-turn debugging session.&lt;/p&gt;

&lt;h2&gt;
  
  
  SSH tunnels: the feature nobody built
&lt;/h2&gt;

&lt;p&gt;This is the part that surprised me. Every multi-database MCP tool assumes direct connections. In reality:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="c1"&gt;# What your infra actually looks like&lt;/span&gt;
&lt;span class="na"&gt;prod&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;host&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod-db.internal&lt;/span&gt;        &lt;span class="c1"&gt;# not reachable from your laptop&lt;/span&gt;
  &lt;span class="na"&gt;port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;5432&lt;/span&gt;
  &lt;span class="na"&gt;user&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;prod_reader&lt;/span&gt;
  &lt;span class="na"&gt;password&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;${PROD_PG_PASSWORD}&lt;/span&gt;
  &lt;span class="na"&gt;permissions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;read-only&lt;/span&gt;
  &lt;span class="na"&gt;approval&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;hitl&lt;/span&gt;
  &lt;span class="na"&gt;tunnel&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
    &lt;span class="na"&gt;bastion&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;bastion.company.com&lt;/span&gt;
    &lt;span class="na"&gt;username&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;deploy&lt;/span&gt;
    &lt;span class="na"&gt;key_path&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;~/.ssh/prod_key&lt;/span&gt;
    &lt;span class="na"&gt;local_port&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;15432&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;toad-tunnel-mcp manages SSH tunnels automatically via &lt;code&gt;ssh2&lt;/code&gt;. The lifecycle:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Lazy connect.&lt;/strong&gt; No tunnel opens at startup. First query to &lt;code&gt;prod&lt;/code&gt; triggers it.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Keep-alive.&lt;/strong&gt; Configurable heartbeat (default 30s) prevents SSH timeout.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Idle disconnect.&lt;/strong&gt; No queries for 5 minutes? Tunnel closes. Resources freed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Auto-reconnect.&lt;/strong&gt; Connection drops? Exponential backoff, up to 3 retries. Pool gets invalidated and recreated on reconnect.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graceful shutdown.&lt;/strong&gt; &lt;code&gt;SIGINT&lt;/code&gt; → close all pools → close all tunnels → exit. No zombie SSH processes.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// You never think about tunnels. Just query.&lt;/span&gt;
&lt;span class="c1"&gt;// The router checks if env has a tunnel config,&lt;/span&gt;
&lt;span class="c1"&gt;// opens it if needed, routes through it.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;handler&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;prod&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;SELECT count(*) FROM products&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="c1"&gt;// Behind the scenes: SSH tunnel opened → pg pool connected&lt;/span&gt;
&lt;span class="c1"&gt;// through 127.0.0.1:15432 → query executed → result returned&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I looked at &lt;code&gt;ssh2-promise&lt;/code&gt;, native &lt;code&gt;child_process&lt;/code&gt; with &lt;code&gt;ssh -L&lt;/code&gt;, and raw &lt;code&gt;ssh2&lt;/code&gt;. Went with &lt;code&gt;ssh2&lt;/code&gt; behind a &lt;code&gt;TunnelProvider&lt;/code&gt; interface — if the library dies, we swap the implementation without touching the router.&lt;/p&gt;

&lt;h2&gt;
  
  
  Safety: four layers, not one
&lt;/h2&gt;

&lt;p&gt;Here's where it gets serious. A nice MCP router that makes prod equally easy to query as dev is a liability, not a feature. We need defense-in-depth.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: PostgreSQL read-only roles.&lt;/strong&gt; The actual defense. &lt;code&gt;ALTER ROLE prod_reader SET default_transaction_read_only = ON&lt;/code&gt;. Even if every software layer above fails, the database itself rejects mutations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: Keyword blocklist.&lt;/strong&gt; Fast-fail before the query reaches the database. &lt;code&gt;DROP&lt;/code&gt;, &lt;code&gt;DELETE&lt;/code&gt;, &lt;code&gt;ALTER&lt;/code&gt;, &lt;code&gt;TRUNCATE&lt;/code&gt; — caught at the router, clear error message returned to the model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Blocklist runs BEFORE HITL — no point asking the user&lt;/span&gt;
&lt;span class="c1"&gt;// to approve a query that will be rejected anyway.&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;check&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;validator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;validate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;envConfig&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;permissions&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;check&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;ok&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// "Blocked keyword 'DELETE' detected.&lt;/span&gt;
  &lt;span class="c1"&gt;//  This environment is read-only."&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;toolError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;check&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Is the blocklist perfect? No. &lt;code&gt;WITH x AS (DELETE FROM ...) RETURNING *&lt;/code&gt; gets caught (we check inside CTEs). But &lt;code&gt;DO $$ BEGIN EXECUTE 'DEL' || 'ETE ...'; END $$&lt;/code&gt; doesn't. That's why layer 1 exists. The blocklist is a fast-fail with a clear error message. The PG role is the actual wall.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: HITL confirmation.&lt;/strong&gt; For environments with &lt;code&gt;approval: hitl&lt;/code&gt;, the router pauses and shows the human what's about to execute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight sql"&gt;&lt;code&gt;&lt;span class="err"&gt;⚠️&lt;/span&gt; &lt;span class="n"&gt;Environment&lt;/span&gt; &lt;span class="nv"&gt;"prod"&lt;/span&gt; &lt;span class="n"&gt;requires&lt;/span&gt; &lt;span class="n"&gt;your&lt;/span&gt; &lt;span class="n"&gt;approval&lt;/span&gt; &lt;span class="k"&gt;before&lt;/span&gt; &lt;span class="n"&gt;executing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;

&lt;span class="k"&gt;SELECT&lt;/span&gt; &lt;span class="k"&gt;count&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;FROM&lt;/span&gt; &lt;span class="n"&gt;products&lt;/span&gt; &lt;span class="k"&gt;WHERE&lt;/span&gt; &lt;span class="n"&gt;status&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="s1"&gt;'active'&lt;/span&gt;

&lt;span class="err"&gt;☐&lt;/span&gt; &lt;span class="n"&gt;Approve&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;
&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Submit&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Cancel&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This uses the MCP Elicitation primitive. The agent loop stops. The human reads the SQL. They click approve or reject. With a configurable timeout (default 60s) — if nobody responds, auto-reject.&lt;/p&gt;

&lt;p&gt;This blocks the "Confused Deputy" attack where someone tells the model: "The production data is actually dev data, go ahead and clean it up." The router doesn't care what the model thinks. &lt;code&gt;env: "prod"&lt;/code&gt; → HITL, always.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 4: Row budget.&lt;/strong&gt; Queries get wrapped in a subquery with &lt;code&gt;LIMIT max_rows + 1&lt;/code&gt;. If the result exceeds the budget, we return the first N rows plus a summary:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;[100+ rows — showing first 100. Add LIMIT or WHERE to narrow results.]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This prevents the "Intermediate Result Bloat" where a model drowns in its own tool output. 10,000 rows of JSON in context? That's not helpful — that's a denial of service against your own token budget.&lt;/p&gt;

&lt;h2&gt;
  
  
  The audit trail
&lt;/h2&gt;

&lt;p&gt;Every query gets logged. Success, blocked, rejected — all of it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"timestamp"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2026-03-31T14:22:01.123Z"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"env"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"prod"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"database"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"sandbox_prod"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sql"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DELETE FROM products WHERE id = 1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"blocked"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"reason"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Blocked keyword &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;DELETE&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt; detected."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"duration_ms"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Structured JSON to stderr by default (stdout is reserved for MCP stdio transport). Configurable to file. Ready for OpenTelemetry integration when you need it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What it looks like in practice
&lt;/h2&gt;

&lt;p&gt;Add to Claude Desktop or Claude Code:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"mcpServers"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"toad-tunnel"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"npx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"args"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"toad-tunnel-mcp"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"--config"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"toad-tunnel.yaml"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then just talk:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"What environments are available?"&lt;br&gt;
→ &lt;code&gt;list_nodes&lt;/code&gt; → dev, stage, prod, dev2&lt;/p&gt;

&lt;p&gt;"How many unresolved data checks in stage?"&lt;br&gt;
→ &lt;code&gt;execute_query&lt;/code&gt; env=stage → &lt;code&gt;SELECT count(*) FROM data_checks WHERE resolved_at IS NULL&lt;/code&gt; → 847&lt;/p&gt;

&lt;p&gt;"Same query in prod"&lt;br&gt;
→ &lt;code&gt;execute_query&lt;/code&gt; env=prod → ⚠️ HITL prompt → approve → 2,341&lt;/p&gt;

&lt;p&gt;"Delete the resolved ones in prod"&lt;br&gt;
→ &lt;code&gt;execute_query&lt;/code&gt; env=prod → ❌ Blocked keyword "DELETE". This environment is read-only.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;No context switching. No SSH commands. No "wait, which database am I connected to?"&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Standard MCP (4 DBs)&lt;/th&gt;
&lt;th&gt;toad-tunnel-mcp&lt;/th&gt;
&lt;th&gt;Change&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Tools in context&lt;/td&gt;
&lt;td&gt;12+&lt;/td&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;-67%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Startup tokens&lt;/td&gt;
&lt;td&gt;~10,000&lt;/td&gt;
&lt;td&gt;~500&lt;/td&gt;
&lt;td&gt;-95%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool selection accuracy&lt;/td&gt;
&lt;td&gt;Degrades with count&lt;/td&gt;
&lt;td&gt;Stable (enum)&lt;/td&gt;
&lt;td&gt;Deterministic&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SSH tunnel management&lt;/td&gt;
&lt;td&gt;Manual&lt;/td&gt;
&lt;td&gt;Automatic&lt;/td&gt;
&lt;td&gt;Zero friction&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prod safety&lt;/td&gt;
&lt;td&gt;Trust the model&lt;/td&gt;
&lt;td&gt;Protocol-enforced&lt;/td&gt;
&lt;td&gt;Defense-in-depth&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query audit&lt;/td&gt;
&lt;td&gt;None&lt;/td&gt;
&lt;td&gt;Every query logged&lt;/td&gt;
&lt;td&gt;Full trail&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;At 24 databases (enterprise scale), the token reduction is 99%+.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-g&lt;/span&gt; toad-tunnel-mcp

&lt;span class="c"&gt;# Create config from example&lt;/span&gt;
curl &lt;span class="nt"&gt;-o&lt;/span&gt; toad-tunnel.yaml https://raw.githubusercontent.com/vola-trebla/toad-tunnel-mcp/main/config/toad-tunnel.example.yaml

&lt;span class="c"&gt;# Validate&lt;/span&gt;
toad-tunnel-mcp validate &lt;span class="nt"&gt;--config&lt;/span&gt; toad-tunnel.yaml

&lt;span class="c"&gt;# Run&lt;/span&gt;
toad-tunnel-mcp &lt;span class="nt"&gt;--config&lt;/span&gt; toad-tunnel.yaml
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Full config reference, security model docs, and example setups for AWS RDS + bastion: &lt;a href="https://github.com/vola-trebla/toad-tunnel-mcp" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.npmjs.com/package/toad-tunnel-mcp" rel="noopener noreferrer"&gt;npm&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;toad-tunnel-mcp&lt;/strong&gt; — multi-env PostgreSQL MCP router with SSH tunnels: &lt;a href="https://github.com/vola-trebla/toad-tunnel-mcp" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.npmjs.com/package/toad-tunnel-mcp" rel="noopener noreferrer"&gt;npm&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🐸⚡&lt;/p&gt;

</description>
      <category>ai</category>
      <category>postgres</category>
      <category>mcp</category>
      <category>typescript</category>
    </item>
    <item>
      <title>MCP servers are the fastest-growing part of the AI stack. They have zero observability.</title>
      <dc:creator>Albert Alov</dc:creator>
      <pubDate>Mon, 30 Mar 2026 23:13:45 +0000</pubDate>
      <link>https://dev.to/vola-trebla/mcp-servers-are-the-fastest-growing-part-of-the-ai-stack-they-have-zero-observability-5299</link>
      <guid>https://dev.to/vola-trebla/mcp-servers-are-the-fastest-growing-part-of-the-ai-stack-they-have-zero-observability-5299</guid>
      <description>&lt;p&gt;Your LLM agent calls a tool via MCP. The tool fails. Your trace shows &lt;code&gt;tools/call search — error&lt;/code&gt;. That's it.&lt;/p&gt;

&lt;p&gt;Not why it failed. Not how long it took. Not what arguments were passed. Not whether it was a timeout, a validation error, or a rate limit from a downstream API. Because nobody instruments the server side.&lt;/p&gt;

&lt;p&gt;Every MCP observability tool watches the client. We built the first middleware that watches from inside the server. One import, one function call:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;toadEyeMiddleware&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;toad-eye/mcp&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;McpServer&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;my-server&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;version&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;1.0.0&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="nf"&gt;toadEyeMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Every tool, resource, and prompt handler is now instrumented.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here's why this was harder than it sounds, and what we learned building it.&lt;/p&gt;




&lt;h2&gt;
  
  
  The black box
&lt;/h2&gt;

&lt;p&gt;Here's what an MCP tool call trace looks like today — from the client's perspective:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;invoke_agent orchestrator          200ms
├── chat gpt-4o                    1.2s
├── tools/call web-search           ???
│   └── (nothing — the server is a black box)
└── chat gpt-4o                    800ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The LLM decided to call &lt;code&gt;web-search&lt;/code&gt;. The client sent the JSON-RPC request. Something happened on the server. The client got a response — or an error.&lt;/p&gt;

&lt;p&gt;The gap between "sent the request" and "got the response" is completely invisible.&lt;/p&gt;

&lt;p&gt;MCP adoption is exploding. Claude Desktop, Cursor, Windsurf, Zed — every AI IDE supports it. Thousands of servers on npm. And every single one is a black box. When your tool breaks in production, you're debugging with nothing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why you can't just "add logging"
&lt;/h2&gt;

&lt;p&gt;Your first instinct: &lt;code&gt;console.log&lt;/code&gt; in the tool handler. Three reasons that doesn't work for MCP:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP uses JSON-RPC 2.0, not HTTP.&lt;/strong&gt; No Express middleware. No Hono middleware. No request/response cycle to hook into. The SDK has a &lt;code&gt;McpServer&lt;/code&gt; class where you register handlers — &lt;code&gt;.tool()&lt;/code&gt;, &lt;code&gt;.resource()&lt;/code&gt;, &lt;code&gt;.prompt()&lt;/code&gt; — and routes JSON-RPC messages internally.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The SDK internals are private.&lt;/strong&gt; You can't monkey-patch &lt;code&gt;McpServer._requestHandlers&lt;/code&gt; — it's a private &lt;code&gt;Map&lt;/code&gt;, and TypeScript strict mode won't let you touch it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;And the worst part — stdio transport.&lt;/strong&gt; In stdio mode, &lt;code&gt;stdout&lt;/code&gt; IS the JSON-RPC wire. Every byte on stdout is parsed by the client as a JSON-RPC message:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight console"&gt;&lt;code&gt;&lt;span class="gp"&gt;$&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;node my-mcp-server.js
&lt;span class="go"&gt;debug                          ← your console.log
{"jsonrpc":"2.0","id":1,...}   ← actual MCP response
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The client tries to parse &lt;code&gt;debug&lt;/code&gt; as JSON. Can't. Connection dead. Your debugging broke the thing you were trying to debug.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Wrapper Pattern
&lt;/h2&gt;

&lt;p&gt;Can't patch internals. Can't hook HTTP. Can't use stdout. But we can intercept the public API.&lt;/p&gt;

&lt;p&gt;When you call &lt;code&gt;server.tool()&lt;/code&gt;, the SDK stores the handler internally. If we replace &lt;code&gt;.tool()&lt;/code&gt; before any handlers are registered, we wrap every handler transparently:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;originalTool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tool&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;tool&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;wrappedTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;rest&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;handlerIndex&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;rest&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;findIndex&lt;/span&gt;&lt;span class="p"&gt;((&lt;/span&gt;&lt;span class="nx"&gt;arg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="k"&gt;typeof&lt;/span&gt; &lt;span class="nx"&gt;arg&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;function&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;originalHandler&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;rest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;handlerIndex&lt;/span&gt;&lt;span class="p"&gt;];&lt;/span&gt;

  &lt;span class="nx"&gt;rest&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nx"&gt;handlerIndex&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;wrappedHandler&lt;/span&gt;&lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;span&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;startToolSpan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;performance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;originalHandler&lt;/span&gt;&lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nf"&gt;endSpanSuccess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nf"&gt;recordMcpToolCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;performance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;success&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nx"&gt;result&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nf"&gt;endSpanError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="nf"&gt;recordMcpToolCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;performance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;error&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
      &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;error&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;

  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;originalTool&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;...&lt;/span&gt;&lt;span class="nx"&gt;rest&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;};&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same approach for &lt;code&gt;.resource()&lt;/code&gt; and &lt;code&gt;.prompt()&lt;/code&gt;. The handler is wrapped before it enters the SDK's private map. The SDK never knows. Your code never changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you see after
&lt;/h2&gt;

&lt;p&gt;Before: &lt;code&gt;tools/call search — error&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;After:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;invoke_agent orchestrator                              [client-side]
├── chat gpt-4o                         1.2s
├── tools/call calculate                45ms   ✅      [server-side]
├── tools/call web-search               2.3s   ❌ RateLimitError
├── resources/read file:///config.json  3ms    ✅
└── chat gpt-4o                         800ms
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each operation gets a span with standard OTel attributes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;gen_ai.operation.name&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"tools/call"&lt;/span&gt;
&lt;span class="py"&gt;gen_ai.tool.name&lt;/span&gt;         &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"calculate"&lt;/span&gt;
&lt;span class="py"&gt;mcp.server.name&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"my-server"&lt;/span&gt;
&lt;span class="py"&gt;mcp.session.id&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"a1b2c3d4"&lt;/span&gt;
&lt;span class="py"&gt;network.transport&lt;/span&gt;        &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"pipe"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Any OTel-compatible backend — Jaeger, Grafana Tempo, Datadog, Arize Phoenix — recognizes them without configuration.&lt;/p&gt;

&lt;h2&gt;
  
  
  Metrics — patterns, not just incidents
&lt;/h2&gt;

&lt;p&gt;Spans tell you about individual requests. Metrics tell you about patterns:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Type&lt;/th&gt;
&lt;th&gt;What it tells you&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.mcp.tool.duration&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Histogram&lt;/td&gt;
&lt;td&gt;Which tools are slow? Latency trending up?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.mcp.tool.calls&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Counter&lt;/td&gt;
&lt;td&gt;Which tools are popular? Agent over-using one?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.mcp.tool.errors&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Counter&lt;/td&gt;
&lt;td&gt;Which tools are unreliable? What error types?&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.mcp.resource.reads&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Counter&lt;/td&gt;
&lt;td&gt;Access patterns for resources&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.mcp.session.active&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;UpDownCounter&lt;/td&gt;
&lt;td&gt;How many MCP sessions right now?&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;"The search tool has an 8% error rate and P95 latency of 2.3 seconds." That's actionable. "A tool failed" is not.&lt;/p&gt;

&lt;h2&gt;
  
  
  The STDIO trap
&lt;/h2&gt;

&lt;p&gt;This is the part we learned the hard way.&lt;/p&gt;

&lt;p&gt;OpenTelemetry's SDK writes diagnostic messages to &lt;code&gt;stdout&lt;/code&gt; by default. In an HTTP server, you'd never notice. In a stdio MCP server, those diagnostics are catastrophic.&lt;/p&gt;

&lt;p&gt;OTel SDK initializes → writes &lt;code&gt;"DiagAPI initialized"&lt;/code&gt; to stdout → MCP client parses it as JSON → parse fails → connection dead. Before a single tool call.&lt;/p&gt;

&lt;p&gt;We spent 3 hours on "why does the connection die when I import toad-eye" before finding this. The fix is ten lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;ensureStdioSafe&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;stderrLogger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;DiagConsoleLogger&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;safeLogger&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;verbose&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;stderrLogger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;verbose&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;
    &lt;span class="na"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;stderrLogger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;
    &lt;span class="na"&gt;info&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;stderrLogger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;info&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;
    &lt;span class="na"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;stderrLogger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warn&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;
    &lt;span class="na"&gt;error&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;(...&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;stderrLogger&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;String&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;args&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;])),&lt;/span&gt;
  &lt;span class="p"&gt;};&lt;/span&gt;
  &lt;span class="nx"&gt;diag&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setLogger&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;safeLogger&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;DiagLogLevel&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;WARN&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Redirects all OTel diagnostics to &lt;code&gt;stderr&lt;/code&gt;. The MCP connection survives.&lt;/p&gt;

&lt;p&gt;If you're building any MCP server with any OTel instrumentation: &lt;strong&gt;redirect diagnostics to stderr first.&lt;/strong&gt; Before spans, before metrics, before anything.&lt;/p&gt;

&lt;h2&gt;
  
  
  Privacy by default
&lt;/h2&gt;

&lt;p&gt;Tool arguments can contain anything. API keys. User data. File contents. Database credentials. The default must be safe:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;toadEyeMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// arguments NOT recorded&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Opt in explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;toadEyeMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;server&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="na"&gt;recordInputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;recordOutputs&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;redactKeys&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;apiKey&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;token&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;password&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
  &lt;span class="na"&gt;maxPayloadSize&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;4096&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Sensitive fields become &lt;code&gt;[REDACTED]&lt;/code&gt; in spans. Large payloads get truncated. Compare with tools that record everything by default and leave privacy as "your problem."&lt;/p&gt;

&lt;h2&gt;
  
  
  Context propagation
&lt;/h2&gt;

&lt;p&gt;The most powerful thing: linking client and server traces. One trace tree, complete picture.&lt;/p&gt;

&lt;p&gt;HTTP does this with &lt;code&gt;traceparent&lt;/code&gt; headers. MCP stdio has no headers. But it has &lt;code&gt;_meta&lt;/code&gt; in JSON-RPC params:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"method"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"tools/call"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"params"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"calculate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"expression"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2+2"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"_meta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"traceparent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"00-0af7651916cd43dd-b7ad6b71692033-01"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our middleware extracts it. When the host injects &lt;code&gt;_meta.traceparent&lt;/code&gt;, server spans become children of client spans. When it doesn't — graceful fallback, span starts as root. No crash, no error. Works with whatever context is available.&lt;/p&gt;

&lt;h2&gt;
  
  
  The landscape
&lt;/h2&gt;

&lt;p&gt;We looked for MCP server-side observability before building this. We couldn't find any — not "nothing good," but nothing at all.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Client-side tracing&lt;/th&gt;
&lt;th&gt;Server-side middleware&lt;/th&gt;
&lt;th&gt;Privacy controls&lt;/th&gt;
&lt;th&gt;OTel-native&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Datadog&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Langfuse&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;AgentOps&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;toad-eye&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;✅&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Every tool watches the client. Nobody watches the server. We built this because our own bot's MCP tools kept failing and we had no way to diagnose why.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick checklist
&lt;/h2&gt;

&lt;p&gt;If you're building or maintaining MCP servers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Never &lt;code&gt;console.log&lt;/code&gt; in stdio servers — use stderr or a logger&lt;/li&gt;
&lt;li&gt;Redirect OTel diagnostics to stderr before initializing anything&lt;/li&gt;
&lt;li&gt;Don't record tool arguments by default — they may contain secrets&lt;/li&gt;
&lt;li&gt;Use standard span names: &lt;code&gt;tools/call {name}&lt;/code&gt;, &lt;code&gt;resources/read {uri}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Test with both stdio and SSE transports — they break differently&lt;/li&gt;
&lt;li&gt;Check if your MCP host injects &lt;code&gt;_meta.traceparent&lt;/code&gt; for trace linking&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Implementation:&lt;/strong&gt; &lt;a href="https://github.com/vola-trebla/toad-eye/pull/212" rel="noopener noreferrer"&gt;Phase 1: Core middleware&lt;/a&gt; · &lt;a href="https://github.com/vola-trebla/toad-eye/pull/213" rel="noopener noreferrer"&gt;Phase 2: Metrics + privacy&lt;/a&gt; · &lt;a href="https://github.com/vola-trebla/toad-eye/pull/214" rel="noopener noreferrer"&gt;Phase 3: STDIO isolation&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Previous articles:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/your-llm-streaming-traces-are-lying-to-you-53f0"&gt;#4: Your LLM streaming traces are lying to you&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/your-ai-agent-re-sends-80-of-your-budget-every-loop-27an"&gt;#5: Your AI agent re-sends 80% of your budget every loop&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/your-llm-traces-are-write-only-20ci"&gt;#6: Your LLM traces are write-only&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;toad-eye&lt;/strong&gt; — open-source LLM observability, OTel-native: &lt;a href="https://github.com/vola-trebla/toad-eye" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.npmjs.com/package/toad-eye" rel="noopener noreferrer"&gt;npm&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🐸👁️&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>opentelemetry</category>
      <category>typescript</category>
    </item>
    <item>
      <title>Your LLM traces are write-only</title>
      <dc:creator>Albert Alov</dc:creator>
      <pubDate>Sun, 29 Mar 2026 10:00:28 +0000</pubDate>
      <link>https://dev.to/vola-trebla/your-llm-traces-are-write-only-20ci</link>
      <guid>https://dev.to/vola-trebla/your-llm-traces-are-write-only-20ci</guid>
      <description>&lt;p&gt;You spent weeks building observability for your LLM app. Traces in Jaeger. Metrics in Grafana. Alerts in Slack. You can see exactly what your model says, how long it takes, and how much it costs.&lt;/p&gt;

&lt;p&gt;Then you change the prompt.&lt;/p&gt;

&lt;p&gt;Did the model get better? Worse? For which inputs? You have no idea — because your traces are write-only. You observe but never evaluate. Your production data sits in Jaeger and never becomes a test.&lt;/p&gt;

&lt;p&gt;We built the bridge from traces to tests. Then we ran it on our own traces and discovered half our spans had no content — because &lt;code&gt;recordContent&lt;/code&gt; was off by default. The tool designed to extract test data couldn't extract anything.&lt;/p&gt;

&lt;p&gt;Fixed that. Here's the workflow.&lt;/p&gt;




&lt;h2&gt;
  
  
  The loop nobody closes
&lt;/h2&gt;

&lt;p&gt;Every LLM team has some version of this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Deploy prompt v2
2. Watch dashboards for a few hours
3. "Looks fine, latency is similar, no errors"
4. Move on
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;"Looks fine" is not evaluation. You're checking system health — latency, errors, cost — but not output quality. Your model could be returning subtly worse answers and you'd never know, because you don't have regression tests built from real production data.&lt;/p&gt;

&lt;p&gt;The teams that do this well have a different loop:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;1. Collect production inputs and outputs (traces)
2. Extract test cases from real traffic
3. Run the new prompt against those inputs
4. Score: is v2 better than v1?
5. Deploy with confidence
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Steps 2-4 are what eval frameworks do. The problem is getting from step 1 to step 2. Your traces live in Jaeger. Your eval framework expects YAML datasets. Nobody builds the bridge.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bridge: &lt;code&gt;export-trace&lt;/code&gt;
&lt;/h2&gt;

&lt;p&gt;One CLI command converts a Jaeger trace into a test dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx toad-eye export-trace abc123def456
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="s"&gt;✅ Exported trace abc123def456 → ./trace-abc123de.eval.yaml&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The generated YAML:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;exported-trace-abc123de&lt;/span&gt;
&lt;span class="na"&gt;source&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;toad-eye-export&lt;/span&gt;
&lt;span class="na"&gt;metadata&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="na"&gt;trace_id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;abc123def456&lt;/span&gt;
  &lt;span class="na"&gt;exported_at&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2026-03-15T14:22:00.000Z"&lt;/span&gt;
  &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gpt-4o&lt;/span&gt;
  &lt;span class="na"&gt;provider&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;openai&lt;/span&gt;
&lt;span class="na"&gt;cases&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production-case-1&lt;/span&gt;
    &lt;span class="na"&gt;variables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;are&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;the&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;side&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;effects&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;of&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;ibuprofen?"&lt;/span&gt;
    &lt;span class="na"&gt;assertions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;max_length&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1500&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;not_contains&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;i&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cannot"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;id&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;production-case-2&lt;/span&gt;
    &lt;span class="na"&gt;variables&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="na"&gt;input&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s1"&gt;'&lt;/span&gt;&lt;span class="s"&gt;{"action":&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"summarize",&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"text":&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;"..."}'&lt;/span&gt;
    &lt;span class="na"&gt;assertions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;max_length&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;800&lt;/span&gt;
      &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;is_json&lt;/span&gt;
        &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;One trace, multiple LLM calls, each becomes a test case. The assertions are auto-generated from what the production model actually returned.&lt;/p&gt;

&lt;h2&gt;
  
  
  How assertions are generated
&lt;/h2&gt;

&lt;p&gt;The export doesn't just copy inputs and outputs. It analyzes the production response and creates baseline assertions:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What it checks&lt;/th&gt;
&lt;th&gt;How&lt;/th&gt;
&lt;th&gt;Why&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;max_length&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;completion.length × 1.5&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;New prompt shouldn't produce wildly longer output&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;not_contains&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;Checks for refusal phrases&lt;/td&gt;
&lt;td&gt;If production didn't refuse, the new prompt shouldn't either&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;is_json&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;JSON.parse()&lt;/code&gt; succeeds&lt;/td&gt;
&lt;td&gt;If production returned valid JSON, new prompt must too&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;These are conservative baselines — they catch regressions, not improvements. If your current prompt returns a 500-character JSON answer and the new prompt returns a 3,000-character refusal, something is broken. These assertions catch that.&lt;/p&gt;

&lt;p&gt;You add domain-specific assertions on top:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;assertions&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;max_length&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="m"&gt;1500&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;not_contains&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;i&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;cannot"&lt;/span&gt;
  &lt;span class="c1"&gt;# Your domain expertise:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;contains&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;nausea"&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;type&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;llm_judge&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;is&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;medically&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;accurate&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;and&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;lists&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;at&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;least&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;3&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;side&lt;/span&gt;&lt;span class="nv"&gt; &lt;/span&gt;&lt;span class="s"&gt;effects"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Auto-generated assertions bootstrap the dataset. Your domain knowledge makes it useful.&lt;/p&gt;

&lt;h2&gt;
  
  
  The prerequisite nobody remembers
&lt;/h2&gt;

&lt;p&gt;By default, toad-eye doesn't record prompts and completions in traces. &lt;a href="https://dev.to/vola-trebla/opentelemetry-just-standardized-llm-tracing-heres-what-it-actually-looks-like-in-code-2e5f"&gt;Article #3&lt;/a&gt; explained why — the OTel spec says don't, and your security team agrees.&lt;/p&gt;

&lt;p&gt;But for trace-to-eval export, you need the content.&lt;/p&gt;

&lt;p&gt;We learned this the embarrassing way. Built the entire &lt;code&gt;export-trace&lt;/code&gt; pipeline, ran it on our own Jaeger instance, and got:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;✗ No exportable spans in trace abc123. Was recordContent enabled?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Half our spans had inputs and outputs as empty strings. The tool worked perfectly — on empty data. Classic.&lt;/p&gt;

&lt;p&gt;Enable it where it matters:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;initObservability&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;serviceName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;my-app&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;recordContent&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// enable in staging or for a traffic sample&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The recommendation: enable &lt;code&gt;recordContent&lt;/code&gt; in staging, or use content sampling in production to record a percentage of traffic. Export from those traces. Don't record everything — record enough.&lt;/p&gt;

&lt;h2&gt;
  
  
  From export to CI
&lt;/h2&gt;

&lt;p&gt;The concrete workflow, compressed:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Find interesting traces&lt;/strong&gt; in Jaeger. Look for high-token traces (complex reasoning), traces with tool calls (agent behavior), traces near budget limits (cost-sensitive paths). These are your golden test cases.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Export them:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx toad-eye export-trace abc123def456 &lt;span class="nt"&gt;--output&lt;/span&gt; ./eval-datasets
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Add your assertions&lt;/strong&gt; to the generated YAML. The scaffolding is there — add the domain-specific checks that matter for your use case.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Run evals on every prompt change:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx toad-eval run &lt;span class="nt"&gt;--dataset&lt;/span&gt; ./eval-datasets/trace-abc123de.eval.yaml &lt;span class="nt"&gt;--model&lt;/span&gt; gpt-4o
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now you know: does prompt v2 pass the same cases that prompt v1 handled in production? Not "it didn't break in the first 2 hours" confidence — "it passes the same inputs our users actually send" confidence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Automate it.&lt;/strong&gt; The programmable API (&lt;code&gt;exportTrace&lt;/code&gt;, &lt;code&gt;fetchTrace&lt;/code&gt;, &lt;code&gt;traceToEvalYaml&lt;/code&gt;) lets you build a cron job that exports traces nightly from staging, feeds them into CI, and blocks deploys when regressions are detected. The pieces compose.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why OTel-native matters here
&lt;/h2&gt;

&lt;p&gt;This workflow only works because toad-eye uses OpenTelemetry. The trace format is standard. Jaeger stores it. The export reads it via Jaeger's API. No vendor lock-in, no proprietary format, no "export your data" button that gives you a CSV.&lt;/p&gt;

&lt;p&gt;If you're using Langfuse or Arize, you can build the same pipeline — through their API, in their format, with their rate limits. With OTel, your traces are yours. They live in your Jaeger. You query them whenever you want.&lt;/p&gt;

&lt;h2&gt;
  
  
  What comes next
&lt;/h2&gt;

&lt;p&gt;The manual export covers "build a dataset, run evals in CI." But there's a second mode we're working toward: inline eval callbacks where every completed span triggers a scoring function automatically. No Jaeger query, no manual export — production traffic scores itself in real time.&lt;/p&gt;

&lt;p&gt;That's a separate deep dive. For now, the manual pipeline is the foundation — and it's already more than most teams have.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick checklist
&lt;/h2&gt;

&lt;p&gt;If you want to start building eval datasets from production traces:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Enable &lt;code&gt;recordContent: true&lt;/code&gt; in staging or for a traffic sample&lt;/li&gt;
&lt;li&gt;Find 10-20 traces that represent your core use cases&lt;/li&gt;
&lt;li&gt;Export with &lt;code&gt;npx toad-eye export-trace &amp;lt;trace_id&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Add domain-specific assertions to the generated YAML&lt;/li&gt;
&lt;li&gt;Run evals against your current prompt — establish the baseline&lt;/li&gt;
&lt;li&gt;Run evals against every prompt change before deploying&lt;/li&gt;
&lt;li&gt;Automate: nightly exports, CI runs evals on PR&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your traces already contain the best test data you'll ever get — real inputs from real users. Stop letting them rot in Jaeger.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Previous articles:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/opentelemetry-just-standardized-llm-tracing-heres-what-it-actually-looks-like-in-code-2e5f"&gt;#3: OpenTelemetry just standardized LLM tracing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/your-llm-streaming-traces-are-lying-to-you-53f0"&gt;#4: Your LLM streaming traces are lying to you&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/your-ai-agent-re-sends-80-of-your-budget-every-loop-27an"&gt;#5: Your AI agent re-sends 80% of your budget every loop&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;toad-eye&lt;/strong&gt; — open-source LLM observability, OTel-native: &lt;a href="https://github.com/vola-trebla/toad-eye" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.npmjs.com/package/toad-eye" rel="noopener noreferrer"&gt;npm&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🐸👁️&lt;/p&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>typescript</category>
      <category>opentelemetry</category>
    </item>
    <item>
      <title>Your AI agent re-sends 80% of your budget every loop</title>
      <dc:creator>Albert Alov</dc:creator>
      <pubDate>Fri, 27 Mar 2026 17:35:50 +0000</pubDate>
      <link>https://dev.to/vola-trebla/your-ai-agent-re-sends-80-of-your-budget-every-loop-27an</link>
      <guid>https://dev.to/vola-trebla/your-ai-agent-re-sends-80-of-your-budget-every-loop-27an</guid>
      <description>&lt;p&gt;Your ReAct agent runs 15 turns. By turn 10, &lt;code&gt;input_tokens&lt;/code&gt; is 87K. You're re-sending the entire conversation history every single iteration.&lt;/p&gt;

&lt;p&gt;That's not generation cost. That's &lt;em&gt;re-reading&lt;/em&gt; cost. And no observability tool shows you the trajectory.&lt;/p&gt;

&lt;p&gt;We built a metric for it. Then we built a guard that stops the bleed before it kills your budget. Here's the problem, the math, and the fix.&lt;/p&gt;




&lt;h2&gt;
  
  
  The invisible cost of agent loops
&lt;/h2&gt;

&lt;p&gt;Here's how a typical ReAct agent works:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Turn  1: system prompt + user query                       →    1,200 input tokens
Turn  2: + assistant response + tool result                →    3,800 input tokens
Turn  5: + three more rounds of think/act/observe          →   15,000 input tokens
Turn 10: the entire conversation so far                    →   87,000 input tokens
Turn 15: approaching the context limit                     →  152,000 input tokens
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every turn re-sends everything. The system prompt. The user's question. Every assistant response. Every tool result. The LLM has no memory between calls — you're paying to "remind" it what happened.&lt;/p&gt;

&lt;p&gt;On GPT-4o ($2.50/M input tokens):&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Turn&lt;/th&gt;
&lt;th&gt;Input tokens&lt;/th&gt;
&lt;th&gt;Cumulative input cost&lt;/th&gt;
&lt;th&gt;New generation cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1,200&lt;/td&gt;
&lt;td&gt;$0.003&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;15,000&lt;/td&gt;
&lt;td&gt;$0.07&lt;/td&gt;
&lt;td&gt;$0.002&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;87,000&lt;/td&gt;
&lt;td&gt;$0.29&lt;/td&gt;
&lt;td&gt;$0.003&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;15&lt;/td&gt;
&lt;td&gt;152,000&lt;/td&gt;
&lt;td&gt;$0.67&lt;/td&gt;
&lt;td&gt;$0.003&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The generation column barely moves. You're paying $0.67 to re-read context, and $0.03 for the model to actually think. That's 96% overhead.&lt;/p&gt;

&lt;p&gt;Switch to Claude Opus ($15/M input) and those numbers are 6x worse. A 15-turn agent run costs $4 in re-reads alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why your dashboard doesn't show this
&lt;/h2&gt;

&lt;p&gt;Open your observability tool. You'll see total tokens per request, cost per request, latency per request. All per-request. All in isolation.&lt;/p&gt;

&lt;p&gt;None of these tell you:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What percentage of the context window is used at each turn&lt;/li&gt;
&lt;li&gt;How fast utilization is growing&lt;/li&gt;
&lt;li&gt;When you'll hit the model's limit&lt;/li&gt;
&lt;li&gt;That 80% of your input tokens are the same conversation sent again&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your dashboard shows snapshots. It doesn't show the trajectory — the runaway growth curve eating your budget across a multi-turn session.&lt;/p&gt;

&lt;p&gt;This is the metric that was missing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Context utilization: one ratio that changes everything
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;utilization&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;input_tokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="n"&gt;max_context_tokens&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Input tokens divided by the model's maximum context window. A number between 0 and 1.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Turn  1: utilization = 0.01   — plenty of room
Turn  5: utilization = 0.12   — still fine
Turn 10: utilization = 0.68   — growing fast
Turn 13: utilization = 0.85   — danger zone
Turn 15: utilization = 0.95   — one tool result away from hitting the wall
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Plot this on a chart and you see the growth curve &lt;em&gt;before&lt;/em&gt; it becomes a cost problem. At a glance you know: how close you are to the limit, how fast you're approaching it, and which agent is at risk.&lt;/p&gt;

&lt;h2&gt;
  
  
  How we record it
&lt;/h2&gt;

&lt;p&gt;In toad-eye, context utilization is calculated automatically after every LLM call. If the model is in the pricing table, the metric is emitted — you don't do anything:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;pricing&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;getModelPricing&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;pricing&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="nx"&gt;maxContextTokens&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;inputTokens&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;utilization&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;inputTokens&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="nx"&gt;pricing&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;maxContextTokens&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

  &lt;span class="c1"&gt;// On the span — queryable in Jaeger&lt;/span&gt;
  &lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gen_ai.toad_eye.context_utilization&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;utilization&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

  &lt;span class="c1"&gt;// As a histogram — P95/P99 in Grafana&lt;/span&gt;
  &lt;span class="nf"&gt;recordContextUtilization&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;utilization&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pricing table knows every major model's context window:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;           &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;maxContextTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;128&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-4.1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;          &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;maxContextTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;_047_576&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;claude-opus-4&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;    &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;maxContextTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;claude-sonnet-4&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;  &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;maxContextTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="nx"&gt;_000&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gemini-2.5-pro&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;   &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;maxContextTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="nx"&gt;_048_576&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Custom model? Override it:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;setCustomPricing&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;my-finetuned-gpt4&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;inputPer1M&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;3.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;outputPer1M&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;12.0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;maxContextTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="nx"&gt;_768&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Context guard: warn before it's too late
&lt;/h2&gt;

&lt;p&gt;A metric tells you what happened. A guard stops it from happening.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nf"&gt;initObservability&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;serviceName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;my-agent&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;contextGuard&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="na"&gt;warnAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;    &lt;span class="c1"&gt;// console.warn at 80%&lt;/span&gt;
    &lt;span class="na"&gt;blockAt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.95&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;// throw before the LLM call at 95%&lt;/span&gt;
  &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At 80%, you get a warning:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;toad-eye: context window 82% full for gpt-4o (104,960 / 128,000 tokens)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;At 95%, toad-eye throws a &lt;code&gt;ToadContextExceededError&lt;/code&gt; — before the call, not after. Your agent catches it and can act:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;traceLLMCall&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;input&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;=&amp;gt;&lt;/span&gt; &lt;span class="nx"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;));&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt; &lt;span class="k"&gt;instanceof&lt;/span&gt; &lt;span class="nx"&gt;ToadContextExceededError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// err.utilization: 0.96&lt;/span&gt;
    &lt;span class="c1"&gt;// err.inputTokens: 122,880&lt;/span&gt;
    &lt;span class="c1"&gt;// err.maxContextTokens: 128,000&lt;/span&gt;
    &lt;span class="nx"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;compressHistory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="c1"&gt;// retry with compressed context&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The error carries everything you need: current utilization, the threshold, the model, token counts. No guessing.&lt;/p&gt;

&lt;p&gt;When the block fires, toad-eye records it in three places: a span event in Jaeger, a counter metric in Grafana, and a warning in your application logs. One event, full visibility.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to do when utilization is high
&lt;/h2&gt;

&lt;p&gt;The metric and guard tell you there's a problem. Three practical fixes:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Summarize old turns.&lt;/strong&gt; After N turns, replace the conversation history with an LLM-generated summary. Trade 50K tokens of history for a 2K summary. The agent loses some detail but stays under budget.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Compress tool results.&lt;/strong&gt; Tool results are the biggest token hogs — a web search returning 10K tokens of HTML. Summarize tool results before adding them to context. Or store full results externally and put just a reference in context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Route to a bigger model.&lt;/strong&gt; When utilization crosses a threshold, switch models. Running on &lt;code&gt;gpt-4o&lt;/code&gt; (128K)? Route to &lt;code&gt;gpt-4.1&lt;/code&gt; (1M) for the final turns:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt; &lt;span class="k"&gt;instanceof&lt;/span&gt; &lt;span class="nx"&gt;ToadContextExceededError&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;model&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;callWithModel&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-4.1&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And sometimes the right answer is to stop. If your agent hasn't converged in 10 turns, adding 5 more turns of context won't help — it'll just cost more.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where this came from
&lt;/h2&gt;

&lt;p&gt;After publishing &lt;a href="https://dev.to/vola-trebla/opentelemetry-just-standardized-llm-tracing-heres-what-it-actually-looks-like-in-code"&gt;article #3&lt;/a&gt; on OTel semantic conventions, a reader named @jidong left a comment:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Context window usage per turn matters more than total tokens in agent loops."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;They were right. Total tokens is a number. Context utilization is a trajectory. The first tells you what happened. The second tells you what's about to happen.&lt;/p&gt;

&lt;p&gt;We built &lt;code&gt;context_utilization&lt;/code&gt; the next week. &lt;a href="https://github.com/vola-trebla/toad-eye/issues/188" rel="noopener noreferrer"&gt;Here's the tracking issue&lt;/a&gt; — shaped directly by that comment.&lt;/p&gt;




&lt;h2&gt;
  
  
  Quick checklist
&lt;/h2&gt;

&lt;p&gt;If you're running agents in production:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Monitor &lt;code&gt;input_tokens&lt;/code&gt; per turn, not just per session&lt;/li&gt;
&lt;li&gt;Calculate context utilization: &lt;code&gt;input_tokens / max_context_tokens&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Alert when P95 utilization crosses 0.7&lt;/li&gt;
&lt;li&gt;Guard at 80% (warn) and 95% (block)&lt;/li&gt;
&lt;li&gt;Have a compression strategy ready before you hit the limit&lt;/li&gt;
&lt;li&gt;Test with 10+ turn runs — the problem only shows up at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The metric is simple. The insight it gives you is not.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Previous articles:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/my-ai-bot-burned-through-my-api-budget-overnight-so-i-built-an-open-source-tool-to-make-sure-it-2372"&gt;#1: My AI bot burned through my API budget overnight&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/i-audited-my-tool-fixed-44-bugs-and-it-still-didnt-work-4omk"&gt;#2: I audited my tool, fixed 44 bugs — and it still didn't work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/opentelemetry-just-standardized-llm-tracing-heres-what-it-actually-looks-like-in-code"&gt;#3: OpenTelemetry just standardized LLM tracing&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.toARTICLE_4_URL"&gt;#4: Your LLM streaming traces are lying to you&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;toad-eye&lt;/strong&gt; — open-source LLM observability, OTel-native: &lt;a href="https://github.com/vola-trebla/toad-eye" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.npmjs.com/package/toad-eye" rel="noopener noreferrer"&gt;npm&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🐸👁️&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opentelemetry</category>
      <category>typescript</category>
      <category>observability</category>
    </item>
    <item>
      <title>Your LLM streaming traces are lying to you</title>
      <dc:creator>Albert Alov</dc:creator>
      <pubDate>Tue, 24 Mar 2026 10:41:53 +0000</pubDate>
      <link>https://dev.to/vola-trebla/your-llm-streaming-traces-are-lying-to-you-53f0</link>
      <guid>https://dev.to/vola-trebla/your-llm-streaming-traces-are-lying-to-you-53f0</guid>
      <description>&lt;p&gt;Your traces say the streaming call used 0 tokens and cost $0. Your agent made 3 tool calls but the trace shows none. Latency reads 2.5 seconds — but you have no idea if that was 200ms thinking and 2.3s generating, or 2s stuck in prefill and 500ms actually writing.&lt;/p&gt;

&lt;p&gt;Every LLM SDK returns &lt;code&gt;stream: true&lt;/code&gt; differently. Most observability tools treat streaming as an afterthought. The result: your traces are confidently wrong.&lt;/p&gt;

&lt;p&gt;We shipped streaming support in toad-eye v2.2. It passed 252 tests. Then we ran it against real providers and discovered it reported 0 tokens for every single streaming call. This article is about the 5 ways streaming traces lie — and the fixes we shipped across 5 PRs to make them stop.&lt;/p&gt;




&lt;h2&gt;
  
  
  Lie #1: "0 tokens used, $0 cost"
&lt;/h2&gt;

&lt;p&gt;This one is silent and expensive.&lt;/p&gt;

&lt;p&gt;OpenAI does not send usage data in streaming chunks by default. Every chunk arrives with &lt;code&gt;choices[0].delta.content&lt;/code&gt; — the text — but no &lt;code&gt;usage&lt;/code&gt; field. The token counts simply aren't there unless you ask for them.&lt;/p&gt;

&lt;p&gt;You have to explicitly inject this into the request body:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[...],&lt;/span&gt;
  &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="nx"&gt;stream_options&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nl"&gt;include_usage&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;// without this: 0 tokens forever&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With this flag, OpenAI sends one final chunk with an empty &lt;code&gt;choices&lt;/code&gt; array and a populated &lt;code&gt;usage&lt;/code&gt; object. Without it, your accumulator dutifully records &lt;code&gt;inputTokens: 0&lt;/code&gt;, &lt;code&gt;outputTokens: 0&lt;/code&gt;, and your cost dashboards show $0 while your bill grows.&lt;/p&gt;

&lt;p&gt;The fix in toad-eye: we auto-inject &lt;code&gt;stream_options&lt;/code&gt; before the call reaches the SDK. Users don't need to know about it.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk6vsaajt0yhmsxow82oz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk6vsaajt0yhmsxow82oz.png" alt="Screenshot: stream_options injection diff"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;PR #179: one mutation that turns invisible streaming costs into real numbers.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here's the fun part: our budget guards use token counts to enforce spend limits. With 0 tokens, every streaming call looked "free" — so budget guards never triggered. The feature designed to prevent the exact problem from &lt;a href="https://dev.to/vola-trebla/my-ai-bot-burned-through-my-api-budget-overnight-so-i-built-an-open-source-tool-to-make-sure-it-2372"&gt;article #1&lt;/a&gt; was quietly disabled for all streaming traffic.&lt;/p&gt;
&lt;h2&gt;
  
  
  Lie #2: "No tool calls happened"
&lt;/h2&gt;

&lt;p&gt;When an LLM calls a tool during streaming, the chunks don't arrive as a neat JSON object. They arrive in pieces:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Chunk&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"delta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool_calls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"function"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"search"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Chunk&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"delta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool_calls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"function"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;q&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;:"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;Chunk&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"choices"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"delta"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"tool_calls"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"index"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"function"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="nl"&gt;"arguments"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;" &lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;weather&lt;/span&gt;&lt;span class="se"&gt;\"&lt;/span&gt;&lt;span class="s2"&gt;}"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The function name comes in one chunk. The arguments arrive character by character across dozens of chunks. If your accumulator only captures &lt;code&gt;delta.content&lt;/code&gt; (text), tool calls are invisible.&lt;/p&gt;

&lt;p&gt;Anthropic does it differently — tool use arrives as a &lt;code&gt;content_block_start&lt;/code&gt; with &lt;code&gt;type: "tool_use"&lt;/code&gt;, then &lt;code&gt;input_json_delta&lt;/code&gt; events build the arguments incrementally. Same problem, different wire format.&lt;/p&gt;

&lt;p&gt;Our &lt;code&gt;StreamAccumulator&lt;/code&gt; now tracks tool calls alongside text:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;export&lt;/span&gt; &lt;span class="kr"&gt;interface&lt;/span&gt; &lt;span class="nx"&gt;StreamAccumulator&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;inputTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;outputTokens&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;number&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="nl"&gt;toolCalls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;Array&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;         &lt;span class="c1"&gt;// NEW&lt;/span&gt;
    &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;arguments&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nl"&gt;id&lt;/span&gt;&lt;span class="p"&gt;?:&lt;/span&gt; &lt;span class="kr"&gt;string&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbrnnk1xlf5du34ehjzrx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbrnnk1xlf5du34ehjzrx.png" alt="Screenshot: tool calls accumulator diff"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;PR #180: tool calls captured across all three providers.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For agent observability, this matters a lot. Without tool call data on streaming spans, your Jaeger trace shows the agent "thought" but not what it did. The most useful part of the trace was missing.&lt;/p&gt;
&lt;h2&gt;
  
  
  Lie #3: "Latency = 2.5s"
&lt;/h2&gt;

&lt;p&gt;A single duration number for a streaming call is almost meaningless. Two calls can both take 2.5 seconds with completely different stories:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Call A:&lt;/strong&gt; 200ms to first token, 2.3s generating 500 tokens. Model responded fast, lots of output.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call B:&lt;/strong&gt; 2.4s to first token, 100ms generating 20 tokens. Model was stuck in prefill — probably a huge prompt.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The diagnosis is opposite. Call A is healthy. Call B has a context size problem. Same "latency."&lt;/p&gt;

&lt;p&gt;The OTel spec recommends three TTFT signals. We now emit all three:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// In onFirstChunk callback:&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;ttft&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;performance&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;now&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="nx"&gt;start&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="c1"&gt;// 1. Histogram metric (P95/P99 across requests)&lt;/span&gt;
&lt;span class="nf"&gt;recordTimeToFirstToken&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;ttft&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;provider&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;model&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// 2. Span event (per-trace debugging in Jaeger)&lt;/span&gt;
&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;addEvent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gen_ai.content.first_token&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gen_ai.response.time_to_first_token_ms&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nx"&gt;ttft&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// 3. Span attribute (easy ad-hoc queries)&lt;/span&gt;
&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gen_ai.response.time_to_first_token_ms&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;ttft&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Plus: decode latency = total - TTFT&lt;/span&gt;
&lt;span class="c1"&gt;// gen_ai.toad_eye.latency.decode_ms&lt;/span&gt;
&lt;span class="c1"&gt;// gen_ai.toad_eye.throughput.tokens_per_second&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now when a call is slow, the first question is: prefill or decode? The answer changes everything about what you fix.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lie #4: "No thinking happened"
&lt;/h2&gt;

&lt;p&gt;Anthropic's extended thinking feature sends &lt;code&gt;thinking&lt;/code&gt; content blocks — the model's reasoning before it responds. These arrive as &lt;code&gt;thinking_delta&lt;/code&gt; chunks, separate from the regular &lt;code&gt;content_block_delta&lt;/code&gt; text chunks.&lt;/p&gt;

&lt;p&gt;Most tracers don't handle them. The thinking tokens disappear. But they cost money — billed at a different rate — and they represent real compute time that shows up in your latency but not in your traces.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// Anthropic chunk types during extended thinking:&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;content_block_start&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;content_block&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;thinking&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;content_block_delta&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;delta&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;thinking_delta&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;thinking&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Let me analyze...&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="c1"&gt;// ...many thinking chunks...&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;content_block_start&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;content_block&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;content_block_delta&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;delta&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text_delta&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;text&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Here's my answer:&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Our accumulator now tracks thinking separately:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;?.&lt;/span&gt;&lt;span class="kd"&gt;type&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;thinking_delta&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nx"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;thinkingContent&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;delta&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;thinking&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="c1"&gt;// tracked separately — not appended to completion&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This means you can see in your trace: "the model spent 3 seconds thinking, generated 2,000 thinking tokens, then responded in 500ms with 200 output tokens." Without this, the 3 seconds of thinking looks like slow latency and the thinking tokens are unaccounted cost.&lt;/p&gt;

&lt;h2&gt;
  
  
  Lie #5: "The call succeeded"
&lt;/h2&gt;

&lt;p&gt;User opens your AI chat. Streaming starts. After 3 seconds and 150 tokens, user closes the tab. Browser kills the connection. Your server's async iterator throws or the &lt;code&gt;for await&lt;/code&gt; loop ends early.&lt;/p&gt;

&lt;p&gt;What does your trace say? If the span is only finalized in &lt;code&gt;onComplete&lt;/code&gt;, and &lt;code&gt;onComplete&lt;/code&gt; only fires when the stream is fully exhausted — the span is either missing entirely or stuck open forever.&lt;/p&gt;

&lt;p&gt;Our fix: a &lt;code&gt;finally&lt;/code&gt; block that fires regardless:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="kd"&gt;function&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="nx"&gt;wrapAsyncIterable&lt;/span&gt;&lt;span class="o"&gt;&amp;lt;&lt;/span&gt;&lt;span class="nx"&gt;T&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;accumulate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;onFirstChunk&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;onComplete&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;onError&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;completed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="kd"&gt;let&lt;/span&gt; &lt;span class="nx"&gt;errored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;false&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="k"&gt;try&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="k"&gt;await &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt; &lt;span class="k"&gt;of&lt;/span&gt; &lt;span class="nx"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="c1"&gt;// accumulate...&lt;/span&gt;
      &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="nx"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="nx"&gt;completed&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;onComplete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;catch &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;errored&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="nf"&gt;onError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;throw&lt;/span&gt; &lt;span class="nx"&gt;err&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;finally&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="c1"&gt;// Consumer broke out early — still record partial data&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;completed&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="nx"&gt;errored&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
      &lt;span class="nf"&gt;onComplete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;acc&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// records whatever we accumulated so far&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
  &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;finally&lt;/code&gt; block records partial data: tokens consumed so far, text generated so far, duration up to the point of abandonment. The span closes with real data instead of silence. You billed for those 150 tokens — your trace should show them.&lt;/p&gt;




&lt;h2&gt;
  
  
  The provider chaos table
&lt;/h2&gt;

&lt;p&gt;Building all of this required handling three completely different SSE implementations. Here's the reality:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;/th&gt;
&lt;th&gt;Text&lt;/th&gt;
&lt;th&gt;Tokens&lt;/th&gt;
&lt;th&gt;Tool calls&lt;/th&gt;
&lt;th&gt;Thinking&lt;/th&gt;
&lt;th&gt;Gotchas&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;OpenAI&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;delta.content&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Final chunk only, opt-in via &lt;code&gt;stream_options&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;delta.tool_calls[]&lt;/code&gt; with index&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;Empty &lt;code&gt;choices&lt;/code&gt; on final chunk — don't discard it&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Anthropic&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;content_block_delta&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Split: &lt;code&gt;message_start&lt;/code&gt; (input) + &lt;code&gt;message_delta&lt;/code&gt; (output)&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;content_block_start&lt;/code&gt; type &lt;code&gt;tool_use&lt;/code&gt; + &lt;code&gt;input_json_delta&lt;/code&gt;
&lt;/td&gt;
&lt;td&gt;&lt;code&gt;thinking_delta&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Requires state machine for event types&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Gemini&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;chunk.text()&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;usageMetadata&lt;/code&gt; overwrites each chunk&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;functionCall&lt;/code&gt; in parts&lt;/td&gt;
&lt;td&gt;N/A&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;text()&lt;/code&gt; throws on safety-blocked content&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Three providers. Three formats. One &lt;code&gt;StreamAccumulator&lt;/code&gt; interface. Each provider gets its own &lt;code&gt;accumulateChunk()&lt;/code&gt; extractor that normalizes everything into the same shape.&lt;/p&gt;

&lt;h2&gt;
  
  
  What your streaming traces should show
&lt;/h2&gt;

&lt;p&gt;After these fixes, here's what each streaming span contains:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight properties"&gt;&lt;code&gt;&lt;span class="py"&gt;gen_ai.operation.name&lt;/span&gt;          &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"chat"&lt;/span&gt;
&lt;span class="py"&gt;gen_ai.provider.name&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"openai"&lt;/span&gt;
&lt;span class="py"&gt;gen_ai.request.model&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;"gpt-4o"&lt;/span&gt;
&lt;span class="py"&gt;gen_ai.usage.input_tokens&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;1,847          ← was 0&lt;/span&gt;
&lt;span class="py"&gt;gen_ai.usage.output_tokens&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;423            ← was 0&lt;/span&gt;
&lt;span class="py"&gt;gen_ai.toad_eye.cost&lt;/span&gt;           &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;0.00886        ← was $0&lt;/span&gt;
&lt;span class="py"&gt;gen_ai.toad_eye.tool.calls&lt;/span&gt;     &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;2              ← was invisible&lt;/span&gt;
&lt;span class="py"&gt;gen_ai.response.time_to_first_token_ms&lt;/span&gt; &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;340    ← was mixed into total&lt;/span&gt;
&lt;span class="py"&gt;gen_ai.toad_eye.latency.decode_ms&lt;/span&gt;      &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;1,960  ← didn't exist&lt;/span&gt;
&lt;span class="py"&gt;gen_ai.toad_eye.context_utilization&lt;/span&gt;    &lt;span class="p"&gt;=&lt;/span&gt; &lt;span class="s"&gt;0.014   ← didn't exist&lt;/span&gt;

&lt;span class="err"&gt;Span&lt;/span&gt; &lt;span class="py"&gt;event&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;gen_ai.content.first_token at +340ms&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Every number was either wrong or missing before. Now it's real.&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick checklist
&lt;/h2&gt;

&lt;p&gt;If you're tracing LLM streaming — in toad-eye or your own code — check these:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Are you injecting &lt;code&gt;stream_options: { include_usage: true }&lt;/code&gt; for OpenAI?&lt;/li&gt;
&lt;li&gt;Does your accumulator capture tool call chunks, not just text?&lt;/li&gt;
&lt;li&gt;Do you split TTFT from total duration?&lt;/li&gt;
&lt;li&gt;Do you handle Anthropic &lt;code&gt;thinking_delta&lt;/code&gt; if using extended thinking?&lt;/li&gt;
&lt;li&gt;Does your span close correctly when the stream is abandoned?&lt;/li&gt;
&lt;li&gt;Is your &lt;code&gt;finally&lt;/code&gt; block recording partial data?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If any answer is "no" or "I'm not sure" — your streaming traces are lying to you.&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Previous articles:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/my-ai-bot-burned-through-my-api-budget-overnight-so-i-built-an-open-source-tool-to-make-sure-it-2372"&gt;#1: My AI bot burned through my API budget overnight&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/i-audited-my-tool-fixed-44-bugs-and-it-still-didnt-work-4omk"&gt;#2: I audited my tool, fixed 44 bugs — and it still didn't work&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/opentelemetry-just-standardized-llm-tracing-heres-what-it-actually-looks-like-in-code"&gt;#3: OpenTelemetry just standardized LLM tracing&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;toad-eye&lt;/strong&gt; — open-source LLM observability, OTel-native: &lt;a href="https://github.com/vola-trebla/toad-eye" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.npmjs.com/package/toad-eye" rel="noopener noreferrer"&gt;npm&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🐸👁️&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opentelemetry</category>
      <category>typescript</category>
      <category>observability</category>
    </item>
    <item>
      <title>OpenTelemetry just standardized LLM tracing. Here's what it actually looks like in code.</title>
      <dc:creator>Albert Alov</dc:creator>
      <pubDate>Sat, 21 Mar 2026 21:47:33 +0000</pubDate>
      <link>https://dev.to/vola-trebla/opentelemetry-just-standardized-llm-tracing-heres-what-it-actually-looks-like-in-code-2e5f</link>
      <guid>https://dev.to/vola-trebla/opentelemetry-just-standardized-llm-tracing-heres-what-it-actually-looks-like-in-code-2e5f</guid>
      <description>&lt;p&gt;Every LLM tool invents its own tracing format. Langfuse has one. Helicone has one. Arize has one. If you built your own — congratulations, you have one too.&lt;/p&gt;

&lt;p&gt;OpenTelemetry just published a standard for all of them.&lt;/p&gt;

&lt;p&gt;It defines how to name spans, what attributes a tool call should have, how to log prompts without leaking PII, and which span kind to use for an agent. It's called GenAI Semantic Conventions. It's experimental. And almost nobody has written about what it actually looks like when you implement it.&lt;/p&gt;

&lt;p&gt;I know because I searched. "OTel GenAI semantic conventions" gives you spec pages. Zero practical articles. "How to trace LLM agent with OpenTelemetry" gives you StackOverflow questions with no answers.&lt;/p&gt;

&lt;p&gt;We implemented it. Four PRs, a gap analysis, real before/after code. We also discovered, mid-implementation, that our traces never exported at all — but that's a &lt;a href="https://dev.to/vola-trebla/i-audited-my-tool-fixed-44-bugs-and-it-still-didnt-work-4omk"&gt;different story&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Here's what the spec actually says, where we got it wrong, and what you should do today.&lt;/p&gt;




&lt;h2&gt;
  
  
  The wild west of LLM tracing
&lt;/h2&gt;

&lt;p&gt;Right now, if you trace LLM calls, you're probably doing something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;llm.provider&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;llm.model&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;llm.tokens.input&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;llm.cost&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.003&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's what we did in toad-eye v1. Made sense to us. Worked fine in our dashboards.&lt;/p&gt;

&lt;p&gt;Problem: nobody else's dashboards understand these attributes. Switch from Jaeger to Arize Phoenix — reconfigure everything. Export traces to Datadog — they see raw spans with no LLM context. Your tracing is a walled garden. You built vendor lock-in into your own code.&lt;/p&gt;

&lt;p&gt;This is exactly what OpenTelemetry was created to solve. And now it has a spec for GenAI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Three types of GenAI spans
&lt;/h2&gt;

&lt;p&gt;The spec defines three operations. Every LLM-related span gets one:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;chat gpt-4o                    ← model call
invoke_agent orchestrator      ← agent invocation  
execute_tool web_search        ← tool execution
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The span name format is &lt;code&gt;{operation} {name}&lt;/code&gt;. Not your custom format. Not &lt;code&gt;gen_ai.openai.gpt-4o&lt;/code&gt; (that's what we had — no backend recognizes it).&lt;/p&gt;

&lt;p&gt;Here's what we changed:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ezxomm21gsx5yjv1m5f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8ezxomm21gsx5yjv1m5f.png" alt="Screenshot: Span"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Span naming migration: the old format was invisible to every GenAI-aware backend.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Agent attributes — we had the paths wrong
&lt;/h2&gt;

&lt;p&gt;If you're building agents (ReAct, tool-use, multi-step), the spec defines identity and tool attributes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// What OTel says:&lt;/span&gt;
&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gen_ai.agent.name&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;weather-bot&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gen_ai.agent.id&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;agent-001&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gen_ai.tool.name&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;search&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gen_ai.tool.type&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;function&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// What we had:&lt;/span&gt;
&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gen_ai.agent.tool.name&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;search&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;  &lt;span class="c1"&gt;// wrong path&lt;/span&gt;
&lt;span class="c1"&gt;// gen_ai.agent.name — didn't exist at all&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The &lt;code&gt;gen_ai.agent.tool.name&lt;/code&gt; path looks reasonable. It even reads well. But the spec puts tool attributes at &lt;code&gt;gen_ai.tool.*&lt;/code&gt; — flat, not nested under agent. Our format, again, invisible to any backend that follows the standard.&lt;/p&gt;

&lt;h2&gt;
  
  
  Content recording — the spec agrees with us (feels good)
&lt;/h2&gt;

&lt;p&gt;This was the one thing we got right from day one, and it's worth calling out because most teams get it wrong.&lt;/p&gt;

&lt;p&gt;The spec says: &lt;strong&gt;don't record prompts and completions by default.&lt;/strong&gt; Instrumentations SHOULD NOT capture content unless explicitly enabled.&lt;/p&gt;

&lt;p&gt;Three official patterns:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Default: don't record.&lt;/strong&gt; No prompt, no completion in spans. Privacy first.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Opt-in via span attributes.&lt;/strong&gt; &lt;code&gt;gen_ai.input.messages&lt;/code&gt; and &lt;code&gt;gen_ai.output.messages&lt;/code&gt; as JSON strings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;External storage.&lt;/strong&gt; Store content elsewhere, put a reference on the span.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;We had &lt;code&gt;recordContent: false&lt;/code&gt; as default since v1. When the spec confirmed this approach, it was one of those rare moments where your gut feeling gets validated by a committee of very smart people.&lt;/p&gt;

&lt;p&gt;If you're logging prompts in spans by default — you might want to reconsider before your security team does it for you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The honest gap analysis
&lt;/h2&gt;

&lt;p&gt;Here's the full picture. No spin, no cherry-picking.&lt;/p&gt;

&lt;h3&gt;
  
  
  What we got right from day 1
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Our attribute&lt;/th&gt;
&lt;th&gt;OTel spec&lt;/th&gt;
&lt;th&gt;Verdict&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.provider.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gen_ai.provider.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅ Exact match&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.request.model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gen_ai.request.model&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅ Exact match&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;gen_ai.usage.input_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gen_ai.usage.input_tokens&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅ Exact match&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;code&gt;error.type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;error.type&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;✅ Exact match&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What we got wrong
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What&lt;/th&gt;
&lt;th&gt;Our version&lt;/th&gt;
&lt;th&gt;OTel spec&lt;/th&gt;
&lt;th&gt;Status&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Span name&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gen_ai.openai.gpt-4o&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;chat gpt-4o&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fixed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tool name attribute&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gen_ai.agent.tool.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gen_ai.tool.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Fixed&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Custom attributes&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gen_ai.agent.step.*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Reserved namespace&lt;/td&gt;
&lt;td&gt;Moved to &lt;code&gt;gen_ai.toad_eye.*&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Agent identity&lt;/td&gt;
&lt;td&gt;Didn't exist&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gen_ai.agent.name&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Added&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  What we built beyond the spec
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Namespace&lt;/th&gt;
&lt;th&gt;Why it's not in OTel&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Cost per request&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gen_ai.toad_eye.cost&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Pricing is vendor-specific&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Budget guards&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gen_ai.toad_eye.budget.*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Runtime enforcement ≠ observability&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Shadow guardrails&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gen_ai.toad_eye.guard.*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Validation is app-level&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Semantic drift&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gen_ai.toad_eye.semantic_drift&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Quality metric, not trace standard&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ReAct step tracking&lt;/td&gt;
&lt;td&gt;&lt;code&gt;gen_ai.toad_eye.agent.step.*&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;ReAct is one pattern; spec is pattern-agnostic&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The key insight: &lt;strong&gt;OTel spec covers WHAT happened. We cover WHY and HOW MUCH.&lt;/strong&gt; Not competing — complementary. Your custom metrics go under your namespace. The spec's attributes go where backends expect them.&lt;/p&gt;

&lt;h2&gt;
  
  
  The migration: dual-emit, don't break users
&lt;/h2&gt;

&lt;p&gt;We didn't do a clean break. v2.4 emits both old and new attribute names:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="c1"&gt;// New (OTel spec-compliant)&lt;/span&gt;
&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gen_ai.tool.name&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;toolName&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

&lt;span class="c1"&gt;// Old (deprecated, still emitted for backward compat)&lt;/span&gt;
&lt;span class="nx"&gt;span&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setAttribute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gen_ai.agent.tool.name&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;toolName&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl4qj269knbpmegcwwtlt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl4qj269knbpmegcwwtlt.png" alt="Screenshot: Attribute prefix migration diff with @deprecated tags and dual-emit"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Dual-emit approach: old attributes get &lt;code&gt;@deprecated&lt;/code&gt;, new ones follow the spec. Both emitted until v3.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;An environment variable controls when to stop emitting deprecated attributes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="nv"&gt;OTEL_SEMCONV_STABILITY_OPT_IN&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;gen_ai_latest_experimental
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This was four PRs (&lt;a href="https://github.com/vola-trebla/toad-eye/pull/170" rel="noopener noreferrer"&gt;#170&lt;/a&gt;, &lt;a href="https://github.com/vola-trebla/toad-eye/pull/171" rel="noopener noreferrer"&gt;#171&lt;/a&gt;, &lt;a href="https://github.com/vola-trebla/toad-eye/pull/172" rel="noopener noreferrer"&gt;#172&lt;/a&gt;, &lt;a href="https://github.com/vola-trebla/toad-eye/pull/173" rel="noopener noreferrer"&gt;#173&lt;/a&gt;). v3 will remove the deprecated aliases entirely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The irony
&lt;/h2&gt;

&lt;p&gt;While implementing all of this, we did a round of manual testing.&lt;/p&gt;

&lt;p&gt;Turns out our traces never exported. At all. Ever. The OTel &lt;code&gt;NodeSDK&lt;/code&gt; silently disables trace export when you pass &lt;code&gt;spanProcessors: []&lt;/code&gt;. We had 252 passing tests. All of them mocked the SDK.&lt;/p&gt;

&lt;p&gt;So we standardized our attributes perfectly — for traces that nobody could see.&lt;/p&gt;

&lt;p&gt;We fixed both. Published six patch versions in one day. The &lt;a href="https://dev.to/vola-trebla/i-audited-my-tool-fixed-44-bugs-and-it-still-didnt-work-4omk"&gt;full story is in article #2&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Which backends actually support this
&lt;/h2&gt;

&lt;p&gt;This is the reason to care. Emit the right attributes today → six backends visualize your traces tomorrow:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Recognizes GenAI spans&lt;/th&gt;
&lt;th&gt;Agent visualization&lt;/th&gt;
&lt;th&gt;Cost&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Jaeger&lt;/td&gt;
&lt;td&gt;Basic (nested spans)&lt;/td&gt;
&lt;td&gt;Hierarchy view&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Arize Phoenix&lt;/td&gt;
&lt;td&gt;Full GenAI UI&lt;/td&gt;
&lt;td&gt;Agent workflow&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;SigNoz&lt;/td&gt;
&lt;td&gt;GenAI dashboards&lt;/td&gt;
&lt;td&gt;Nested spans&lt;/td&gt;
&lt;td&gt;Free / Cloud&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Datadog&lt;/td&gt;
&lt;td&gt;LLM Observability&lt;/td&gt;
&lt;td&gt;Agent tracing&lt;/td&gt;
&lt;td&gt;Paid&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Langfuse&lt;/td&gt;
&lt;td&gt;Full GenAI UI&lt;/td&gt;
&lt;td&gt;Session view&lt;/td&gt;
&lt;td&gt;Free tier&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Grafana + Tempo&lt;/td&gt;
&lt;td&gt;Query by attributes&lt;/td&gt;
&lt;td&gt;Custom dashboards&lt;/td&gt;
&lt;td&gt;Free&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;No vendor lock-in. One set of attributes. Six places to visualize them.&lt;/p&gt;

&lt;h2&gt;
  
  
  What you should do today
&lt;/h2&gt;

&lt;p&gt;If you're tracing LLM calls — even with custom code — aligning with the spec now saves you pain later. The conventions are experimental, but the direction is locked in.&lt;/p&gt;

&lt;p&gt;Quick checklist:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Set &lt;code&gt;gen_ai.operation.name&lt;/code&gt; on every LLM span: &lt;code&gt;chat&lt;/code&gt;, &lt;code&gt;invoke_agent&lt;/code&gt;, or &lt;code&gt;execute_tool&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Format span names as &lt;code&gt;{operation} {model_or_agent_name}&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Use official attributes: &lt;code&gt;gen_ai.agent.name&lt;/code&gt;, &lt;code&gt;gen_ai.tool.name&lt;/code&gt;, &lt;code&gt;gen_ai.tool.type&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Put YOUR custom attributes under YOUR namespace — not &lt;code&gt;gen_ai.*&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Don't record prompt/completion by default — make it opt-in&lt;/li&gt;
&lt;li&gt;Test your traces in at least 2 backends (Jaeger + one GenAI-specific like Phoenix)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Full spec: &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/" rel="noopener noreferrer"&gt;OpenTelemetry GenAI Semantic Conventions&lt;/a&gt;&lt;br&gt;
Agent spans: &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/" rel="noopener noreferrer"&gt;GenAI Agent Spans&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Previous articles:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/my-ai-bot-burned-through-my-api-budget-overnight-so-i-built-an-open-source-tool-to-make-sure-it-2372"&gt;#1: My AI bot burned through my API budget overnight&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/vola-trebla/i-audited-my-tool-fixed-44-bugs-and-it-still-didnt-work-4omk"&gt;#2: I audited my tool, fixed 44 bugs — and it still didn't work&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;toad-eye&lt;/strong&gt; — open-source LLM observability, OTel-native: &lt;a href="https://github.com/vola-trebla/toad-eye" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; · &lt;a href="https://www.npmjs.com/package/toad-eye" rel="noopener noreferrer"&gt;npm&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;🐸👁️&lt;/p&gt;

</description>
      <category>ai</category>
      <category>observability</category>
      <category>typescript</category>
      <category>opentelemetry</category>
    </item>
    <item>
      <title>I audited my tool, fixed 44 bugs - and it still didn’t work</title>
      <dc:creator>Albert Alov</dc:creator>
      <pubDate>Sat, 21 Mar 2026 21:25:05 +0000</pubDate>
      <link>https://dev.to/vola-trebla/i-audited-my-tool-fixed-44-bugs-and-it-still-didnt-work-4omk</link>
      <guid>https://dev.to/vola-trebla/i-audited-my-tool-fixed-44-bugs-and-it-still-didnt-work-4omk</guid>
      <description>&lt;p&gt;252 green tests, zero traces in Jaeger, and the one-line OpenTelemetry mistake that made my observability tool blind.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TL;DR&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I shipped an observability tool with &lt;strong&gt;252 green tests&lt;/strong&gt; — but &lt;strong&gt;zero traces&lt;/strong&gt; ever reached Jaeger. The root cause was an OpenTelemetry config detail that looked harmless (&lt;code&gt;spanProcessors: []&lt;/code&gt;) but silently disabled trace export. Manual testing found it in minutes.&lt;/p&gt;

&lt;p&gt;Act 1 ·&lt;/p&gt;

&lt;p&gt;Act 2 ·&lt;/p&gt;

&lt;p&gt;Root cause ·&lt;/p&gt;

&lt;p&gt;Fix ·&lt;/p&gt;

&lt;p&gt;Checklist ·&lt;/p&gt;

&lt;p&gt;Links&lt;/p&gt;

&lt;p&gt;I shipped v2.2.0 of my observability tool with 143 tests and a green CI run.&lt;/p&gt;

&lt;p&gt;Then I did what I thought was the responsible thing: a deep code + DX audit. I found 44 issues, fixed them in a sprint, bumped the version a bunch of times, and ended at v2.4.4 with 252 tests.&lt;/p&gt;

&lt;p&gt;I felt great — until I ran the tool like a real user would.&lt;/p&gt;

&lt;p&gt;

&lt;strong&gt;Zero traces were reaching the backend.&lt;/strong&gt; Not “sometimes.” Not “misconfigured.” Just: never.

&lt;/p&gt;

&lt;p&gt;

&lt;strong&gt;252 unit tests. All green.&lt;/strong&gt; Traces were broken since day one.

&lt;/p&gt;

&lt;p&gt;This is how I found out, what the root cause was, and why tests (and code audits) didn’t see it.&lt;/p&gt;




&lt;h2 id="act-1"&gt;Act 1: The audit (I found what I expected)&lt;/h2&gt;

&lt;p&gt;The audit was useful. It caught real problems — especially the kind that looks “reasonable” in code review and passes unit tests.&lt;/p&gt;

&lt;p&gt;Three examples that matter for the story:&lt;/p&gt;

&lt;h3&gt;
  
  
  1) A privacy feature that leaked PII
&lt;/h3&gt;

&lt;p&gt;I had an &lt;code&gt;auditMasking&lt;/code&gt; mode meant to help debug redaction. Great intention, terrible output: it logged the original unmasked text to stdout.&lt;/p&gt;

&lt;p&gt;If your logs go to CloudWatch/Datadog (they do), stdout isn’t “local debug.” It’s a data pipeline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28kr90xqqvakbxe92n73.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28kr90xqqvakbxe92n73.png" alt="leaked" width="800" height="680"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caption:&lt;/strong&gt; “Fix: audit mode no longer prints raw input (PII).”&lt;/p&gt;

&lt;h3&gt;
  
  
  2) &lt;code&gt;diag.warn()&lt;/code&gt; was invisible by default
&lt;/h3&gt;

&lt;p&gt;I used OpenTelemetry’s &lt;code&gt;diag.warn()&lt;/code&gt; for user-facing warnings.&lt;/p&gt;

&lt;p&gt;Problem: &lt;code&gt;diag.*&lt;/code&gt; emits nothing unless diagnostics are explicitly configured. So warnings existed… but users never saw them. Typo? Missing SDK? Collector down? Silent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Keep this in mind:&lt;/strong&gt; “silent failure” becomes the recurring theme of this story.&lt;/p&gt;

&lt;h3&gt;
  
  
  3) &lt;code&gt;npx&lt;/code&gt; CLI was completely dead
&lt;/h3&gt;

&lt;p&gt;The CLI entry guard compared a symlink path to a real path, so &lt;code&gt;npx toad-eye ...&lt;/code&gt; produced zero output. Entire CLI dead via npx.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8hq3h7h049cufm4rqieh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8hq3h7h049cufm4rqieh.png" alt="CLI" width="800" height="411"&gt;&lt;/a&gt;  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caption:&lt;/strong&gt; “Fix: &lt;code&gt;npx&lt;/code&gt; runs via a symlink — compare real paths or the CLI never executes.”&lt;/p&gt;

&lt;p&gt;At this point, the audit felt like a win: 44 issues found, 44 fixed, tests grew from 143 → 252. Ship it.&lt;/p&gt;




&lt;h2 id="act-2"&gt;Act 2: Manual testing (I found what I didn’t expect)&lt;/h2&gt;

&lt;p&gt;After the audit, I wrote a quick testing guide and ran the tool end-to-end:&lt;/p&gt;

&lt;p&gt;1) &lt;code&gt;npx&lt;/code&gt; init  &lt;/p&gt;

&lt;p&gt;2) import into a tiny app  &lt;/p&gt;

&lt;p&gt;3) run against a real Collector  &lt;/p&gt;

&lt;p&gt;4) confirm traces show up in Jaeger&lt;/p&gt;

&lt;p&gt;This is where everything fell apart fast:&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1 — &lt;code&gt;npx toad-eye init&lt;/code&gt;: silence
&lt;/h3&gt;

&lt;p&gt;That was the broken &lt;code&gt;npx&lt;/code&gt; guard (fixed as above). The tool looked “dead” for the most common installation path.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 2 — importing with &lt;code&gt;tsx&lt;/code&gt;: &lt;code&gt;ERR_PACKAGE_PATH_NOT_EXPORTED&lt;/code&gt;
&lt;/h3&gt;

&lt;p&gt;Package exports were missing the &lt;code&gt;"default"&lt;/code&gt; condition. Another “works locally” trap.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 3 — Jaeger: nothing
&lt;/h3&gt;

&lt;p&gt;No service. No spans. No errors. No warnings (because &lt;code&gt;diag.warn&lt;/code&gt; was invisible).&lt;/p&gt;

&lt;p&gt;So I did what everyone does: I blamed Docker and infrastructure. I spent an hour tweaking Collector configs, flipping between gRPC and HTTP ports, restarting containers — all while assuming the problem was upstream in Jaeger or the Collector.&lt;/p&gt;

&lt;p&gt;But the pipeline wasn’t broken in the middle.&lt;/p&gt;

&lt;p&gt;It was broken at the source.&lt;/p&gt;




&lt;h2 id="root-cause"&gt;The root cause: I accidentally disabled trace export completely&lt;/h2&gt;

&lt;p&gt;Here’s the bug:&lt;/p&gt;

&lt;p&gt;I passed &lt;code&gt;spanProcessors: []&lt;/code&gt; to OpenTelemetry’s &lt;code&gt;NodeSDK&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That looks harmless. It’s not.&lt;/p&gt;

&lt;p&gt;An empty &lt;code&gt;spanProcessors&lt;/code&gt; array doesn’t mean “use defaults.”  &lt;/p&gt;

&lt;p&gt;It means “override defaults with nothing.”&lt;/p&gt;

&lt;p&gt;No span processor → nothing exports.&lt;/p&gt;

&lt;p&gt;Metrics still worked (separate pipeline), which made the bug even harder to spot. The tool looked “alive” while traces were dead.&lt;/p&gt;

&lt;p&gt;Even worse: when &lt;code&gt;instrument: ['ai']&lt;/code&gt; was enabled, &lt;code&gt;spanProcessors&lt;/code&gt; became non-empty… but the processor I provided only recorded metrics. I still didn’t include the default BatchSpanProcessor for exporting spans.&lt;/p&gt;

&lt;p&gt;Different code path, same result: zero traces.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; this wasn’t a flaky config issue. Traces never worked for any user. Ever.&lt;/p&gt;




&lt;h2 id="one-line-fix"&gt;The one-line fix&lt;/h2&gt;

&lt;p&gt;The fix is almost insulting:&lt;/p&gt;

&lt;p&gt;Don’t pass an empty array.&lt;/p&gt;

&lt;p&gt;Let the SDK create its default BatchSpanProcessor unless you actually have span processors to set.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3wvsxbqnf7xgtrj1eki.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3wvsxbqnf7xgtrj1eki.png" alt="tracer" width="800" height="291"&gt;&lt;/a&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Caption:&lt;/strong&gt; “Fix: don’t override the default BatchSpanProcessor with &lt;code&gt;spanProcessors: []&lt;/code&gt;.”&lt;/p&gt;




&lt;h2 id="act-3"&gt;Act 3: The takeaway (what changed in how I test)&lt;/h2&gt;

&lt;p&gt;After this, “252 tests” stopped feeling comforting.&lt;/p&gt;

&lt;p&gt;Because the real problem wasn’t “insufficient assertions.”  &lt;/p&gt;

&lt;p&gt;It was: my tests weren’t testing reality.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why unit tests didn’t catch it
&lt;/h3&gt;

&lt;p&gt;My unit tests mocked the OpenTelemetry SDK.&lt;/p&gt;

&lt;p&gt;So they verified:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“I call NodeSDK with these options”&lt;/li&gt;
&lt;li&gt;“I register this instrumentation”&lt;/li&gt;
&lt;li&gt;“I construct this processor”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But they didn’t verify the one thing an observability tool must do: do traces actually show up in the backend?&lt;/p&gt;




&lt;h2 id="checklist"&gt;The checklist I’m keeping now&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Practical, not preachy.&lt;/strong&gt; If you ship devtools (especially observability), keep one test path that’s real.&lt;/p&gt;

&lt;h3&gt;
  
  
  Testing (reality checks)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Integration smoke test: run a real Collector + Jaeger and assert at least one span shows up&lt;/li&gt;
&lt;li&gt;Don’t mock away the pipeline: at least one test should export for real&lt;/li&gt;
&lt;li&gt;When debugging, start at the source (your app), not the destination (Jaeger)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Design (don’t disable defaults accidentally)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Avoid “empty override” configs (&lt;code&gt;[]&lt;/code&gt;) unless you truly mean “disable defaults”&lt;/li&gt;
&lt;li&gt;Treat streaming / special modes as separate first-class paths (parity tests)&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  UX (make failure loud)
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Make failures visible by default (don’t rely on invisible diagnostics)&lt;/li&gt;
&lt;li&gt;Run the “11pm developer” test: typos, missing Docker, empty dashboards — does the tool explain itself?&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  Results (numbers, no hype)
&lt;/h2&gt;

&lt;p&gt;Before:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;v2.2.0&lt;/li&gt;
&lt;li&gt;143 tests&lt;/li&gt;
&lt;li&gt;traces: broken since day 1&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;npx&lt;/code&gt; CLI: silent/dead&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;v2.4.4&lt;/li&gt;
&lt;li&gt;252 tests&lt;/li&gt;
&lt;li&gt;code audit: 44 issues found, 44 fixed&lt;/li&gt;
&lt;li&gt;manual testing: 5 critical bugs found, all fixed&lt;/li&gt;
&lt;li&gt;traces: working&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;npx&lt;/code&gt; CLI: working&lt;/li&gt;
&lt;li&gt;npm publishes: 6 (in one day)&lt;/li&gt;
&lt;/ul&gt;




&lt;h2 id="links"&gt;Links&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Repo: &lt;a href="https://github.com/vola-trebla/toad-eye" rel="noopener noreferrer"&gt;https://github.com/vola-trebla/toad-eye&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Article #1: &lt;a href="https://dev.to/vola-trebla/my-ai-bot-burned-through-my-api-budget-overnight-so-i-built-an-open-source-tool-to-make-sure-it-2372"&gt;https://dev.to/vola-trebla/my-ai-bot-burned-through-my-api-budget-overnight-so-i-built-an-open-source-tool-to-make-sure-it-2372&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ai</category>
      <category>webdev</category>
      <category>opentelemetry</category>
      <category>typescript</category>
    </item>
    <item>
      <title>My AI bot burned through my API budget overnight. So I built an open-source tool to make sure it never happens again.</title>
      <dc:creator>Albert Alov</dc:creator>
      <pubDate>Sat, 21 Mar 2026 00:30:15 +0000</pubDate>
      <link>https://dev.to/vola-trebla/my-ai-bot-burned-through-my-api-budget-overnight-so-i-built-an-open-source-tool-to-make-sure-it-2372</link>
      <guid>https://dev.to/vola-trebla/my-ai-bot-burned-through-my-api-budget-overnight-so-i-built-an-open-source-tool-to-make-sure-it-2372</guid>
      <description>&lt;p&gt;I run an autonomous AI news engine called El Sapo Cripto. It monitors 25+ RSS feeds, scores articles, generates Spanish-language summaries with Gemini, creates images, and publishes to Telegram and X. All day, every day, zero human intervention.&lt;/p&gt;

&lt;p&gt;One morning I woke up to a ~$4 bill from Google. Not a lot, right? But my usual daily spend was under $0.25. Something was very wrong.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwitzpsx2y1uurolhl68b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwitzpsx2y1uurolhl68b.png" alt="AI studio Spend section" width="800" height="315"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What happened
&lt;/h2&gt;

&lt;p&gt;My app runs on Railway. Railway occasionally restarts containers. My budget tracker lived in memory. Restart = budget reset = the bot thought it had a fresh $0 balance and went wild. Gemini Flash calls piled up - summarizing, re-summarizing, processing articles it had already processed.&lt;/p&gt;

&lt;p&gt;I caught it by accident, scrolling through the billing page. There was no alert. No dashboard. No way to see what happened without manually reading through logs.&lt;/p&gt;

&lt;p&gt;And here's the thing that really bothered me: the app was returning 200 OK on every request. Prometheus would've shown zero errors. Traditional monitoring would've said "everything's fine" while the bot was eating money.&lt;/p&gt;

&lt;h2&gt;
  
  
  The real problem
&lt;/h2&gt;

&lt;p&gt;I started looking around for tools to monitor LLM calls. Found a few options: Langfuse, Helicone, OpenLLMetry. All solid projects. But they all shared the same limitation — they show you what your LLM is doing, but they don't tell you if it's doing it well.&lt;/p&gt;

&lt;p&gt;I don't just need to see that my bot made 200 Gemini calls. I need to know: did the summaries get worse after I changed the prompt last Tuesday? Are error rates creeping up because the provider is having issues? Is the cost per article going up because responses are getting longer? Is the model quietly refusing to summarize certain topics?&lt;/p&gt;

&lt;p&gt;I'm a Senior SDET by background. 8+ years building test frameworks and quality infrastructure. In the testing world, we don't just log requests — we assert on behavior, detect regressions, set quality gates. None of the existing LLM tools did that.&lt;/p&gt;

&lt;p&gt;So I built one.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6z8prmrnnc6faq5eldr9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6z8prmrnnc6faq5eldr9.png" alt="Launch in terminal" width="800" height="777"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  toad-eye
&lt;/h2&gt;

&lt;p&gt;toad-eye is an open-source observability toolkit for LLM systems, built on OpenTelemetry. You install it, run three commands, and get full visibility into every LLM call your app makes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4qnjcgocewysrawbtwr6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4qnjcgocewysrawbtwr6.png" alt="toad-eye Overview dashboard showing request rate, latency, cost across 3 LLM providers" width="800" height="405"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&lt;br&gt;
npm install toad-eye&lt;br&gt;
npx toad-eye init&lt;br&gt;
npx toad-eye up&lt;br&gt;
&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;That gives you Grafana with 6 pre-built dashboards, Jaeger for trace inspection, and Prometheus for metrics. All pre-configured, all running locally.&lt;/p&gt;

&lt;p&gt;The SDK auto-instruments your LLM calls. No wrappers, no code changes:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nx"&gt;initObservability&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;toad-eye&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nf"&gt;initObservability&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;serviceName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;my-app&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;instrument&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;anthropic&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="c1"&gt;// every OpenAI and Anthropic call is now traced automatically&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;It tracks latency, token usage, cost, error rates — broken down by provider and model. If you're using GPT-4o for some things and Claude for others, you see them side by side.&lt;/p&gt;

&lt;h2&gt;
  
  
  What makes it different
&lt;/h2&gt;

&lt;p&gt;The features I built come directly from problems I hit in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Budget guards.&lt;/strong&gt; The thing that would've saved me from the El Sapo incident. Set a daily budget, per-user budget, or per-model budget. toad-eye checks before every LLM call. If you're over budget, it can warn, block, or automatically downgrade to a cheaper model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nf"&gt;initObservability&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
  &lt;span class="na"&gt;serviceName&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;my-app&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="na"&gt;budgets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="na"&gt;daily&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;perModel&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;30&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
  &lt;span class="na"&gt;onBudgetExceeded&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;block&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Semantic drift monitoring.&lt;/strong&gt; This is the one I'm most proud of. LLMs can silently degrade — the model returns 200 OK, but the answers are getting worse. Maybe the provider updated the model weights, maybe your prompt doesn't work well with the new version. Traditional monitoring can't catch this.&lt;/p&gt;

&lt;p&gt;toad-eye saves embeddings of your "good" responses as a baseline. Then it periodically compares new responses against that baseline. If the average distance grows beyond a threshold, you get an alert: "semantic drift detected." Your model is still responding, but it's not responding the same way.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Shadow guardrails.&lt;/strong&gt; You want to add validation rules (no PII in responses, must be valid JSON, etc.) but you're scared they'll block legitimate traffic. Shadow mode runs the validation on every response but doesn't block anything. It just records what would have been blocked. You see a "potential block rate" in Grafana and can tune your thresholds on real production data before flipping the switch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent tracing.&lt;/strong&gt; AI agents (the think-act-observe-repeat kind) are notoriously hard to debug. toad-eye records each step as a nested OpenTelemetry span. You can open Jaeger and see exactly how your agent decided to call which tools and why.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzv0upofe3qdjle2op5ep.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzv0upofe3qdjle2op5ep.png" alt="Jaeger trace showing nested agent steps: think, act, observe, answer" width="800" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Trace-to-test export.&lt;/strong&gt; Found a bad trace in production? One CLI command exports it as a test case for your eval suite. Production failure becomes a regression test.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npx toad-eye export-trace &amp;lt;trace_id&amp;gt; &lt;span class="nt"&gt;--output&lt;/span&gt; ./evals/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;FinOps attribution.&lt;/strong&gt; Break down costs by team, user, feature — not just by model. "The checkout team spent $28 yesterday, mostly on GPT-4o for classification. Switching to Flash would save 60%." That's the kind of insight that makes engineering managers pay attention.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcemxan39nax99nucrweq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcemxan39nax99nucrweq.png" alt="Prometheus query showing cost per hour by provider and model" width="800" height="364"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The numbers
&lt;/h2&gt;

&lt;p&gt;Current state of the project:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;154 tests passing&lt;/li&gt;
&lt;li&gt;6 Grafana dashboards&lt;/li&gt;
&lt;li&gt;13 tracked metrics&lt;/li&gt;
&lt;li&gt;3 auto-instrumented SDKs (OpenAI, Anthropic, Gemini) with full streaming support&lt;/li&gt;
&lt;li&gt;Published on npm&lt;/li&gt;
&lt;li&gt;Self-hosted or cloud mode&lt;/li&gt;
&lt;li&gt;OTel GenAI semantic conventions compliant&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Part of something bigger
&lt;/h2&gt;

&lt;p&gt;toad-eye is the observability module of TOAD (Testing &amp;amp; Observability for AI Development) — an ecosystem of tools for AI quality:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;toad-eye&lt;/strong&gt; — observability (this article)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;toad-guard&lt;/strong&gt; — LLM output validation with Zod&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;toad-eval&lt;/strong&gt; — test suites for prompts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;toad-ci&lt;/strong&gt; — CI/CD quality gates for prompt changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;toad-mcp&lt;/strong&gt; — Claude Desktop integration via MCP&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The idea is simple: AI systems deserve the same quality engineering rigor we apply to regular software. Observability is where it starts, but testing, validation, and CI gates are where quality actually happens.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;npm &lt;span class="nb"&gt;install &lt;/span&gt;toad-eye
npx toad-eye init
npx toad-eye up
npx toad-eye demo
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Open localhost:3100. You'll see your dashboards with data in under 2 minutes.&lt;/p&gt;

&lt;p&gt;GitHub: &lt;a href="https://github.com/vola-trebla/toad-eye" rel="noopener noreferrer"&gt;https://github.com/vola-trebla/toad-eye&lt;/a&gt;&lt;br&gt;
npm: &lt;a href="https://www.npmjs.com/package/toad-eye" rel="noopener noreferrer"&gt;https://www.npmjs.com/package/toad-eye&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you're running LLMs in production without observability - you're flying blind. And trust me, you don't want to find out about your budget problem from a billing email.&lt;/p&gt;

&lt;p&gt;This is an early-stage project and I'm actively developing it. If you try it out, I'd love to hear what works, what doesn't, and what features you'd want next. Open an issue on GitHub or just drop a comment here.&lt;/p&gt;

&lt;p&gt;The toad is watching. 🐸👁️&lt;/p&gt;

</description>
      <category>ai</category>
      <category>opensource</category>
      <category>opentelemetry</category>
      <category>typescript</category>
    </item>
  </channel>
</rss>
