<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Jay</title>
    <description>The latest articles on DEV Community by Jay (@jay_singh_e5b5ee6be59c0e0).</description>
    <link>https://dev.to/jay_singh_e5b5ee6be59c0e0</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3841620%2F820b66f7-5cbe-46ab-a1be-9f3b07bdd6ba.jpg</url>
      <title>DEV Community: Jay</title>
      <link>https://dev.to/jay_singh_e5b5ee6be59c0e0</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/jay_singh_e5b5ee6be59c0e0"/>
    <language>en</language>
    <item>
      <title>From 66% to 96%: How I Fixed a Drive-Thru Voice Agent Before It Took a Single Real Call</title>
      <dc:creator>Jay</dc:creator>
      <pubDate>Tue, 14 Apr 2026 17:51:13 +0000</pubDate>
      <link>https://dev.to/jay_singh_e5b5ee6be59c0e0/from-66-to-96-how-i-fixed-a-drive-thru-voice-agent-before-it-took-a-single-real-call-1dm5</link>
      <guid>https://dev.to/jay_singh_e5b5ee6be59c0e0/from-66-to-96-how-i-fixed-a-drive-thru-voice-agent-before-it-took-a-single-real-call-1dm5</guid>
      <description>&lt;p&gt;I've been building voice agents for a while. The hardest part isn't the STT or TTS layer.&lt;/p&gt;

&lt;p&gt;It's this: how do you test edge cases before you have real users?&lt;/p&gt;

&lt;p&gt;The default answer is the vibe-check loop. You call your own agent, order a burger, say "yeah that felt okay," and move on. I did this for longer than I should have.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Scenario
&lt;/h2&gt;

&lt;p&gt;I built a drive-thru voice agent called "Future Burger." Requirements were simple: take orders fast, stay concise, skip the small talk.&lt;/p&gt;

&lt;p&gt;The architecture was brain-first. STT and TTS are just the ears and mouth, interchangeable peripherals. The LLM handles reasoning, context switching, and tool calling.&lt;/p&gt;

&lt;p&gt;If the agent can't figure out that "Actually, make that a Sprite" means &lt;em&gt;replacing&lt;/em&gt; the previous drink, no amount of voice synthesis polish saves the interaction. So I focused entirely on the intelligence layer.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: Synthetic Data (Skipping the Cold Start)
&lt;/h2&gt;

&lt;p&gt;Instead of waiting weeks for real call logs, I used FutureAGI's &lt;a href="https://docs.futureagi.com/docs/dataset/features/create?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=blog_post&amp;amp;utm_content=dataset_feature" rel="noopener noreferrer"&gt;Dataset&lt;/a&gt; to build a ground truth dataset. You define a schema and it produces structured input/output pairs.&lt;/p&gt;

&lt;p&gt;I asked for two fields: &lt;code&gt;user_transcript&lt;/code&gt; (what the user says) and &lt;code&gt;expected_order&lt;/code&gt; (what the agent should actually book).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt used:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;"Generate 500 diverse drive-thru interactions. Include complex orders like 'Cheeseburger no pickles', combo meals, and modifications."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28ccimgf94iusmpnoxx4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28ccimgf94iusmpnoxx4.png" alt=" " width="800" height="456"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In seconds I had 500 labeled pairs ready for evaluation. What surprised me here was how fast this exposed gaps I hadn't even thought to test. Mid-sentence order changes, multilingual switches, impatient customers. Edge cases I always meant to write but never did.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 2: Baseline Prompt (Workbench + Experiments)
&lt;/h2&gt;

&lt;p&gt;Before touching latency or audio quality, I needed to confirm the logic holds. I drafted the initial system prompt (v0.1) in the &lt;a href="https://docs.futureagi.com/docs/prompt/features/create-from-scratch?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=blog_post&amp;amp;utm_content=prompt_workbench" rel="noopener noreferrer"&gt;Prompt Workbench&lt;/a&gt;, saved it as a versioned template, and ran an experiment across those 500 scenarios using three models: &lt;code&gt;gpt-5-nano&lt;/code&gt;, &lt;code&gt;Gemini-3-Flash&lt;/code&gt;, and &lt;code&gt;gpt-5-mini&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyxrj6pjj3ivpyqyb45mr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyxrj6pjj3ivpyqyb45mr.png" alt=" " width="800" height="385"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Result:&lt;/strong&gt; 80% accuracy. Decent. But the responses were wall-of-text paragraphs. Every reply opened with something like:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3ktvr64o3xt8ohh62y4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3ktvr64o3xt8ohh62y4.png" alt=" " width="800" height="483"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"Certainly! I have updated your order to include a cheeseburger without pickles and a medium Sprite. Is there anything else I can help you with today?"&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Fine for a chatbot. For a voice agent where every word adds latency, it's a failure mode.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Simulation (The Stress Test)
&lt;/h2&gt;

&lt;p&gt;I connected the agent and ran a simulation with layered scenario types: hesitant users, stuttering, mid-order changes, rushed and angry customers.&lt;/p&gt;

&lt;p&gt;The results were immediate:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Latency issues.&lt;/strong&gt; The agent was too wordy. It started every response with "Certainly!" and ran three sentences too long.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Logic breaks.&lt;/strong&gt; When a user changed their mind, the agent added &lt;em&gt;both&lt;/em&gt; items to the cart instead of replacing the first.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Success rate: 66%.&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6u7u72ot2tiqafvcxos0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6u7u72ot2tiqafvcxos0.png" alt=" " width="800" height="547"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One in three conversations ending in failure is not a quirk to patch later. That's a production blocker.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Automated Optimization
&lt;/h2&gt;

&lt;p&gt;This is the part I found most useful. Instead of manually editing the system prompt and guessing which instruction caused which failure, I let the optimization engine analyze the conversation logs directly.&lt;/p&gt;

&lt;p&gt;I defined 10 evaluation criteria specific to this agent, including:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context_Retention
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Objection_Handling
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Language_Switching
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because the platform evaluates native audio rather than transcripts alone, it recognized failure patterns across hundreds of simulated conversations and surfaced two actionable fixes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Fix 1 (High Latency):&lt;/strong&gt; "Reduce decision tree depth for menu inquiries and remove redundant validation steps."&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fix 2 (Hallucination):&lt;/strong&gt; "Restrict generative capabilities to the defined &lt;code&gt;menu_items&lt;/code&gt; vector store to prevent inventing dishes."&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9fji4o3wjtk6qvbjx9x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft9fji4o3wjtk6qvbjx9x.png" alt=" " width="800" height="495"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I selected the failed simulation runs and ran ProTeGi optimization with two objectives:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Task_Completion
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Customer_Interruption_Handling
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The system iterated on the system prompt automatically, testing variants like &lt;em&gt;"Be extremely brief"&lt;/em&gt; or &lt;em&gt;"If user changes mind, overwrite previous item."&lt;/em&gt; It ran each variant against the simulator in a loop until the metrics climbed.&lt;/p&gt;

&lt;p&gt;I've done this manually on other projects. It takes hours. Watching it run in a loop was a genuinely different experience.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Results
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Before:&lt;/strong&gt; Polite, slow, failed to track mid-order changes&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;After:&lt;/strong&gt; Crisp. "Burger, no pickles. Got it." 96% accuracy on the "Indecisive" scenario&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4xesz33ag1tllkvbiasv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4xesz33ag1tllkvbiasv.png" alt=" " width="800" height="256"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Going from 66% to 96% without writing a single new instruction manually validated the loop: &lt;strong&gt;&lt;code&gt;Dataset&lt;/code&gt; &amp;gt; Simulate &amp;gt; Evaluate &amp;gt; Optimize.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Took From This
&lt;/h2&gt;

&lt;p&gt;The cold start problem for voice agents is real. You can't get quality data without users, and you can't get users without quality behavior. Synthetic simulation breaks that dependency.&lt;/p&gt;

&lt;p&gt;The bigger shift for me was realizing that most prompt debugging is just pattern matching on logs. You run the agent, it fails, you guess why, you edit, you repeat. That process is automatable. The hard part is setting up the right evaluation criteria upfront.&lt;/p&gt;

&lt;p&gt;If you're still in the vibe-check phase and want to see what the full evaluation infrastructure looks like, &lt;a href="https://futureagi.com/?utm_source=devto&amp;amp;utm_medium=blog&amp;amp;utm_campaign=llm_observability_guide&amp;amp;utm_content=hero_cta" rel="noopener noreferrer"&gt;the architecture walkthrough is here&lt;/a&gt;.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Curious what evaluation criteria others track for voice agents in production. Context retention and objection handling were obvious for this use case, but I'd like to know what else people actually measure.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>ai</category>
      <category>llm</category>
      <category>agents</category>
      <category>voiceai</category>
    </item>
    <item>
      <title>The MCP Evaluation Framework Nobody Talks About (But Should)</title>
      <dc:creator>Jay</dc:creator>
      <pubDate>Wed, 08 Apr 2026 13:49:32 +0000</pubDate>
      <link>https://dev.to/jay_singh_e5b5ee6be59c0e0/the-mcp-evaluation-framework-nobody-talks-about-but-should-1po2</link>
      <guid>https://dev.to/jay_singh_e5b5ee6be59c0e0/the-mcp-evaluation-framework-nobody-talks-about-but-should-1po2</guid>
      <description>&lt;p&gt;Your agent worked fine in staging. It called the right MCP tools, returned clean outputs, passed the test suite. Then it hit production, a user sent a slightly different query, and it picked the wrong tool, passed malformed arguments, and chained three unnecessary calls before returning garbage.&lt;/p&gt;

&lt;p&gt;I've watched this happen more times than I'd like. The model isn't the problem. The missing piece is an evaluation system that matches how MCP actually behaves at runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why MCP Changes the Eval Problem
&lt;/h2&gt;

&lt;p&gt;Before MCP, most agents had hardcoded tools. You could write deterministic tests: "Given this input, the agent should call &lt;code&gt;search_docs&lt;/code&gt; with these parameters." That worked.&lt;/p&gt;

&lt;p&gt;MCP flips that model. An MCP-connected agent discovers tools at runtime from one or more MCP servers. The available tools can change between requests. The agent decides what to call, in what order, with what arguments, based on the user's prompt and context injected through MCP resources.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.anthropic.com/news/donating-the-model-context-protocol-and-establishing-of-the-agentic-ai-foundation" rel="noopener noreferrer"&gt;Anthropic open-sourced MCP in late 2024&lt;/a&gt;. Within a year it had 97 million monthly SDK downloads and 10,000+ published servers. In December 2025, Anthropic donated MCP to the &lt;a href="https://www.linuxfoundation.org/press/linux-foundation-announces-the-formation-of-the-agentic-ai-foundation" rel="noopener noreferrer"&gt;Linux Foundation's Agentic AI Foundation (AAIF)&lt;/a&gt;, with OpenAI, Google, Microsoft, and AWS backing the move.&lt;/p&gt;

&lt;p&gt;This creates three evaluation problems that didn't exist before:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Dynamic tool selection is non-deterministic.&lt;/strong&gt; The same query can produce different tool call sequences depending on which MCP servers are connected and what they expose at that moment. You can't assert "the agent must call this specific tool." You evaluate whether the choice was &lt;em&gt;reasonable&lt;/em&gt; given the available options.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context injection needs validation.&lt;/strong&gt; MCP servers inject resources that shape the agent's reasoning. If a resource returns stale data or an unexpected format, the agent reasons incorrectly. Your eval needs to cover whether that injected context was used correctly, not just whether the final output looked reasonable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chains need end-to-end tracing.&lt;/strong&gt; A single request can trigger 5 to 10 MCP tool calls across different servers, each with its own latency, failure mode, and output quality. Following only the final response misses every intermediate failure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Five Dimensions to Evaluate
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Tool Selection Accuracy
&lt;/h3&gt;

&lt;p&gt;Did the agent pick the right tool? Measure this against labeled examples where humans identified the optimal tools for a given query. Two sub-metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Precision:&lt;/strong&gt; Out of all tools called, how many were necessary?&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Recall:&lt;/strong&gt; Out of all tools that should have been called, how many were?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;High precision with low recall means the agent is too conservative and missing useful tools. Low precision with high recall means it's calling unnecessary tools, burning tokens, and increasing latency.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Argument Correctness
&lt;/h3&gt;

&lt;p&gt;Even when the agent picks the right tool, it can pass wrong arguments. Validate that:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Arguments match the MCP tool's JSON schema&lt;/li&gt;
&lt;li&gt;Types are correct (no string where an integer belongs)&lt;/li&gt;
&lt;li&gt;Required fields are present and populated&lt;/li&gt;
&lt;li&gt;Semantic accuracy holds: did it pass the correct document ID for this specific task, not a random one?&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Task Completion Rate
&lt;/h3&gt;

&lt;p&gt;This is the bottom-line metric. Did the agent actually accomplish what the user asked? I use LLM-as-a-judge evaluators here because they catch cases where every individual tool call succeeded but the agent failed to synthesize the results correctly.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Chain Efficiency
&lt;/h3&gt;

&lt;p&gt;MCP agents can make far more tool calls than necessary. Track:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Total tool calls per request&lt;/li&gt;
&lt;li&gt;Redundant calls (same tool, same arguments called twice)&lt;/li&gt;
&lt;li&gt;Calls whose outputs never appeared in the final response&lt;/li&gt;
&lt;li&gt;Total chain latency&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;An agent that calls 8 tools when 2 would do isn't just slow. It's expensive and significantly harder to debug.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Context Utilization
&lt;/h3&gt;

&lt;p&gt;MCP servers expose resources that influence the agent's reasoning. Evaluate whether the agent used that context accurately or hallucinated information that contradicted it. The key metrics are groundedness and context relevance.&lt;/p&gt;

&lt;p&gt;Here are the thresholds I use as a starting baseline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tool Selection Precision: &amp;gt;85%&lt;/li&gt;
&lt;li&gt;Tool Selection Recall: &amp;gt;90%&lt;/li&gt;
&lt;li&gt;Argument Schema Compliance: &amp;gt;98%&lt;/li&gt;
&lt;li&gt;Task Completion: &amp;gt;80%&lt;/li&gt;
&lt;li&gt;Chain Efficiency Ratio (min needed calls / actual calls): &amp;gt;0.7&lt;/li&gt;
&lt;li&gt;Groundedness: &amp;gt;85%&lt;/li&gt;
&lt;li&gt;P95 Latency: &amp;lt;5s&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  You Can't Eval What You Can't See
&lt;/h2&gt;

&lt;p&gt;Tracing is the foundation. The standard approach is OpenTelemetry-based instrumentation, where each MCP tool call becomes a span recording: tool name, server name, arguments, response, latency, and status code. These spans nest under a parent trace representing the full user request.&lt;/p&gt;

&lt;p&gt;A well-instrumented MCP trace captures:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Root span:&lt;/strong&gt; User query received, final response returned&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM decision span:&lt;/strong&gt; Model reasoning, tool selection decision&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP tool call spans:&lt;/strong&gt; One per invocation, with full arguments and response&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context retrieval spans:&lt;/strong&gt; MCP resource fetches&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Synthesis span:&lt;/strong&gt; Final response generation from tool outputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://github.com/future-agi/ai-evaluation?utm_source=evaluatemcp&amp;amp;utm_medium=Blog&amp;amp;utm_campaign=blog_page" rel="noopener noreferrer"&gt;TraceAI&lt;/a&gt; is an open-source library that extends OpenTelemetry with AI-specific semantic conventions. It supports 20+ frameworks including OpenAI, Anthropic, LangChain, and CrewAI. Setup is under 10 lines:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fi_instrumentation&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;register&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fi_instrumentation.fi_types&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ProjectType&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;traceai_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAIInstrumentor&lt;/span&gt;

&lt;span class="n"&gt;trace_provider&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;register&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;project_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;ProjectType&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;OBSERVE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;project_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mcp_agent_prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nc"&gt;OpenAIInstrumentor&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;instrument&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tracer_provider&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;trace_provider&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once traces are flowing, you can visualize every LLM call, tool invocation, and retrieval step as nested timelines on the &lt;a href="https://futureagi.com/" rel="noopener noreferrer"&gt;Future AGI&lt;/a&gt; Observe dashboard, with latency, cost, and evaluation scores side-by-side.&lt;/p&gt;

&lt;h2&gt;
  
  
  Building the Pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Instrument your agent.&lt;/strong&gt;&lt;br&gt;
Set up auto-instrumentation with TraceAI or a compatible library. Capture the MCP-specific attributes too: which server the tool came from, schema version, and whether the call was a retry. That context is critical when debugging failures at 2am.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2: Define your evaluation criteria.&lt;/strong&gt;&lt;br&gt;
Pick metrics from the five pillars based on your use case. A support agent should prioritize task completion and groundedness. A code generation agent should prioritize argument correctness and chain efficiency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Set up automated evaluators.&lt;/strong&gt;&lt;br&gt;
For subjective measurements like task completion and response quality, use LLM-as-a-judge. For objective checks like schema compliance and latency thresholds, use deterministic validators.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/future-agi/ai-evaluation?utm_source=evaluatemcp&amp;amp;utm_medium=Blog&amp;amp;utm_campaign=blog_page" rel="noopener noreferrer"&gt;evaluation SDK&lt;/a&gt; ships with 60+ pre-built templates covering factual accuracy, groundedness, tone, conciseness, and more:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;fi.evals&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Evaluator&lt;/span&gt;

&lt;span class="n"&gt;evaluator&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Evaluator&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;fi_api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_api_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;fi_secret_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your_secret_key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;evaluator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;evaluate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;eval_templates&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;groundedness&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;context&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;retrieved_context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;agent_response&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;turing_flash&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 4: Sample and score production traffic.&lt;/strong&gt;&lt;br&gt;
Don't eval every request. A 10-20% sampling rate works for most teams. For finance or healthcare, push toward 100%. &lt;a href="https://app.futureagi.com/dashboard/evaluations?utm_source=evaluatemcp&amp;amp;utm_medium=Blog&amp;amp;utm_campaign=blog_page" rel="noopener noreferrer"&gt;Future AGI's Eval Tasks&lt;/a&gt; let you schedule scoring on live or historical traffic with configurable sampling rates.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5: Alert on regression.&lt;/strong&gt;&lt;br&gt;
Threshold-based alerts are what turn passive monitoring into an actual feedback loop:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Task completion drops below 80%? Alert.&lt;/li&gt;
&lt;li&gt;Average tool calls per request spikes above 6? Alert.&lt;/li&gt;
&lt;li&gt;Argument schema compliance dips below 95%? Alert.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Route these to Slack, PagerDuty, or your CI/CD pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Failure Patterns Worth Flagging
&lt;/h2&gt;

&lt;p&gt;A few I keep running into:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing only the happy path.&lt;/strong&gt; Dev and staging MCP servers have limited tool sets. Mirror production MCP server configs in your test environment, or you're not actually testing the surface area that breaks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluating calls in isolation.&lt;/strong&gt; Evaluating each tool call without considering the chain misses ordering failures. Evaluate full sequences and flag when order affects correctness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM-as-a-judge without deterministic checks.&lt;/strong&gt; LLM evaluators are inconsistent on their own. Pair them with schema validation, not instead of it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No established baseline.&lt;/strong&gt; If you don't record baseline metrics in the first week, you can't detect degradation. Track deltas. Absolute scores lie.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No cost tracking.&lt;/strong&gt; Tool calls compound fast in MCP chains. Include token and call costs in every trace. Set spike alerts before the bill does it for you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluating post-ship only.&lt;/strong&gt; Running evals only after deployment means you're always reacting. Enable tracing in experiment mode during development and surface failure patterns before they reach production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Closing the Loop
&lt;/h2&gt;

&lt;p&gt;Evaluation without action is just monitoring. The actual cycle:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Trace&lt;/strong&gt; every MCP tool call with OpenTelemetry-compatible instrumentation&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Evaluate&lt;/strong&gt; sampled traces across the five metrics automatically&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Identify&lt;/strong&gt; failure patterns through clustering: which tools fail most, which queries produce the worst task completion scores&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate&lt;/strong&gt; on prompts, tool descriptions, and MCP server configurations based on evaluation feedback&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Verify&lt;/strong&gt; improvements by comparing eval scores across deployment versions&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The teams shipping reliable MCP-connected agents aren't the ones with the best models. They're the ones with the best evaluation pipelines. Start there.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>mcp</category>
      <category>agents</category>
    </item>
    <item>
      <title>How I Audited My Infra After the LiteLLM Supply Chain Attack (And What I'm Doing Differently Now)</title>
      <dc:creator>Jay</dc:creator>
      <pubDate>Mon, 06 Apr 2026 18:30:00 +0000</pubDate>
      <link>https://dev.to/jay_singh_e5b5ee6be59c0e0/how-i-audited-my-infra-after-the-litellm-supply-chain-attack-and-what-im-doing-differently-now-39ma</link>
      <guid>https://dev.to/jay_singh_e5b5ee6be59c0e0/how-i-audited-my-infra-after-the-litellm-supply-chain-attack-and-what-im-doing-differently-now-39ma</guid>
      <description>&lt;p&gt;I woke up to a Slack thread on March 24, 2026, that made my stomach drop. LiteLLM, the Python proxy I'd been running to route LLM calls across providers, had been &lt;a href="https://docs.litellm.ai/blog/security-update-march-2026" rel="noopener noreferrer"&gt;backdoored with credential-stealing malware&lt;/a&gt;. Versions 1.82.7 and 1.82.8, published by a threat actor called TeamPCP, contained a three-stage payload that harvested SSH keys, cloud credentials, Kubernetes secrets, and cryptocurrency wallets. PyPI quarantined the entire package.&lt;/p&gt;

&lt;p&gt;What surprised me was the targeting. LiteLLM is literally an API key management gateway. It holds credentials for every LLM provider your org uses. If you wanted to compromise one package to get access to everything, this was the perfect pick.&lt;/p&gt;

&lt;p&gt;This wasn't a one-off either. It was the third hit in a five-day campaign. Aqua Security's Trivy scanner got compromised on March 19 (&lt;a href="https://github.com/aquasecurity/trivy/security/advisories/GHSA-69fq-xp46-6x23" rel="noopener noreferrer"&gt;GHSA-69fq-xp46-6x23&lt;/a&gt;). Checkmarx's KICS GitHub Actions followed on March 23 (&lt;a href="https://github.com/Checkmarx/kics-github-action/issues/152" rel="noopener noreferrer"&gt;kics-github-action#152&lt;/a&gt;, &lt;a href="https://checkmarx.com/blog/checkmarx-security-update/" rel="noopener noreferrer"&gt;Checkmarx Update&lt;/a&gt;). LiteLLM was the final target on March 24 (&lt;a href="https://github.com/BerriAI/litellm/issues/24512" rel="noopener noreferrer"&gt;litellm#24512&lt;/a&gt;, &lt;a href="https://docs.litellm.ai/blog/security-update-march-2026" rel="noopener noreferrer"&gt;LiteLLM Update&lt;/a&gt;). The attack chain was &lt;a href="https://www.wiz.io/blog/threes-a-crowd-teampcp-trojanizes-litellm-in-continuation-of-campaign" rel="noopener noreferrer"&gt;present in 36% of cloud environments&lt;/a&gt;, often pulled in as a transitive dependency through agent frameworks nobody audited.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Attack Chain Worked
&lt;/h2&gt;

&lt;p&gt;As confirmed in &lt;a href="https://docs.litellm.ai/blog/security-update-march-2026" rel="noopener noreferrer"&gt;LiteLLM's official security update&lt;/a&gt;, the project's CI/CD pipeline ran Trivy without pinning to a specific version. When the compromised Trivy action executed inside LiteLLM's GitHub Actions runner, it exfiltrated the &lt;code&gt;PYPI_PUBLISH&lt;/code&gt; token. TeamPCP used that stolen token to push malicious releases directly to PyPI.&lt;/p&gt;

&lt;p&gt;Version 1.82.7 embedded the payload in &lt;code&gt;proxy/proxy_server.py&lt;/code&gt;, firing on import. Version 1.82.8 was worse: it included a &lt;code&gt;.pth&lt;/code&gt; file called &lt;code&gt;litellm_init.pth&lt;/code&gt; that &lt;a href="https://docs.python.org/3/library/site.html" rel="noopener noreferrer"&gt;executed on every Python process startup&lt;/a&gt;, regardless of whether you ever imported LiteLLM. Python's site module processes all &lt;code&gt;.pth&lt;/code&gt; files in site-packages during interpreter initialization, as &lt;a href="https://github.com/BerriAI/litellm/issues/24512" rel="noopener noreferrer"&gt;documented in the GitHub issue&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The payload used double base64 encoding:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Popen&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
    &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;executable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;-c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;import base64; exec(base64.b64decode(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;))&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once running, it executed three stages. Stage 1 harvested credentials: SSH keys, AWS/GCP/Azure tokens, environment variables, &lt;code&gt;.env&lt;/code&gt; files, Kubernetes configs, Docker configs, database credentials, shell history, browser cookies, and cryptocurrency wallets. Stage 2 deployed privileged Alpine pods into the &lt;code&gt;kube-system&lt;/code&gt; namespace on every reachable Kubernetes node, grabbing cluster secrets and service account tokens. Stage 3 installed &lt;code&gt;sysmon.py&lt;/code&gt; as a systemd service that polled &lt;code&gt;checkmarx[.]zone/raw&lt;/code&gt; for additional payloads, giving the attacker persistent access even after discovery.&lt;/p&gt;

&lt;p&gt;All stolen data was encrypted and POSTed to &lt;code&gt;models.litellm[.]cloud&lt;/code&gt;, a lookalike domain controlled by TeamPCP.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Blast Radius Was Bigger Than I Expected
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;.pth&lt;/code&gt; execution model is what made this particularly nasty. On any machine where LiteLLM 1.82.8 was installed, the malware fired every time Python started. Not when you imported the package. Not when you used the proxy. Every single Python process.&lt;/p&gt;

&lt;p&gt;That means a data scientist running Jupyter, a DevOps engineer running Ansible, a backend dev spinning up a Flask server: all compromised if the package sat anywhere in their Python environment. The malware just ran silently alongside whatever they were actually doing.&lt;/p&gt;

&lt;p&gt;Here's the part that really got me: you didn't need to install it yourself. If any package in your dependency tree pulled LiteLLM in, the payload still executed. As reported in &lt;a href="https://github.com/BerriAI/litellm/issues/24512" rel="noopener noreferrer"&gt;GitHub issue #24512&lt;/a&gt;, the researcher who found this discovered it because their Cursor IDE pulled LiteLLM through an MCP plugin without any manual installation.&lt;/p&gt;

&lt;p&gt;I checked my own environment and found LiteLLM listed in the &lt;code&gt;Required-by&lt;/code&gt; field for a framework I'd installed months ago. I had no idea it was there.&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Checked If I Was Affected
&lt;/h2&gt;

&lt;p&gt;Here's what I ran across my local machine, CI/CD runners, Docker images, and staging:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip show litellm | &lt;span class="nb"&gt;grep &lt;/span&gt;Version
pip cache list litellm
find / &lt;span class="nt"&gt;-name&lt;/span&gt; &lt;span class="s2"&gt;"litellm_init.pth"&lt;/span&gt; 2&amp;gt;/dev/null
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then I scanned egress logs. Any traffic to &lt;code&gt;models.litellm[.]cloud&lt;/code&gt; or &lt;code&gt;checkmarx[.]zone&lt;/code&gt; means confirmed exfiltration:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# CloudWatch&lt;/span&gt;
fields @timestamp, @message
| filter @message like /models&lt;span class="se"&gt;\.&lt;/span&gt;litellm&lt;span class="se"&gt;\.&lt;/span&gt;cloud|checkmarx&lt;span class="se"&gt;\.&lt;/span&gt;zone/

&lt;span class="c"&gt;# Nginx&lt;/span&gt;
&lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-E&lt;/span&gt; &lt;span class="s2"&gt;"models&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;litellm&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;cloud|checkmarx&lt;/span&gt;&lt;span class="se"&gt;\.&lt;/span&gt;&lt;span class="s2"&gt;zone"&lt;/span&gt; /var/log/nginx/access.log
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;And checked for transitive installation:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip show litellm  &lt;span class="c"&gt;# Check "Required-by" field&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If other packages list LiteLLM there, it entered your environment as a transitive dependency without your knowledge.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Incident Response Steps I Followed
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Step 1: Kill everything immediately.&lt;/strong&gt; Stop all LiteLLM containers and scale Kubernetes deployments to zero:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;docker ps | &lt;span class="nb"&gt;grep &lt;/span&gt;litellm | &lt;span class="nb"&gt;awk&lt;/span&gt; &lt;span class="s1"&gt;'{print $1}'&lt;/span&gt; | xargs docker &lt;span class="nb"&gt;kill
&lt;/span&gt;kubectl scale deployment litellm-proxy &lt;span class="nt"&gt;--replicas&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;0 &lt;span class="nt"&gt;-n&lt;/span&gt; your-namespace
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2: Rotate every credential on affected machines.&lt;/strong&gt; The malware harvested everything it could reach. I treated the following as fully compromised: cloud provider tokens (AWS access keys, GCP service account keys, Azure AD tokens), all SSH keys in &lt;code&gt;~/.ssh/&lt;/code&gt;, database passwords and connection strings from &lt;code&gt;.env&lt;/code&gt; files, every LLM provider API key (OpenAI, Anthropic, Gemini), Kubernetes service accounts and CI/CD tokens, and any crypto wallet files present on the machine. If you have crypto wallets on an affected host, move funds immediately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3: Hunt for persistence artifacts.&lt;/strong&gt; The malware planted privileged pods in Kubernetes and installed a systemd backdoor. Check for both:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Check for lateral movement&lt;/span&gt;
kubectl get pods &lt;span class="nt"&gt;-n&lt;/span&gt; kube-system | &lt;span class="nb"&gt;grep&lt;/span&gt; &lt;span class="nt"&gt;-i&lt;/span&gt; &lt;span class="s2"&gt;"node-setup"&lt;/span&gt;
find / &lt;span class="nt"&gt;-name&lt;/span&gt; &lt;span class="s2"&gt;"sysmon.py"&lt;/span&gt; 2&amp;gt;/dev/null

&lt;span class="c"&gt;# Full removal&lt;/span&gt;
pip uninstall litellm &lt;span class="nt"&gt;-y&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; pip cache purge
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; ~/.cache/uv
find &lt;span class="si"&gt;$(&lt;/span&gt;python &lt;span class="nt"&gt;-c&lt;/span&gt; &lt;span class="s2"&gt;"import site; print(site.getsitepackages()[0])"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
    &lt;span class="nt"&gt;-name&lt;/span&gt; &lt;span class="s2"&gt;"litellm_init.pth"&lt;/span&gt; &lt;span class="nt"&gt;-delete&lt;/span&gt;
&lt;span class="nb"&gt;rm&lt;/span&gt; &lt;span class="nt"&gt;-rf&lt;/span&gt; ~/.config/sysmon/ ~/.config/systemd/user/sysmon.service
docker build &lt;span class="nt"&gt;--no-cache&lt;/span&gt; &lt;span class="nt"&gt;-t&lt;/span&gt; your-image:clean &lt;span class="nb"&gt;.&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Do not downgrade to a previous version. Remove entirely and replace.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Deeper Problem with Self-Hosted Python Proxies
&lt;/h2&gt;

&lt;p&gt;I've been thinking about this since the cleanup, and honestly, the structural issue here goes beyond one compromised package.&lt;/p&gt;

&lt;p&gt;LiteLLM's Python proxy pulls in hundreds of transitive dependencies: ML frameworks, data processing libraries, provider SDKs. Every one of those is a trust decision most teams make automatically with &lt;code&gt;pip install --upgrade&lt;/code&gt;. When you add LiteLLM, you're not just trusting LiteLLM. You're trusting every package it depends on, every package those depend on, and every maintainer account tied to each one.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;.pth&lt;/code&gt; attack vector is especially concerning because most supply chain scanning tools focus on &lt;code&gt;setup.py&lt;/code&gt;, &lt;code&gt;__init__.py&lt;/code&gt;, and defined entry points. The &lt;code&gt;.pth&lt;/code&gt; mechanism is a legitimate Python feature for path configuration that has been completely overlooked as an injection vector. I expect this technique to show up in future attacks. Traditional scanning would not have caught it.&lt;/p&gt;

&lt;p&gt;There's also a response-time problem. The LiteLLM maintainers didn't rotate their CI/CD credentials for five days after the &lt;a href="https://github.com/aquasecurity/trivy/security/advisories/GHSA-69fq-xp46-6x23" rel="noopener noreferrer"&gt;Trivy disclosure on March 19&lt;/a&gt;. If the maintainers couldn't react fast enough, downstream teams had no realistic chance. When you self-host, you inherit the blast radius.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I Moved To (And Why)
&lt;/h2&gt;

&lt;p&gt;After cleaning up, I needed to replace the routing layer. The options I evaluated fell into two buckets: self-hosted alternatives (which carry the same dependency tree risk) and managed gateways (which eliminate it).&lt;/p&gt;

&lt;p&gt;I ended up switching to a managed gateway approach. &lt;a href="https://docs.futureagi.com/docs/prism?utm_source=litellm_incident_blog&amp;amp;utm_medium=blog&amp;amp;utm_campaign=product_marketing&amp;amp;utm_content=prism_docs" rel="noopener noreferrer"&gt;Prism&lt;/a&gt; (by Future AGI) is one example of this pattern. Instead of installing a Python package to route requests, you point your OpenAI SDK at a managed endpoint. Your attack surface goes from an entire Python environment with hundreds of dependencies to an API key and a URL.&lt;/p&gt;

&lt;p&gt;The migration was a two-line change:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before (LiteLLM):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;litellm&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;completion&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;completion&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;After (managed gateway):&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://gateway.futureagi.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-prism-your-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-5&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Same OpenAI SDK format, same model names, same response schema. TypeScript works identically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight typescript"&gt;&lt;code&gt;&lt;span class="k"&gt;import&lt;/span&gt; &lt;span class="nx"&gt;OpenAI&lt;/span&gt; &lt;span class="k"&gt;from&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;openai&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;baseURL&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;https://gateway.futureagi.com&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;apiKey&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;sk-prism-your-key&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nx"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
    &lt;span class="na"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;gpt-4o&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="na"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt; &lt;span class="na"&gt;role&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;user&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="na"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;Hello&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt; &lt;span class="p"&gt;}]&lt;/span&gt;
&lt;span class="p"&gt;});&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Provider keys sit in the gateway dashboard instead of &lt;code&gt;.env&lt;/code&gt; files scattered across developer machines. You can &lt;a href="https://docs.futureagi.com/docs/prism?utm_source=litellm_incident_blog&amp;amp;utm_medium=blog&amp;amp;utm_campaign=product_marketing&amp;amp;utm_content=prism_docs" rel="noopener noreferrer"&gt;read the full docs&lt;/a&gt; for setup details.&lt;/p&gt;

&lt;p&gt;For Kubernetes deployments, the swap is just environment variables:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight yaml"&gt;&lt;code&gt;&lt;span class="na"&gt;env&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM_BASE_URL&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://gateway.futureagi.com"&lt;/span&gt;  &lt;span class="c1"&gt;# was http://litellm-proxy:4000&lt;/span&gt;
  &lt;span class="pi"&gt;-&lt;/span&gt; &lt;span class="na"&gt;name&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s"&gt;LLM_API_KEY&lt;/span&gt;
    &lt;span class="na"&gt;value&lt;/span&gt;&lt;span class="pi"&gt;:&lt;/span&gt; &lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-prism-your-key"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Then delete the LiteLLM pod, its service, Postgres, and Redis. That's infrastructure you no longer maintain or patch.&lt;/p&gt;

&lt;p&gt;One feature I didn't expect to use heavily is &lt;a href="https://docs.futureagi.com/docs/prism/features/caching?utm_source=litellm_incident_blog&amp;amp;utm_medium=blog&amp;amp;utm_campaign=product_marketing&amp;amp;utm_content=prism_caching" rel="noopener noreferrer"&gt;semantic caching&lt;/a&gt;. It matches queries that mean the same thing but use different wording, so "What is your return policy?" and "How do I return an item?" hit the same cache entry. Cached responses come back with &lt;code&gt;X-Prism-Cost: 0&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;prism&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Prism&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GatewayConfig&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;CacheConfig&lt;/span&gt;
&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Prism&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sk-prism-your-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://gateway.futureagi.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;GatewayConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;CacheConfig&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;enabled&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;semantic&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ttl&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;5m&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prod&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
    &lt;span class="p"&gt;),&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The gateway also applies guardrails (PII detection, prompt injection prevention) at the routing layer before requests reach the provider. That's 18+ checks I previously didn't have at all.&lt;/p&gt;

&lt;h2&gt;
  
  
  What This Means Going Forward
&lt;/h2&gt;

&lt;p&gt;The EU Cyber Resilience Act now holds organizations legally responsible for the security of open-source components in their products. SOC 2 Type II audits scrutinize dependency management. "We pull the latest from PyPI" won't pass a controls review anymore. If your product ran LiteLLM and customer credentials were exfiltrated, the liability is yours, not the maintainer's. For background on &lt;a href="https://futureagi.com/blogs/ai-compliance-guardrails-enterprise-llms-2025" rel="noopener noreferrer"&gt;AI compliance and LLM security&lt;/a&gt;, Future AGI has an enterprise breakdown worth reading.&lt;/p&gt;

&lt;p&gt;Dependency pinning alone doesn't fix this. Pinning prevents pulling a new malicious version but not a compromised maintainer overwriting an existing tag. Hash verification (&lt;code&gt;pip install --hash=sha256:&amp;lt;exact_hash&amp;gt;&lt;/code&gt;) is the real control, though adoption is low because the tooling is painful.&lt;/p&gt;

&lt;p&gt;Every team running LLM applications now faces a clear architectural choice: self-host and inherit the full supply chain risk, or use a managed gateway and shrink the trust boundary to an API endpoint. After March 24, the risk math changed.&lt;/p&gt;

&lt;p&gt;I spent two days rotating credentials and auditing Kubernetes pods because of a package I didn't even know was in my dependency tree. I'd rather spend that time shipping features.&lt;/p&gt;

</description>
      <category>security</category>
      <category>python</category>
      <category>ai</category>
      <category>devops</category>
    </item>
    <item>
      <title>What Is Tool Chaining in LLMs? Why It Breaks and How to Think About Orchestration</title>
      <dc:creator>Jay</dc:creator>
      <pubDate>Wed, 25 Mar 2026 13:01:13 +0000</pubDate>
      <link>https://dev.to/jay_singh_e5b5ee6be59c0e0/what-is-tool-chaining-in-llms-why-it-breaks-and-how-to-think-about-orchestration-2j3m</link>
      <guid>https://dev.to/jay_singh_e5b5ee6be59c0e0/what-is-tool-chaining-in-llms-why-it-breaks-and-how-to-think-about-orchestration-2j3m</guid>
      <description>&lt;p&gt;Your agent chains three tool calls together. The first returns slightly malformed output. The second accepts it but misinterprets a field. By the third call, the entire chain has gone off the rails. No error was thrown. Your logs look clean. The user got confidently wrong answers.&lt;br&gt;
If you've built anything with LLM agents beyond a demo, you've hit this. It's called the cascading failure problem, and research from Zhu et al. (2025) confirms it: error propagation from early mistakes cascading into later failures is the single biggest barrier to building dependable LLM agents.&lt;br&gt;
I've spent a lot of time debugging these kinds of failures, and I want to break down why tool chaining is so fragile, what the actual failure modes look like, and what patterns hold up in production.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tool Chaining, Quickly Defined
&lt;/h2&gt;

&lt;p&gt;Tool chaining is when an LLM agent executes multiple tool calls in sequence, where each tool's output becomes input for the next. The agent gets a user query, calls an API, processes the result with a second tool, and builds a final response from the combined output.&lt;br&gt;
A single tool call is simple. Chaining is where dependencies show up. The agent has to figure out execution order, track intermediate state, and handle partial failures while staying on task.&lt;br&gt;
In multi-agent systems, this gets worse. One agent calls a tool, hands the result to a second agent, which runs its own tool chain before returning. The orchestration overhead stacks fast, and so do the failure points.&lt;br&gt;
Here's a concrete example: a user asks an agent to pull earnings data, compare it against competitors, and generate a summary. The first call returns revenue in the wrong currency. The comparison runs fine but produces misleading figures. The summary confidently presents wrong data. Nothing errored out. That's the core danger when you chain tools without validation and observability.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Tool Chains Break in Production
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Context Gets Lost Across Calls
&lt;/h3&gt;

&lt;p&gt;LLMs work within a finite context window. Every tool call adds tokens: function parameters, response payloads, reasoning traces. In long chains, critical context from early steps gets pushed out of the window or buried under intermediate results.&lt;br&gt;
This isn't theoretical. Research shows LLMs lose performance on information buried in the middle of long contexts, even with large windows. When your agent forgets a user constraint from step 1 by the time it hits step 5, the output might be structurally valid but factually wrong. The user asked for revenue in USD, but that detail got lost three calls ago.&lt;/p&gt;

&lt;p&gt;What actually helps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Pass structured state objects between calls, not raw text. Keeps payloads compact and parseable.&lt;/li&gt;
&lt;li&gt;Summarize intermediate results before forwarding. Strip metadata the next tool doesn't need.&lt;/li&gt;
&lt;li&gt;Use frameworks with explicit state management. LangGraph, for example, provides durable state across graph nodes so context stays inspectable and doesn't just float in the prompt.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Cascading Failures Compound Silently
&lt;/h3&gt;

&lt;p&gt;This is the biggest production risk. When one tool returns bad or partial data, the error flows downstream and compounds at every step. Unlike traditional software where bad data throws exceptions, LLM tool chains fail silently because the agent treats garbage output as valid input and keeps going.&lt;br&gt;
A 2025 study on OpenReview that analyzed failed agent trajectories found error propagation was the most common failure pattern. Memory and reflection errors were the most frequent sources of cascades. Once they start, they're extremely hard to reverse mid-chain.&lt;br&gt;
In multi-agent setups, it's amplified further. The Gradient Institute found that transitive trust chains between agents mean a single wrong output propagates through the entire system without verification. OWASP ASI08 specifically flags cascading failures as a top security risk in agentic AI.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context Window Saturation
&lt;/h3&gt;

&lt;p&gt;Every tool call eats tokens. A chain of five calls can burn through 40-60% of your available context before the agent even starts generating its final response. Even with models offering massive token limits, the "lost in the middle" problem means the agent's attention degrades on information that isn't near the beginning or end of the context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Picking a Framework for Multi-Tool Orchestration
&lt;/h2&gt;

&lt;p&gt;The framework you choose shapes how much of this you have to handle yourself. Here's how the main options compare for production use in 2026:&lt;br&gt;
LangGraph is my go-to for anything stateful or branching. It models tool chains as explicit state machines: every node is a tool call or decision point, edges define transitions. You can plug in retry logic, fallback paths, and human-in-the-loop checkpoints at specific stages. Its durable execution feature means if a chain breaks at step 4 of 7, you resume from step 4 instead of restarting. Deep tracing through LangSmith with state transition capture.&lt;br&gt;
LangChain is still the fastest way to get started. Its LCEL pipe syntax makes linear tool chains quick to compose. But for production workloads with branching or parallel calls, most teams I've seen migrate to LangGraph for finer control.&lt;br&gt;
AutoGen works well for multi-agent conversation patterns. It uses message-passing with built-in function call semantics. Observability is moderate and usually needs external tooling for production-grade traces.&lt;br&gt;
CrewAI takes a role-based approach to multi-agent task execution. Tool assignment happens per role, which is intuitive but can mean longer deliberation before tool calls. Basic logging out of the box.&lt;/p&gt;

&lt;h2&gt;
  
  
  Tracing and Observability Are Not Optional
&lt;/h2&gt;

&lt;p&gt;You can't fix what you can't see. Tool chain failures are often silent, so a chain that returns wrong answers without errors looks perfectly healthy in your logs unless you have distributed tracing on every step.&lt;br&gt;
What to capture in every tool chain execution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input and output of each tool call. Exact parameters and full responses so you can replay failures.&lt;/li&gt;
&lt;li&gt;Latency per step. A slow tool can cascade into downstream timeouts.&lt;/li&gt;
&lt;li&gt;Token consumption. Track context window usage to spot saturation before it degrades output quality.&lt;/li&gt;
&lt;li&gt;Agent reasoning between calls. Chain-of-thought capture helps you find logic errors that data alone won't reveal.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tools like LangSmith, Langfuse, and Future AGI provide native tracing for LangGraph and LangChain workflows. Future AGI's traceAI SDK integrates with OpenTelemetry and includes built-in evaluation metrics for completeness, groundedness, and function calling accuracy.&lt;/p&gt;

&lt;h2&gt;
  
  
  Evaluating Tool Chains Beyond "Did It Work?"
&lt;/h2&gt;

&lt;p&gt;Tracing tells you what happened. Evaluation tells you whether it was correct. For tool chains, you need to cover multiple dimensions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Tool selection accuracy: Did the agent pick the right tool at each step?&lt;/li&gt;
&lt;li&gt;Parameter correctness: Were the arguments valid and complete?&lt;/li&gt;
&lt;li&gt;Chain completion rate: What percentage of multi-step chains finish without errors, fallbacks, or manual correction?&lt;/li&gt;
&lt;li&gt;Output faithfulness: Does the final response reflect the tool data accurately without hallucinations?&lt;/li&gt;
&lt;li&gt;Error recovery rate: When a tool returns an error, how often does the agent actually recover?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Running these at scale means automation. Platforms like Future AGI attach evaluation metrics directly to traces, scoring every execution and creating a continuous feedback loop. The point is to make evaluation a part of the pipeline, not something you run manually after incidents.&lt;/p&gt;

&lt;h2&gt;
  
  
  Patterns That Hold Up in Production
&lt;/h2&gt;

&lt;p&gt;These are the patterns I've seen consistently improve reliability across real deployments:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Validate at every boundary. Put input and output validation between every tool call using Pydantic or JSON Schema. Don't rely on the LLM to notice malformed data. Explicit validation catches errors at the source before they propagate.&lt;/li&gt;
&lt;li&gt;Plan first, execute second. Research from Scale AI shows that having the LLM formulate a structured plan (as JSON or code) before executing it through a deterministic executor reduces tool chaining errors significantly. Separating reasoning from execution is a big win.&lt;/li&gt;
&lt;li&gt;Implement circuit breakers. If a tool fails or returns unexpected results more than N times, break the circuit and return a graceful failure. Don't let one broken tool take down the entire workflow.&lt;/li&gt;
&lt;li&gt;Keep chains short. Longer chains mean more failure surface and more context consumption. If you need more than 5-6 sequential calls, restructure into sub-chains or parallel branches.&lt;/li&gt;
&lt;li&gt;Test with adversarial inputs. Your happy-path tests will pass. Production traffic won't be happy-path. Test with empty tool responses, oversized payloads, unexpected types, and ambiguous queries.&lt;/li&gt;
&lt;li&gt;Trace everything from day one. Instrument your tool chains with distributed tracing on the first deployment. When something breaks in production, traces are the difference between hours of debugging and a 10-minute fix.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  FAQ
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Why don't LLM tool chain errors throw exceptions like normal software?
&lt;/h3&gt;

&lt;p&gt;Because the LLM treats tool outputs as text, not typed data. If a tool returns malformed JSON or wrong values, the model doesn't crash. It interprets whatever it got and keeps going. That's why schema validation between every step matters so much. The LLM won't catch bad data for you.&lt;/p&gt;

&lt;h3&gt;
  
  
  Is a longer context window the fix for context loss in tool chains?
&lt;/h3&gt;

&lt;p&gt;Not really. Even with million-token windows, research shows LLMs lose attention on information in the middle of the context. A bigger window gives you more room, but it doesn't solve the core problem. Structured state management and summarization between steps are more reliable than just hoping the model remembers everything.&lt;/p&gt;

&lt;h3&gt;
  
  
  When should I use LangGraph over LangChain for tool chaining?
&lt;/h3&gt;

&lt;p&gt;If your chain is linear and simple, LangChain's LCEL syntax is faster to set up. Once you need conditional branching, retries at specific steps, or durable execution (resume from failure point), LangGraph gives you that control. Most teams I've talked to start with LangChain and move to LangGraph when their chains get complex enough to need explicit state machines.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do I know if my tool chain is consuming too much of the context window?
&lt;/h3&gt;

&lt;p&gt;Trace your token usage per step. If your chain of tool calls is eating 40-60% of available tokens before the agent generates its final response, you're in the danger zone. Summarize intermediate outputs aggressively and strip metadata the downstream tools don't need.&lt;/p&gt;

&lt;h3&gt;
  
  
  What's the simplest thing I can do today to make my tool chains more reliable?
&lt;/h3&gt;

&lt;p&gt;Add Pydantic or JSON Schema validation on the output of every single tool call. It takes maybe 30 minutes to set up and catches the majority of silent data corruption issues before they cascade. It's the highest-leverage change you can make.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>architecture</category>
    </item>
  </channel>
</rss>
