DEV Community: Jay

How I Built a Self-Improving AI Agent Pipeline That Fixes Its Own Failures

Jay — Mon, 04 May 2026 18:38:59 +0000

Every AI agent I have shipped has had the same lifecycle: it works well in testing, degrades quietly in production, and by the time someone notices, there are three weeks of bad outputs to explain.

The fix most teams reach for is more evals and more monitoring. That helps, but it's still reactive. You're still waiting for something to break before you act.

I wanted to build something different: a pipeline that doesn't just detect failures but closes the loop automatically.

What "Self-Improving" Actually Means Here

I'm not talking about reinforcement learning or any kind of model fine-tuning.

The loop I built works at the prompt and pipeline level. It detects when outputs fail evaluation thresholds, traces which component caused the failure, generates a fix, tests that fix against the same inputs, and deploys it if the metrics improve.

The full cycle runs without manual intervention. A human reviews the summary, not the individual failures.

That structure means the evaluator is not just a reporting layer. It is the trigger for everything downstream.

The Pipeline Structure

The pipeline has five components that run in sequence.

Input feeds into the Agent. The agent's outputs go to an Evaluator. Failures from the evaluator go into a Root Cause Analyzer. The analyzer feeds a Prompt Optimizer. The optimized prompt feeds back into the agent.

That loop is the core of the system. Everything else is instrumentation around it.

The evaluator is where the loop either fires or stays dormant, so getting the evaluation criteria right is the most important upfront decision in the entire build.

Step 1: Evaluation as the Trigger

The evaluator runs after every agent response.

It checks outputs against a set of criteria defined upfront: task completion, factual accuracy, response format, and any domain-specific constraints for the use case. Each criterion gets a pass/fail score.

When the aggregate score drops below a threshold, the evaluator doesn't just log the failure. It packages the failing input, the bad output, the criterion that failed, and the expected output, then sends that bundle downstream.

That bundle is what makes automated root cause analysis possible. Without structured failure data, you can't automate anything beyond alerting.

Step 2: Root Cause Analysis

Most failures in LLM pipelines come from one of three places: the prompt is ambiguous, the context window is missing relevant information, or the model is being asked to do something outside its reliable capability.

The root cause analyzer looks at the failure bundle and tries to classify which of those three caused the failure.

It does this by running the same input through a separate diagnostic prompt that asks the model to explain why the original output failed. That explanation gets tagged with a failure category: prompt issue, context issue, or model capability issue.

Prompt issues and context issues are fixable automatically. Model capability issues get flagged for human review.

This separation matters. Trying to auto-fix a model capability issue at the prompt level wastes cycles and produces false confidence.

Step 3: Automated Prompt Optimization

For prompt-category failures, the optimizer generates candidate fixes.

It takes the original prompt, the failure explanation, and the expected output, then generates three to five alternative prompt variants designed to address the specific failure mode.

def generate_prompt_variants(
    original_prompt: str,
    failure_explanation: str,
    expected_output: str,
    n_variants: int = 5
) -> list[str]:
    optimizer_prompt = f"""
    Original prompt: {original_prompt}

    Why it failed: {failure_explanation}

    Expected output: {expected_output}

    Generate {n_variants} improved prompt variants that fix the failure.
    Each variant should be on a new line, prefixed with VARIANT:
    """

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": optimizer_prompt}]
    )

    variants = []
    for line in response.choices.message.content.split('\n'):
        if line.startswith('VARIANT:'):
            variants.append(line.replace('VARIANT:', '').strip())

    return variants

Each variant runs against the same failing input. The variant that scores highest on the evaluator replaces the original prompt in the pipeline.

Before any fix goes live, it runs through a validation gate, which is the step that prevents a targeted improvement from quietly breaking something adjacent.

Step 4: Context Repair for Context Failures

Context failures need a different fix than prompt failures.

When the analyzer flags a context issue, it means the agent produced a bad output because it lacked relevant information, not because the prompt was wrong. The fix is to identify what information was missing and add a retrieval step that fetches it.

def repair_context(
    agent_input: str,
    failure_explanation: str,
    available_context_sources: list[str]
) -> str:
    context_repair_prompt = f"""
    Agent input: {agent_input}

    Failure explanation: {failure_explanation}

    Available context sources: {available_context_sources}

    Identify what information was missing and write a retrieval query
    that would fetch it from the most relevant source.
    """

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": context_repair_prompt}]
    )

    return response.choices.message.content

The retrieval query runs against the specified sources. The fetched context gets injected into the agent's input on the next run.

Both prompt fixes and context fixes feed into the same validation step, so the deployment decision is always based on measured improvement, not just absence of obvious regression.

Step 5: Validation Before Deployment

No fix goes into production without validation.

The candidate fix (whether a new prompt or augmented context) runs against a held-out set of cases that includes the original failure and similar inputs from the evaluation history. If the fix improves the target metric without regressing others, it gets promoted.

def validate_fix(
    original_prompt: str,
    candidate_prompt: str,
    test_cases: list[dict],
    evaluator: callable
) -> dict:
    original_scores = []
    candidate_scores = []

    for case in test_cases:
        original_output = run_agent(original_prompt, case['input'])
        candidate_output = run_agent(candidate_prompt, case['input'])

        original_scores.append(evaluator(original_output, case['expected']))
        candidate_scores.append(evaluator(candidate_output, case['expected']))

    return {
        "original_mean": sum(original_scores) / len(original_scores),
        "candidate_mean": sum(candidate_scores) / len(candidate_scores),
        "improvement": (
            sum(candidate_scores) - sum(original_scores)
        ) / len(test_cases),
        "should_deploy": (
            sum(candidate_scores) > sum(original_scores)
        )
    }

If should_deploy is False, the failure gets escalated to the human review queue with the full context: original prompt, candidate prompt, test results, and the failure bundle that triggered the cycle.

With the five components defined, the full loop wires together in a single class that runs on every inference.

Putting It Together

The full loop in code looks like this:

class SelfImprovingPipeline:
    def __init__(self, agent, evaluator, optimizer, validator):
        self.agent = agent
        self.evaluator = evaluator
        self.optimizer = optimizer
        self.validator = validator
        self.current_prompt = agent.system_prompt

    def run(self, user_input: str) -> str:
        output = self.agent.run(user_input, self.current_prompt)

        eval_result = self.evaluator.evaluate(output, user_input)

        if eval_result['score'] < eval_result['threshold']:
            failure_bundle = {
                'input': user_input,
                'output': output,
                'eval_result': eval_result,
                'current_prompt': self.current_prompt
            }

            root_cause = analyze_root_cause(failure_bundle)

            if root_cause['type'] == 'prompt':
                variants = self.optimizer.generate_variants(
                    self.current_prompt,
                    root_cause['explanation']
                )

                validation = self.validator.validate(
                    self.current_prompt,
                    variants,  # Best variant
                    get_test_cases()
                )

                if validation['should_deploy']:
                    self.current_prompt = variants
                    log_improvement(validation)
                else:
                    escalate_to_human(failure_bundle, validation)

        return output

The pipeline runs every inference through the evaluator. Most requests pass and nothing changes. When something fails, the loop fires, tries a fix, validates it, and either deploys or escalates. The full cycle runs in the background while the pipeline continues serving requests.

Running this in production surfaces two things quickly: most failures are interaction effects, not isolated prompt bugs, and the root cause classifier is the component that needs the most tuning time.

What I Learned Running This in Production

The first thing that surprised me was how rarely failures are caused by a single bad prompt instruction. Most failures are interaction effects: an instruction that works well in isolation produces unexpected behavior when combined with a specific type of user input.

The root cause analyzer is the part that needs the most tuning. Getting the failure categorization right (prompt vs. context vs. model capability) determines whether the downstream fix is useful or wasted compute. I spent more time calibrating the diagnostic prompt than any other part of the system.

The validation gate is also more important than it sounds. Early versions of this pipeline would occasionally deploy a fix that improved the target metric while regressing something adjacent. The held-out test set needs to cover failure modes you've already seen and general capability cases, not just the current failure.

The full observability setup that makes this pipeline debuggable is documented here. Without trace-level visibility into each cycle, the loop is a black box and you won't trust it.

Your APM Tool Won't Catch Voice AI Failures. Here's What Actually Needs Monitoring

Jay — Wed, 15 Apr 2026 16:54:23 +0000

Most production voice agents get monitored the same way web services do: uptime, error rates, response times. I've seen this pattern more times than I can count, and it consistently fails.

Your infrastructure metrics look green while users are abandoning calls because the agent dropped conversation state or misclassified an intent. Standard APM was never designed to catch that.

Why Voice AI Breaks Differently

Voice AI runs across multiple probabilistic layers. STT converts audio under varying noise conditions. An LLM infers intent from that imperfect transcription. Tool calls fire based on that inference. TTS converts the response back to audio. A failure anywhere in that chain produces a bad conversation, and none of it shows up in an uptime dashboard.

Performance drift is the harder problem. It creeps in as your agent encounters new accents, background noise patterns, or edge cases outside your training data. By the time customer complaints surface, the damage is already done.

The Metrics That Actually Matter

Latency

Three latency measurements matter, and they're not interchangeable:

Time-to-First-Byte (TTFB): The delay between user silence and the first audio packet returned. This determines whether the conversation feels natural or stilted.
End-to-End Turn Latency: Total time from user input to agent response completion, covering transcription, LLM inference, and TTS generation.
TTS Processing Lag: The delta between text generation and audio rendering. This one gets ignored until you hit a synthesis pipeline bottleneck.

Conversation Quality

Word Error Rate (WER): Compare ASR output against ground-truth logs to catch domain-specific vocabulary failures. A 3% WER on general speech can be 15% on industry terminology.
Intent Classification Confidence: Track the model's confidence scores for intent recognition. Sudden drops indicate new query patterns or training data gaps.
Task Success Rate: The percentage of conversations where the user's primary goal was completed without human intervention. This is the metric that maps directly to business outcomes.

Business Metrics

Average Handle Time (AHT): High AHT usually means the agent is looping rather than resolving.
First Contact Resolution (FCR): Low FCR is often a sign the agent is giving partial answers.
Escalation Rate: Differentiate between planned handoffs and failure-driven ones. They look identical in raw numbers but mean completely different things.

Audio Quality

Mean Opinion Score (MOS): Automated algorithms estimate audio clarity on a 1-5 scale. Flag calls scoring below 3.5.
Jitter and Packet Loss: Network stability metrics that produce choppy or robotic artifacts during real-time streaming.
Barge-in Failure Rate: Instances where the agent failed to stop speaking when the user interrupted. This drives user frustration faster than almost any other failure mode.

Alerting That Doesn't Burn Out Your On-Call Team

Track P95, Not Average

Average latency hides tail-user frustration. If 5% of calls have 3-second delays, that's hundreds of angry users daily even when your average looks fine.

Track P95 and P99. Also track spike duration. A 2-second burst looks completely different from a sustained 6-hour degradation, and your alerts should distinguish between them.

Anomaly Detection Over Static Thresholds

Static thresholds work for hard limits like SLA violations or server down states. They fail for metrics that naturally fluctuate with traffic patterns.

An 800ms threshold at 2 AM looks fine. That same number during peak hours means something entirely different. Adaptive baselines that learn your traffic patterns catch drift and seasonal spikes that static rules miss. What I've found in practice is that the first two weeks of anomaly detection produce alerts you'll want to tune. After that, the signal-to-noise ratio improves considerably and it stops feeling noisy.

Group Alerts to Prevent Fatigue

Don't alert on every raw metric spike. Group related signals into incidents. Page engineers only for sustained P95 degradation or widespread error spikes. Log transient jitter warnings for batch review during sprint planning. The goal is that on-call engineers get paged for things that actually need immediate action, nothing else.

End-to-End Conversation Tracing

Infrastructure metrics tell you something broke. Traces tell you what broke and why.

Session-Level Visibility

Assign a unique session ID to each conversation and link every user turn, agent response, tool call, and audio event under that session. When a user reports a bad interaction, you pull the session ID and replay the full trace rather than reconstructing the sequence from scattered logs.

Component-Level Breakdown

Each trace should isolate latency by component: STT processing time, LLM inference duration, TTS generation lag. Without this breakdown, "high turn latency" could mean a slow LLM or a synthesis bottleneck. They present identically at the surface.

Causality Chains

When the agent makes a tool call to check inventory or book an appointment, the trace should connect that action back to the specific audio input that triggered it. This is how you catch full failure chains: network jitter causes packet loss, STT mishears the transcription, the LLM classifies the wrong intent, the wrong tool fires. Each failure looks independent without the causality chain linking them.

Confidence Score Tracking

Configure evaluations to capture confidence scores at transcription, intent classification, and TTS quality estimation steps. Filter for conversations where low confidence scores correlate with failure. These are your highest-signal debugging targets, and they're easy to miss if you're only looking at binary pass/fail outcomes.

Catching Drift Before Customers Do

What Causes It

Drift accumulates from model version updates, shifts in user language patterns (new slang, regional accents, industry jargon), and infrastructure changes like switching STT providers or rebalancing load. It's gradual, which is exactly what makes it dangerous.

A Real Example Worth Looking At

A voice agent handling insurance claims showed a 4% drop in intent classification accuracy over two weeks. The cause: customers started using "virtual inspection" instead of "photo claim" after a marketing campaign. Manual monitoring wouldn't catch a 4% drift over two weeks. Automated anomaly detection flagged the confidence score drop early enough to retrain before the accuracy dip had any measurable business impact.

The takeaway: when anomaly detection fires, correlate across multiple dimensions (latency, accuracy, confidence scores, user segments) before assuming you know the root cause. Single-metric spikes almost always have upstream contributors.

The Improvement Loop

Observability without a feedback loop is just expensive logging. The loop that actually works:

Observe: Monitor production traffic in real time. Set aside time weekly to review traces for recurring failure patterns rather than waiting for escalations to surface them.

Evaluate: Run targeted experiments comparing prompts, model versions, or pipeline configurations against production-like scenarios. Use datasets derived from real user interactions, not just synthetic coverage tests.

Optimize: Deploy incrementally. Track metrics before and after. Validate that the fix improved target KPIs without regressing other dimensions. Feed evaluation results back into your training data and observability baselines.

One thing I've changed my thinking on: pre-production testing should use production traces, not just synthetically generated scenarios. Real failures are the highest-quality test cases you have. Feeding them back into your test suite catches the accent variations, background noise patterns, and unexpected phrasings that synthetic generation doesn't produce.

The full architecture for this kind of observability setup is documented here.

Team Workflows: Who Sees What

Different roles need different views of the same data:

Engineers get real-time alerts for P95 latency spikes and error rate increases
Product managers see dashboards tracking task success rates and escalation frequency
ML teams review weekly reports on confidence score distributions and drift patterns

Route alerts based on severity and context. On-call engineers get paged only for sustained degradation. Lower-priority signals get queued for sprint planning review. The architecture of your alerting system matters as much as the metrics themselves.

Curious where others draw the line between acceptable drift and "this needs a retrain." Context-specific thresholds seem obvious in theory but I've seen teams struggle with it in practice, especially right after a model update.

From 66% to 96%: How I Fixed a Drive-Thru Voice Agent Before It Took a Single Real Call

Jay — Tue, 14 Apr 2026 17:51:13 +0000

I've been building voice agents for a while. The hardest part isn't the STT or TTS layer.

It's this: how do you test edge cases before you have real users?

The default answer is the vibe-check loop. You call your own agent, order a burger, say "yeah that felt okay," and move on. I did this for longer than I should have.

The Scenario

I built a drive-thru voice agent called "Future Burger." Requirements were simple: take orders fast, stay concise, skip the small talk.

The architecture was brain-first. STT and TTS are just the ears and mouth, interchangeable peripherals. The LLM handles reasoning, context switching, and tool calling.

If the agent can't figure out that "Actually, make that a Sprite" means replacing the previous drink, no amount of voice synthesis polish saves the interaction. So I focused entirely on the intelligence layer.

Step 1: Synthetic Data (Skipping the Cold Start)

Instead of waiting weeks for real call logs, I used FutureAGI's Dataset to build a ground truth dataset. You define a schema and it produces structured input/output pairs.

I asked for two fields: user_transcript (what the user says) and expected_order (what the agent should actually book).

Prompt used:

"Generate 500 diverse drive-thru interactions. Include complex orders like 'Cheeseburger no pickles', combo meals, and modifications."

In seconds I had 500 labeled pairs ready for evaluation. What surprised me here was how fast this exposed gaps I hadn't even thought to test. Mid-sentence order changes, multilingual switches, impatient customers. Edge cases I always meant to write but never did.

Step 2: Baseline Prompt (Workbench + Experiments)

Before touching latency or audio quality, I needed to confirm the logic holds. I drafted the initial system prompt (v0.1) in the Prompt Workbench, saved it as a versioned template, and ran an experiment across those 500 scenarios using three models: gpt-5-nano, Gemini-3-Flash, and gpt-5-mini.

Result: 80% accuracy. Decent. But the responses were wall-of-text paragraphs. Every reply opened with something like:

"Certainly! I have updated your order to include a cheeseburger without pickles and a medium Sprite. Is there anything else I can help you with today?"

Fine for a chatbot. For a voice agent where every word adds latency, it's a failure mode.

Step 3: Simulation (The Stress Test)

I connected the agent and ran a simulation with layered scenario types: hesitant users, stuttering, mid-order changes, rushed and angry customers.

The results were immediate:

Latency issues. The agent was too wordy. It started every response with "Certainly!" and ran three sentences too long.
Logic breaks. When a user changed their mind, the agent added both items to the cart instead of replacing the first.
Success rate: 66%.

One in three conversations ending in failure is not a quirk to patch later. That's a production blocker.

Step 4: Automated Optimization

This is the part I found most useful. Instead of manually editing the system prompt and guessing which instruction caused which failure, I let the optimization engine analyze the conversation logs directly.

I defined 10 evaluation criteria specific to this agent, including:

Context_Retention

Objection_Handling

Language_Switching

Because the platform evaluates native audio rather than transcripts alone, it recognized failure patterns across hundreds of simulated conversations and surfaced two actionable fixes:

Fix 1 (High Latency): "Reduce decision tree depth for menu inquiries and remove redundant validation steps."
Fix 2 (Hallucination): "Restrict generative capabilities to the defined menu_items vector store to prevent inventing dishes."

I selected the failed simulation runs and ran ProTeGi optimization with two objectives:

Task_Completion

Customer_Interruption_Handling

The system iterated on the system prompt automatically, testing variants like "Be extremely brief" or "If user changes mind, overwrite previous item." It ran each variant against the simulator in a loop until the metrics climbed.

I've done this manually on other projects. It takes hours. Watching it run in a loop was a genuinely different experience.

Step 5: Results

Before: Polite, slow, failed to track mid-order changes
After: Crisp. "Burger, no pickles. Got it." 96% accuracy on the "Indecisive" scenario

Going from 66% to 96% without writing a single new instruction manually validated the loop: Dataset > Simulate > Evaluate > Optimize.

What I Took From This

The cold start problem for voice agents is real. You can't get quality data without users, and you can't get users without quality behavior. Synthetic simulation breaks that dependency.

The bigger shift for me was realizing that most prompt debugging is just pattern matching on logs. You run the agent, it fails, you guess why, you edit, you repeat. That process is automatable. The hard part is setting up the right evaluation criteria upfront.

If you're still in the vibe-check phase and want to see what the full evaluation infrastructure looks like, the architecture walkthrough is here.

Curious what evaluation criteria others track for voice agents in production. Context retention and objection handling were obvious for this use case, but I'd like to know what else people actually measure.

The MCP Evaluation Framework Nobody Talks About (But Should)

Jay — Wed, 08 Apr 2026 13:49:32 +0000

Your agent worked fine in staging. It called the right MCP tools, returned clean outputs, passed the test suite. Then it hit production, a user sent a slightly different query, and it picked the wrong tool, passed malformed arguments, and chained three unnecessary calls before returning garbage.

I've watched this happen more times than I'd like. The model isn't the problem. The missing piece is an evaluation system that matches how MCP actually behaves at runtime.

Why MCP Changes the Eval Problem

Before MCP, most agents had hardcoded tools. You could write deterministic tests: "Given this input, the agent should call search_docs with these parameters." That worked.

MCP flips that model. An MCP-connected agent discovers tools at runtime from one or more MCP servers. The available tools can change between requests. The agent decides what to call, in what order, with what arguments, based on the user's prompt and context injected through MCP resources.

Anthropic open-sourced MCP in late 2024. Within a year it had 97 million monthly SDK downloads and 10,000+ published servers. In December 2025, Anthropic donated MCP to the Linux Foundation's Agentic AI Foundation (AAIF), with OpenAI, Google, Microsoft, and AWS backing the move.

This creates three evaluation problems that didn't exist before:

Dynamic tool selection is non-deterministic. The same query can produce different tool call sequences depending on which MCP servers are connected and what they expose at that moment. You can't assert "the agent must call this specific tool." You evaluate whether the choice was reasonable given the available options.

Context injection needs validation. MCP servers inject resources that shape the agent's reasoning. If a resource returns stale data or an unexpected format, the agent reasons incorrectly. Your eval needs to cover whether that injected context was used correctly, not just whether the final output looked reasonable.

Chains need end-to-end tracing. A single request can trigger 5 to 10 MCP tool calls across different servers, each with its own latency, failure mode, and output quality. Following only the final response misses every intermediate failure.

Five Dimensions to Evaluate

1. Tool Selection Accuracy

Did the agent pick the right tool? Measure this against labeled examples where humans identified the optimal tools for a given query. Two sub-metrics:

Precision: Out of all tools called, how many were necessary?
Recall: Out of all tools that should have been called, how many were?

High precision with low recall means the agent is too conservative and missing useful tools. Low precision with high recall means it's calling unnecessary tools, burning tokens, and increasing latency.

2. Argument Correctness

Even when the agent picks the right tool, it can pass wrong arguments. Validate that:

Arguments match the MCP tool's JSON schema
Types are correct (no string where an integer belongs)
Required fields are present and populated
Semantic accuracy holds: did it pass the correct document ID for this specific task, not a random one?

3. Task Completion Rate

This is the bottom-line metric. Did the agent actually accomplish what the user asked? I use LLM-as-a-judge evaluators here because they catch cases where every individual tool call succeeded but the agent failed to synthesize the results correctly.

4. Chain Efficiency

MCP agents can make far more tool calls than necessary. Track:

Total tool calls per request
Redundant calls (same tool, same arguments called twice)
Calls whose outputs never appeared in the final response
Total chain latency

An agent that calls 8 tools when 2 would do isn't just slow. It's expensive and significantly harder to debug.

5. Context Utilization

MCP servers expose resources that influence the agent's reasoning. Evaluate whether the agent used that context accurately or hallucinated information that contradicted it. The key metrics are groundedness and context relevance.

Here are the thresholds I use as a starting baseline:

Tool Selection Precision: >85%
Tool Selection Recall: >90%
Argument Schema Compliance: >98%
Task Completion: >80%
Chain Efficiency Ratio (min needed calls / actual calls): >0.7
Groundedness: >85%
P95 Latency: <5s

You Can't Eval What You Can't See

Tracing is the foundation. The standard approach is OpenTelemetry-based instrumentation, where each MCP tool call becomes a span recording: tool name, server name, arguments, response, latency, and status code. These spans nest under a parent trace representing the full user request.

A well-instrumented MCP trace captures:

Root span: User query received, final response returned
LLM decision span: Model reasoning, tool selection decision
MCP tool call spans: One per invocation, with full arguments and response
Context retrieval spans: MCP resource fetches
Synthesis span: Final response generation from tool outputs

TraceAI is an open-source library that extends OpenTelemetry with AI-specific semantic conventions. It supports 20+ frameworks including OpenAI, Anthropic, LangChain, and CrewAI. Setup is under 10 lines:

from fi_instrumentation import register
from fi_instrumentation.fi_types import ProjectType
from traceai_openai import OpenAIInstrumentor

trace_provider = register(
    project_type=ProjectType.OBSERVE,
    project_name="mcp_agent_prod"
)
OpenAIInstrumentor().instrument(tracer_provider=trace_provider)

Once traces are flowing, you can visualize every LLM call, tool invocation, and retrieval step as nested timelines on the Future AGI Observe dashboard, with latency, cost, and evaluation scores side-by-side.

Building the Pipeline

Step 1: Instrument your agent.
Set up auto-instrumentation with TraceAI or a compatible library. Capture the MCP-specific attributes too: which server the tool came from, schema version, and whether the call was a retry. That context is critical when debugging failures at 2am.

Step 2: Define your evaluation criteria.
Pick metrics from the five pillars based on your use case. A support agent should prioritize task completion and groundedness. A code generation agent should prioritize argument correctness and chain efficiency.

Step 3: Set up automated evaluators.
For subjective measurements like task completion and response quality, use LLM-as-a-judge. For objective checks like schema compliance and latency thresholds, use deterministic validators.

The evaluation SDK ships with 60+ pre-built templates covering factual accuracy, groundedness, tone, conciseness, and more:

from fi.evals import Evaluator

evaluator = Evaluator(
    fi_api_key="your_api_key",
    fi_secret_key="your_secret_key"
)

result = evaluator.evaluate(
    eval_templates="groundedness",
    inputs={
        "context": retrieved_context,
        "output": agent_response
    },
    config={"model": "turing_flash"}
)

Step 4: Sample and score production traffic.
Don't eval every request. A 10-20% sampling rate works for most teams. For finance or healthcare, push toward 100%. Future AGI's Eval Tasks let you schedule scoring on live or historical traffic with configurable sampling rates.

Step 5: Alert on regression.
Threshold-based alerts are what turn passive monitoring into an actual feedback loop:

Task completion drops below 80%? Alert.
Average tool calls per request spikes above 6? Alert.
Argument schema compliance dips below 95%? Alert.

Route these to Slack, PagerDuty, or your CI/CD pipeline.

Failure Patterns Worth Flagging

A few I keep running into:

Testing only the happy path. Dev and staging MCP servers have limited tool sets. Mirror production MCP server configs in your test environment, or you're not actually testing the surface area that breaks.

Evaluating calls in isolation. Evaluating each tool call without considering the chain misses ordering failures. Evaluate full sequences and flag when order affects correctness.

LLM-as-a-judge without deterministic checks. LLM evaluators are inconsistent on their own. Pair them with schema validation, not instead of it.

No established baseline. If you don't record baseline metrics in the first week, you can't detect degradation. Track deltas. Absolute scores lie.

No cost tracking. Tool calls compound fast in MCP chains. Include token and call costs in every trace. Set spike alerts before the bill does it for you.

Evaluating post-ship only. Running evals only after deployment means you're always reacting. Enable tracing in experiment mode during development and surface failure patterns before they reach production.

Closing the Loop

Evaluation without action is just monitoring. The actual cycle:

Trace every MCP tool call with OpenTelemetry-compatible instrumentation
Evaluate sampled traces across the five metrics automatically
Identify failure patterns through clustering: which tools fail most, which queries produce the worst task completion scores
Iterate on prompts, tool descriptions, and MCP server configurations based on evaluation feedback
Verify improvements by comparing eval scores across deployment versions

The teams shipping reliable MCP-connected agents aren't the ones with the best models. They're the ones with the best evaluation pipelines. Start there.

How I Audited My Infra After the LiteLLM Supply Chain Attack (And What I'm Doing Differently Now)

Jay — Mon, 06 Apr 2026 18:30:00 +0000

I woke up to a Slack thread on March 24, 2026, that made my stomach drop. LiteLLM, the Python proxy I'd been running to route LLM calls across providers, had been backdoored with credential-stealing malware. Versions 1.82.7 and 1.82.8, published by a threat actor called TeamPCP, contained a three-stage payload that harvested SSH keys, cloud credentials, Kubernetes secrets, and cryptocurrency wallets. PyPI quarantined the entire package.

What surprised me was the targeting. LiteLLM is literally an API key management gateway. It holds credentials for every LLM provider your org uses. If you wanted to compromise one package to get access to everything, this was the perfect pick.

This wasn't a one-off either. It was the third hit in a five-day campaign. Aqua Security's Trivy scanner got compromised on March 19 (GHSA-69fq-xp46-6x23). Checkmarx's KICS GitHub Actions followed on March 23 (kics-github-action#152, Checkmarx Update). LiteLLM was the final target on March 24 (litellm#24512, LiteLLM Update). The attack chain was present in 36% of cloud environments, often pulled in as a transitive dependency through agent frameworks nobody audited.

How the Attack Chain Worked

As confirmed in LiteLLM's official security update, the project's CI/CD pipeline ran Trivy without pinning to a specific version. When the compromised Trivy action executed inside LiteLLM's GitHub Actions runner, it exfiltrated the PYPI_PUBLISH token. TeamPCP used that stolen token to push malicious releases directly to PyPI.

Version 1.82.7 embedded the payload in proxy/proxy_server.py, firing on import. Version 1.82.8 was worse: it included a .pth file called litellm_init.pth that executed on every Python process startup, regardless of whether you ever imported LiteLLM. Python's site module processes all .pth files in site-packages during interpreter initialization, as documented in the GitHub issue.

The payload used double base64 encoding:

import os, subprocess, sys
subprocess.Popen([
    sys.executable, "-c",
    "import base64; exec(base64.b64decode('...'))"
])

Once running, it executed three stages. Stage 1 harvested credentials: SSH keys, AWS/GCP/Azure tokens, environment variables, .env files, Kubernetes configs, Docker configs, database credentials, shell history, browser cookies, and cryptocurrency wallets. Stage 2 deployed privileged Alpine pods into the kube-system namespace on every reachable Kubernetes node, grabbing cluster secrets and service account tokens. Stage 3 installed sysmon.py as a systemd service that polled checkmarx[.]zone/raw for additional payloads, giving the attacker persistent access even after discovery.

All stolen data was encrypted and POSTed to models.litellm[.]cloud, a lookalike domain controlled by TeamPCP.

The Blast Radius Was Bigger Than I Expected

The .pth execution model is what made this particularly nasty. On any machine where LiteLLM 1.82.8 was installed, the malware fired every time Python started. Not when you imported the package. Not when you used the proxy. Every single Python process.

That means a data scientist running Jupyter, a DevOps engineer running Ansible, a backend dev spinning up a Flask server: all compromised if the package sat anywhere in their Python environment. The malware just ran silently alongside whatever they were actually doing.

Here's the part that really got me: you didn't need to install it yourself. If any package in your dependency tree pulled LiteLLM in, the payload still executed. As reported in GitHub issue #24512, the researcher who found this discovered it because their Cursor IDE pulled LiteLLM through an MCP plugin without any manual installation.

I checked my own environment and found LiteLLM listed in the Required-by field for a framework I'd installed months ago. I had no idea it was there.

How I Checked If I Was Affected

Here's what I ran across my local machine, CI/CD runners, Docker images, and staging:

pip show litellm | grep Version
pip cache list litellm
find / -name "litellm_init.pth" 2>/dev/null

Then I scanned egress logs. Any traffic to models.litellm[.]cloud or checkmarx[.]zone means confirmed exfiltration:

# CloudWatch
fields @timestamp, @message
| filter @message like /models\.litellm\.cloud|checkmarx\.zone/

# Nginx
grep -E "models\.litellm\.cloud|checkmarx\.zone" /var/log/nginx/access.log

And checked for transitive installation:

pip show litellm  # Check "Required-by" field

If other packages list LiteLLM there, it entered your environment as a transitive dependency without your knowledge.

The Incident Response Steps I Followed

Step 1: Kill everything immediately. Stop all LiteLLM containers and scale Kubernetes deployments to zero:

docker ps | grep litellm | awk '{print $1}' | xargs docker kill
kubectl scale deployment litellm-proxy --replicas=0 -n your-namespace

Step 2: Rotate every credential on affected machines. The malware harvested everything it could reach. I treated the following as fully compromised: cloud provider tokens (AWS access keys, GCP service account keys, Azure AD tokens), all SSH keys in ~/.ssh/, database passwords and connection strings from .env files, every LLM provider API key (OpenAI, Anthropic, Gemini), Kubernetes service accounts and CI/CD tokens, and any crypto wallet files present on the machine. If you have crypto wallets on an affected host, move funds immediately.

Step 3: Hunt for persistence artifacts. The malware planted privileged pods in Kubernetes and installed a systemd backdoor. Check for both:

# Check for lateral movement
kubectl get pods -n kube-system | grep -i "node-setup"
find / -name "sysmon.py" 2>/dev/null

# Full removal
pip uninstall litellm -y && pip cache purge
rm -rf ~/.cache/uv
find $(python -c "import site; print(site.getsitepackages()[0])") \
    -name "litellm_init.pth" -delete
rm -rf ~/.config/sysmon/ ~/.config/systemd/user/sysmon.service
docker build --no-cache -t your-image:clean .

Do not downgrade to a previous version. Remove entirely and replace.

The Deeper Problem with Self-Hosted Python Proxies

I've been thinking about this since the cleanup, and honestly, the structural issue here goes beyond one compromised package.

LiteLLM's Python proxy pulls in hundreds of transitive dependencies: ML frameworks, data processing libraries, provider SDKs. Every one of those is a trust decision most teams make automatically with pip install --upgrade. When you add LiteLLM, you're not just trusting LiteLLM. You're trusting every package it depends on, every package those depend on, and every maintainer account tied to each one.

The .pth attack vector is especially concerning because most supply chain scanning tools focus on setup.py, __init__.py, and defined entry points. The .pth mechanism is a legitimate Python feature for path configuration that has been completely overlooked as an injection vector. I expect this technique to show up in future attacks. Traditional scanning would not have caught it.

There's also a response-time problem. The LiteLLM maintainers didn't rotate their CI/CD credentials for five days after the Trivy disclosure on March 19. If the maintainers couldn't react fast enough, downstream teams had no realistic chance. When you self-host, you inherit the blast radius.

What I Moved To (And Why)

After cleaning up, I needed to replace the routing layer. The options I evaluated fell into two buckets: self-hosted alternatives (which carry the same dependency tree risk) and managed gateways (which eliminate it).

I ended up switching to a managed gateway approach. Prism (by Future AGI) is one example of this pattern. Instead of installing a Python package to route requests, you point your OpenAI SDK at a managed endpoint. Your attack surface goes from an entire Python environment with hundreds of dependencies to an API key and a URL.

The migration was a two-line change:

Before (LiteLLM):

from litellm import completion
response = completion(model="gpt-5", messages=[{"role": "user", "content": "Hello"}])

After (managed gateway):

from openai import OpenAI
client = OpenAI(base_url="https://gateway.futureagi.com", api_key="sk-prism-your-key")
response = client.chat.completions.create(
    model="gpt-5", messages=[{"role": "user", "content": "Hello"}]
)

Same OpenAI SDK format, same model names, same response schema. TypeScript works identically:

import OpenAI from "openai";
const client = new OpenAI({
    baseURL: "https://gateway.futureagi.com",
    apiKey: "sk-prism-your-key"
});
const response = await client.chat.completions.create({
    model: "gpt-4o",
    messages: [{ role: "user", content: "Hello" }]
});

Provider keys sit in the gateway dashboard instead of .env files scattered across developer machines. You can read the full docs for setup details.

For Kubernetes deployments, the swap is just environment variables:

env:
  - name: LLM_BASE_URL
    value: "https://gateway.futureagi.com"  # was http://litellm-proxy:4000
  - name: LLM_API_KEY
    value: "sk-prism-your-key"

Then delete the LiteLLM pod, its service, Postgres, and Redis. That's infrastructure you no longer maintain or patch.

One feature I didn't expect to use heavily is semantic caching. It matches queries that mean the same thing but use different wording, so "What is your return policy?" and "How do I return an item?" hit the same cache entry. Cached responses come back with X-Prism-Cost: 0.

from prism import Prism, GatewayConfig, CacheConfig
client = Prism(
    api_key="sk-prism-your-key",
    base_url="https://gateway.futureagi.com",
    config=GatewayConfig(
        cache=CacheConfig(enabled=True, mode="semantic", ttl="5m", namespace="prod"),
    ),
)

The gateway also applies guardrails (PII detection, prompt injection prevention) at the routing layer before requests reach the provider. That's 18+ checks I previously didn't have at all.

What This Means Going Forward

The EU Cyber Resilience Act now holds organizations legally responsible for the security of open-source components in their products. SOC 2 Type II audits scrutinize dependency management. "We pull the latest from PyPI" won't pass a controls review anymore. If your product ran LiteLLM and customer credentials were exfiltrated, the liability is yours, not the maintainer's. For background on AI compliance and LLM security, Future AGI has an enterprise breakdown worth reading.

Dependency pinning alone doesn't fix this. Pinning prevents pulling a new malicious version but not a compromised maintainer overwriting an existing tag. Hash verification (pip install --hash=sha256:<exact_hash>) is the real control, though adoption is low because the tooling is painful.

Every team running LLM applications now faces a clear architectural choice: self-host and inherit the full supply chain risk, or use a managed gateway and shrink the trust boundary to an API endpoint. After March 24, the risk math changed.

I spent two days rotating credentials and auditing Kubernetes pods because of a package I didn't even know was in my dependency tree. I'd rather spend that time shipping features.

What Is Tool Chaining in LLMs? Why It Breaks and How to Think About Orchestration

Jay — Wed, 25 Mar 2026 13:01:13 +0000

Your agent chains three tool calls together. The first returns slightly malformed output. The second accepts it but misinterprets a field. By the third call, the entire chain has gone off the rails. No error was thrown. Your logs look clean. The user got confidently wrong answers.
If you've built anything with LLM agents beyond a demo, you've hit this. It's called the cascading failure problem, and research from Zhu et al. (2025) confirms it: error propagation from early mistakes cascading into later failures is the single biggest barrier to building dependable LLM agents.
I've spent a lot of time debugging these kinds of failures, and I want to break down why tool chaining is so fragile, what the actual failure modes look like, and what patterns hold up in production.

Tool Chaining, Quickly Defined

Tool chaining is when an LLM agent executes multiple tool calls in sequence, where each tool's output becomes input for the next. The agent gets a user query, calls an API, processes the result with a second tool, and builds a final response from the combined output.
A single tool call is simple. Chaining is where dependencies show up. The agent has to figure out execution order, track intermediate state, and handle partial failures while staying on task.
In multi-agent systems, this gets worse. One agent calls a tool, hands the result to a second agent, which runs its own tool chain before returning. The orchestration overhead stacks fast, and so do the failure points.
Here's a concrete example: a user asks an agent to pull earnings data, compare it against competitors, and generate a summary. The first call returns revenue in the wrong currency. The comparison runs fine but produces misleading figures. The summary confidently presents wrong data. Nothing errored out. That's the core danger when you chain tools without validation and observability.

Why Tool Chains Break in Production

Context Gets Lost Across Calls

LLMs work within a finite context window. Every tool call adds tokens: function parameters, response payloads, reasoning traces. In long chains, critical context from early steps gets pushed out of the window or buried under intermediate results.
This isn't theoretical. Research shows LLMs lose performance on information buried in the middle of long contexts, even with large windows. When your agent forgets a user constraint from step 1 by the time it hits step 5, the output might be structurally valid but factually wrong. The user asked for revenue in USD, but that detail got lost three calls ago.

What actually helps:

Pass structured state objects between calls, not raw text. Keeps payloads compact and parseable.
Summarize intermediate results before forwarding. Strip metadata the next tool doesn't need.
Use frameworks with explicit state management. LangGraph, for example, provides durable state across graph nodes so context stays inspectable and doesn't just float in the prompt.

Cascading Failures Compound Silently

This is the biggest production risk. When one tool returns bad or partial data, the error flows downstream and compounds at every step. Unlike traditional software where bad data throws exceptions, LLM tool chains fail silently because the agent treats garbage output as valid input and keeps going.
A 2025 study on OpenReview that analyzed failed agent trajectories found error propagation was the most common failure pattern. Memory and reflection errors were the most frequent sources of cascades. Once they start, they're extremely hard to reverse mid-chain.
In multi-agent setups, it's amplified further. The Gradient Institute found that transitive trust chains between agents mean a single wrong output propagates through the entire system without verification. OWASP ASI08 specifically flags cascading failures as a top security risk in agentic AI.

Context Window Saturation

Every tool call eats tokens. A chain of five calls can burn through 40-60% of your available context before the agent even starts generating its final response. Even with models offering massive token limits, the "lost in the middle" problem means the agent's attention degrades on information that isn't near the beginning or end of the context.

Picking a Framework for Multi-Tool Orchestration

The framework you choose shapes how much of this you have to handle yourself. Here's how the main options compare for production use in 2026:
LangGraph is my go-to for anything stateful or branching. It models tool chains as explicit state machines: every node is a tool call or decision point, edges define transitions. You can plug in retry logic, fallback paths, and human-in-the-loop checkpoints at specific stages. Its durable execution feature means if a chain breaks at step 4 of 7, you resume from step 4 instead of restarting. Deep tracing through LangSmith with state transition capture.
LangChain is still the fastest way to get started. Its LCEL pipe syntax makes linear tool chains quick to compose. But for production workloads with branching or parallel calls, most teams I've seen migrate to LangGraph for finer control.
AutoGen works well for multi-agent conversation patterns. It uses message-passing with built-in function call semantics. Observability is moderate and usually needs external tooling for production-grade traces.
CrewAI takes a role-based approach to multi-agent task execution. Tool assignment happens per role, which is intuitive but can mean longer deliberation before tool calls. Basic logging out of the box.

Tracing and Observability Are Not Optional

You can't fix what you can't see. Tool chain failures are often silent, so a chain that returns wrong answers without errors looks perfectly healthy in your logs unless you have distributed tracing on every step.
What to capture in every tool chain execution:

Input and output of each tool call. Exact parameters and full responses so you can replay failures.
Latency per step. A slow tool can cascade into downstream timeouts.
Token consumption. Track context window usage to spot saturation before it degrades output quality.
Agent reasoning between calls. Chain-of-thought capture helps you find logic errors that data alone won't reveal.

Tools like LangSmith, Langfuse, and Future AGI provide native tracing for LangGraph and LangChain workflows. Future AGI's traceAI SDK integrates with OpenTelemetry and includes built-in evaluation metrics for completeness, groundedness, and function calling accuracy.

Evaluating Tool Chains Beyond "Did It Work?"

Tracing tells you what happened. Evaluation tells you whether it was correct. For tool chains, you need to cover multiple dimensions:

Tool selection accuracy: Did the agent pick the right tool at each step?
Parameter correctness: Were the arguments valid and complete?
Chain completion rate: What percentage of multi-step chains finish without errors, fallbacks, or manual correction?
Output faithfulness: Does the final response reflect the tool data accurately without hallucinations?
Error recovery rate: When a tool returns an error, how often does the agent actually recover?

Running these at scale means automation. Platforms like Future AGI attach evaluation metrics directly to traces, scoring every execution and creating a continuous feedback loop. The point is to make evaluation a part of the pipeline, not something you run manually after incidents.

Patterns That Hold Up in Production

These are the patterns I've seen consistently improve reliability across real deployments:

Validate at every boundary. Put input and output validation between every tool call using Pydantic or JSON Schema. Don't rely on the LLM to notice malformed data. Explicit validation catches errors at the source before they propagate.
Plan first, execute second. Research from Scale AI shows that having the LLM formulate a structured plan (as JSON or code) before executing it through a deterministic executor reduces tool chaining errors significantly. Separating reasoning from execution is a big win.
Implement circuit breakers. If a tool fails or returns unexpected results more than N times, break the circuit and return a graceful failure. Don't let one broken tool take down the entire workflow.
Keep chains short. Longer chains mean more failure surface and more context consumption. If you need more than 5-6 sequential calls, restructure into sub-chains or parallel branches.
Test with adversarial inputs. Your happy-path tests will pass. Production traffic won't be happy-path. Test with empty tool responses, oversized payloads, unexpected types, and ambiguous queries.
Trace everything from day one. Instrument your tool chains with distributed tracing on the first deployment. When something breaks in production, traces are the difference between hours of debugging and a 10-minute fix.

FAQ

Why don't LLM tool chain errors throw exceptions like normal software?

Because the LLM treats tool outputs as text, not typed data. If a tool returns malformed JSON or wrong values, the model doesn't crash. It interprets whatever it got and keeps going. That's why schema validation between every step matters so much. The LLM won't catch bad data for you.

Is a longer context window the fix for context loss in tool chains?

Not really. Even with million-token windows, research shows LLMs lose attention on information in the middle of the context. A bigger window gives you more room, but it doesn't solve the core problem. Structured state management and summarization between steps are more reliable than just hoping the model remembers everything.

When should I use LangGraph over LangChain for tool chaining?

If your chain is linear and simple, LangChain's LCEL syntax is faster to set up. Once you need conditional branching, retries at specific steps, or durable execution (resume from failure point), LangGraph gives you that control. Most teams I've talked to start with LangChain and move to LangGraph when their chains get complex enough to need explicit state machines.

How do I know if my tool chain is consuming too much of the context window?

Trace your token usage per step. If your chain of tool calls is eating 40-60% of available tokens before the agent generates its final response, you're in the danger zone. Summarize intermediate outputs aggressively and strip metadata the downstream tools don't need.

What's the simplest thing I can do today to make my tool chains more reliable?

Add Pydantic or JSON Schema validation on the output of every single tool call. It takes maybe 30 minutes to set up and catches the majority of silent data corruption issues before they cascade. It's the highest-leverage change you can make.