Jay

Posted on May 4 • Originally published at futureagi.com

How I Built a Self-Improving AI Agent Pipeline That Fixes Its Own Failures

#ai #llm #machinelearning #agentdev

Every AI agent I have shipped has had the same lifecycle: it works well in testing, degrades quietly in production, and by the time someone notices, there are three weeks of bad outputs to explain.

The fix most teams reach for is more evals and more monitoring. That helps, but it's still reactive. You're still waiting for something to break before you act.

I wanted to build something different: a pipeline that doesn't just detect failures but closes the loop automatically.

What "Self-Improving" Actually Means Here

I'm not talking about reinforcement learning or any kind of model fine-tuning.

The loop I built works at the prompt and pipeline level. It detects when outputs fail evaluation thresholds, traces which component caused the failure, generates a fix, tests that fix against the same inputs, and deploys it if the metrics improve.

The full cycle runs without manual intervention. A human reviews the summary, not the individual failures.

That structure means the evaluator is not just a reporting layer. It is the trigger for everything downstream.

The Pipeline Structure

The pipeline has five components that run in sequence.

Input feeds into the Agent. The agent's outputs go to an Evaluator. Failures from the evaluator go into a Root Cause Analyzer. The analyzer feeds a Prompt Optimizer. The optimized prompt feeds back into the agent.

That loop is the core of the system. Everything else is instrumentation around it.

The evaluator is where the loop either fires or stays dormant, so getting the evaluation criteria right is the most important upfront decision in the entire build.

Step 1: Evaluation as the Trigger

The evaluator runs after every agent response.

It checks outputs against a set of criteria defined upfront: task completion, factual accuracy, response format, and any domain-specific constraints for the use case. Each criterion gets a pass/fail score.

When the aggregate score drops below a threshold, the evaluator doesn't just log the failure. It packages the failing input, the bad output, the criterion that failed, and the expected output, then sends that bundle downstream.

That bundle is what makes automated root cause analysis possible. Without structured failure data, you can't automate anything beyond alerting.

Step 2: Root Cause Analysis

Most failures in LLM pipelines come from one of three places: the prompt is ambiguous, the context window is missing relevant information, or the model is being asked to do something outside its reliable capability.

The root cause analyzer looks at the failure bundle and tries to classify which of those three caused the failure.

It does this by running the same input through a separate diagnostic prompt that asks the model to explain why the original output failed. That explanation gets tagged with a failure category: prompt issue, context issue, or model capability issue.

Prompt issues and context issues are fixable automatically. Model capability issues get flagged for human review.

This separation matters. Trying to auto-fix a model capability issue at the prompt level wastes cycles and produces false confidence.

Step 3: Automated Prompt Optimization

For prompt-category failures, the optimizer generates candidate fixes.

It takes the original prompt, the failure explanation, and the expected output, then generates three to five alternative prompt variants designed to address the specific failure mode.

def generate_prompt_variants(
    original_prompt: str,
    failure_explanation: str,
    expected_output: str,
    n_variants: int = 5
) -> list[str]:
    optimizer_prompt = f"""
    Original prompt: {original_prompt}

    Why it failed: {failure_explanation}

    Expected output: {expected_output}

    Generate {n_variants} improved prompt variants that fix the failure.
    Each variant should be on a new line, prefixed with VARIANT:
    """

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": optimizer_prompt}]
    )

    variants = []
    for line in response.choices.message.content.split('\n'):
        if line.startswith('VARIANT:'):
            variants.append(line.replace('VARIANT:', '').strip())

    return variants

Each variant runs against the same failing input. The variant that scores highest on the evaluator replaces the original prompt in the pipeline.

Before any fix goes live, it runs through a validation gate, which is the step that prevents a targeted improvement from quietly breaking something adjacent.

Step 4: Context Repair for Context Failures

Context failures need a different fix than prompt failures.

When the analyzer flags a context issue, it means the agent produced a bad output because it lacked relevant information, not because the prompt was wrong. The fix is to identify what information was missing and add a retrieval step that fetches it.

def repair_context(
    agent_input: str,
    failure_explanation: str,
    available_context_sources: list[str]
) -> str:
    context_repair_prompt = f"""
    Agent input: {agent_input}

    Failure explanation: {failure_explanation}

    Available context sources: {available_context_sources}

    Identify what information was missing and write a retrieval query
    that would fetch it from the most relevant source.
    """

    response = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": context_repair_prompt}]
    )

    return response.choices.message.content

The retrieval query runs against the specified sources. The fetched context gets injected into the agent's input on the next run.

Both prompt fixes and context fixes feed into the same validation step, so the deployment decision is always based on measured improvement, not just absence of obvious regression.

Step 5: Validation Before Deployment

No fix goes into production without validation.

The candidate fix (whether a new prompt or augmented context) runs against a held-out set of cases that includes the original failure and similar inputs from the evaluation history. If the fix improves the target metric without regressing others, it gets promoted.

def validate_fix(
    original_prompt: str,
    candidate_prompt: str,
    test_cases: list[dict],
    evaluator: callable
) -> dict:
    original_scores = []
    candidate_scores = []

    for case in test_cases:
        original_output = run_agent(original_prompt, case['input'])
        candidate_output = run_agent(candidate_prompt, case['input'])

        original_scores.append(evaluator(original_output, case['expected']))
        candidate_scores.append(evaluator(candidate_output, case['expected']))

    return {
        "original_mean": sum(original_scores) / len(original_scores),
        "candidate_mean": sum(candidate_scores) / len(candidate_scores),
        "improvement": (
            sum(candidate_scores) - sum(original_scores)
        ) / len(test_cases),
        "should_deploy": (
            sum(candidate_scores) > sum(original_scores)
        )
    }

If should_deploy is False, the failure gets escalated to the human review queue with the full context: original prompt, candidate prompt, test results, and the failure bundle that triggered the cycle.

With the five components defined, the full loop wires together in a single class that runs on every inference.

Putting It Together

The full loop in code looks like this:

class SelfImprovingPipeline:
    def __init__(self, agent, evaluator, optimizer, validator):
        self.agent = agent
        self.evaluator = evaluator
        self.optimizer = optimizer
        self.validator = validator
        self.current_prompt = agent.system_prompt

    def run(self, user_input: str) -> str:
        output = self.agent.run(user_input, self.current_prompt)

        eval_result = self.evaluator.evaluate(output, user_input)

        if eval_result['score'] < eval_result['threshold']:
            failure_bundle = {
                'input': user_input,
                'output': output,
                'eval_result': eval_result,
                'current_prompt': self.current_prompt
            }

            root_cause = analyze_root_cause(failure_bundle)

            if root_cause['type'] == 'prompt':
                variants = self.optimizer.generate_variants(
                    self.current_prompt,
                    root_cause['explanation']
                )

                validation = self.validator.validate(
                    self.current_prompt,
                    variants,  # Best variant
                    get_test_cases()
                )

                if validation['should_deploy']:
                    self.current_prompt = variants
                    log_improvement(validation)
                else:
                    escalate_to_human(failure_bundle, validation)

        return output

The pipeline runs every inference through the evaluator. Most requests pass and nothing changes. When something fails, the loop fires, tries a fix, validates it, and either deploys or escalates. The full cycle runs in the background while the pipeline continues serving requests.

Running this in production surfaces two things quickly: most failures are interaction effects, not isolated prompt bugs, and the root cause classifier is the component that needs the most tuning time.

What I Learned Running This in Production

The first thing that surprised me was how rarely failures are caused by a single bad prompt instruction. Most failures are interaction effects: an instruction that works well in isolation produces unexpected behavior when combined with a specific type of user input.

The root cause analyzer is the part that needs the most tuning. Getting the failure categorization right (prompt vs. context vs. model capability) determines whether the downstream fix is useful or wasted compute. I spent more time calibrating the diagnostic prompt than any other part of the system.

The validation gate is also more important than it sounds. Early versions of this pipeline would occasionally deploy a fix that improved the target metric while regressing something adjacent. The held-out test set needs to cover failure modes you've already seen and general capability cases, not just the current failure.

The full observability setup that makes this pipeline debuggable is documented here. Without trace-level visibility into each cycle, the loop is a black box and you won't trust it.

DEV Community