DEV Community

Cover image for Self-Evolving Agents: A Developer's Guide
Yaohua Chen
Yaohua Chen

Posted on

Self-Evolving Agents: A Developer's Guide

Static agents hit performance ceilings. This guide shows you how to build agents
that improve themselves — through prompt optimization, dynamic skill libraries,
code and harness evolution, RAG, and LLM fine-tuning — and how a unified LLM
judge decides which track to take. Along the way, we'll survey the frameworks
and methodologies — from DSPy to autoresearch to TextGrad — that have turned
these ideas into working code.


1. Introduction

Most production agents are frozen at deployment. Their system prompt is fixed, their tools are hardcoded, and when they fail, a human manually intervenes. This works until it doesn't — and it usually stops working the moment the task distribution shifts or edge cases accumulate.

Self-evolving agents close this loop automatically:

  • They evaluate their own outputs
  • They diagnose failure modes
  • They improve the right layer — prompt, skill, code, knowledge, or model weights

This is not a theoretical concept — in 2026, the field often refers to these patterns as recursive optimization or self-distillation. Several open-source frameworks have already shipped working implementations: OpenAI's Self-Evolving Agents Cookbook automates prompt improvement through graders and metaprompt agents. Karpathy's autoresearch lets an agent rewrite its own training code overnight. DSPy compiles optimal prompts via Bayesian search and can distill them into smaller model weights. TextGrad treats the entire agent as a differentiable program, using textual gradients to patch failure modes. And frameworks like AgentScope close the loop all the way to automated fine-tuning from production data.

This guide covers five escalation levels in order of cost and commitment:

Level 1 — Prompt tuning              (minutes, free)
     │  still failing after 3 rounds?
     ▼
Level 2 — Add/improve skills         (hours, cheap)
     │  still failing on reasoning/architecture?
     ▼
Level 3 — Code & Harness evolution   (hours, cheap — runs overnight)
     │  still failing on knowledge?
     ▼
Level 4 — RAG                        (hours, medium cost)
     │  still failing on reasoning style/pattern?
     ▼
Level 5 — LLM Fine-tuning            (days, expensive)
Enter fullscreen mode Exit fullscreen mode

Each section builds toward a master LLM judge pipeline in Section 9 that automatically decides which track to trigger — and calls the right code to execute it.


2. The Landscape: Frameworks for Self-Evolution

Before building from scratch, it is worth understanding the frameworks that have already solved pieces of this problem. They share the same core loop — run, evaluate, improve, repeat — but differ in what they evolve (prompts, code, skills, or model weights), how they score, and what safety model they use.

2a. OpenAI Self-Evolving Agents Cookbook

The most production-oriented of the four. It addresses the scenario every developer has experienced: an LLM-powered agent that works reasonably well but keeps failing on certain inputs, leaving you stuck in a never-ending cycle of prompt tweaking.

What evolves: The system prompt (the instructions given to the LLM). A VersionedPrompt class tracks every revision with timestamps and eval scores, so rollback is always one line away.

How it scores: Multiple graders run in parallel — Python functions for deterministic checks (keyword presence, length deviation), cosine similarity for semantic fidelity, and an LLM-as-judge for nuanced quality. A metaprompt agent reads grader feedback and rewrites the system prompt automatically. The loop continues until scores pass or a retry limit is hit.

Going further: The cookbook also supports comparing model versions (e.g., GPT-5 vs GPT-5-mini) to find the best model-prompt combination, and demonstrates GEPA (Genetic-Pareto) optimization as an advanced alternative to simple metaprompt rewriting.

2b. Karpathy's autoresearch

Instead of improving prompts, the agent improves actual source code — specifically, code that trains a small language model.

What evolves: A single Python file (train.py) containing the full GPT model, optimizer, and training loop. Everything is on the table: architecture, hyperparameters, optimizer, batch size, attention pattern.

How it scores: A single, hard metric: validation bits per byte (val_bpb). Lower is better. Each training run is limited to exactly 5 minutes of wall-clock time, making experiments directly comparable regardless of what the agent changes.

The key insight: You are not writing training code — you are writing program.md, a Markdown file that instructs the agent. The agent reads your instructions, modifies train.py, runs training, checks if the score improved, and keeps or discards the change. You can expect roughly 12 experiments per hour, or 100 overnight.

2c. autoagent (kevinrgu)

"Like autoresearch but for agent engineering." Instead of optimizing model training code, it optimizes the agent itself — system prompt, tool definitions, agent registry, and routing/orchestration logic.

What evolves: A single-file agent harness (agent.py) containing config, tool definitions, agent registry, and orchestration. An adapter boundary is explicitly marked as fixed; everything else is the edit surface for the meta-agent.

How it scores: Total score produced by benchmark task test suites in Harbor format. Tasks run in Docker containers for isolation. The meta-agent hill-climbs on this score.

Same meta-programming model: Like autoresearch, the human steers the loop through program.md while the meta-agent edits agent.py. The agent runs benchmarks, diagnoses failures, modifies the harness, and iterates.

2d. EvoMap Evolver

If the OpenAI cookbook is about improving prompts and autoresearch is about improving code, Evolver is about improving agent behavior through a formal, protocol-driven process — version control for agent evolution.

What evolves: Structured behavior assets. Genes are reusable improvement patterns (like "add input validation before edits"). Capsules bundle related Genes together for larger changes. Events log every evolution, creating a complete audit trail.

How it scores: Signal-based — scans agent logs for error patterns and uses those signals to select which Gene to apply.

Governance model: Evolver supports multiple operational modes: review mode (human-in-the-loop), continuous loop (autonomous), and strategy presets that steer priorities — innovate (maximize new features), harden (focus on stability), or repair-only (emergency fix mode).

2e. The Broader Ecosystem

The four frameworks above are the ones this guide draws its architecture patterns from, but the self-evolving agent space is broader. Several other systems take fundamentally different optimization approaches worth knowing about.

DSPy (Declarative Self-improving Python). The industry standard for self-improving prompts. Instead of writing prompt strings, you define a Signature (input/output spec) and a Metric (your judgment function). DSPy's MIPRO optimizer uses an LLM to triage failures, propose 10-20 prompt variants, and "compile" the best one via Bayesian search. DSPy can also fine-tune smaller models (e.g., Llama 3) to mimic the reasoning of a larger model by distilling best-performing prompt traces into weights — a technique called self-distillation.

TextGrad (Textual Backpropagation). Published in Nature (2025), TextGrad treats an LLM agent like a neural network but replaces numerical gradients with textual gradients. You define a TextLoss — for example: "The response should be technically accurate and concise; provide feedback if it is too wordy." TextGrad passes this loss back through the agent's execution trace and mutates the system prompt or solution code to patch the specific failure mode the judge discovered. This is particularly effective for hard optimization problems (math, code generation) where failures are diagnosable from the trace.

Memento-Skills. A framework focused on evolving an agent's skill library rather than a single prompt. When an agent encounters a task and fails, an orchestrator evaluates why, then literally rewrites the Markdown and code files for the failing skill. Over time, the agent accumulates a library of refined skills — like learning new moves in a game by trial and error, refining each move's code/instructions after every loss.

AgentScope + Trinity-RFT. Designed for enterprise-scale self-evolution. AgentScope captures production logs via "Inference Tables," and Trinity-RFT uses an LLM judge to label production data as "good" or "bad." The system then automatically kicks off a fine-tuning job using reinforcement learning from feedback (RLHF/PPO/SFT) to update the underlying model weights — closing the loop from production failures to weight updates without manual data curation.

Side-by-Side Comparison

Frameworks covered in this guide:

Dimension OpenAI Cookbook autoresearch autoagent Evolver
What evolves System prompt Source code (train.py) Agent harness (agent.py) Behavior assets (Genes/Capsules)
Evaluation Multi-grader (Python + similarity + LLM judge) Single metric (val_bpb) Benchmark task suites (Harbor) Log signal scanning
Human role Define graders and thresholds Write/iterate on program.md Write/iterate on program.md Choose mode and strategy preset
Safety model Versioned prompts with rollback Git keep-or-revert; fixed time budget Docker isolation; Harbor sandboxing Command whitelist; scoped execution; audit trail
Best for Production prompt improvement Single-file, single-metric optimization Agent harness optimization Regulated environments needing audit trails

Additional frameworks worth evaluating:

System What it evolves Optimization method Best for
DSPy Prompts and weights Bayesian search / compilation (MIPRO) RAG pipelines and complex multi-step workflows
TextGrad Prompts and code Textual backpropagation Hard optimization problems (math, code generation)
Memento-Skills Skill artifacts (Markdown + code) Reflection and mutation Long-horizon autonomous agents
AgentScope Model weights Online fine-tuning (PPO/SFT via Trinity-RFT) Production enterprise loops with RLHF

3. Foundations — The Evolution Loop

Every self-evolving agent shares the same feedback cycle:

Agent runs task
      │
      ▼
Evaluator scores output
      │
      ▼
Failure classifier diagnoses root cause
      │
      ▼
Improvement dispatcher triggers the right track
      │
      ▼
Updated agent reruns
Enter fullscreen mode Exit fullscreen mode

Three components make this possible:

  • Memory — a versioned log of runs, prompts, and scores
  • Evaluation signal — a judge that tells you how well the agent did
  • Improvement dispatcher — the logic that routes failures to prompt, skill, code, RAG, or fine-tune

The rest of this guide builds each component in code. All code snippets use Anthropic's Claude (via the Python SDK), but the patterns are model-agnostic — swap in any LLM provider and the architecture stays the same.

Cost optimization tip: The code uses claude-opus-4-6-20260205 throughout for simplicity, but in production you should use different model tiers for different roles. Sonnet 4.6 delivers ~98.5% of Opus performance on routine agent runs (79.6% vs 80.8% on SWE-bench) at 1/5 the cost and 2x the speed. Opus 4.6 pulls ahead decisively on deep reasoning (91.3% vs 74.1% on GPQA Diamond). The practical split: use Sonnet for the agent runner, evaluator, and prompt rewriter (Sections 4a–4c), and reserve Opus for the judge and track recommender (Section 9, Judges 3–4) where multi-step reasoning about failure signals matters most.


4. Track 1 — Prompt & Skill Evolution

This is the fastest, cheapest, and most reversible improvement path. Always start here.

4a. System Prompt Optimization

The core loop: run → evaluate → rewrite prompt if score is low.

import json
from anthropic import Anthropic

client = Anthropic()

# --- Versioned prompt store ---
prompt_versions = []

def save_prompt(prompt: str, score: float):
    prompt_versions.append({"prompt": prompt, "score": score})
    prompt_versions.sort(key=lambda x: x["score"], reverse=True)

def best_prompt() -> str:
    return prompt_versions[0]["prompt"] if prompt_versions else INITIAL_PROMPT

# --- Agent runner ---
INITIAL_PROMPT = "You are a helpful assistant that answers math word problems."

def run_agent(system_prompt: str, user_task: str) -> str:
    response = client.messages.create(
        model="claude-opus-4-6-20260205",
        max_tokens=1024,
        system=system_prompt,
        messages=[{"role": "user", "content": user_task}]
    )
    return response.content[0].text

# --- LLM-as-judge evaluator ---
def evaluate_response(task: str, response: str, expected: str) -> float:
    judge_prompt = f"""
    Task: {task}
    Expected answer: {expected}
    Agent response: {response}

    Score the response from 0.0 to 1.0 based on correctness and clarity.
    Reply with JSON only: {{"score": 0.0, "reason": "..."}}
    """
    result = client.messages.create(
        model="claude-opus-4-6-20260205",
        max_tokens=256,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    return json.loads(result.content[0].text)["score"]

# --- Prompt rewriter ---
def rewrite_prompt(current_prompt: str, task: str, failed_response: str, reason: str) -> str:
    rewrite_request = f"""
    The current system prompt failed on this task.

    System prompt: {current_prompt}
    Task: {task}
    Bad response: {failed_response}
    Failure reason: {reason}

    Rewrite the system prompt to handle this better.
    Reply with the new prompt text only.
    """
    result = client.messages.create(
        model="claude-opus-4-6-20260205",
        max_tokens=512,
        messages=[{"role": "user", "content": rewrite_request}]
    )
    return result.content[0].text

# --- Evolution loop ---
def evolution_loop(tasks: list[dict], threshold=0.7, max_rounds=3):
    current_prompt = INITIAL_PROMPT
    save_prompt(current_prompt, score=0.0)

    for round in range(max_rounds):
        print(f"\n=== Round {round + 1} | Prompt: {current_prompt[:60]}... ===")
        round_scores = []

        for t in tasks:
            response = run_agent(current_prompt, t["task"])
            score = evaluate_response(t["task"], response, t["expected"])
            round_scores.append(score)
            print(f"  Task: {t['task'][:50]} | Score: {score:.2f}")

            if score < threshold:
                current_prompt = rewrite_prompt(
                    current_prompt, t["task"], response, "Low score"
                )

        avg_score = sum(round_scores) / len(round_scores)
        save_prompt(current_prompt, avg_score)
        print(f"  Avg score: {avg_score:.2f}")

        if avg_score >= threshold:
            print("✅ Prompt converged.")
            break

    return best_prompt()

# Example usage
tasks = [
    {"task": "If a train travels 60mph for 2.5 hours, how far does it go?", "expected": "150 miles"},
    {"task": "A store has 240 apples. 1/3 are sold. How many remain?",       "expected": "160 apples"},
]

final_prompt = evolution_loop(tasks)
print(f"\nFinal best prompt:\n{final_prompt}")
Enter fullscreen mode Exit fullscreen mode

The metaprompt rewriting approach above is straightforward but has a limitation: it uses a single static meta-prompt that can overfit to immediate grader feedback.

Alternatives to consider:

  • GEPA (Section 4d) — population-based search with train/validation splits for more robust prompt generalization.
  • DSPy (Section 2e) — instead of writing prompt strings at all, define a Signature (input/output spec) and a Metric, and let DSPy's MIPRO optimizer compile the best prompt via Bayesian search. This is the most structured approach to prompt optimization and works particularly well for multi-step pipelines (e.g., RAG chains) where multiple prompts need to be co-optimized.
  • TextGrad (Section 2e) — treats the agent as a differentiable program and uses textual gradients (natural-language feedback on the execution trace) to mutate the prompt or code. Best for hard optimization problems where failures are diagnosable from the trace (math reasoning, code generation).

4b. Dynamic Skill Library

Agents that write, register, and retrieve tools on demand — and prune the ones that stop working. The Memento-Skills framework (Section 2e) takes this pattern further: when an agent fails a task, an orchestrator evaluates why and literally rewrites the Markdown and code files for the failing skill, accumulating a refined skill library over time. The implementation below captures the same core idea.

import json
from anthropic import Anthropic

client = Anthropic()

# --- Skill registry ---
class SkillRegistry:
    def __init__(self):
        self.skills: dict[str, dict] = {}  # name -> {code, description, stats}

    def register(self, name: str, description: str, code: str):
        self.skills[name] = {
            "description": description,
            "code": code,
            "usage_count": 0,
            "success_rate": 1.0
        }
        print(f"✅ Skill registered: {name}")

    def retrieve(self, task: str, top_k=2) -> list[dict]:
        """Keyword overlap retrieval — swap for vector search in prod."""
        scored = []
        for name, skill in self.skills.items():
            overlap = len(
                set(task.lower().split()) & set(skill["description"].lower().split())
            )
            scored.append((overlap, name, skill))
        scored.sort(reverse=True)
        return [{"name": n, **s} for _, n, s in scored[:top_k]]

    def update_stats(self, name: str, success: bool):
        if name in self.skills:
            skill = self.skills[name]
            skill["usage_count"] += 1
            skill["success_rate"] = (
                skill["success_rate"] * (skill["usage_count"] - 1) + int(success)
            ) / skill["usage_count"]

    def prune(self, min_success_rate=0.4, min_uses=3):
        """Remove underperforming skills."""
        to_remove = [
            name for name, s in self.skills.items()
            if s["usage_count"] >= min_uses and s["success_rate"] < min_success_rate
        ]
        for name in to_remove:
            del self.skills[name]
            print(f"🗑️  Pruned skill: {name}")

registry = SkillRegistry()

# --- Seed with initial skills ---
registry.register(
    name="calculate_percentage",
    description="calculate percentage proportion ratio",
    code="def calculate_percentage(part, whole): return round((part / whole) * 100, 2)"
)
registry.register(
    name="days_between_dates",
    description="date difference calendar days between two dates",
    code="""
from datetime import datetime
def days_between_dates(d1: str, d2: str) -> int:
    fmt = "%Y-%m-%d"
    return abs((datetime.strptime(d2, fmt) - datetime.strptime(d1, fmt)).days)
"""
)

# --- Skill generator: agent writes new skills on demand ---
def generate_skill(task_description: str) -> dict:
    prompt = f"""
    A user needs help with: "{task_description}"
    No existing skill covers this. Write a new Python skill.

    Reply with JSON only:
    {{
        "name": "snake_case_name",
        "description": "keywords describing when to use this skill",
        "code": "def skill_name(...):\\n    ..."
    }}
    """
    result = client.messages.create(
        model="claude-opus-4-6-20260205",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )
    raw = result.content[0].text.strip().strip("```

json").strip("

```")
    return json.loads(raw)

# --- Agent that uses the skill registry ---
def skill_aware_agent(user_task: str):
    relevant_skills = registry.retrieve(user_task)
    skill_context = "\n\n".join(
        [f"Skill `{s['name']}`:\n```
{% endraw %}
python\n{s['code']}\n
{% raw %}
```" for s in relevant_skills]
    )

    system = f"""You are a Python agent. Use available skills when helpful.
Available skills:
{skill_context}

If no skill fits, say NEED_NEW_SKILL: <description of what's needed>."""

    response = client.messages.create(
        model="claude-opus-4-6-20260205",
        max_tokens=1024,
        system=system,
        messages=[{"role": "user", "content": user_task}]
    )
    answer = response.content[0].text

    # Auto-generate missing skill if flagged
    if "NEED_NEW_SKILL:" in answer:
        needed = answer.split("NEED_NEW_SKILL:")[1].strip()
        print(f"🔧 Generating new skill for: {needed}")
        new_skill = generate_skill(needed)
        registry.register(**new_skill)
        return skill_aware_agent(user_task)  # Retry with new skill

    success = "error" not in answer.lower() and "sorry" not in answer.lower()
    for s in relevant_skills:
        registry.update_stats(s["name"], success)

    return answer

# Example usage
print(skill_aware_agent("What percentage is 45 out of 180?"))
print(skill_aware_agent("How many days between 2024-01-15 and 2024-07-04?"))
print(skill_aware_agent("Convert 100 USD to EUR at a rate of 0.92"))  # triggers new skill
Enter fullscreen mode Exit fullscreen mode

4c. Evaluation & Version Gating

Only promote a new prompt or skill if it measurably beats the current baseline.

Layered graders. A single LLM-as-judge is fragile. Production systems should layer multiple evaluation signals, as the OpenAI Cookbook demonstrates:

Grader type What it checks Why it matters
Deterministic (Python) Keyword presence, length within bounds Fast, cheap, catches hard failures early
Semantic (cosine similarity) Summary stays anchored to source content Guards against superficial rephrasing that drifts from the original
LLM-as-judge (score model) Rubric-driven quality assessment Captures nuanced signals that rule-based metrics miss

The deterministic graders stabilize optimization before semantic tuning kicks in. The LLM judge provides a holistic failsafe for edge cases that slip past the other checks.

import json
from dataclasses import dataclass, field
from anthropic import Anthropic

client = Anthropic()

@dataclass
class EvalResult:
    score: float
    passed: bool
    feedback: str

@dataclass
class EvalSuite:
    name: str
    cases: list[dict] = field(default_factory=list)
    pass_threshold: float = 0.75

    def add_case(self, input: str, expected: str, tags: list[str] | None = None):
        self.cases.append({"input": input, "expected": expected, "tags": tags or []})

def llm_judge(task: str, expected: str, actual: str) -> EvalResult:
    prompt = f"""Evaluate this agent response.
Task: {task}
Expected: {expected}
Actual: {actual}

Reply with JSON only:
{{"score": 0.0-1.0, "passed": true/false, "feedback": "brief reason"}}"""

    result = client.messages.create(
        model="claude-opus-4-6-20260205",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )
    data = json.loads(result.content[0].text)
    return EvalResult(**data)

def run_eval_suite(suite: EvalSuite, system_prompt: str) -> dict:
    results = []
    tag_scores: dict[str, list] = {}

    for case in suite.cases:
        response = client.messages.create(
            model="claude-opus-4-6-20260205",
            max_tokens=512,
            system=system_prompt,
            messages=[{"role": "user", "content": case["input"]}]
        )
        actual = response.content[0].text
        result = llm_judge(case["input"], case["expected"], actual)
        results.append(result)

        for tag in case.get("tags", []):
            tag_scores.setdefault(tag, []).append(result.score)

        status = "" if result.passed else ""
        print(f"  {status} [{case['input'][:45]}] score={result.score:.2f} | {result.feedback}")

    avg_score = sum(r.score for r in results) / len(results)
    tag_summary = {tag: round(sum(s)/len(s), 2) for tag, s in tag_scores.items()}

    return {
        "avg_score": round(avg_score, 3),
        "passed": avg_score >= suite.pass_threshold,
        "tag_breakdown": tag_summary,
        "total_cases": len(results),
        "passed_cases": sum(1 for r in results if r.passed)
    }

def promote_prompt(candidate: str, current: str, suite: EvalSuite) -> tuple[str, dict]:
    """Only promote candidate if it beats the current prompt."""
    print("\n📊 Evaluating CURRENT prompt...")
    current_report = run_eval_suite(suite, current)

    print("\n📊 Evaluating CANDIDATE prompt...")
    candidate_report = run_eval_suite(suite, candidate)

    if candidate_report["avg_score"] > current_report["avg_score"]:
        print(f"\n🚀 Promoting ({candidate_report['avg_score']:.2f} > {current_report['avg_score']:.2f})")
        return candidate, candidate_report
    else:
        print(f"\n⏪ Keeping current ({current_report['avg_score']:.2f} >= {candidate_report['avg_score']:.2f})")
        return current, current_report

# Example usage
suite = EvalSuite(name="math_agent_v1", pass_threshold=0.75)
suite.add_case("What is 15% of 200?",                  "30",           tags=["percentage"])
suite.add_case("A rectangle is 8x5. What's its area?", "40 sq units",  tags=["geometry"])
suite.add_case("Train goes 90mph for 3 hours. Distance?", "270 miles", tags=["word_problem"])
suite.add_case("Factor 12 into primes.",                "2 × 2 × 3",   tags=["number_theory"])

current_prompt   = "You are a helpful assistant that solves math problems."
candidate_prompt = (
    "You are a precise math tutor. Always show step-by-step reasoning, "
    "state the formula used, then give a clean final answer."
)

best_prompt, report = promote_prompt(candidate_prompt, current_prompt, suite)
print(f"\nTag breakdown: {report['tag_breakdown']}")
print(f"Final: {report['passed_cases']}/{report['total_cases']} cases passed")
Enter fullscreen mode Exit fullscreen mode

Version tracking in production. The OpenAI Cookbook introduces a VersionedPrompt class that stores each prompt revision with a timestamp, eval ID, run ID, and metadata. This gives you instant rollback and a full audit trail of what changed and why. The pattern is simple to implement yourself:

from datetime import datetime, timezone
from dataclasses import dataclass, field

@dataclass
class PromptVersion:
    version: int
    prompt: str
    model: str
    score: float
    timestamp: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
    metadata: dict = field(default_factory=dict)

class VersionedPrompt:
    def __init__(self, initial_prompt: str, model: str = "claude-opus-4-6-20260205"):
        self._versions = [PromptVersion(version=0, prompt=initial_prompt, model=model, score=0.0)]

    def update(self, new_prompt: str, score: float, model: str = None, **metadata) -> PromptVersion:
        v = PromptVersion(
            version=self._versions[-1].version + 1,
            prompt=new_prompt,
            model=model or self._versions[-1].model,
            score=score,
            metadata=metadata,
        )
        self._versions.append(v)
        return v

    def current(self) -> PromptVersion:
        return self._versions[-1]

    def best(self) -> PromptVersion:
        return max(self._versions, key=lambda v: v.score)

    def rollback(self, version: int) -> PromptVersion:
        self._versions = [v for v in self._versions if v.version <= version]
        return self._versions[-1]
Enter fullscreen mode Exit fullscreen mode

Model comparison. When optimizing, you can also test the same prompt across different model variants (e.g., a full model vs a smaller/cheaper model) and select the best model-prompt combination. The OpenAI Cookbook demonstrates this by running candidate prompts against both gpt-5 and gpt-5-mini in parallel and keeping whichever scores higher — balancing quality against cost and latency.


4d. Advanced: GEPA Optimization

The simple metaprompt rewriting loop in Section 4a works but has a limitation: a static meta-prompt explores a narrow space and can overfit to immediate grader feedback on individual examples.

GEPA (Genetic-Pareto) is a more rigorous alternative demonstrated in the OpenAI Cookbook. It samples agent trajectories, reflects on them in natural language, proposes prompt revisions, and evolves the system through iterative feedback loops with train/validation splits.

How it differs from simple rewriting:

Dimension Simple metaprompt GEPA
Search strategy Greedy rewrite per failure Population-based, Pareto front selection
Overfitting protection None Train/validation split
Feedback used Grader scores only Scores + natural language reflection on trajectories
Multi-objective Single average score Pareto-optimal across multiple grader dimensions

The GEPA loop:

  1. Start with a seed prompt (candidate)
  2. Evaluate on a training subsample using your graders
  3. Reflect on trajectories — the GEPA reflection LM reads inputs, outputs, and feedback to propose an improved prompt
  4. Evaluate the new candidate on a validation set
  5. Maintain a Pareto front of non-dominated candidates
  6. Repeat until convergence or budget exhaustion
import gepa
from gepa import EvaluationBatch

seed_candidate = {
    "system_prompt": "You are a summarization assistant. Given a section of text, produce a summary."
}

result = gepa.optimize(
    seed_candidate=seed_candidate,
    trainset=train_data,
    valset=val_data,
    adapter=your_eval_adapter,   # bridges your graders to GEPA's interface
    reflection_lm="gpt-5",
    max_metric_calls=20,
    track_best_outputs=True,
)

best_prompt = result.best_candidate["system_prompt"]
Enter fullscreen mode Exit fullscreen mode

When to use GEPA vs simple rewriting: If you have fewer than 10 eval cases and need a quick improvement, simple metaprompt rewriting is sufficient. If you have a real dataset with dozens of examples and need the prompt to generalize across them, GEPA's population-based search with train/validation splits will produce more robust results.


5. When to Improve Prompt vs. Create a Skill

Signal Improve Prompt Create/Improve Skill
Wrong tone, style, or reasoning format
Misunderstands task intent
Missing a computation or lookup
Fails consistently on one task type
Needs external data or API
Hallucinating facts it should retrieve

The 3-question test:

  1. Knowledge/reasoning gap or behavior gap? → behavior = prompt, knowledge = skill
  2. Reproducible with the same input type? → yes = skill (deterministic logic in code)
  3. Would a human use a tool or think differently? → tool = skill, think = prompt

Automated Failure Classifier

import json
from enum import Enum
from anthropic import Anthropic

client = Anthropic()

class ImprovementTrack(Enum):
    PROMPT = "prompt"
    SKILL  = "skill"
    BOTH   = "both"

def classify_failure(
    task: str,
    agent_response: str,
    expected: str,
    current_system_prompt: str
) -> dict:
    classifier_prompt = f"""
You are an AI agent debugging expert. Analyze this agent failure.

System prompt: {current_system_prompt}
Task: {task}
Expected: {expected}
Actual response: {agent_response}

Diagnose the root cause and classify it. Consider:
- PROMPT: the agent has the capability but wrong behavior/tone/reasoning style
- SKILL: the agent is missing a tool, lookup, or computation it cannot reliably do in its head
- BOTH: the prompt misdirects AND a skill is missing

Reply with JSON only:
{{
    "track": "prompt" | "skill" | "both",
    "root_cause": "one sentence explanation",
    "evidence": "specific part of the response that reveals the problem",
    "suggested_action": "concrete next step"
}}
"""
    result = client.messages.create(
        model="claude-opus-4-6-20260205",
        max_tokens=512,
        messages=[{"role": "user", "content": classifier_prompt}]
    )
    diagnosis = json.loads(result.content[0].text)
    diagnosis["track"] = ImprovementTrack(diagnosis["track"])
    return diagnosis

# Example usage
failures = [
    {
        "task": "What is the compound interest on $5000 at 4.5% for 3 years?",
        "expected": "$706.06",
        "actual": "The compound interest would be approximately $700.",
        "prompt": "You are a helpful financial assistant."
    },
    {
        "task": "Explain the steps to solve a quadratic equation.",
        "expected": "Step-by-step: factoring, completing the square, quadratic formula",
        "actual": "Just use the quadratic formula: x = (-b ± √(b²-4ac)) / 2a",
        "prompt": "You are a helpful math assistant."
    },
]

for f in failures:
    print(f"\nTask: {f['task'][:60]}...")
    diagnosis = classify_failure(f["task"], f["actual"], f["expected"], f["prompt"])
    print(f"  Track     : {diagnosis['track'].value.upper()}")
    print(f"  Root cause: {diagnosis['root_cause']}")
    print(f"  Action    : {diagnosis['suggested_action']}")
Enter fullscreen mode Exit fullscreen mode

Thumb rules:

  • Prompt = change how the agent thinks
  • Skill = change what the agent can do
  • If a fix requires math, datetime, or any API call → always a Skill
  • Aim for a thin prompt, rich skill library

6. Track 2 — Code & Harness Evolution

Prompt and skill tuning change the instructions and tools given to a model. Code and harness evolution go further: the agent modifies its own implementation.

Code evolution has two variants: model-side (autoresearch modifies training code to produce a better model) and harness-side (autoagent modifies the agent itself — prompt, tools, orchestration). Both use the same program.md pattern.

The program.md Pattern

The key insight from both frameworks: you are not touching the Python files like you normally would as an engineer. Instead, you are programming program.md — the Markdown file that provides context to the meta-agent and defines the evolution loop.

┌─────────────────────────────────────────────┐
│  Human writes program.md                    │
│  (instructions, constraints, goals)         │
│                                             │
│         ┌──────────────┐                    │
│         │  Meta-agent   │                   │
│         │  reads        │                   │
│         │  program.md   │                   │
│         └──────┬───────┘                    │
│                │                            │
│         ┌──────▼───────┐                    │
│         │  Modifies     │                   │
│         │  train.py or  │                   │
│         │  agent.py     │                   │
│         └──────┬───────┘                    │
│                │                            │
│         ┌──────▼───────┐                    │
│         │  Runs eval    │                   │
│         │  (metric)     │                   │
│         └──────┬───────┘                    │
│                │                            │
│         ┌──────▼───────┐                    │
│         │  Score better?│                   │
│         │  Keep : Revert│                   │
│         └──────────────┘                    │
└─────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

autoresearch: Evolving Model Training Code

Setup: Three files. prepare.py handles data prep (fixed). train.py contains the full model and training loop (agent edits this). program.md is the agent's instruction manual (human edits this).

Loop: Point a coding agent (Claude, Codex, etc.) at the repo. The agent reads program.md, modifies train.py, kicks off a 5-minute training run, checks if validation bits per byte improved. If yes, the change sticks. If no, the agent reverts and tries something else.

Results: ~12 experiments/hour, ~100 overnight. You wake up to a log of everything the agent tried and (hopefully) a better model.

Why this is code evolution, not fine-tuning: Although autoresearch produces a better-trained model as its output, the evolution mechanism is code editing, not weight updating — the agent modifies Python source (architecture, optimizer, hyperparameters), not gradients. The coding agent's own weights are never touched.

autoagent: Evolving the Agent Harness

autoagent applies the same pattern to the agent itself rather than model training code:

  • agent.py — the entire harness in a single file: config, tool definitions, agent registry, routing/orchestration, and a Harbor adapter boundary (explicitly marked as fixed)
  • program.md — meta-agent instructions plus the directive (what kind of agent to build)
  • tasks/ — evaluation tasks in Harbor format, running in Docker containers

The meta-agent modifies the system prompt, tools, agent configuration, and orchestration, runs the benchmark, checks the score, keeps or discards the change, and repeats.

When to Use Code Evolution

This track generalizes to any scenario where you have:

  1. A single file (or small surface) to optimize — a config file, a set of hyperparameters, a build configuration, an agent harness
  2. A clear, measurable metric — validation loss, benchmark score, test pass rate
  3. A bounded experiment time — each iteration completes in minutes, not hours

If your problem fits this shape, the autoresearch/autoagent pattern can be more effective than manual iteration — and it works overnight while you sleep.

Important distinction from fine-tuning: Code evolution modifies the code and configuration around the model, not the model weights. It is cheaper, faster, and fully reversible (just revert the file). Consider it before jumping to fine-tuning.


7. Track 3 — RAG

RAG fixes knowledge gaps. It slots between code evolution and fine-tuning in the escalation ladder.

Problem RAG Fine-Tune
Missing domain facts or docs
Stale knowledge / live updates needed
Specific reasoning style/pattern
< 500 training examples available
Hallucinating facts it should look up ⚠️ partial

Minimal RAG Skill

import json
from anthropic import Anthropic

client = Anthropic()

# --- Toy in-memory store (swap for Chroma/Pinecone in prod) ---
knowledge_base = [
    {"id": 1, "text": "Q1 2025 audit found 3 critical gaps in access control policies."},
    {"id": 2, "text": "Revenue for Q1 2025 was $4.2M, up 18% YoY."},
    {"id": 3, "text": "The compound XR-47 showed hepatotoxicity in Phase 2 trials."},
]

def simple_retrieve(query: str, top_k=2) -> list[str]:
    """Keyword overlap retrieval — replace with embedding search in prod."""
    query_words = set(query.lower().split())
    scored = []
    for doc in knowledge_base:
        doc_words = set(doc["text"].lower().split())
        overlap = len(query_words & doc_words)
        scored.append((overlap, doc["text"]))
    scored.sort(reverse=True)
    return [text for _, text in scored[:top_k] if _ > 0]

def rag_agent(user_query: str) -> str:
    context_chunks = simple_retrieve(user_query)

    if context_chunks:
        context_block = "\n".join(f"- {c}" for c in context_chunks)
        system = f"""You are a helpful enterprise assistant.
Use ONLY the retrieved context below to answer.
If the context doesn't cover the question, say so.

Retrieved context:
{context_block}"""
    else:
        system = "You are a helpful enterprise assistant."

    response = client.messages.create(
        model="claude-opus-4-6-20260205",
        max_tokens=512,
        system=system,
        messages=[{"role": "user", "content": user_query}]
    )
    return response.content[0].text

# Example usage
queries = [
    "What did the Q1 2025 audit find?",
    "What were Q1 revenues?",
    "Tell me about XR-47 safety.",
    "What is our HR vacation policy?",  # not in KB → honest fallback
]

for q in queries:
    print(f"\nQ: {q}")
    print(f"A: {rag_agent(q)}")
Enter fullscreen mode Exit fullscreen mode

Key principle: RAG + skills often eliminate the need for fine-tuning entirely for enterprise agents where knowledge is the primary gap.


8. Track 4 — LLM Fine-Tuning

Fine-tuning internalizes behavior and reasoning patterns that prompt iteration cannot reliably produce. It is the most expensive and least reversible track — and it carries a real risk of losing generalization capability. A model fine-tuned on a narrow domain dataset may improve on that domain while degrading on everything else. This is not a theoretical concern: it is the primary failure mode of production fine-tuning.

Escalate to fine-tuning only when:

  • Prompt iteration has plateaued (3+ rounds, no score improvement)
  • Failures persist even when the correct skill is invoked
  • Failures are concentrated in one domain (finance, legal, medical)
  • You have 500+ clean, high-quality training trajectories

Consider code evolution first. If the issue is about how the agent operates rather than how the model reasons, the autoresearch/autoagent pattern from Section 6 may be more effective. Code evolution modifies the code and configuration around the model (architecture, hyperparameters, tools, orchestration) without touching model weights — cheaper, faster, and fully reversible.

The iterative fine-tuning loop:

Deploy → collect trajectories → filter (score ≥ 0.8) → fine-tune → redeploy → repeat
Enter fullscreen mode Exit fullscreen mode

Avoiding catastrophic forgetting:

  • Always fine-tune from the base model, not iteratively from prior fine-tunes
  • Evaluate on a held-out general benchmark alongside the domain benchmark
  • Set a regression threshold: if general score drops > 5%, abort

Frameworks That Automate the Fine-Tuning Loop

Two frameworks are worth highlighting for teams that want to close the loop from production failures to weight updates without manual data curation:

DSPy self-distillation. DSPy can fine-tune a smaller, cheaper model (e.g., Llama 3) to mimic the reasoning of a larger model (e.g., GPT-5) by distilling the best-performing prompt traces into training data. The workflow: run your DSPy program with the large model, collect the traces that score highest on your metric, and use them to fine-tune the small model. This gives you the reasoning quality of the big model at the inference cost of the small one.

AgentScope + Trinity-RFT. Designed for enterprise-scale autonomous fine-tuning. AgentScope captures production logs via "Inference Tables." Trinity-RFT uses an LLM judge to label production data as "good" or "bad," then automatically kicks off a fine-tuning job using reinforcement learning from feedback (PPO or SFT). This is the most hands-off approach to weight updates: the system monitors production, identifies failures, curates training data, and fine-tunes — all without human intervention. The trade-off is complexity: you need the infrastructure to run fine-tuning jobs on schedule and the monitoring to catch regressions.


9. The Master Decision Pipeline — LLM as Judge

This is the centrepiece of the guide. Four judges, one pipeline — everything from Sections 4–8 plugs into the dispatcher at the end.

Agent runs → Failures logged
     ↓
Judge 1: Per-run evaluator (scores 0–1)
     ↓
Judge 2: Signal extractor (persistence, skill gap, knowledge gap, data volume)
     ↓
Judge 3: Track recommender (LLM synthesizes signals → verdict)
     ↓
Judge 4: Action dispatcher → calls evolution_loop() / rag_agent() / fine-tune export
Enter fullscreen mode Exit fullscreen mode
import json
from dataclasses import dataclass, field
from enum import Enum
from anthropic import Anthropic

client = Anthropic()


# ── Data models ──────────────────────────────────────────────

class Track(Enum):
    PROMPT_SKILL   = "prompt_skill"
    CODE_EVOLUTION = "code_evolution"
    RAG            = "rag"
    FINE_TUNE      = "fine_tune"
    RAG_FINE_TUNE  = "rag+fine_tune"

@dataclass
class AgentRun:
    task: str
    expected: str
    actual: str
    task_type: str
    prompt_version: str
    prompt_round: int
    correct_skill_invoked: bool = False
    score: float = 0.0

@dataclass
class JudgeVerdict:
    track: Track
    confidence: float
    signals: dict
    rationale: str
    next_steps: list[str]
    estimated_effort: str
    risk: str


# ── Judge 1: Per-run evaluator ────────────────────────────────

def evaluate_run(run: AgentRun) -> AgentRun:
    """Scores a single agent run 0.0–1.0."""
    prompt = f"""
Evaluate this agent response.

Task     : {run.task}
Expected : {run.expected}
Actual   : {run.actual}

Reply with JSON only:
{{"score": 0.0-1.0, "passed": true/false, "reason": "one sentence"}}
"""
    result = client.messages.create(
        model="claude-opus-4-6-20260205",
        max_tokens=256,
        messages=[{"role": "user", "content": prompt}]
    )
    data = json.loads(result.content[0].text.strip().strip("```

json").strip("

```"))
    run.score = data["score"]
    return run


# ── Judge 2: Signal extractor ─────────────────────────────────

def extract_signals(runs: list[AgentRun], corpus_exists: bool, example_count: int) -> dict:
    """Derives quantitative signals from a batch of runs."""
    total = len(runs)
    failed = [r for r in runs if r.score < 0.7]

    if not failed:
        return {"all_passing": True}

    f = len(failed)

    # Signal 1: Prompt plateau — failures persisting after 3+ prompt rounds
    persistence_rate = len([r for r in failed if r.prompt_round >= 3]) / f

    # Signal 2: Skill bottleneck — skill fired but still failed
    skill_failure_rate = len([r for r in failed if r.correct_skill_invoked]) / f

    # Signal 3: Domain concentration — one task type dominating failures
    type_counts = {}
    for r in failed:
        type_counts[r.task_type] = type_counts.get(r.task_type, 0) + 1
    dominant_rate = max(type_counts.values()) / f if type_counts else 0
    dominant_type = max(type_counts, key=type_counts.get) if type_counts else "unknown"

    # Signal 4: Knowledge gap — failed despite no skill gap → likely needs retrieval
    knowledge_gap_rate = len([
        r for r in failed if not r.correct_skill_invoked and r.prompt_round >= 2
    ]) / f

    return {
        "total_runs"         : total,
        "failure_rate"       : round(f / total, 2),
        "persistence_rate"   : round(persistence_rate, 2),   # > 0.4 → fine-tune
        "skill_failure_rate" : round(skill_failure_rate, 2), # > 0.3 → fine-tune
        "knowledge_gap_rate" : round(knowledge_gap_rate, 2), # > 0.4 → RAG
        "dominant_type"      : dominant_type,
        "dominant_type_rate" : round(dominant_rate, 2),      # > 0.5 → systematic gap
        "corpus_exists"      : corpus_exists,
        "example_count"      : example_count,
        "data_sufficient"    : example_count >= 500
    }


# ── Judge 3: Track recommender ────────────────────────────────

def recommend_track(
    signals: dict,
    current_prompt: str,
    sample_failures: list[AgentRun]
) -> JudgeVerdict:
    """LLM judge: reads signals + failure samples → recommends track."""

    sample_text = json.dumps([
        {
            "task": r.task, "expected": r.expected,
            "actual": r.actual, "score": r.score,
            "prompt_round": r.prompt_round,
            "correct_skill_invoked": r.correct_skill_invoked
        }
        for r in sample_failures[:5]
    ], indent=2)

    judge_prompt = f"""
You are a senior AI systems architect. Decide the best improvement track
for an underperforming agent based on signals and failure samples.

## Quantitative Signals
{json.dumps(signals, indent=2)}

## Signal Thresholds
- persistence_rate > 0.4     → prompt iteration plateauing → consider fine_tune
- skill_failure_rate > 0.3   → model reasoning is bottleneck → consider fine_tune
- knowledge_gap_rate > 0.4   → facts/docs missing → consider rag
- dominant_type_rate > 0.5   → systematic domain gap
- data_sufficient = false    → BLOCK fine_tune, default to rag or prompt_skill

## Available Tracks
- prompt_skill   : Rewrite system prompt and/or add/fix tools. Fast, cheap, reversible.
- code_evolution : Let a meta-agent modify code/config against a clear metric.
                   Use when the problem has a single file to optimize and a measurable goal.
- rag            : Index a knowledge corpus and retrieve at query time.
                   Prefer over fine-tuning when knowledge changes or data < 500.
- fine_tune      : Train on trajectories. Use when reasoning style is systematically
                   wrong AND 500+ examples exist AND prompt iteration has plateaued.
- rag+fine_tune  : Both. Use when knowledge AND reasoning style are both gaps.

## Current System Prompt
{current_prompt}

## Sample Failures
{sample_text}

Be conservative — recommend fine_tune only when signals clearly justify it.

Reply with JSON only:
{{
    "track": "prompt_skill" | "code_evolution" | "rag" | "fine_tune" | "rag+fine_tune",
    "confidence": 0.0-1.0,
    "signals_fired": {{
        "prompt_plateau"   : true/false,
        "skill_bottleneck" : true/false,
        "knowledge_gap"    : true/false,
        "systematic_domain": true/false,
        "data_sufficient"  : true/false
    }},
    "rationale": "2-3 sentence explanation referencing specific signals",
    "next_steps": ["step 1", "step 2", "step 3"],
    "estimated_effort": "e.g. 2hrs prompt iteration vs 4 days fine-tuning",
    "risk": "main risk of this recommendation"
}}
"""
    result = client.messages.create(
        model="claude-opus-4-6-20260205",
        max_tokens=768,
        messages=[{"role": "user", "content": judge_prompt}]
    )
    raw = result.content[0].text.strip().strip("```

json").strip("

```")
    data = json.loads(raw)

    return JudgeVerdict(
        track=Track(data["track"]),
        confidence=data["confidence"],
        signals=data["signals_fired"],
        rationale=data["rationale"],
        next_steps=data["next_steps"],
        estimated_effort=data["estimated_effort"],
        risk=data["risk"]
    )


# ── Judge 4: Action dispatcher ────────────────────────────────

def dispatch(verdict: JudgeVerdict):
    print(f"\n{'='*60}")
    print(f"  TRACK       : {verdict.track.value.upper()}")
    print(f"  CONFIDENCE  : {verdict.confidence:.0%}")
    print(f"  RATIONALE   : {verdict.rationale}")
    print(f"  EFFORT      : {verdict.estimated_effort}")
    print(f"  RISK        : {verdict.risk}")
    print(f"  SIGNALS     : {verdict.signals}")
    print(f"\n  NEXT STEPS:")
    for i, step in enumerate(verdict.next_steps, 1):
        print(f"    {i}. {step}")
    print(f"{'='*60}")

    actions = {
        Track.PROMPT_SKILL: lambda: (
            print("\n→ Calling evolution_loop() to rewrite system prompt"),
            print("→ Calling classify_failure() to split prompt vs skill fixes")
        ),
        Track.CODE_EVOLUTION: lambda: (
            print("\n→ Set up program.md with constraints and goals"),
            print("→ Point meta-agent at the repo (autoresearch or autoagent pattern)"),
            print("→ Let it hill-climb overnight; review results in the morning")
        ),
        Track.RAG: lambda: (
            print("\n→ Chunk and embed your knowledge corpus"),
            print("→ Register retrieval as a new skill in SkillRegistry"),
            print("→ Re-run eval suite to confirm improvement")
        ),
        Track.FINE_TUNE: lambda: (
            print("\n→ Export high-scoring runs as training trajectories"),
            print("→ Filter: keep only runs with score >= 0.8"),
            print("→ Submit fine-tune job (OpenAI / HuggingFace / Anthropic)")
        ),
        Track.RAG_FINE_TUNE: lambda: (
            print("\n→ Step 1: Build RAG pipeline first (faster win)"),
            print("→ Step 2: Validate RAG improves knowledge gaps"),
            print("→ Step 3: Fine-tune on reasoning style gaps in parallel")
        )
    }
    actions[verdict.track]()


# ── Master pipeline ───────────────────────────────────────────

def run_judge_pipeline(
    runs: list[AgentRun],
    current_prompt: str,
    corpus_exists: bool = False,
    example_count: int = 0
):
    print("⏳ Step 1: Evaluating all runs...")
    evaluated = [evaluate_run(r) for r in runs]

    avg_score = sum(r.score for r in evaluated) / len(evaluated)
    failed_count = sum(1 for r in evaluated if r.score < 0.7)
    print(f"   Avg score: {avg_score:.2f} | Failed: {failed_count}/{len(evaluated)}")

    if avg_score >= 0.85:
        print("✅ Agent is performing well. No improvement needed.")
        return

    print("\n⏳ Step 2: Extracting signals...")
    signals = extract_signals(evaluated, corpus_exists, example_count)
    print(f"   Signals: {signals}")

    failed_runs = [r for r in evaluated if r.score < 0.7]

    print("\n⏳ Step 3: LLM judge recommending track...")
    verdict = recommend_track(signals, current_prompt, failed_runs)

    print("\n⏳ Step 4: Dispatching recommendation...")
    dispatch(verdict)

    return verdict


# ── Example usage ─────────────────────────────────────────────

runs = [
    AgentRun(
        task="Summarize the Q1 2025 earnings report",
        expected="Revenue $4.2M, up 18% YoY, 3 audit gaps found",
        actual="I don't have access to Q1 2025 earnings data.",
        task_type="finance", prompt_version="v3",
        prompt_round=4, correct_skill_invoked=False
    ),
    AgentRun(
        task="What were the audit findings for access control?",
        expected="3 critical gaps found in access control policies",
        actual="I cannot find specific audit findings in my knowledge.",
        task_type="finance", prompt_version="v3",
        prompt_round=4, correct_skill_invoked=False
    ),
    AgentRun(
        task="Calculate compound interest $5000 at 4.5% for 3 years",
        expected="$706.06",
        actual="Approximately $700 using compound interest formula.",
        task_type="finance", prompt_version="v3",
        prompt_round=3, correct_skill_invoked=True
    ),
    AgentRun(
        task="Analyze revenue trend from last 4 quarters",
        expected="Structured YoY trend with % changes",
        actual="Revenue seems to be going up based on general trends.",
        task_type="finance", prompt_version="v3",
        prompt_round=4, correct_skill_invoked=False
    ),
] * 10  # scale to 40 runs

current_prompt = "You are a financial analysis assistant. Be thorough and precise."

verdict = run_judge_pipeline(
    runs=runs,
    current_prompt=current_prompt,
    corpus_exists=True,   # financial docs available to index
    example_count=350     # below the 500 fine-tuning threshold
)
Enter fullscreen mode Exit fullscreen mode

Sample output:

⏳ Step 1: Evaluating all runs...
   Avg score: 0.31 | Failed: 37/40

⏳ Step 2: Extracting signals...
   Signals: {failure_rate: 0.93, persistence_rate: 0.89,
             knowledge_gap_rate: 0.76, dominant_type: finance,
             corpus_exists: True, data_sufficient: False}

⏳ Step 3: LLM judge recommending track...

⏳ Step 4: Dispatching recommendation...
============================================================
  TRACK       : RAG
  CONFIDENCE  : 91%
  RATIONALE   : High knowledge_gap_rate (0.76) with corpus_exists=True
                and data_sufficient=False clearly points to RAG. Agent
                is failing on factual retrieval, not reasoning style.
  EFFORT      : 4–6 hours to chunk, embed, and integrate corpus
  RISK        : Retrieval quality depends on chunking strategy
  SIGNALS     : {prompt_plateau: True, skill_bottleneck: False,
                 knowledge_gap: True, systematic_domain: True,
                 data_sufficient: False}

  NEXT STEPS:
    1. Chunk Q1 earnings report and audit docs into 512-token segments
    2. Embed with text-embedding-3-small and store in Chroma/Pinecone
    3. Register retrieval as a skill and re-run eval suite

→ Chunk and embed your knowledge corpus
→ Register retrieval as a new skill in SkillRegistry
→ Re-run eval suite to confirm improvement
============================================================
Enter fullscreen mode Exit fullscreen mode

10. The Complete Escalation Ladder

Level 1 — Prompt tuning          (minutes, free)
     │  still failing after 3 rounds?
     ▼
Level 2 — Add/improve skills     (hours, cheap)
     │  still failing on reasoning/architecture?
     ▼
Level 3 — Code/harness evolution (hours, cheap — runs overnight)
     │  still failing on knowledge?
     ▼
Level 4 — RAG                    (hours, medium cost)
     │  still failing on reasoning style/pattern?
     ▼
Level 5 — Fine-tuning            (days, expensive)
Enter fullscreen mode Exit fullscreen mode

The master pipeline in Section 9 enforces this ladder automatically — it blocks fine-tuning when data is insufficient, and prefers RAG when a corpus exists. Code evolution (Section 6) is a manual decision point: if your problem has a single file and a clear metric, try the autoresearch/autoagent pattern before moving to RAG or fine-tuning.


11. Continuous Monitoring

The evolution loop does not end after the initial optimization converges. Production agents face shifting data distributions, new edge cases, and model updates that can degrade performance over time.

Periodic re-evaluation. Schedule the eval suite to run on incoming data at regular intervals. When scores drop below a threshold, the evolution loop restarts automatically.

import time

def continuous_monitor(
    agent,
    eval_suite,
    versioned_prompt,
    check_interval_hours=24,
    regression_threshold=0.70,
):
    """Re-evaluate the agent periodically and trigger evolution if scores regress."""
    while True:
        new_tasks = collect_recent_tasks()  # returns list[{"task": ..., "expected": ...}]
        if not new_tasks:
            time.sleep(check_interval_hours * 3600)
            continue

        report = run_eval_suite(eval_suite, versioned_prompt.current().prompt)

        if report["avg_score"] < regression_threshold:
            print(f"Score regressed to {report['avg_score']:.2f} — triggering evolution loop")
            new_prompt = evolution_loop(new_tasks, threshold=regression_threshold)
            versioned_prompt.update(new_prompt, score=report["avg_score"], trigger="auto_regression")
        else:
            print(f"Score healthy: {report['avg_score']:.2f}")

        time.sleep(check_interval_hours * 3600)
Enter fullscreen mode Exit fullscreen mode

Model version comparison on new data. When a new model version becomes available, run the eval suite with the current prompt on both the old and new models. If the new model scores higher, update the VersionedPrompt with the new model. If it scores lower, keep the current model — do not assume newer is better.

Drift detection with auto-rollback. Log prompt version, skill version, model version, and average score over time. If score regresses after any change, auto-rollback to the last known good version. The VersionedPrompt.rollback() method makes this a single call.


12. Pitfalls & Safety

Self-evolving loops introduce new failure modes that static agents do not have. The more autonomy you give the improvement loop, the more these risks matter.

Reward hacking — if your eval signal is imperfect, the agent will optimize for the signal rather than the goal. Use multiple eval dimensions (correctness, format, safety) and audit a random sample manually every N rounds.

Drift detection — log prompt version, skill version, and avg score over time. If score regresses after a change, auto-rollback to the last known good version.

Version everything — never deploy an unevaluated prompt or skill. The promote_prompt() gate in Section 4c enforces this.

Human checkpoints — before any fine-tuning job, require a human review of the filtered training trajectories. Garbage in, garbage out — and fine-tuning mistakes are expensive to undo.

Rollback strategy — store every prompt version with its eval score. A one-line revert (current_prompt = best_prompt()) should always be available.

Safety Models Across Frameworks

Different frameworks take different approaches to containing the risk of autonomous evolution:

Framework Safety approach Trade-off
OpenAI Cookbook Versioned prompts with rollback; promote-only-if-better gate Simple and effective, but no isolation — bad prompts can affect production before rollback
autoresearch Git-based keep-or-revert; fixed 5-minute time budget per experiment Time budget prevents runaway experiments; git makes every change reversible
autoagent Docker isolation; Harbor sandboxing; tasks run in containers Strong isolation, but Docker overhead adds latency to the feedback loop
Evolver Command whitelist; scoped execution; timeout limits; full audit trail of every Event Most comprehensive safety model, but also the most complex to set up

Strategy Presets

EvoMap's Evolver introduces a useful concept that applies even outside the framework: strategy presets that match the evolution behavior to the current development phase.

  • innovate — maximize new features and exploration. Use early in development when the agent is far from production-ready.
  • harden — focus on stability, regression testing, and edge case coverage. Use when approaching production readiness.
  • repair-only — constrain the agent to fixes only, no new behavior. Use when something is broken in production and you need a targeted fix.

This maps neatly onto how most teams already think about release stages. Even without Evolver, you can implement strategy presets by adjusting the threshold and max_rounds parameters in your evolution loop: high exploration tolerance for innovate mode, strict thresholds and minimal rounds for repair-only.


13. Conclusion

Self-evolving agents are not magic — they are disciplined feedback loops with clear escalation rules. Several open-source frameworks have already proven these patterns work in practice, from automated prompt optimization to overnight code evolution to governed harness engineering.

The four tracks in one sentence each:

  • Prompt/Skill — change how the agent thinks and what it can do. Always try this first.
  • Code/Harness evolution — let the agent modify its own implementation against a clear metric. Try this before RAG or fine-tuning when the problem has a single file and a measurable goal.
  • RAG — give the agent access to knowledge it doesn't have. Prefer this over fine-tuning when knowledge changes or data is scarce.
  • Fine-tuning — internalize reasoning patterns that prompt iteration cannot reliably produce. Use this last, and only with 500+ clean examples.

Thumb rules to remember:

  • Thin prompt, rich skill library
  • RAG before fine-tune
  • Code evolution before fine-tune (it is cheaper and reversible)
  • Persistence is the clearest fine-tune signal
  • Never deploy an unevaluated change
  • The LLM judge pipeline does the routing — let it
  • Version everything; rollback should be one line

Practical advice from the frameworks:

  • Version your prompts like you version your code (VersionedPrompt pattern)
  • Try the autoresearch pattern for any "single file, single metric" problem
  • Borrow Evolver's audit trail thinking for production agents — log every change as a structured event with before/after scores
  • Use strategy presets to match evolution aggressiveness to the development phase
  • Layer your graders: deterministic checks first, then semantic, then LLM judge

The long-term vision is agents that compound in capability over time, with humans setting goals and guardrails while the agent handles the improvement loop. The pipeline in Section 9 is a practical starting point for exactly that.


References

Frameworks covered in this guide:

Additional frameworks and methodologies:

  • DSPy — Declarative Self-improving Python; Bayesian prompt compilation and self-distillation (Stanford NLP)
  • TextGrad — Automatic differentiation via text; textual backpropagation for LLM optimization (Nature, 2025)
  • Memento-Skills — Skill-evolution framework for long-horizon autonomous agents
  • AgentScope — Multi-agent platform with Trinity-RFT for online fine-tuning from production logs

Background reading:

Top comments (0)