I Let an Algorithm Rewrite My AI Agent's Prompts. It Found Things I Never Would Have.

#ai #opensource #machinelearning #python

I started with this instruction for a Google ADK agent:

"Greet the user appropriately."

Five words. Seemed fine. The agent produced decent greetings. I could've shipped it.

Instead, I ran it through an evolutionary optimizer. Three iterations later, the instruction was three paragraphs long — covering formality tiers, period-appropriate language for different honorifics, tonal variation based on social context, and specific vocabulary constraints I never would have thought to include.

The agent's quality score went from 0.35 to 0.81. Same model, same training examples, completely different output quality. The only thing that changed was the instruction text — and I didn't write a single word of the new one.

The Problem

Prompt engineering is guess-and-check. You write something, test it on a couple examples, tweak a word, test again. It works, kind of — like seasoning food without tasting it. You'll get something edible, but you'll never find the version that's genuinely great.

The core issue: the space of possible instructions is infinite, and your intuition can only explore a tiny corner of it. You get stuck in local optima. You test against too few examples. You optimize for what feels wrong instead of what measurably underperforms.

The Fix: Let an LLM Critique and Rewrite the Prompts

gepa-adk automates this loop using evolutionary optimization (based on the GEPA paper):

Run the agent on training examples
Score outputs with a critic agent
Reflect — an LLM analyzes what went wrong
Mutate — proposes a better instruction based on the analysis
Keep or discard based on whether scores improve

The mutation isn't random. The reflection model sees every output, every score, and every piece of critic feedback. It makes targeted changes. Think of it less like genetic mutation and more like a head chef tasting every plate and adjusting the recipe.

Here's the entire thing:

from google.adk.agents import LlmAgent
from gepa_adk import evolve_sync, SimpleCriticOutput

agent = LlmAgent(
    name="greeter",
    model="gemini-2.5-flash",
    instruction="Greet the user appropriately.",
)

critic = LlmAgent(
    name="critic",
    model="gemini-2.5-flash",
    instruction="Score for formal, Dickens-style greetings. 0.0-1.0.",
    output_schema=SimpleCriticOutput,
)

trainset = [
    {"input": "I am His Majesty, the King."},
    {"input": "I am your mother."},
    {"input": "I am a close friend."},
]

result = evolve_sync(agent, trainset, critic=critic)
print(f"Score: {result.original_score:.2f} -> {result.final_score:.2f}")

That's it. evolve_sync handles the loop. You get back the evolved instruction and the score trajectory.

What Else Can Evolve

Instructions are the default target, but gepa-adk can also optimize output schemas (Pydantic models), generation config (temperature, top-p), and even multi-agent systems — evolving how multiple agents coordinate together.

The multi-agent case is where it gets wild. In a pipeline, one agent's instruction affects another agent's input. Evolving them together finds coordination patterns you'd never discover tuning each agent in isolation.

When This Makes Sense

It shines when you have measurable quality criteria, diverse inputs, and you're building for production where the difference between 0.65 and 0.82 matters at scale.

It's overkill for one-off prompts or tasks where "good enough" is actually good enough.

Try It

pip install gepa-adk

GitHub | PyPI | Docs | v1.0.0 Announcement

For the full deep dive on the evolution loop, critic agents, and architecture: Stop Writing AI Agent Prompts by Hand

Based on the GEPA paper — built on Google ADK.

What's the worst prompt you've manually tuned into submission? I'm curious if evolution would've found something better.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.