I started with this instruction for a Google ADK agent:
"Greet the user appropriately."
Five words. Seemed fine. The agent produced decent greetings. I could've shipped it.
Instead, I ran it through an evolutionary optimizer. Three iterations later, the instruction was three paragraphs long — covering formality tiers, period-appropriate language for different honorifics, tonal variation based on social context, and specific vocabulary constraints I never would have thought to include.
The agent's quality score went from 0.35 to 0.81. Same model, same training examples, completely different output quality. The only thing that changed was the instruction text — and I didn't write a single word of the new one.
The Problem
Prompt engineering is guess-and-check. You write something, test it on a couple examples, tweak a word, test again. It works, kind of — like seasoning food without tasting it. You'll get something edible, but you'll never find the version that's genuinely great.
The core issue: the space of possible instructions is infinite, and your intuition can only explore a tiny corner of it. You get stuck in local optima. You test against too few examples. You optimize for what feels wrong instead of what measurably underperforms.
The Fix: Let an LLM Critique and Rewrite the Prompts
gepa-adk automates this loop using evolutionary optimization (based on the GEPA paper):
- Run the agent on training examples
- Score outputs with a critic agent
- Reflect — an LLM analyzes what went wrong
- Mutate — proposes a better instruction based on the analysis
- Keep or discard based on whether scores improve
The mutation isn't random. The reflection model sees every output, every score, and every piece of critic feedback. It makes targeted changes. Think of it less like genetic mutation and more like a head chef tasting every plate and adjusting the recipe.
Here's the entire thing:
from google.adk.agents import LlmAgent
from gepa_adk import evolve_sync, SimpleCriticOutput
agent = LlmAgent(
name="greeter",
model="gemini-2.5-flash",
instruction="Greet the user appropriately.",
)
critic = LlmAgent(
name="critic",
model="gemini-2.5-flash",
instruction="Score for formal, Dickens-style greetings. 0.0-1.0.",
output_schema=SimpleCriticOutput,
)
trainset = [
{"input": "I am His Majesty, the King."},
{"input": "I am your mother."},
{"input": "I am a close friend."},
]
result = evolve_sync(agent, trainset, critic=critic)
print(f"Score: {result.original_score:.2f} -> {result.final_score:.2f}")
That's it. evolve_sync handles the loop. You get back the evolved instruction and the score trajectory.
What Else Can Evolve
Instructions are the default target, but gepa-adk can also optimize output schemas (Pydantic models), generation config (temperature, top-p), and even multi-agent systems — evolving how multiple agents coordinate together.
The multi-agent case is where it gets wild. In a pipeline, one agent's instruction affects another agent's input. Evolving them together finds coordination patterns you'd never discover tuning each agent in isolation.
When This Makes Sense
It shines when you have measurable quality criteria, diverse inputs, and you're building for production where the difference between 0.65 and 0.82 matters at scale.
It's overkill for one-off prompts or tasks where "good enough" is actually good enough.
Try It
pip install gepa-adk
GitHub | PyPI | Docs | v1.0.0 Announcement
For the full deep dive on the evolution loop, critic agents, and architecture: Stop Writing AI Agent Prompts by Hand
Based on the GEPA paper — built on Google ADK.
What's the worst prompt you've manually tuned into submission? I'm curious if evolution would've found something better.
Top comments (0)