DEV Community

Joshua
Joshua

Posted on

⚡ Rethinking Prompt Engineering: How Agent Lightning’s APO Teaches Agents to Write Better Prompts

**For years, we’ve obsessed over improving model weights and architectures.
But what if the real breakthrough in AI performance comes not from **training the model
, but from training the prompt?

That’s the premise behind Agent Lightning, a new framework from Microsoft that allows AI agents to improve themselves.
It introduces two key algorithms:

  • VERL — for reinforcement learning at the policy level
  • APO (Automatic Prompt Optimization) — for learning textual gradients that refine prompts based on performance feedback

In this article, I’ll show how APO works, why it’s a game-changer, and how I used it to enhance a Text-to-SQL agent built with LangGraph — improving accuracy from 84% to 88% in just two rounds of optimization.


🌩️ The Idea: Prompts That Learn

Prompt engineering has always been a manual, intuition-driven process. You tweak a few words, rerun your agent, and hope it performs better.
APO replaces that guesswork with data-grounded self-improvement.

It doesn’t retrain the underlying model — instead, it trains the text of the prompt itself.
Think of it as “gradient descent in natural language.”

At the heart of APO are two cooperating LLMs:

  • A Critic that examines what went wrong in failed tasks
  • An Editor that rewrites the prompt to address those weaknesses

Each iteration produces multiple improved prompts, scores them on validation data, and preserves the best through beam search — a form of controlled exploration.

The result? A system that writes its own better prompt with every round.


🧮 The Science of Textual Gradients

APO builds on ideas from two research papers — ProTeGi (EMNLP 2023) and TextGrad (Nature 2024) — which formalize how text itself can encode gradient-like feedback.

Here’s what happens inside one APO cycle:

  1. Run current prompt on a small batch of tasks
  2. Score results using an objective metric (for example, SQL correctness)
  3. Critic model reviews (input, output, reward) pairs and summarizes failures in natural language
  4. Editor model applies that feedback to produce refined prompt candidates
  5. Beam search evaluates several rewritten prompts and keeps the top performers

Example critique:

“The prompt doesn’t specify how to handle type mismatches in JOIN columns.
When Singer_ID is INTEGER in one table but TEXT in another, use CAST(col_text AS INTEGER) and filter invalid values.”

This text acts as a direction of improvement — like a gradient — but expressed entirely in language.


🧩 A Practical Experiment: Teaching a SQL Agent to Self-Optimize

To test APO, I applied it to a Text-to-SQL agent that converts natural language questions into SQL queries.
I used the Spider dataset — a well-known text-to-SQL benchmark — and ran 50 examples for training, 50 for validation.


🏗️ The Setup

The agent was built in LangGraph, following a self-correcting workflow.
Agent Lightning handled the optimization loop; I only needed to define the @rollout function that executes the task and returns a reward.

Here’s a minimal setup:

from agentlightning import Trainer
from agentlightning.algorithm.apo import APO
from openai import AsyncOpenAI

openai_client = AsyncOpenAI()

algo = APO(
    openai_client,
    val_batch_size=10,
    gradient_batch_size=4,
    beam_width=2,
    branch_factor=2,
    beam_rounds=2,
)

trainer = Trainer(
    algorithm=algo,
    n_runners=8,
    initial_resources={"prompt_template": prompt_template_baseline()}
)

train_data = load_spider_dataset("data/dev.json")[:50]
val_data = load_spider_dataset("data/dev.json")[50:100]

trainer.fit(
    agent=sql_agent_rollout,
    train_dataset=train_data,
    val_dataset=val_data
)
Enter fullscreen mode Exit fullscreen mode

🧠 The Rollout Function

This is where APO gets its feedback signal — the reward for how well a generated SQL query matches the ground truth.

from agentlightning import rollout
from agentlightning.types import PromptTemplate

@rollout
def sql_agent_rollout(task, prompt_template: PromptTemplate) -> float:
    agent = SQLAgent(
        db_path=f"databases/{task['db_id']}/{task['db_id']}.sqlite",
        write_prompt=prompt_template.format(
            dialect="SQLite",
            table_info=get_schema(task['db_id'])
        )
    )

    result = agent.run(task["question"])
    return evaluate_query(result["query"], task["query"], task["db_id"])
Enter fullscreen mode Exit fullscreen mode

Each rollout returns a numeric reward (1 for correct, 0 for incorrect), giving APO objective feedback for learning.


⚙️ From Draft to Expert Prompt

Baseline (v0)

The initial prompt was something you’d write on your first try — short and vague:

“Be careful not to query for columns that do not exist.”

Accuracy: 84% (42/50)


After Optimization (v5)

After two rounds of APO, the prompt evolved into a structured specification over 350 words long, defining explicit rules for schema validation, safe joins, deterministic ordering, and fallback responses.

Accuracy: 88% (44/50)

Round Version Accuracy Notes
0 v0 84% Baseline
1 v3 86% Added type casting logic
2 v5 88% Added rule hierarchy and validation checks

Example Improvements

Before:

“Use the tables listed below.”

After:

“Use only the tables and columns in {table_info}.
If a required column is missing, respond with an empty result or 'UNABLE TO ANSWER' rather than guessing.”

The optimized prompt became longer, yes — but also far more robust, preventing many subtle SQL errors.


🔍 Why APO Feels Different

1. It Learns from Real Mistakes

Critiques come directly from actual task failures, not from hand-written advice.

2. It Explores Multiple Futures

Beam search means the optimizer doesn’t get trapped in one idea of “better.” It keeps multiple hypotheses alive.

3. It’s Transparent

Every edit is interpretable. You can read the critic’s feedback and understand why the prompt changed.

4. It’s Objective

Rewards are computed from measurable outcomes — in this case, SQL correctness — not subjective LLM scoring.


🧭 What We Learned

After two APO rounds, the system showed clear, measurable gains:

  • 📈 Accuracy: 84% → 88%
  • 📜 Prompt length: 90 → 360 words
  • ⚖️ Rules: 3 vague hints → 19 explicit constraints
  • Validation: added schema checks and safe SQL handling

In essence, APO taught the agent how to write its own better instructions.


🧰 Try It Yourself

You can reproduce this entire setup:

GitHub Repo: bigdata5911/agent-lightning-automatic-prompt-optimization
Agent Lightning Docs: microsoft.github.io/agent-lightning
Spider Dataset: yale-lily.github.io/spider

Requirements

  • Python 3.8+
  • uv package manager
  • OpenAI API key (GPT-5 access)
  • Sufficient disk space for Spider dataset

Quick Start

uv sync
./setup_data.sh
export OPENAI_API_KEY="your-api-key"
uv run python train.py
Enter fullscreen mode Exit fullscreen mode

🌟 Final Thoughts

Agent Lightning’s Automatic Prompt Optimization (APO) is more than an automation trick — it’s a paradigm shift.

Instead of endlessly hand-crafting prompts, you can let your agent learn from its own mistakes, guided by measurable outcomes and transparent reasoning.

In my experiments, APO transformed a generic baseline into a specialized, rule-driven prompt that performed better, explained itself better, and could continue improving indefinitely.

Prompt engineering just got an upgrade — now, the prompts engineer themselves.


Follow me for more explorations into autonomous agents, self-optimizing prompts, and data-driven LLM workflows.

Top comments (0)