Note from the author: You're reading a Dev.to adaptation. The original on NexusTrade includes interactive trace viewers, animated diagrams, equity curve visualizations, and embedded course exercises. Read it there for the full experience.
You built the agent. You gave it tools. You hooked up the memory. You ran it overnight on five years of historical data and it produced a 140% annual return.
You deploy it with $25,000 on Monday. By Friday you've lost 30%.
This is not a bug in your code. This is overfitting. Your agent didn't find a durable market edge. It memorized the historical data and learned to exploit noise that doesn't exist in live markets. The AI succeeded at the goal you gave it. The goal was wrong.
Evaluation is the part of agent development that nobody talks about because most people building AI demos have never run a system long enough to see it fail. This article covers the engineering that keeps it from happening: traces, LLM judges, and a feedback loop that makes agents actually improve over time.
The Problem
The optimization trap: when the agent succeeds at the wrong thing.
Every machine learning practitioner knows about overfitting. You train a model on historical data. It learns the data perfectly, including all the noise and anomalies specific to that dataset. When you expose it to new data, it falls apart because what it learned wasn't a real pattern.
AI agents have the same problem, and it's harder to catch. When you tell an agent to "build a trading strategy with the highest possible backtest return," it will do exactly that. It will explore every combination of indicators, time windows, and position sizes until it finds something that maximizes the metric you asked for.
The result looks impressive. 126% return in 2024. You deploy it with $25,000 on Monday. By Friday you've lost 30%. The 2022 bear market destroyed it completely โ either because the agent never tested 2022, or because the evaluation criteria didn't penalize drawdown, so the agent ignored it.
The optimization trap is an evaluation failure, not a model failure. The model did what you asked. You asked for the wrong thing.
The fix is not a better model. It's a better evaluator: one that grades the agent on what actually matters, penalizes single-year outlier returns, requires multi-regime evidence, and gets stricter every round.
| Wrong objective | Right objective |
|---|---|
| Highest backtest return | Consistent returns across regimes |
| Single-year Sharpe ratio | Positive 2022 bear market performance |
| Win rate on in-sample data | Max drawdown below 30% |
| Score goes up each round | Multi-year evidence, not one outlier |
I know this because I ran the experiment. Five rounds of hill climbing on a live trading agent. $676 spent. The first round scored 71 and produced an Iron Condor with a 54% average annual return. By Round 5, the score had dropped to 27 and the agent was recommending long directional options with a -6.3% average and a 92% drawdown in 2022. The evaluator caused every step of the decline.
Observability
The flight recorder: what a trace actually is.
When a traditional app crashes, you read the stack trace. When an agent fails on iteration 7 of a 12-step ReAct loop, you're guessing โ unless you have a trace.
A trace is a structured log of every step in the agent's execution. Every input the model saw, every decision it made, every tool it called, every result it got back, every token it spent, every millisecond it waited. If something goes wrong at 3 AM, you can reconstruct exactly what happened.
๐ Interactive trace viewer โ view on NexusTrade
Without traces, a failed agent run is a black box. You see the final answer (or the error), and you guess at what went wrong. With traces, you can pinpoint the exact iteration where the model made a bad assumption, called the wrong tool, or misread a result.
Evaluation
How to grade an agent: algorithmic metrics and LLM judges.
Not all evaluation is the same. Some things you can measure with code. Some things you need a second AI to grade.
| Type | What it measures | Examples |
|---|---|---|
| Algorithmic | Objective, countable things. No model needed. | Total cost, iteration count, latency, Sharpe ratio, max drawdown, whether a strategy was deployed |
| LLM Judge | Subjective quality that requires reasoning. | Did it explain the strategy logic clearly? Did it test enough different structures? Is the recommendation realistic? |
For trading agents specifically, overfitting is the failure mode that pure algorithmic metrics miss. A high backtest return is an objective number. But whether that return is trustworthy โ whether it comes from a durable edge or from memorized noise โ requires judgment. That's where the LLM judge comes in.
Here's the central question embedded in the NexusTrade Agent Run Evaluator's system prompt (Gemini 3 Pro, temp 0):
"The central question you must answer is: If the user deployed the recommended strategy on Monday with their $25,000 live account, how confident are we that it will achieve 100% annual return? Everything else is secondary."
The hard caps are the anti-overfitting mechanism:
| Average annual return | Max score |
|---|---|
| Below 30%/yr | deployedStrategyFitness cannot exceed 3 |
| 30โ59%/yr | Cannot exceed 6 |
| 60โ89%/yr | Cannot exceed 8 |
| 90%+/yr, survived bear, drawdown below 50% | Full range available |
| Nothing deployed | Capped at 2, regardless of exploration quality |
An agent that found 126% returns in 2024 but only tested one year cannot score above 4, because single-year outlier performance is exactly what overfitting looks like. The evaluator enforces multi-year evidence as a precondition for a high score.
Here's what the evaluator looks like running against a real Aurora agent run:
And the structured JSON it returned for Round 1 of the hill climbing experiment:
{
"summary": "Deployed a robust Iron Condor across 4 regimes. Consistent positive years including 2022 bear. Return profile (54% avg) won't reach 100% goal.",
"deployedStrategy": "Always-On Iron Condors (SPY/QQQ)",
"deployedStrategyAvgReturn": "+54.34% avg (2022: +31.2%, 2024: +72.1%)",
"deploymentVerdict": "iterate_first",
"scores": {
"deployedStrategyFitness": 7,
"evidenceStrength": 7,
"explorationCoverage": 6,
"riskRealism": 5
},
"overallScore": 71,
"verdict": "good",
"failures": ["59% max drawdown exceeds safe threshold", "54% avg won't reach 100% annual goal"],
"nextIteration": "Push for higher return while maintaining the 2022 floor."
}
The nextIteration field is what makes the loop work. It becomes the seed for the next agent run. The evaluator writes the coach's notes.
You call it from anywhere via MCP:
run_agent_run_evaluator({
agent_id: "69d49c51d06eee7b51cf5f68",
model: "google/gemini-3-pro-preview"
})
// Returns: scores, verdict, nextIteration, deploymentVerdict
// Inject nextIteration into the next run's context.
Going Deeper
Beyond hill climbing: four approaches to agent optimization.
Once you have traces and an evaluator, the natural next step is to close the loop: run the agent, score the output, use that score to improve the next run. The simplest version of this is hill climbing โ run, grade, seed the next run with the feedback.
But hill climbing is a local search. It follows the gradient of whatever metric you give it. Point it at the wrong objective and it will confidently optimize you into a cliff. Here's the full landscape:
01. Hill Climbing โ Run โ grade โ seed the next run with the feedback. Simple, cheap, and effective for small prompt spaces. Gets stuck in local maxima when the feedback signal points in the wrong direction. Use it as a baseline before investing in more sophisticated approaches โ but watch your rubric carefully.
02. NSGA-II Multi-Objective Optimization โ When you have competing objectives (high return AND low drawdown), hill climbing optimizes one at the expense of the other. NSGA-II produces a Pareto front instead: every efficient tradeoff simultaneously, so you pick from real data rather than assumptions. The GA4GC paper achieved 37.7% runtime reduction while improving code quality, with 135x hypervolume improvement over defaults.
03. Meta-Harness (Stanford) โ Most prompt optimization focuses on what you say to the model. Meta-Harness optimizes what information the model sees: which context to store, what to retrieve, how to structure inputs. An agentic proposer reads up to 10 million tokens of diagnostic context per iteration and rewrites the harness itself. The Stanford paper shows 7.7 point improvement over state-of-the-art using 4x fewer tokens.
04. AutoResearch-RL โ An RL agent that proposes code modifications to the training script, executes them under fixed time budgets, observes validation metrics, and updates its policy via PPO. No human in the loop. A related study analyzing 10,469 experiments found architectural choices explain 94% of performance variance. The autoresearch framework demonstrates a 2.4x boost in experiment throughput by aborting poor-performing runs early.
For most teams: start with hill climbing. Run it until it stalls. If you hit a genuine multi-objective tradeoff, consider NSGA-II โ but only with automated infrastructure and a clear budget. The rubric is still load-bearing in all four cases.
Real-World Evidence
I ran this loop five times. The first round was still the best.
War story โ $676 spent: Five complete agent runs. Full evaluator traces for each round. Round 1 scored 71 โ a robust Iron Condor on SPY/QQQ averaging 54% across all regimes including 2022. By Round 5, score was 27. The agent was recommending long directional options with a -6.3% average and a 92% drawdown in 2022. The evaluator caused every step of the decline. The
nextIterationnote from Round 1 said "push for higher return while maintaining the 2022 floor." The agent pushed for higher return. It forgot the floor. Each round the evaluator rewarded the attempt at higher returns and the agent drifted further from the Iron Condor structure that actually worked. By Round 5, it had completely abandoned a 54%-average strategy in search of 100%, and found a disaster instead. The rubric didn't penalize abandoning what worked. So the agent abandoned it.
Read the full hill climbing experiment
Production Architecture
The feedback loop: how evaluation connects to memory.
Evaluation doesn't mean much if the agent can't learn from it. The feedback loop is what turns a one-time grade into compounding improvement.
The full pipeline in five steps:
Step 1 โ Agent Run: The agent receives its task, reads injected memory from past runs, and works through the ReAct loop. Every decision, every tool call, every result is captured.
Step 2 โ Trace Captured: Every iteration is logged: the model's thought, the tool it called, the result, token cost, and latency. The trace is stored and ready to be read by the judge.
Step 3 โ LLM Judge Scores: The evaluator reads the full trace and returns a structured verdict: scores on 4 dimensions, an overall score, a deployment verdict, and a nextIteration note.
Step 4 โ Written to Memory: The score, verdict, and lessons are written into an AgentSummary document in MongoDB. The next run retrieves this via the memory system from Module 4.
Step 5 โ Next Run Improves: Before the next run starts, the memory system retrieves matching AgentSummary records and injects them into the planner. Here's what that injection looks like at the start of Round 2:
## Previous Run Context
Score: 71/100 ยท Verdict: good
Lessons learned:
- Iron Condors on SPY/QQQ survived the 2022 bear market
- Multi-year average 54.3% โ consistent across all regimes
- Max drawdown 59% exceeded the 50% threshold
Next iteration focus:
Enhance return profile while holding the 2022 floor.
Do not abandon the Iron Condor structure.
---
## Your Task
Build an options strategy for a $25,000 account...
Run โ Trace โ Judge โ Memory โ Improve โ repeat. Each loop produces a labeled training example: what the agent did, how it scored, and what to do differently next time.
A demo runs once. A system runs, gets graded, and improves โ or degrades, depending on what you told it to optimize for.
Connect
Expose your agent to the world: the NexusTrade MCP server.
Everything in this article โ traces, LLM judges, feedback loops โ is built on Aurora's infrastructure and accessible via MCP from Claude Desktop or Cursor.
{
"mcpServers": {
"nexustrade": {
"url": "https://nexustrade.io/api/mcp",
"headers": { "Authorization": "Bearer <your-api-key>" }
}
}
}
From Claude Desktop or Cursor, you can ask: "Use NexusTrade to screen for stocks with RSI below 40 trading above their 200-day moving average." Claude calls screen_stocks on the NexusTrade MCP server. The server returns live results from the same screener Aurora uses internally. You can also run the evaluator directly โ run_agent_run_evaluator(agent_id) grades any completed agent run from any client.
Check Your Understanding
Pop quiz.
Q: What is a trace in the context of AI agents?
A: A trace is a structured log of every step the agent took: inputs, outputs, tool calls, results, costs, and timing. Think of it like a flight recorder โ you can reconstruct exactly what happened, step by step, to debug issues or evaluate performance. Without it, a failed run is a black box.
Q: An agent produces a 126% backtest return in 2024, but you only have one year of data. The evaluator gives it a high score. What's wrong?
A: Single-year outlier returns are the textbook signature of overfitting. The agent may have memorized 2024-specific anomalies that won't repeat in live trading. A trustworthy evaluator caps the score based on multi-year average return and requires evidence across at least two distinct market regimes โ including a bear market. One year of 126% is not evidence of a durable edge.
Q: When would you use an LLM judge instead of an algorithmic evaluator?
A: When you need to evaluate subjective criteria that are hard to measure with code โ did the agent explain its reasoning clearly? Did it test fundamentally different strategy types? Objective metrics like cost, iteration count, and Sharpe ratio should use algorithmic evaluation. Code is faster, cheaper, and deterministic for anything you can count.
Q: True or false: the evaluation feedback loop requires both evaluation AND memory to work. Having only one of the two is not enough.
A: True. Without evaluation, the agent has no signal for what "better" means. Without memory, the agent can't retain the lessons it learned โ every run starts from scratch. Evaluation produces the signal. Memory carries it forward.
The End
You've built the complete agent. This is where most people stop. It's where you start.
Five articles. Five modules. You started with a leaked source file and ended with a production evaluation loop โ router, ReAct engine, long-term memory, rubric design, feedback loop. The only thing left is to run it.
The $676 hill climbing experiment is a better argument for evaluation than anything else in this article. The agent wasn't broken. The rubric was. Build the right rubric and the loop compounds toward something real. Build the wrong one and a perfectly obedient agent will follow it off a cliff, five rounds in a row, at $135 per run.
Before the capstone, here's the full series in five minutes:
Module 6 is where the capstone lives. Aurora runs inline: screens the market, validates with news, builds a watchlist, then wires up a scheduled agent to manage it every week.
Run the Capstone โ free, no credit card
Part 5 of 5 in the AI Agents from Scratch series.
Try NexusTrade's AI trading agent free: https://nexustrade.io


Top comments (0)