Most model evaluations happen in controlled environments: single prompts, tidy inputs, and clear expected outputs.
Production agents don’t operate like that.
They route between capabilities.
They call tools.
They maintain state.
They operate under constraints.
They break in subtle, non-obvious ways.
In this post, I’ll walk through the evaluation framework I built to test LLMs inside a structured, stateful agent workflow.
To make the discussion concrete, the walkthrough uses Mistral models as the implementation example — demonstrating the kind of behavior and outcomes you can expect when running this evaluation in practice.
The goal is simple:
Evaluate how models behave inside a real execution system, not how they perform on isolated prompts.
What I’m Testing
The framework evaluates five core capabilities required for practical agent systems:
1. Routing
Can the model correctly identify intent and select the appropriate execution path?
Agent systems depend heavily on correct internal routing. Misclassification at this layer cascades into downstream failures.
2. Tool Use
Does the model call tools correctly with valid structured arguments?
This includes:
- Schema adherence
- Proper parameter formatting
- Calling the correct tool
- Avoiding hallucinated fields
Tool misuse is one of the most common failure modes in agent systems.
3. Basic Decision Making
Given system instructions and available tools, can the model take reasonable next steps?
This tests whether the model:
- Understands procedural expectations
- Chooses logical actions
- Avoids unnecessary tool calls
4. Resolving Constraints
Does the model respect hard rules?
Examples:
- Forbidden actions
- Required preconditions
- Deterministic constraint layers overriding reasoning
Real systems include guardrails. Models must operate within them.
5. Handling Multi-Turn Conversation
Can the model maintain state and coherence across turns?
This evaluates:
- Context retention
- Correct updates to system state
- Avoiding contradiction
- Consistent identity and task awareness
Single-turn correctness does not guarantee multi-step stability.
How the Test Is Structured
This is not a prompt benchmark. It is a workflow simulation.
The framework includes:
- Multi-step task execution (not single-prompt evaluation)
- Structured tool interfaces with defined schemas
- Deterministic constraint layers applied over model reasoning
- Explicit state tracking across conversation turns
- Clear evaluation criteria per capability
- Repeatable, controlled scenarios
Each scenario is designed to simulate how LLMs are actually deployed inside agent systems — where planning, execution, constraint handling, and state management are all interacting simultaneously.
Why This Matters
Synthetic benchmarks are optimized for clean scoring.
Agent workflows are optimized for messy reality.
When models operate inside structured systems, failure modes shift:
- Partial tool calls
- Invalid arguments
- Broken routing
- State drift
- Constraint violations
These issues rarely show up in leaderboard-style benchmarks.
What This Post Covers
This post (and accompanying video) focuses on:
- The architecture of the evaluation framework
- The capability breakdown
- The workflow simulation model
- A practical walkthrough using Mistral models
- The kind of results this methodology surfaces in real agent scenarios
If you're building agent systems or evaluating models beyond chat use cases, this framework is directly applicable to real-world deployments.
Benchmarks tell you how a model performs in isolation.
Agent evaluations tell you how it behaves inside a system.
Top comments (2)
This distinction between benchmark performance and workflow performance is something the field needs to talk about more honestly.
The five capabilities you listed map closely to where production agents actually break. From what we've seen running multi-model setups in production, the failure hierarchy usually goes: routing breaks first (model picks wrong tool or path), then constraint violation (model does something it was told not to do because context pushed it that direction), then multi-turn drift (coherence erodes over long sessions as context accumulates).
The constraint resolution point is underappreciated. A model can score 95% on a tool-calling benchmark and then routinely violate a business rule when the rule is two paragraphs into the system prompt and there's competing context. Deterministic constraint layers that override reasoning are basically the only reliable fix — the model can't reason its way around a guardrail that runs outside the LLM call.
Curious what you're using as the ground truth signal for routing accuracy. Exact-match expected path, or something fuzzier? That seems like the hardest thing to automate in a repeatable way.
The routing is the part that is challenging but also a big pay off when it works right. Because it allows isolating work into a defined 'sandbox' of sorts. If the routing is done correctly then the execution will highly likely succeed with the correct output.
Now that the importance of routing has been established, to answer your question about ground truth signal.
There is a prompt that governs the ground truth for the routing. Then after routing you do evaluation to see if it's routing things correctly. Given that the LLM 'understands' natural language enough, when it doesn't route correctly it's generally due to gaps in training data OR gaps in the prompt that governs routing.
For example we had a case:
When the user says "hi" it knows "hi" is a greeting it will get that right for sure, with 100% accuracy, because "hi" is universal and well known to be a greeting.
However imagine someone typing in something odd without any context like "blah" or something in non english like with 2 characters. The LLM will often struggle, to route that correctly because
For those scenarios we just route to a catch-all of some kind. I mean if someone types "blah" and it misroutes generally we don't consider that a failure. It's more failure on the user's part.
Imagine what you would do if someone came up to you sand said "blah". You'd prolly think the person is crazy, because you don't know how to process it.
The other scenario is when there is context, there is history, and the prompt contains conflicting clauses that cause the LLM to misroute. This can also happen and with this you need to evaluate if the model routed correctly and if your prompt (the llm's ground truth) has conflicting clauses that cause it to misroute.
For the most part though if the llm has enough 'common sense' it will route correctly. We are essentially splitting
thinkingandexecutioninto 2 pieces.