Zack Siri

Posted on Feb 24

Synthetic Benchmarks Are Clean. Agent Workflows Are Not.

#ai #rag #agents

Most model evaluations happen in controlled environments: single prompts, tidy inputs, and clear expected outputs.

Production agents don’t operate like that.

They route between capabilities.

They call tools.

They maintain state.

They operate under constraints.

They break in subtle, non-obvious ways.

In this post, I’ll walk through the evaluation framework I built to test LLMs inside a structured, stateful agent workflow.

To make the discussion concrete, the walkthrough uses Mistral models as the implementation example — demonstrating the kind of behavior and outcomes you can expect when running this evaluation in practice.

The goal is simple:

Evaluate how models behave inside a real execution system, not how they perform on isolated prompts.

What I’m Testing

The framework evaluates five core capabilities required for practical agent systems:

1. Routing

Can the model correctly identify intent and select the appropriate execution path?

Agent systems depend heavily on correct internal routing. Misclassification at this layer cascades into downstream failures.

2. Tool Use

Does the model call tools correctly with valid structured arguments?

This includes:

Schema adherence
Proper parameter formatting
Calling the correct tool
Avoiding hallucinated fields

Tool misuse is one of the most common failure modes in agent systems.

3. Basic Decision Making

Given system instructions and available tools, can the model take reasonable next steps?

This tests whether the model:

Understands procedural expectations
Chooses logical actions
Avoids unnecessary tool calls

4. Resolving Constraints

Does the model respect hard rules?

Examples:

Forbidden actions
Required preconditions
Deterministic constraint layers overriding reasoning

Real systems include guardrails. Models must operate within them.

5. Handling Multi-Turn Conversation

Can the model maintain state and coherence across turns?

This evaluates:

Context retention
Correct updates to system state
Avoiding contradiction
Consistent identity and task awareness

Single-turn correctness does not guarantee multi-step stability.

How the Test Is Structured

This is not a prompt benchmark. It is a workflow simulation.

The framework includes:

Multi-step task execution (not single-prompt evaluation)
Structured tool interfaces with defined schemas
Deterministic constraint layers applied over model reasoning
Explicit state tracking across conversation turns
Clear evaluation criteria per capability
Repeatable, controlled scenarios

Each scenario is designed to simulate how LLMs are actually deployed inside agent systems — where planning, execution, constraint handling, and state management are all interacting simultaneously.

Why This Matters

Synthetic benchmarks are optimized for clean scoring.

Agent workflows are optimized for messy reality.

When models operate inside structured systems, failure modes shift:

Partial tool calls
Invalid arguments
Broken routing
State drift
Constraint violations

These issues rarely show up in leaderboard-style benchmarks.

What This Post Covers

This post (and accompanying video) focuses on:

The architecture of the evaluation framework
The capability breakdown
The workflow simulation model
A practical walkthrough using Mistral models
The kind of results this methodology surfaces in real agent scenarios

If you're building agent systems or evaluating models beyond chat use cases, this framework is directly applicable to real-world deployments.

Benchmarks tell you how a model performs in isolation.

Agent evaluations tell you how it behaves inside a system.

Top comments (2)

signalstack • Feb 24

This distinction between benchmark performance and workflow performance is something the field needs to talk about more honestly.

The five capabilities you listed map closely to where production agents actually break. From what we've seen running multi-model setups in production, the failure hierarchy usually goes: routing breaks first (model picks wrong tool or path), then constraint violation (model does something it was told not to do because context pushed it that direction), then multi-turn drift (coherence erodes over long sessions as context accumulates).

The constraint resolution point is underappreciated. A model can score 95% on a tool-calling benchmark and then routinely violate a business rule when the rule is two paragraphs into the system prompt and there's competing context. Deterministic constraint layers that override reasoning are basically the only reliable fix — the model can't reason its way around a guardrail that runs outside the LLM call.

Curious what you're using as the ground truth signal for routing accuracy. Exact-match expected path, or something fuzzier? That seems like the hardest thing to automate in a repeatable way.

Zack Siri • Mar 5 • Edited

The routing is the part that is challenging but also a big pay off when it works right. Because it allows isolating work into a defined 'sandbox' of sorts. If the routing is done correctly then the execution will highly likely succeed with the correct output.

Now that the importance of routing has been established, to answer your question about ground truth signal.

There is a prompt that governs the ground truth for the routing. Then after routing you do evaluation to see if it's routing things correctly. Given that the LLM 'understands' natural language enough, when it doesn't route correctly it's generally due to gaps in training data OR gaps in the prompt that governs routing.

For example we had a case:

When the user says "hi" it knows "hi" is a greeting it will get that right for sure, with 100% accuracy, because "hi" is universal and well known to be a greeting.

However imagine someone typing in something odd without any context like "blah" or something in non english like with 2 characters. The LLM will often struggle, to route that correctly because

No context provided other than the word user typed in
There is not enough there to determine the intention.

For those scenarios we just route to a catch-all of some kind. I mean if someone types "blah" and it misroutes generally we don't consider that a failure. It's more failure on the user's part.

Imagine what you would do if someone came up to you sand said "blah". You'd prolly think the person is crazy, because you don't know how to process it.

The other scenario is when there is context, there is history, and the prompt contains conflicting clauses that cause the LLM to misroute. This can also happen and with this you need to evaluate if the model routed correctly and if your prompt (the llm's ground truth) has conflicting clauses that cause it to misroute.

For the most part though if the llm has enough 'common sense' it will route correctly. We are essentially splitting thinking and execution into 2 pieces.