DEV Community

Zack Siri
Zack Siri

Posted on

Synthetic Benchmarks Are Clean. Agent Workflows Are Not.

Most model evaluations happen in controlled environments: single prompts, tidy inputs, and clear expected outputs.

Production agents don’t operate like that.

They route between capabilities.

They call tools.

They maintain state.

They operate under constraints.

They break in subtle, non-obvious ways.

In this post, I’ll walk through the evaluation framework I built to test LLMs inside a structured, stateful agent workflow.

To make the discussion concrete, the walkthrough uses Mistral models as the implementation example — demonstrating the kind of behavior and outcomes you can expect when running this evaluation in practice.

The goal is simple:

Evaluate how models behave inside a real execution system, not how they perform on isolated prompts.


What I’m Testing

The framework evaluates five core capabilities required for practical agent systems:

1. Routing

Can the model correctly identify intent and select the appropriate execution path?

Agent systems depend heavily on correct internal routing. Misclassification at this layer cascades into downstream failures.


2. Tool Use

Does the model call tools correctly with valid structured arguments?

This includes:

  • Schema adherence
  • Proper parameter formatting
  • Calling the correct tool
  • Avoiding hallucinated fields

Tool misuse is one of the most common failure modes in agent systems.


3. Basic Decision Making

Given system instructions and available tools, can the model take reasonable next steps?

This tests whether the model:

  • Understands procedural expectations
  • Chooses logical actions
  • Avoids unnecessary tool calls

4. Resolving Constraints

Does the model respect hard rules?

Examples:

  • Forbidden actions
  • Required preconditions
  • Deterministic constraint layers overriding reasoning

Real systems include guardrails. Models must operate within them.


5. Handling Multi-Turn Conversation

Can the model maintain state and coherence across turns?

This evaluates:

  • Context retention
  • Correct updates to system state
  • Avoiding contradiction
  • Consistent identity and task awareness

Single-turn correctness does not guarantee multi-step stability.


How the Test Is Structured

This is not a prompt benchmark. It is a workflow simulation.

The framework includes:

  • Multi-step task execution (not single-prompt evaluation)
  • Structured tool interfaces with defined schemas
  • Deterministic constraint layers applied over model reasoning
  • Explicit state tracking across conversation turns
  • Clear evaluation criteria per capability
  • Repeatable, controlled scenarios

Each scenario is designed to simulate how LLMs are actually deployed inside agent systems — where planning, execution, constraint handling, and state management are all interacting simultaneously.


Why This Matters

Synthetic benchmarks are optimized for clean scoring.

Agent workflows are optimized for messy reality.

When models operate inside structured systems, failure modes shift:

  • Partial tool calls
  • Invalid arguments
  • Broken routing
  • State drift
  • Constraint violations

These issues rarely show up in leaderboard-style benchmarks.


What This Post Covers

This post (and accompanying video) focuses on:

  • The architecture of the evaluation framework
  • The capability breakdown
  • The workflow simulation model
  • A practical walkthrough using Mistral models
  • The kind of results this methodology surfaces in real agent scenarios

If you're building agent systems or evaluating models beyond chat use cases, this framework is directly applicable to real-world deployments.

Benchmarks tell you how a model performs in isolation.

Agent evaluations tell you how it behaves inside a system.

Top comments (1)

Collapse
 
signalstack profile image
signalstack

This distinction between benchmark performance and workflow performance is something the field needs to talk about more honestly.

The five capabilities you listed map closely to where production agents actually break. From what we've seen running multi-model setups in production, the failure hierarchy usually goes: routing breaks first (model picks wrong tool or path), then constraint violation (model does something it was told not to do because context pushed it that direction), then multi-turn drift (coherence erodes over long sessions as context accumulates).

The constraint resolution point is underappreciated. A model can score 95% on a tool-calling benchmark and then routinely violate a business rule when the rule is two paragraphs into the system prompt and there's competing context. Deterministic constraint layers that override reasoning are basically the only reliable fix — the model can't reason its way around a guardrail that runs outside the LLM call.

Curious what you're using as the ground truth signal for routing accuracy. Exact-match expected path, or something fuzzier? That seems like the hardest thing to automate in a repeatable way.