Harish Kotra (he/him)

Posted on Jul 3

EvalMaster: Building a Practical Curriculum for AI Agent Evals

#ai #programming #python #dailybuild2026

EvalMaster is a teaching project for builders who want to understand agent evaluation by doing, not just reading theory. It packages the core ideas of evals into a runnable Python codebase using Agno, Pydantic, Rich, asyncio, and a small amount of scaffolding that makes the examples work offline by default.

The core lesson is simple: if you are building agents, you need more than “does the answer look okay?” You need evals that measure exactness, structure, tool use, grounding, robustness, and the quality of multi-step trajectories.

Why EvalMaster Exists

A lot of AI projects start with a demo and end with a surprise. The prompt works on the examples you tried manually, but once the model changes, the tool schema shifts, or the context gets longer, behavior drifts.

EvalMaster is designed to make that drift visible.

It helps answer questions like:

Did the agent still call the right tool after a prompt rewrite?
Did the new model improve quality or just become more verbose?
Is the RAG system actually grounded in its sources?
Does the support agent escalate when it should?
How often does the judge model disagree with itself?

The point is not to eliminate uncertainty. The point is to turn uncertainty into something observable and testable.

Architecture Overview

The architecture is intentionally lightweight:

config.py chooses the provider.
common.py defines the typed eval objects.
runtime_utils.py contains the scoring and reporting helpers.
each folder under 00_ through 08_ teaches one evaluation pattern.
07_real_world_projects/ shows how to connect the ideas to a real application.

Technology Choices

Why these tools?

Agno: gives the project a consistent agent interface and team/tool abstractions.
Pydantic: keeps the eval inputs and outputs typed and explicit.
Rich: makes the outputs readable and presentation-ready in a terminal.
asyncio: makes batched evaluation practical.
pandas + matplotlib: support simple analysis and reporting.
pytest: gives us confidence that the scaffold stays runnable.
python-dotenv: keeps secrets and provider settings out of the code.

A Few Important Design Decisions

1. Offline First

The demos run without live API calls. That matters because a curriculum should be executable on day one, even if the user has not configured credentials yet.

2. Shared Scoring Primitives

The project reuses scoring helpers for exact match, schema checks, tool-call validation, grounding, and judge-based scoring. That keeps the examples consistent and avoids turning each lesson into a separate mini-framework.

3. Provider Switching Through Config

The intent is to make provider changes a configuration concern rather than a code rewrite.

PROVIDER = os.getenv("PROVIDER", "openai")

def get_model(model_id: str | None = None):
    if PROVIDER == "openai":
        return OpenAIChat(id=model_id or os.getenv("OPENAI_MODEL", "gpt-4o-mini"))
    if PROVIDER == "lmstudio":
        return OpenAILike(id=model_id or "local-model", base_url="http://localhost:1234/v1", api_key="lm-studio")

4. Structured Evaluation Artifacts

Using typed EvalCase, EvalResult, and EvalSuite objects means the code can be serialized, versioned, and exported to JSON or CSV without guessing what shape the data has.

What The Curriculum Covers

Foundations

The first section explains:

what evals are
why they are different from tests
the spectrum from rules to human review
when to write an eval before coding

Deterministic Evals

This section covers:

exact match
regex validation
JSON schema checks
tool call validation
latency and cost scoring
factual grounding

LLM-as-Judge Evals

This section shows how to:

run a single judge
compare answers pairwise
apply rubric-based grading
think about judge variance and bias

Agent-Specific Evals

This section focuses on:

tool selection
multi-step trajectories
memory recall
reasoning traces
goal completion
multi-agent handoffs
adversarial robustness

Dataset Management

The dataset section demonstrates:

golden set creation
synthetic case generation
content hashing and versioning
edge-case mining
Pydantic schemas for storage

Running Evals At Scale

This section covers:

asynchronous execution
concurrency limits
retry logic
incremental persistence
regression gates
terminal dashboards

Interpreting Results

This section helps answer:

what aggregation metric should I trust?
when does a confidence interval matter?
how do I cluster failure modes?
how do I report results clearly?

Real-World Projects

The final application-style examples cover:

RAG evaluation
customer support evaluation
code generation evaluation

Example: Trajectory Eval

One of the most important ideas in the project is that an agent should be evaluated as a sequence, not just a final string.

def score(case: EvalCase, steps: list[TrajectoryStep]):
    correctness = 1.0 if any(step.kind == "message" and step.content == "done" for step in steps) else 0.0
    efficiency = 1.0 if len(steps) <= 2 else 0.5
    error_recovery = 1.0
    combined = round((correctness + efficiency + error_recovery) / 3, 3)
    return combined, combined >= 0.75, ...

That pattern is important because two agents can produce the same final answer while taking very different paths to get there.

Example: Async Eval Runner

The runner illustrates a practical pattern for scaling up evals:

async def run_eval_runner(cases, worker, concurrency: int = 5):
    sem = asyncio.Semaphore(concurrency)
    for coro in asyncio.as_completed([guarded(case) for case in cases]):
        result = await coro
        results.append(result)

The useful bit here is not just parallelism. It is the combination of:

concurrency control
retries
incremental saving
progress visibility

Example: RAG Metrics

The RAG project splits evaluation into multiple lenses:

faithfulness
answer relevance
context recall
context precision

That separation is important because “the answer is good” and “retrieval is good” are not the same thing.

How To Extend The Project

If you fork EvalMaster, some useful additions would be:

a richer judge abstraction with strict, lenient, and expert personas
real embeddings for semantic similarity
an artifact store for historical eval runs
SQLite-backed case/result persistence
a dashboard with trend charts
model routing evals with cost-aware selection
prompt sensitivity experiments
a human review queue with approvals

CLI Entry Point

To make the curriculum easier to explore, EvalMaster now includes a simple typer-based CLI.

python3 cli.py list
python3 cli.py run foundations
python3 cli.py run projects

This keeps the learning path discoverable without forcing people to remember individual file paths.

How To Contribute

The easiest path to contributing is:

Pick one module.
Add one more case or metric.
Update the tests.
Run the relevant script and inspect the output.
Submit a PR that explains why the new eval matters.

That is a good contribution model because every improvement should tighten the feedback loop, not just add more code.

Why This Matters In Practice

The real value of evals shows up when you need to answer a production question quickly:

Did this model update break something?
Is the new prompt better for one task but worse for another?
Are we handling malformed inputs safely?
Can we ship this without relying on manual spot checks?

EvalMaster is a compact way to practice that thinking.

Output

Code & more: https://www.dailybuild.xyz/project/182-eval-master

DEV Community