EvalMaster is a teaching project for builders who want to understand agent evaluation by doing, not just reading theory. It packages the core ideas of evals into a runnable Python codebase using Agno, Pydantic, Rich, asyncio, and a small amount of scaffolding that makes the examples work offline by default.
The core lesson is simple: if you are building agents, you need more than “does the answer look okay?” You need evals that measure exactness, structure, tool use, grounding, robustness, and the quality of multi-step trajectories.
Why EvalMaster Exists
A lot of AI projects start with a demo and end with a surprise. The prompt works on the examples you tried manually, but once the model changes, the tool schema shifts, or the context gets longer, behavior drifts.
EvalMaster is designed to make that drift visible.
It helps answer questions like:
- Did the agent still call the right tool after a prompt rewrite?
- Did the new model improve quality or just become more verbose?
- Is the RAG system actually grounded in its sources?
- Does the support agent escalate when it should?
- How often does the judge model disagree with itself?
The point is not to eliminate uncertainty. The point is to turn uncertainty into something observable and testable.
Architecture Overview
The architecture is intentionally lightweight:
-
config.pychooses the provider. -
common.pydefines the typed eval objects. -
runtime_utils.pycontains the scoring and reporting helpers. - each folder under
00_through08_teaches one evaluation pattern. -
07_real_world_projects/shows how to connect the ideas to a real application.
Technology Choices
Why these tools?
- Agno: gives the project a consistent agent interface and team/tool abstractions.
- Pydantic: keeps the eval inputs and outputs typed and explicit.
- Rich: makes the outputs readable and presentation-ready in a terminal.
- asyncio: makes batched evaluation practical.
- pandas + matplotlib: support simple analysis and reporting.
- pytest: gives us confidence that the scaffold stays runnable.
- python-dotenv: keeps secrets and provider settings out of the code.
A Few Important Design Decisions
1. Offline First
The demos run without live API calls. That matters because a curriculum should be executable on day one, even if the user has not configured credentials yet.
2. Shared Scoring Primitives
The project reuses scoring helpers for exact match, schema checks, tool-call validation, grounding, and judge-based scoring. That keeps the examples consistent and avoids turning each lesson into a separate mini-framework.
3. Provider Switching Through Config
The intent is to make provider changes a configuration concern rather than a code rewrite.
PROVIDER = os.getenv("PROVIDER", "openai")
def get_model(model_id: str | None = None):
if PROVIDER == "openai":
return OpenAIChat(id=model_id or os.getenv("OPENAI_MODEL", "gpt-4o-mini"))
if PROVIDER == "lmstudio":
return OpenAILike(id=model_id or "local-model", base_url="http://localhost:1234/v1", api_key="lm-studio")
4. Structured Evaluation Artifacts
Using typed EvalCase, EvalResult, and EvalSuite objects means the code can be serialized, versioned, and exported to JSON or CSV without guessing what shape the data has.
What The Curriculum Covers
Foundations
The first section explains:
- what evals are
- why they are different from tests
- the spectrum from rules to human review
- when to write an eval before coding
Deterministic Evals
This section covers:
- exact match
- regex validation
- JSON schema checks
- tool call validation
- latency and cost scoring
- factual grounding
LLM-as-Judge Evals
This section shows how to:
- run a single judge
- compare answers pairwise
- apply rubric-based grading
- think about judge variance and bias
Agent-Specific Evals
This section focuses on:
- tool selection
- multi-step trajectories
- memory recall
- reasoning traces
- goal completion
- multi-agent handoffs
- adversarial robustness
Dataset Management
The dataset section demonstrates:
- golden set creation
- synthetic case generation
- content hashing and versioning
- edge-case mining
- Pydantic schemas for storage
Running Evals At Scale
This section covers:
- asynchronous execution
- concurrency limits
- retry logic
- incremental persistence
- regression gates
- terminal dashboards
Interpreting Results
This section helps answer:
- what aggregation metric should I trust?
- when does a confidence interval matter?
- how do I cluster failure modes?
- how do I report results clearly?
Real-World Projects
The final application-style examples cover:
- RAG evaluation
- customer support evaluation
- code generation evaluation
Example: Trajectory Eval
One of the most important ideas in the project is that an agent should be evaluated as a sequence, not just a final string.
def score(case: EvalCase, steps: list[TrajectoryStep]):
correctness = 1.0 if any(step.kind == "message" and step.content == "done" for step in steps) else 0.0
efficiency = 1.0 if len(steps) <= 2 else 0.5
error_recovery = 1.0
combined = round((correctness + efficiency + error_recovery) / 3, 3)
return combined, combined >= 0.75, ...
That pattern is important because two agents can produce the same final answer while taking very different paths to get there.
Example: Async Eval Runner
The runner illustrates a practical pattern for scaling up evals:
async def run_eval_runner(cases, worker, concurrency: int = 5):
sem = asyncio.Semaphore(concurrency)
for coro in asyncio.as_completed([guarded(case) for case in cases]):
result = await coro
results.append(result)
The useful bit here is not just parallelism. It is the combination of:
- concurrency control
- retries
- incremental saving
- progress visibility
Example: RAG Metrics
The RAG project splits evaluation into multiple lenses:
- faithfulness
- answer relevance
- context recall
- context precision
That separation is important because “the answer is good” and “retrieval is good” are not the same thing.
How To Extend The Project
If you fork EvalMaster, some useful additions would be:
- a richer judge abstraction with strict, lenient, and expert personas
- real embeddings for semantic similarity
- an artifact store for historical eval runs
- SQLite-backed case/result persistence
- a dashboard with trend charts
- model routing evals with cost-aware selection
- prompt sensitivity experiments
- a human review queue with approvals
CLI Entry Point
To make the curriculum easier to explore, EvalMaster now includes a simple typer-based CLI.
python3 cli.py list
python3 cli.py run foundations
python3 cli.py run projects
This keeps the learning path discoverable without forcing people to remember individual file paths.
How To Contribute
The easiest path to contributing is:
- Pick one module.
- Add one more case or metric.
- Update the tests.
- Run the relevant script and inspect the output.
- Submit a PR that explains why the new eval matters.
That is a good contribution model because every improvement should tighten the feedback loop, not just add more code.
Why This Matters In Practice
The real value of evals shows up when you need to answer a production question quickly:
- Did this model update break something?
- Is the new prompt better for one task but worse for another?
- Are we handling malformed inputs safely?
- Can we ship this without relying on manual spot checks?
EvalMaster is a compact way to practice that thinking.
Output
Code & more: https://www.dailybuild.xyz/project/182-eval-master



Top comments (0)