Sivananda Panda

Posted on Jul 2

The Evaluation Layer: The Part of Your LLM System You Keep Skipping

#ai #llm #rag #systemdesign

I've built two agentic AI systems over the past few months, and despite solving very different problems, both exposed the same weakness. The agents worked perfectly during demos. They passed my manual tests. They looked production-ready. But once real users started interacting with them, confidently wrong responses began slipping through. The root cause wasn't the model, the prompts, or the tools. It was much simpler: there was no proper evaluation layer. The system could tell whether it had produced an answer, but it had no way to determine whether that answer was actually good.

This article is about the layer that closes that gap. What it is, why LLM systems specifically can't survive without it, the five ways people actually build it, the tooling, and the mistakes that make an eval layer worse than useless — because a miscalibrated eval gives you false confidence, which is more dangerous than no eval at all.

I'll use a RAG-and-agent stack for the concrete examples (LangGraph, Claude, LangSmith), but nothing here is framework-specific. The principles move to any stack you like.

Why LLM systems need this and normal software doesn't

Here's the uncomfortable property that breaks every habit you brought from traditional engineering: the same input can produce different outputs, and a wrong output usually looks exactly like a right one.

In a normal codebase, add(2, 2) returns 4 or it returns a bug you can see. The failure has a shape — a stack trace, a null, a red test. You write an assertion, it passes forever, you move on. LLM failures don't have that shape. A hallucinated citation, a subtly-off summary, a tool call with the wrong argument that still "sort of" works — these render as fluent, plausible, professional-looking text. The failure is camouflaged inside a success.

Three things follow from that, and they're the whole reason the eval layer exists:

Non-determinism means one passing run proves nothing. You need distributions, not point checks.
The interesting failures live in the middle. An agent that made 12 tool calls and 4 reasoning hops can reach a correct answer through completely broken logic — and reach a wrong one next time from the same code.
Nobody sees the middle by default. The end user gets the final answer. The reasoning, the retrieval, the tool arguments — all invisible unless you deliberately capture them.

So the eval layer isn't a testing afterthought. It's the observability and quality infrastructure that lets you answer three questions on a continuous basis: is the system doing the right thing (correctness), doing it efficiently (cost and latency), and doing it safely (no leaked secrets, no disallowed tools, no confident nonsense). Skip it and you're not shipping a product; you're running an ongoing experiment on your users without reading the results.

What the layer actually is

It's tempting to picture "the eval layer" as one service you bolt on at the end. It isn't. It's a cross-cutting concern that taps into every stage of execution and scores the trace, not just the output.

Three moving parts. Tracers capture what happened at every step — inputs, outputs, tool arguments, retrieved chunks, latencies, token counts. Scorers turn those captured artifacts into numbers — a faithfulness score, a latency measurement, a pass/fail on a schema. Evals are the curated sets and thresholds that give those numbers meaning: is 0.81 faithfulness good, and is it better or worse than last week? Lose any one of the three and the other two stop being useful.

Five ways to build it

These aren't competing options where you pick one. A mature system runs all five, at different frequencies. But you'll add them in roughly this order, cheapest and most objective first.

1. Deterministic evals — start here, not with an LLM

The instinct is to reach for an LLM judge immediately because the outputs are "fuzzy." Resist it. A surprising amount of what you care about is not fuzzy at all, and plain code checks it faster, cheaper, and without any of the reliability problems a judge brings.

Did the agent call only allowed tools? Did the structured output match the schema? Did it stay under the tool-call budget? Did the JSON parse? These have crisp right answers, and a regular function is the correct instrument.

def eval_tool_call_validity(trace: AgentTrace) -> EvalResult:
    """Every tool call must use an allowed tool and respect the budget."""
    allowed_tools = {"search", "calculator", "fetch_document"}
    violations = []

    for step in trace.steps:
        if step.type != "tool_call":
            continue
        if step.tool_name not in allowed_tools:
            violations.append(f"Disallowed tool: {step.tool_name}")
        if step.call_count > MAX_TOOL_CALLS:
            violations.append(f"Exceeded call budget: {step.call_count}")

    return EvalResult(
        passed=not violations,
        score=1.0 if not violations else 0.0,
        details=violations,
    )

These cost microseconds and never disagree with themselves. Run them on every single trace in production, not just in CI. They're your smoke detectors.

2. LLM-as-judge — powerful, and the thing most likely to lie to you

For the genuinely subjective stuff — tone, helpfulness, whether an answer is faithful to its source — you hand the output to a separate model with a rubric and collect structured scores.

from anthropic import Anthropic
import json

client = Anthropic()

JUDGE_PROMPT = """You are evaluating an AI agent's response.

User query: {query}
Agent response: {response}
Retrieved context: {context}

Score each dimension 1-5, and cite the specific text that justifies your score.
- Faithfulness: Is every claim grounded in the provided context?
- Relevance:    Does it address what the user actually asked?
- Completeness: Does it fully answer the question?

Respond ONLY as JSON:
{{"faithfulness": N, "relevance": N, "completeness": N, "evidence": "...", "reasoning": "..."}}"""

def llm_judge(query: str, response: str, context: str) -> dict:
    result = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=600,
        messages=[{
            "role": "user",
            "content": JUDGE_PROMPT.format(query=query, response=response, context=context),
        }],
    )
    return json.loads(result.content[0].text)

Notice I made the judge cite evidence and reason, not just emit a number. That's not decoration. A bare score is unfalsifiable; a score with quoted evidence can be audited when you disagree with it.

Here's the part teams underrate: LLM judges have the same weaknesses as the model being judged. They reward length. They're swayed by confident phrasing. They wave through fluent errors — the exact failure mode you built the judge to catch. An uncalibrated judge doesn't measure quality, it launders your existing biases into an official-looking number. So the judge is not the end of the work; it's a component that itself needs validating against human labels before you trust a word it says. More on that in the mistakes section, because it's the one people get wrong most often.

3. Trace-level evaluation — stop grading the black box

Grading only the final answer is like reviewing a math test by checking the last number and ignoring the working. The number can be right for the wrong reasons and wrong for the same reasons next time.

The fix is to instrument every step and evaluate the whole trace. Observability platforms — LangSmith, Langfuse, Arize Phoenix, W&B Weave — exist for exactly this. You wrap your nodes and they capture the tree of execution:

from langsmith import traceable, Client

ls = Client()

@traceable(run_type="chain", name="research_agent")
def run_agent(query: str) -> str:
    plan   = planner.plan(query)          # captured as a child run
    docs   = retriever.fetch(plan.query)  # captured as a child run
    answer = generator.generate(docs)     # captured as a child run
    return answer

def score_run(run_id: str, scores: dict):
    ls.create_feedback(
        run_id=run_id,
        key="faithfulness",
        score=scores["faithfulness"],
        comment=scores["reasoning"],
    )

The payoff is diagnostic, not just descriptive. When faithfulness tanks, the trace tells you where: the retriever pulled garbage, or the retriever was fine and the generator ignored it. Those are two completely different bugs with two completely different fixes, and output-only evals can't tell them apart.

4. Human-in-the-loop — the ground truth everything else calibrates against

Automated evals are necessary and insufficient. Humans catch what code and judges structurally cannot: domain inaccuracies a generalist judge waves through, tone that's wrong for your specific users, edge cases that are technically correct and practically useless. And critically — human labels are the yardstick you use to check whether your LLM judge is any good. Without them, every other layer is measuring against itself.

Four things separate a useful human-eval pipeline from a box-ticking one:

Sample successes, not just failures. "The agent returned an answer" is not evidence the answer was good. Review 5–10% of all production traces weekly, including the ones nobody complained about.
Stratify the sample. Cover query types, user segments, and tool paths. Review only the easy queries and you'll build a rosy, useless picture.
Write annotation rubrics that force a decision. "Was this response good?" produces noise. "Did the response answer the user's stated question without inventing information they didn't ask for?" produces labels you can act on.
Measure agreement between annotators. Run the same trace past two reviewers now and then. If they disagree more than one time in five, the problem is your rubric, not your reviewers — sharpen it before you collect more labels.

5. Regression and red-teaming — so tomorrow's fix doesn't quietly break today's feature

Every prompt tweak is a potential regression, and LLM regressions are invisible without a locked baseline to compare against. A regression suite is a curated set of (input, expected) pairs you run on every deploy. Red-teaming is its adversarial twin: inputs deliberately built to break things — prompt injection, context-stuffing, multi-hop manipulation, tool misuse. In agentic systems these are nastier than in a single call, because one poisoned step can cascade through every step that follows it.

evals/
  regression/
    core_qa.jsonl          # 50 representative Q&A pairs
    tool_use.jsonl         # 30 traces exercising tool-call patterns
    edge_cases.jsonl       # 20 known-difficult inputs
    adversarial.jsonl      # 15 red-team prompts
  run_suite.py             # runner that diffs scores against baseline
  baseline_scores.json     # locked scores from the last known-good release

Run it in CI before every deploy. Alert when any category drops more than ~5% from baseline. The exact number matters less than the fact that there is one and something screams when you cross it.

Six mistakes that turn an eval layer into a liability

An eval layer isn't automatically good. A bad one is worse than none, because "our scores are green" is a sentence that stops people from looking closer. These are the six failure modes I see most.

Grading only the final output. Covered above, but it's the number-one blind spot so it earns repeating: a correct answer reached through a broken trace is a landmine, not a pass. Evaluate every node, not just the terminal one.

Collapsing everything into one score. A composite of 0.72 tells you something is wrong and nothing about what. Is it faithfulness? Latency? Tool selection? Keep the dimensions separate, track them separately, set thresholds separately. You can always average later; you can't un-average.

Evaluating on your training distribution. If your eval set is made from the same queries you used to build and tune the system, you've optimized to the eval, not to reality — Goodhart's Law wearing a lab coat. Hold out a blind set that never touches development, and keep topping it up with real production samples. Treat eval-set contamination as seriously as data leakage.

Trusting an uncalibrated judge. The single most common self-inflicted wound. Before an LLM judge scores anything that matters, run it against 100+ human-labeled examples and compute how often it agrees with humans. Disagreement above ~15%? Fix the judge prompt before it goes anywhere near your dashboard. A judge you haven't validated is a confidence machine, not a measurement.

No baseline, no history. Scores you don't persist can't answer "better or worse than last week?" Every run should write to a durable store with a timestamp, run ID, and git SHA. A SQLite table is enough — the discipline of storing beats the sophistication of the storage.

Treating evals as a one-time setup. Eval suites rot. User behavior drifts, new failure modes appear, and a suite built in month one is riddled with blind spots by month six. Add a case from every production failure within a couple of days of finding it, and audit the whole suite quarterly. It's a living artifact, not a shipped deliverable.

Setting it up: a practical walkthrough

Opinionated defaults for a LangGraph production agent. Swap the tools freely; the sequence is the point.

Step 1 — Define "good" in prose before you write any eval code. For each capability, answer on paper: what does a correct output look like, what are the common failure modes, what constraints must always hold, and what separates "barely acceptable" from "excellent" in a domain expert's eyes? This costs half a day. Skip it and you'll spend weeks precisely measuring the wrong things.

Step 2 — Instrument the graph. Wrap every node. Capture node name, input, output, latency, token counts, and any tool calls with arguments and returns.

from langsmith import traceable

@traceable(run_type="llm", name="planner_node")
def planner_node(state: AgentState) -> AgentState:
    response = llm.invoke(state["messages"])
    return {"plan": response.content, **state}

@traceable(run_type="tool", name="retriever_node")
def retriever_node(state: AgentState) -> AgentState:
    docs = retriever.invoke(state["plan"])
    return {"context": docs, **state}

Step 3 — Build the suite in three tiers, by frequency.

Tier 1 — every trace, real time: schema validation, tool-call constraints, latency thresholds. Cheap deterministic code.
Tier 2 — daily or per release: LLM-judge scores on a sample, retrieval metrics (precision@k, NDCG), end-to-end correctness on the regression suite.
Tier 3 — weekly or per major release: stratified human review, red-team annotation, edge-case triage.

Step 4 — Set thresholds before production, not during the incident. For every metric, define a green zone (normal), a yellow zone (investigate, don't block), and a red zone (block the deploy or page, someone). "We'll know bad when we see it" is the sentence people say right before a bad week.

Step 5 — Wire it into CI/CD. An eval that only runs when someone remembers doesn't run.

# .github/workflows/eval.yml
name: Eval Suite
on: [push]
jobs:
  run-evals:
    steps:
      - name: Run regression suite
        run: python evals/run_suite.py --compare-to baseline_scores.json
      - name: Check score thresholds
        run: python evals/check_thresholds.py --fail-on-regression 5

What to measure and what to keep

Split metrics into three families and never blend them into one number.

Correctness

Metric	What it measures	How to compute
Faithfulness	Is every claim grounded in retrieved context, with nothing hallucinated?	LLM judge or NLI model comparing response to context
Answer relevance	Does the response address the actual question?	LLM judge or embedding similarity between query and response
Context precision	Of the chunks retrieved, what fraction were useful?	Human or LLM label per chunk
Context recall	Did retrieval surface everything needed to answer?	Compare retrieved set against a gold document set
Tool-call accuracy	Right tools, right arguments?	Deterministic diff against an expected tool trace

Efficiency

Metric	What it measures	Target
Latency (p50/p95/p99)	User-perceived speed	Track trends; set SLOs per use case
Token consumption	Cost per query	Input + output tokens per run
Tool-call count	Wasted calls	Compare to the minimum viable count
Retry rate	How often steps fail and rerun	Under ~5% in steady state
Context-window utilization	How full the window runs	High → truncation risk

Safety and reliability

Metric	What it measures
Hallucination rate	% of responses with claims unsupported by context
Refusal rate	% of valid queries wrongly refused
Task-completion rate	% of queries reaching a terminal answer
Error rate by type	Tool failures, timeouts, parse errors — broken out, not summed
Constraint-violation rate	% of runs breaking a defined rule (e.g. a disallowed tool)

And keep the artifacts, because a score with no trace behind it is impossible to debug. For every run, preserve: the full execution trace (at least 30 days of production), per-dimension scores linked to run ID and git SHA, the judge's reasoning (not just its number — this is gold when you contest a score), failure cases tagged by type, the versioned eval-suite definition, the current baseline snapshot, and human annotation logs with annotator ID and timestamp.

Picking your tooling

Tool	Best for	The catch
LangSmith	LangChain/LangGraph shops wanting tight integration	Vendor lock-in; price scales with trace volume
Langfuse	Open-source, self-hostable	More setup; smaller ecosystem
Arize Phoenix	Teams already on Arize for ML monitoring	Stronger on classic ML; newer for LLMs
W&B Weave	Teams already living in Weights & Biases	Natural fit if you also fine-tune
RAGAS	RAG metrics out of the box	Narrow scope — mostly retrieval + generation
Custom (SQLite + an SDK)	Maximum control, minimal dependency	You own the build and the maintenance

My honest default for a LangGraph production system: LangSmith for tracing, RAGAS for RAG-specific metrics, and a small custom Python runner for the deterministic checks. Add human-eval tooling — even a scrappy Streamlit annotation app — once the system is past its first real users. Don't buy the enterprise platform on day one; you don't yet know what you're measuring.

The actual mindset shift

The thing most teams get wrong isn't a tool choice. It's timing. They treat evaluation as a phase that comes after building, and by then the design decisions that would have made the system measurable are already baked in.

Flip it. Evaluation is a lens you hold up while building. When you write a new node, the first question isn't "does this code run?" — it's "how will I know if this node is doing the right thing next Tuesday, in production, on a query I haven't seen?" When you tune a prompt, you don't eyeball three examples and ship on a good feeling; you run the suite and read the diff.

That's the whole difference between a demo and a system you can put your name on. Without an eval layer you're steering on vibes, and vibes don't survive contact with real traffic. With one, every decision has evidence under it. Build it early, treat it as first-class engineering rather than QA cleanup, and never push a change to production without knowing what your scores say about it.

Checklist

Before first deploy

[ ] Eval criteria written down for each capability
[ ] Every node instrumented with tracing
[ ] Tier-1 unit evals live (schema, constraints, latency)
[ ] Regression suite built (50+ curated examples)
[ ] Green/yellow/red thresholds set for every metric
[ ] Regression suite wired into CI/CD
[ ] Baseline scores locked

Weekly

[ ] Human review of a random 5–10% production sample
[ ] Post-mortem on every red-zone incident
[ ] New failure cases folded into the suite

Quarterly

[ ] Suite audit — cut stale cases, close coverage gaps
[ ] Re-calibrate LLM judges against fresh human labels
[ ] Revisit thresholds — still the right lines?
[ ] Run a red-team exercise and act on what breaks

DEV Community