How a two-week evaluation design sprint almost ended with us switching tools entirely — and what we learned from not doing that.
There’s a particular kind of confidence that sneaks up on you when you’re building an LLM agent. You test it manually a few times, it gives reasonable answers, and you think: okay, this works. Then someone on the team asks, “but how do you know it works?” and suddenly that confidence gets a lot wobblier.
That question is what led us down the rabbit hole of LLM evaluation infrastructure for our agent system — a multi-layer, tool-heavy setup built on LangGraph, running in AWS Bedrock AgentCore, with a FastMCP server handling tool calls. We had three distinct layers — conversation, orchestration, and search — and each of them could fail in different, non-obvious ways. “It works” wasn’t good enough. We needed proof.
This is the story of how we built the eval stack, nearly replaced it entirely, and ended up with a decision framework that I think applies well beyond our specific setup.
1. The Problem: What Does “Eval” Even Mean for an Agent?
Here’s something that doesn’t get said enough: evaluation for an LLM agent is not one thing. It’s at least three.
In a traditional software system, you write unit tests for functions, integration tests for services, and end-to-end tests for flows. An agent has the same stratification — except that the “functions” are probabilistic, the “services” are external model APIs, and the “flows” involve multi-turn conversations with context that mutates across turns.
For our LangGraph-based agent, we identified three evaluation concerns that had to be addressed independently:
- Conversation quality — Is the final response accurate? Is it grounded in the retrieved context? Is it relevant to what the user actually asked?
- Orchestration quality — Is the agent routing to the right tools? Is it invoking them with correct parameters? Is it retrying sensibly when something goes wrong?
- Search/retrieval quality — When a RAG-like tool call happens, is the context that comes back actually useful? Is the retrieved content faithful to the source?
A monolithic evaluator that just looks at the final output misses everything in the middle. You can have a response that looks good but was assembled from hallucinated intermediate steps, or a retrieval call that returned garbage that the LLM happened to paper over with prior knowledge. You won’t catch either of those without layer-specific evals.
This is the core reason why we needed a structured approach rather than ad-hoc testing.
2. The Test Fixtures: .eval.yaml
Before picking any tools, we needed a consistent format for defining what we were testing. We settled on a pattern tied to a ticketing approach we internally called a simple YAML structure with two required fields:
# example: booking_intent.eval.yaml
test_input:
conversation_id: "test-001"
user_message: "I need to extend my stay by two nights, checking out on Friday instead of Wednesday."
session_context:
property_id: "prop_42"
current_checkout: "2024-11-20"
success_criteria:
- type: contains_tool_call
tool: "modify_reservation"
with_params:
new_checkout: "2024-11-22"
- type: llm_judge
metric: response_relevance
threshold: 0.85
- type: deterministic
check: no_hallucinated_dates
The test_input captures a realistic scenario — not a synthetic toy, but something derived from actual usage patterns or edge cases that were reported. The success_criteria is a mixed list of deterministic checks (did the right tool get called with the right parameters?) and LLM-judged metrics (is the response relevant, faithful, grounded?).
Why separate YAML files per test case rather than a big test suite file? A few reasons: they’re easier to review in PRs, they can be tagged and filtered independently, and they map cleanly to the tickets or stories that motivated them. When a new edge case surfaces in production, you create one new file and the eval pipeline picks it up automatically.
3. First Proposal: LangFuse + Ragas + DeepEval
Our initial evaluation stack combined three tools, each with a distinct role:
LangFuse handles tracing and observability. Every LangGraph node execution gets captured — what went in, what came out, how long it took, what the token counts looked like. It’s the backbone that gives you visibility into what the agent actually did, not just what it said.
Ragas provides the core RAG-oriented metrics. Faithfulness (is the response supported by the retrieved context?), answer relevance (does the answer actually address the question?), context precision, context recall. These are the metrics that matter for retrieval-augmented flows.
DeepEval fills in the rest — hallucination detection, task-specific metrics, toxicity checking, and the ability to define custom metrics with your own rubrics. It also provides the test runner infrastructure that ties everything together.
Docker Compose setup to Get This Running (Local)
version: "3.9"
services:
langfuse:
image: langfuse/langfuse:latest
ports:
- "3000:3000"
environment:
- DATABASE_URL=postgresql://langfuse:langfuse@postgres:5432/langfuse
- NEXTAUTH_SECRET=your-secret-here
- NEXTAUTH_URL=http://localhost:3000
- SALT=your-salt-here
depends_on:
- postgres
postgres:
image: postgres:15
environment:
POSTGRES_USER: langfuse
POSTGRES_PASSWORD: langfuse
POSTGRES_DB: langfuse
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
pgdata:
The Python-side wiring looks like as follows:
# eval_runner.py
import yaml
import os
from langfuse.callback import CallbackHandler
from deepeval import evaluate
from deepeval.metrics import (
AnswerRelevancyMetric,
FaithfulnessMetric,
HallucinationMetric,
)
from deepeval.test_case import LLMTestCase
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from ragas import evaluate as ragas_evaluate
LANGFUSE_HANDLER = CallbackHandler(
public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
secret_key=os.environ["LANGFUSE_SECRET_KEY"],
host="http://localhost:3000",
)
def load_eval_fixture(path: str) -> dict:
with open(path) as f:
return yaml.safe_load(f)
def run_agent_with_tracing(test_input: dict) -> dict:
"""Run the LangGraph agent with LangFuse tracing attached."""
from your_agent import graph # your compiled LangGraph graph
result = graph.invoke(
{"messages": [{"role": "user", "content": test_input["user_message"]}]},
config={"callbacks": [LANGFUSE_HANDLER]},
)
return result
def evaluate_response(fixture: dict, agent_output: dict):
test_case = LLMTestCase(
input=fixture["test_input"]["user_message"],
actual_output=agent_output["final_response"],
retrieval_context=agent_output.get("retrieved_chunks", []),
)
metrics = [
AnswerRelevancyMetric(threshold=0.8),
FaithfulnessMetric(threshold=0.8),
HallucinationMetric(threshold=0.3),
]
evaluate([test_case], metrics)
The callback handler is the key integration point — LangFuse hooks into every LangGraph step automatically through the callbacks mechanism, so you get full trace visibility without changing your agent code.
4. The Reviewer Comments That Shaped the Design
We opened this design up for internal review, expecting mostly rubber-stamping. We got something more useful: a few pointed questions that fundamentally shaped how we thought about the problem.
“Which of these metrics are deterministic and which use an LLM judge?”
This turned out to be more important than it first appeared. Deterministic checks — did tool X get called, did parameter Y have value Z — are stable across runs. LLM-judge metrics are not. They can vary based on which model you use, how the prompt template is phrased, and even non-determinism in the judge model itself. A score of 0.83 today might be 0.79 tomorrow not because your agent got worse, but because the judge got slightly different. We needed to track these separately and be explicit about which was which in our YAML fixtures.
“What’s the judge model, and do you have a plan for judge model bias?”
If you’re using GPT-4 as your judge and your agent is also using GPT-4, you’re likely getting inflated scores — the judge model tends to favor outputs that look like its own outputs. We added a note in our evaluation config to pin the judge model to a different provider than the agent model, and to document this explicitly.
“Where do I see the scores? Per-run? Aggregated over time?”
The answer at that point was “in the terminal.” That wasn’t good enough. We added LangFuse dashboards for score aggregation and set up alerts for when any metric dropped below threshold on consecutive runs.
“What does a health check look like for the eval pipeline itself?”
Good question. We added a canary fixture — a trivially easy test case that should always pass — that runs first in every eval job. If the canary fails, something is wrong with the eval infrastructure, not the agent, and the run is aborted before generating misleading results.
Small comments. They had a bigger impact on the final design than most of the actual architecture decisions.
5. The AWS Alternative: AgentCore Evaluations
While we were finalizing the LangFuse + Ragas + DeepEval design, our colleague Maciej was separately evaluating AWS’s native evaluation offering through Bedrock AgentCore. The pitch was compelling: fewer moving parts, native tracing that integrates with everything else in the AgentCore stack, and a production path that doesn’t require managing three separate services.
The proposal was to run a hybrid approach:
+----------------------+-----------------------------+
| Layer | Tool |
+----------------------+-----------------------------+
| Tracing | AgentCore native tracing |
| Built-in metrics | AgentCore Evaluations |
| Custom metrics | DeepEval (kept) |
| Test fixtures | .eval.yaml (kept) |
+----------------------+-----------------------------+
The reduction in moving parts is real. Instead of running LangFuse locally or self-hosted, you lean on AgentCore’s built-in tracing. Instead of standing up Ragas metrics computation, you use AgentCore’s built-in faithfulness and context relevance metrics.
We were genuinely tempted. The operational simplicity argument is hard to ignore when you’re a small team.
6. Where the Native Replacements Break Down
Then we actually compared the metrics side by side. And this is where the “native” story got complicated.
Faithfulness: Same Name, Different Problem
AgentCore provides a metric called Builtin.Faithfulness. Ragas provides a metric called faithfulness. They sound equivalent. They are not.
Ragas faithfulness asks: “Are the claims in the response supported by the retrieved context?” It decomposes the response into individual claims, checks each claim against the context, and computes a ratio. It’s specifically a RAG faithfulness check.
AgentCore’s Builtin.Faithfulness asks: "Is the response consistent with the input prompt and conversation history?" That's a consistency check, not a grounding check. For a RAG-heavy agent, these catch completely different failure modes. You can pass one and fail the other.
If you swap Ragas faithfulness for AgentCore’s built-in and call it a day, you’ve silently dropped one of your most important safety checks.
Context Relevance vs. Context Precision
Similar story with retrieval quality. Ragas has context_precision, which measures whether the retrieved chunks that were actually used in the response were among the most relevant ones available. It's a quality-of-retrieval metric — it penalizes you for retrieving ten chunks but only using the bottom three.
AgentCore’s ContextRelevance measures whether the retrieved context is relevant to the input query at all. That's a different, weaker check. Passing context relevance just means you retrieved something related to the question. It says nothing about whether the retrieval was precise or whether the agent used the best available context.
Here’s a Side-by-Side Summary
+---------------------+----------------------------------+----------------------------------+
| Metric | Ragas | AgentCore |
+---------------------+----------------------------------+----------------------------------+
| faithfulness | Claims grounded in retrieved | Response consistent with |
| | context (RAG grounding check) | conversation history |
+---------------------+----------------------------------+----------------------------------+
| context quality | context_precision: were the | ContextRelevance: is retrieved |
| | best chunks selected? | context related to the query? |
+---------------------+----------------------------------+----------------------------------+
| answer relevance | answer_relevancy: does response | Similar — reasonable overlap |
| | address the question? | here |
+---------------------+----------------------------------+----------------------------------+
The name similarity is what makes this dangerous. You could swap these metrics, see similar-looking scores on a simple test case, and conclude the migration is safe. The divergence only shows up on the cases where it matters — complex multi-hop retrieval, sparse context, adversarial inputs.
7. The Decision Framework: PoC First
Evaluation infrastructure is not business logic. You can swap it out. But you can also create invisible regressions if you swap it out carelessly — which is exactly what the faithfulness naming collision would have caused.
Instead, we defined a three-outcome PoC:
+-------------------+------------------------------------------+
| Outcome | Action |
+-------------------+------------------------------------------+
| Adopt | AgentCore metrics are equivalent or |
| | better — migrate fully |
+-------------------+------------------------------------------+
| Swap | Some AgentCore metrics work, others |
| | don't — hybrid approach |
+-------------------+------------------------------------------+
| Build custom | Neither works well enough — write |
| | custom metric using AgentCore's |
| | custom evaluator API |
+-------------------+------------------------------------------+
The PoC scope was intentionally small: run the same 10 eval fixtures through both stacks in parallel, compare scores for the same inputs, and flag any case where the scores diverge by more than 15%. Two weeks of data. One shared dashboard.
That’s a much cheaper way to answer the question than migrating your entire eval pipeline and discovering the issue three months into production.
8. A Minimal Local Eval Setup with Ollama
If you want to experiment with this pattern without spinning up cloud infrastructure, here’s a fully local setup using Ollama as the LLM judge. This runs entirely on your machine.
# docker-compose.eval-local.yml
version: "3.9"
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
# Pull a model after startup: docker exec -it ollama ollama pull llama3
langfuse:
image: langfuse/langfuse:latest
ports:
- "3000:3000"
environment:
- DATABASE_URL=postgresql://langfuse:langfuse@postgres:5432/langfuse
- NEXTAUTH_SECRET=dev-secret-change-in-prod
- NEXTAUTH_URL=http://localhost:3000
- SALT=dev-salt
depends_on:
- postgres
postgres:
image: postgres:15
environment:
POSTGRES_USER: langfuse
POSTGRES_PASSWORD: langfuse
POSTGRES_DB: langfuse
volumes:
- pgdata:/var/lib/postgresql/data
volumes:
ollama_data:
pgdata:
# eval_local.py — uses Ollama as the judge model
import os
from deepeval.models.base_model import DeepEvalBaseLLM
from deepeval.metrics import AnswerRelevancyMetric, FaithfulnessMetric
from deepeval.test_case import LLMTestCase
from deepeval import evaluate
import requests
import json
class OllamaJudge(DeepEvalBaseLLM):
"""Custom DeepEval judge backed by a local Ollama model."""
def __init__ (self, model_name: str = "llama3"):
self.model_name = model_name
def load_model(self):
return self
def generate(self, prompt: str) -> str:
response = requests.post(
"http://localhost:11434/api/generate",
json={"model": self.model_name, "prompt": prompt, "stream": False},
)
return response.json()["response"]
async def a_generate(self, prompt: str) -> str:
return self.generate(prompt)
def get_model_name(self) -> str:
return f"ollama/{self.model_name}"
def run_local_eval(test_cases: list[dict]):
judge = OllamaJudge(model_name="llama3")
cases = [
LLMTestCase(
input=tc["input"],
actual_output=tc["output"],
retrieval_context=tc.get("context", []),
)
for tc in test_cases
]
metrics = [
AnswerRelevancyMetric(threshold=0.7, model=judge),
FaithfulnessMetric(threshold=0.7, model=judge),
]
evaluate(cases, metrics)
if __name__ == " __main__":
sample_cases = [
{
"input": "What time is check-out?",
"output": "Check-out is at 11:00 AM. Late check-out until 2 PM is available for an additional fee.",
"context": [
"Hotel policy: standard check-out at 11:00 AM.",
"Late check-out available until 14:00 for EUR 30.",
],
}
]
run_local_eval(sample_cases)
Start everything with docker compose -f docker-compose.eval-local.yml up -d, pull a model with docker exec -it ollama pull llama3, and you have a fully local eval stack with no API keys, no cloud costs, and no data leaving your machine.
9. Lessons Learned
Eval infrastructure is not your core product — treat it as something you can swap. Don’t get attached to a specific tool. What matters is the fixture format and the success criteria. If your .eval.yaml files are tool-agnostic, you can migrate the underlying runner without losing any of the work you put into defining good tests.
“Native” does not mean “equivalent.” This seems obvious in retrospect, but the naming similarity between AgentCore’s Builtin.Faithfulness and Ragas's faithfulness is genuinely confusing. Always read the metric definition, not just the name. Check what it's actually measuring and whether that maps to the failure mode you care about.
Name similarity is a trap, especially when you’re under time pressure. When you’re evaluating tools quickly, you tend to match on names. That’s fine as a first pass, but it needs to be followed by an actual comparison on real data before you commit.
Keep your PoC scope small. Two weeks, ten fixtures, one shared dashboard. That’s enough to make a data-driven decision. The instinct to do a “comprehensive evaluation” before deciding is usually a way to delay the decision indefinitely. Define the minimum evidence you’d need to choose, run the experiment to get it, then choose.
Deterministic and LLM-judge metrics are different animals. Keep them separate in your fixtures, track them separately in your dashboards, and don’t conflate a drop in one with a drop in the other. A regression in tool-call correctness (deterministic) is a different kind of problem than a regression in faithfulness score (LLM judge) and needs a different debugging approach.
The eval stack is never really done. New failure modes emerge, judge models get updated, new metrics become available. But if you invest upfront in a good fixture format and a clear framework for comparing evaluation tools, you’re set up to evolve the infrastructure without losing ground. And the next time someone asks “but how do you know it works?” — you have an answer.
If you’ve run into interesting eval challenges with LangGraph or other agent frameworks, I’d be curious what metrics ended up being most useful for you. The more people share on this, the better the whole ecosystem gets.
Top comments (0)