Every team building AI agents hits the same wall.
You ship a LangChain agent. It works great in demos. Then it goes to
production and quietly starts hallucinating, calling the wrong tools,
or giving answers that have nothing to do with what it retrieved.

You don't find out until a user complains.
The root cause is simple: there's no standard way to evaluate agent
quality before and after every deploy.
Every framework has its own story:
- LangChain has LangSmith — but it's a paid SaaS and only works with LangChain
- CrewAI has no eval tooling
- AutoGen has no eval tooling
- OpenAI Agents SDK has basic tracing but no scoring
If you switch frameworks, you rebuild your eval setup from scratch.
If you use multiple frameworks, you have no unified view.
This is the problem I set out to solve.
Introducing EvalForge

EvalForge is a
framework-agnostic LLM agent evaluation harness. You give it a trace
JSON from any agent framework, it scores it on quality metrics, and
returns a pass/fail result your CI pipeline understands.
evalforge run --trace my_agent_run.json --metrics faithfulness
Output:
EvalForge v0.1
─────────────────────────────
Trace ID: my-run-001
Framework: langchain
Model: gpt-4o
Agent: research-agent
Steps: 4
Duration: 3421ms
─────────────────────────────
Scoring Results
─────────────────────────────
faithfulness 0.91 PASS
Reason: The answer accurately reflects the retrieved context.
─────────────────────────────
Overall: PASS
Exit code 0 = pass. Exit code 1 = fail. Plugs straight into any CI
pipeline.
How it works
Every agent run — regardless of framework — goes through the same
lifecycle:
User gives input
→ Agent thinks / plans
→ Agent calls tools
→ Agent produces final answer
EvalForge captures this in a simple universal trace format:
{
"evalforge_version": "0.1",
"trace_id": "run-001",
"metadata": {
"framework": "langchain",
"model": "gpt-4o",
"agent_name": "research-agent",
"duration_ms": 3421,
"total_tokens": 1820
},
"input": {
"user": "What are the latest papers on LLM evaluation?",
"system": "You are a helpful research assistant."
},
"steps": [
{
"step_id": 1,
"type": "thought",
"content": "I need to search for recent papers."
},
{
"step_id": 2,
"type": "tool_call",
"tool": "web_search",
"input": { "query": "LLM evaluation papers 2026" },
"output": { "results": ["paper1", "paper2"] },
"duration_ms": 890
}
],
"output": {
"answer": "The latest papers on LLM evaluation include..."
},
"eval_hints": {
"expected_tools": ["web_search"],
"expected_answer": null,
"context_documents": []
}
}
Every major framework maps cleanly to this format. LangChain's
AgentAction becomes a tool_call. CrewAI's task results become
steps. AutoGen's conversation messages become thought entries.
The scoring — LLM as judge
For v0.1 we ship faithfulness scoring.
Faithfulness asks: did the agent's final answer stay true to what
its tools actually returned?
If the tools returned facts A, B, C and the agent only used A, B, C
— high faithfulness.
If the agent invented D, E that weren't in the tool outputs — low
faithfulness. That's a hallucination.
We score it using Claude as judge. The prompt:
You are evaluating whether an AI agent's answer is faithful
to the context it retrieved.
Question: {question}
Retrieved Context: {context}
Agent's Answer: {answer}
Does the answer only use information from the retrieved context,
without adding facts not present in the context?
Respond in JSON: {"score": 0.0-1.0, "reason": "explanation"}
Score >= 0.7 = PASS. Configurable with --threshold.
Why Rust?
The core is written in Rust with a Python SDK wrapper.
Three reasons:
Speed — millisecond startup, no GIL bottleneck. Runs 1000 eval
cases in the time Python tools run 100.
Single binary — curl | sh install. No virtualenv, no
dependency hell in CI. One file that works on Linux, Mac, Windows.
Python SDK on top — users never think about Rust. They
pip install evalforge and write:
import evalforge
result = evalforge.run(
trace="my_agent_run.json",
metrics=["faithfulness"]
)
print(result.passed) # True
print(result.metrics[0].score) # 0.91
print(result.metrics[0].reason) # "Answer stays within retrieved context"
Works with every major framework today
| Framework | Language | Status |
|---|---|---|
| LangChain / LangGraph | Python | ✅ v0.1 |
| CrewAI | Python | ✅ v0.1 |
| AutoGen / AG2 | Python | ✅ v0.1 |
| OpenAI Agents SDK | Python | ✅ v0.1 |
| Mastra | TypeScript | 🔜 Planned |
| Vercel AI SDK | TypeScript | 🔜 Planned |
CI/CD integration
Add to your GitHub Actions workflow:
- name: Evaluate agent quality
run: evalforge run --trace agent_run.json --metrics faithfulness
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Every PR now has an automatic quality gate on your agent. Merge only
when your agent passes.
What's coming
-
v0.2 —
tool_accuracy,goal_completion,hallucinationmetrics - v0.3 — Native CI integrations (GitHub Actions marketplace)
- v0.4 — JavaScript SDK + Mastra support
- v0.5 — Auto trace capture from LangChain/CrewAI callbacks
- v1.0 — Web dashboard + team collaboration ## Update: What shipped since launch
A lot has happened since I first posted this. Here's what
EvalForge looks like today:
7 metrics now live
| Metric | What it measures |
|---|---|
faithfulness |
Answer stays true to retrieved context |
tool_accuracy |
Agent used the right tools (deterministic) |
goal_completion |
Agent finished the task |
hallucination |
Agent made up facts |
g_eval |
Your custom rubric in plain English |
context_precision |
Was all retrieved context relevant? |
answer_relevance |
Is the answer actually about the question? |
Framework adapters — no manual JSON needed
from evalforge.adapters import from_langchain
import evalforge
result = agent.invoke({"input": "Your question"})
trace = from_langchain(result, model="gpt-4o")
eval_result = evalforge.run(trace, metrics=["faithfulness"])
print(eval_result.passed)
Supports LangChain, CrewAI, AutoGen, and OpenAI Agents SDK.
RunTrendAnalyzer — catch drift before users do
Four runs at 0.91 → 0.85 → 0.79 → 0.73 all pass
individually. EvalForge catches the regression:
evalforge trend --history results/ \
--metrics faithfulness \
--exit-on-regression
JavaScript/TypeScript SDK
npm install evalforge
import { fromMastra, run } from 'evalforge';
const trace = fromMastra(result, { agentName: 'my-agent' });
const evalResult = run(trace, { metrics: ['faithfulness'] });
Defensible scoring — full audit log
Every --output JSON now includes:
-
method: "deterministic" or "llm_judge" -
judge_model: exactly which model scored this -
threshold: the exact value used -
timestamp: UTC time of the run
Install and try
pip install evalforge
python3 -c "import evalforge; print(evalforge.demo())"
Or with npm:
npm install evalforge
GitHub: https://github.com/heManKuMAR6/evalforge
Would love to hear what metrics and frameworks matter
most to you — drop a comment below.
Try it now
git clone https://github.com/heManKuMAR6/evalforge
cd evalforge
cargo build --release
# Score a sample trace
cargo run -- run --trace tests/fixtures/sample_trace.json \
--metrics faithfulness --mock
Or with Python:
pip install evalforge
evalforge run --trace my_trace.json --metrics faithfulness
The repo is at https://github.com/heManKuMAR6/evalforge — MIT
license, contributions welcome.
Would love feedback on:
- What metrics matter most to you in production?
- What frameworks should we prioritize next?
- What does your current eval setup look like?
If this solves a problem you have, a GitHub star helps others find it.

Top comments (5)
Update: v0.7.0 is now live with 7 metrics, framework
adapters for LangChain/CrewAI/AutoGen/OpenAI Agents SDK,
JS/TypeScript SDK, RunTrendAnalyzer, and full audit logs.
pip install evalforge
npm install evalforge
Full changelog: github.com/heManKuMAR6/evalforge
This was a strong read because it goes after a real infrastructure problem instead of just shipping another layer of agent hype.
A lot of teams can get an agent demo working. Far fewer have a clean, repeatable way to ask whether the thing is still behaving properly after a change, across frameworks, inside CI, without rebuilding the whole evaluation story every time the stack shifts. That is the part that makes this interesting.
What I like most here is the instinct to normalize the trace instead of marrying the entire tool to one ecosystem. That is the right level to attack the problem. Once evaluation gets trapped inside framework specific tooling, it stops being a reliability layer and starts becoming platform gravity.
The pass fail CI angle is also important. Evaluation only starts to matter when it can actually block bad behavior from gliding into production. Otherwise it stays as a dashboard people glance at and ignore.
I also respect that this starts narrow. Faithfulness is not the whole problem, but it is a real one, and starting with a metric that has operational value is a lot better than pretending to solve all of agent quality in one sweep.
The harder part, obviously, is ahead. LLM as judge can be useful, but it also becomes part of the trust chain, so the long term credibility of a tool like this will live or die on how transparent, reproducible, and defensible the scoring layer becomes.
Still, this is the kind of project that feels worth paying attention to because it is trying to turn agent evaluation from vague aspiration into something teams can actually wire into engineering reality.
This is exactly the kind of feedback that sharpens the direction.
You nailed the core tension - LLM-as-judge adds a trust dependency
to the trust layer itself. That's the hardest problem ahead.
The plan to address it:
→ Deterministic metrics alongside LLM-judge (ROUGE, BERTScore)
so scoring doesn't depend entirely on another model
→ Scoring transparency - log the full judge prompt and response
so teams can audit exactly why something passed or failed
→ Calibration tools - let teams validate judge scores against
their own human labels
The "platform gravity" framing is exactly why we normalize at the
trace level. Once you're locked into framework-specific eval,
you're one migration away from losing your entire quality history.
Would genuinely value your input as we design the scoring
transparency layer for v0.2. What would "defensible scoring"
look like to you in practice?
That direction makes sense.
To me, defensible scoring starts the moment a team can answer not just what score they got, but why they got it, what produced it, and whether that result would hold up under reinspection a month later.
So in practice I would want a few things:
First, every score should come with an evidence trail. Not just the final number, but the trace slice used, the exact judge prompt, the model version, the threshold, and the raw response that led to the verdict.
Second, I would separate deterministic checks from interpretive checks as hard as possible. If something can be measured structurally, measure it structurally. Reserve LLM judgment for the parts that actually require semantic interpretation, not as the default for everything.
Third, I would want calibration to be visible, not implied. Show where judge scores align with human labels, where they drift, and where the model is consistently too generous or too harsh. Otherwise teams are inheriting a confidence ritual, not a reliability layer.
Fourth, reproducibility matters. If the scoring model changes, that should be treated like changing the test harness itself. Version it, surface it, and make score history comparable across time instead of quietly shifting the ground underneath the metric.
And fifth, I would be careful about merge gates driven by a single abstract score. A pass fail gate is powerful, but only if the underlying evidence is inspectable enough that engineers trust the block when it fires.
That is the part I find most interesting here. The trace normalization solves one layer of platform gravity. Defensible scoring is what determines whether the system becomes infrastructure or just another nice looking evaluation surface.
The universal trace format is interesting. Standardizing this layer could simplify a lot of cross framework tooling.