Every team building AI agents hits the same wall.
You ship a LangChain agent. It works great in demos. Then it goes to
production and quietly starts hallucinating, calling the wrong tools,
or giving answers that have nothing to do with what it retrieved.

You don't find out until a user complains.
The root cause is simple: there's no standard way to evaluate agent
quality before and after every deploy.
Every framework has its own story:
- LangChain has LangSmith — but it's a paid SaaS and only works with LangChain
- CrewAI has no eval tooling
- AutoGen has no eval tooling
- OpenAI Agents SDK has basic tracing but no scoring
If you switch frameworks, you rebuild your eval setup from scratch.
If you use multiple frameworks, you have no unified view.
This is the problem I set out to solve.
Introducing EvalForge

EvalForge is a
framework-agnostic LLM agent evaluation harness. You give it a trace
JSON from any agent framework, it scores it on quality metrics, and
returns a pass/fail result your CI pipeline understands.
evalforge run --trace my_agent_run.json --metrics faithfulness
Output:
EvalForge v0.1
─────────────────────────────
Trace ID: my-run-001
Framework: langchain
Model: gpt-4o
Agent: research-agent
Steps: 4
Duration: 3421ms
─────────────────────────────
Scoring Results
─────────────────────────────
faithfulness 0.91 PASS
Reason: The answer accurately reflects the retrieved context.
─────────────────────────────
Overall: PASS
Exit code 0 = pass. Exit code 1 = fail. Plugs straight into any CI
pipeline.
How it works
Every agent run — regardless of framework — goes through the same
lifecycle:
User gives input
→ Agent thinks / plans
→ Agent calls tools
→ Agent produces final answer
EvalForge captures this in a simple universal trace format:
{
"evalforge_version": "0.1",
"trace_id": "run-001",
"metadata": {
"framework": "langchain",
"model": "gpt-4o",
"agent_name": "research-agent",
"duration_ms": 3421,
"total_tokens": 1820
},
"input": {
"user": "What are the latest papers on LLM evaluation?",
"system": "You are a helpful research assistant."
},
"steps": [
{
"step_id": 1,
"type": "thought",
"content": "I need to search for recent papers."
},
{
"step_id": 2,
"type": "tool_call",
"tool": "web_search",
"input": { "query": "LLM evaluation papers 2026" },
"output": { "results": ["paper1", "paper2"] },
"duration_ms": 890
}
],
"output": {
"answer": "The latest papers on LLM evaluation include..."
},
"eval_hints": {
"expected_tools": ["web_search"],
"expected_answer": null,
"context_documents": []
}
}
Every major framework maps cleanly to this format. LangChain's
AgentAction becomes a tool_call. CrewAI's task results become
steps. AutoGen's conversation messages become thought entries.
The scoring — LLM as judge
For v0.1 we ship faithfulness scoring.
Faithfulness asks: did the agent's final answer stay true to what
its tools actually returned?
If the tools returned facts A, B, C and the agent only used A, B, C
— high faithfulness.
If the agent invented D, E that weren't in the tool outputs — low
faithfulness. That's a hallucination.
We score it using Claude as judge. The prompt:
You are evaluating whether an AI agent's answer is faithful
to the context it retrieved.
Question: {question}
Retrieved Context: {context}
Agent's Answer: {answer}
Does the answer only use information from the retrieved context,
without adding facts not present in the context?
Respond in JSON: {"score": 0.0-1.0, "reason": "explanation"}
Score >= 0.7 = PASS. Configurable with --threshold.
Why Rust?
The core is written in Rust with a Python SDK wrapper.
Three reasons:
Speed — millisecond startup, no GIL bottleneck. Runs 1000 eval
cases in the time Python tools run 100.
Single binary — curl | sh install. No virtualenv, no
dependency hell in CI. One file that works on Linux, Mac, Windows.
Python SDK on top — users never think about Rust. They
pip install evalforge and write:
import evalforge
result = evalforge.run(
trace="my_agent_run.json",
metrics=["faithfulness"]
)
print(result.passed) # True
print(result.metrics[0].score) # 0.91
print(result.metrics[0].reason) # "Answer stays within retrieved context"
Works with every major framework today
| Framework | Language | Status |
|---|---|---|
| LangChain / LangGraph | Python | ✅ v0.1 |
| CrewAI | Python | ✅ v0.1 |
| AutoGen / AG2 | Python | ✅ v0.1 |
| OpenAI Agents SDK | Python | ✅ v0.1 |
| Mastra | TypeScript | 🔜 Planned |
| Vercel AI SDK | TypeScript | 🔜 Planned |
CI/CD integration
Add to your GitHub Actions workflow:
- name: Evaluate agent quality
run: evalforge run --trace agent_run.json --metrics faithfulness
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Every PR now has an automatic quality gate on your agent. Merge only
when your agent passes.
What's coming
-
v0.2 —
tool_accuracy,goal_completion,hallucinationmetrics - v0.3 — Native CI integrations (GitHub Actions marketplace)
- v0.4 — JavaScript SDK + Mastra support
- v0.5 — Auto trace capture from LangChain/CrewAI callbacks
- v1.0 — Web dashboard + team collaboration
Try it now
git clone https://github.com/heManKuMAR6/evalforge
cd evalforge
cargo build --release
# Score a sample trace
cargo run -- run --trace tests/fixtures/sample_trace.json \
--metrics faithfulness --mock
Or with Python:
pip install evalforge
evalforge run --trace my_trace.json --metrics faithfulness
The repo is at https://github.com/heManKuMAR6/evalforge — MIT
license, contributions welcome.
Would love feedback on:
- What metrics matter most to you in production?
- What frameworks should we prioritize next?
- What does your current eval setup look like?
If this solves a problem you have, a GitHub star helps others find it.

Top comments (1)
This was a strong read because it goes after a real infrastructure problem instead of just shipping another layer of agent hype.
A lot of teams can get an agent demo working. Far fewer have a clean, repeatable way to ask whether the thing is still behaving properly after a change, across frameworks, inside CI, without rebuilding the whole evaluation story every time the stack shifts. That is the part that makes this interesting.
What I like most here is the instinct to normalize the trace instead of marrying the entire tool to one ecosystem. That is the right level to attack the problem. Once evaluation gets trapped inside framework specific tooling, it stops being a reliability layer and starts becoming platform gravity.
The pass fail CI angle is also important. Evaluation only starts to matter when it can actually block bad behavior from gliding into production. Otherwise it stays as a dashboard people glance at and ignore.
I also respect that this starts narrow. Faithfulness is not the whole problem, but it is a real one, and starting with a metric that has operational value is a lot better than pretending to solve all of agent quality in one sweep.
The harder part, obviously, is ahead. LLM as judge can be useful, but it also becomes part of the trust chain, so the long term credibility of a tool like this will live or die on how transparent, reproducible, and defensible the scoring layer becomes.
Still, this is the kind of project that feels worth paying attention to because it is trying to turn agent evaluation from vague aspiration into something teams can actually wire into engineering reality.