DEV Community

hemanth kumar
hemanth kumar

Posted on • Edited on

I built an open source LLM agent evaluation tool that works with any framework

Every team building AI agents hits the same wall.
You ship a LangChain agent. It works great in demos. Then it goes to
production and quietly starts hallucinating, calling the wrong tools,
or giving answers that have nothing to do with what it retrieved.


You don't find out until a user complains.

The root cause is simple: there's no standard way to evaluate agent
quality before and after every deploy.

Every framework has its own story:

  • LangChain has LangSmith — but it's a paid SaaS and only works with LangChain
  • CrewAI has no eval tooling
  • AutoGen has no eval tooling
  • OpenAI Agents SDK has basic tracing but no scoring

If you switch frameworks, you rebuild your eval setup from scratch.
If you use multiple frameworks, you have no unified view.

This is the problem I set out to solve.

Introducing EvalForge


EvalForge is a
framework-agnostic LLM agent evaluation harness. You give it a trace
JSON from any agent framework, it scores it on quality metrics, and
returns a pass/fail result your CI pipeline understands.

evalforge run --trace my_agent_run.json --metrics faithfulness
Enter fullscreen mode Exit fullscreen mode

Output:

EvalForge v0.1
─────────────────────────────
Trace ID:   my-run-001
Framework:  langchain
Model:      gpt-4o
Agent:      research-agent
Steps:      4
Duration:   3421ms
─────────────────────────────
Scoring Results
─────────────────────────────
faithfulness     0.91   PASS
Reason: The answer accurately reflects the retrieved context.
─────────────────────────────
Overall: PASS
Enter fullscreen mode Exit fullscreen mode

Exit code 0 = pass. Exit code 1 = fail. Plugs straight into any CI
pipeline.

How it works

Every agent run — regardless of framework — goes through the same
lifecycle:

User gives input
  → Agent thinks / plans
    → Agent calls tools
      → Agent produces final answer
Enter fullscreen mode Exit fullscreen mode

EvalForge captures this in a simple universal trace format:

{
  "evalforge_version": "0.1",
  "trace_id": "run-001",
  "metadata": {
    "framework": "langchain",
    "model": "gpt-4o",
    "agent_name": "research-agent",
    "duration_ms": 3421,
    "total_tokens": 1820
  },
  "input": {
    "user": "What are the latest papers on LLM evaluation?",
    "system": "You are a helpful research assistant."
  },
  "steps": [
    {
      "step_id": 1,
      "type": "thought",
      "content": "I need to search for recent papers."
    },
    {
      "step_id": 2,
      "type": "tool_call",
      "tool": "web_search",
      "input": { "query": "LLM evaluation papers 2026" },
      "output": { "results": ["paper1", "paper2"] },
      "duration_ms": 890
    }
  ],
  "output": {
    "answer": "The latest papers on LLM evaluation include..."
  },
  "eval_hints": {
    "expected_tools": ["web_search"],
    "expected_answer": null,
    "context_documents": []
  }
}
Enter fullscreen mode Exit fullscreen mode

Every major framework maps cleanly to this format. LangChain's
AgentAction becomes a tool_call. CrewAI's task results become
steps. AutoGen's conversation messages become thought entries.

The scoring — LLM as judge

For v0.1 we ship faithfulness scoring.

Faithfulness asks: did the agent's final answer stay true to what
its tools actually returned?

If the tools returned facts A, B, C and the agent only used A, B, C
— high faithfulness.

If the agent invented D, E that weren't in the tool outputs — low
faithfulness. That's a hallucination.

We score it using Claude as judge. The prompt:

You are evaluating whether an AI agent's answer is faithful 
to the context it retrieved.

Question: {question}
Retrieved Context: {context}
Agent's Answer: {answer}

Does the answer only use information from the retrieved context, 
without adding facts not present in the context?

Respond in JSON: {"score": 0.0-1.0, "reason": "explanation"}
Enter fullscreen mode Exit fullscreen mode

Score >= 0.7 = PASS. Configurable with --threshold.

Why Rust?

The core is written in Rust with a Python SDK wrapper.

Three reasons:

Speed — millisecond startup, no GIL bottleneck. Runs 1000 eval
cases in the time Python tools run 100.

Single binarycurl | sh install. No virtualenv, no
dependency hell in CI. One file that works on Linux, Mac, Windows.

Python SDK on top — users never think about Rust. They
pip install evalforge and write:

import evalforge

result = evalforge.run(
    trace="my_agent_run.json",
    metrics=["faithfulness"]
)

print(result.passed)              # True
print(result.metrics[0].score)   # 0.91
print(result.metrics[0].reason)  # "Answer stays within retrieved context"
Enter fullscreen mode Exit fullscreen mode

Works with every major framework today

Framework Language Status
LangChain / LangGraph Python ✅ v0.1
CrewAI Python ✅ v0.1
AutoGen / AG2 Python ✅ v0.1
OpenAI Agents SDK Python ✅ v0.1
Mastra TypeScript 🔜 Planned
Vercel AI SDK TypeScript 🔜 Planned

CI/CD integration

Add to your GitHub Actions workflow:

- name: Evaluate agent quality
  run: evalforge run --trace agent_run.json --metrics faithfulness
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
Enter fullscreen mode Exit fullscreen mode

Every PR now has an automatic quality gate on your agent. Merge only
when your agent passes.

What's coming

  • v0.2tool_accuracy, goal_completion, hallucination metrics
  • v0.3 — Native CI integrations (GitHub Actions marketplace)
  • v0.4 — JavaScript SDK + Mastra support
  • v0.5 — Auto trace capture from LangChain/CrewAI callbacks
  • v1.0 — Web dashboard + team collaboration ## Update: What shipped since launch

A lot has happened since I first posted this. Here's what
EvalForge looks like today:

7 metrics now live

Metric What it measures
faithfulness Answer stays true to retrieved context
tool_accuracy Agent used the right tools (deterministic)
goal_completion Agent finished the task
hallucination Agent made up facts
g_eval Your custom rubric in plain English
context_precision Was all retrieved context relevant?
answer_relevance Is the answer actually about the question?

Framework adapters — no manual JSON needed

from evalforge.adapters import from_langchain
import evalforge

result = agent.invoke({"input": "Your question"})
trace = from_langchain(result, model="gpt-4o")
eval_result = evalforge.run(trace, metrics=["faithfulness"])
print(eval_result.passed)
Enter fullscreen mode Exit fullscreen mode

Supports LangChain, CrewAI, AutoGen, and OpenAI Agents SDK.

RunTrendAnalyzer — catch drift before users do

Four runs at 0.91 → 0.85 → 0.79 → 0.73 all pass
individually. EvalForge catches the regression:

evalforge trend --history results/ \
  --metrics faithfulness \
  --exit-on-regression
Enter fullscreen mode Exit fullscreen mode

JavaScript/TypeScript SDK

npm install evalforge
Enter fullscreen mode Exit fullscreen mode
import { fromMastra, run } from 'evalforge';

const trace = fromMastra(result, { agentName: 'my-agent' });
const evalResult = run(trace, { metrics: ['faithfulness'] });
Enter fullscreen mode Exit fullscreen mode

Defensible scoring — full audit log

Every --output JSON now includes:

  • method: "deterministic" or "llm_judge"
  • judge_model: exactly which model scored this
  • threshold: the exact value used
  • timestamp: UTC time of the run

Install and try

pip install evalforge
python3 -c "import evalforge; print(evalforge.demo())"
Enter fullscreen mode Exit fullscreen mode

Or with npm:

npm install evalforge
Enter fullscreen mode Exit fullscreen mode

GitHub: https://github.com/heManKuMAR6/evalforge

Would love to hear what metrics and frameworks matter
most to you — drop a comment below.

Try it now

git clone https://github.com/heManKuMAR6/evalforge
cd evalforge
cargo build --release

# Score a sample trace
cargo run -- run --trace tests/fixtures/sample_trace.json \
  --metrics faithfulness --mock
Enter fullscreen mode Exit fullscreen mode

Or with Python:

pip install evalforge

evalforge run --trace my_trace.json --metrics faithfulness
Enter fullscreen mode Exit fullscreen mode

The repo is at https://github.com/heManKuMAR6/evalforge — MIT
license, contributions welcome.

Would love feedback on:

  • What metrics matter most to you in production?
  • What frameworks should we prioritize next?
  • What does your current eval setup look like?

If this solves a problem you have, a GitHub star helps others find it.

Top comments (5)

Collapse
 
hemankumar6 profile image
hemanth kumar

Update: v0.7.0 is now live with 7 metrics, framework
adapters for LangChain/CrewAI/AutoGen/OpenAI Agents SDK,
JS/TypeScript SDK, RunTrendAnalyzer, and full audit logs.

pip install evalforge
npm install evalforge

Full changelog: github.com/heManKuMAR6/evalforge

Collapse
 
crisiscoresystems profile image
CrisisCore-Systems

This was a strong read because it goes after a real infrastructure problem instead of just shipping another layer of agent hype.

A lot of teams can get an agent demo working. Far fewer have a clean, repeatable way to ask whether the thing is still behaving properly after a change, across frameworks, inside CI, without rebuilding the whole evaluation story every time the stack shifts. That is the part that makes this interesting.

What I like most here is the instinct to normalize the trace instead of marrying the entire tool to one ecosystem. That is the right level to attack the problem. Once evaluation gets trapped inside framework specific tooling, it stops being a reliability layer and starts becoming platform gravity.

The pass fail CI angle is also important. Evaluation only starts to matter when it can actually block bad behavior from gliding into production. Otherwise it stays as a dashboard people glance at and ignore.

I also respect that this starts narrow. Faithfulness is not the whole problem, but it is a real one, and starting with a metric that has operational value is a lot better than pretending to solve all of agent quality in one sweep.

The harder part, obviously, is ahead. LLM as judge can be useful, but it also becomes part of the trust chain, so the long term credibility of a tool like this will live or die on how transparent, reproducible, and defensible the scoring layer becomes.

Still, this is the kind of project that feels worth paying attention to because it is trying to turn agent evaluation from vague aspiration into something teams can actually wire into engineering reality.

Collapse
 
hemankumar6 profile image
hemanth kumar

This is exactly the kind of feedback that sharpens the direction.

You nailed the core tension - LLM-as-judge adds a trust dependency
to the trust layer itself. That's the hardest problem ahead.

The plan to address it:
→ Deterministic metrics alongside LLM-judge (ROUGE, BERTScore)
so scoring doesn't depend entirely on another model
→ Scoring transparency - log the full judge prompt and response
so teams can audit exactly why something passed or failed
→ Calibration tools - let teams validate judge scores against
their own human labels

The "platform gravity" framing is exactly why we normalize at the
trace level. Once you're locked into framework-specific eval,
you're one migration away from losing your entire quality history.

Would genuinely value your input as we design the scoring
transparency layer for v0.2. What would "defensible scoring"
look like to you in practice?

Collapse
 
crisiscoresystems profile image
CrisisCore-Systems

That direction makes sense.

To me, defensible scoring starts the moment a team can answer not just what score they got, but why they got it, what produced it, and whether that result would hold up under reinspection a month later.

So in practice I would want a few things:

First, every score should come with an evidence trail. Not just the final number, but the trace slice used, the exact judge prompt, the model version, the threshold, and the raw response that led to the verdict.

Second, I would separate deterministic checks from interpretive checks as hard as possible. If something can be measured structurally, measure it structurally. Reserve LLM judgment for the parts that actually require semantic interpretation, not as the default for everything.

Third, I would want calibration to be visible, not implied. Show where judge scores align with human labels, where they drift, and where the model is consistently too generous or too harsh. Otherwise teams are inheriting a confidence ritual, not a reliability layer.

Fourth, reproducibility matters. If the scoring model changes, that should be treated like changing the test harness itself. Version it, surface it, and make score history comparable across time instead of quietly shifting the ground underneath the metric.

And fifth, I would be careful about merge gates driven by a single abstract score. A pass fail gate is powerful, but only if the underlying evidence is inspectable enough that engineers trust the block when it fires.

That is the part I find most interesting here. The trace normalization solves one layer of platform gravity. Defensible scoring is what determines whether the system becomes infrastructure or just another nice looking evaluation surface.

Collapse
 
scott_morrison_39a1124d85 profile image
Knowband

The universal trace format is interesting. Standardizing this layer could simplify a lot of cross framework tooling.