hemanth kumar

Posted on Apr 2 • Edited on Apr 7

I built an open source LLM agent evaluation tool that works with any framework

#opensource #rust #llm #python

Every team building AI agents hits the same wall.
You ship a LangChain agent. It works great in demos. Then it goes to
production and quietly starts hallucinating, calling the wrong tools,
or giving answers that have nothing to do with what it retrieved.

You don't find out until a user complains.

The root cause is simple: there's no standard way to evaluate agent
quality before and after every deploy.

Every framework has its own story:

LangChain has LangSmith — but it's a paid SaaS and only works with LangChain
CrewAI has no eval tooling
AutoGen has no eval tooling
OpenAI Agents SDK has basic tracing but no scoring

If you switch frameworks, you rebuild your eval setup from scratch.
If you use multiple frameworks, you have no unified view.

This is the problem I set out to solve.

Introducing EvalForge

EvalForge is a
framework-agnostic LLM agent evaluation harness. You give it a trace
JSON from any agent framework, it scores it on quality metrics, and
returns a pass/fail result your CI pipeline understands.

evalforge run --trace my_agent_run.json --metrics faithfulness

Output:

EvalForge v0.1
─────────────────────────────
Trace ID:   my-run-001
Framework:  langchain
Model:      gpt-4o
Agent:      research-agent
Steps:      4
Duration:   3421ms
─────────────────────────────
Scoring Results
─────────────────────────────
faithfulness     0.91   PASS
Reason: The answer accurately reflects the retrieved context.
─────────────────────────────
Overall: PASS

Exit code 0 = pass. Exit code 1 = fail. Plugs straight into any CI
pipeline.

How it works

Every agent run — regardless of framework — goes through the same
lifecycle:

User gives input
  → Agent thinks / plans
    → Agent calls tools
      → Agent produces final answer

EvalForge captures this in a simple universal trace format:

{
  "evalforge_version": "0.1",
  "trace_id": "run-001",
  "metadata": {
    "framework": "langchain",
    "model": "gpt-4o",
    "agent_name": "research-agent",
    "duration_ms": 3421,
    "total_tokens": 1820
  },
  "input": {
    "user": "What are the latest papers on LLM evaluation?",
    "system": "You are a helpful research assistant."
  },
  "steps": [
    {
      "step_id": 1,
      "type": "thought",
      "content": "I need to search for recent papers."
    },
    {
      "step_id": 2,
      "type": "tool_call",
      "tool": "web_search",
      "input": { "query": "LLM evaluation papers 2026" },
      "output": { "results": ["paper1", "paper2"] },
      "duration_ms": 890
    }
  ],
  "output": {
    "answer": "The latest papers on LLM evaluation include..."
  },
  "eval_hints": {
    "expected_tools": ["web_search"],
    "expected_answer": null,
    "context_documents": []
  }
}

Every major framework maps cleanly to this format. LangChain's
AgentAction becomes a tool_call. CrewAI's task results become
steps. AutoGen's conversation messages become thought entries.

The scoring — LLM as judge

For v0.1 we ship faithfulness scoring.

Faithfulness asks: did the agent's final answer stay true to what
its tools actually returned?

If the tools returned facts A, B, C and the agent only used A, B, C
— high faithfulness.

If the agent invented D, E that weren't in the tool outputs — low
faithfulness. That's a hallucination.

We score it using Claude as judge. The prompt:

You are evaluating whether an AI agent's answer is faithful 
to the context it retrieved.

Question: {question}
Retrieved Context: {context}
Agent's Answer: {answer}

Does the answer only use information from the retrieved context, 
without adding facts not present in the context?

Respond in JSON: {"score": 0.0-1.0, "reason": "explanation"}

Score >= 0.7 = PASS. Configurable with --threshold.

Why Rust?

The core is written in Rust with a Python SDK wrapper.

Three reasons:

Speed — millisecond startup, no GIL bottleneck. Runs 1000 eval
cases in the time Python tools run 100.

Single binary — curl | sh install. No virtualenv, no
dependency hell in CI. One file that works on Linux, Mac, Windows.

Python SDK on top — users never think about Rust. They
pip install evalforge and write:

import evalforge

result = evalforge.run(
    trace="my_agent_run.json",
    metrics=["faithfulness"]
)

print(result.passed)              # True
print(result.metrics[0].score)   # 0.91
print(result.metrics[0].reason)  # "Answer stays within retrieved context"

Works with every major framework today

Framework	Language	Status
LangChain / LangGraph	Python	✅ v0.1
CrewAI	Python	✅ v0.1
AutoGen / AG2	Python	✅ v0.1
OpenAI Agents SDK	Python	✅ v0.1
Mastra	TypeScript	🔜 Planned
Vercel AI SDK	TypeScript	🔜 Planned

CI/CD integration

Add to your GitHub Actions workflow:

- name: Evaluate agent quality
  run: evalforge run --trace agent_run.json --metrics faithfulness
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Every PR now has an automatic quality gate on your agent. Merge only
when your agent passes.

What's coming

v0.2 — tool_accuracy, goal_completion, hallucination metrics
v0.3 — Native CI integrations (GitHub Actions marketplace)
v0.4 — JavaScript SDK + Mastra support
v0.5 — Auto trace capture from LangChain/CrewAI callbacks
v1.0 — Web dashboard + team collaboration ## Update: What shipped since launch

A lot has happened since I first posted this. Here's what
EvalForge looks like today:

7 metrics now live

Metric	What it measures
`faithfulness`	Answer stays true to retrieved context
`tool_accuracy`	Agent used the right tools (deterministic)
`goal_completion`	Agent finished the task
`hallucination`	Agent made up facts
`g_eval`	Your custom rubric in plain English
`context_precision`	Was all retrieved context relevant?
`answer_relevance`	Is the answer actually about the question?

Framework adapters — no manual JSON needed

from evalforge.adapters import from_langchain
import evalforge

result = agent.invoke({"input": "Your question"})
trace = from_langchain(result, model="gpt-4o")
eval_result = evalforge.run(trace, metrics=["faithfulness"])
print(eval_result.passed)

Supports LangChain, CrewAI, AutoGen, and OpenAI Agents SDK.

RunTrendAnalyzer — catch drift before users do

Four runs at 0.91 → 0.85 → 0.79 → 0.73 all pass
individually. EvalForge catches the regression:

evalforge trend --history results/ \
  --metrics faithfulness \
  --exit-on-regression

JavaScript/TypeScript SDK

npm install evalforge

import { fromMastra, run } from 'evalforge';

const trace = fromMastra(result, { agentName: 'my-agent' });
const evalResult = run(trace, { metrics: ['faithfulness'] });

Defensible scoring — full audit log

Every --output JSON now includes:

method: "deterministic" or "llm_judge"
judge_model: exactly which model scored this
threshold: the exact value used
timestamp: UTC time of the run

Install and try

pip install evalforge
python3 -c "import evalforge; print(evalforge.demo())"

Or with npm:

npm install evalforge

GitHub: https://github.com/heManKuMAR6/evalforge

Would love to hear what metrics and frameworks matter
most to you — drop a comment below.

Try it now

git clone https://github.com/heManKuMAR6/evalforge
cd evalforge
cargo build --release

# Score a sample trace
cargo run -- run --trace tests/fixtures/sample_trace.json \
  --metrics faithfulness --mock

Or with Python:

pip install evalforge

evalforge run --trace my_trace.json --metrics faithfulness

The repo is at https://github.com/heManKuMAR6/evalforge — MIT
license, contributions welcome.

Would love feedback on:

What metrics matter most to you in production?
What frameworks should we prioritize next?
What does your current eval setup look like?

If this solves a problem you have, a GitHub star helps others find it.

Top comments (5)

CrisisCore-Systems • Apr 2

This was a strong read because it goes after a real infrastructure problem instead of just shipping another layer of agent hype.

A lot of teams can get an agent demo working. Far fewer have a clean, repeatable way to ask whether the thing is still behaving properly after a change, across frameworks, inside CI, without rebuilding the whole evaluation story every time the stack shifts. That is the part that makes this interesting.

What I like most here is the instinct to normalize the trace instead of marrying the entire tool to one ecosystem. That is the right level to attack the problem. Once evaluation gets trapped inside framework specific tooling, it stops being a reliability layer and starts becoming platform gravity.

The pass fail CI angle is also important. Evaluation only starts to matter when it can actually block bad behavior from gliding into production. Otherwise it stays as a dashboard people glance at and ignore.

I also respect that this starts narrow. Faithfulness is not the whole problem, but it is a real one, and starting with a metric that has operational value is a lot better than pretending to solve all of agent quality in one sweep.

The harder part, obviously, is ahead. LLM as judge can be useful, but it also becomes part of the trust chain, so the long term credibility of a tool like this will live or die on how transparent, reproducible, and defensible the scoring layer becomes.

Still, this is the kind of project that feels worth paying attention to because it is trying to turn agent evaluation from vague aspiration into something teams can actually wire into engineering reality.

hemanth kumar • Apr 3

This is exactly the kind of feedback that sharpens the direction.

You nailed the core tension - LLM-as-judge adds a trust dependency
to the trust layer itself. That's the hardest problem ahead.

The plan to address it:
→ Deterministic metrics alongside LLM-judge (ROUGE, BERTScore)
so scoring doesn't depend entirely on another model
→ Scoring transparency - log the full judge prompt and response
so teams can audit exactly why something passed or failed
→ Calibration tools - let teams validate judge scores against
their own human labels

The "platform gravity" framing is exactly why we normalize at the
trace level. Once you're locked into framework-specific eval,
you're one migration away from losing your entire quality history.

Would genuinely value your input as we design the scoring
transparency layer for v0.2. What would "defensible scoring"
look like to you in practice?

CrisisCore-Systems • Apr 4

That direction makes sense.

To me, defensible scoring starts the moment a team can answer not just what score they got, but why they got it, what produced it, and whether that result would hold up under reinspection a month later.

So in practice I would want a few things:

First, every score should come with an evidence trail. Not just the final number, but the trace slice used, the exact judge prompt, the model version, the threshold, and the raw response that led to the verdict.

Second, I would separate deterministic checks from interpretive checks as hard as possible. If something can be measured structurally, measure it structurally. Reserve LLM judgment for the parts that actually require semantic interpretation, not as the default for everything.

Third, I would want calibration to be visible, not implied. Show where judge scores align with human labels, where they drift, and where the model is consistently too generous or too harsh. Otherwise teams are inheriting a confidence ritual, not a reliability layer.

Fourth, reproducibility matters. If the scoring model changes, that should be treated like changing the test harness itself. Version it, surface it, and make score history comparable across time instead of quietly shifting the ground underneath the metric.

And fifth, I would be careful about merge gates driven by a single abstract score. A pass fail gate is powerful, but only if the underlying evidence is inspectable enough that engineers trust the block when it fires.

That is the part I find most interesting here. The trace normalization solves one layer of platform gravity. Defensible scoring is what determines whether the system becomes infrastructure or just another nice looking evaluation surface.

hemanth kumar • Apr 7

Update: v0.7.0 is now live with 7 metrics, framework
adapters for LangChain/CrewAI/AutoGen/OpenAI Agents SDK,
JS/TypeScript SDK, RunTrendAnalyzer, and full audit logs.

pip install evalforge
npm install evalforge

Full changelog: github.com/heManKuMAR6/evalforge