Why I built a neutral LLM eval framework after Promptfoo joined OpenAI

#python #ai #llm #opensource

A few weeks ago, Promptfoo — one of the most popular open-source LLM evaluation frameworks — joined OpenAI.

I don't think that's inherently bad. But it created a real problem for the ecosystem: the tools we use to evaluate AI systems are increasingly owned by the same companies that build those AI systems. That's a conflict of interest that matters.

So I built Rubric — an independent, MIT-licensed LLM and AI agent evaluation framework. No corporate parent. Open source forever.

Here's what I learned building it, and why I think agent trace evaluation is the missing piece in most teams' LLM testing story.

The gap: everyone evaluates output, nobody evaluates the journey

Most LLM eval frameworks work like this:

input → model → output → did the output match expected?

That's fine for simple Q&A. But if you're building an AI agent — something that calls tools, makes decisions, and takes multi-step actions — the final output is only part of the story.

What if the agent got the right answer but called a dangerous tool along the way? What if it looped five times unnecessarily? What if it exceeded your latency or cost budget?

These are questions about the trace, not the output. And most eval frameworks don't answer them.

What Rubric does differently

Rubric treats agent trace evaluation as a first-class feature. Here's what evaluating an agent looks like:

import rubriceval as rubric

result = my_agent.run("Book a flight from Cairo to Paris")

test = rubric.AgentTestCase(
    input="Book a flight from Cairo to Paris",
    actual_output=result.output,
    expected_tools=["search_flights", "book_flight"],
    forbidden_tools=["send_email", "charge_card"],
    tool_calls=result.tool_calls,
    trace=result.trace,
    latency_ms=result.latency_ms,
    max_steps=10,
)

report = rubric.evaluate(
    test_cases=[test],
    metrics=[
        rubric.ToolCallAccuracy(check_order=True),
        rubric.TraceQuality(penalize_loops=True),
        rubric.TaskCompletion(),
        rubric.LatencyMetric(max_ms=5000),
        rubric.CostMetric(max_cost_usd=0.05),
    ],
)

You get a full HTML report showing every metric, every step, pass/fail status — all local, no cloud required.

It's pytest for AI

One thing I prioritized was developer experience. Rubric integrates natively with pytest:

# test_my_llm.py
def test_answers_geography_questions(rubric_eval):
    rubric_eval.add(
        rubric.TestCase(
            input="What is the capital of Egypt?",
            actual_output=my_llm("What is the capital of Egypt?"),
            expected_output="Cairo",
        ),
        metrics=[rubric.Contains("Cairo"), rubric.SemanticSimilarity(threshold=0.8)],
    )

pytest tests/ -v

No YAML configs. No custom test runners. Just pytest. This also means you can drop Rubric into your existing CI/CD pipeline with zero friction.

Zero required dependencies

For the core string-matching metrics — ExactMatch, Contains, NotContains, RegexMatch — Rubric has zero dependencies. Just pip install rubric-eval and you're evaluating.

You only install what you need:

pip install "rubric-eval[semantic]"   # sentence-transformers
pip install "rubric-eval[openai]"     # LLM judge via OpenAI
pip install "rubric-eval[anthropic]"  # LLM judge via Anthropic

This is especially useful if you want to use a local model (Ollama, LM Studio) as both the model under test and the judge. Everything runs offline.

How it compares

	Rubric	DeepEval	Promptfoo
Open source	✅ MIT	✅ Apache	✅ MIT (now OpenAI-owned)
Agent trace evaluation	✅ First-class	❌ Limited	❌ No
Zero required dependencies	✅	❌	❌ Requires Node.js
pytest integration	✅ Native	✅ Decorator	❌ YAML
Local HTML dashboard	✅ Built-in	💰 Paid cloud	❌ No
Owned by AI company	❌ Independent	❌ Independent	✅ OpenAI

What's next

The roadmap includes web dashboard with local history, dataset management from CSV/JSONL, regression detection, integrations with LangChain/LlamaIndex/CrewAI, and production monitoring.