DEV Community

Anjaiah Methuku
Anjaiah Methuku

Posted on

Stop Flying Blind: We Built an LLM Evaluation Framework That Works Across 17+ Agent Frameworks

Let me be brutally honest with you.

I've seen teams demo AI agents that look incredible — smooth responses, beautiful UI, stakeholders impressed. Then that same team ships to production and spends the next three weeks firefighting hallucinations they could have caught in testing.

The problem isn't the AI. The problem is nobody evaluated it properly.

Not because they didn't want to. Because the existing tools made it painful.

You're building with LangGraph on Monday. LlamaIndex RAG pipeline on Wednesday. The product team wants CrewAI by Friday. Every framework has different output shapes. Every eval tool wants you to rebuild your stack around it.

So you ship anyway. With fingers crossed.

That's the exact problem I set out to solve with Custom Evals.


What Is Custom Evals?

Custom Evals is an open-source, lightweight evaluation framework for LLM outputs with support for 17+ agent frameworks and a multi-layer metric system — from fast deterministic checks to full LLM-as-judge scoring.

pip install -e ".[dev]"
Enter fullscreen mode Exit fullscreen mode

That's it. No required backend. No dashboard to stand up. No mandatory test runner.

Here's your first evaluation in 10 lines:

from custom.evals import CoherenceEvaluator
from custom.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = CoherenceEvaluator(llm)

score = evaluator.evaluate({
    "input": "What is AI?",
    "output": "AI is artificial intelligence, enabling machines to perform intelligent tasks."
})

print(f"{score.label}: {score.explanation}")
# coherent: The response provides a clear, logical explanation...
Enter fullscreen mode Exit fullscreen mode

A Score object. A label. An explanation. That's the entire interface.


Why Existing Tools Leave Gaps

I want to be fair here — the existing eval tools are genuinely good. But they each have a niche.

Phoenix Evals (Arize) is brilliant if you're deep in the Arize observability ecosystem. The Custom Evals architecture is openly inspired by it. But Phoenix is a full observability platform. If you just want to score outputs without standing up a tracing infrastructure, it's overkill.

DeepEval has 50+ metrics — impressive. But it requires a specific test runner, a specific file format, and an opinionated workflow. It's a comprehensive evaluation suite, not a lightweight library.

RAGAS is surgical and excellent at RAG evaluation specifically. Faithfulness, AnswerRelevancy, ContextPrecision — the research is solid. But it's RAG-first. It doesn't cover general LLM evaluation, agent tool-use quality, or document extraction accuracy.

The gap: none of them give you a single unified interface that works across 17 different frameworks without requiring a backend.


The Architecture: Four Evaluation Layers

The interesting design choice in Custom Evals is that there's no single "evaluator." There are four distinct layers. Use any or all of them.

Custom Evals
├── Layer 1: Code-Based Metrics       (deterministic, zero LLM cost)
├── Layer 2: LLM-Based Evaluators     (LLM-as-judge, semantic quality)
├── Layer 3: NLP Similarity Metrics   (BLEU, ROUGE, cosine, Jaro-Winkler...)
└── Layer 4: OCR / Document Metrics   (for non-LLM extraction pipelines)
Enter fullscreen mode Exit fullscreen mode

Layer 1 — Deterministic Checks (Free & Fast)

No LLM call. No latency. No API cost. Just math.

from custom.evals.metrics import exact_match, sentiment_score

score = exact_match({"output": "Paris", "expected": "Paris"})
# Score(score=1.0, label="exact_match")

score = sentiment_score({"output": "The product is absolutely fantastic!"})
# Score(score=0.9, label="positive")
Enter fullscreen mode Exit fullscreen mode

Want to register your own in 3 lines? Use the decorator:

from custom.evals import create_evaluator, Score

@create_evaluator(name="json_validity", direction="maximize")
def json_validity(output: str) -> Score:
    import json
    try:
        json.loads(output)
        return Score(score=1.0, label="valid", name="json_validity")
    except:
        return Score(score=0.0, label="invalid", name="json_validity")
Enter fullscreen mode Exit fullscreen mode

Layer 2 — LLM-as-Judge (The Semantic Layer)

Four production-ready evaluators ship out of the box:

Evaluator What It Measures Needs Ground Truth?
HallucinationEvaluator Does output contradict its context? No
CorrectnessEvaluator Factually correct vs expected answer? Yes
RelevanceEvaluator Does it actually answer the question? No
CoherenceEvaluator Logical flow and internal consistency? No

Plus two RAG-specific ones: FaithfulnessEvaluator and AnswerRelevancyEvaluator.

One subtle but important detail — every evaluator declares a DIRECTION:

class HallucinationEvaluator(LLMEvaluator):
    DIRECTION = "minimize"  # Lower score = less hallucination = better ✅
Enter fullscreen mode Exit fullscreen mode
class CoherenceEvaluator(LLMEvaluator):
    DIRECTION = "maximize"  # Higher score = more coherent = better ✅
Enter fullscreen mode Exit fullscreen mode

This means your test thresholds work correctly regardless of metric semantics. You don't need to remember "is higher hallucination score good or bad?" — the evaluator tells you.

Layer 3 — NLP Similarity Metrics

Seven industry-standard metrics for reference-based comparison, no LLM required:

  • BLEU Score — N-gram precision, the machine translation standard
  • ROUGE-N / ROUGE-L — Recall-oriented overlap, the summarization gold standard
  • Jaro-Winkler — Edit distance with prefix weighting, great for entity matching
  • Dice Coefficient — Bigram overlap, fast and symmetric
  • Token F1 Score — Precision/recall at the token level
  • Cosine Similarity (TF-IDF) — Vector-space document comparison
from custom.evals.metrics import bleu_score, rouge_n, cosine_similarity_tfidf

result = bleu_score({
    "output": "The model predicts outcomes accurately",
    "expected": "The model accurately predicts outcomes"
})
print(result.score)  # 0.71
print(result.metadata)  # {"brevity_penalty": 1.0, "n_gram_precisions": [...]}
Enter fullscreen mode Exit fullscreen mode

All seven return the same standardized Score object. Mix and match freely.

Layer 4 — Document Extraction & OCR Metrics

This is the most underrated part of the framework. Not everything you evaluate is an LLM.

If AWS Textract, Google Cloud Vision, or Azure Form Recognizer is in your pipeline, you need evaluation metrics for those outputs too:

  • text_extraction_accuracy — Fuzzy sequence similarity
  • character_error_rate (CER) — Standard OCR benchmarking metric
  • word_error_rate (WER) — Used in document parsing and speech-to-text
  • bounding_box_iou — Intersection over Union for spatial accuracy
  • field_extraction_f1 — Precision/recall for structured form fields
from custom.evals.metrics import text_extraction_accuracy, character_error_rate, bounding_box_iou

eval_input = {
    "output": "Invoice Date: 12/31/2025\nTotal: $1,234.56",
    "expected": "Invoice Date: 12/31/2025\nTotal: $1,234.56",
    "output_bbox": {"x": 10, "y": 20, "width": 100, "height": 30},
    "expected_bbox": {"x": 10, "y": 20, "width": 100, "height": 30}
}

print(f"Accuracy: {text_extraction_accuracy(eval_input).score:.2%}")  # 100.00%
print(f"CER: {character_error_rate(eval_input).metadata['raw_cer']:.2%}")  # 0.00%
print(f"IoU: {bounding_box_iou(eval_input).score:.2f}")                   # 1.00
Enter fullscreen mode Exit fullscreen mode

None of the pure-LLM evaluation frameworks address this. Custom Evals does.


17+ Framework Integrations — The Full Picture

The pattern is the same across every framework:

# 1. Run your framework-specific agent (different per framework)
result = your_agent.run(query)
response = extract_response(result)  # framework-specific extraction

# 2. Evaluate with Custom Evals (always the same)
eval_input = {
    "input": query,
    "output": response,
    "context": relevant_context,  # optional
    "expected": ground_truth_answer  # optional
}

score = evaluator.evaluate(eval_input)
Enter fullscreen mode Exit fullscreen mode

The eval_input dict is the universal adapter. Every integration reduces to filling this dict.

Here's a quick tour of what's covered:

☁️ Cloud Platforms

  • AWS Strands Agents (Bedrock + Claude)
  • Google ADK (Gemini 1.5 Flash/Pro)
  • Databricks Agent Bricks SDK (native MLflow experiment tracking included)

🏢 Microsoft Ecosystem

  • Microsoft Agent Framework
  • Semantic Kernel (plugin output boundaries)
  • Autogen (individual turns and full conversation outcomes)

🔗 LangChain & LlamaIndex

  • LangGraph (stateful graph evaluation)
  • LlamaIndex Workflows (event-driven hooks)
  • LangChain RAG + LlamaIndex RAG (full faithfulness/relevancy stack)

🤖 OpenAI

  • OpenAI Agents Framework (tool calls, handoffs)
  • OpenAI Agents SDK (function calling, structured outputs)
  • OpenAI Assistants (threads and run-based responses)
  • OpenAI Swarm (experimental multi-agent)

🌍 Community Frameworks

  • Agno (multi-agent at scale)
  • CrewAI (role-based agent outputs)
  • Pydantic AI (type-safe, structured outputs)

A Real Production Pipeline (Async + Concurrent)

Here's what running evaluation at scale actually looks like:

import asyncio
from custom.evals import (
    HallucinationEvaluator,
    FaithfulnessEvaluator,
    AnswerRelevancyEvaluator,
    RelevanceEvaluator
)
from custom.evals.llm import LLM
from custom.evals.metrics import bleu_score, rouge_n

async def evaluate_rag_batch(queries, rag_pipeline):
    llm = LLM(provider="openai", model="gpt-4o-mini")
    evaluators = {
        "hallucination": HallucinationEvaluator(llm),
        "faithfulness": FaithfulnessEvaluator(llm),
        "answer_relevancy": AnswerRelevancyEvaluator(llm),
        "relevance": RelevanceEvaluator(llm),
    }
    results = []

    for query in queries:
        response = rag_pipeline.query(query.text)
        eval_input = {
            "input": query.text,
            "output": response.answer,
            "context": "\n".join(response.source_nodes),
            "expected": query.expected_answer
        }

        # Run all LLM evaluations concurrently — not one-by-one
        llm_scores = await asyncio.gather(*[
            evaluators[name].evaluate_async(eval_input)
            for name in evaluators
        ])

        row = {"query": query.text}
        for name, score in zip(evaluators.keys(), llm_scores):
            row[f"{name}_score"] = score.score
            row[f"{name}_label"] = score.label

        # Add deterministic metrics at zero cost
        if query.expected_answer:
            row["bleu"] = bleu_score(eval_input).score
            row["rouge_1"] = rouge_n(eval_input).score

        results.append(row)

    return results

results = asyncio.run(evaluate_rag_batch(test_queries, my_rag_pipeline))

# Aggregate and report
import statistics
faithfulness_scores = [r["faithfulness_score"] for r in results]
print(f"Mean Faithfulness: {statistics.mean(faithfulness_scores):.3f}")
print(f"Failing: {sum(1 for s in faithfulness_scores if s < 0.7)}/{len(results)}")
Enter fullscreen mode Exit fullscreen mode

Concurrent async LLM calls + fast deterministic checks. That's how you run evaluation at scale without LLM serial bottlenecks.


The Ground Truth Problem (And How It's Handled)

Here's a question most eval frameworks dodge: what if I don't have ground truth?

In production, users ask unpredictable questions. You can't pre-label every possible answer. Custom Evals explicitly handles three scenarios:

Reference-free (no ground truth needed):

# Hallucination only requires context, not an expected answer
score = hallucination_eval.evaluate({
    "input": "Who wrote Hamlet?",
    "output": "Shakespeare wrote Hamlet in 1600.",
    "context": "William Shakespeare wrote Hamlet circa 1600."
    # No "expected" key — still meaningful evaluation
})
Enter fullscreen mode Exit fullscreen mode

Soft ground truth (intent-based):

# Answer relevancy checks if the answer addresses the question's intent
score = answer_relevancy_eval.evaluate({
    "input": "Who wrote Hamlet?",
    "output": "Shakespeare wrote Hamlet in 1600."
    # No expected answer — evaluates relevance to the question
})
Enter fullscreen mode Exit fullscreen mode

Hard ground truth (known correct answers):

# Full correctness + BLEU/ROUGE when you have labeled data
score = correctness_eval.evaluate({
    "input": "Who wrote Hamlet?",
    "output": "Shakespeare wrote Hamlet in 1600.",
    "expected": "William Shakespeare"
})
Enter fullscreen mode Exit fullscreen mode

This matters. Evaluation infrastructure that only works with labeled datasets is evaluation infrastructure you won't actually use in production.


Observability: Beyond Individual Scores

Custom Evals integrates with Phoenix Tracing (Arize) for production monitoring. One initialization line instruments everything:

from custom.evals import initialize_tracing

initialize_tracing(
    phoenix_endpoint="http://localhost:6006/v1/traces",
    metrics_enabled=True,
    metrics_export_interval=30000  # export every 30 seconds
)
Enter fullscreen mode Exit fullscreen mode

After this, every evaluator call automatically:

  • Creates an OpenTelemetry span with timing + attributes
  • Increments evaluation counters
  • Records score distributions
  • Computes P50/P95/P99 latency histograms

Real-time dashboards showing score distributions over time — proactive monitoring instead of reactive debugging.


Framework Comparison

Feature Custom Evals Phoenix Evals DeepEval RAGAS
Installation Low friction Medium Medium Low
Agent framework support 17+ Limited Limited Limited
LLM-as-judge metrics 6 Many 50+ 4
Deterministic NLP metrics 10+ Few Few Few
OCR/Document evaluation
OpenTelemetry tracing Optional Core
No backend required

The honest take: Custom Evals isn't trying to replace DeepEval or RAGAS. It's designed to be the evaluation layer you can plug into any stack. Run it alongside DeepEval for deeper metric coverage. Run it alongside Phoenix for full observability. It's composable by design.


Get Started in 5 Minutes

# Step 1: Install
pip install -e ".[dev]"

# Step 2: Set your API key
export OPENAI_API_KEY="sk-..."
Enter fullscreen mode Exit fullscreen mode
# Step 3: Run your first real evaluation
from custom.evals import HallucinationEvaluator
from custom.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = HallucinationEvaluator(llm)

score = evaluator.evaluate({
    "input": "What year was Python created?",
    "output": "Python was created in 1991 by Guido van Rossum.",
    "context": "Python was created in 1991 by Guido van Rossum."
})

print(f"Result: {score.label}")        # factual
print(f"Score: {score.score}")         # 0.0 (no hallucination = good)
print(f"Reason: {score.explanation}")
Enter fullscreen mode Exit fullscreen mode
# Step 4: Add free deterministic checks
from custom.evals.metrics import bleu_score, rouge_n

for metric in [bleu_score, rouge_n]:
    result = metric({"output": my_answer, "expected": ground_truth})
    print(f"{result.name}: {result.score:.3f}")
Enter fullscreen mode Exit fullscreen mode

What's Coming Next

The roadmap has some meaningful additions in the pipeline:

  • Context Precision & Recall — The two RAGAS metrics that complete the standard RAG evaluation quadrant
  • Safety Metrics — Bias and toxicity detection
  • Agentic Metrics — Tool call correctness, task completion rate, step efficiency for multi-agent systems
  • Extended Provider Support — Cohere, Mistral, Ollama (the strategy pattern makes this straightforward)

Wrapping Up

The LLM evaluation space is fragmented. Teams are building on different stacks. Frameworks produce different output shapes. Use cases demand different metrics.

Custom Evals is an honest acknowledgment of that fragmentation — and a practical response to it.

It won't be the only eval library you ever use. It will be the one you can actually drop into any stack without rebuilding your infrastructure around it.

Because in a world where you're choosing between 17 agent frameworks on any given sprint, having a single evaluation interface that works across all of them isn't a nice-to-have.

It's the difference between knowing your agent works and hoping it does.


Resources & Links

  • 🔗 GitHub: anjijava16/cust-evals
  • 📖 Framework Index: docs/FRAMEWORK_INDEX.md — all 17+ integrations
  • 🚀 Quick Start: guides/QUICKSTART.md
  • 📄 Non-LLM Evaluation: docs/NON_LLM_EVALUATION_GUIDE.md — OCR and Textract guide
  • 📊 Advanced Metrics: docs/ADVANCED_METRICS_GUIDE.md — BLEU, ROUGE, and beyond

Building evaluation pipelines for AI systems? What metrics have you found most actionable in production? Drop your experience in the comments — genuinely curious what's working for others.

Top comments (0)