Anjaiah Methuku

Posted on May 24

Stop Flying Blind: We Built an LLM Evaluation Framework That Works Across 17+ Agent Frameworks

#agents #llm #showdev #testing

Let me be brutally honest with you.

I've seen teams demo AI agents that look incredible — smooth responses, beautiful UI, stakeholders impressed. Then that same team ships to production and spends the next three weeks firefighting hallucinations they could have caught in testing.

The problem isn't the AI. The problem is nobody evaluated it properly.

Not because they didn't want to. Because the existing tools made it painful.

You're building with LangGraph on Monday. LlamaIndex RAG pipeline on Wednesday. The product team wants CrewAI by Friday. Every framework has different output shapes. Every eval tool wants you to rebuild your stack around it.

So you ship anyway. With fingers crossed.

That's the exact problem I set out to solve with Custom Evals.

What Is Custom Evals?

Custom Evals is an open-source, lightweight evaluation framework for LLM outputs with support for 17+ agent frameworks and a multi-layer metric system — from fast deterministic checks to full LLM-as-judge scoring.

pip install -e ".[dev]"

That's it. No required backend. No dashboard to stand up. No mandatory test runner.

Here's your first evaluation in 10 lines:

from custom.evals import CoherenceEvaluator
from custom.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = CoherenceEvaluator(llm)

score = evaluator.evaluate({
    "input": "What is AI?",
    "output": "AI is artificial intelligence, enabling machines to perform intelligent tasks."
})

print(f"{score.label}: {score.explanation}")
# coherent: The response provides a clear, logical explanation...

A Score object. A label. An explanation. That's the entire interface.

Why Existing Tools Leave Gaps

I want to be fair here — the existing eval tools are genuinely good. But they each have a niche.

Phoenix Evals (Arize) is brilliant if you're deep in the Arize observability ecosystem. The Custom Evals architecture is openly inspired by it. But Phoenix is a full observability platform. If you just want to score outputs without standing up a tracing infrastructure, it's overkill.

DeepEval has 50+ metrics — impressive. But it requires a specific test runner, a specific file format, and an opinionated workflow. It's a comprehensive evaluation suite, not a lightweight library.

RAGAS is surgical and excellent at RAG evaluation specifically. Faithfulness, AnswerRelevancy, ContextPrecision — the research is solid. But it's RAG-first. It doesn't cover general LLM evaluation, agent tool-use quality, or document extraction accuracy.

The gap: none of them give you a single unified interface that works across 17 different frameworks without requiring a backend.

The Architecture: Four Evaluation Layers

The interesting design choice in Custom Evals is that there's no single "evaluator." There are four distinct layers. Use any or all of them.

Custom Evals
├── Layer 1: Code-Based Metrics       (deterministic, zero LLM cost)
├── Layer 2: LLM-Based Evaluators     (LLM-as-judge, semantic quality)
├── Layer 3: NLP Similarity Metrics   (BLEU, ROUGE, cosine, Jaro-Winkler...)
└── Layer 4: OCR / Document Metrics   (for non-LLM extraction pipelines)

Layer 1 — Deterministic Checks (Free & Fast)

No LLM call. No latency. No API cost. Just math.

from custom.evals.metrics import exact_match, sentiment_score

score = exact_match({"output": "Paris", "expected": "Paris"})
# Score(score=1.0, label="exact_match")

score = sentiment_score({"output": "The product is absolutely fantastic!"})
# Score(score=0.9, label="positive")

Want to register your own in 3 lines? Use the decorator:

from custom.evals import create_evaluator, Score

@create_evaluator(name="json_validity", direction="maximize")
def json_validity(output: str) -> Score:
    import json
    try:
        json.loads(output)
        return Score(score=1.0, label="valid", name="json_validity")
    except:
        return Score(score=0.0, label="invalid", name="json_validity")

Layer 2 — LLM-as-Judge (The Semantic Layer)

Four production-ready evaluators ship out of the box:

Evaluator	What It Measures	Needs Ground Truth?
`HallucinationEvaluator`	Does output contradict its context?	No
`CorrectnessEvaluator`	Factually correct vs expected answer?	Yes
`RelevanceEvaluator`	Does it actually answer the question?	No
`CoherenceEvaluator`	Logical flow and internal consistency?	No

Plus two RAG-specific ones: FaithfulnessEvaluator and AnswerRelevancyEvaluator.

One subtle but important detail — every evaluator declares a DIRECTION:

class HallucinationEvaluator(LLMEvaluator):
    DIRECTION = "minimize"  # Lower score = less hallucination = better ✅

class CoherenceEvaluator(LLMEvaluator):
    DIRECTION = "maximize"  # Higher score = more coherent = better ✅

This means your test thresholds work correctly regardless of metric semantics. You don't need to remember "is higher hallucination score good or bad?" — the evaluator tells you.

Layer 3 — NLP Similarity Metrics

Seven industry-standard metrics for reference-based comparison, no LLM required:

BLEU Score — N-gram precision, the machine translation standard
ROUGE-N / ROUGE-L — Recall-oriented overlap, the summarization gold standard
Jaro-Winkler — Edit distance with prefix weighting, great for entity matching
Dice Coefficient — Bigram overlap, fast and symmetric
Token F1 Score — Precision/recall at the token level
Cosine Similarity (TF-IDF) — Vector-space document comparison

from custom.evals.metrics import bleu_score, rouge_n, cosine_similarity_tfidf

result = bleu_score({
    "output": "The model predicts outcomes accurately",
    "expected": "The model accurately predicts outcomes"
})
print(result.score)  # 0.71
print(result.metadata)  # {"brevity_penalty": 1.0, "n_gram_precisions": [...]}

All seven return the same standardized Score object. Mix and match freely.

Layer 4 — Document Extraction & OCR Metrics

This is the most underrated part of the framework. Not everything you evaluate is an LLM.

If AWS Textract, Google Cloud Vision, or Azure Form Recognizer is in your pipeline, you need evaluation metrics for those outputs too:

text_extraction_accuracy — Fuzzy sequence similarity
character_error_rate (CER) — Standard OCR benchmarking metric
word_error_rate (WER) — Used in document parsing and speech-to-text
bounding_box_iou — Intersection over Union for spatial accuracy
field_extraction_f1 — Precision/recall for structured form fields

from custom.evals.metrics import text_extraction_accuracy, character_error_rate, bounding_box_iou

eval_input = {
    "output": "Invoice Date: 12/31/2025\nTotal: $1,234.56",
    "expected": "Invoice Date: 12/31/2025\nTotal: $1,234.56",
    "output_bbox": {"x": 10, "y": 20, "width": 100, "height": 30},
    "expected_bbox": {"x": 10, "y": 20, "width": 100, "height": 30}
}

print(f"Accuracy: {text_extraction_accuracy(eval_input).score:.2%}")  # 100.00%
print(f"CER: {character_error_rate(eval_input).metadata['raw_cer']:.2%}")  # 0.00%
print(f"IoU: {bounding_box_iou(eval_input).score:.2f}")                   # 1.00

None of the pure-LLM evaluation frameworks address this. Custom Evals does.

17+ Framework Integrations — The Full Picture

The pattern is the same across every framework:

# 1. Run your framework-specific agent (different per framework)
result = your_agent.run(query)
response = extract_response(result)  # framework-specific extraction

# 2. Evaluate with Custom Evals (always the same)
eval_input = {
    "input": query,
    "output": response,
    "context": relevant_context,  # optional
    "expected": ground_truth_answer  # optional
}

score = evaluator.evaluate(eval_input)

The eval_input dict is the universal adapter. Every integration reduces to filling this dict.

Here's a quick tour of what's covered:

☁️ Cloud Platforms

AWS Strands Agents (Bedrock + Claude)
Google ADK (Gemini 1.5 Flash/Pro)
Databricks Agent Bricks SDK (native MLflow experiment tracking included)

🏢 Microsoft Ecosystem

Microsoft Agent Framework
Semantic Kernel (plugin output boundaries)
Autogen (individual turns and full conversation outcomes)

🔗 LangChain & LlamaIndex

LangGraph (stateful graph evaluation)
LlamaIndex Workflows (event-driven hooks)
LangChain RAG + LlamaIndex RAG (full faithfulness/relevancy stack)

🤖 OpenAI

OpenAI Agents Framework (tool calls, handoffs)
OpenAI Agents SDK (function calling, structured outputs)
OpenAI Assistants (threads and run-based responses)
OpenAI Swarm (experimental multi-agent)

🌍 Community Frameworks

Agno (multi-agent at scale)
CrewAI (role-based agent outputs)
Pydantic AI (type-safe, structured outputs)

A Real Production Pipeline (Async + Concurrent)

Here's what running evaluation at scale actually looks like:

import asyncio
from custom.evals import (
    HallucinationEvaluator,
    FaithfulnessEvaluator,
    AnswerRelevancyEvaluator,
    RelevanceEvaluator
)
from custom.evals.llm import LLM
from custom.evals.metrics import bleu_score, rouge_n

async def evaluate_rag_batch(queries, rag_pipeline):
    llm = LLM(provider="openai", model="gpt-4o-mini")
    evaluators = {
        "hallucination": HallucinationEvaluator(llm),
        "faithfulness": FaithfulnessEvaluator(llm),
        "answer_relevancy": AnswerRelevancyEvaluator(llm),
        "relevance": RelevanceEvaluator(llm),
    }
    results = []

    for query in queries:
        response = rag_pipeline.query(query.text)
        eval_input = {
            "input": query.text,
            "output": response.answer,
            "context": "\n".join(response.source_nodes),
            "expected": query.expected_answer
        }

        # Run all LLM evaluations concurrently — not one-by-one
        llm_scores = await asyncio.gather(*[
            evaluators[name].evaluate_async(eval_input)
            for name in evaluators
        ])

        row = {"query": query.text}
        for name, score in zip(evaluators.keys(), llm_scores):
            row[f"{name}_score"] = score.score
            row[f"{name}_label"] = score.label

        # Add deterministic metrics at zero cost
        if query.expected_answer:
            row["bleu"] = bleu_score(eval_input).score
            row["rouge_1"] = rouge_n(eval_input).score

        results.append(row)

    return results

results = asyncio.run(evaluate_rag_batch(test_queries, my_rag_pipeline))

# Aggregate and report
import statistics
faithfulness_scores = [r["faithfulness_score"] for r in results]
print(f"Mean Faithfulness: {statistics.mean(faithfulness_scores):.3f}")
print(f"Failing: {sum(1 for s in faithfulness_scores if s < 0.7)}/{len(results)}")

Concurrent async LLM calls + fast deterministic checks. That's how you run evaluation at scale without LLM serial bottlenecks.

The Ground Truth Problem (And How It's Handled)

Here's a question most eval frameworks dodge: what if I don't have ground truth?

In production, users ask unpredictable questions. You can't pre-label every possible answer. Custom Evals explicitly handles three scenarios:

Reference-free (no ground truth needed):

# Hallucination only requires context, not an expected answer
score = hallucination_eval.evaluate({
    "input": "Who wrote Hamlet?",
    "output": "Shakespeare wrote Hamlet in 1600.",
    "context": "William Shakespeare wrote Hamlet circa 1600."
    # No "expected" key — still meaningful evaluation
})

Soft ground truth (intent-based):

# Answer relevancy checks if the answer addresses the question's intent
score = answer_relevancy_eval.evaluate({
    "input": "Who wrote Hamlet?",
    "output": "Shakespeare wrote Hamlet in 1600."
    # No expected answer — evaluates relevance to the question
})

Hard ground truth (known correct answers):

# Full correctness + BLEU/ROUGE when you have labeled data
score = correctness_eval.evaluate({
    "input": "Who wrote Hamlet?",
    "output": "Shakespeare wrote Hamlet in 1600.",
    "expected": "William Shakespeare"
})

This matters. Evaluation infrastructure that only works with labeled datasets is evaluation infrastructure you won't actually use in production.

Observability: Beyond Individual Scores

Custom Evals integrates with Phoenix Tracing (Arize) for production monitoring. One initialization line instruments everything:

from custom.evals import initialize_tracing

initialize_tracing(
    phoenix_endpoint="http://localhost:6006/v1/traces",
    metrics_enabled=True,
    metrics_export_interval=30000  # export every 30 seconds
)

After this, every evaluator call automatically:

Creates an OpenTelemetry span with timing + attributes
Increments evaluation counters
Records score distributions
Computes P50/P95/P99 latency histograms

Real-time dashboards showing score distributions over time — proactive monitoring instead of reactive debugging.

Framework Comparison

Feature	Custom Evals	Phoenix Evals	DeepEval	RAGAS
Installation	Low friction	Medium	Medium	Low
Agent framework support	17+	Limited	Limited	Limited
LLM-as-judge metrics	6	Many	50+	4
Deterministic NLP metrics	10+	Few	Few	Few
OCR/Document evaluation	✅	❌	❌	❌
OpenTelemetry tracing	Optional	Core	❌	❌
No backend required	✅	❌	✅	✅

The honest take: Custom Evals isn't trying to replace DeepEval or RAGAS. It's designed to be the evaluation layer you can plug into any stack. Run it alongside DeepEval for deeper metric coverage. Run it alongside Phoenix for full observability. It's composable by design.

Get Started in 5 Minutes

# Step 1: Install
pip install -e ".[dev]"

# Step 2: Set your API key
export OPENAI_API_KEY="sk-..."

# Step 3: Run your first real evaluation
from custom.evals import HallucinationEvaluator
from custom.evals.llm import LLM

llm = LLM(provider="openai", model="gpt-4o-mini")
evaluator = HallucinationEvaluator(llm)

score = evaluator.evaluate({
    "input": "What year was Python created?",
    "output": "Python was created in 1991 by Guido van Rossum.",
    "context": "Python was created in 1991 by Guido van Rossum."
})

print(f"Result: {score.label}")        # factual
print(f"Score: {score.score}")         # 0.0 (no hallucination = good)
print(f"Reason: {score.explanation}")

# Step 4: Add free deterministic checks
from custom.evals.metrics import bleu_score, rouge_n

for metric in [bleu_score, rouge_n]:
    result = metric({"output": my_answer, "expected": ground_truth})
    print(f"{result.name}: {result.score:.3f}")

What's Coming Next

The roadmap has some meaningful additions in the pipeline:

Context Precision & Recall — The two RAGAS metrics that complete the standard RAG evaluation quadrant
Safety Metrics — Bias and toxicity detection
Agentic Metrics — Tool call correctness, task completion rate, step efficiency for multi-agent systems
Extended Provider Support — Cohere, Mistral, Ollama (the strategy pattern makes this straightforward)

Wrapping Up

The LLM evaluation space is fragmented. Teams are building on different stacks. Frameworks produce different output shapes. Use cases demand different metrics.

Custom Evals is an honest acknowledgment of that fragmentation — and a practical response to it.

It won't be the only eval library you ever use. It will be the one you can actually drop into any stack without rebuilding your infrastructure around it.

Because in a world where you're choosing between 17 agent frameworks on any given sprint, having a single evaluation interface that works across all of them isn't a nice-to-have.

It's the difference between knowing your agent works and hoping it does.

Resources & Links

🔗 GitHub: anjijava16/cust-evals
📖 Framework Index: docs/FRAMEWORK_INDEX.md — all 17+ integrations
🚀 Quick Start: guides/QUICKSTART.md
📄 Non-LLM Evaluation: docs/NON_LLM_EVALUATION_GUIDE.md — OCR and Textract guide
📊 Advanced Metrics: docs/ADVANCED_METRICS_GUIDE.md — BLEU, ROUGE, and beyond

Building evaluation pipelines for AI systems? What metrics have you found most actionable in production? Drop your experience in the comments — genuinely curious what's working for others.

DEV Community