DEV Community

Thesius Code
Thesius Code

Posted on • Originally published at datanest-stores.pages.dev

LLM Evaluation Framework

LLM Evaluation Framework

You can't improve what you can't measure. This framework gives you automated, repeatable evaluation harnesses for LLM outputs — with built-in metrics for accuracy, relevance, coherence, and safety, plus custom metric support. Run evaluations in CI/CD, track quality over time, compare models head-to-head, and catch regressions before they reach production.

Key Features

  • Automated Eval Harnesses — Define test suites as YAML, run them against any model, and get structured scores with statistical significance testing
  • Built-In Metrics — Accuracy, relevance, coherence, faithfulness, toxicity, and latency measured out of the box
  • Custom Metrics — Define your own scoring functions (Python callables) and plug them into the evaluation pipeline
  • Human Feedback Collection — Web-based annotation interface for side-by-side comparisons, Likert scales, and free-text feedback
  • Regression Testing — Compare current model outputs against a golden baseline and flag any score drops exceeding your threshold
  • Model Comparison — Run the same eval suite across multiple models/prompts and generate comparison reports with confidence intervals
  • Quality Monitoring — Continuous evaluation on production traffic with dashboards and alerting on quality degradation
  • Reproducible Runs — Every evaluation run is versioned with the exact prompt, model, parameters, and dataset hash

Quick Start

from llm_eval import EvalSuite, metrics, Runner

# 1. Define evaluation suite
suite = EvalSuite(
    name="customer_support_v2",
    dataset="eval_data/support_questions.jsonl",
    metrics=[
        metrics.Relevance(model="gpt-4o-mini"),     # LLM-as-judge
        metrics.Faithfulness(sources_key="context"),  # Grounded in context?
        metrics.Coherence(),                          # Well-structured output?
        metrics.Toxicity(threshold=0.1),              # Safe output?
        metrics.Latency(max_p95_ms=2000),             # Performance SLA
    ],
)

# 2. Run evaluation
runner = Runner(model="gpt-4o", temperature=0)
results = runner.evaluate(suite)

# 3. View results
print(results.summary())
# ┌─────────────┬────────┬────────┬────────┐
# │ Metric      │ Mean   │ P5     │ P95    │
# ├─────────────┼────────┼────────┼────────┤
# │ Relevance   │ 0.87   │ 0.72   │ 0.96   │
# │ Faithfulness│ 0.91   │ 0.80   │ 0.98   │
# │ Coherence   │ 0.85   │ 0.68   │ 0.95   │
# │ Toxicity    │ 0.02   │ 0.00   │ 0.08   │
# │ Latency (ms)│ 1240   │ 890    │ 1850   │
# └─────────────┴────────┴────────┴────────┘
Enter fullscreen mode Exit fullscreen mode

Architecture

┌──────────────────────────────────────────────┐
│              Eval Suite Definition            │
│  Dataset + Metrics + Model Config + Baseline  │
└───────────────────┬──────────────────────────┘
                    ▼
┌──────────────────────────────────────────────┐
│                 Runner                        │
│  For each (input, expected) in dataset:       │
│    1. Generate output from model              │
│    2. Score with each metric                  │
│    3. Compare against baseline (if set)       │
└───────────────────┬──────────────────────────┘
                    ▼
┌──────────────────────────────────────────────┐
│              Results Store                    │
│  Scores + Metadata + Diffs + Run ID          │
│                                              │
│  ┌────────────┐  ┌───────────┐  ┌─────────┐ │
│  │ Dashboard  │  │ CI Report │  │ Alerts  │ │
│  └────────────┘  └───────────┘  └─────────┘ │
└──────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Usage Examples

Custom Metrics

from llm_eval.metrics import Metric, MetricResult

class BrandVoiceScore(Metric):
    """Check if output matches brand tone guidelines."""

    name = "brand_voice"

    def __init__(self, guidelines: str):
        self.guidelines = guidelines

    def score(self, input_text: str, output_text: str, **kwargs) -> MetricResult:
        # Use LLM-as-judge to score brand voice adherence
        prompt = f"""Rate how well this response matches our brand voice guidelines.
        Guidelines: {self.guidelines}
        Response: {output_text}
        Score from 0.0 to 1.0:"""

        score = self._llm_judge(prompt)
        return MetricResult(score=score, explanation=f"Brand voice: {score:.2f}")

suite = EvalSuite(
    name="brand_check",
    metrics=[BrandVoiceScore(guidelines="Friendly, concise, no jargon.")],
)
Enter fullscreen mode Exit fullscreen mode

Regression Testing in CI/CD

from llm_eval import RegressionTest

test = RegressionTest(
    suite=suite,
    baseline_run="runs/baseline_2025_03_15",
    max_regression={
        "relevance": 0.05,     # Allow max 5% drop
        "faithfulness": 0.03,  # Allow max 3% drop
        "toxicity": 0.01,      # Almost zero tolerance
    },
)

result = test.run(model="gpt-4o")
if result.has_regressions:
    print("REGRESSIONS DETECTED:")
    for reg in result.regressions:
        print(f"  {reg.metric}: {reg.baseline:.3f}{reg.current:.3f} ({reg.delta:+.3f})")
    exit(1)  # Fail CI pipeline
Enter fullscreen mode Exit fullscreen mode

Model Comparison

from llm_eval import ModelComparison

comparison = ModelComparison(
    suite=suite,
    models=[
        {"name": "gpt-4o", "temperature": 0},
        {"name": "gpt-4o-mini", "temperature": 0},
        {"name": "claude-sonnet-4-20250514", "temperature": 0},
    ],
)

report = comparison.run()
print(report.ranking())        # Models ranked by aggregate score
print(report.cost_efficiency()) # Score per dollar
report.export_html("reports/model_comparison.html")
Enter fullscreen mode Exit fullscreen mode

Configuration

# eval_config.yaml
suites:
  customer_support:
    dataset: "eval_data/support_questions.jsonl"
    sample_size: 200             # Evaluate on random subset (null = all)
    metrics:
      - name: "relevance"
        judge_model: "gpt-4o-mini"
      - name: "faithfulness"
        sources_key: "context"
      - name: "coherence"
      - name: "toxicity"
        threshold: 0.1
      - name: "latency"
        max_p95_ms: 2000

runner:
  model: "gpt-4o"
  temperature: 0                 # Deterministic for reproducibility
  max_tokens: 1000
  concurrent_requests: 10
  retry_on_failure: true

regression:
  baseline_dir: "baselines/"
  max_regression:
    relevance: 0.05
    faithfulness: 0.03
    coherence: 0.05
    toxicity: 0.01

monitoring:
  enabled: true
  sample_rate: 0.05              # Evaluate 5% of production traffic
  alert_on_degradation: true
  alert_threshold: 0.1           # Alert if metric drops 10% from baseline
  dashboard_port: 8081

storage:
  backend: "sqlite"              # sqlite | postgres
  results_dir: "eval_results/"
  retention_days: 180
Enter fullscreen mode Exit fullscreen mode

Best Practices

  1. Use LLM-as-judge for subjective metrics — Relevance, coherence, and tone are hard to measure with rules. Use a capable model as the judge.
  2. Set baselines early — Run your first eval suite before making changes. You can't detect regression without a baseline.
  3. Evaluate on diverse inputs — Ensure your dataset covers edge cases, long inputs, multi-language queries, and adversarial prompts.
  4. Separate metric concerns — A high relevance score with low faithfulness means the model is making up plausible-sounding answers.
  5. Run evals in CI — Every prompt change, model swap, or system prompt edit should trigger the regression suite.
  6. Monitor production quality — Eval datasets get stale. Sample real production traffic for continuous evaluation.

Troubleshooting

Problem Cause Fix
LLM-as-judge scores are inconsistent Judge model temperature > 0 Set temperature: 0 for the judge model; run each judgment 3x and average
Eval suite takes too long Dataset too large or concurrent requests too low Use sample_size to subset and increase concurrent_requests
Regression test fails on every run Baseline is stale or threshold too tight Update baseline with test.update_baseline() and relax thresholds
Toxicity scores are always 0 Test data doesn't include adversarial inputs Add red-team prompts to your eval dataset to stress-test safety

This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [LLM Evaluation Framework] with all files, templates, and documentation for $49.

Get the Full Kit →

Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.

Get the Complete Bundle →


Related Articles

Top comments (0)