LLM Evaluation Framework
You can't improve what you can't measure. This framework gives you automated, repeatable evaluation harnesses for LLM outputs — with built-in metrics for accuracy, relevance, coherence, and safety, plus custom metric support. Run evaluations in CI/CD, track quality over time, compare models head-to-head, and catch regressions before they reach production.
Key Features
- Automated Eval Harnesses — Define test suites as YAML, run them against any model, and get structured scores with statistical significance testing
- Built-In Metrics — Accuracy, relevance, coherence, faithfulness, toxicity, and latency measured out of the box
- Custom Metrics — Define your own scoring functions (Python callables) and plug them into the evaluation pipeline
- Human Feedback Collection — Web-based annotation interface for side-by-side comparisons, Likert scales, and free-text feedback
- Regression Testing — Compare current model outputs against a golden baseline and flag any score drops exceeding your threshold
- Model Comparison — Run the same eval suite across multiple models/prompts and generate comparison reports with confidence intervals
- Quality Monitoring — Continuous evaluation on production traffic with dashboards and alerting on quality degradation
- Reproducible Runs — Every evaluation run is versioned with the exact prompt, model, parameters, and dataset hash
Quick Start
from llm_eval import EvalSuite, metrics, Runner
# 1. Define evaluation suite
suite = EvalSuite(
name="customer_support_v2",
dataset="eval_data/support_questions.jsonl",
metrics=[
metrics.Relevance(model="gpt-4o-mini"), # LLM-as-judge
metrics.Faithfulness(sources_key="context"), # Grounded in context?
metrics.Coherence(), # Well-structured output?
metrics.Toxicity(threshold=0.1), # Safe output?
metrics.Latency(max_p95_ms=2000), # Performance SLA
],
)
# 2. Run evaluation
runner = Runner(model="gpt-4o", temperature=0)
results = runner.evaluate(suite)
# 3. View results
print(results.summary())
# ┌─────────────┬────────┬────────┬────────┐
# │ Metric │ Mean │ P5 │ P95 │
# ├─────────────┼────────┼────────┼────────┤
# │ Relevance │ 0.87 │ 0.72 │ 0.96 │
# │ Faithfulness│ 0.91 │ 0.80 │ 0.98 │
# │ Coherence │ 0.85 │ 0.68 │ 0.95 │
# │ Toxicity │ 0.02 │ 0.00 │ 0.08 │
# │ Latency (ms)│ 1240 │ 890 │ 1850 │
# └─────────────┴────────┴────────┴────────┘
Architecture
┌──────────────────────────────────────────────┐
│ Eval Suite Definition │
│ Dataset + Metrics + Model Config + Baseline │
└───────────────────┬──────────────────────────┘
▼
┌──────────────────────────────────────────────┐
│ Runner │
│ For each (input, expected) in dataset: │
│ 1. Generate output from model │
│ 2. Score with each metric │
│ 3. Compare against baseline (if set) │
└───────────────────┬──────────────────────────┘
▼
┌──────────────────────────────────────────────┐
│ Results Store │
│ Scores + Metadata + Diffs + Run ID │
│ │
│ ┌────────────┐ ┌───────────┐ ┌─────────┐ │
│ │ Dashboard │ │ CI Report │ │ Alerts │ │
│ └────────────┘ └───────────┘ └─────────┘ │
└──────────────────────────────────────────────┘
Usage Examples
Custom Metrics
from llm_eval.metrics import Metric, MetricResult
class BrandVoiceScore(Metric):
"""Check if output matches brand tone guidelines."""
name = "brand_voice"
def __init__(self, guidelines: str):
self.guidelines = guidelines
def score(self, input_text: str, output_text: str, **kwargs) -> MetricResult:
# Use LLM-as-judge to score brand voice adherence
prompt = f"""Rate how well this response matches our brand voice guidelines.
Guidelines: {self.guidelines}
Response: {output_text}
Score from 0.0 to 1.0:"""
score = self._llm_judge(prompt)
return MetricResult(score=score, explanation=f"Brand voice: {score:.2f}")
suite = EvalSuite(
name="brand_check",
metrics=[BrandVoiceScore(guidelines="Friendly, concise, no jargon.")],
)
Regression Testing in CI/CD
from llm_eval import RegressionTest
test = RegressionTest(
suite=suite,
baseline_run="runs/baseline_2025_03_15",
max_regression={
"relevance": 0.05, # Allow max 5% drop
"faithfulness": 0.03, # Allow max 3% drop
"toxicity": 0.01, # Almost zero tolerance
},
)
result = test.run(model="gpt-4o")
if result.has_regressions:
print("REGRESSIONS DETECTED:")
for reg in result.regressions:
print(f" {reg.metric}: {reg.baseline:.3f} → {reg.current:.3f} ({reg.delta:+.3f})")
exit(1) # Fail CI pipeline
Model Comparison
from llm_eval import ModelComparison
comparison = ModelComparison(
suite=suite,
models=[
{"name": "gpt-4o", "temperature": 0},
{"name": "gpt-4o-mini", "temperature": 0},
{"name": "claude-sonnet-4-20250514", "temperature": 0},
],
)
report = comparison.run()
print(report.ranking()) # Models ranked by aggregate score
print(report.cost_efficiency()) # Score per dollar
report.export_html("reports/model_comparison.html")
Configuration
# eval_config.yaml
suites:
customer_support:
dataset: "eval_data/support_questions.jsonl"
sample_size: 200 # Evaluate on random subset (null = all)
metrics:
- name: "relevance"
judge_model: "gpt-4o-mini"
- name: "faithfulness"
sources_key: "context"
- name: "coherence"
- name: "toxicity"
threshold: 0.1
- name: "latency"
max_p95_ms: 2000
runner:
model: "gpt-4o"
temperature: 0 # Deterministic for reproducibility
max_tokens: 1000
concurrent_requests: 10
retry_on_failure: true
regression:
baseline_dir: "baselines/"
max_regression:
relevance: 0.05
faithfulness: 0.03
coherence: 0.05
toxicity: 0.01
monitoring:
enabled: true
sample_rate: 0.05 # Evaluate 5% of production traffic
alert_on_degradation: true
alert_threshold: 0.1 # Alert if metric drops 10% from baseline
dashboard_port: 8081
storage:
backend: "sqlite" # sqlite | postgres
results_dir: "eval_results/"
retention_days: 180
Best Practices
- Use LLM-as-judge for subjective metrics — Relevance, coherence, and tone are hard to measure with rules. Use a capable model as the judge.
- Set baselines early — Run your first eval suite before making changes. You can't detect regression without a baseline.
- Evaluate on diverse inputs — Ensure your dataset covers edge cases, long inputs, multi-language queries, and adversarial prompts.
- Separate metric concerns — A high relevance score with low faithfulness means the model is making up plausible-sounding answers.
- Run evals in CI — Every prompt change, model swap, or system prompt edit should trigger the regression suite.
- Monitor production quality — Eval datasets get stale. Sample real production traffic for continuous evaluation.
Troubleshooting
| Problem | Cause | Fix |
|---|---|---|
| LLM-as-judge scores are inconsistent | Judge model temperature > 0 | Set temperature: 0 for the judge model; run each judgment 3x and average |
| Eval suite takes too long | Dataset too large or concurrent requests too low | Use sample_size to subset and increase concurrent_requests
|
| Regression test fails on every run | Baseline is stale or threshold too tight | Update baseline with test.update_baseline() and relax thresholds |
| Toxicity scores are always 0 | Test data doesn't include adversarial inputs | Add red-team prompts to your eval dataset to stress-test safety |
This is 1 of 11 resources in the AI Builder Pro toolkit. Get the complete [LLM Evaluation Framework] with all files, templates, and documentation for $49.
Or grab the entire AI Builder Pro bundle (11 products) for $169 — save 30%.
Top comments (0)