Every team that ships an LLM feature eventually discovers the same problem: the model regressed and nobody noticed until users complained.
Traditional software has unit tests, integration tests, and CI gates. LLM apps have... vibes.
mawlaia-evalforge is an open-source eval runner that gives LLM output quality the same treatment as code — structured scoring, pass/fail thresholds, and CI integration.
The problem
You optimized a prompt. It got better. Three sprints later, a model update or prompt change silently degraded it. You shipped the regression.
What you actually need is this running in CI:
from evalforge import Runner, RougeScorer, LLMJudgeScorer, Dataset
dataset = Dataset.from_jsonl("eval_cases.jsonl") # your golden test set
runner = Runner(scorers=[
RougeScorer(threshold=0.7),
LLMJudgeScorer(client=openai_client, threshold=0.8)
])
results = runner.run(dataset)
results.assert_pass() # raises if any scorer below threshold — CI fails
Scorers
Four scorers ship out of the box:
RougeScorer — ROUGE-L F1 between expected and actual output. Fast, no API calls, good for structured outputs.
ExactScorer — exact string match, with optional normalization (lowercase, strip whitespace).
RegexScorer — checks that output matches (or doesn't match) a pattern. Useful for format enforcement.
LLMJudgeScorer — sends (input, expected, actual) to an LLM and asks for a 0–1 quality score with reasoning. Slow but handles open-ended outputs that ROUGE can't score.
from evalforge import Runner, RougeScorer, LLMJudgeScorer
runner = Runner(scorers=[
RougeScorer(threshold=0.6), # fast gate
LLMJudgeScorer(client=openai_client, threshold=0.75) # semantic gate
])
TypeScript
import { Runner, LLMJudgeScorer } from 'mawlaia-evalforge';
const runner = new Runner({ scorers: [new LLMJudgeScorer({ client: openai })] });
const results = await runner.run(dataset);
results.assertPass();
What it doesn't do
- No hosted eval dashboard — coming in Phase 2. Right now it's a library, not a service.
- No dataset management UI — datasets are JSONL files, keep them in your repo.
- No automatic dataset generation — you write the golden test cases.
The scope is deliberate: a library you drop into CI in 10 minutes, not a platform that takes a week to set up.
Installation
pip install mawlaia-evalforge
npm install mawlaia-evalforge
Source, tests (35 Python, 37 TypeScript), MIT: github.com/Mawlaia-Labs/evalforge
The hosted version with dataset management, eval history, and team dashboards is coming Q3 2026. Early access: dev@mawlaia.com.
Top comments (0)