How to add eval quality gates to your LLM app (like CI for AI)

#python #ai #testing #machinelearning

Every team that ships an LLM feature eventually discovers the same problem: the model regressed and nobody noticed until users complained.

Traditional software has unit tests, integration tests, and CI gates. LLM apps have... vibes.

mawlaia-evalforge is an open-source eval runner that gives LLM output quality the same treatment as code — structured scoring, pass/fail thresholds, and CI integration.

The problem

You optimized a prompt. It got better. Three sprints later, a model update or prompt change silently degraded it. You shipped the regression.

What you actually need is this running in CI:

from evalforge import Runner, RougeScorer, LLMJudgeScorer, Dataset

dataset = Dataset.from_jsonl("eval_cases.jsonl")  # your golden test set
runner = Runner(scorers=[
    RougeScorer(threshold=0.7),
    LLMJudgeScorer(client=openai_client, threshold=0.8)
])

results = runner.run(dataset)
results.assert_pass()  # raises if any scorer below threshold — CI fails

Scorers

Four scorers ship out of the box:

RougeScorer — ROUGE-L F1 between expected and actual output. Fast, no API calls, good for structured outputs.

ExactScorer — exact string match, with optional normalization (lowercase, strip whitespace).

RegexScorer — checks that output matches (or doesn't match) a pattern. Useful for format enforcement.

LLMJudgeScorer — sends (input, expected, actual) to an LLM and asks for a 0–1 quality score with reasoning. Slow but handles open-ended outputs that ROUGE can't score.

from evalforge import Runner, RougeScorer, LLMJudgeScorer

runner = Runner(scorers=[
    RougeScorer(threshold=0.6),          # fast gate
    LLMJudgeScorer(client=openai_client, threshold=0.75)  # semantic gate
])

TypeScript

import { Runner, LLMJudgeScorer } from 'mawlaia-evalforge';

const runner = new Runner({ scorers: [new LLMJudgeScorer({ client: openai })] });
const results = await runner.run(dataset);
results.assertPass();

What it doesn't do

No hosted eval dashboard — coming in Phase 2. Right now it's a library, not a service.
No dataset management UI — datasets are JSONL files, keep them in your repo.
No automatic dataset generation — you write the golden test cases.

The scope is deliberate: a library you drop into CI in 10 minutes, not a platform that takes a week to set up.

Installation

pip install mawlaia-evalforge

npm install mawlaia-evalforge

Source, tests (35 Python, 37 TypeScript), MIT: github.com/Mawlaia-Labs/evalforge

The hosted version with dataset management, eval history, and team dashboards is coming Q3 2026. Early access: dev@mawlaia.com.

DEV Community