DEV Community

mawlaia
mawlaia

Posted on

How to add eval quality gates to your LLM app (like CI for AI)

Every team that ships an LLM feature eventually discovers the same problem: the model regressed and nobody noticed until users complained.

Traditional software has unit tests, integration tests, and CI gates. LLM apps have... vibes.

mawlaia-evalforge is an open-source eval runner that gives LLM output quality the same treatment as code — structured scoring, pass/fail thresholds, and CI integration.


The problem

You optimized a prompt. It got better. Three sprints later, a model update or prompt change silently degraded it. You shipped the regression.

What you actually need is this running in CI:

from evalforge import Runner, RougeScorer, LLMJudgeScorer, Dataset

dataset = Dataset.from_jsonl("eval_cases.jsonl")  # your golden test set
runner = Runner(scorers=[
    RougeScorer(threshold=0.7),
    LLMJudgeScorer(client=openai_client, threshold=0.8)
])

results = runner.run(dataset)
results.assert_pass()  # raises if any scorer below threshold — CI fails
Enter fullscreen mode Exit fullscreen mode

Scorers

Four scorers ship out of the box:

RougeScorer — ROUGE-L F1 between expected and actual output. Fast, no API calls, good for structured outputs.

ExactScorer — exact string match, with optional normalization (lowercase, strip whitespace).

RegexScorer — checks that output matches (or doesn't match) a pattern. Useful for format enforcement.

LLMJudgeScorer — sends (input, expected, actual) to an LLM and asks for a 0–1 quality score with reasoning. Slow but handles open-ended outputs that ROUGE can't score.

from evalforge import Runner, RougeScorer, LLMJudgeScorer

runner = Runner(scorers=[
    RougeScorer(threshold=0.6),          # fast gate
    LLMJudgeScorer(client=openai_client, threshold=0.75)  # semantic gate
])
Enter fullscreen mode Exit fullscreen mode

TypeScript

import { Runner, LLMJudgeScorer } from 'mawlaia-evalforge';

const runner = new Runner({ scorers: [new LLMJudgeScorer({ client: openai })] });
const results = await runner.run(dataset);
results.assertPass();
Enter fullscreen mode Exit fullscreen mode

What it doesn't do

  • No hosted eval dashboard — coming in Phase 2. Right now it's a library, not a service.
  • No dataset management UI — datasets are JSONL files, keep them in your repo.
  • No automatic dataset generation — you write the golden test cases.

The scope is deliberate: a library you drop into CI in 10 minutes, not a platform that takes a week to set up.


Installation

pip install mawlaia-evalforge
Enter fullscreen mode Exit fullscreen mode
npm install mawlaia-evalforge
Enter fullscreen mode Exit fullscreen mode

Source, tests (35 Python, 37 TypeScript), MIT: github.com/Mawlaia-Labs/evalforge


The hosted version with dataset management, eval history, and team dashboards is coming Q3 2026. Early access: dev@mawlaia.com.

Top comments (0)