1. The moment it became a problem
Six months ago I was shipping a summarization agent for a client. Every document came out of the model as a 2,000-word wall of text. The spec said 500 words max. I had two options: prompt-engineer my way out of it with increasingly desperate instructions, or catch the bad outputs and retry.
I added a simple length check. It worked. Then I needed to check that certain keywords were present. Then that some phrases were absent. Then I had four separate if statements scattered across the pipeline checking different output properties. Then a colleague changed one. Then we had a bug in production.
The real fix was obvious: put the rules in one place, score each one independently, combine them with weights. That is what prompt-eval-rubric is.
No LLM required. No API call to grade the output of the first API call. Just Python logic that runs in microseconds and tells you a number between 0.0 and 1.0.
I see people reach for judge models when the quality criteria are actually rule-based. If the spec says "must be under 500 words and must mention the product name", that is not a semantic judgment. A regex and a length check get there in under a millisecond. Save the judge model call for when you genuinely need semantic evaluation.
There is also a practical cost argument. If your pipeline runs 10,000 times a day and you use a small judge model at $0.001 per call, that is $10 a day just for quality scoring. Rule-based rubrics are free to run. Reserve the LLM call budget for generation, not scoring what you generated.
2. Shape of the fix
from prompt_eval_rubric import Rubric, LengthCriterion, KeywordCriterion, RegexCriterion
rubric = Rubric([
LengthCriterion(max_chars=500, weight=0.3),
KeywordCriterion(required=["summary", "key points"], weight=0.4),
KeywordCriterion(forbidden=["I apologize", "As an AI"], weight=0.2),
RegexCriterion(pattern=r"\d{4}-\d{2}-\d{2}", expect_match=False, weight=0.1),
])
result = rubric.score(llm_output)
print(result.total) # 0.85
print(result.passed) # True (above default threshold of 0.7)
print(result.details) # per-criterion breakdown
The Rubric class takes a list of criteria. Each criterion has a weight. The final result.total is the weighted average of all individual scores. If a criterion raises an exception internally, it scores 0.0 for that criterion and keeps going. The other criteria still run.
You can configure the pass threshold per rubric:
strict_rubric = Rubric(criteria, pass_threshold=0.9)
lenient_rubric = Rubric(criteria, pass_threshold=0.5)
And you can define custom criteria:
from prompt_eval_rubric import BaseCriterion
class ReadabilityScore(BaseCriterion):
def score(self, text: str) -> float:
# Flesch-Kincaid or whatever you want
sentences = text.split(".")
words = text.split()
if not sentences or not words:
return 0.0
avg_sentence_length = len(words) / max(len(sentences), 1)
return 1.0 if avg_sentence_length < 20 else 0.5
Plug it straight into a Rubric with a weight and it works the same way.
3. What it does NOT do
It does not call any model. There is no inference happening. If you need semantic evaluation ("is this answer accurate given the source document?"), this library is not the right tool for that job. That requires a judge model.
It does not parse structured outputs. If your LLM returns JSON and you want to validate the schema, use a JSON schema validator. This library works on plain text strings.
It does not retry. The library scores. Your pipeline decides what to do with the score. If you want automatic retry logic, wire that up yourself or use a library like llm-structured-retry.
It does not store history. Each rubric.score() call is stateless. If you want trend data over multiple outputs, collect result.total values yourself.
4. Inside the library
The repo is at MukundaKatta/prompt-eval-rubric. There are 23 tests.
The core types are:
-
BaseCriterion: abstract base with ascore(text: str) -> floatmethod and aweightattribute. -
LengthCriterion: scores 1.0 iflen(text) <= max_chars, 0.0 if over by more than 50%, linear interpolation in between. -
KeywordCriterion: handles bothrequiredandforbiddenkeyword lists. For required, fraction of keywords present. For forbidden, 1.0 minus fraction of forbidden keywords found. -
RegexCriterion: 1.0 ifexpect_matchand pattern matches, 0.0 if not. Inverted ifexpect_match=False. -
NoForbiddenPhraseCriterion: alias forKeywordCriterionwithforbiddenset andrequiredempty. -
CustomCriterion: takes a callablefn(text: str) -> floatso you can define inline without subclassing. -
Rubric: aggregates criteria, runs them with exception isolation, computes weighted total. -
RubricResult: dataclass withtotal,passed,details.
The exception isolation is intentional. In production pipelines you want a bad custom criterion to score zero, not bring down the whole evaluation pass. The error is captured in RubricResult.details so you can inspect it.
Weights do not need to sum to 1.0. The rubric normalizes by the sum of weights of criteria that actually ran (excluding errored ones if you set ignore_errors=True).
5. When this is useful, when it is not
Useful when:
- You have clear, rule-based quality criteria for LLM outputs. Length limits, required sections, forbidden phrases, format patterns.
- You want to gate outputs before passing them downstream. Score below threshold = retry or flag for human review.
- You are running evaluations offline against a dataset of model outputs and want a fast, cheap signal alongside human ratings.
- You have a team and want the evaluation logic in one versioned place instead of ad-hoc checks scattered through prompt engineering code.
Not useful when:
- Your quality criteria are semantic. "Is this factually correct?" "Does this answer the question?" Those require a judge model.
- Your outputs are structured (JSON, YAML, code). Use schema validation tools for structure checks.
- You need fuzzy matching. The keyword criterion is exact substring match. If you need fuzzy keyword presence, write a
CustomCriterionwith your own matching logic.
6. Install
The package is pending PyPI publication.
# PyPI (pending):
pip install prompt-eval-rubric
# From source:
git clone https://github.com/MukundaKatta/prompt-eval-rubric
cd prompt-eval-rubric
pip install -e .
No runtime dependencies. Python 3.9+.
# Run the tests:
pytest tests/ -v
# 23 tests, all passing
7. Siblings in the stack
These libraries work well alongside prompt-eval-rubric:
| Library | What it does |
|---|---|
agentsnap |
Snapshot agent state at any point for inspection |
agenttrace |
Cost and latency per agent run |
agent-decision-log |
Log WHY the agent made each decision |
llm-output-validator |
Rule-based string validation (simpler, no weights) |
prompt-replay |
Record prompts and replay them across model versions |
llm-structured-retry |
Retry with error injected as follow-up message |
The combination that makes sense most often: agenttrace to measure performance, prompt-eval-rubric to measure output quality, agent-decision-log to explain what happened when quality drops.
8. What comes next
A few things I want to add before the first stable release:
First, a CompositeRubric that lets you AND or OR multiple rubrics. Right now if you want different rubrics for different output types, you instantiate them separately. A composite would let you express "pass rubric A and rubric B" cleanly.
Second, a calibrate() method. You give it a set of labeled examples (good/bad) and it adjusts weights to maximize agreement with your labels. Basically gradient-free weight tuning.
Third, better reporting. Right now result.details is a list of dicts. I want a result.to_markdown() method that produces a readable table. Useful for logging to eval dashboards.
The library is intentionally small. The goal is a fast, dependency-free scoring layer that you can drop into any pipeline. If you need something heavier, the judge-model path is always available.
Top comments (0)