Mukunda Rao Katta

Posted on May 25

prompt-eval-rubric: Composable 0.0-1.0 Scoring Rubrics for LLM Outputs

#hermeschallenge #ai #python #agents

It was 2am on a Tuesday. The overnight eval run had been crawling through 800 LLM responses, scoring each one against a set of quality rubrics. About halfway through, a rubric threw a re.error on an unexpected model output that contained a malformed capture group in its response. Not a model error. Not a network failure. A regex error in a scoring function.

The evaluator crashed. The whole run died. Eight hours of compute, gone. When I woke up I had a half-empty results file and no idea which half was good.

That morning I wrote prompt-eval-rubric. It is a small Python library for composable LLM output scoring. The core design choice: rubrics are exception-isolated. If one rubric crashes, the evaluator does not crash. It records the error, assigns that rubric a score of 0.0, and keeps going.

No more 2am failures because someone added an experimental rubric to the pipeline.

The shape of the fix

The library has three pieces: rubrics, a weighted evaluator, and a strict evaluator. You define rubrics and compose them.

Here is the simplest case: scoring whether an LLM response contains expected keywords and is within a reasonable length.

from prompt_eval_rubric import ContainsKeywords, LengthBetween, WeightedRubric

rubrics = WeightedRubric([
    (ContainsKeywords(["summary", "key points"]), 0.4),
    (LengthBetween(min_chars=50, max_chars=500), 0.6),
])

response = "Here is a summary of the meeting. Key points: ..."
result = rubrics.score(response)

print(result.score)       # 0.0 to 1.0
print(result.breakdown)   # per-rubric scores and labels

The breakdown gives you the per-rubric details. Useful when a score is lower than expected and you want to know which rubric pulled it down.

Five built-in rubrics ship with the library:

from prompt_eval_rubric import (
    ContainsKeywords,    # checks for required words/phrases
    MatchesRegex,        # pattern match, score 1.0 if matched
    JsonSchema,          # validate output against a JSON schema dict
    LengthBetween,       # score by whether length is in range
    NoForbiddenPhrases,  # score 0.0 if any forbidden phrase is present
)

JsonSchema is the most useful one in practice. If your agent produces structured output, you want to know when the schema drifts between model versions.

import json
from prompt_eval_rubric import JsonSchema, WeightedRubric

schema = {
    "type": "object",
    "required": ["status", "items"],
    "properties": {
        "status": {"type": "string"},
        "items": {"type": "array"},
    }
}

rubric = WeightedRubric([(JsonSchema(schema), 1.0)])
response = '{"status": "ok", "items": [1, 2, 3]}'
result = rubric.score(response)
print(result.score)  # 1.0

bad_response = '{"status": "ok"}'  # missing items
result2 = rubric.score(bad_response)
print(result2.score)  # 0.0

Custom rubrics are just callables. Any function that takes a string and returns a float between 0.0 and 1.0 works.

def my_rubric(output: str) -> float:
    # return 1.0 if the output mentions a date, 0.0 otherwise
    import re
    return 1.0 if re.search(r"\d{4}-\d{2}-\d{2}", output) else 0.0

rubrics = WeightedRubric([
    (my_rubric, 0.5),
    (LengthBetween(min_chars=20), 0.5),
])

The AllRubric evaluator is stricter. It fails the whole output if any single rubric scores below a threshold. Use this when every rubric must pass, not just the weighted average.

from prompt_eval_rubric import AllRubric

strict = AllRubric(
    rubrics=[ContainsKeywords(["confirmed"]), NoForbiddenPhrases(["error", "failed"])],
    threshold=0.8,
)

result = strict.score("Order confirmed. Thank you.")
print(result.passed)   # True

result2 = strict.score("Request failed due to an error.")
print(result2.passed)  # False
print(result2.failed_rubrics)  # which rubrics failed

What it does NOT do

It does not call the LLM for you. The library scores strings. You produce those strings however you like. Bring your own client.
It does not auto-generate rubrics from a prompt. There is no "here is my prompt, tell me what to check for" magic. You write the rubrics. That is intentional: auto-generated rubrics tend to be vague and miss your actual requirements.
It does not aggregate across multiple LLM outputs into a dataset metric. It scores one output at a time. If you want dataset-level aggregation, call it in a loop and average the results yourself.
It does not persist scores. No database, no file format. It returns results in memory. If you want persistence, write the results to your own store.

Inside the lib: exception isolation design

This is the part that matters most.

Every rubric call is wrapped in a try/except. If the rubric raises any exception, the evaluator catches it, records the error, and assigns that rubric a score of 0.0. The other rubrics in the evaluator still run normally.

from prompt_eval_rubric import WeightedRubric

def broken_rubric(output: str) -> float:
    raise ValueError("something went wrong in my custom scoring logic")

def good_rubric(output: str) -> float:
    return 1.0 if "summary" in output else 0.0

rubrics = WeightedRubric([
    (broken_rubric, 0.3),
    (good_rubric, 0.7),
])

result = rubrics.score("here is your summary")

print(result.score)         # 0.7 (only the good rubric contributed)
print(result.errors)        # {"broken_rubric": "ValueError: something went wrong..."}
print(result.breakdown)     # per-rubric scores including the 0.0 for broken_rubric

The reason this matters in practice: eval pipelines are long-running. You are often scoring hundreds or thousands of outputs. You do not want one bad rubric or one weird model output to kill a four-hour run. The isolation means you can add experimental rubrics to an existing evaluator without putting the whole pipeline at risk. Worst case, the experimental rubric scores 0.0 on everything and you see it in the error log.

The result.errors field is how you find out. After a run you can check which rubrics errored and how often, without having crashed mid-run.

When this is useful

You are running nightly evals against a prompt and want to track output quality over time. Rubric scores give you a numeric trend.
You are comparing two model versions and want to know which is producing better structured output. Run the same rubrics against both sets of responses.
You have a CI check that should fail if LLM outputs drift outside an expected format. Use AllRubric with a threshold and fail the build if result.passed is False.
You are doing A/B testing on a system prompt. Score both variants with the same rubric set and compare the distributions.
You just shipped a new model version and want to verify that none of your downstream parsing requirements broke. JsonSchema rubric, run it on a sample of outputs.

When this is NOT what you want

You want human-style qualitative feedback on LLM outputs. The library scores against rules you define. It does not provide "this response is too terse" style judgments.
Your rubrics require calling another LLM to judge the output. This library is for deterministic, programmatic rubrics. If you need LLM-as-judge scoring, this is not the right layer.
You want rubric results to automatically update a dataset or trigger a re-run. The library returns results. What you do with them is your problem.

Install

pip install prompt-eval-rubric

No dependencies. No LLM SDK bundled. Bring your own outputs.

GitHub: MukundaKatta/prompt-eval-rubric

23 tests, all passing.

Sibling libraries

Lib	Boundary	Repo
agentsnap	Snapshot tests for agent tool-call traces	MukundaKatta/agentsnap
prompt-replay	Replay recorded LLM prompts against a new model and diff the outputs	MukundaKatta/prompt-replay
driftvane	Eval-based drift detection across embedding, retrieval, and response layers	MukundaKatta/driftvane
agent-decision-log	WHY layer: record which option was chosen and why at each branch	MukundaKatta/agent-decision-log

prompt-replay is the closest sibling. It replays recorded prompts against a new model and diffs the raw output. prompt-eval-rubric takes that a step further: instead of diffing text, you score outputs against defined quality criteria. The two libraries compose naturally. Use prompt-replay to get the new outputs, then feed them into a WeightedRubric to quantify how much the quality shifted.

driftvane covers eval-based drift over time at a higher level: it tracks whether your retrieval, embedding, or response quality is drifting across deployments. prompt-eval-rubric provides the per-output scoring primitives that drift detectors can be built on top of.

What is next

A few things on the list:

A RubricReport.to_csv() helper for writing scored outputs to disk during long eval runs. Right now you handle persistence yourself.
A threshold_rubric factory that wraps any numeric function and turns it into a pass/fail rubric: threshold_rubric(my_fn, min_score=0.7).
Named rubric groups so you can tag rubrics by category (format, tone, accuracy) and get category-level aggregates from the breakdown.

The core loop, define rubrics, compose them, score outputs, inspect the breakdown, is stable. The exception isolation means you can keep adding rubrics to a production eval pipeline without worrying about one bad regex taking down an overnight run.

Built for the Hermes Agent Challenge. Part of a series of small libraries for production agent infrastructure.

DEV Community