DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

prompt-eval-rubric: Score Your Agent's Outputs Without Paying for Another LLM Call

1. The moment it became a problem

Six months ago I was shipping a summarization agent for a client. Every document came out of the model as a 2,000-word wall of text. The spec said 500 words max. I had two options: prompt-engineer my way out of it with increasingly desperate instructions, or catch the bad outputs and retry.

I added a simple length check. It worked. Then I needed to check that certain keywords were present. Then that some phrases were absent. Then I had four separate if statements scattered across the pipeline checking different output properties. Then a colleague changed one. Then we had a bug in production.

The real fix was obvious: put the rules in one place, score each one independently, combine them with weights. That is what prompt-eval-rubric is.

No LLM required. No API call to grade the output of the first API call. Just Python logic that runs in microseconds and tells you a number between 0.0 and 1.0.

I see people reach for judge models when the quality criteria are actually rule-based. If the spec says "must be under 500 words and must mention the product name", that is not a semantic judgment. A regex and a length check get there in under a millisecond. Save the judge model call for when you genuinely need semantic evaluation.

There is also a practical cost argument. If your pipeline runs 10,000 times a day and you use a small judge model at $0.001 per call, that is $10 a day just for quality scoring. Rule-based rubrics are free to run. Reserve the LLM call budget for generation, not scoring what you generated.

2. Shape of the fix

from prompt_eval_rubric import Rubric, LengthCriterion, KeywordCriterion, RegexCriterion

rubric = Rubric([
    LengthCriterion(max_chars=500, weight=0.3),
    KeywordCriterion(required=["summary", "key points"], weight=0.4),
    KeywordCriterion(forbidden=["I apologize", "As an AI"], weight=0.2),
    RegexCriterion(pattern=r"\d{4}-\d{2}-\d{2}", expect_match=False, weight=0.1),
])

result = rubric.score(llm_output)

print(result.total)        # 0.85
print(result.passed)       # True (above default threshold of 0.7)
print(result.details)      # per-criterion breakdown
Enter fullscreen mode Exit fullscreen mode

The Rubric class takes a list of criteria. Each criterion has a weight. The final result.total is the weighted average of all individual scores. If a criterion raises an exception internally, it scores 0.0 for that criterion and keeps going. The other criteria still run.

You can configure the pass threshold per rubric:

strict_rubric = Rubric(criteria, pass_threshold=0.9)
lenient_rubric = Rubric(criteria, pass_threshold=0.5)
Enter fullscreen mode Exit fullscreen mode

And you can define custom criteria:

from prompt_eval_rubric import BaseCriterion

class ReadabilityScore(BaseCriterion):
    def score(self, text: str) -> float:
        # Flesch-Kincaid or whatever you want
        sentences = text.split(".")
        words = text.split()
        if not sentences or not words:
            return 0.0
        avg_sentence_length = len(words) / max(len(sentences), 1)
        return 1.0 if avg_sentence_length < 20 else 0.5
Enter fullscreen mode Exit fullscreen mode

Plug it straight into a Rubric with a weight and it works the same way.

3. What it does NOT do

It does not call any model. There is no inference happening. If you need semantic evaluation ("is this answer accurate given the source document?"), this library is not the right tool for that job. That requires a judge model.

It does not parse structured outputs. If your LLM returns JSON and you want to validate the schema, use a JSON schema validator. This library works on plain text strings.

It does not retry. The library scores. Your pipeline decides what to do with the score. If you want automatic retry logic, wire that up yourself or use a library like llm-structured-retry.

It does not store history. Each rubric.score() call is stateless. If you want trend data over multiple outputs, collect result.total values yourself.

4. Inside the library

The repo is at MukundaKatta/prompt-eval-rubric. There are 23 tests.

The core types are:

  • BaseCriterion: abstract base with a score(text: str) -> float method and a weight attribute.
  • LengthCriterion: scores 1.0 if len(text) <= max_chars, 0.0 if over by more than 50%, linear interpolation in between.
  • KeywordCriterion: handles both required and forbidden keyword lists. For required, fraction of keywords present. For forbidden, 1.0 minus fraction of forbidden keywords found.
  • RegexCriterion: 1.0 if expect_match and pattern matches, 0.0 if not. Inverted if expect_match=False.
  • NoForbiddenPhraseCriterion: alias for KeywordCriterion with forbidden set and required empty.
  • CustomCriterion: takes a callable fn(text: str) -> float so you can define inline without subclassing.
  • Rubric: aggregates criteria, runs them with exception isolation, computes weighted total.
  • RubricResult: dataclass with total, passed, details.

The exception isolation is intentional. In production pipelines you want a bad custom criterion to score zero, not bring down the whole evaluation pass. The error is captured in RubricResult.details so you can inspect it.

Weights do not need to sum to 1.0. The rubric normalizes by the sum of weights of criteria that actually ran (excluding errored ones if you set ignore_errors=True).

5. When this is useful, when it is not

Useful when:

  • You have clear, rule-based quality criteria for LLM outputs. Length limits, required sections, forbidden phrases, format patterns.
  • You want to gate outputs before passing them downstream. Score below threshold = retry or flag for human review.
  • You are running evaluations offline against a dataset of model outputs and want a fast, cheap signal alongside human ratings.
  • You have a team and want the evaluation logic in one versioned place instead of ad-hoc checks scattered through prompt engineering code.

Not useful when:

  • Your quality criteria are semantic. "Is this factually correct?" "Does this answer the question?" Those require a judge model.
  • Your outputs are structured (JSON, YAML, code). Use schema validation tools for structure checks.
  • You need fuzzy matching. The keyword criterion is exact substring match. If you need fuzzy keyword presence, write a CustomCriterion with your own matching logic.

6. Install

The package is pending PyPI publication.

# PyPI (pending):
pip install prompt-eval-rubric

# From source:
git clone https://github.com/MukundaKatta/prompt-eval-rubric
cd prompt-eval-rubric
pip install -e .
Enter fullscreen mode Exit fullscreen mode

No runtime dependencies. Python 3.9+.

# Run the tests:
pytest tests/ -v
# 23 tests, all passing
Enter fullscreen mode Exit fullscreen mode

7. Siblings in the stack

These libraries work well alongside prompt-eval-rubric:

Library What it does
agentsnap Snapshot agent state at any point for inspection
agenttrace Cost and latency per agent run
agent-decision-log Log WHY the agent made each decision
llm-output-validator Rule-based string validation (simpler, no weights)
prompt-replay Record prompts and replay them across model versions
llm-structured-retry Retry with error injected as follow-up message

The combination that makes sense most often: agenttrace to measure performance, prompt-eval-rubric to measure output quality, agent-decision-log to explain what happened when quality drops.

8. What comes next

A few things I want to add before the first stable release:

First, a CompositeRubric that lets you AND or OR multiple rubrics. Right now if you want different rubrics for different output types, you instantiate them separately. A composite would let you express "pass rubric A and rubric B" cleanly.

Second, a calibrate() method. You give it a set of labeled examples (good/bad) and it adjusts weights to maximize agreement with your labels. Basically gradient-free weight tuning.

Third, better reporting. Right now result.details is a list of dicts. I want a result.to_markdown() method that produces a readable table. Useful for logging to eval dashboards.

The library is intentionally small. The goal is a fast, dependency-free scoring layer that you can drop into any pipeline. If you need something heavier, the judge-model path is always available.

Source: github.com/MukundaKatta/prompt-eval-rubric

Top comments (0)