Score Your Agent's Responses With a 0.0-1.0 Rubric (No LLM Judge Required)

#hermeschallenge #ai #python #agents

"LLM-as-judge" for evaluating agent responses sounds appealing. You ask another LLM whether the response was good. The problem: it is slow, expensive, and inconsistent. Two runs of the same evaluation often disagree.

prompt-eval-rubric takes a different approach: weighted rule-based scoring with a 0.0-1.0 result. No LLM calls. Consistent. Fast. Cheap to run in CI.

The Shape of the Fix

from prompt_eval_rubric import Rubric, Criterion, RubricResult

rubric = Rubric([
    Criterion(
        name="addresses_question",
        weight=0.40,
        check=lambda response, context: context["question"].split()[0].lower() in response.lower(),
    ),
    Criterion(
        name="appropriate_length",
        weight=0.25,
        check=lambda response, _: 50 <= len(response) <= 800,
    ),
    Criterion(
        name="no_ai_preamble",
        weight=0.20,
        check=lambda response, _: not response.startswith(("As an AI", "I'm an AI")),
    ),
    Criterion(
        name="has_specifics",
        weight=0.15,
        check=lambda response, _: any(char.isdigit() or char == '$' for char in response),
    ),
])

result: RubricResult = rubric.score(
    response="The average salary is $95,000 per year according to the 2026 survey.",
    context={"question": "What is the average salary for ML engineers?"},
)

print(f"Score: {result.score:.2f}")         # 0.75 (weighted sum of passing criteria)
print(f"Breakdown: {result.per_criterion}") # {"addresses_question": True, "appropriate_length": True, ...}

What It Does NOT Do

prompt-eval-rubric does not evaluate semantic correctness. If the model says "The average salary is $95,000" but the true answer is $120,000, the rubric does not know that. It checks structural quality, not factual accuracy.

It does not replace human evaluation. For high-stakes quality decisions, human review is irreplaceable. The rubric is useful for catching obviously bad responses (too short, missing key terms, wrong format) in CI.

It does not weight criteria automatically. You specify weights. The weights should reflect the relative importance of each criterion for your use case.

Inside the Library

Criterion is a dataclass:

@dataclass
class Criterion:
    name: str
    weight: float  # must sum to 1.0 across all criteria
    check: Callable[[str, dict], bool]
    description: str = ""

Rubric.score() calls each criterion's check() function and computes the weighted sum:

def score(self, response: str, context: dict = None) -> RubricResult:
    ctx = context or {}
    per_criterion = {}
    total_score = 0.0

    for criterion in self._criteria:
        passed = criterion.check(response, ctx)
        per_criterion[criterion.name] = passed
        if passed:
            total_score += criterion.weight

    return RubricResult(
        score=total_score,
        per_criterion=per_criterion,
        response=response,
    )

The score is 0.0 (no criteria pass) to 1.0 (all criteria pass). A response that meets 75% of weighted criteria by weight scores 0.75.

Validation: Rubric.__init__() checks that weights sum to approximately 1.0 (within 0.01 tolerance). If they do not, WeightSumError is raised with the actual sum.

When to Use It

Use it in CI for regression testing. Define a rubric for your agent's expected output quality. Run the rubric against a set of test prompts with fixed (mocked or recorded) LLM responses. Fail CI if the average score drops below 0.80.

Use it for A/B comparison. Before deploying a prompt change, score 50 responses with the current prompt against 50 responses with the new prompt. If the new average is higher, the change is an improvement.

Use it for threshold gating. If a response scores below 0.50, route it to a human reviewer rather than returning it to the user.

Skip it for open-ended generation (creative writing, brainstorming). Rule-based rubrics cannot capture creative quality. They work best for fact-based, structured, or task-completion responses.

Install

pip install git+https://github.com/MukundaKatta/prompt-eval-rubric

from prompt_eval_rubric import Rubric, Criterion

# Support ticket classification rubric
support_rubric = Rubric([
    Criterion("is_one_word", 0.30, lambda r, _: len(r.strip().split()) == 1),
    Criterion("valid_category", 0.50, 
              lambda r, _: r.strip().lower() in {"billing", "technical", "account", "other"}),
    Criterion("no_punctuation", 0.20, lambda r, _: r.strip().isalpha()),
])

def classify_and_validate(message: str) -> str:
    raw = call_llm(f"Classify this support message into one category "
                   f"(billing/technical/account/other): {message}")

    result = support_rubric.score(raw.strip())
    if result.score < 0.80:
        # Invalid classification, use fallback
        return "other"
    return raw.strip().lower()

Sibling Libraries

Library	What it solves
`llm-output-validator`	Rule-based validation of output shape (pass/fail, not scored)
`agentsnap`	Snapshot regression: flag when output structure changes
`prompt-lint`	Static analysis of prompts before deployment
`llm-multi-vote`	Jury voting for categorical decisions
`agent-guard-rails`	Output filter pipeline (clean/block, not score)

The quality stack: prompt-lint pre-deployment, prompt-eval-rubric in CI for regression, llm-output-validator at runtime for structural validation, agent-guard-rails for live filtering.

What's Next

Partial-match criteria: a criterion that scores 0.5 instead of 0.0/1.0 for partial passes. "Response contains at least 2 of these 4 required terms" could score 0.5 instead of failing entirely.

Rubric serialization: load/save rubrics from YAML or JSON. This would let product managers define quality criteria in config files without writing Python.

Fuzzy string matching for check functions: a built-in helper fuzzy_contains(keyword, threshold=0.8) that handles typos and abbreviations when checking whether a response addresses a keyword.

Built as part of the agent-stack family: composable Python primitives for production LLM agents.