DEV Community

Mukunda Rao Katta
Mukunda Rao Katta

Posted on

Score Your Agent's Responses With a 0.0-1.0 Rubric (No LLM Judge Required)

"LLM-as-judge" for evaluating agent responses sounds appealing. You ask another LLM whether the response was good. The problem: it is slow, expensive, and inconsistent. Two runs of the same evaluation often disagree.

prompt-eval-rubric takes a different approach: weighted rule-based scoring with a 0.0-1.0 result. No LLM calls. Consistent. Fast. Cheap to run in CI.


The Shape of the Fix

from prompt_eval_rubric import Rubric, Criterion, RubricResult

rubric = Rubric([
    Criterion(
        name="addresses_question",
        weight=0.40,
        check=lambda response, context: context["question"].split()[0].lower() in response.lower(),
    ),
    Criterion(
        name="appropriate_length",
        weight=0.25,
        check=lambda response, _: 50 <= len(response) <= 800,
    ),
    Criterion(
        name="no_ai_preamble",
        weight=0.20,
        check=lambda response, _: not response.startswith(("As an AI", "I'm an AI")),
    ),
    Criterion(
        name="has_specifics",
        weight=0.15,
        check=lambda response, _: any(char.isdigit() or char == '$' for char in response),
    ),
])

result: RubricResult = rubric.score(
    response="The average salary is $95,000 per year according to the 2026 survey.",
    context={"question": "What is the average salary for ML engineers?"},
)

print(f"Score: {result.score:.2f}")         # 0.75 (weighted sum of passing criteria)
print(f"Breakdown: {result.per_criterion}") # {"addresses_question": True, "appropriate_length": True, ...}
Enter fullscreen mode Exit fullscreen mode

What It Does NOT Do

prompt-eval-rubric does not evaluate semantic correctness. If the model says "The average salary is $95,000" but the true answer is $120,000, the rubric does not know that. It checks structural quality, not factual accuracy.

It does not replace human evaluation. For high-stakes quality decisions, human review is irreplaceable. The rubric is useful for catching obviously bad responses (too short, missing key terms, wrong format) in CI.

It does not weight criteria automatically. You specify weights. The weights should reflect the relative importance of each criterion for your use case.


Inside the Library

Criterion is a dataclass:

@dataclass
class Criterion:
    name: str
    weight: float  # must sum to 1.0 across all criteria
    check: Callable[[str, dict], bool]
    description: str = ""
Enter fullscreen mode Exit fullscreen mode

Rubric.score() calls each criterion's check() function and computes the weighted sum:

def score(self, response: str, context: dict = None) -> RubricResult:
    ctx = context or {}
    per_criterion = {}
    total_score = 0.0

    for criterion in self._criteria:
        passed = criterion.check(response, ctx)
        per_criterion[criterion.name] = passed
        if passed:
            total_score += criterion.weight

    return RubricResult(
        score=total_score,
        per_criterion=per_criterion,
        response=response,
    )
Enter fullscreen mode Exit fullscreen mode

The score is 0.0 (no criteria pass) to 1.0 (all criteria pass). A response that meets 75% of weighted criteria by weight scores 0.75.

Validation: Rubric.__init__() checks that weights sum to approximately 1.0 (within 0.01 tolerance). If they do not, WeightSumError is raised with the actual sum.


When to Use It

Use it in CI for regression testing. Define a rubric for your agent's expected output quality. Run the rubric against a set of test prompts with fixed (mocked or recorded) LLM responses. Fail CI if the average score drops below 0.80.

Use it for A/B comparison. Before deploying a prompt change, score 50 responses with the current prompt against 50 responses with the new prompt. If the new average is higher, the change is an improvement.

Use it for threshold gating. If a response scores below 0.50, route it to a human reviewer rather than returning it to the user.

Skip it for open-ended generation (creative writing, brainstorming). Rule-based rubrics cannot capture creative quality. They work best for fact-based, structured, or task-completion responses.


Install

pip install git+https://github.com/MukundaKatta/prompt-eval-rubric
Enter fullscreen mode Exit fullscreen mode
from prompt_eval_rubric import Rubric, Criterion

# Support ticket classification rubric
support_rubric = Rubric([
    Criterion("is_one_word", 0.30, lambda r, _: len(r.strip().split()) == 1),
    Criterion("valid_category", 0.50, 
              lambda r, _: r.strip().lower() in {"billing", "technical", "account", "other"}),
    Criterion("no_punctuation", 0.20, lambda r, _: r.strip().isalpha()),
])

def classify_and_validate(message: str) -> str:
    raw = call_llm(f"Classify this support message into one category "
                   f"(billing/technical/account/other): {message}")

    result = support_rubric.score(raw.strip())
    if result.score < 0.80:
        # Invalid classification, use fallback
        return "other"
    return raw.strip().lower()
Enter fullscreen mode Exit fullscreen mode

Sibling Libraries

Library What it solves
llm-output-validator Rule-based validation of output shape (pass/fail, not scored)
agentsnap Snapshot regression: flag when output structure changes
prompt-lint Static analysis of prompts before deployment
llm-multi-vote Jury voting for categorical decisions
agent-guard-rails Output filter pipeline (clean/block, not score)

The quality stack: prompt-lint pre-deployment, prompt-eval-rubric in CI for regression, llm-output-validator at runtime for structural validation, agent-guard-rails for live filtering.


What's Next

Partial-match criteria: a criterion that scores 0.5 instead of 0.0/1.0 for partial passes. "Response contains at least 2 of these 4 required terms" could score 0.5 instead of failing entirely.

Rubric serialization: load/save rubrics from YAML or JSON. This would let product managers define quality criteria in config files without writing Python.

Fuzzy string matching for check functions: a built-in helper fuzzy_contains(keyword, threshold=0.8) that handles typos and abbreviations when checking whether a response addresses a keyword.


Built as part of the agent-stack family: composable Python primitives for production LLM agents.

Top comments (0)