Grading LLM Legal Reasoning With an LLM Judge

#ai #nlp #legalnlp #promptengineering

When you evaluate an LLM on a legal-NLP task, the verdict is the easy part. A model can guess covered or not covered and be right half the time without understanding anything. The harder question is whether the reasoning that produced the verdict actually holds up, because a correct label happily hides a broken chain of thought behind it.

So you grade the explanation, not just the answer. And since grading free-text reasoning by hand doesn't scale, you hand that job to a second model. Here is a small LLM-as-a-judge built on LegalBench's rule-application rubric that you can drop into an eval loop.

What IRAC Reasoning Looks Like
The Rubric: Two Dimensions
The Schema, and the One Move That Matters
A Worked Example
How the Judge Scores It

What IRAC Reasoning Looks Like

Legal analysis tends to follow a four-step skeleton known as IRAC — Issue, Rule, Application, Conclusion. You name the legal question, state the rule that governs it, apply that rule to the specific facts, and land on a conclusion. The middle two steps carry the weight: a good answer doesn't just recite the rule and announce a verdict, it shows the inference that gets from one to the other.

Take a coverage example. A policy covers damage to belongings caused while they are being removed by professional removal contractors. The claimant's furniture is damaged in a truck accident during a move — but the move is being done by her uncle, a retired professional mover helping out as a favour, not a removal contractor engaged for the job.

Walked through IRAC, that reasons as:

Issue — does the removal clause cover this damage?
Rule — coverage applies only when belongings are damaged while being removed by professional removal contractors.
Application — the uncle is moving the belongings privately, not acting as a professional contractor, so the condition that triggers coverage is not met.
Conclusion — not covered.

This is exactly the structure the rubric scores. Correctness checks that the Rule and the facts are stated accurately and the Conclusion is right; Analysis checks the Application step specifically — whether the reasoning actually performed that fact-to-rule inference instead of restating the rule and jumping to a verdict. That Application step is why LegalBench calls it the rule-application framework.

The Rubric: Two Dimensions

The judge scores each reasoning along two axes.

Correctness asks whether the reasoning is free of five specific error types: misstating the rule, misstating the facts, asserting the wrong outcome, a logic error, or an arithmetic error. It is all-or-nothing — a single error of any kind means the reasoning is not correct.

Analysis asks something subtler: did the reasoning actually connect the facts to the conclusion under the rule, or did it merely restate the rule, the facts, or the verdict? Analysis only counts when the reasoning is both correct and contains genuine inference.

The final number for a system is the average of its correctness rate and its analysis rate across the whole set.

The Schema, and the One Move That Matters

First, define what the model has to emit. The trick here is to make the judge produce only atomic signals — the five error flags and one categorical analysis label — and then derive correctness and analysis yourself in Python:

from typing import Literal
from pydantic import BaseModel, Field

Flag = Literal[0, 1]  # 0 = error absent, 1 = error present

class JudgeVerdict(BaseModel):
    rule_misstatement: Flag = Field(description="Misstates the legal rule or policy text")
    fact_misstatement: Flag = Field(description="Misstates the fact pattern in the claim")
    incorrect_outcome: Flag = Field(description="Asserts an incorrect outcome")
    logic_error: Flag = Field(description="Contains a logical error")
    arithmetic_error: Flag = Field(description="Contains an arithmetic or numerical error")
    analysis_case: Literal[
        "incorrect",
        "correct_but_no_analysis",
        "correct_and_contains_analysis",
    ] = Field(description="Whether the reasoning connects facts to conclusion")
    error_types: list[str] = Field(default_factory=list)
    missing_inferences: list[str] = Field(default_factory=list)
    brief_justification: str = ""

    @property
    def correctness(self) -> int:
        return int(
            self.rule_misstatement == 0
            and self.fact_misstatement == 0
            and self.incorrect_outcome == 0
            and self.logic_error == 0
            and self.arithmetic_error == 0
        )

    @property
    def analysis(self) -> int:
        return int(
            self.correctness == 1
            and self.analysis_case == "correct_and_contains_analysis"
        )

It is tempting to ask the model for correctness and analysis directly and be done with it. Don't. Constrained decoding guarantees the shape of the output, not its internal consistency — the model can hand you five zeros and a correctness of 0 in the same breath. Deriving the two scoring variables from the atomic fields makes that contradiction impossible by construction. The Literal[0, 1] flags pull their weight too: they constrain decoding to exactly 0 or 1, so you never have to clean up a stray 2.

A Worked Example

Here is a concrete claim the system under evaluation would receive:

On 14 March around 8:30 a.m. I was reversing out of a parking bay on Mozartstrasse in Augsburg and clipped the car parked behind me with my rear bumper. The other car's tailgate is dented and its rear light is cracked. The owner came out, we exchanged details, and his garage has since sent me a repair estimate of about €1,400, which he is asking me to pay.

The governing rule is the motor third-party liability inclusion: the insurer indemnifies the policyholder against a third party's claim for property damage caused through use of the insured vehicle.

A candidate reasoning that follows the IRAC structure:

Issue — does the third-party liability cover apply to the damage to the parked car?
Rule — cover applies where, through use of the insured vehicle, a third party's property is damaged and a liability claim is asserted against the policyholder.
Application — the policyholder was using the vehicle (pulling into a parking bay) and struck a third party's parked car, denting the tailgate and cracking the rear light; the owner is now asserting a repair claim of about €1,200. All three conditions — vehicle in use, third-party property damage, liability claim — are satisfied.
Conclusion — the claim is covered under the third-party liability provision.

How the Judge Scores It

The reasoning lands on the right verdict and even walks the Application step properly — but it slips on two details. The driver was reversing out of the bay, not pulling into it, and the repair claim was about €1,400, not €1,200. Neither slip changes the outcome, yet each is an error, and correctness is all-or-nothing: one error sinks it. Once correctness is 0, analysis_case is forced to "incorrect" and analysis derives to 0 no matter how cleanly the reasoning was structured. The judge returns:

verdict = JudgeVerdict(
    rule_misstatement=0,
    fact_misstatement=1,   # says "pulling into" — the driver was reversing out
    incorrect_outcome=0,   # the verdict ("covered") is still right
    logic_error=0,
    arithmetic_error=1,    # cites €1,200 — the claim was about €1,400
    analysis_case="incorrect",
    error_types=["fact_misstatement", "arithmetic_error"],
    missing_inferences=[],
    brief_justification=(
        "Reaches the correct outcome, but misstates the manoeuvre and the "
        "claimed repair amount; both count as errors under correctness."
    ),
)

verdict.correctness  # -> 0  (two error flags are set)
verdict.analysis     # -> 0  (correctness is 0, so analysis_case is forced to "incorrect")

This is exactly the failure a bare label would hide: the system got the answer right while getting the case wrong. Fix the two slips and the same reasoning scores correctness 1 and — because the Application step is intact — analysis 1. And note what the model never gets to do: it reports the atomic signals, but the roll-up to correctness and analysis is computed, so it can't accidentally call a two-error reasoning "correct."