James Lee

Posted on Jun 18

Part 4 — High Semantic Similarity Correct Business Conclusion: A Three-Layer Judgment Engine from Retrieval to Quantifiable Decisions

#rag #ai #llm #architecture

This article covers the fourth layer of the full-stack architecture: the Judgment Engine. Core engineering challenge: retrieval is responsible for "finding relevant content" — but a business conclusion requires "element completeness verification + quantified scoring + auditable output." Vector retrieval can't do any of those three things.

📦 Source code: production-rag-engineering — esg/services/generation_service.py

0. The Pain Point

After Part 3's retrieval layer was in place, the system could accurately surface relevant content. But the first version's judgment logic was:

Feed the retrieved content directly to GPT-4 and let the model decide "compliant or not."

Two problems emerged simultaneously:

Problem 1: Cost spiraled out of control.
GRI has 58 core rules. One GPT-4 call per rule. Cost per judgment: $0.58. Half a dollar per report. Completely unsustainable at scale.

Problem 2: Conclusions had no actionable direction.
The model output "Not Met" — but the company had no idea what was missing. Was it the emissions total? The calculation method? The data source? A qualitative conclusion with no breakdown means no remediation direction. Correction cycles stretched to 3 months.

This dilemma appears in any rule-intensive judgment scenario: pure model judgment is expensive, pure rule judgment has low accuracy, and neither alone is sufficient.

The solution is to add a judgment engine between retrieval and the model — not to replace the model, but to ensure the model only handles what it's actually good at.

1. Three Gaps Between Retrieval and Decision

First, let's be precise about the problem: why does a solid retrieval layer still need a separate judgment engine?

Gap 1: Semantic similarity ≠ element completeness

GRI 305-1 requires disclosure of three elements: total emissions + calculation method + data source.

A company report states: "Scope 1 emissions in 2023: 5,000 tonnes." Vector similarity: 0.88. The retrieval layer considers this a hit.

But only the first element is satisfied. Calculation method and data source are both absent.

Similarity 0.88 does not equal clause compliance. The retrieval layer can only determine "content is relevant" — it cannot determine "elements are complete."

Gap 2: Qualitative conclusion ≠ actionable remediation

"Not Met" is a qualitative conclusion. A company receiving this conclusion doesn't know:

Which element is missing?
How much is missing?
What exactly needs to be added?

Without a quantified score and a breakdown of missing items, the remediation direction is completely unclear.

Gap 3: Single model ≠ scenario-appropriate

A single report contains three completely different judgment requirements:

Employee compensation disclosure → sensitive data, cannot leave the premises
Scope 3 emissions (11 categories) → complex logic, requires high accuracy
Audit scenario → must output reasoning process to support review

Using one model for all three scenarios means either data compliance risk, cost overrun, or an audit trail that doesn't exist.

These three gaps define exactly what the judgment engine needs to do: element completeness verification, quantified scoring, and scenario-adaptive output.

2. The Three-Layer Progressive Judgment Engine

The core logic of the three-layer design: if rules can solve it, don't call a model; if a model must be involved, route by scenario; after the model outputs, use NER for structured verification.

Retrieval results (Top 3 relevant chunks)
        ↓
Layer 1 — Rule Engine (fast filtering, no model calls)
        ↓
Layer 2 — Multi-model routing (select model by scenario)
        ↓
Layer 3 — NER element verification (structured validation, precise gap identification)
        ↓
Judgment conclusion (Fully Met / Partially Met / Not Met) + missing element breakdown

Layer 1: Rule Engine — Filter 60% of Obvious Non-Compliance Cases

The rule engine's design principle: only handle "obvious" cases — no fuzzy judgment.

def rule_engine_filter(retrieval_results: list, clause: dict) -> dict | None:
    """
    Returns None: case needs to proceed to Layer 2
    Returns conclusion: rule engine makes a direct determination
    """
    max_similarity = max(r["similarity_score"] for r in retrieval_results)

    # Rule 1: No relevant content → directly Not Met
    if max_similarity < 0.5:
        return {
            "result": "Not Met",
            "reason": "No relevant content found in report",
            "confidence": 0.95
        }

    # Rule 2: Missing 2+ core elements → directly Not Met
    required_elements = clause["required_elements"]
    found_elements = extract_elements(retrieval_results, required_elements)
    missing_count = len(required_elements) - len(found_elements)

    if missing_count >= 2:
        return {
            "result": "Not Met",
            "reason": f"Missing core elements: {set(required_elements) - set(found_elements)}",
            "confidence": 0.90
        }

    # Rule 3: Content present but 1 element missing → Partially Met, record missing item
    if missing_count == 1:
        missing = list(set(required_elements) - set(found_elements))
        return {
            "result": "Partially Met",
            "missing_elements": missing,
            "confidence": 0.85
        }

    # All other cases proceed to Layer 2
    return None

The rule engine filters out 60% of obvious non-compliance cases — none of these require a model call.

Layer 2: Multi-Model Routing — Select Model by Scenario

Cases that pass the rule engine are routed to different models based on scenario:

Scenario	Model	Selection rationale
Privacy-sensitive (employee compensation, client data)	Local Llama3-70B	Data stays on-premises; satisfies privacy compliance
Complex logic (Scope 3 emissions, 11 categories)	GPT-4	95% accuracy, 90% logical clarity
Audit scenario (reasoning process required)	DeepSeek	Outputs full reasoning chain; supports audit review

def route_to_model(clause: dict, context: dict) -> str:
    # Scenario 1: Privacy-sensitive
    if clause.get("privacy_sensitive"):
        return "llama3_local"

    # Scenario 2: Complex logic (many required elements or cross-chapter)
    if len(clause["required_elements"]) >= 4 or context.get("cross_chapter"):
        return "gpt4"

    # Scenario 3: Audit mode (reasoning process required)
    if context.get("audit_mode"):
        return "deepseek"

    # Default: GPT-4
    return "gpt4"

When multiple models produce conflicting results, GPT-4 takes precedence (highest accuracy). The conflict itself is recorded in metadata for human review reference.

Layer 3: NER Element Verification — Precisely Locate Missing Items

After the model produces a judgment, NER performs structured verification — not re-judging, but converting the model's qualitative conclusion into a precise, element-level breakdown of what's missing:

def ner_element_check(text: str, clause: dict) -> dict:
    """
    Use NER to extract structured elements from the report
    Precisely identify which elements are present and which are missing
    """
    required_elements = clause["required_elements"]
    element_patterns = clause["element_patterns"]  # recognition patterns per element

    found = {}
    missing = []

    for element, patterns in element_patterns.items():
        # NER + regex dual recognition
        extracted = extract_with_ner(text, patterns)
        if extracted:
            found[element] = extracted
        else:
            missing.append(element)

    return {"found": found, "missing": missing}

Real case — GRI 306-3 spill disclosure:

Report text: "Two spill incidents occurred in 2023. Emergency response measures were taken."

NER element verification result:

✅ Spill count: 2 incidents
❌ Spill volume: missing
✅ Response measures: emergency response measures taken

Determination: Partially Met — missing "spill volume."

This missing item feeds directly into the report's remediation recommendations: add "spill volume" (e.g., "total spill volume: 50 litres"). The remediation direction is immediately clear.

Three-layer engine results: manual review rate 100% → 15%, cost per judgment $0.58 → $0.23.

3. Quantified Scoring: From Qualitative Conclusion to Actionable Score

The three-layer engine produces a qualitative conclusion — "Fully Met / Partially Met / Not Met." But what companies need is "how far off, and what to prioritize."

Three-dimension scoring system (0–100 points):

Dimension	Weight	What it measures
Retrieval match score	40%	Semantic similarity between report text and rule
Element completeness	40%	Whether all required disclosure elements are present
Terminology accuracy	20%	Whether standard terms are used (e.g., "Scope 1" vs. "direct emissions")

Rule weight stratification:

Not all rules carry equal weight. Core rules (e.g., GRI 305-1 greenhouse gas emissions) carry 30% weight; standard rules (e.g., GRI 302-5 energy efficiency measures) carry 10%. This stratification ensures that failing a high-risk rule has a proportionally larger impact on the total score.

Full calculation example — GRI 305-1 (fully compliant):

$$\text{Retrieval match: } 0.92 \times 40 = 36.8$$

$$\text{Element completeness: } \frac{3}{3} \times 40 = 40.0$$

$$\text{Terminology accuracy: } \frac{20}{20} = 20.0$$

$$\text{Total: } 36.8 + 40.0 + 20.0 = 96.8 \text{ pts} \rightarrow \textbf{Fully Met}$$

Full calculation example — GRI 306-3 (missing spill volume):

$$\text{Retrieval match: } 0.81 \times 40 = 32.4$$

$$\text{Element completeness: } \frac{2}{3} \times 40 = 26.7$$

$$\text{Terminology accuracy: } \frac{13}{20} = 13.0$$

$$\text{Total: } 32.4 + 26.7 + 13.0 = 72.1 \text{ pts} \rightarrow \textbf{Partially Met}$$

Three score tiers:

Score range	Grade	Meaning
> 85	Fully Met	All elements present, terminology accurate
70–85	Partially Met	Content present, but missing key elements or imprecise terminology
< 70	Not Met	No relevant content, or 2+ key elements missing

The value of the score isn't just classification — it's letting companies know which clauses to prioritize for remediation. A clause scoring 72 is more urgent than one scoring 78. Without scores, both are just "Partially Met" with no clear priority.

4. Prompt Engineering: Getting the Model to Output Auditable Conclusions

The quality of the model's judgment directly determines Layer 2 accuracy. The prompt design went through three rounds of optimization.

Five-section prompt structure:

[Role definition]
You are an ESG compliance audit expert with deep expertise in GRI standards.
Your task is to determine whether corporate report content satisfies clause requirements.

[Task description]
Evaluate whether the following report content satisfies the GRI clause requirement.
Output: conclusion, satisfied elements, missing elements, and confidence score.

[Rule requirements]
Clause ID: GRI 305-1
Required elements: total emissions / calculation method / data source
Importance: core clause (weight 30%)

[Report content]
(Top 3 retrieved paragraphs, each with page number and chunk_id)

[Output format]
{
  "result": "Fully Met / Partially Met / Not Met",
  "found_elements": [...],
  "missing_elements": [...],
  "evidence": "quoted original text + page number",
  "confidence": 0.0–1.0
}

Three rounds of optimization:

Round 1: Added vague language handling rule (accuracy 70% → 92%)

Problem: a report stated "some suppliers have completed ESG assessments." The model returned "Not Met" (no specific numbers). But this is "Partially Met," not "Not Met."

Fix — added rule to prompt:

"If the report uses vague language (e.g., 'some,' 'a portion,' 'most'), classify as Partially Met rather than Not Met, and note in missing elements: 'specific figures/percentages required.'"

Round 2: Added cross-chapter verification rule (accuracy 75% → 88%)

Problem: Scope 3 emissions data across 11 categories was split between the environmental chapter and the data appendix. A single retrieval only surfaced one chapter. The model returned "incomplete."

Fix — added rule to prompt:

"If the same element appears across multiple chapters, evaluate holistically. Do not mark an element as missing because a single paragraph is incomplete."

Round 3: Added confidence score output (manual review efficiency +50%)

Original prompt only output a conclusion — no confidence score. Reviewers had no way to know which conclusions needed closer scrutiny.

Fix: require the model to output a confidence score from 0–1. Conclusions with confidence < 0.8 are automatically flagged as "pending human review." Reviewers only need to check this subset.

Why Few-shot instead of CoT?

We tested Chain-of-Thought. The conclusion:

CoT accuracy was only 2% higher (92% → 94%)
But token consumption increased 30%, cost increased 30%
The task is fundamentally "does this element exist" — it doesn't require complex reasoning chains

Few-shot provides 3 reference examples (one each for Fully Met / Partially Met / Not Met). The model matches against examples and outputs. Accuracy: 92%. Cost: controlled.

5. Auditable Reports: Conclusions Must Be Challengeable

The final output of the judgment engine isn't a number — it's a report that can be challenged and traced back to its source.

Rule coverage formula:

$$\text{Rule Coverage} = \frac{\text{Fully Met} + 0.5 \times \text{Partially Met}}{\text{Total Rules}}$$

Example with 58 core rules:

$$\frac{45 + 0.5 \times 8}{58} = \frac{49}{58} \approx 84.5\%$$

Why does Partially Met count as 0.5 rather than 0 or 1? Because Partially Met means "content exists, direction is right, but incomplete." Counting it as 0 undervalues the company's work; counting it as 1 overstates compliance. 0.5 is the quantification of "partial."

Four components of the compliance report:

ESG Compliance Assessment Report
├─ Rule-by-rule breakdown (score + compliance grade per clause)
│   ├─ GRI 305-1: 96.8 pts — Fully Met
│   ├─ GRI 306-3: 72.1 pts — Partially Met, missing "spill volume"
│   └─ GRI 401-1: 35 pts — Not Met, no relevant content found
│
├─ Missing element breakdown (element-level precision)
│   ├─ GRI 306-3: missing "spill volume"
│   │   Recommendation: add spill volume (e.g., "total spill volume: 50 litres")
│   └─ GRI 305-3: missing "emissions source classification"
│       Recommendation: add source breakdown (business travel / employee commuting / waste transport)
│
├─ Industry benchmark comparison
│   ├─ This company: 84.5%
│   ├─ Industry average: 78% (+6.5%)
│   └─ Industry leader: 92% (-7.5%, primary gap: Scope 3 disclosure)
│
└─ Traceability (every conclusion traceable to source text)
    ├─ GRI 305-1: source — 2023 Annual Report p.45 para.3, chunk_id=chunk_245
    └─ GRI 306-3: source — 2023 Annual Report p.52 para.1, chunk_id=chunk_252

Traceability record format:

traceability_record = {
    "clause_id": "GRI-306-3",
    "result": "Partially Met",
    "score": 72.1,
    "missing_elements": ["spill volume"],
    "evidence": {
        "chunk_id": "chunk_252",
        "page_range": "52",
        "original_text": "Two spill incidents occurred in 2023. Emergency response measures were taken.",
        "similarity_score": 0.81
    }
}

When a company challenges "we clearly disclosed this — why Partially Met?", the complete evidence chain is available in under 5 minutes: chunk_id → original paragraph → NER element verification result → missing item explanation.

6. Cost Control

The cost optimization logic of the three-layer engine is straightforward: let expensive models only handle what they must handle.

All 58 rules enter Layer 1
        ↓
Rule engine filters 60% → 35 rules directly determined, no model calls
        ↓
Remaining 23 rules enter Layer 2 → model called
        ↓
Model calls: 58 → 23 (60% reduction)
Cost per judgment: $0.58 → $0.23

Manual review cost:

After Layer 3 outputs confidence scores, reviewers only need to check conclusions with confidence < 0.8 (approximately 20% of cases — ~12 out of 58 rules).

Before optimization: 100% manual review, 2 hours per report
After optimization: 20% manual review, ~25 minutes per report
Review efficiency improved by 50%

Overall cost comparison:

Approach	Cost per judgment	Accuracy	Manual review rate
Pure model judgment	$0.58	95%	100%
Pure rule judgment	~$0	85%	100%
Three-layer engine (rules + model + NER)	$0.23	95%	15%

60% cost reduction. Accuracy maintained at 95%. Manual review rate dropped from 100% to 15%.

7. Wrapping Up: The Judgment Engine Decision Tree

When facing a new "retrieval → decision" scenario, three questions determine the layer structure:

Q1: Is there relevant content? (similarity < 0.5)
  └─ No → Rule engine directly returns Not Met. No model call.

Q2: Are elements complete? (2+ core elements missing)
  └─ No → Rule engine directly returns Not Met or Partially Met.
           Record missing items.

Q3: Is the scenario sensitive?
  ├─ Privacy-sensitive → Local model, data stays on-premises
  ├─ Complex logic     → GPT-4, accuracy is the priority
  └─ Audit scenario   → DeepSeek, output full reasoning process

Transferability of this three-layer engine:

The core logic — "rule filtering + model routing + structured verification" — is independent of the specific business domain:

Legal document matching: rule engine filters irrelevant statutes, GPT-4 handles complex legal logic, NER extracts key legal elements
Financial compliance review: rule engine filters obvious non-compliance, local model handles sensitive financial data, NER verifies disclosure elements
Medical diagnostic assistance: rule engine filters irrelevant symptoms, specialized model handles complex clinical cases, NER extracts key diagnostic elements

As long as the scenario fits the structure of "retrieval → rule verification → quantified conclusion," this engine transfers directly. The only things you replace are the rule library and element definitions.

Source Code

All implementations referenced in this article are available here:

👉 github.com/muzinan123/production-rag-engineering

Relevant files for this part:

esg/services/generation_service.py — multi-model routing engine (Llama2 / GPT-4 / DeepSeek)

Next up: The judgment engine has produced its conclusions. But what happens when a conclusion is challenged? "We clearly disclosed this — why Partially Met?" Can the system produce a complete evidence chain in under 5 minutes? This isn't a question about judgment logic — it's a question about whether the system has observability infrastructure. → Part 5 — Full-Chain Traceability

DEV Community