Implementing "Refusal-First" RAG: Why We Architected Our AI to Say 'I Don't Know'

#nlp #reasoning #python #systemdesign

In high-stakes domains like biomedical research or legal discovery, a hallucination isn't just a UX glitch—it's a liability.

Most RAG (Retrieval-Augmented Generation) architectures are designed to be helpful "people-pleasers." If they can't find the exact answer, they often synthesize a plausible one from the model's latent space using inductive prediction (predicting the next likely word).

At Flamehaven, we are building LOGOS, a reasoning engine with a "Strict Evidence" policy. We designed it to fail loudly when data is insufficient.

Here is the engineering breakdown of how we implemented Abductive Reasoning with a Zero-Slop Gate, avoiding "generative magic" in favor of strict software constraints.

The Core Problem: "Plausible" is not "True"

We found that standard RAG pipelines would often take a query like "Link protein A to symptom B" and generate a generic, medically sound sentence that wasn't actually in the source text.

To fix this, we moved from semantic similarity to evidence atomization.

1. Stop Treating Text as Strings (Evidence Atomization)

The first mistake in many RAG systems is passing raw strings to the context window. We don't do that. We treat evidence as immutable data structures with stable IDs.

In our module missing_link/evidence.py, we implement Evidence Atomization. Inputs are split into tracked spans. If a hypothesis cannot be traced back to a specific EvidenceSpan ID ($S_1, S_2...$), the system rejects it.

Here is the conceptual structure of our context bundle:

# [Source: missing_link/runner.py]
from dataclasses import dataclass, field
from typing import List, Dict, Any

@dataclass
class EvidenceBundle:
    """
    Immutable container for the reasoning context.
    'evidence_spans' are atomized facts (e.g., S1, S2) that must be cited.
    """
    query: str
    domain: str
    evidence_spans: List[Dict[str, Any]]  # Normalized spans with 'sid'
    declared_intent: str

@dataclass
class HypothesisCandidate:
    """
    The output structure. Note 'supporting_spans'.
    We don't just return text; we return the specific IDs used to construct it.
    """
    hypothesis: str
    supporting_spans: List[str]  # e.g., ["S1", "S4"]
    novelty_score: float
    plausibility_score: float
    # If supporting_spans is empty or overlap < 90%, this candidate is dropped.

By enforcing this structure, the model cannot "invent" a fact without failing the validation layer immediately.

2. The "Slop Gate": Rejecting Noise Early

Before we burn expensive GPU cycles on inference, we run a deterministic quality filter called the Slop Gate.

Garbage In = Garbage Out. If the input data is full of buzzwords or repetitive scraping errors, no amount of reasoning will save it. We implemented a hard filter in runner.py that acts as a circuit breaker.

The Architecture

We visualize this process as a pre-inference firewall:

The Code Implementation

Here is a snippet of the detection logic:

# [Source: missing_link/runner.py]
def _slop_gate(self, bundle: EvidenceBundle) -> Tuple[bool, str]:
    """
    Pre-inference firewall. Rejects inputs that look like SEO spam or noise.
    """
    # 1. Tokenize and clean
    content = " ".join([span.get("text", "") for span in bundle.evidence_spans])
    tokens = self._tokenize(content)

    # 2. Check for 'Slop' (Repetitive loops common in scraped data)
    token_counts = Counter(tokens)
    most_common = token_counts.most_common(1)[0][1]
    repeat_ratio = most_common / len(tokens)

    if repeat_ratio > self.gate_cfg.max_repeat_ratio:
        # Loop detected: Abort immediately
        return False, "suspicious"

    # 3. Buzzword Check (Domain specific lists)
    # logic to calculate buzzword_ratio...

    if buzzword_ratio > self.gate_cfg.max_buzzword_ratio:
        return False, "suspicious"

    return True, "clean"

If the gate returns False, the pipeline aborts. We prefer a hard stop over a bad output.

3. The Verification Loop (The Omega Score)

Instead of standard Inductive Prediction (predicting the next token), we use Abductive Reasoning (inferring the most likely cause given observations).

But Abduction can be overly creative. To rein it in, we use a composite metric called the Omega Score.

It balances two opposing forces:

Grounding: Can this hypothesis be mapped to existing Spans () with >90% token overlap?
Novelty: Is this a new logical connection, or just a summary of the input?

We optimize for High Grounding + High Novelty.

Summary: Moving to "Audit-Ready" AI

We are trying to move away from "Generative AI" towards "Verifiable Reasoning."

It can be frustrating when the system returns status: tenuous and refuses to answer a vague query. But in B2B contexts, that frustration builds trust. The user knows that if the system does speak, it has the receipts (Evidence Spans) to back it up.

If you are working on hallucination detection, grounding metrics, or refusal-aware architectures, I'd love to hear how you handle the "Novelty vs. Grounding" trade-off in the comments.

The code snippets above are from the missing_link module of Flamehaven-LOGOS, currently under active development for biomedical and legal discovery applications.