In the last post, we gave our forensic system "Eyes" using local Multimodal Vision. We successfully extracted a mysterious handwritten inscription from a first edition of The Great Gatsby without a single pixel leaving our local network.
But perception is only half the battle. To turn that raw text into a forensic verdict, we often need the "High Reasoning" capabilities of frontier cloud models like Claude 3.5 or GPT-4o. This creates a Privacy Paradox: How do we send the context of a finding to the cloud without leaking the Personally Identifiable Information (PII) contained within it?
Today, we implement the Sovereign Redactor—a precision-guided airlock that scrubs sensitive entities at the edge before they hit the egress pipe.
The Problem: NLP Over-redaction
Traditional redaction is a blunt instrument. If you use a simple regex or a basic NER (Named Entity Recognition) model, it might redact the author "F. Scott Fitzgerald" or the publisher "Scribner’s" because it identifies them as PERSON or ORGANIZATION.
In rare book forensics, for example, the author’s name isn't PII—it’s primary metadata. If we redact the subject of the audit, the cloud-based reasoning agent becomes useless. We need a system that can distinguish between Metadata (to keep) and PII (to hide).
The Stack: Microsoft Presidio + spaCy
To solve this, we integrated Microsoft Presidio. Unlike a standard regex, Presidio allows us to define a complex pipeline of "Recognizers" and "Anonymizers."
We use spaCy’s en_core_web_lg (Large) model as the underlying NLP engine. This gives the Redactor the linguistic context to understand that "Gatsby" in a book title should stay, but "Gatsby" mentioned as a person's name in a private letter might need to go.
The Architecture: Secure by Default
The Redactor is built on a "Secure by Default" philosophy. In our orchestrator, we don't ask if a provider is "dangerous." We ask if a provider is Local.
If the provider is ollama or none, the data stays raw. If the provider is anything else (Anthropic, OpenAI, etc.), the Sovereign Vault Airlock engages automatically.
The Precision Shield: How the Sovereign Redactor intercepts sensitive PII at the edge while allowing critical metadata to pass through for cloud-based reasoning.
# The Sovereign Egress Guard
LOCAL_PROVIDERS = {'ollama', 'none'}
if provider not in LOCAL_PROVIDERS:
# Engage the Airlock
scrubbed_text, count = redactor.scrub(
text=visual_findings,
allow_list=metadata_allow_list
)
logger.info(f"🛡️ Sovereign Vault: {count} entities redacted from egress.")
The "Precision Shield": Using Allow-lists
To prevent the "Fitzgerald" problem, we implement a Precision-Guided Allow-list. Before the Redactor scans the text, the orchestrator dynamically builds a list of "safe" words based on the Master Bibliography:
- The Book Title
- The Author’s Name
- The Publisher’s Name
These entities are passed to the Redactor as an allow_list, instructing Presidio to ignore them even if it’s 99% sure they are PERSON or ORGANIZATION entities.
Resiliency: The "Safe-Fail" Pattern
One of the biggest challenges with local NLP is the resource cost. Loading a 500MB spaCy model into memory is "expensive."
We implemented a Sentinel-based Lazy Loading pattern. The Redactor only loads when it’s needed. If the system fails to load the model (e.g., missing dependencies), it doesn't crash the audit. Instead, it marks itself as _REDACTOR_DISABLED, logs a critical warning to the human auditor, and "fails open" to preserve forensic continuity.
"In a forensic system, a hard crash is a loss of data. A safe-fail is a managed risk."
The Result: Privacy-Preserving Reasoning
When we ran the Gatsby audit, the local Vision Agent found a handwritten note. The Redactor identified three sensitive entities (mentions of a name and a location not in our allow-list) and scrubbed them.
The cloud received this:
"Handwritten note found on title page. Content: 'I must have you by . I would like to read it for my English class at .'"
Claude 3.5 was still able to reason that the note was non-canonical and unusual for a first edition, without ever knowing the names or locations written in that 100-year-old pencil.
Architect’s Summary
The Sovereign Redactor proves that Privacy and Intelligence are not a zero-sum game. By moving the redaction logic to the edge and using precision allow-lists, we can utilize the world’s most powerful cloud models while ensuring our "Forensic Vault" remains truly sovereign.
Ready to build your own Sovereign Vault?
Explore the hardened SovereignRedactor logic in the mcp-forensic-analyzer repository. Don't forget to check out the new WALKTHROUGH.md to see how the code evolved from a simple tool to a privacy-preserving airlock.
The Shield is up. Now we need the Verdict.
We have the raw visual data from the Eye. We have the privacy shield from the Redactor. But an audit isn't a list of findings; it's a decision.
In our final installment of this series, The Auditor, we introduce the high-reasoning synthesis layer. We’ll explore how to combine disparate forensic streams into a single, structured verdict and implement the Guardian Pattern—a Human-in-the-Loop handshake that ensures the AI never has the final word on a $50,000 asset.
Coming Next: High-Reasoning Synthesis & The Ethics of Autonomous Verdicts.

Top comments (0)