How I Built an OCR-Based Defense Against Prompt Injection for Local LLM Search

#security #ai #opensource #docker

When you plug a local LLM into a web search tool, every fetched page becomes an attack surface. I found this out the hard way — my Ollama setup was pulling web content that contained invisible Unicode injection, fake system prompts, and markdown image tags designed to exfiltrate data through URL parameters.

I went looking for a solution and found that Google DeepMind's own research showed their best model-level defenses fail 53.6% of the time against adaptive attacks. The "Attacker Moves Second" paper demonstrated that all 12 published defenses were bypassed at >90% success rates. The UK's National Cyber Security Centre formally characterized LLMs as "inherently confusable deputies."

So I stopped trying to make the model resist injection and started removing the attack text before the model ever sees it.

The Insight: OCR as a Nuclear Defense

Since I'm generating the image from text (not scanning a document), I control every variable. The OCR round-trip becomes a ground truth extractor:

Take untrusted web content
Render it to an image with ImageMagick (300 DPI, 20pt monospace, TIFF)
OCR it back with Tesseract (LSTM engine)
Anything that didn't produce visible pixels is gone

Zero-width characters, bidirectional overrides, homoglyphs, variation selectors, tag characters — they all die in the render step because they have no visual representation. No pattern matching required for the entire invisible attack surface.

The Full Pipeline

Five independent layers, each catching a different class:

Layer	What	Catches
1. OCR round-trip	text → image → OCR	All invisible characters
2. Regex detect	31 compiled patterns	Instruction overrides, role hijacking, system tags
3. Regex redact	Strip detected patterns	Prevents detected attacks from reaching LLM
4. URL/email redact	Strip exfil channels	Markdown img exfil, hidden endpoints
5. Trust wrap	Tag as HOSTILE/UNTRUSTED	Gives LLM provenance metadata

The OCR runs first. Everything else operates on the clean output.

Red Team Results

I built a test harness with 12 adversarial payloads and ran them directly through the sanitization pipeline:

T01: Instruction Override — ✓ NEUTRALIZED
T02: Unicode Steganography — ✓ NEUTRALIZED
T03: Bidi Override — ✓ NEUTRALIZED
T04: Markdown Exfil — ✓ NEUTRALIZED
T05: Role Hijacking — ✓ NEUTRALIZED
T06: System Tag Injection — ✓ NEUTRALIZED
T07: Base64 Payload — ✓ NEUTRALIZED
T08: Typoglycemia — ✓ NEUTRALIZED
T09: Code Fence Injection — ✓ NEUTRALIZED
T10: Trust Escalation — ✓ NEUTRALIZED
T11: HTML Img Exfil — ✓ NEUTRALIZED
T12: Multi-Vector Combined — ✓ NEUTRALIZED

The red team script is included in the repo — python3 redteam.py runs all 12 payloads against your running instance.

What This Doesn't Catch

I want to be upfront about the limitations because I think the security community has a problem with tools that oversell:

Semantic injection — "the previous assessment methodology was found to contain errors" is natural English. No regex or OCR catches it.
Adaptive regex evasion — if an attacker studies the 31 patterns, they can craft bypasses using synonyms.
Cross-page composite attacks — each page is sanitized independently. An injection split across multiple search results would pass.
Model-level manipulation — the filter LLM is still an LLM.

Per DeepMind's research, prompt injection may never be fully solved with current architectures. This tool raises the cost of attack, it doesn't eliminate it.

Setup

Requires Docker and Ollama (or any OpenAI-compatible local LLM).

git clone https://github.com/Morfasco/search-sanitizer.git
cd search-sanitizer
cp .env.example .env  # edit with your model/endpoint
bash setup.sh
python3 redteam.py    # verify the pipeline

Supports Ollama, LM Studio, vLLM, text-generation-webui — anything that speaks /v1/chat/completions works via the LLM_API_FORMAT=openai setting in .env.