Yunus Emre Altanay

Posted on Apr 6

How I Built a PII Tokenization Middleware to Keep Sensitive Data Out of LLM APIs

#go #llm #privacy #opensource

The Problem I Kept Ignoring

Every time we sent a customer transcript to an LLM API, we were sending real data — credit card numbers, home addresses, full names, national IDs — in plaintext to a third-party server.

Most teams I've talked to handle this in one of two ways:

Ignore it and hope the provider's data processing agreement covers them
Prompt engineer around it — "don't repeat personal information in your response" — which does nothing about what's already been transmitted

Neither is acceptable in a production system handling real user data. So I built llm-hasher — a PII tokenization middleware that sits between your application and any LLM API.

The Core Idea

The LLM doesn't need to see the actual credit card number to summarize a support transcript. It just needs to know a credit card number was mentioned. So instead of:

"Hi, my card is 4111-1111-1111-1111 and email is john@example.com"

The LLM receives:

"Hi, my card is CREDIT_CARD_john12_4f8a2b and email is EMAIL_john12_9c3d1a"

It can still reason about the context. It just never touches the real values. When the response comes back, you detokenize it and restore the originals.

Architecture

Your App ──► POST /v1/tokenize ──► llm-hasher ──► tokenized text
                                       │
                              detects PII locally
                              (Ollama, no cloud)
                              stores in encrypted vault

Your App ──► [your LLM call with tokenized text]

Your App ──► POST /v1/detokenize ──► llm-hasher ──► original text restored

Three moving parts: a detector, a vault, and an HTTP service wrapping both.

Detection: Hybrid Regex + LLM

PII falls into two categories that require different detection strategies.

Structured PII — credit cards, emails, IBANs, IPv4 addresses — has well-defined patterns. Regex handles these with sub-millisecond latency and 100% recall on valid formats. No need to involve a language model.

Contextual PII — names, addresses, national IDs, passports — is where regex breaks down completely. "John Smith" looks identical to "Smith & Wesson" to a pattern matcher. You need semantic understanding.

For contextual PII, llm-hasher sends the text to a locally running Ollama instance. The model (default: llama3.2:3b) extracts entities and returns structured JSON. Because Ollama runs on your own server, this detection step never touches an external API — your raw data stays on your infra.

The hybrid approach gives you the best of both: speed and precision for structured types, semantic understanding for contextual ones.

Chunking for Long Texts

Sending a 5,000-word transcript to Ollama in one shot causes problems — context window limits, degraded accuracy on long inputs, serial latency.

llm-hasher chunks large texts (configurable, default 800 words) and processes chunks in parallel goroutines:

// Simplified — actual implementation handles overlap and deduplication
func (d *Detector) detectParallel(ctx context.Context, text string) ([]Entity, error) {
    chunks := chunk(text, d.cfg.ChunkSize)
    results := make(chan []Entity, len(chunks))

    var wg sync.WaitGroup
    for _, chunk := range chunks {
        wg.Add(1)
        go func(c string) {
            defer wg.Done()
            entities, _ := d.detectWithOllama(ctx, c)
            results <- entities
        }(chunk)
    }

    wg.Wait()
    close(results)
    return merge(results), nil
}

A 5,000-word document with 6 chunks processes in roughly the same time as a single chunk — latency scales with the slowest chunk, not the total count.

The Vault: AES-256-GCM Encrypted SQLite

Token-to-value mappings are stored in a local SQLite database. Each value is encrypted with AES-256-GCM before being written.

Key design decisions:

Context scoping with your own IDs. Instead of generating opaque foreign UUIDs that you'd need to track on your side, you pass a context_id from your domain:

{
  "text": "Hi, my card is 4111-1111-1111-1111",
  "context_id": "zoom_call_789"
}

This means your Zoom call processor can detokenize with zoom_call_789 without needing to store a mapping between your ID and a vault-generated UUID.

Deduplication within a context. The same PII value within a context always maps to the same token. If a name appears five times in a transcript, the LLM sees the same token each time — so it can reason about the entity consistently across the full text.

TTL support. Tokens can have an expiry:

{
  "text": "...",
  "context_id": "session_abc",
  "ttl": "24h"
}

For compliance scenarios (GDPR right to erasure), there's a hard-delete endpoint:

DELETE /v1/contexts/{context_id}

This removes all mappings for that context from the vault. Once deleted, detokenization is impossible — by design.

Detokenization: Single-Pass Multi-String Replace

Naive detokenization would loop through each token and do a string replace — O(n×m) where n is text length and m is token count. For a transcript with 40 entities, that's 40 passes over the text.

llm-hasher builds an Aho-Corasick automaton from the token set and does a single linear pass:

func (v *Vault) Detokenize(text string, mappings map[string]string) string {
    replacer := strings.NewReplacer(flatten(mappings)...)
    return replacer.Replace(text)
}

Detokenization latency is effectively constant regardless of token count — typically under 5ms even for large documents.

Real-World Integration

Python — LLM Proxy Pattern

import requests
import openai

# 1. Tokenize before sending to LLM
resp = requests.post("http://localhost:8080/v1/tokenize", json={
    "text": transcript,
    "context_id": f"zoom_{call_id}"
})
tokenized = resp.json()

# 2. Send tokenized text to your LLM
llm_response = openai.chat.completions.create(
    messages=[
        {"role": "system", "content": "Summarize this call transcript."},
        {"role": "user",   "content": tokenized["tokenized_text"]}
    ]
)

# 3. Detokenize the LLM response
final = requests.post("http://localhost:8080/v1/detokenize", json={
    "text": llm_response.choices[0].message.content,
    "context_id": f"zoom_{call_id}"
})
print(final.json()["original_text"])

Go — Library Mode

If you don't want to run a separate HTTP service, import the hasher package directly:

import "github.com/yemrealtanay/llm-hasher/pkg/hasher"

h, err := hasher.New(
    hasher.WithOllama("http://localhost:11434", "llama3.2:3b"),
    hasher.WithVault("data/vault.db", ""),
)
defer h.Close()

result, err := h.Tokenize(ctx, transcript, "zoom_call_789", nil)
// result.Text contains tokenized transcript

original, err := h.Detokenize(ctx, llmResponse, "zoom_call_789")

Performance Characteristics

Scenario	Typical Latency
Short text, regex PII only	< 5ms
Short text with LLM detection	2–8s (model dependent)
Long text (5,000 words), 6 parallel chunks	3–10s
Detokenize (any size)	< 5ms

The dominant cost is Ollama inference. On a modern laptop with llama3.2:3b, expect 2–4 seconds per chunk. A GPU or a larger/faster model changes this significantly. If your use case is async (batch processing, background jobs), the latency is generally acceptable without hardware changes.

For latency-sensitive paths, run tokenization asynchronously before the user-facing LLM call — most pipelines have a natural point to do this.

What It Doesn't Do (Yet)

It's not a firewall. If someone deliberately encodes PII to evade detection (e.g., spelling out digits), llm-hasher won't catch it. It handles the common case, not adversarial inputs.

Ollama recall isn't 100%. The LLM detector misses things, especially in noisy or multilingual text. Tuning confidence_threshold and chunk size helps, but there's no guarantee of perfect recall without human review.

No streaming support yet. Tokenization requires the full text — SSE/streaming tokenization is on the v2 roadmap.

Running It

git clone https://github.com/yemrealtanay/llm-hasher
cd llm-hasher
make docker-up

Docker Compose starts Ollama, pulls llama3.2:3b (~2GB), and starts the service on port 8080. Check it's running:

curl http://localhost:8080/healthz
# {"status":"ok"}

For production, set an explicit vault encryption key:

# Generate
openssl rand -hex 32

# Set in .env
VAULT_KEY=<your_64_char_hex_key>

If VAULT_KEY is not set, a key is auto-generated and saved to data/vault.key. Fine for development, not for production — you need the key to survive restarts.

What's Next

The v2 roadmap includes built-in LLM proxy endpoints (OpenAI-compatible and Anthropic), so instead of calling llm-hasher then your LLM separately, you point your existing OpenAI client at llm-hasher and it handles tokenization transparently in the middle. This would make adoption essentially zero-config for teams already using the OpenAI SDK.

Contributions are welcome, especially for v2 LLM provider adapters — each provider is a well-defined, self-contained implementation.

Top comments (4)

Archit Mittal • Apr 11

The context_id scoping design is really well thought out — especially the deduplication within a context so the LLM sees consistent tokens for the same entity across a transcript. That's the detail most PII-stripping tools get wrong: they generate random replacements each time, which means the model can't track that "Person A" mentioned on page 1 is the same person on page 5.

A few things I'd be curious about for production use:

How do you handle the case where the LLM's response introduces a token that wasn't in the input? For example, if the model hallucinates "CREDIT_CARD_john12_xxxx" with a suffix that doesn't exist in your vault — does detokenize silently pass it through, or do you flag it as an unknown token? In compliance-sensitive contexts, you'd want to know if the model is fabricating PII-shaped strings.

The Aho-Corasick single-pass detokenization is a nice touch. For the v2 proxy mode, have you considered doing the tokenization at the streaming chunk level with a small buffer? You'd need to handle the edge case where a PII entity spans a chunk boundary, but a sliding window of ~50 tokens would catch most cases. That would unlock streaming support without waiting for the full response.

Also — the hybrid regex + local LLM detection is the right architecture. We've seen teams try to use cloud NER APIs for PII detection, which defeats the entire purpose since you're sending the raw text to another external service to figure out what's sensitive. Running Ollama locally keeps the threat model clean.

Yunus Emre Altanay • Apr 11

Thanks for the reply. I love all your comments. Try my best to answer;

Currently detokenize passes unknown tokens through silently, but you're right that this is a compliance gap. I'm planning to add a strict mode flag where any token-shaped string (matching the ENTITY_contextid_hash pattern) that isn't in the vault gets flagged as a detokenization warning either returned as an error or redacted. The vault lookup is O(1) so the overhead is negligible. I should even add this as a GitHub issue.

The sliding window approach is exactly what I had in mind for v2. The tricky part isn't the buffer size but entity type a credit card number spanning a chunk boundary is recoverable with ~30 char lookahead, but a full name split mid-token needs context. I'm leaning toward a 64-byte overlap buffer with a flush trigger on sentence boundaries.

For regex ollama comment:
This is exactly why I built it this way glad someone else sees the contradiction too. Sending raw text to a cloud NER API to figure out what's sensitive is just exfiltration with extra steps, regardless of their privacy policy. Ollama on CPU is slower, sure, but the threat model stays clean and everything stays in your own infra. That trade-off felt non-negotiable to me from the start.

Thank you very much again for your comment...

Archit Mittal • Apr 12

The strict mode flag for unknown token-shaped strings is a great call — definitely worth a GitHub issue. In compliance-heavy environments (healthcare, fintech), you'd want detokenization to fail loudly rather than silently pass through something that looks like a PII token but isn't in the vault. That's the kind of edge case auditors will ask about.

The 64-byte overlap buffer with sentence boundary flush for streaming sounds like a solid approach. For names specifically, you could also look at maintaining a running entity context from the tokenization phase — if you know what tokens were generated for the input, you can pre-seed the detokenizer's expected set for the response stream rather than trying to detect them fresh.

Looking forward to the v2 proxy mode — that zero-config OpenAI SDK integration would make adoption really easy for teams that are already running standard chat completions pipelines.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

DEV Community