DEV Community

Yunus Emre Altanay
Yunus Emre Altanay

Posted on

How I Built a PII Tokenization Middleware to Keep Sensitive Data Out of LLM APIs

The Problem I Kept Ignoring

Every time we sent a customer transcript to an LLM API, we were sending real data — credit card numbers, home addresses, full names, national IDs — in plaintext to a third-party server.

Most teams I've talked to handle this in one of two ways:

  1. Ignore it and hope the provider's data processing agreement covers them
  2. Prompt engineer around it — "don't repeat personal information in your response" — which does nothing about what's already been transmitted

Neither is acceptable in a production system handling real user data. So I built llm-hasher — a PII tokenization middleware that sits between your application and any LLM API.


The Core Idea

The LLM doesn't need to see the actual credit card number to summarize a support transcript. It just needs to know a credit card number was mentioned. So instead of:

"Hi, my card is 4111-1111-1111-1111 and email is john@example.com"
Enter fullscreen mode Exit fullscreen mode

The LLM receives:

"Hi, my card is CREDIT_CARD_john12_4f8a2b and email is EMAIL_john12_9c3d1a"
Enter fullscreen mode Exit fullscreen mode

It can still reason about the context. It just never touches the real values. When the response comes back, you detokenize it and restore the originals.


Architecture

Your App ──► POST /v1/tokenize ──► llm-hasher ──► tokenized text
                                       │
                              detects PII locally
                              (Ollama, no cloud)
                              stores in encrypted vault

Your App ──► [your LLM call with tokenized text]

Your App ──► POST /v1/detokenize ──► llm-hasher ──► original text restored
Enter fullscreen mode Exit fullscreen mode

Three moving parts: a detector, a vault, and an HTTP service wrapping both.


Detection: Hybrid Regex + LLM

PII falls into two categories that require different detection strategies.

Structured PII — credit cards, emails, IBANs, IPv4 addresses — has well-defined patterns. Regex handles these with sub-millisecond latency and 100% recall on valid formats. No need to involve a language model.

Contextual PII — names, addresses, national IDs, passports — is where regex breaks down completely. "John Smith" looks identical to "Smith & Wesson" to a pattern matcher. You need semantic understanding.

For contextual PII, llm-hasher sends the text to a locally running Ollama instance. The model (default: llama3.2:3b) extracts entities and returns structured JSON. Because Ollama runs on your own server, this detection step never touches an external API — your raw data stays on your infra.

The hybrid approach gives you the best of both: speed and precision for structured types, semantic understanding for contextual ones.

Chunking for Long Texts

Sending a 5,000-word transcript to Ollama in one shot causes problems — context window limits, degraded accuracy on long inputs, serial latency.

llm-hasher chunks large texts (configurable, default 800 words) and processes chunks in parallel goroutines:

// Simplified — actual implementation handles overlap and deduplication
func (d *Detector) detectParallel(ctx context.Context, text string) ([]Entity, error) {
    chunks := chunk(text, d.cfg.ChunkSize)
    results := make(chan []Entity, len(chunks))

    var wg sync.WaitGroup
    for _, chunk := range chunks {
        wg.Add(1)
        go func(c string) {
            defer wg.Done()
            entities, _ := d.detectWithOllama(ctx, c)
            results <- entities
        }(chunk)
    }

    wg.Wait()
    close(results)
    return merge(results), nil
}
Enter fullscreen mode Exit fullscreen mode

A 5,000-word document with 6 chunks processes in roughly the same time as a single chunk — latency scales with the slowest chunk, not the total count.


The Vault: AES-256-GCM Encrypted SQLite

Token-to-value mappings are stored in a local SQLite database. Each value is encrypted with AES-256-GCM before being written.

Key design decisions:

Context scoping with your own IDs. Instead of generating opaque foreign UUIDs that you'd need to track on your side, you pass a context_id from your domain:

{
  "text": "Hi, my card is 4111-1111-1111-1111",
  "context_id": "zoom_call_789"
}
Enter fullscreen mode Exit fullscreen mode

This means your Zoom call processor can detokenize with zoom_call_789 without needing to store a mapping between your ID and a vault-generated UUID.

Deduplication within a context. The same PII value within a context always maps to the same token. If a name appears five times in a transcript, the LLM sees the same token each time — so it can reason about the entity consistently across the full text.

TTL support. Tokens can have an expiry:

{
  "text": "...",
  "context_id": "session_abc",
  "ttl": "24h"
}
Enter fullscreen mode Exit fullscreen mode

For compliance scenarios (GDPR right to erasure), there's a hard-delete endpoint:

DELETE /v1/contexts/{context_id}
Enter fullscreen mode Exit fullscreen mode

This removes all mappings for that context from the vault. Once deleted, detokenization is impossible — by design.


Detokenization: Single-Pass Multi-String Replace

Naive detokenization would loop through each token and do a string replace — O(n×m) where n is text length and m is token count. For a transcript with 40 entities, that's 40 passes over the text.

llm-hasher builds an Aho-Corasick automaton from the token set and does a single linear pass:

func (v *Vault) Detokenize(text string, mappings map[string]string) string {
    replacer := strings.NewReplacer(flatten(mappings)...)
    return replacer.Replace(text)
}
Enter fullscreen mode Exit fullscreen mode

Detokenization latency is effectively constant regardless of token count — typically under 5ms even for large documents.


Real-World Integration

Python — LLM Proxy Pattern

import requests
import openai

# 1. Tokenize before sending to LLM
resp = requests.post("http://localhost:8080/v1/tokenize", json={
    "text": transcript,
    "context_id": f"zoom_{call_id}"
})
tokenized = resp.json()

# 2. Send tokenized text to your LLM
llm_response = openai.chat.completions.create(
    messages=[
        {"role": "system", "content": "Summarize this call transcript."},
        {"role": "user",   "content": tokenized["tokenized_text"]}
    ]
)

# 3. Detokenize the LLM response
final = requests.post("http://localhost:8080/v1/detokenize", json={
    "text": llm_response.choices[0].message.content,
    "context_id": f"zoom_{call_id}"
})
print(final.json()["original_text"])
Enter fullscreen mode Exit fullscreen mode

Go — Library Mode

If you don't want to run a separate HTTP service, import the hasher package directly:

import "github.com/yemrealtanay/llm-hasher/pkg/hasher"

h, err := hasher.New(
    hasher.WithOllama("http://localhost:11434", "llama3.2:3b"),
    hasher.WithVault("data/vault.db", ""),
)
defer h.Close()

result, err := h.Tokenize(ctx, transcript, "zoom_call_789", nil)
// result.Text contains tokenized transcript

original, err := h.Detokenize(ctx, llmResponse, "zoom_call_789")
Enter fullscreen mode Exit fullscreen mode

Performance Characteristics

Scenario Typical Latency
Short text, regex PII only < 5ms
Short text with LLM detection 2–8s (model dependent)
Long text (5,000 words), 6 parallel chunks 3–10s
Detokenize (any size) < 5ms

The dominant cost is Ollama inference. On a modern laptop with llama3.2:3b, expect 2–4 seconds per chunk. A GPU or a larger/faster model changes this significantly. If your use case is async (batch processing, background jobs), the latency is generally acceptable without hardware changes.

For latency-sensitive paths, run tokenization asynchronously before the user-facing LLM call — most pipelines have a natural point to do this.


What It Doesn't Do (Yet)

It's not a firewall. If someone deliberately encodes PII to evade detection (e.g., spelling out digits), llm-hasher won't catch it. It handles the common case, not adversarial inputs.

Ollama recall isn't 100%. The LLM detector misses things, especially in noisy or multilingual text. Tuning confidence_threshold and chunk size helps, but there's no guarantee of perfect recall without human review.

No streaming support yet. Tokenization requires the full text — SSE/streaming tokenization is on the v2 roadmap.


Running It

git clone https://github.com/yemrealtanay/llm-hasher
cd llm-hasher
make docker-up
Enter fullscreen mode Exit fullscreen mode

Docker Compose starts Ollama, pulls llama3.2:3b (~2GB), and starts the service on port 8080. Check it's running:

curl http://localhost:8080/healthz
# {"status":"ok"}
Enter fullscreen mode Exit fullscreen mode

For production, set an explicit vault encryption key:

# Generate
openssl rand -hex 32

# Set in .env
VAULT_KEY=<your_64_char_hex_key>
Enter fullscreen mode Exit fullscreen mode

If VAULT_KEY is not set, a key is auto-generated and saved to data/vault.key. Fine for development, not for production — you need the key to survive restarts.


What's Next

The v2 roadmap includes built-in LLM proxy endpoints (OpenAI-compatible and Anthropic), so instead of calling llm-hasher then your LLM separately, you point your existing OpenAI client at llm-hasher and it handles tokenization transparently in the middle. This would make adoption essentially zero-config for teams already using the OpenAI SDK.

Contributions are welcome, especially for v2 LLM provider adapters — each provider is a well-defined, self-contained implementation.


Links

Top comments (0)