The Problem I Kept Ignoring
Every time we sent a customer transcript to an LLM API, we were sending real data — credit card numbers, home addresses, full names, national IDs — in plaintext to a third-party server.
Most teams I've talked to handle this in one of two ways:
- Ignore it and hope the provider's data processing agreement covers them
- Prompt engineer around it — "don't repeat personal information in your response" — which does nothing about what's already been transmitted
Neither is acceptable in a production system handling real user data. So I built llm-hasher — a PII tokenization middleware that sits between your application and any LLM API.
The Core Idea
The LLM doesn't need to see the actual credit card number to summarize a support transcript. It just needs to know a credit card number was mentioned. So instead of:
"Hi, my card is 4111-1111-1111-1111 and email is john@example.com"
The LLM receives:
"Hi, my card is CREDIT_CARD_john12_4f8a2b and email is EMAIL_john12_9c3d1a"
It can still reason about the context. It just never touches the real values. When the response comes back, you detokenize it and restore the originals.
Architecture
Your App ──► POST /v1/tokenize ──► llm-hasher ──► tokenized text
│
detects PII locally
(Ollama, no cloud)
stores in encrypted vault
Your App ──► [your LLM call with tokenized text]
Your App ──► POST /v1/detokenize ──► llm-hasher ──► original text restored
Three moving parts: a detector, a vault, and an HTTP service wrapping both.
Detection: Hybrid Regex + LLM
PII falls into two categories that require different detection strategies.
Structured PII — credit cards, emails, IBANs, IPv4 addresses — has well-defined patterns. Regex handles these with sub-millisecond latency and 100% recall on valid formats. No need to involve a language model.
Contextual PII — names, addresses, national IDs, passports — is where regex breaks down completely. "John Smith" looks identical to "Smith & Wesson" to a pattern matcher. You need semantic understanding.
For contextual PII, llm-hasher sends the text to a locally running Ollama instance. The model (default: llama3.2:3b) extracts entities and returns structured JSON. Because Ollama runs on your own server, this detection step never touches an external API — your raw data stays on your infra.
The hybrid approach gives you the best of both: speed and precision for structured types, semantic understanding for contextual ones.
Chunking for Long Texts
Sending a 5,000-word transcript to Ollama in one shot causes problems — context window limits, degraded accuracy on long inputs, serial latency.
llm-hasher chunks large texts (configurable, default 800 words) and processes chunks in parallel goroutines:
// Simplified — actual implementation handles overlap and deduplication
func (d *Detector) detectParallel(ctx context.Context, text string) ([]Entity, error) {
chunks := chunk(text, d.cfg.ChunkSize)
results := make(chan []Entity, len(chunks))
var wg sync.WaitGroup
for _, chunk := range chunks {
wg.Add(1)
go func(c string) {
defer wg.Done()
entities, _ := d.detectWithOllama(ctx, c)
results <- entities
}(chunk)
}
wg.Wait()
close(results)
return merge(results), nil
}
A 5,000-word document with 6 chunks processes in roughly the same time as a single chunk — latency scales with the slowest chunk, not the total count.
The Vault: AES-256-GCM Encrypted SQLite
Token-to-value mappings are stored in a local SQLite database. Each value is encrypted with AES-256-GCM before being written.
Key design decisions:
Context scoping with your own IDs. Instead of generating opaque foreign UUIDs that you'd need to track on your side, you pass a context_id from your domain:
{
"text": "Hi, my card is 4111-1111-1111-1111",
"context_id": "zoom_call_789"
}
This means your Zoom call processor can detokenize with zoom_call_789 without needing to store a mapping between your ID and a vault-generated UUID.
Deduplication within a context. The same PII value within a context always maps to the same token. If a name appears five times in a transcript, the LLM sees the same token each time — so it can reason about the entity consistently across the full text.
TTL support. Tokens can have an expiry:
{
"text": "...",
"context_id": "session_abc",
"ttl": "24h"
}
For compliance scenarios (GDPR right to erasure), there's a hard-delete endpoint:
DELETE /v1/contexts/{context_id}
This removes all mappings for that context from the vault. Once deleted, detokenization is impossible — by design.
Detokenization: Single-Pass Multi-String Replace
Naive detokenization would loop through each token and do a string replace — O(n×m) where n is text length and m is token count. For a transcript with 40 entities, that's 40 passes over the text.
llm-hasher builds an Aho-Corasick automaton from the token set and does a single linear pass:
func (v *Vault) Detokenize(text string, mappings map[string]string) string {
replacer := strings.NewReplacer(flatten(mappings)...)
return replacer.Replace(text)
}
Detokenization latency is effectively constant regardless of token count — typically under 5ms even for large documents.
Real-World Integration
Python — LLM Proxy Pattern
import requests
import openai
# 1. Tokenize before sending to LLM
resp = requests.post("http://localhost:8080/v1/tokenize", json={
"text": transcript,
"context_id": f"zoom_{call_id}"
})
tokenized = resp.json()
# 2. Send tokenized text to your LLM
llm_response = openai.chat.completions.create(
messages=[
{"role": "system", "content": "Summarize this call transcript."},
{"role": "user", "content": tokenized["tokenized_text"]}
]
)
# 3. Detokenize the LLM response
final = requests.post("http://localhost:8080/v1/detokenize", json={
"text": llm_response.choices[0].message.content,
"context_id": f"zoom_{call_id}"
})
print(final.json()["original_text"])
Go — Library Mode
If you don't want to run a separate HTTP service, import the hasher package directly:
import "github.com/yemrealtanay/llm-hasher/pkg/hasher"
h, err := hasher.New(
hasher.WithOllama("http://localhost:11434", "llama3.2:3b"),
hasher.WithVault("data/vault.db", ""),
)
defer h.Close()
result, err := h.Tokenize(ctx, transcript, "zoom_call_789", nil)
// result.Text contains tokenized transcript
original, err := h.Detokenize(ctx, llmResponse, "zoom_call_789")
Performance Characteristics
| Scenario | Typical Latency |
|---|---|
| Short text, regex PII only | < 5ms |
| Short text with LLM detection | 2–8s (model dependent) |
| Long text (5,000 words), 6 parallel chunks | 3–10s |
| Detokenize (any size) | < 5ms |
The dominant cost is Ollama inference. On a modern laptop with llama3.2:3b, expect 2–4 seconds per chunk. A GPU or a larger/faster model changes this significantly. If your use case is async (batch processing, background jobs), the latency is generally acceptable without hardware changes.
For latency-sensitive paths, run tokenization asynchronously before the user-facing LLM call — most pipelines have a natural point to do this.
What It Doesn't Do (Yet)
It's not a firewall. If someone deliberately encodes PII to evade detection (e.g., spelling out digits), llm-hasher won't catch it. It handles the common case, not adversarial inputs.
Ollama recall isn't 100%. The LLM detector misses things, especially in noisy or multilingual text. Tuning confidence_threshold and chunk size helps, but there's no guarantee of perfect recall without human review.
No streaming support yet. Tokenization requires the full text — SSE/streaming tokenization is on the v2 roadmap.
Running It
git clone https://github.com/yemrealtanay/llm-hasher
cd llm-hasher
make docker-up
Docker Compose starts Ollama, pulls llama3.2:3b (~2GB), and starts the service on port 8080. Check it's running:
curl http://localhost:8080/healthz
# {"status":"ok"}
For production, set an explicit vault encryption key:
# Generate
openssl rand -hex 32
# Set in .env
VAULT_KEY=<your_64_char_hex_key>
If VAULT_KEY is not set, a key is auto-generated and saved to data/vault.key. Fine for development, not for production — you need the key to survive restarts.
What's Next
The v2 roadmap includes built-in LLM proxy endpoints (OpenAI-compatible and Anthropic), so instead of calling llm-hasher then your LLM separately, you point your existing OpenAI client at llm-hasher and it handles tokenization transparently in the middle. This would make adoption essentially zero-config for teams already using the OpenAI SDK.
Contributions are welcome, especially for v2 LLM provider adapters — each provider is a well-defined, self-contained implementation.
Links
- GitHub: github.com/yemrealtanay/llm-hasher
- License: MIT
- Issues / PRs welcome
Top comments (0)