DEV Community

Jaydeep Shah (JD)
Jaydeep Shah (JD)

Posted on

One Prompt Was Not Enough: Building a 4-Step On-Device Redaction Pipeline

I thought I could solve PII redaction with one LLM call. Classify the document, find every identifier, replace them with placeholders, and verify nothing leaked, all in a single prompt. The model is smart enough, right?

It took three architectural rewrites and a trail of silently failed redactions before I arrived at something that actually works: a 4-step pipeline where each step gets its own conversation, its own prompt, and its own job. This is the story of how I got there, told through the failures that forced each design decision.

The system is called Redacto. It runs entirely on-device: Gemma 4 E2B (~2.3B effective params, 5.1B total via Per-Layer Embeddings) on a Snapdragon 8 Elite NPU via LiteRT-LM. No cloud, no internet permission, no BAA required. The architecture I am about to describe applies to any LLM-driven document processing pipeline, but the constraints of on-device inference made every failure more painful and every fix more deliberate.


The naive approach: one call to rule them all

The first version of Redacto did exactly what seems reasonable. One prompt, one LLM call:

You are a privacy redaction engine. Analyze the following text. Identify the document type. Find all personally identifiable information. Replace each PII item with a category placeholder like [NAME_1]. Verify that no PII remains. Return the redacted text.

This is the architecture most people reach for first. It mirrors how a human would think about the problem: read the document, find the sensitive stuff, black it out, double-check. A single mental pass.

And it works, sometimes. On short, clean, well-formatted inputs, the model can juggle all four tasks in one shot. I saw it handle a three-line medical note perfectly on the first try. So I shipped it.

Then I tested it on real documents.

What actually happened

Three categories of failure emerged immediately.

Inconsistent output format. The prompt asked for structured output: category labels, detected items, redacted text. But the model would sometimes return the redacted text without the detection list. Other times it would return the detection list but skip the redaction. On a small model running on-device, output format reliability is not something you can take for granted. This is the reality of working with small models: they are powerful, but they cannot reliably hold four distinct objectives in working memory at once (D3).

Missed detections. When the model is simultaneously trying to understand document structure, enumerate PII items, perform text surgery, and self-verify, something has to give. In practice, detection suffered the most. The model would catch obvious items like names and dates but miss relational identifiers, "the patient's daughter Lisa", where "Lisa" is only PII because of its relationship to the patient. Single-pass processing does not give the model enough cognitive room to reason about context (D3).

Silent failures. The worst category. The model would return confidently formatted output that looked correct but had gaps. No error, no warning, just missing redactions. A medical record number sitting in plain text in the "redacted" output. When your system's job is privacy protection, a silent failure is worse than a crash.

These are not hypothetical problems. They are documented failure modes that drove the architectural decisions in Redacto's engineering log.


The evolution through failed approaches

Getting from single-pass to the final 4-step pipeline was not a clean leap. It was two intermediate architectures, each of which fixed one problem and revealed the next.

Attempt 2: Separate detection, deterministic replacement

The first fix was obvious: split detection from replacement. Let the LLM focus on finding PII (Step 2), then use deterministic code to do the actual text replacement (Step 3). No LLM creativity in the replacement step, just string.replace(), sorted by length to avoid partial matches.

This is sound software engineering. Separate concerns. Use deterministic logic where you can. Reserve the LLM for what only an LLM can do.

It failed catastrophically.

Here is the specific failure mode, documented as Decision D4 in our engineering log: OCR text has errors. A scanned medical bill might contain "inquire@blucurrent. mail", note the stray space before "mail." The LLM in Step 2 receives this OCR text and detects an email address. But the LLM does what LLMs do: it "corrects" the text in its output. It reports the detection as "inquire@ucal.com".

Now string.replace() runs. It searches for "inquire@ucal.com" in the original text. That string does not exist in the original. The replacement silently does nothing. The email address remains in the output, unredacted.

In this test, 6 of 8 detections were skipped because the detected text did not match the original OCR text (D4). That figure is directional, from a single test run rather than a rigorous multi-run study, but the pattern was unmistakable: silent, systematic loss.

This is a fundamental tension: the LLM understands semantics but does not preserve exact string representations. Deterministic string operations preserve exact strings but do not understand semantics. When you chain one into the other, the semantic layer corrupts the data that the deterministic layer depends on.

Attempt 3: Word-diff for image bounding boxes

For image redaction, the pipeline was: OCR extracts text with bounding boxes, the LLM redacts the text, then we diff the original against the redacted version to figure out which words were replaced, and draw black boxes over their bounding box coordinates.

The diff approach broke for the same reason. The LLM does not just replace PII, it rewrites surrounding text. It fixes grammar, merges sentences, reorders clauses. The word-level diff between original and redacted text becomes noisy and unreliable. Bounding boxes end up on the wrong words, or they cover too much, or they miss the target entirely.

This was the failure that finally forced a complete rethink of the architecture (D9).


The 4-step pipeline that works

The final architecture decomposes the problem into four steps, each with a single responsibility:

The 4-step redaction pipeline: an input document flows through Step 1 Classify, Step 2 Detect, Step 3 Redact, and Step 4 Validate (an LLM-as-a-judge audit), each in its own fresh conversation. On PASS the redacted output is returned; on FAIL the missed items are fed back and steps 3 and 4 re-run, up to a maximum of 3 rounds.

Step 1: Classify

Input: raw document text.
Output: document type and redaction category.

DOCUMENT_TYPE: Medical Note
CATEGORY: Medical
Enter fullscreen mode Exit fullscreen mode

This determines which of the 7 specialist prompt sets will drive Step 2. A medical note activates the Medical detector (the HIPAA PHI identifiers, preserving diagnoses and medications). A police report activates the Tactical detector (protecting victim and witness identities while preserving suspect descriptions). One prompt does not fit all (D12).

Step 2: Detect

Input: document text + category-specific detection prompt.
Output: structured list of every PII item found.

NAME: Mrs. Chen
DATE: 3/15/48
MRN:  4471829
PHONE: 408-555-1234
Enter fullscreen mode Exit fullscreen mode

The detector's only job is to find things. It does not replace them, does not think about output format, does not self-verify. This single focus is what lets it catch relational identifiers that the single-pass approach missed, "the patient's daughter Lisa", because all of its cognitive budget goes to understanding context, not juggling four tasks.

The output uses a simple line-based format (CATEGORY: text), not JSON. This is deliberate: a model this size sometimes struggles with JSON bracket matching, and the line-based format is more token-efficient (D8).

Step 3: Redact

Input: original text + detection list from Step 2.
Output: text with [CATEGORY_N] placeholders.

This is the step where the critical design decision lives. The original plan was deterministic string.replace(). After the 6-of-8 failure described above, this became an LLM call (D4).

The LLM receives both the original OCR text (errors and all) and the detection list (with the LLM's "corrected" versions). Because the LLM understands that "inquire@blucurrent. mail" and "inquire@ucal.com" refer to the same entity, it can perform the substitution correctly despite the mismatch. It handles OCR noise naturally because it operates on semantics, not string equality.

The tradeoff: one extra LLM call, adding roughly 1-5 seconds. If the LLM call fails, we fall back to deterministic replacement as a safety net. But in practice, the LLM call succeeds, and the accuracy improvement is not negotiable for a privacy tool (D6).

Step 4: Validate

Input: the redacted output from Step 3, and nothing else.
Output: PASS or FAIL with a list of missed items.

Step 4 is an LLM-as-a-judge check: a separate model call whose only job is to grade the redacted output, not to produce it. The validator is an independent auditor. It receives the redacted text in a fresh conversation with no memory of what Steps 1-3 decided (D5). It reads the output cold and asks: "Does any PII remain?"

Using the model as its own judge is what makes the pipeline self-correcting. If it finds missed items, Steps 3 and 4 re-run with the missed items added to the detection list. Maximum 3 rounds total (1 initial + 2 retries) (D7). In testing, most documents pass in round 1. Those that fail after 3 rounds have systemic prompt issues, and retrying will not help.


Why every step gets a fresh conversation

This is Decision D5 in our engineering log, and it deserves its own section because it is counterintuitive. Reusing a conversation across steps would be faster, no repeated context loading. Why throw that away?

Context pollution.

If the validator (Step 4) shares a conversation with the detector (Step 2), it has already "seen" the detection reasoning. It knows what the detector was thinking, what it found, what it was uncertain about. That makes it a terrible auditor. It is biased by its own prior work. It will not catch things the detector missed because it has already internalized the detector's blind spots.

A fresh conversation means the validator approaches the redacted text with no preconceptions. It is reading the output as an outsider would. This is the same principle that drives code review: the person who wrote the code should not be the only one reviewing it.

Two panels contrasting conversation strategies. Left, wrong: a single shared conversation runs Detect, Redact, and Validate in sequence, so the validator is biased by its own prior reasoning and has internalized the detector's blind spots. Right, correct: Detect, Redact, and Validate each run in a separate conversation, so the validator has no memory of prior steps and reads the redacted output as an outsider would.

On-device, the cost of a fresh conversation is low. Inference runs on the local NPU at zero marginal cost, no API fees, no rate limits. The additional latency (roughly 1.3 seconds for the validation step) is acceptable because accuracy matters more than speed when the system's job is protecting private health information and financial data (D6).


Indexed-element image redaction

Image redaction required its own architectural breakthrough, documented as Decision D9.

The original approach for images: OCR extracts text with bounding boxes. The LLM redacts the text. Then match the redacted words back to the OCR bounding boxes and draw black rectangles.

This is the word-diff approach I described earlier, and it broke for the same reason: the LLM rewrites text, so the diff between original and redacted is unreliable.

The solution eliminates string matching entirely. Instead of giving the LLM raw text, we give it indexed elements:

[0] Patient
[1] Jane
[2] Smith
[3] DOB:
[4] 03/15/78
[5] compound
[6] fracture
Enter fullscreen mode Exit fullscreen mode

The LLM returns index labels, not strings:

1:NAME
2:NAME
4:DATE
Enter fullscreen mode Exit fullscreen mode

We map indices directly to bounding boxes. Index 1 corresponds to the bounding box for "Jane." Index 2 corresponds to "Smith." We draw black rectangles at those coordinates.

The indexed-element image redaction flow: ML Kit OCR emits indexed elements (0 Patient, 1 Jane, 2 Smith, and so on) to the Step 2 Detect LLM, which returns index labels rather than strings (1:NAME, 2:NAME, 4:DATE). An index-to-bounding-box mapping is lossless, so black rectangles are drawn at the coordinates for the flagged indices, producing the redacted image.

The index-to-bounding-box mapping is lossless. No text matching means no OCR error sensitivity. The HUD field count (how many items the UI reports as redacted) and the visual box count (how many black rectangles appear on the image) are always consistent.

For long documents, OCR elements are chunked at 150 elements per chunk, each chunk processed through the detection LLM separately (D10). This keeps each chunk within the model's deployed 1,024-token KV cache. Gemma 4 E2B is trained for a 128K context, but the build I shipped on-device is compiled with a 1,024-token cache (cache_length=1024, prefill 256); the app requests maxNumTokens=4000, but the compiled cache is what actually governs how much fits. Chunking at 150 elements keeps each detection pass comfortably inside that deployed budget without sacrificing coverage.


The cost of doing it right

The 4-step pipeline is slower than single-pass. Here are the numbers I measured on a Galaxy S25 Ultra with a Snapdragon 8 Elite NPU:

Architecture Latency (229-char medical note)
Single-pass (1 LLM call) ~1.5s
3-step pipeline (Classify + Detect + Redact) ~2.8s
4-step pipeline (+ Validate) ~3.3s

These are directional figures from a single May-2026 session on one device. I did not retain the raw logs, and that phone is no longer available to me, so treat this as a one-session, order-of-magnitude read rather than a rigorous multi-run benchmark.

That is roughly 2x the latency of single-pass. Is it worth it?

Yes. And the reason is Decision D6 in our engineering log: accuracy over latency, no hard latency budget.

This is a privacy tool. A missed redaction means someone's medical record number, Social Security number, or home address leaks. The cost of a false negative is not "the user has to redo it", it is a potential HIPAA violation, identity theft, or witness endangerment. In that context, an extra 1.8 seconds is not a meaningful cost.

The economics reinforce this. Inference runs on the local NPU at zero marginal cost. There are no API fees per call. There are no rate limits. The only cost is the user's time, and users of a privacy tool prefer thorough redaction over fast results.

Per-step latency breakdown (NPU, patient record via image pipeline), same single-session caveat as above, approximate values only:

Step Latency (approx) TTFT (approx) Tokens (approx) Decode (approx)
Step 1 (Classify) ~290ms ~95ms ~10 ~42 tok/s
Step 2 (Detect) ~2,600ms ~100ms ~100 ~42 tok/s
Step 3 (Redact) ~4,700ms ~100ms ~190 ~42 tok/s
Step 4 (Validate) ~380ms ~95ms ~12 ~42 tok/s

I am deliberately not quoting per-step decode rates to a tenth of a token: I do not have the logs to back that precision. The decode rate lands around 42 tok/s and stays roughly flat across steps, which is what you would expect on this NPU, where decode is memory-bandwidth-bound rather than compute-bound, so generating more tokens does not change the per-token rate much. Step 2 is the bottleneck because detection generates the most tokens, it must enumerate every PII item. Step 4 is fast because most documents pass validation: the output is a single "PASS" token. The pipeline progress indicator ("Step 2/4: Detecting Medical identifiers...") keeps users informed during the 3-8 second wait (D18).


7 category-specific prompt sets

The final architectural piece is the prompt system, documented as Decision D12. Each of the 7 redaction categories has dedicated Detect and Validate prompts. Classify and Redact prompts are universal (shared across categories).

Why can you not use one prompt for all document types?

Because what counts as PII depends entirely on context. A clinical note must preserve "Type 2 diabetes" (it is a diagnosis, not an identifier) while redacting "Jane Smith" (it is a patient name). A police report must preserve suspect descriptions ("6'2", brown jacket, facial scar") while redacting victim names. A financial document must preserve dollar amounts and institution names while redacting account and routing numbers.

Category Detection focus Key preserve rules
Medical HIPAA PHI identifiers (names, DOB, MRN, etc.) Diagnoses, medications, vitals, body locations
Financial Account/routing/card/SSN/TaxID Dollar amounts, institution names, toll-free numbers
Legal Buyer/seller/tenant names Property specs, legal terms, dollar amounts
Tactical Victim/witness/minor protection Suspect descriptions, officer names, crime scene
Journalism Source identity protection Public officials, reporter names
Field Service Customer PII + security credentials Equipment details, fault codes
General Broad PII detection (fallback) Minimal preserve rules

A single "find all PII" prompt would either over-redact (blanking out diagnoses in medical notes) or under-redact (missing gate codes in field service reports). The category system lets each prompt set be precise about what to redact and what to preserve.

The prompts are embedded as Kotlin constants in a PipelinePrompts object (D13). No runtime file reading, no asset loading. The prompts are compiled into the APK and available immediately. The design reference files (SKILL.md) in the prompts/ directory are for human review, not runtime consumption.


The pattern: decompose, isolate, verify

Looking back, the 4-step pipeline is an application of a principle that shows up across software engineering: when a single component is unreliable at a complex task, decompose it into focused steps with clear interfaces between them.

The specific insights from building Redacto:

  1. LLMs are unreliable multi-taskers on complex instructions. A small model asked to simultaneously classify, detect, replace, and verify will drop one of those tasks, usually the hardest one. Decomposition is not just good engineering, it is a reliability requirement.

  2. Semantic and deterministic operations do not compose cleanly. The LLM operates on meaning; string operations operate on bytes. When you chain them, the semantic layer's tendency to "correct" input corrupts the deterministic layer's assumptions. Either go fully semantic (LLM for substitution) or fully deterministic (regex for detection). Do not mix them at the boundary.

  3. Verification requires independence. A validator that shares context with the system it is validating is not a validator, it is a rubber stamp. The fix is LLM-as-a-judge: a separate, context-free model call that grades the output instead of producing it. Fresh conversations cost latency but buy genuine error detection.

  4. Index-based mapping eliminates an entire class of errors. Anywhere you find yourself doing string matching between LLM output and source data, ask whether you can use indices instead. The LLM is good at understanding what index 1 refers to. It is bad at reproducing the exact string that index 1 contains.

  5. Domain-specific prompts are not premature optimization. Different document types have genuinely different rules about what constitutes sensitive information. A universal prompt will always be a compromise.

These are lessons I learned by shipping broken versions and measuring the failures. The 4-step pipeline is not clever, it is the simplest architecture that actually works.


Full pipeline architecture

The full Redacto pipeline. Text input and image input converge: images first pass through ML Kit OCR to produce indexed elements, then both feed Step 1 Classify (universal prompt). The detected category (1 of 7) selects a category-specific prompt set, which drives Step 2 Detect. The detection list flows to Step 3 Redact (universal prompt), then to Step 4 Validate, an LLM-as-a-judge audit. On PASS the output is returned; on FAIL, if the round count is under 3, missed items are fed back and steps 3 and 4 re-run.


Related in this series of "Edge AI from the Trenches"


Jaydeep Shah is a developer with roots in embedded systems, Android platform internals, and silicon-level AI optimization. He now explores on-device AI inference - bringing models from the cloud to phones and edge hardware. Along with his team Edge Artists, he builds applications using LiteRT-LM and Gemma models on mobile hardware, and writes about what works, what breaks, and what he learns along the way. This post is part of the Edge AI from the Trenches series.

Last updated: July 2026
13th of 23 posts in the "Edge AI from the Trenches" series

Top comments (0)