DEV Community

Cover image for PII Redaction Built Entirely in the Browser
prajyu
prajyu

Posted on

PII Redaction Built Entirely in the Browser

Hey everyone! I’m gearing up to launch a new project I’ve been pouring a lot of love into. It's called Cloak.

The Problem

We constantly feed data into LLMs, but scrubbing sensitive Personal Identifiable Information (PII) manually is tedious and risky. I wanted a robust tool that could instantly redact sensitive data, but with one absolute rule: the data must never leave the device.

Enter Cloak

Cloak is a privacy-first web application designed to redact PII from text, images, and PDFs instantly. I wanted to nail a highly immersive, Apple-inspired interface, so making the experience feel native, fluid, and heavily polished was a massive priority during development.

Here are the core features:

  • Zero Server Uploads: Drag and drop text, images, or PDFs into the app. Everything is processed entirely within your browser.
  • On-Device AI Detection: It uses standard regex patterns for predictable formats (like SSNs, credit cards, and bank accounts), but also includes an optional "Deep Scan". This utilizes an on-device NER model (Xenova/bert-base-NER) running via Web Workers to catch trickier entities.
  • Client-Side OCR: It extracts and redacts text directly from images utilizing tesseract.js.
  • LLM Response Restorer: Instead of just blacking out text, Cloak can generate a "Synthetic" version of your document. It swaps real names and IDs for fake ones. Once your LLM generates a response using the fake data, Cloak’s restorer maps your original data back into the output.
  • Visual Redaction Styles: You can toggle between Black Box, Blur, or Pixelate styles for image and text redactions.

The Stack

Building a heavy computational tool that stays completely client-side meant relying on some great libraries:

  • Framework: Next.js
  • Styling & Animations: Tailwind CSS v4 alongside Framer Motion for buttery-smooth, native-feeling transitions and pill menus.
  • Database: dexie for saving your session history locally via IndexedDB.
  • Document Handling: pdf-lib and pdfjs-dist for client-side PDF parsing and rendering.

I’m finalizing the build and polishing the final animations before the official launch. I’d love to hear your thoughts on building local-first tools or dealing with PII in the age of LLMs. Let me know what you think in the comments!

Top comments (4)

Collapse
 
voltagegpu profile image
VoltageGPU

Cool project! Using the browser for PII redaction without relying on server-side processing is a solid approach for privacy. I've worked on similar client-side processing pipelines for confidential computing—browser WebAssembly can be a great tool for keeping data local. If you're handling sensitive data, maybe consider how GPU acceleration (like via WebGPU or VoltageGPU) could speed things up for heavier workloads.

Collapse
 
prajyu profile image
prajyu

Hey, thanks so much!
Really appreciate the feedback.You bring up a fantastic point.
Right now, I'm offloading the heavier processing like the bert-base-NER model and the tesseract.js OCR to Web Workers just to ensure the main thread stays unblocked and the UI stays fluid.
But you're completely right, as I start testing with larger, multi-page PDFs or heavier image batches, CPU-bound processing starts to show its limits. Transitioning the AI models to leverage WebGPU is definitely a major milestone on my roadmap, and I'll definitely have to look into VoltageGPU as part of that research to help push the performance boundaries.
Thanks again for the suggestion and for checking the project out ❤️

Collapse
 
nazar_boyko profile image
Nazar Boyko

The synthetic swap and restore is the clever piece, way better than black boxes that leave the document useless to the LLM. The thing I'd worry about most is a missed entity, since the whole promise breaks the first time the NER model doesn't recognize a name and it sails through unredacted. bert-base-NER is solid but it'll miss unusual names and anything far from what it was trained on. Do you surface a confidence pass or let people eyeball what got caught before it leaves, or is regex plus NER the whole net? For a privacy tool the false negative is the scary case, not the false positive.

Collapse
 
prajyu profile image
prajyu

Hey Nazar, thanks so much, I really appreciate the feedback.

You hit the absolute most critical point, a false negative is the worst-case scenario for a privacy tool. You are 100% right that bert-base-NER (and standard regex) won't catch every edge case or highly unusual name.

To combat this, the entire workflow is built around a "human-in-the-loop" philosophy. It is not a blind pass-through. Here is how Cloak handles the safety net:

  • Visual Review (Eyeballing): Before anything is exported, the user stays on the "Original" tab where every detected entity is highlighted in the text (or boxed on the image). You can clearly see exactly what was caught and what wasn't before moving forward.
  • Confidence Scores: The UI surfaces the exact confidence percentage for every single detection in the side panel, so you can see exactly why the engine flagged it (the NER model is currently tuned to a 0.6 threshold)
  • Manual Overrides & Brush Tool: If the engine catches a false positive, you can click the entity to ignore it. If you are working with images and it completely misses a custom string, there is a manual brush tool to paint over the missed PII yourself.
  • Custom Pattern Editor: If your documents have a specific proprietary format or unusual naming conventions that the NER misses, there is a Pattern Editor where you can add your own custom regex rules to run alongside the standard engine.

The AI and Regex are really just there to do 95% of the tedious heavy lifting, but the user always gets the final visual sign-off before hitting that "Copy Synthetic" button.

Thanks again for bringing this up, it’s the exact right mindset to have when evaluating security tools ❤️