DEV Community

Shuai Shao
Shuai Shao

Posted on • Originally published at Medium

Why I Built a 100% Offline AI Tool (PySide6 + Microsoft Presidio) to Permanently Redact PDFs

As developers, we handle tons of sensitive data every day—API keys, legal contracts, financial statements, and user metrics. A few months ago, I needed to redact some Personally Identifiable Information (PII) from a batch of PDF documents.

I looked into existing online tools, but they all required one terrifying thing: uploading confidential files to their cloud servers. For anyone handling HIPAA, GDPR, or corporate legal data, that's an absolute dealbreaker. Even worse, many standard PDF editors only place a black vector box over the text, leaving the underlying sensitive data completely scrapable.

So, I spent the last 2 months building a local-first, privacy-focused desktop app to solve this permanently: PII Blackout.

Here is how I built it, the tech stack behind it, and the engineering challenges I faced.


🛠️ The Tech Stack & Architecture

I wanted the app to be cross-platform, highly performant, and capable of running heavy AI inference locally without breaking a sweat.

  • GUI Framework: PySide6 (Qt for Python). It provides native desktop performance and smooth UI rendering, which is essential for handling massive PDF files.
  • Core PII Engine: Powered by Microsoft's Presidio architecture. I chose Presidio because of its extensible orchestrator design, which allowed me to easily combine rule-based pattern matchers (Regex) with advanced NER models (like GLiNER). This combination ensures the app intelligently auto-detects names, emails, phone numbers, and physical addresses out of the box with production-grade accuracy.
  • PDF Processing: Custom backend that flattens and burns the blackouts directly into the image layer of the document, making it mathematically impossible to recover or reverse-engineer the redacted data.

🛡️ Core Engineering Challenges

1. Adapting Microsoft Presidio for Local Execution

Microsoft Presidio is fantastic, but it's often deployed as a cloud service or containerized API. Adapting its architecture to run entirely inside a local Python client environment required optimizing the loading times of the underlying AI models and managing memory efficiently.

💡 Lesson learned: A minimum of 12GB RAM is the sweet spot for smooth local AI processing when parsing multi-page documents simultaneously.

2. Destructive Redaction vs. Lazy Masking

Many PDF tools just change the background color of the text to black. If you copy-paste that section, the hidden text is revealed. PII Blackout completely flattens the document architecture, destroying the sensitive data layers and rendering them as a unified image layer.

3. Batch Processing Optimization

To make it useful for professionals, I implemented a drag-and-drop batch processing system. Users can drop an entire folder of hundreds of PDFs, and the app will queue and redact them locally in seconds.


🚀 Check it out (and Feedback Welcome!)

I've just officially released v1.0.2 and would love to get the DEV community's feedback on the UI/UX and the local performance.

🎁 Free Tier for Developers & Solo Users

The free tier lets you process up to 3 PDFs per day (max 15 pages per document, with a light watermark). If you need heavy enterprise usage, there are Pro tiers available too.

If you are familiar with Microsoft Presidio, building local-first AI tools, or working with PySide6, let's connect in the comments! How are you handling PII security in your own workflows?

Top comments (0)