How to Automate PII Redaction and AI Data Extraction with Python (Serverless API)

#ai #python #automation #api

Handling unstructured documents is one of the most tedious tasks in software engineering. Whether you are building an HR platform, an accounting ERP, or a legal tech app, you eventually face two massive headaches: Data Privacy (GDPR/LGPD compliance) and Structured Data Extraction.

How do you safely share a resume without leaking the candidate's phone number? How do you extract the total tax amount from a messy PDF invoice without writing hundreds of fragile Regex rules?

Enter the NeoPII & Extract API. In this article, I’ll walk you through how to use this serverless, GPT-4o-powered API to sanitize documents and extract structured JSON data using Python.

🌟 The Opportunities: What Can You Build?
The API is divided into two main engines:

NeoShield (Data Anonymization) This engine takes a file, scans it using NLP and advanced Regex, and replaces Personally Identifiable Information (PII) — like emails, credit cards, and social security numbers — with safe tags (e.g., [EMAIL], [CPF]). The killer feature? It natively supports and reconstructs binary files like .DOCX, .XLSX, and .PDF. It doesn't just return flat text; it returns your original Excel sheet or Word document with the exact same formatting, just safely redacted.

Use cases: Unbiased recruitment pipelines, public legal document filings, and secure data sharing.

Neo-Struct (AI Data Extraction) Powered by Azure OpenAI (GPT-4o), this endpoint reads unstructured files and forces the output into a strict JSON schema that you define. No AI hallucinations, no broken JSON.

Use cases: Automating invoice data entry, extracting clauses from contracts, and digitizing messy historical records.

💻 Tutorial: Python Implementation
Let’s get our hands dirty. First, make sure you have the requests library installed:

pip install requests

Scenario A: Masking a Confidential Document
When working with APIs that accept files, the biggest pitfall in Python is how the requests library handles multipart/form-data. To ensure the serverless backend correctly parses your file, always send the file as a tuple (filename, file_object, mime_type).

Here is how you mask a .docx file:

import requests
import os

API_URL = "https://neopii-extract.p.rapidapi.com/mask_file"
HEADERS = {
    "X-RapidAPI-Host": "neopii-extract.p.rapidapi.com",
    "X-RapidAPI-Key": "YOUR_RAPIDAPI_KEY"
}

file_path = "confidential_contract.docx"

with open(file_path, "rb") as f:
    # Pro-tip: Use the tuple format to ensure boundary integrity
    files = {"file": (os.path.basename(file_path), f, "application/octet-stream")}

    print("Anonymizing document...")
    response = requests.post(API_URL, files=files, headers=HEADERS)

    if response.status_code == 200:
        with open("masked_contract.docx", "wb") as out:
            out.write(response.content)
        print("Success! The document was sanitized and saved.")
    else:
        print(f"Error: {response.text}")

Scenario B: Extracting Structured JSON from a PDF
Imagine you have a messy PDF invoice and you only care about the vendor's name, the total amount, and the due date. You can simply declare a JSON schema and let the AI do the heavy lifting.

import requests
import json
import os

API_URL = "https://neopii-extract.p.rapidapi.com/extract"
HEADERS = {
    "X-RapidAPI-Host": "neopii-extract.p.rapidapi.com",
    "X-RapidAPI-Key": "YOUR_RAPIDAPI_KEY"
}

file_path = "vendor_invoice.pdf"

# Define the exact data structure you want the AI to return
my_schema = json.dumps({
    "vendor_name": "string",
    "total_amount": "number",
    "due_date": "string (ISO format)"
})

with open(file_path, "rb") as f:
    files = {"file": (os.path.basename(file_path), f, "application/octet-stream")}
    data = {"schema": my_schema}

    print("Extracting data via AI...")
    response = requests.post(API_URL, files=files, data=data, headers=HEADERS)

    if response.status_code == 200:
        structured_data = response.json()
        print("\nExtraction Complete:")
        print(json.dumps(structured_data, indent=2))
    else:
        print(f"Error: {response.text}")

⚠️ Technical Limitations & Best Practices
Before you push to production, there are a few architectural constraints you should keep in mind:

Serverless "Cold Starts" The API is built on a serverless architecture to ensure scalability. If the API hasn't been called in a while, the very first request might take around 10 to 15 seconds to load the heavy NLP libraries (like SpaCy) into memory. Subsequent requests, however, will be lightning-fast (usually under 2 seconds).

Best Practice: Implement a retry mechanism or a simple background ping to the /health endpoint to keep the engine warm during business hours.

Context Window (Token Limits) The AI extraction endpoint (/extract) utilizes GPT-4o with an operational limit of around 15,000 tokens. If you send a highly dense 50-page PDF, the text will be truncated to fit the context window, which might lead to missing data.

Best Practice: If you have massive documents, split them into smaller chunks or ensure the file size remains reasonable.

File Size Limits To prevent timeouts and ensure low latency, the API enforces strict size limits:

PDF: 10 MB

DOCX: 5 MB

XLSX: 3 MB

TXT / JSON / CSV: 1-2 MB

Final Thoughts
The days of writing endless Regex to parse PDFs or risking data leaks in file processing are over. By combining native binary reconstruction with LLM-based extraction, you can build highly secure, enterprise-grade applications in minutes.

Ready to give it a try? You can grab your API key and test it out for free on the RapidAPI Hub. Happy coding! 🚀

DEV Community

How to Automate PII Redaction and AI Data Extraction with Python (Serverless API)

Top comments (0)