Charles Nwankpa

Posted on Feb 4

Introducing gdpr-safe-rag: Build GDPR-Compliant RAG Systems in Minutes

#ai #machinelearning #python #opensource

After spending 9 months building production AI systems for UK organisations—including government-funded community programs and OFQUAL-regulated education platforms—I kept encountering the same critical problem:

Most RAG (Retrieval-Augmented Generation) implementations leak Personal Identifiable Information (PII) directly into vector databases.

So I built a solution. Today, I'm open-sourcing it.

🔒 The Problem: RAG Systems and Privacy

RAG has become the go-to architecture for building AI applications that need to reference private documents, customer data, or organisational knowledge. But there's a dangerous gap in most implementations:

# What most RAG implementations do:
documents = load_documents("customer_support_tickets.txt")
vectorstore = Chroma.from_documents(documents, embeddings)

# Problem: customer_support_tickets.txt contains:
# - Email addresses
# - Phone numbers  
# - NHS numbers
# - Home addresses
# - Credit card numbers

# All of this PII is now in your vector database. 😱

This creates three major compliance risks:

Data Minimisation Violation: GDPR Article 5(1)(c) requires collecting only necessary data. Storing raw PII in vectors violates this.
No Audit Trail: GDPR Article 30 requires keeping records of processing activities. Without logging, you can't prove compliance.
Right to Erasure: GDPR Article 17 requires deleting user data on request. Can you delete embeddings? Where's the mapping?

For organisations in regulated sectors—healthcare, finance, education, government—this isn't theoretical. It's a legal liability.

🛠️ The Solution: gdpr-safe-rag

I built gdpr-safe-rag to solve this problem at the architectural level. It's a production-grade Python toolkit that adds three critical capabilities to any RAG system:

1. Automatic PII Detection & Redaction

2. Built-in Audit Logging

3. Compliance Validation

Here's how it works:

📦 Quick Start

Install via pip:

pip install gdpr-safe-rag

Basic usage:

from gdpr_safe_rag import PIIDetector, AuditLogger

# Initialise components
detector = PIIDetector(region="UK")
logger = AuditLogger(storage_path="./audit_logs")

# Process a document with PII
document = """
Customer: John Smith
Email: john.smith@example.co.uk
Phone: 07700 900123
NHS Number: 485 777 3456
Issue: Request refund for order #98765
"""

# Detect PII
pii_items = detector.detect(document)
print(f"Found {len(pii_items)} PII items")
# Output: Found 3 PII items (email, phone, NHS number)

# Redact PII before storing in vector database
clean_doc, mapping = detector.redact(document)

print(clean_doc)
# Output:
# Customer: John Smith
# Email: [EMAIL_1]
# Phone: [UK_PHONE_1]
# NHS Number: [NHS_NUMBER_1]
# Issue: Request refund for order #98765

# Log the operation for compliance
logger.log_ingestion(
    document_id="ticket_12345",
    user_id="system",
    pii_detected=[item.type for item in pii_items],
    pii_count=len(pii_items)
)

Now your vector database only contains [EMAIL_1] instead of actual email addresses. ✅

🏗️ Architecture Deep Dive

Let me show you how this integrates into a real RAG pipeline:

Component 1: PII Detector

The PII Detector uses a combination of validated regex patterns and optional NER (Named Entity Recognition) to identify sensitive data:

UK/EU-Specific Patterns:

UK Postcodes: SW1A 1AA
UK Phone Numbers: 07700 900123, +44 7700 900123
NHS Numbers: 485 777 3456 (with checksum validation!)
NI Numbers: AB 12 34 56 C
EU Phone Numbers, IBANs, VAT numbers

Common Patterns:

Email addresses
Credit card numbers (with Luhn algorithm validation)
Generic phone numbers
Names and addresses (via spaCy NER)

What makes this production-grade? Validation, not just pattern matching.

# Example: NHS Number validation with checksum
def validate_nhs_number(nhs_str: str) -> bool:
    """Validate using modulus 11 algorithm"""
    digits = [int(d) for d in nhs_str.replace(' ', '')]
    total = sum(d * (11 - i) for i, d in enumerate(digits[:9]))
    check_digit = 11 - (total % 11)
    if check_digit == 11:
        check_digit = 0
    return check_digit == digits[9]

Most PII detectors just use regex. We validate checksums. This reduces false positives dramatically.

Component 2: Audit Logger

Every operation is logged to PostgreSQL (or SQLite for testing) with full ACID compliance:

from gdpr_safe_rag import AuditLogger

logger = AuditLogger(
    storage_path="postgresql://user:pass@localhost/audit_logs",
    retention_days=2555  # 7 years for UK GDPR
)

# Log a query
logger.log_query(
    user_id="user_12345",
    query_text="What is the refund policy?",
    retrieved_docs=["doc_001", "doc_002"],
    response_generated=True
)

# Export compliance report
report = logger.export_compliance_report(
    start_date="2024-01-01",
    end_date="2024-12-31",
    format="pdf"
)

This gives you the audit trail required by GDPR Article 30.

Component 3: Compliance Checker

Automated validation of your RAG system against GDPR requirements:

from gdpr_safe_rag import ComplianceChecker

checker = ComplianceChecker(
    vector_db_path="./chroma_db",
    audit_log_path="./audit_logs",
    retention_days=2555
)

# Run all compliance checks
results = checker.run_all_checks()

# Results include:
# ✅ Data Inventory (Article 30)
# ✅ Retention Validation (Article 5)
# ✅ Erasure Support (Article 17)
# ✅ Data Minimisation (Article 5)
# ✅ Security Controls (Article 32)

# Generate report
checker.generate_report(format="pdf")

🔧 Real-World Integration

Here's a complete example with LangChain:

from gdpr_safe_rag import PIIDetector, AuditLogger
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Initialise GDPR components
pii_detector = PIIDetector(region="UK")
audit_logger = AuditLogger(storage_path="./audit_logs")

# Load documents
loader = TextLoader("customer_data.txt")
documents = loader.load()

# Clean documents BEFORE embedding
clean_documents = []
for doc in documents:
    # Detect and redact PII
    clean_text, mapping = pii_detector.redact(doc.page_content)

    # Log the ingestion
    audit_logger.log_ingestion(
        document_id=doc.metadata.get("source"),
        pii_count=len(mapping)
    )

    # Store cleaned version
    from langchain.schema import Document
    clean_doc = Document(
        page_content=clean_text,
        metadata={**doc.metadata, "pii_redacted": True}
    )
    clean_documents.append(clean_doc)

# Now proceed with normal RAG pipeline (but with clean data!)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
splits = text_splitter.split_documents(clean_documents)

vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=OpenAIEmbeddings()
)

# Create QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever()
)

# Query with audit logging
def safe_query(question: str, user_id: str):
    audit_logger.log_query(user_id=user_id, query_text=question)
    result = qa_chain({"query": question})
    return result["result"]

# Use it
answer = safe_query("What is the refund policy?", user_id="user_123")

Your vector database now contains zero PII. ✅

Every operation is logged. ✅

You can prove compliance. ✅

🏢 Production-Grade Design

What makes this toolkit production-ready?

1. Abstract Backend Interfaces

# Multiple storage backends supported
from gdpr_safe_rag.audit_logger.backends import (
    PostgreSQLBackend,  # Production
    SQLiteBackend,      # Development/Testing
    MemoryBackend       # Unit Tests
)

# Program to interfaces, not implementations
logger = AuditLogger(
    backend=PostgreSQLBackend(url="postgresql://...")
)

2. Comprehensive Testing

87 tests covering:

Pattern matching accuracy
Validation algorithms (NHS checksum, Luhn)
Backend operations (CRUD, retention)
Compliance checks
Integration scenarios

$ pytest
========================= 87 passed in 12.34s =========================

3. Type Safety with Pydantic

from pydantic import BaseModel, Field

class PIIItem(BaseModel):
    type: str = Field(..., description="Type of PII detected")
    value: str = Field(..., description="Detected PII value")
    start: int = Field(..., description="Start position in text")
    end: int = Field(..., description="End position in text")
    confidence: float = Field(1.0, ge=0.0, le=1.0)

4. Configuration Management

# Use pydantic-settings for environment-based config
from gdpr_safe_rag import Settings

settings = Settings(
    database_url="postgresql://localhost/audit",
    pii_detection_level="strict",
    audit_retention_days=2555,
    enable_ner=True
)

5. Docker Support

# docker-compose.yml included
services:
  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: audit_logs
    ports:
      - "5432:5432"

Start a compliant dev environment in one command:

docker-compose up -d

📊 Why This Matters: Real-World Context

This toolkit was built from experience deploying AI systems for:

Unity in Diversity - UK government-funded Community Interest Company supporting vulnerable populations. We needed GDPR-compliant case management with risk-aware triage. PII leakage wasn't an option.

IUFP Global - OFQUAL-regulated international education platform serving 10+ countries. Processing student data across jurisdictions required bulletproof compliance.

TEEP FinTech - Payment platform serving 3,500+ users. Financial data + AI = regulatory scrutiny.

In every case, the pattern was the same:

RAG was the right architecture
Compliance was non-negotiable
Existing tools didn't address the gap

So I built one that does.

🚀 Get Started

Installation

# Basic installation
pip install gdpr-safe-rag

# With PostgreSQL support
pip install gdpr-safe-rag[postgres]

# With NER (Named Entity Recognition)
pip install gdpr-safe-rag[ner]

# Everything
pip install gdpr-safe-rag[postgres,ner]

Resources

GitHub: github.com/Charlescifix/gdpr-safe-rag
Documentation: Full docs in repo
Examples: Complete examples
PyPI: pypi.org/project/gdpr-safe-rag

Quick Links

🤝 Contributing

This is an open-source project. Contributions welcome!

Ways to help:

⭐ Star the repo
🐛 Report bugs
💡 Suggest features
🔧 Submit PRs
📖 Improve docs
💬 Share your use case

Particularly interested in:

Additional regional patterns (US, Canada, Australia)
Integration examples (LlamaIndex, other frameworks)
Dashboard UI for compliance monitoring
Performance optimisations

🎯 What's Next

Planned for v0.2:

LlamaIndex native integration
Streamlit compliance dashboard
Additional regional patterns (US, Canada, Australia)
Kubernetes deployment examples
Automated compliance reporting (ICO format)

Planned for v1.0:

Right to erasure automation (delete embeddings by user ID)
Data portability tools (export user data)
Fine-grained access controls
Multi-tenancy support

💭 Final Thoughts

Building compliant AI shouldn't require a legal team and six months of architecture design.

RAG is an incredibly powerful pattern for building useful AI applications. But if we want AI adoption in regulated sectors—healthcare, finance, education, government—we need to make compliance accessible.

That's what gdpr-safe-rag does. It takes GDPR compliance from "call the lawyers" to pip install gdpr-safe-rag.

If you're building RAG systems for any organisation that handles personal data (which is basically everyone in the UK/EU), give it a try.

And if you find it useful, star the repo. If you find issues, open them. If you have ideas, share them.

Let's make compliant AI the default, not the exception.

👤 About Me

I'm Charles Nwankpa, Founder at Gen3Block where I help UK SMEs adopt AI safely through implementation and training.

Background:

AWS Certified Machine Learning Engineer
MSc Data Science (Coventry University)
BCS Certificate in Ethical Build of AI
Working with government-funded organisations and regulated platforms

I'm currently building production AI systems for UK organisations and teaching AI implementation to businesses that need regulatory-first approaches.

Connect:

Website: charlesnwankpa.com
GitHub: @Charlescifix
LinkedIn: Charles Nwankpa

📚 Further Reading

Interested in GDPR-compliant AI? Check out these resources:

Want to learn more about RAG systems?

Like this article?

⭐ Star gdpr-safe-rag on GitHub
💬 Leave a comment below
🔄 Share with your network
📧 Subscribe for updates on the series

Next in this series: "GDPR for Developers: What You Actually Need to Know" - coming next week!

DEV Community