DEV Community

Cover image for Introducing gdpr-safe-rag: Build GDPR-Compliant RAG Systems in Minutes
Charles Nwankpa
Charles Nwankpa

Posted on

Introducing gdpr-safe-rag: Build GDPR-Compliant RAG Systems in Minutes

After spending 9 months building production AI systems for UK organisations—including government-funded community programs and OFQUAL-regulated education platforms—I kept encountering the same critical problem:

Most RAG (Retrieval-Augmented Generation) implementations leak Personal Identifiable Information (PII) directly into vector databases.

So I built a solution. Today, I'm open-sourcing it.

🔒 The Problem: RAG Systems and Privacy

RAG has become the go-to architecture for building AI applications that need to reference private documents, customer data, or organisational knowledge. But there's a dangerous gap in most implementations:

# What most RAG implementations do:
documents = load_documents("customer_support_tickets.txt")
vectorstore = Chroma.from_documents(documents, embeddings)

# Problem: customer_support_tickets.txt contains:
# - Email addresses
# - Phone numbers  
# - NHS numbers
# - Home addresses
# - Credit card numbers

# All of this PII is now in your vector database. 😱
Enter fullscreen mode Exit fullscreen mode

This creates three major compliance risks:

  1. Data Minimisation Violation: GDPR Article 5(1)(c) requires collecting only necessary data. Storing raw PII in vectors violates this.

  2. No Audit Trail: GDPR Article 30 requires keeping records of processing activities. Without logging, you can't prove compliance.

  3. Right to Erasure: GDPR Article 17 requires deleting user data on request. Can you delete embeddings? Where's the mapping?

For organisations in regulated sectors—healthcare, finance, education, government—this isn't theoretical. It's a legal liability.

🛠️ The Solution: gdpr-safe-rag

I built gdpr-safe-rag to solve this problem at the architectural level. It's a production-grade Python toolkit that adds three critical capabilities to any RAG system:

1. Automatic PII Detection & Redaction

2. Built-in Audit Logging

3. Compliance Validation

Here's how it works:

📦 Quick Start

Install via pip:

pip install gdpr-safe-rag
Enter fullscreen mode Exit fullscreen mode

Basic usage:

from gdpr_safe_rag import PIIDetector, AuditLogger

# Initialise components
detector = PIIDetector(region="UK")
logger = AuditLogger(storage_path="./audit_logs")

# Process a document with PII
document = """
Customer: John Smith
Email: john.smith@example.co.uk
Phone: 07700 900123
NHS Number: 485 777 3456
Issue: Request refund for order #98765
"""

# Detect PII
pii_items = detector.detect(document)
print(f"Found {len(pii_items)} PII items")
# Output: Found 3 PII items (email, phone, NHS number)

# Redact PII before storing in vector database
clean_doc, mapping = detector.redact(document)

print(clean_doc)
# Output:
# Customer: John Smith
# Email: [EMAIL_1]
# Phone: [UK_PHONE_1]
# NHS Number: [NHS_NUMBER_1]
# Issue: Request refund for order #98765

# Log the operation for compliance
logger.log_ingestion(
    document_id="ticket_12345",
    user_id="system",
    pii_detected=[item.type for item in pii_items],
    pii_count=len(pii_items)
)
Enter fullscreen mode Exit fullscreen mode

Now your vector database only contains [EMAIL_1] instead of actual email addresses. ✅

🏗️ Architecture Deep Dive

Let me show you how this integrates into a real RAG pipeline:

Component 1: PII Detector

The PII Detector uses a combination of validated regex patterns and optional NER (Named Entity Recognition) to identify sensitive data:

UK/EU-Specific Patterns:

  • UK Postcodes: SW1A 1AA
  • UK Phone Numbers: 07700 900123, +44 7700 900123
  • NHS Numbers: 485 777 3456 (with checksum validation!)
  • NI Numbers: AB 12 34 56 C
  • EU Phone Numbers, IBANs, VAT numbers

Common Patterns:

  • Email addresses
  • Credit card numbers (with Luhn algorithm validation)
  • Generic phone numbers
  • Names and addresses (via spaCy NER)

What makes this production-grade? Validation, not just pattern matching.

# Example: NHS Number validation with checksum
def validate_nhs_number(nhs_str: str) -> bool:
    """Validate using modulus 11 algorithm"""
    digits = [int(d) for d in nhs_str.replace(' ', '')]
    total = sum(d * (11 - i) for i, d in enumerate(digits[:9]))
    check_digit = 11 - (total % 11)
    if check_digit == 11:
        check_digit = 0
    return check_digit == digits[9]
Enter fullscreen mode Exit fullscreen mode

Most PII detectors just use regex. We validate checksums. This reduces false positives dramatically.

Component 2: Audit Logger

Every operation is logged to PostgreSQL (or SQLite for testing) with full ACID compliance:

from gdpr_safe_rag import AuditLogger

logger = AuditLogger(
    storage_path="postgresql://user:pass@localhost/audit_logs",
    retention_days=2555  # 7 years for UK GDPR
)

# Log a query
logger.log_query(
    user_id="user_12345",
    query_text="What is the refund policy?",
    retrieved_docs=["doc_001", "doc_002"],
    response_generated=True
)

# Export compliance report
report = logger.export_compliance_report(
    start_date="2024-01-01",
    end_date="2024-12-31",
    format="pdf"
)
Enter fullscreen mode Exit fullscreen mode

This gives you the audit trail required by GDPR Article 30.

Component 3: Compliance Checker

Automated validation of your RAG system against GDPR requirements:

from gdpr_safe_rag import ComplianceChecker

checker = ComplianceChecker(
    vector_db_path="./chroma_db",
    audit_log_path="./audit_logs",
    retention_days=2555
)

# Run all compliance checks
results = checker.run_all_checks()

# Results include:
# ✅ Data Inventory (Article 30)
# ✅ Retention Validation (Article 5)
# ✅ Erasure Support (Article 17)
# ✅ Data Minimisation (Article 5)
# ✅ Security Controls (Article 32)

# Generate report
checker.generate_report(format="pdf")
Enter fullscreen mode Exit fullscreen mode

🔧 Real-World Integration

Here's a complete example with LangChain:

from gdpr_safe_rag import PIIDetector, AuditLogger
from langchain.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Initialise GDPR components
pii_detector = PIIDetector(region="UK")
audit_logger = AuditLogger(storage_path="./audit_logs")

# Load documents
loader = TextLoader("customer_data.txt")
documents = loader.load()

# Clean documents BEFORE embedding
clean_documents = []
for doc in documents:
    # Detect and redact PII
    clean_text, mapping = pii_detector.redact(doc.page_content)

    # Log the ingestion
    audit_logger.log_ingestion(
        document_id=doc.metadata.get("source"),
        pii_count=len(mapping)
    )

    # Store cleaned version
    from langchain.schema import Document
    clean_doc = Document(
        page_content=clean_text,
        metadata={**doc.metadata, "pii_redacted": True}
    )
    clean_documents.append(clean_doc)

# Now proceed with normal RAG pipeline (but with clean data!)
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000)
splits = text_splitter.split_documents(clean_documents)

vectorstore = Chroma.from_documents(
    documents=splits,
    embedding=OpenAIEmbeddings()
)

# Create QA chain
qa_chain = RetrievalQA.from_chain_type(
    llm=OpenAI(),
    retriever=vectorstore.as_retriever()
)

# Query with audit logging
def safe_query(question: str, user_id: str):
    audit_logger.log_query(user_id=user_id, query_text=question)
    result = qa_chain({"query": question})
    return result["result"]

# Use it
answer = safe_query("What is the refund policy?", user_id="user_123")
Enter fullscreen mode Exit fullscreen mode

Your vector database now contains zero PII. ✅

Every operation is logged. ✅

You can prove compliance. ✅

🏢 Production-Grade Design

What makes this toolkit production-ready?

1. Abstract Backend Interfaces

# Multiple storage backends supported
from gdpr_safe_rag.audit_logger.backends import (
    PostgreSQLBackend,  # Production
    SQLiteBackend,      # Development/Testing
    MemoryBackend       # Unit Tests
)

# Program to interfaces, not implementations
logger = AuditLogger(
    backend=PostgreSQLBackend(url="postgresql://...")
)
Enter fullscreen mode Exit fullscreen mode

2. Comprehensive Testing

87 tests covering:

  • Pattern matching accuracy
  • Validation algorithms (NHS checksum, Luhn)
  • Backend operations (CRUD, retention)
  • Compliance checks
  • Integration scenarios
$ pytest
========================= 87 passed in 12.34s =========================
Enter fullscreen mode Exit fullscreen mode

3. Type Safety with Pydantic

from pydantic import BaseModel, Field

class PIIItem(BaseModel):
    type: str = Field(..., description="Type of PII detected")
    value: str = Field(..., description="Detected PII value")
    start: int = Field(..., description="Start position in text")
    end: int = Field(..., description="End position in text")
    confidence: float = Field(1.0, ge=0.0, le=1.0)
Enter fullscreen mode Exit fullscreen mode

4. Configuration Management

# Use pydantic-settings for environment-based config
from gdpr_safe_rag import Settings

settings = Settings(
    database_url="postgresql://localhost/audit",
    pii_detection_level="strict",
    audit_retention_days=2555,
    enable_ner=True
)
Enter fullscreen mode Exit fullscreen mode

5. Docker Support

# docker-compose.yml included
services:
  postgres:
    image: postgres:15-alpine
    environment:
      POSTGRES_DB: audit_logs
    ports:
      - "5432:5432"
Enter fullscreen mode Exit fullscreen mode

Start a compliant dev environment in one command:

docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

📊 Why This Matters: Real-World Context

This toolkit was built from experience deploying AI systems for:

Unity in Diversity - UK government-funded Community Interest Company supporting vulnerable populations. We needed GDPR-compliant case management with risk-aware triage. PII leakage wasn't an option.

IUFP Global - OFQUAL-regulated international education platform serving 10+ countries. Processing student data across jurisdictions required bulletproof compliance.

TEEP FinTech - Payment platform serving 3,500+ users. Financial data + AI = regulatory scrutiny.

In every case, the pattern was the same:

  1. RAG was the right architecture
  2. Compliance was non-negotiable
  3. Existing tools didn't address the gap

So I built one that does.

🚀 Get Started

Installation

# Basic installation
pip install gdpr-safe-rag

# With PostgreSQL support
pip install gdpr-safe-rag[postgres]

# With NER (Named Entity Recognition)
pip install gdpr-safe-rag[ner]

# Everything
pip install gdpr-safe-rag[postgres,ner]
Enter fullscreen mode Exit fullscreen mode

Resources

Quick Links

🤝 Contributing

This is an open-source project. Contributions welcome!

Ways to help:

  • ⭐ Star the repo
  • 🐛 Report bugs
  • 💡 Suggest features
  • 🔧 Submit PRs
  • 📖 Improve docs
  • 💬 Share your use case

Particularly interested in:

  • Additional regional patterns (US, Canada, Australia)
  • Integration examples (LlamaIndex, other frameworks)
  • Dashboard UI for compliance monitoring
  • Performance optimisations

🎯 What's Next

Planned for v0.2:

  • LlamaIndex native integration
  • Streamlit compliance dashboard
  • Additional regional patterns (US, Canada, Australia)
  • Kubernetes deployment examples
  • Automated compliance reporting (ICO format)

Planned for v1.0:

  • Right to erasure automation (delete embeddings by user ID)
  • Data portability tools (export user data)
  • Fine-grained access controls
  • Multi-tenancy support

💭 Final Thoughts

Building compliant AI shouldn't require a legal team and six months of architecture design.

RAG is an incredibly powerful pattern for building useful AI applications. But if we want AI adoption in regulated sectors—healthcare, finance, education, government—we need to make compliance accessible.

That's what gdpr-safe-rag does. It takes GDPR compliance from "call the lawyers" to pip install gdpr-safe-rag.

If you're building RAG systems for any organisation that handles personal data (which is basically everyone in the UK/EU), give it a try.

And if you find it useful, star the repo. If you find issues, open them. If you have ideas, share them.

Let's make compliant AI the default, not the exception.


👤 About Me

I'm Charles Nwankpa, Founder at Gen3Block where I help UK SMEs adopt AI safely through implementation and training.

Background:

  • AWS Certified Machine Learning Engineer
  • MSc Data Science (Coventry University)
  • BCS Certificate in Ethical Build of AI
  • Working with government-funded organisations and regulated platforms

I'm currently building production AI systems for UK organisations and teaching AI implementation to businesses that need regulatory-first approaches.

Connect:


📚 Further Reading

Interested in GDPR-compliant AI? Check out these resources:

Want to learn more about RAG systems?


Like this article?

  • ⭐ Star gdpr-safe-rag on GitHub
  • 💬 Leave a comment below
  • 🔄 Share with your network
  • 📧 Subscribe for updates on the series

Next in this series: "GDPR for Developers: What You Actually Need to Know" - coming next week!


Top comments (0)