Joe Seabrook

Posted on Jun 8

I Built a GDPR Compliance Scanner Using the Claude API - Here's How It Works

#python #ai #showdev #buildinpublic

I Built a GDPR Compliance Scanner Using the Claude API - Here's How It Works

A few months ago I noticed something that kept bugging me. I was building and handing off websites for clients and every single time, GDPR compliance was either an afterthought or a panic right before launch. Privacy policies copied from templates, cookie banners slapped on at the last minute, no one really sure if the contact form was actually compliant.

The bigger problem: there was no quick, affordable way to check. Enterprise compliance tools cost hundreds per month. Legal consultants cost more. Most small businesses just crossed their fingers.

So I built ClearlyCompliant - an automated GDPR compliance scanner that analyses a website and delivers a detailed PDF report for a one-off fee. No subscription, no jargon, just a clear picture of where a site stands.

Here's how it actually works under the hood.

The Stack

Django (Python) - backend and web app
BeautifulSoup + requests - crawling and HTML parsing
Python threading - async scanning without the overhead of Celery/Redis
Anthropic Claude API (Haiku) - AI-powered policy analysis
ReportLab - PDF report generation
Stripe - payments
IONOS SMTP - email delivery
Gunicorn + Nginx on an IONOS VPS

The Scanning Pipeline

When a user submits a domain and completes payment, the scan kicks off immediately. Rather than making them wait on a loading screen, the scan runs asynchronously in a background thread and the report gets emailed when it's done.

I deliberately avoided Celery and Redis here. For the scale I needed, Python's built-in threading module was more than sufficient and kept the infrastructure simple. One less thing to maintain, one less thing to break.

import threading

def run_scan_async(domain, order_id, customer_email):
    thread = threading.Thread(
        target=run_full_scan,
        args=(domain, order_id, customer_email)
    )
    thread.daemon = True
    thread.start()

The scan itself runs 23 individual GDPR checks across several categories.

The 23 Checks

The checks are grouped into logical categories:

Cookie & Consent

Does a cookie banner exist?
Is consent required before non-essential cookies fire?
Are there pre-ticked opt-in boxes?
Is declining as easy as accepting?

Privacy Policy

Does a privacy policy exist and is it linked correctly?
Does it mention data retention periods?
Does it list third-party processors?
Does it cover user rights (access, erasure, portability)?
Does it mention the ICO / supervisory authority?

Forms & Data Collection

Do forms collect more data than necessary?
Is there a privacy policy link at the point of data collection?
Are forms served over HTTPS?

Security

Is HTTPS enforced sitewide?
Are there mixed content issues?
Are security headers present (HSTS, X-Frame-Options, CSP)?

Third-Party Scripts

Are known tracking/analytics scripts detected?
Are advertising pixels present?
Are session recording tools detected?

Technical

Is there a robots.txt?
Is there a sitemap?
Are there any obvious data leakage issues in page source?

Each check returns a pass, fail, or warning status, along with a plain-English explanation of what was found and why it matters.

The Interesting Part: Using Claude to Analyse Privacy Policies

Most of the checks are deterministic - I'm looking for specific HTML elements, HTTP headers, or known script signatures. But privacy policy analysis is different. A privacy policy is a natural language document and the question isn't just "does one exist" but "does it actually say the right things?"

This is where the Claude API comes in.

I fetch the privacy policy content and send it to Claude Haiku with a structured prompt asking it to evaluate specific GDPR requirements:

import anthropic

def analyse_privacy_policy(policy_text):
    client = anthropic.Anthropic()

    prompt = f"""
    You are a GDPR compliance analyst. Analyse the following privacy policy and evaluate 
    whether it covers each of these required elements under UK GDPR / EU GDPR:

    1. Identity and contact details of the data controller
    2. Purposes and lawful basis for processing
    3. Legitimate interests (if relied upon)
    4. Recipients or categories of recipients of personal data
    5. Details of transfers to third countries and safeguards
    6. Retention periods
    7. Rights of the data subject (access, rectification, erasure, portability, objection)
    8. Right to withdraw consent
    9. Right to lodge a complaint with a supervisory authority
    10. Whether provision of data is a statutory/contractual requirement

    For each element, respond with: PRESENT, PARTIAL, or MISSING, followed by a brief 
    one-sentence explanation.

    Respond only in the structured format requested. Do not add preamble or commentary.

    Privacy Policy:
    {policy_text[:8000]}
    """

    message = client.messages.create(
        model="claude-haiku-4-5",
        max_tokens=1000,
        messages=[{"role": "user", "content": prompt}]
    )

    return message.content[0].text

Haiku is fast and cost-effective for this task - I don't need Claude's full reasoning capability here, just reliable structured analysis of a document. The truncation to 8,000 characters handles the cases where policies are extremely long while keeping API costs predictable.

The response gets parsed and fed into the report alongside the deterministic checks.

Auto-Detecting the Privacy Policy URL

One thing I spent more time on than expected: finding the privacy policy in the first place.

I could have added a form field asking users to paste the URL, but that's friction and users often don't know the exact URL offhand. Instead I built an auto-detection function:

def find_privacy_policy_url(base_url, soup):
    # Common privacy policy URL patterns
    privacy_patterns = [
        r'privacy[-_]?policy',
        r'privacy[-_]?notice',
        r'data[-_]?protection',
        r'privacy',
    ]

    # Check all links on the page
    for link in soup.find_all('a', href=True):
        href = link['href'].lower()
        text = link.get_text().lower()

        for pattern in privacy_patterns:
            if re.search(pattern, href) or re.search(pattern, text):
                return urljoin(base_url, link['href'])

    # Fallback: try common paths directly
    common_paths = [
        '/privacy-policy',
        '/privacy',
        '/data-protection',
        '/legal/privacy',
    ]

    for path in common_paths:
        url = urljoin(base_url, path)
        try:
            response = requests.head(url, timeout=5)
            if response.status_code == 200:
                return url
        except:
            continue

    return None

This handles the vast majority of sites cleanly. If no policy is found, that itself becomes a finding in the report.

Generating the PDF Report with ReportLab

I initially looked at WeasyPrint for PDF generation - it produces beautiful output from HTML/CSS. But it has a GTK dependency that caused headaches on my Windows development machine. ReportLab is pure Python, installs cleanly everywhere, and gives you precise control over layout.

The report is structured as:

Cover page - domain scanned, date, overall compliance score
Executive summary - high-level findings with a visual pass/fail breakdown
Detailed findings - each of the 23 checks with status, explanation, and recommendation
Priority actions - the top issues ranked by severity

from reportlab.lib.pagesizes import A4
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, Table
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle
from reportlab.lib import colors

def generate_report(scan_results, domain, output_path):
    doc = SimpleDocTemplate(
        output_path,
        pagesize=A4,
        rightMargin=50,
        leftMargin=50,
        topMargin=50,
        bottomMargin=50
    )

    story = []
    styles = getSampleStyleSheet()

    # Build report sections
    story.extend(build_cover_page(domain, scan_results, styles))
    story.extend(build_executive_summary(scan_results, styles))
    story.extend(build_detailed_findings(scan_results, styles))
    story.extend(build_priority_actions(scan_results, styles))

    doc.build(story)

The overall compliance score is calculated by weighting checks by severity - a missing HTTPS is weighted higher than a missing sitemap, for example.

Payments with Stripe

Stripe handles the £29.99 one-off payment. The flow is:

User enters domain → Stripe Checkout session created
User pays → Stripe webhook fires checkout.session.completed
Webhook triggers the async scan
Report emailed on completion

Using webhooks rather than redirect-based confirmation means the scan triggers reliably even if the user closes the browser after paying.

@csrf_exempt
def stripe_webhook(request):
    payload = request.body
    sig_header = request.META.get('HTTP_STRIPE_SIGNATURE')

    try:
        event = stripe.Webhook.construct_event(
            payload, sig_header, settings.STRIPE_WEBHOOK_SECRET
        )
    except ValueError:
        return HttpResponse(status=400)
    except stripe.error.SignatureVerificationError:
        return HttpResponse(status=400)

    if event['type'] == 'checkout.session.completed':
        session = event['data']['object']
        domain = session['metadata']['domain']
        customer_email = session['customer_details']['email']
        order_id = session['id']

        run_scan_async(domain, order_id, customer_email)

    return HttpResponse(status=200)

What I'd Do Differently

Remediation guidance. The report tells you what's wrong but not always how to fix it. I deliberately left this out to ship faster, but it's the most common piece of feedback from users. It's next on the roadmap.

Recrawling sub-pages. Currently the scanner analyses the homepage and any linked pages it can find. A more thorough scan would systematically crawl deeper, particularly for e-commerce sites with checkout flows on different pages.

Caching policy analysis. If the same privacy policy URL appears in multiple scans, I'm hitting the Claude API each time. A simple hash-based cache would reduce costs significantly at scale.

The Live Product

ClearlyCompliant is live at clearlycompliant.co.uk. If you want to see what the report output looks like or run a scan on a site you're working on, the £29.99 one-off report is available directly on the site.

Happy to answer questions on any part of the build in the comments - the threading approach, the ReportLab PDF generation, the Claude API integration, or the Stripe webhook setup. All of it was figured out the hard way so ask away.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.