Construction Capital

Posted on Feb 10

How I Built an AI-Powered Lead Generation Pipeline by Scraping 1,300+ Planning Applications

#python #ai #webdev #automation

A deep dive into building automated council planning scrapers, AI classification systems, and intelligent outreach workflows using Python, Claude API, and n8n — turning public data into qualified business leads.

How I Built an AI-Powered Lead Generation Pipeline by Scraping 1,300+ Planning Applications

Most developers build SaaS tools. I built an AI system that reads planning applications from every council in England, classifies them by development type and value, finds the developer’s contact details, and triggers personalised outreach — all while I sleep.

Here’s the full technical breakdown of how I did it, what worked, what failed spectacularly, and the stack that now powers the deal pipeline at Construction Capital, a development finance advisory firm I’m building in the UK.

The Problem: Finding Needles in a Bureaucratic Haystack

If you’ve never looked at UK planning data, here’s the short version: every property development in England requires planning permission from the local council. These applications are public record. They contain the developer’s name, the site address, the type of development, and often the estimated project value.

For someone in development finance, this is gold. A developer who just received planning approval for 12 new-build houses needs funding — usually within weeks. But there are 333 local planning authorities in England, each with their own janky web portal, inconsistent data formats, and zero standardisation.

Manually checking even 10 councils a day would take hours. I needed a system that could monitor all of them, extract what matters, and surface only the opportunities worth pursuing.

Architecture Overview

Here’s what the full pipeline looks like:

┌─────────────────┐     ┌──────────────────┐     ┌─────────────────┐
│  Council Portal │────▶│  Scraper Engine   │────▶│  Raw Data Store  │
│  (333 sources)  │     │  (Python/Scrapy)  │     │  (PostgreSQL)    │
└─────────────────┘     └──────────────────┘     └────────┬────────┘
                                                          │
                                                          ▼
                                                 ┌─────────────────┐
                                                 │ AI Classifier    │
                                                 │ (Claude API)     │
                                                 └────────┬────────┘
                                                          │
                                                          ▼
                                                 ┌─────────────────┐
                                                 │ Lead Enrichment  │
                                                 │ (Companies House │
                                                 │  + LinkedIn API) │
                                                 └────────┬────────┘
                                                          │
                                                          ▼
                                                 ┌─────────────────┐
                                                 │ Outreach Engine  │
                                                 │ (n8n + SMTP)     │
                                                 └─────────────────┘

Let’s break down each component.

Step 1: The Scraper Engine

The Challenge with Council Portals

UK council planning portals are a developer’s nightmare. Some use Idox, some use NEC (formerly Civica), a few use bespoke systems built in what I can only assume was a fever dream in 2004. There’s no unified API, no consistent HTML structure, and session-based authentication that would make a security researcher weep.

I started with Scrapy but quickly realised I needed Playwright for the JavaScript-heavy portals:

import asyncio
from playwright.async_api import async_playwright

class CouncilScraper:
    def __init__(self, council_config):
        self.config = council_config
        self.base_url = council_config['portal_url']
        self.portal_type = council_config['portal_type']  # idox, nec, custom

    async def scrape_applications(self, date_from, date_to):
        async with async_playwright() as p:
            browser = await p.chromium.launch(headless=True)
            page = await browser.new_page()

            # Each portal type has its own navigation strategy
            if self.portal_type == 'idox':
                return await self._scrape_idox(page, date_from, date_to)
            elif self.portal_type == 'nec':
                return await self._scrape_nec(page, date_from, date_to)
            else:
                return await self._scrape_custom(page, date_from, date_to)

    async def _scrape_idox(self, page, date_from, date_to):
        """Idox portals (used by ~60% of councils)"""
        await page.goto(f"{self.base_url}/search.do?action=advanced")

        # Set date range for decided applications
        await page.fill('#date(applicationDecisionDate)From', 
                        date_from.strftime('%d/%m/%Y'))
        await page.fill('#date(applicationDecisionDate)To', 
                        date_to.strftime('%d/%m/%Y'))

        # Filter for approved applications only
        await page.select_option('#searchCriteria\\.caseStatus', 
                                 'Decided')

        await page.click('button[type="submit"]')
        await page.wait_for_load_state('networkidle')

        applications = []
        while True:
            # Extract application data from results page
            rows = await page.query_selector_all('.searchresult')
            for row in rows:
                app = await self._extract_idox_application(row)
                if app:
                    applications.append(app)

            # Check for next page
            next_btn = await page.query_selector('a.next')
            if next_btn:
                await next_btn.click()
                await page.wait_for_load_state('networkidle')
            else:
                break

        return applications

Handling the Edge Cases

The real engineering challenge wasn’t the scraping itself — it was the edge cases:

Rate limiting: Some councils will block you after 50 requests. I implemented exponential backoff with jitter and rotated through a pool of residential proxies.
CAPTCHA walls: A handful of councils use CAPTCHA on their search pages. For these, I fell back to the council’s RSS feeds where available, or the Planning Inspectorate’s centralised data dumps.
Data inconsistency: The same type of development might be labelled “New Build Residential” on one portal and “Erection of 4no. dwellings with associated parking” on another. This is where the AI classifier earns its keep.

# Config-driven approach for 333 councils
COUNCIL_CONFIGS = {
    'manchester': {
        'portal_url': 'https://pa.manchester.gov.uk/online-applications',
        'portal_type': 'idox',
        'rate_limit': 2.0,  # seconds between requests
        'proxy_required': False
    },
    'birmingham': {
        'portal_url': 'https://eplanning.idox.birmingham.gov.uk',
        'portal_type': 'idox',
        'rate_limit': 3.0,
        'proxy_required': True
    },
    # ... 331 more configs
}

After three months of refinement, the scraper was pulling 1,344 approved development sites from September onwards. Not bad for a system running on a £20/month VPS.

Step 2: AI Classification with Claude

Raw planning data is noisy. You get everything from “Replacement of garden shed” to “Construction of 200-unit residential development with commercial ground floor.” I needed to classify each application by:

Development type: New-build residential, conversion, commercial, mixed-use
Estimated project value: Based on unit count, location, and development type
Funding probability: How likely is this developer to need external finance?
Priority score: A composite ranking for outreach ordering

The Classification Prompt

This is where Claude’s structured output capability really shines. Here’s the core of the classification system:

import anthropic

client = anthropic.Anthropic()

def classify_application(application_data: dict) -> dict:
    prompt = f"""Analyse this UK planning application and classify it 
    for development finance lead qualification.

    Application Reference: {application_data['reference']}
    Description: {application_data['description']}
    Address: {application_data['address']}
    Applicant: {application_data['applicant']}
    Decision: {application_data['decision']}
    Decision Date: {application_data['decision_date']}

    Classify this application and respond in JSON format:
    {{
        "development_type": "new_build_residential | conversion | 
            commercial | mixed_use | minor_works | other",
        "estimated_units": <integer or null>,
        "estimated_gdv": <estimated gross development value in GBP>,
        "funding_need_score": <1-10, where 10 = almost certainly 
            needs development finance>,
        "priority_score": <1-10, composite score for outreach>,
        "reasoning": "<brief explanation>",
        "is_qualified_lead": <boolean>
    }}

    Rules for qualification:
    - Minor works (extensions, sheds, fences) = NOT qualified
    - Single dwelling self-builds = LOW priority
    - 2+ residential units = QUALIFIED
    - Any commercial development over 500sqm = QUALIFIED
    - Conversions creating 3+ units = QUALIFIED
    - Estimated GDV under £500k = NOT qualified"""

    response = client.messages.create(
        model="claude-sonnet-4-5-20250514",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}]
    )

    return json.loads(response.content[0].text)

Classification Results

After running this across the full dataset, the numbers were striking:

Category	Count	% of Total
Qualified leads (score 7+)	187	13.9%
Moderate potential (score 4-6)	342	25.4%
Not qualified	815	60.6%

That 13.9% qualified rate means roughly 1 in 7 approved planning applications represents a genuine development finance opportunity. At an average deal size of £2-5M, even converting a small percentage of those leads is meaningful.

Accuracy Validation

I manually reviewed 200 classifications against my own assessment. Claude achieved 94.5% accuracy on the binary qualified/not-qualified decision, and 87% accuracy on the development type classification. The main failure mode was underestimating the complexity (and therefore funding needs) of large conversion projects.

Step 3: Lead Enrichment

A planning application gives you a name and an address. To do meaningful outreach, you need more context. My enrichment pipeline pulls from three sources:

async def enrich_lead(application: dict) -> dict:
    enriched = {**application}

    # 1. Companies House lookup
    if company_name := extract_company_name(application['applicant']):
        ch_data = await companies_house_lookup(company_name)
        enriched['company_number'] = ch_data.get('company_number')
        enriched['directors'] = ch_data.get('directors', [])
        enriched['incorporation_date'] = ch_data.get('date_of_creation')
        enriched['sic_codes'] = ch_data.get('sic_codes', [])

    # 2. Previous planning history
    enriched['previous_applications'] = await get_planning_history(
        applicant=application['applicant'],
        council=application['council']
    )
    enriched['is_experienced_developer'] = (
        len(enriched['previous_applications']) >= 2
    )

    # 3. Contact discovery
    enriched['linkedin_profile'] = await find_linkedin_profile(
        name=application['applicant'],
        company=company_name,
        location=application['council']
    )

    return enriched

The Companies House API is free and excellent — one of the best government APIs I’ve ever worked with. Planning history helps distinguish experienced developers (who are more likely to need structured finance) from first-timers doing a single self-build.

Step 4: The Outreach Engine (n8n)

I chose n8n over Zapier for the outreach automation because I needed more control over the workflow logic, and n8n’s self-hosted option meant I could keep sensitive lead data on my own infrastructure.

The workflow triggers when a new qualified lead enters the database:

┌──────────────┐     ┌───────────────┐     ┌──────────────────┐
│ New Lead      │────▶│ Template      │────▶│ Personalisation  │
│ (Webhook)     │     │ Selection     │     │ (Claude API)     │
└──────────────┘     └───────────────┘     └────────┬─────────┘
                                                     │
                                                     ▼
                                            ┌──────────────────┐
                                            │ Email Send        │
                                            │ (SMTP via O365)   │
                                            └────────┬─────────┘
                                                     │
                                                     ▼
                                            ┌──────────────────┐
                                            │ CRM Update        │
                                            │ (Airtable)        │
                                            └──────────────────┘

Personalisation at Scale

Each outreach email is personalised using Claude based on the enriched lead data. The key insight was that generic “Hi, we do development finance” emails get a ~2% response rate. Emails that reference the specific planning application, the development type, and relevant funding structures get 8-12%:

def generate_outreach_email(lead: dict) -> str:
    prompt = f"""Write a brief, professional outreach email to a 
    property developer who has just received planning approval.

    Developer: {lead['applicant']}
    Project: {lead['description']}
    Location: {lead['address']}
    Estimated GDV: £{lead['estimated_gdv']:,.0f}
    Developer Experience: {'Experienced' if lead['is_experienced_developer'] 
                           else 'Newer developer'}

    The email should:
    - Congratulate them on their planning approval
    - Reference their specific project naturally
    - Briefly mention we provide development finance advisory
    - Suggest a quick call to discuss funding options
    - Be under 150 words
    - Sound human, not templated
    - NOT be pushy or salesy

    Sign off as the team at Construction Capital."""

    # ... Claude API call

Email Infrastructure

For scaled outreach, email deliverability is everything. I set up multiple Office 365 accounts across different domains, each warmed up over 2-3 weeks before sending live emails. Key learnings:

Warm-up period: Start with 5 emails/day, increase by 5 each week
Domain diversity: Don’t send all emails from one domain
SPF/DKIM/DMARC: Non-negotiable. Set these up properly or go straight to spam
Content variation: The AI-generated personalisation actually helps deliverability because no two emails are identical

The Tech Stack Summary

Component	Technology	Cost/month
Scraper runtime	Hetzner VPS (CX21)	£8
Database	PostgreSQL on same VPS	£0
AI Classification	Claude API (Sonnet)	~£30
Enrichment	Companies House API (free) + custom scripts	£0
Workflow automation	n8n (self-hosted)	£0
Email infrastructure	Office 365 (3 accounts)	£30
Monitoring	Uptime Kuma + Grafana	£0
Total		~£68/month

For under £70/month, I have a system that monitors 333 councils, classifies thousands of planning applications, enriches qualified leads, and triggers personalised outreach. The Construction Capital pipeline currently represents approximately £600k in potential advisory fees across active opportunities — a return that makes the infrastructure cost look like a rounding error.

What I’d Do Differently

1. Start with the Planning Inspectorate data dumps. I spent weeks building individual council scrapers before discovering that the Planning Inspectorate publishes bulk data extracts. These don’t have everything, but they cover ~70% of what I need. I’d use these as the foundation and only build custom scrapers for the gaps.

2. Use structured outputs from day one. Early versions of the classifier used free-text responses that I parsed with regex. Moving to Claude’s structured JSON output eliminated an entire class of parsing bugs.

3. Don’t over-automate outreach initially. I built the full automation pipeline before validating that the outreach messaging worked. I should have sent the first 50 emails manually, iterated on the messaging, and then automated.

4. Build the CRM integration earlier. I was tracking leads in a spreadsheet for the first month. Moving to Airtable with proper status tracking and follow-up reminders immediately improved conversion rates.

Results After 5 Months

1,344 planning applications scraped and classified
187 qualified leads identified
~60% open rate on personalised outreach emails
~10% response rate (compared to 2% industry average for cold email)
10 active opportunities in the Construction Capital pipeline
~£600k in potential advisory fees

The system runs largely autonomously now. New planning approvals flow in daily, get classified overnight, enriched by morning, and outreach goes out by mid-afternoon. I check in once a day to review the highest-priority leads and handle responses.

Open Source Considerations

I’ve been asked whether I’d open-source the scraper configs. I’m considering releasing the Idox and NEC scraper templates (since these cover ~80% of councils) as a public repo. If there’s interest, drop a comment below and I’ll gauge demand.

The AI classification prompts and the n8n workflow templates are things I’d happily share too — the competitive advantage isn’t in the code, it’s in the domain expertise and the relationships on the lending side.

Key Takeaways for Developers

Public data is massively underutilised. Planning applications, Companies House filings, Land Registry data — there’s a goldmine of structured information sitting in government portals waiting to be systematically harvested.
AI classification transforms noisy data into actionable intelligence. The jump from “raw planning applications” to “qualified, enriched, scored leads” is where all the value is created.
Personalisation at scale is now trivially easy. With Claude or similar LLMs, generating genuinely personalised outreach at scale costs pennies per message and dramatically outperforms templates.
The cheapest infrastructure is often the best. A £20/month VPS running Python scripts and PostgreSQL outperforms most enterprise lead generation platforms I’ve seen.
Domain expertise matters more than code. Anyone can build a scraper. Knowing which planning applications represent genuine funding needs, and how to structure the right finance solution for each project — that’s the moat.

If you’re building something similar in proptech, fintech, or any domain where public data meets AI classification, I’d love to hear about it in the comments. And if you’re a property developer who just got planning approval and needs funding… well, you know where to find us. 😄

Connect with me: