DEV Community

James
James

Posted on

Automating Web Intelligence with Python: A Practical Guide

Automating Web Intelligence with Python: A Practical Guide

Web intelligence — the systematic extraction of actionable data from the web — sounds like a spy movie plot. In reality, it's a Python script with good error handling and a respectful rate limit.

I run an agency in Berlin that builds this infrastructure for European clients. Here's the practical stack we use.


Pattern 1: Structured Data Extraction

The simplest and most common pattern: a website has predictable HTML structure, and you want specific fields.

import requests
from bs4 import BeautifulSoup
import time

def extract_product(url):
    headers = {
        "User-Agent": "GrahamMirandaBot/1.0 (+https://grahammiranda.com/scraping-policy)"
    }
    response = requests.get(url, headers=headers, timeout=30)
    soup = BeautifulSoup(response.text, "html.parser")

    return {
        "title": soup.select_one("h1.product-title").get_text(strip=True),
        "price": soup.select_one("span.price").get_text(strip=True),
        "availability": soup.select_one("div.availability").get_text(strip=True),
        "url": url,
        "scraped_at": time.strftime("%Y-%m-%d %H:%M:%S")
    }

# Usage
for url in product_urls:
    data = extract_product(url)
    save_to_database(data)
    time.sleep(1.0)  # Respectful rate limiting
Enter fullscreen mode Exit fullscreen mode

When this works: E-commerce product pages, directory listings, news article pages — any site with consistent CSS classes.

When it breaks: Sites using JavaScript frameworks (React, Vue) where content loads after initial HTML.


Pattern 2: Dynamic Content with Playwright

For JS-rendered sites, you need a real browser. Playwright is the current standard.

from playwright.sync_api import sync_playwright
import time

def extract_dynamic(url):
    with sync_playwright() as p:
        browser = p.firefox.launch(headless=True)
        page = browser.new_page()

        # Anti-detection
        page.add_init_script("""
            Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
        """)

        page.goto(url, wait_until="networkidle")
        time.sleep(2)  # Wait for JS hydration

        # Extract after render
        title = page.locator("h1").inner_text()
        price = page.locator("[data-testid='price']").inner_text()

        browser.close()
        return {"title": title, "price": price, "url": url}
Enter fullscreen mode Exit fullscreen mode

Key insight: wait_until="networkidle" waits for network activity to cease — essential for SPAs. But on heavy pages, this can take 10+ seconds. Use domcontentloaded + fixed sleep for faster processing.

Cost: ~3-5x slower than requests/BeautifulSoup. Use only when necessary.


Pattern 3: API Aggregation

Many sites expose JSON APIs that are easier to parse than HTML. Finding them:

import requests

# Method 1: Check page source for API calls
response = requests.get("https://example.com/products")
# Look for `window.__INITIAL_STATE__` or similar JS variables

# Method 2: Network inspection (manual, then script)
# Open DevTools → Network → XHR → find the API endpoint
# Usually: /api/v1/products, /graphql, /json/products

# Method 3: Check for GraphQL endpoints
# Common patterns: /graphql, /api/graphql, /gql
Enter fullscreen mode Exit fullscreen mode

The real value: APIs return structured data. No HTML parsing needed. Much more reliable.


Pattern 4: Distributed Crawling at Scale

For thousands of URLs, you need a queue system.

from celery import Celery
import redis

app = Celery('web_intel', broker='redis://localhost:6379/0')
redis_client = redis.Redis(host='localhost', port=6379, db=1)

@app.task
def scrape_url(url):
    # Your extraction logic here
    data = extract_product(url)
    redis_client.hset(f"product:{url}", mapping=data)
    return data

# Queue 10,000 URLs
for url in url_list:
    scrape_url.delay(url)
Enter fullscreen mode Exit fullscreen mode

Architecture:

  • Redis: URL queue + result cache
  • Celery workers: 4-8 parallel scrapers
  • PostgreSQL: Persistent storage
  • Monitoring: Flower (Celery dashboard) + Prometheus

Scaling rule: One worker per CPU core. With rate limiting at 1 req/sec, 8 workers = 8 req/sec sustained throughput.


Pattern 5: Document Parsing

Web intelligence isn't just web pages. Documents matter too.

# PDF extraction
import fitz  # PyMuPDF

def extract_pdf_text(url):
    response = requests.get(url)
    doc = fitz.open(stream=response.content, filetype="pdf")
    text = "\n".join(page.get_text() for page in doc)
    return text

# DOCX extraction
from docx import Document

def extract_docx_text(filepath):
    doc = Document(filepath)
    return "\n".join(p.text for p in doc.paragraphs)
Enter fullscreen mode Exit fullscreen mode

Use case: Legal discovery, contract analysis, regulatory document monitoring.


Error Handling That Matters

The difference between a prototype and production scraping is error handling.

def robust_extract(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            return extract_product(url)
        except requests.exceptions.Timeout:
            time.sleep(5 * (attempt + 1))  # Exponential backoff
        except requests.exceptions.HTTPError as e:
            if e.response.status_code == 429:
                time.sleep(60)  # Rate limited — long wait
            elif e.response.status_code == 403:
                return {"error": "blocked", "url": url}
            else:
                raise
        except Exception as e:
            if attempt == max_retries - 1:
                return {"error": str(e), "url": url}
            time.sleep(2 ** attempt)

    return {"error": "max_retries_exceeded", "url": url}
Enter fullscreen mode Exit fullscreen mode

Critical patterns:

  • 429 (Too Many Requests): Back off for 60+ seconds
  • 403 (Forbidden): Switch proxy or skip
  • 502/503/504: Retry with exponential backoff
  • Connection timeout: Retry with longer timeout
  • Parse error: Log and continue (don't crash the batch)

Rate Limiting: The Non-Negotiable Rule

import time
from functools import wraps

def rate_limited(max_per_second=1.0):
    min_interval = 1.0 / max_per_second
    last_called = [0.0]

    def decorator(fn):
        @wraps(fn)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_called[0]
            if elapsed < min_interval:
                time.sleep(min_interval - elapsed)
            result = fn(*args, **kwargs)
            last_called[0] = time.time()
            return result
        return wrapper
    return decorator

@rate_limited(max_per_second=0.5)  # 1 request per 2 seconds
def fetch_url(url):
    return requests.get(url, timeout=30)
Enter fullscreen mode Exit fullscreen mode

Our defaults:

  • General sites: 1 req/sec
  • Small sites (< 100 pages): 0.5 req/sec
  • APIs with documented limits: Follow documented limit minus 10%
  • German government sites: 0.2 req/sec (extra polite)

DSGVO Compliance Checklist

For every scraping project:

  • [ ] robots.txt checked and respected
  • [ ] Public data only (no login-required pages)
  • [ ] No personal data collected (names, emails, addresses stripped)
  • [ ] Rate limiting implemented
  • [ ] User-Agent identifies scraper with contact info
  • [ ] Data retention policy defined (default: 90 days)
  • [ ] Audit trail maintained
  • [ ] Legitimate Interest assessment documented

Tools We Recommend

Tool Purpose Cost
requests + BeautifulSoup Simple HTML scraping Free
Playwright JS-rendered sites Free
Scrapy Large-scale crawling Free
Celery + Redis Distributed queue Free
PostgreSQL Structured storage Free
Elasticsearch Full-text search Free
Hetzner VPS Hosting €6–15/month
Bright Data / Oxylabs Residential proxies $50–500/month

Real-World Example

Client: Hamburg logistics company
Goal: Monitor shipping lane disruptions from port authority websites
Scope: 15 port authorities across Europe, updated every 6 hours
Implementation:

  • Scrapy spider with rotating User-Agents
  • Celery beat schedule (6-hour intervals)
  • Redis for deduplication
  • Telegram notification on status changes
  • DSGVO compliance: Public data only, 30-day retention

Result: Detected a Rotterdam strike 4 hours before mainstream news. Client rerouted 3 containers, saving €12,000 in demurrage fees.


Resources


Graham Miranda is the founder of Graham Miranda UG (Berlin, HRB 36794), specializing in web intelligence, automation, and privacy-first infrastructure for European businesses.

Top comments (0)