Automating Web Intelligence with Python: A Practical Guide
Web intelligence — the systematic extraction of actionable data from the web — sounds like a spy movie plot. In reality, it's a Python script with good error handling and a respectful rate limit.
I run an agency in Berlin that builds this infrastructure for European clients. Here's the practical stack we use.
Pattern 1: Structured Data Extraction
The simplest and most common pattern: a website has predictable HTML structure, and you want specific fields.
import requests
from bs4 import BeautifulSoup
import time
def extract_product(url):
headers = {
"User-Agent": "GrahamMirandaBot/1.0 (+https://grahammiranda.com/scraping-policy)"
}
response = requests.get(url, headers=headers, timeout=30)
soup = BeautifulSoup(response.text, "html.parser")
return {
"title": soup.select_one("h1.product-title").get_text(strip=True),
"price": soup.select_one("span.price").get_text(strip=True),
"availability": soup.select_one("div.availability").get_text(strip=True),
"url": url,
"scraped_at": time.strftime("%Y-%m-%d %H:%M:%S")
}
# Usage
for url in product_urls:
data = extract_product(url)
save_to_database(data)
time.sleep(1.0) # Respectful rate limiting
When this works: E-commerce product pages, directory listings, news article pages — any site with consistent CSS classes.
When it breaks: Sites using JavaScript frameworks (React, Vue) where content loads after initial HTML.
Pattern 2: Dynamic Content with Playwright
For JS-rendered sites, you need a real browser. Playwright is the current standard.
from playwright.sync_api import sync_playwright
import time
def extract_dynamic(url):
with sync_playwright() as p:
browser = p.firefox.launch(headless=True)
page = browser.new_page()
# Anti-detection
page.add_init_script("""
Object.defineProperty(navigator, 'webdriver', { get: () => undefined });
""")
page.goto(url, wait_until="networkidle")
time.sleep(2) # Wait for JS hydration
# Extract after render
title = page.locator("h1").inner_text()
price = page.locator("[data-testid='price']").inner_text()
browser.close()
return {"title": title, "price": price, "url": url}
Key insight: wait_until="networkidle" waits for network activity to cease — essential for SPAs. But on heavy pages, this can take 10+ seconds. Use domcontentloaded + fixed sleep for faster processing.
Cost: ~3-5x slower than requests/BeautifulSoup. Use only when necessary.
Pattern 3: API Aggregation
Many sites expose JSON APIs that are easier to parse than HTML. Finding them:
import requests
# Method 1: Check page source for API calls
response = requests.get("https://example.com/products")
# Look for `window.__INITIAL_STATE__` or similar JS variables
# Method 2: Network inspection (manual, then script)
# Open DevTools → Network → XHR → find the API endpoint
# Usually: /api/v1/products, /graphql, /json/products
# Method 3: Check for GraphQL endpoints
# Common patterns: /graphql, /api/graphql, /gql
The real value: APIs return structured data. No HTML parsing needed. Much more reliable.
Pattern 4: Distributed Crawling at Scale
For thousands of URLs, you need a queue system.
from celery import Celery
import redis
app = Celery('web_intel', broker='redis://localhost:6379/0')
redis_client = redis.Redis(host='localhost', port=6379, db=1)
@app.task
def scrape_url(url):
# Your extraction logic here
data = extract_product(url)
redis_client.hset(f"product:{url}", mapping=data)
return data
# Queue 10,000 URLs
for url in url_list:
scrape_url.delay(url)
Architecture:
- Redis: URL queue + result cache
- Celery workers: 4-8 parallel scrapers
- PostgreSQL: Persistent storage
- Monitoring: Flower (Celery dashboard) + Prometheus
Scaling rule: One worker per CPU core. With rate limiting at 1 req/sec, 8 workers = 8 req/sec sustained throughput.
Pattern 5: Document Parsing
Web intelligence isn't just web pages. Documents matter too.
# PDF extraction
import fitz # PyMuPDF
def extract_pdf_text(url):
response = requests.get(url)
doc = fitz.open(stream=response.content, filetype="pdf")
text = "\n".join(page.get_text() for page in doc)
return text
# DOCX extraction
from docx import Document
def extract_docx_text(filepath):
doc = Document(filepath)
return "\n".join(p.text for p in doc.paragraphs)
Use case: Legal discovery, contract analysis, regulatory document monitoring.
Error Handling That Matters
The difference between a prototype and production scraping is error handling.
def robust_extract(url, max_retries=3):
for attempt in range(max_retries):
try:
return extract_product(url)
except requests.exceptions.Timeout:
time.sleep(5 * (attempt + 1)) # Exponential backoff
except requests.exceptions.HTTPError as e:
if e.response.status_code == 429:
time.sleep(60) # Rate limited — long wait
elif e.response.status_code == 403:
return {"error": "blocked", "url": url}
else:
raise
except Exception as e:
if attempt == max_retries - 1:
return {"error": str(e), "url": url}
time.sleep(2 ** attempt)
return {"error": "max_retries_exceeded", "url": url}
Critical patterns:
- 429 (Too Many Requests): Back off for 60+ seconds
- 403 (Forbidden): Switch proxy or skip
- 502/503/504: Retry with exponential backoff
- Connection timeout: Retry with longer timeout
- Parse error: Log and continue (don't crash the batch)
Rate Limiting: The Non-Negotiable Rule
import time
from functools import wraps
def rate_limited(max_per_second=1.0):
min_interval = 1.0 / max_per_second
last_called = [0.0]
def decorator(fn):
@wraps(fn)
def wrapper(*args, **kwargs):
elapsed = time.time() - last_called[0]
if elapsed < min_interval:
time.sleep(min_interval - elapsed)
result = fn(*args, **kwargs)
last_called[0] = time.time()
return result
return wrapper
return decorator
@rate_limited(max_per_second=0.5) # 1 request per 2 seconds
def fetch_url(url):
return requests.get(url, timeout=30)
Our defaults:
- General sites: 1 req/sec
- Small sites (< 100 pages): 0.5 req/sec
- APIs with documented limits: Follow documented limit minus 10%
- German government sites: 0.2 req/sec (extra polite)
DSGVO Compliance Checklist
For every scraping project:
- [ ] robots.txt checked and respected
- [ ] Public data only (no login-required pages)
- [ ] No personal data collected (names, emails, addresses stripped)
- [ ] Rate limiting implemented
- [ ] User-Agent identifies scraper with contact info
- [ ] Data retention policy defined (default: 90 days)
- [ ] Audit trail maintained
- [ ] Legitimate Interest assessment documented
Tools We Recommend
| Tool | Purpose | Cost |
|---|---|---|
| requests + BeautifulSoup | Simple HTML scraping | Free |
| Playwright | JS-rendered sites | Free |
| Scrapy | Large-scale crawling | Free |
| Celery + Redis | Distributed queue | Free |
| PostgreSQL | Structured storage | Free |
| Elasticsearch | Full-text search | Free |
| Hetzner VPS | Hosting | €6–15/month |
| Bright Data / Oxylabs | Residential proxies | $50–500/month |
Real-World Example
Client: Hamburg logistics company
Goal: Monitor shipping lane disruptions from port authority websites
Scope: 15 port authorities across Europe, updated every 6 hours
Implementation:
- Scrapy spider with rotating User-Agents
- Celery beat schedule (6-hour intervals)
- Redis for deduplication
- Telegram notification on status changes
- DSGVO compliance: Public data only, 30-day retention
Result: Detected a Rotterdam strike 4 hours before mainstream news. Client rerouted 3 containers, saving €12,000 in demurrage fees.
Resources
- Our tools: asearchz.online
- Code samples: Available on request
- Consulting: grahammiranda.com
Graham Miranda is the founder of Graham Miranda UG (Berlin, HRB 36794), specializing in web intelligence, automation, and privacy-first infrastructure for European businesses.
Top comments (0)