How I built a self-healing, robots-respecting web scraper (and put it on the Apify Store)

#webscraping #python #showdev #api

Two things kill most scrapers I've used:

A site ships a redesign and the scraper silently breaks. Every selector was pinned to a CSS class that no longer exists — and you don't notice until your dataset has been empty for a week.
Compliance is an afterthought. Most scrapers ignore robots.txt and happily hoover up personal data. Fine for a weekend hack; not fine if anyone downstream cares about legal risk.

So I built a small family of scrapers that fix both on purpose, and published them on the Apify Store. Here's how they work.

1. Selector-free extraction (so redesigns don't break it)

Instead of hard-coding CSS selectors, I score every <a> by the shape of a headline — link-text length, word count, URL structure — after stripping nav/footer chrome:

def extract_headlines(html, base_url, max_items=25):
    soup = BeautifulSoup(html, 'html.parser')
    for tag in soup.find_all(['nav', 'header', 'footer', 'aside', 'script', 'style', 'form']):
        tag.decompose()
    out, seen = [], set()
    for a in soup.find_all('a', href=True):
        text = ' '.join(a.get_text(' ', strip=True).split())
        if not (28 <= len(text) <= 200) or len(text.split()) < 5:
            continue
        href = urljoin(base_url, a['href']).split('#')[0]
        # ...same-host + article-like URL checks, dedupe...
        out.append({'title': text, 'url': href})
    return out[:max_items]

No selectors means there's nothing site-specific to break. When a source redesigns, the heuristic keeps finding headlines. And if a source ever returns zero items, the actor flags it instead of silently shipping an empty dataset.

2. Compliance enforced in code, not in a disclaimer

Before fetching anything, it checks the host's robots.txt at runtime and only proceeds if our user-agent is allowed — fail-closed on any error:

async def robots_allows(client, url, ua):
    p = urlparse(url)
    r = await client.get(f'{p.scheme}://{p.netloc}/robots.txt')
    if r.status_code == 404:   # no robots.txt = no restriction (RFC)
        return True
    if r.status_code != 200:   # anything weird = blocked
        return False
    rp = RobotFileParser()
    rp.parse((r.text or '').splitlines())
    return rp.can_fetch(ua, url)

Only public pages, and no personal data is ever collected. Output is uniform structured JSON — title, url, source, category, fetched_at — that drops straight into a model or pipeline.

The result

Same ~80-line core, three curated source lists, three actors:

Commodity & Energy News Scraper — oil, gas, gold, silver, uranium
Crypto & DeFi News Scraper — Bitcoin, markets, policy, on-chain
AI & ML News & Papers Scraper — HuggingFace papers, lab blogs, AI press

All the code is on GitHub: Casterdly/compliant-scrapers.

Why "compliant" turned out to be the feature

I almost skipped the robots check — it's tempting to just scrape and move on. But "we collected this in a grey area" is a non-answer for anyone in a regulated, enterprise, or agency setting. Making compliance a guarantee enforced in code turned a boring scraper into something a cautious team can actually run. Sometimes the constraint is the product.

Got a public data source you'd want as a clean, compliant feed? Each actor takes a sources list — bring your own, or open an issue on the repo.