Two things kill most scrapers I've used:
- A site ships a redesign and the scraper silently breaks. Every selector was pinned to a CSS class that no longer exists — and you don't notice until your dataset has been empty for a week.
-
Compliance is an afterthought. Most scrapers ignore
robots.txtand happily hoover up personal data. Fine for a weekend hack; not fine if anyone downstream cares about legal risk.
So I built a small family of scrapers that fix both on purpose, and published them on the Apify Store. Here's how they work.
1. Selector-free extraction (so redesigns don't break it)
Instead of hard-coding CSS selectors, I score every <a> by the shape of a headline — link-text length, word count, URL structure — after stripping nav/footer chrome:
def extract_headlines(html, base_url, max_items=25):
soup = BeautifulSoup(html, 'html.parser')
for tag in soup.find_all(['nav', 'header', 'footer', 'aside', 'script', 'style', 'form']):
tag.decompose()
out, seen = [], set()
for a in soup.find_all('a', href=True):
text = ' '.join(a.get_text(' ', strip=True).split())
if not (28 <= len(text) <= 200) or len(text.split()) < 5:
continue
href = urljoin(base_url, a['href']).split('#')[0]
# ...same-host + article-like URL checks, dedupe...
out.append({'title': text, 'url': href})
return out[:max_items]
No selectors means there's nothing site-specific to break. When a source redesigns, the heuristic keeps finding headlines. And if a source ever returns zero items, the actor flags it instead of silently shipping an empty dataset.
2. Compliance enforced in code, not in a disclaimer
Before fetching anything, it checks the host's robots.txt at runtime and only proceeds if our user-agent is allowed — fail-closed on any error:
async def robots_allows(client, url, ua):
p = urlparse(url)
r = await client.get(f'{p.scheme}://{p.netloc}/robots.txt')
if r.status_code == 404: # no robots.txt = no restriction (RFC)
return True
if r.status_code != 200: # anything weird = blocked
return False
rp = RobotFileParser()
rp.parse((r.text or '').splitlines())
return rp.can_fetch(ua, url)
Only public pages, and no personal data is ever collected. Output is uniform structured JSON — title, url, source, category, fetched_at — that drops straight into a model or pipeline.
The result
Same ~80-line core, three curated source lists, three actors:
- Commodity & Energy News Scraper — oil, gas, gold, silver, uranium
- Crypto & DeFi News Scraper — Bitcoin, markets, policy, on-chain
- AI & ML News & Papers Scraper — HuggingFace papers, lab blogs, AI press
All the code is on GitHub: Casterdly/compliant-scrapers.
Why "compliant" turned out to be the feature
I almost skipped the robots check — it's tempting to just scrape and move on. But "we collected this in a grey area" is a non-answer for anyone in a regulated, enterprise, or agency setting. Making compliance a guarantee enforced in code turned a boring scraper into something a cautious team can actually run. Sometimes the constraint is the product.
Got a public data source you'd want as a clean, compliant feed? Each actor takes a sources list — bring your own, or open an issue on the repo.
Top comments (0)