After building 77 scrapers for production use, I realized I follow the same 21 steps every time. This is the checklist I give to every developer on my team.
Before You Write Any Code
[ ] 1. Check for an official API. 60% of 'scraping' projects don't need scraping at all. Check the site's
/api/, developer docs, or look forapplication/jsonresponses in DevTools.[ ] 2. Check robots.txt. Visit
example.com/robots.txt. If your target path isDisallow, proceed with caution.[ ] 3. Read the Terms of Service. Search for "scraping", "automated", "bot". Some sites explicitly prohibit it.
[ ] 4. Check if the data is available elsewhere. Common Crawl, Wayback Machine, public datasets (data.gov, Kaggle) might already have what you need.
[ ] 5. Decide: HTTP client or browser? If the page works with JavaScript disabled → use
httpx/requests. If not → use Playwright.
Writing the Scraper
[ ] 6. Start with one page. Get it working perfectly for one URL before scaling.
[ ] 7. Use CSS selectors, not XPath. CSS is simpler and covers 95% of cases. XPath only when you need parent/sibling selectors.
[ ] 8. Extract to a schema. Define your output format upfront:
@dataclass
class Product:
name: str
price: float
url: str
scraped_at: datetime
[ ] 9. Handle missing data gracefully. Every
querySelectorcan returnNone. Every price can be "Out of Stock".[ ] 10. Add rate limiting. 1 request/second is safe for most sites. Use
time.sleep(1)or Crawlee's built-in throttling.[ ] 11. Rotate User-Agents. At minimum, set a realistic
User-Agentheader. Better: rotate from a list of 10+ real browser UAs.
Making It Reliable
- [ ] 12. Add retries with exponential backoff.
for attempt in range(3):
try:
response = httpx.get(url, timeout=30)
break
except httpx.TimeoutException:
time.sleep(2 ** attempt)
[ ] 13. Log everything. URL, status code, items found, errors. You'll thank yourself when debugging at 3 AM.
[ ] 14. Save raw HTML. Before parsing, save the raw response. When your selectors break, you can re-parse without re-scraping.
[ ] 15. Dedup by URL or unique ID. Use SQLite's
UNIQUEconstraint or a set of seen URLs.
Storing Results
[ ] 16. Use SQLite for anything above 1,000 items. JSON files become unmanageable fast. SQLite is built into Python.
[ ] 17. Include metadata. Every record needs: source URL, scrape timestamp, scraper version.
[ ] 18. Validate output. Assert expected fields exist. Assert prices are positive. Assert dates are recent.
Deploying
[ ] 19. Dockerize. Your scraper should run identically on your laptop and in production. Pin browser versions.
[ ] 20. Schedule, don't run manually. Use GitHub Actions (free), cron, or Apify schedules.
-
[ ] 21. Monitor. Set up alerts for:
- Scraper didn't run (schedule failed)
- Zero results (site changed)
- Result count dropped >50% (partial failure)
- New errors in logs
Quick Reference: Which Tool for What
| Situation | Tool |
|---|---|
| Site has JSON API |
httpx or curl_cffi
|
| Static HTML |
httpx + selectolax
|
| JS-rendered content | Playwright |
| Anti-bot protection |
curl_cffi + stealth |
| 10K+ pages | Scrapy or Crawlee |
| Scheduled runs | GitHub Actions or Apify |
| Data storage | SQLite (small) or PostgreSQL (team) |
Full tools list: awesome-web-scraping-2026 (130+ tools)
Starter template: python-web-scraping-starter
What steps would you add to this checklist? What did I miss? 👇
More from me: 10 Dev Tools I Use Daily | 77 Scrapers on a Schedule | 150+ Free APIs
Top comments (0)