The Web Scraping Checklist I Wish I Had When I Started (21 Steps)

#python #beginners #tutorial #webdev

After building 77 scrapers for production use, I realized I follow the same 21 steps every time. This is the checklist I give to every developer on my team.

Before You Write Any Code

[ ] 1. Check for an official API. 60% of 'scraping' projects don't need scraping at all. Check the site's /api/, developer docs, or look for application/json responses in DevTools.
[ ] 2. Check robots.txt. Visit example.com/robots.txt. If your target path is Disallow, proceed with caution.
[ ] 3. Read the Terms of Service. Search for "scraping", "automated", "bot". Some sites explicitly prohibit it.
[ ] 4. Check if the data is available elsewhere. Common Crawl, Wayback Machine, public datasets (data.gov, Kaggle) might already have what you need.
[ ] 5. Decide: HTTP client or browser? If the page works with JavaScript disabled → use httpx/requests. If not → use Playwright.

Writing the Scraper

[ ] 6. Start with one page. Get it working perfectly for one URL before scaling.
[ ] 7. Use CSS selectors, not XPath. CSS is simpler and covers 95% of cases. XPath only when you need parent/sibling selectors.
[ ] 8. Extract to a schema. Define your output format upfront:

@dataclass
class Product:
    name: str
    price: float
    url: str
    scraped_at: datetime

[ ] 9. Handle missing data gracefully. Every querySelector can return None. Every price can be "Out of Stock".
[ ] 10. Add rate limiting. 1 request/second is safe for most sites. Use time.sleep(1) or Crawlee's built-in throttling.
[ ] 11. Rotate User-Agents. At minimum, set a realistic User-Agent header. Better: rotate from a list of 10+ real browser UAs.

Making It Reliable

[ ] 12. Add retries with exponential backoff.

for attempt in range(3):
    try:
        response = httpx.get(url, timeout=30)
        break
    except httpx.TimeoutException:
        time.sleep(2 ** attempt)

[ ] 13. Log everything. URL, status code, items found, errors. You'll thank yourself when debugging at 3 AM.
[ ] 14. Save raw HTML. Before parsing, save the raw response. When your selectors break, you can re-parse without re-scraping.
[ ] 15. Dedup by URL or unique ID. Use SQLite's UNIQUE constraint or a set of seen URLs.

Storing Results

[ ] 16. Use SQLite for anything above 1,000 items. JSON files become unmanageable fast. SQLite is built into Python.
[ ] 17. Include metadata. Every record needs: source URL, scrape timestamp, scraper version.
[ ] 18. Validate output. Assert expected fields exist. Assert prices are positive. Assert dates are recent.

Deploying

[ ] 19. Dockerize. Your scraper should run identically on your laptop and in production. Pin browser versions.
[ ] 20. Schedule, don't run manually. Use GitHub Actions (free), cron, or Apify schedules.
[ ] 21. Monitor. Set up alerts for:
- Scraper didn't run (schedule failed)
- Zero results (site changed)
- Result count dropped >50% (partial failure)
- New errors in logs

Quick Reference: Which Tool for What

Situation	Tool
Site has JSON API	`httpx` or `curl_cffi`
Static HTML	`httpx` + `selectolax`
JS-rendered content	Playwright
Anti-bot protection	`curl_cffi` + stealth
10K+ pages	Scrapy or Crawlee
Scheduled runs	GitHub Actions or Apify
Data storage	SQLite (small) or PostgreSQL (team)

Full tools list: awesome-web-scraping-2026 (130+ tools)

Starter template: python-web-scraping-starter

What steps would you add to this checklist? What did I miss? 👇

Need web scraping or data extraction? I've built 88 production scrapers. Email spinov001@gmail.com — quote in 2 hours. Or try my ready-made Apify actors — no code needed.