DEV Community

Alex Spinov
Alex Spinov

Posted on

The Web Scraping Checklist I Wish I Had When I Started (21 Steps)

After building 77 scrapers for production use, I realized I follow the same 21 steps every time. This is the checklist I give to every developer on my team.


Before You Write Any Code

  • [ ] 1. Check for an official API. 60% of 'scraping' projects don't need scraping at all. Check the site's /api/, developer docs, or look for application/json responses in DevTools.

  • [ ] 2. Check robots.txt. Visit example.com/robots.txt. If your target path is Disallow, proceed with caution.

  • [ ] 3. Read the Terms of Service. Search for "scraping", "automated", "bot". Some sites explicitly prohibit it.

  • [ ] 4. Check if the data is available elsewhere. Common Crawl, Wayback Machine, public datasets (data.gov, Kaggle) might already have what you need.

  • [ ] 5. Decide: HTTP client or browser? If the page works with JavaScript disabled → use httpx/requests. If not → use Playwright.


Writing the Scraper

  • [ ] 6. Start with one page. Get it working perfectly for one URL before scaling.

  • [ ] 7. Use CSS selectors, not XPath. CSS is simpler and covers 95% of cases. XPath only when you need parent/sibling selectors.

  • [ ] 8. Extract to a schema. Define your output format upfront:

@dataclass
class Product:
    name: str
    price: float
    url: str
    scraped_at: datetime
Enter fullscreen mode Exit fullscreen mode
  • [ ] 9. Handle missing data gracefully. Every querySelector can return None. Every price can be "Out of Stock".

  • [ ] 10. Add rate limiting. 1 request/second is safe for most sites. Use time.sleep(1) or Crawlee's built-in throttling.

  • [ ] 11. Rotate User-Agents. At minimum, set a realistic User-Agent header. Better: rotate from a list of 10+ real browser UAs.


Making It Reliable

  • [ ] 12. Add retries with exponential backoff.
for attempt in range(3):
    try:
        response = httpx.get(url, timeout=30)
        break
    except httpx.TimeoutException:
        time.sleep(2 ** attempt)
Enter fullscreen mode Exit fullscreen mode
  • [ ] 13. Log everything. URL, status code, items found, errors. You'll thank yourself when debugging at 3 AM.

  • [ ] 14. Save raw HTML. Before parsing, save the raw response. When your selectors break, you can re-parse without re-scraping.

  • [ ] 15. Dedup by URL or unique ID. Use SQLite's UNIQUE constraint or a set of seen URLs.


Storing Results

  • [ ] 16. Use SQLite for anything above 1,000 items. JSON files become unmanageable fast. SQLite is built into Python.

  • [ ] 17. Include metadata. Every record needs: source URL, scrape timestamp, scraper version.

  • [ ] 18. Validate output. Assert expected fields exist. Assert prices are positive. Assert dates are recent.


Deploying

  • [ ] 19. Dockerize. Your scraper should run identically on your laptop and in production. Pin browser versions.

  • [ ] 20. Schedule, don't run manually. Use GitHub Actions (free), cron, or Apify schedules.

  • [ ] 21. Monitor. Set up alerts for:

    • Scraper didn't run (schedule failed)
    • Zero results (site changed)
    • Result count dropped >50% (partial failure)
    • New errors in logs

Quick Reference: Which Tool for What

Situation Tool
Site has JSON API httpx or curl_cffi
Static HTML httpx + selectolax
JS-rendered content Playwright
Anti-bot protection curl_cffi + stealth
10K+ pages Scrapy or Crawlee
Scheduled runs GitHub Actions or Apify
Data storage SQLite (small) or PostgreSQL (team)

Full tools list: awesome-web-scraping-2026 (130+ tools)

Starter template: python-web-scraping-starter

What steps would you add to this checklist? What did I miss? 👇


More from me: 10 Dev Tools I Use Daily | 77 Scrapers on a Schedule | 150+ Free APIs

Top comments (0)