Every scraper I build follows the same 4-step process. It works for any website.
Step 1: Check for a JSON API (5 min)
Open browser DevTools → Network → XHR. Browse the page. Look for JSON responses.
If found → use that endpoint. You're done. No HTML parsing needed.
Examples: Reddit (.json), YouTube (Innertube), HN (Firebase)
Step 2: Check for RSS/Atom (2 min)
Look for <link rel="alternate" type="application/rss+xml"> in the page source.
If found → parse XML. Done.
Example: Google News
Step 3: Check for JSON-LD (2 min)
Search for <script type="application/ld+json"> in page source.
If found → parse JSON. Contains structured product/review/organization data.
Example: Trustpilot
Step 4: Last Resort — HTML Parsing
Only if steps 1-3 fail. Use Cheerio (fast) or Playwright (JavaScript rendering).
This is the MOST COMMON approach but should be your LAST choice.
Apply This to Any Website
- DevTools → Network → look for JSON
- View source → search for RSS
- View source → search for JSON-LD
- Only then → Cheerio/Playwright
All 77 scrapers following this method: GitHub
Custom scraping — $20: Order via Payoneer
Top comments (0)