Last month I scraped 10,000 Reddit posts across 50 subreddits to answer one question: What is the most reliable way to scrape in 2026?
Not hypothetically. I actually ran 200+ scraping sessions, tested 4 different approaches, and tracked what broke and what survived.
Here are my results.
The 4 Approaches I Tested
1. HTML Parsing (BeautifulSoup + Requests)
The classic approach. Parse the rendered HTML, extract with CSS selectors.
Result: Broke 3 times in 2 weeks when the site changed their HTML. Unreliable.
2. JSON API Endpoints
Many sites expose JSON APIs alongside their HTML pages. Reddit has /r/subreddit.json.
import requests
url = "https://old.reddit.com/r/programming/top.json?t=month&limit=100"
response = requests.get(url, headers={"User-Agent": "DataBot/1.0"})
posts = response.json()["data"]["children"]
for post in posts:
d = post["data"]
print(f'[{d["score"]}] {d["title"]}')
Result: Zero breakages in 30 days. The JSON format hasn't changed in years.
3. Headless Browser (Playwright)
Full browser rendering. Handles JavaScript-heavy sites.
Result: Works but 10x slower and 5x more expensive. Overkill for data that has a JSON endpoint.
4. Official API (OAuth)
Result: Rate limits are strict, requires app registration, and policies keep changing.
The Winner
JSON endpoints won by a landslide.
| Approach | Reliability | Speed | Cost | Ease |
|---|---|---|---|---|
| HTML parsing | 2/5 | 4/5 | 5/5 | 3/5 |
| JSON endpoints | 5/5 | 4/5 | 5/5 | 5/5 |
| Headless browser | 4/5 | 2/5 | 2/5 | 3/5 |
| Official API | 3/5 | 3/5 | 4/5 | 2/5 |
Why JSON endpoints win:
- Same data format the site's own mobile app uses
- No authentication required for public data
- Structured response - no parsing needed
- Hasn't changed format in 5+ years
What I Built With This
I turned this into an open-source Reddit scraper that uses the JSON approach. It extracts 20+ fields per post including full comment trees.
I also created a Web Scraping Cheatsheet covering Python, JavaScript, CSS selectors, XPath, and anti-detection techniques.
The 3 Rules I Follow Now
Always check for JSON endpoints first. Append
.jsonto the URL, check network tab for API calls.Use official APIs only when JSON endpoints don't exist. APIs have rate limits and auth requirements.
Headless browsers are the last resort. Only for JavaScript-rendered content with no API alternative.
Sites With Hidden JSON Endpoints
| Site | JSON Endpoint |
|---|---|
reddit.com/r/{sub}.json |
|
| Hacker News | hacker-news.firebaseio.com/v0/ |
| Wikipedia | en.wikipedia.org/api/rest_v1/ |
| GitHub |
api.github.com (60 req/hr free) |
| Stack Overflow | api.stackexchange.com |
What Would You Add?
I'm curious - what's your go-to scraping strategy? Have you found other sites with hidden JSON endpoints?
Drop your findings in the comments.
More resources:
Top comments (0)