Alex Spinov

Posted on Mar 25 • Edited on Mar 26

I Scraped 10,000 Reddit Posts to Find the Best Web Scraping Strategy in 2026

#python #javascript #webdev #tutorial

Last month I scraped 10,000 Reddit posts across 50 subreddits to answer one question: What is the most reliable way to scrape in 2026?

Not hypothetically. I actually ran 200+ scraping sessions, tested 4 different approaches, and tracked what broke and what survived.

Here are my results.

The 4 Approaches I Tested

1. HTML Parsing (BeautifulSoup + Requests)

The classic approach. Parse the rendered HTML, extract with CSS selectors.

Result: Broke 3 times in 2 weeks when the site changed their HTML. Unreliable.

2. JSON API Endpoints

Many sites expose JSON APIs alongside their HTML pages. Reddit has /r/subreddit.json.

import requests

url = "https://old.reddit.com/r/programming/top.json?t=month&limit=100"
response = requests.get(url, headers={"User-Agent": "DataBot/1.0"})
posts = response.json()["data"]["children"]

for post in posts:
    d = post["data"]
    print(f'[{d["score"]}] {d["title"]}')

Result: Zero breakages in 30 days. The JSON format hasn't changed in years.

3. Headless Browser (Playwright)

Full browser rendering. Handles JavaScript-heavy sites.

Result: Works but 10x slower and 5x more expensive. Overkill for data that has a JSON endpoint.

4. Official API (OAuth)

Result: Rate limits are strict, requires app registration, and policies keep changing.

The Winner

JSON endpoints won by a landslide.

Approach	Reliability	Speed	Cost	Ease
HTML parsing	2/5	4/5	5/5	3/5
JSON endpoints	5/5	4/5	5/5	5/5
Headless browser	4/5	2/5	2/5	3/5
Official API	3/5	3/5	4/5	2/5

Why JSON endpoints win:

Same data format the site's own mobile app uses
No authentication required for public data
Structured response - no parsing needed
Hasn't changed format in 5+ years

What I Built With This

I turned this into an open-source Reddit scraper that uses the JSON approach. It extracts 20+ fields per post including full comment trees.

I also created a Web Scraping Cheatsheet covering Python, JavaScript, CSS selectors, XPath, and anti-detection techniques.

The 3 Rules I Follow Now

Always check for JSON endpoints first. Append .json to the URL, check network tab for API calls.
Use official APIs only when JSON endpoints don't exist. APIs have rate limits and auth requirements.
Headless browsers are the last resort. Only for JavaScript-rendered content with no API alternative.

Sites With Hidden JSON Endpoints

Site	JSON Endpoint
Reddit	`reddit.com/r/{sub}.json`
Hacker News	`hacker-news.firebaseio.com/v0/`
Wikipedia	`en.wikipedia.org/api/rest_v1/`
GitHub	`api.github.com` (60 req/hr free)
Stack Overflow	`api.stackexchange.com`

What Would You Add?

I'm curious - what's your go-to scraping strategy? Have you found other sites with hidden JSON endpoints?

Drop your findings in the comments.

More resources:

Need web scraping or data extraction? I've built 88 production scrapers. Email spinov001@gmail.com — quote in 2 hours. Or try my ready-made Apify actors — no code needed.

DEV Community