DEV Community

Alex Spinov
Alex Spinov

Posted on

I Scraped 10,000 Reddit Posts to Find the Best Web Scraping Strategy in 2026

Last month I scraped 10,000 Reddit posts across 50 subreddits to answer one question: What is the most reliable way to scrape in 2026?

Not hypothetically. I actually ran 200+ scraping sessions, tested 4 different approaches, and tracked what broke and what survived.

Here are my results.

The 4 Approaches I Tested

1. HTML Parsing (BeautifulSoup + Requests)

The classic approach. Parse the rendered HTML, extract with CSS selectors.

Result: Broke 3 times in 2 weeks when the site changed their HTML. Unreliable.

2. JSON API Endpoints

Many sites expose JSON APIs alongside their HTML pages. Reddit has /r/subreddit.json.

import requests

url = "https://old.reddit.com/r/programming/top.json?t=month&limit=100"
response = requests.get(url, headers={"User-Agent": "DataBot/1.0"})
posts = response.json()["data"]["children"]

for post in posts:
    d = post["data"]
    print(f'[{d["score"]}] {d["title"]}')
Enter fullscreen mode Exit fullscreen mode

Result: Zero breakages in 30 days. The JSON format hasn't changed in years.

3. Headless Browser (Playwright)

Full browser rendering. Handles JavaScript-heavy sites.

Result: Works but 10x slower and 5x more expensive. Overkill for data that has a JSON endpoint.

4. Official API (OAuth)

Result: Rate limits are strict, requires app registration, and policies keep changing.

The Winner

JSON endpoints won by a landslide.

Approach Reliability Speed Cost Ease
HTML parsing 2/5 4/5 5/5 3/5
JSON endpoints 5/5 4/5 5/5 5/5
Headless browser 4/5 2/5 2/5 3/5
Official API 3/5 3/5 4/5 2/5

Why JSON endpoints win:

  • Same data format the site's own mobile app uses
  • No authentication required for public data
  • Structured response - no parsing needed
  • Hasn't changed format in 5+ years

What I Built With This

I turned this into an open-source Reddit scraper that uses the JSON approach. It extracts 20+ fields per post including full comment trees.

I also created a Web Scraping Cheatsheet covering Python, JavaScript, CSS selectors, XPath, and anti-detection techniques.

The 3 Rules I Follow Now

  1. Always check for JSON endpoints first. Append .json to the URL, check network tab for API calls.

  2. Use official APIs only when JSON endpoints don't exist. APIs have rate limits and auth requirements.

  3. Headless browsers are the last resort. Only for JavaScript-rendered content with no API alternative.

Sites With Hidden JSON Endpoints

Site JSON Endpoint
Reddit reddit.com/r/{sub}.json
Hacker News hacker-news.firebaseio.com/v0/
Wikipedia en.wikipedia.org/api/rest_v1/
GitHub api.github.com (60 req/hr free)
Stack Overflow api.stackexchange.com

What Would You Add?

I'm curious - what's your go-to scraping strategy? Have you found other sites with hidden JSON endpoints?

Drop your findings in the comments.


More resources:

Top comments (0)