DEV Community

Alex Spinov
Alex Spinov

Posted on

Stop Scraping HTML — Use These Hidden JSON APIs Instead

I used to write BeautifulSoup parsers for every website. CSS selectors, XPath, regex on HTML...

Then I discovered that most websites have hidden JSON APIs that return clean, structured data. No parsing needed.

How to Find Hidden APIs

  1. Open Chrome DevTools → Network tab
  2. Filter by XHR/Fetch
  3. Browse the website normally
  4. Watch for JSON responses

You'll be amazed how many sites serve their data through internal APIs.

Real Examples

YouTube (Innertube API)

YouTube's website doesn't scrape its own HTML. It uses an internal API called Innertube:

import requests

resp = requests.post("https://www.youtube.com/youtubei/v1/search", json={
    "context": {"client": {"clientName": "WEB", "clientVersion": "2.20240101.00.00"}},
    "query": "python tutorial"
})

data = resp.json()
# Clean JSON with all video data — no HTML parsing!
Enter fullscreen mode Exit fullscreen mode

No API key. No quotas. Same data YouTube shows you.

Toolkit: youtube-innertube-toolkit

GitHub Trending

GitHub's trending page has no official API. But the page loads data from:

resp = requests.get("https://api.github.com/search/repositories",
    params={"q": "created:>2024-01-01", "sort": "stars", "per_page": 10})

for repo in resp.json()["items"]:
    print(f"{repo['full_name']}{repo['stargazers_count']} stars")
Enter fullscreen mode Exit fullscreen mode

npm Registry

Every npm package has a JSON endpoint:

resp = requests.get("https://registry.npmjs.org/react")
data = resp.json()
print(f"react: {data['description']}")
print(f"Latest: {data['dist-tags']['latest']}")
Enter fullscreen mode Exit fullscreen mode

Wikipedia

resp = requests.get("https://en.wikipedia.org/api/rest_v1/page/summary/Python_(programming_language)")
data = resp.json()
print(data["extract"][:200])
Enter fullscreen mode Exit fullscreen mode

Reddit

Add .json to any Reddit URL:

resp = requests.get("https://www.reddit.com/r/python/hot.json",
    headers={"User-Agent": "MyApp/1.0"})

for post in resp.json()["data"]["children"][:5]:
    d = post["data"]
    print(f"{d['score']} pts | {d['title'][:60]}")
Enter fullscreen mode Exit fullscreen mode

The Pattern

Most modern websites are SPAs (Single Page Applications). They:

  1. Load a skeleton HTML
  2. Fetch data from an internal API
  3. Render with JavaScript

That internal API is what you want. It's:

  • Structured — clean JSON, not messy HTML
  • Stable — API changes less often than HTML layout
  • Fast — no need to render JavaScript
  • Complete — often has MORE data than what's shown on the page

JSON API vs HTML Scraping

JSON API HTML Scraping
Speed Fast (small payload) Slow (full page)
Reliability High (stable structure) Low (breaks on redesign)
Data quality Clean, typed Messy, needs parsing
JavaScript needed No Often yes (Playwright)
Maintenance Low High

My 77 Apify Scrapers Use This Approach

All 77 of my Apify actors use hidden JSON APIs instead of HTML parsing:

  • YouTube → Innertube API
  • Reddit → .json endpoints
  • npm → Registry API
  • GitHub → REST API

They're API-first: faster, more reliable, and they don't break on redesigns.


What hidden API have you discovered? Share your findings in the comments!

More tools and tutorials: GitHub | Dev.to

Top comments (0)