Alex Spinov

Posted on Mar 25

Stop Scraping HTML — Use These Hidden JSON APIs Instead

#api #python #tutorial #webdev

I used to write BeautifulSoup parsers for every website. CSS selectors, XPath, regex on HTML...

Then I discovered that most websites have hidden JSON APIs that return clean, structured data. No parsing needed.

How to Find Hidden APIs

Open Chrome DevTools → Network tab
Filter by XHR/Fetch
Browse the website normally
Watch for JSON responses

You'll be amazed how many sites serve their data through internal APIs.

Real Examples

YouTube (Innertube API)

YouTube's website doesn't scrape its own HTML. It uses an internal API called Innertube:

import requests

resp = requests.post("https://www.youtube.com/youtubei/v1/search", json={
    "context": {"client": {"clientName": "WEB", "clientVersion": "2.20240101.00.00"}},
    "query": "python tutorial"
})

data = resp.json()
# Clean JSON with all video data — no HTML parsing!

No API key. No quotas. Same data YouTube shows you.

Toolkit: youtube-innertube-toolkit

GitHub Trending

GitHub's trending page has no official API. But the page loads data from:

resp = requests.get("https://api.github.com/search/repositories",
    params={"q": "created:>2024-01-01", "sort": "stars", "per_page": 10})

for repo in resp.json()["items"]:
    print(f"{repo['full_name']} — {repo['stargazers_count']} stars")

npm Registry

Every npm package has a JSON endpoint:

resp = requests.get("https://registry.npmjs.org/react")
data = resp.json()
print(f"react: {data['description']}")
print(f"Latest: {data['dist-tags']['latest']}")

Wikipedia

resp = requests.get("https://en.wikipedia.org/api/rest_v1/page/summary/Python_(programming_language)")
data = resp.json()
print(data["extract"][:200])

Add .json to any Reddit URL:

resp = requests.get("https://www.reddit.com/r/python/hot.json",
    headers={"User-Agent": "MyApp/1.0"})

for post in resp.json()["data"]["children"][:5]:
    d = post["data"]
    print(f"{d['score']} pts | {d['title'][:60]}")

The Pattern

Most modern websites are SPAs (Single Page Applications). They:

Load a skeleton HTML
Fetch data from an internal API
Render with JavaScript

That internal API is what you want. It's:

Structured — clean JSON, not messy HTML
Stable — API changes less often than HTML layout
Fast — no need to render JavaScript
Complete — often has MORE data than what's shown on the page

JSON API vs HTML Scraping

	JSON API	HTML Scraping
Speed	Fast (small payload)	Slow (full page)
Reliability	High (stable structure)	Low (breaks on redesign)
Data quality	Clean, typed	Messy, needs parsing
JavaScript needed	No	Often yes (Playwright)
Maintenance	Low	High

My 77 Apify Scrapers Use This Approach

All 77 of my Apify actors use hidden JSON APIs instead of HTML parsing:

YouTube → Innertube API
Reddit → .json endpoints
npm → Registry API
GitHub → REST API

They're API-first: faster, more reliable, and they don't break on redesigns.

What hidden API have you discovered? Share your findings in the comments!

More tools and tutorials: GitHub | Dev.to

DEV Community