I used to write BeautifulSoup parsers for every website. CSS selectors, XPath, regex on HTML...
Then I discovered that most websites have hidden JSON APIs that return clean, structured data. No parsing needed.
How to Find Hidden APIs
- Open Chrome DevTools → Network tab
- Filter by XHR/Fetch
- Browse the website normally
- Watch for JSON responses
You'll be amazed how many sites serve their data through internal APIs.
Real Examples
YouTube (Innertube API)
YouTube's website doesn't scrape its own HTML. It uses an internal API called Innertube:
import requests
resp = requests.post("https://www.youtube.com/youtubei/v1/search", json={
"context": {"client": {"clientName": "WEB", "clientVersion": "2.20240101.00.00"}},
"query": "python tutorial"
})
data = resp.json()
# Clean JSON with all video data — no HTML parsing!
No API key. No quotas. Same data YouTube shows you.
Toolkit: youtube-innertube-toolkit
GitHub Trending
GitHub's trending page has no official API. But the page loads data from:
resp = requests.get("https://api.github.com/search/repositories",
params={"q": "created:>2024-01-01", "sort": "stars", "per_page": 10})
for repo in resp.json()["items"]:
print(f"{repo['full_name']} — {repo['stargazers_count']} stars")
npm Registry
Every npm package has a JSON endpoint:
resp = requests.get("https://registry.npmjs.org/react")
data = resp.json()
print(f"react: {data['description']}")
print(f"Latest: {data['dist-tags']['latest']}")
Wikipedia
resp = requests.get("https://en.wikipedia.org/api/rest_v1/page/summary/Python_(programming_language)")
data = resp.json()
print(data["extract"][:200])
Add .json to any Reddit URL:
resp = requests.get("https://www.reddit.com/r/python/hot.json",
headers={"User-Agent": "MyApp/1.0"})
for post in resp.json()["data"]["children"][:5]:
d = post["data"]
print(f"{d['score']} pts | {d['title'][:60]}")
The Pattern
Most modern websites are SPAs (Single Page Applications). They:
- Load a skeleton HTML
- Fetch data from an internal API
- Render with JavaScript
That internal API is what you want. It's:
- Structured — clean JSON, not messy HTML
- Stable — API changes less often than HTML layout
- Fast — no need to render JavaScript
- Complete — often has MORE data than what's shown on the page
JSON API vs HTML Scraping
| JSON API | HTML Scraping | |
|---|---|---|
| Speed | Fast (small payload) | Slow (full page) |
| Reliability | High (stable structure) | Low (breaks on redesign) |
| Data quality | Clean, typed | Messy, needs parsing |
| JavaScript needed | No | Often yes (Playwright) |
| Maintenance | Low | High |
My 77 Apify Scrapers Use This Approach
All 77 of my Apify actors use hidden JSON APIs instead of HTML parsing:
- YouTube → Innertube API
- Reddit → .json endpoints
- npm → Registry API
- GitHub → REST API
They're API-first: faster, more reliable, and they don't break on redesigns.
What hidden API have you discovered? Share your findings in the comments!
Top comments (0)