I've been building web scrapers for years. Here's my controversial take: most web scraping tutorials teach you the wrong thing.
They teach you to parse HTML. To fight with selectors. To handle dynamic JavaScript rendering.
But 80% of the data you need is available through free public APIs that nobody talks about.
The APIs Nobody Knows About
-
PyPI has a JSON API.
https://pypi.org/pypi/{package}/json— no key, no auth. - YouTube has Innertube. Internal API, no quotas, no key.
- arXiv has a free search API. 2M+ papers, structured XML.
- PubMed returns medical research data in JSON.
- GitHub gives you repo data without a token.
- Crossref searches 130M+ research papers for free.
- WHOIS/RDAP returns domain registration data via REST.
I documented all of them in my free APIs list — 200+ APIs that need zero registration.
Why This Matters
Every time you write a BeautifulSoup selector, you're:
- Building something fragile (one HTML change = broken scraper)
- Fighting anti-bot systems unnecessarily
- Ignoring structured data that's already there
APIs don't change their response format every week. HTML does.
My Rule
Before scraping ANY website, I spend 5 minutes checking:
- Does it have a public API? (check
/api,/graphql, or docs) - Does it expose JSON in page source? (
ytInitialData,__NEXT_DATA__) - Does it have RSS/Atom feeds?
Only if all three fail do I touch the HTML.
What's your approach? Do you default to HTML scraping or APIs first? Have you discovered any hidden APIs that saved you hours of work?
I'm genuinely curious — drop your experience in the comments.
Top comments (0)