Over the past month, I built Python toolkits for 10 different academic APIs. Combined, they give you access to 800M+ research papers, 17M researcher profiles, and 30M free PDFs.
All free. All open source. Most require no API key.
Here's what surprised me along the way.
The 10 Toolkits
| # | Toolkit | Records | API Key? | Best For |
|---|---|---|---|---|
| 1 | OpenAlex | 250M+ papers | No | Bibliometrics, citations |
| 2 | Crossref | 150M+ articles | No | DOI metadata |
| 3 | Semantic Scholar | 200M+ papers | Optional | AI summaries |
| 4 | CORE | 300M+ papers | Free key | Full text |
| 5 | PubMed | 36M+ papers | No | Medical research |
| 6 | arXiv | 2M+ preprints | No | AI/Physics/Math |
| 7 | Unpaywall | 30M+ OA links | Email only | Free PDFs |
| 8 | ORCID | 17M+ profiles | No | Researcher lookup |
| 9 | YouTube Innertube | Unlimited | No | Video data |
| 10 | Research CLI | 800M+ combined | No | Unified search |
5 Things I Learned
1. Academic data is surprisingly free
I expected paywalls everywhere. Instead, I found that most major academic databases have free, well-documented APIs. The paywall is on the papers themselves, not the metadata.
2. Crossref is the backbone of everything
Every DOI in the world is registered with Crossref. When OpenAlex, PubMed, or Semantic Scholar give you a DOI — that DOI resolves through Crossref. It's the invisible infrastructure of academic publishing.
3. ~30% of paywalled papers have free versions
Using Unpaywall, I checked 200 random DOIs from Crossref. About 30% had a free, legal version somewhere. For recent papers (2023+), it was closer to 50%.
The free version is usually on:
- Author's personal website
- University institutional repository
- Preprint server (arXiv, bioRxiv)
4. AI summaries change how you read papers
Semantic Scholar's TLDR feature gives you a one-sentence AI summary of every paper. When you're scanning 100 papers for a literature review, this saves hours.
import requests
resp = requests.get("https://api.semanticscholar.org/graph/v1/paper/search",
params={"query": "large language models", "limit": 1, "fields": "title,tldr"})
paper = resp.json()["data"][0]
print(f"{paper['title']}")
print(f"TLDR: {paper['tldr']['text']}")
5. OpenAlex is the most underrated API
250M+ works, no API key, no rate limits (practically), and they have data on authors, institutions, journals, funders, and concepts. It's like a free version of Web of Science.
The Unified CLI
I also built a CLI that searches multiple databases at once:
$ python research_cli.py search "CRISPR gene therapy" --limit 3
1. CRISPR-Cas9 Gene Editing for Sickle Cell Disease (2023)
Citations: 847 | Source: OpenAlex
2. In Vivo CRISPR Gene Therapy (2024)
Citations: 234 | Source: OpenAlex
research-paper-cli — one command, 7 databases, 800M+ papers.
What's Next
I'm thinking about:
- Adding ClinicalTrials.gov support
- Building a web UI for non-technical researchers
- Creating a citation graph visualizer
All 10 toolkits are on my GitHub. Star them if you find them useful.
Which academic API would you add to this suite? And what would you build with access to 800M+ papers?
Top comments (0)