Alex Spinov

Posted on Mar 25 • Edited on Mar 26

I Built 10 Open Source Toolkits for Academic APIs — Here's What I Learned

#python #opensource #api #productivity

Over the past month, I built Python toolkits for 10 different academic APIs. Combined, they give you access to 800M+ research papers, 17M researcher profiles, and 30M free PDFs.

All free. All open source. Most require no API key.

Here's what surprised me along the way.

The 10 Toolkits

#	Toolkit	Records	API Key?	Best For
1	OpenAlex	250M+ papers	No	Bibliometrics, citations
2	Crossref	150M+ articles	No	DOI metadata
3	Semantic Scholar	200M+ papers	Optional	AI summaries
4	CORE	300M+ papers	Free key	Full text
5	PubMed	36M+ papers	No	Medical research
6	arXiv	2M+ preprints	No	AI/Physics/Math
7	Unpaywall	30M+ OA links	Email only	Free PDFs
8	ORCID	17M+ profiles	No	Researcher lookup
9	YouTube Innertube	Unlimited	No	Video data
10	Research CLI	800M+ combined	No	Unified search

5 Things I Learned

1. Academic data is surprisingly free

I expected paywalls everywhere. Instead, I found that most major academic databases have free, well-documented APIs. The paywall is on the papers themselves, not the metadata.

2. Crossref is the backbone of everything

Every DOI in the world is registered with Crossref. When OpenAlex, PubMed, or Semantic Scholar give you a DOI — that DOI resolves through Crossref. It's the invisible infrastructure of academic publishing.

3. ~30% of paywalled papers have free versions

Using Unpaywall, I checked 200 random DOIs from Crossref. About 30% had a free, legal version somewhere. For recent papers (2023+), it was closer to 50%.

The free version is usually on:

Author's personal website
University institutional repository
Preprint server (arXiv, bioRxiv)

4. AI summaries change how you read papers

Semantic Scholar's TLDR feature gives you a one-sentence AI summary of every paper. When you're scanning 100 papers for a literature review, this saves hours.

import requests
resp = requests.get("https://api.semanticscholar.org/graph/v1/paper/search",
    params={"query": "large language models", "limit": 1, "fields": "title,tldr"})
paper = resp.json()["data"][0]
print(f"{paper['title']}")
print(f"TLDR: {paper['tldr']['text']}")

5. OpenAlex is the most underrated API

250M+ works, no API key, no rate limits (practically), and they have data on authors, institutions, journals, funders, and concepts. It's like a free version of Web of Science.

The Unified CLI

I also built a CLI that searches multiple databases at once:

$ python research_cli.py search "CRISPR gene therapy" --limit 3

 1. CRISPR-Cas9 Gene Editing for Sickle Cell Disease (2023)
    Citations: 847 | Source: OpenAlex

 2. In Vivo CRISPR Gene Therapy (2024)
    Citations: 234 | Source: OpenAlex

research-paper-cli — one command, 7 databases, 800M+ papers.

What's Next

I'm thinking about:

Adding ClinicalTrials.gov support
Building a web UI for non-technical researchers
Creating a citation graph visualizer

All 10 toolkits are on my GitHub. Star them if you find them useful.

Which academic API would you add to this suite? And what would you build with access to 800M+ papers?

Need web scraping or data extraction? I've built 77+ production scrapers. Email spinov001@gmail.com — quote in 2 hours. Or try my ready-made Apify actors — no code needed.

DEV Community