DEV Community

Alex Spinov
Alex Spinov

Posted on

I Built 10 Open Source Toolkits for Academic APIs — Here's What I Learned

Over the past month, I built Python toolkits for 10 different academic APIs. Combined, they give you access to 800M+ research papers, 17M researcher profiles, and 30M free PDFs.

All free. All open source. Most require no API key.

Here's what surprised me along the way.

The 10 Toolkits

# Toolkit Records API Key? Best For
1 OpenAlex 250M+ papers No Bibliometrics, citations
2 Crossref 150M+ articles No DOI metadata
3 Semantic Scholar 200M+ papers Optional AI summaries
4 CORE 300M+ papers Free key Full text
5 PubMed 36M+ papers No Medical research
6 arXiv 2M+ preprints No AI/Physics/Math
7 Unpaywall 30M+ OA links Email only Free PDFs
8 ORCID 17M+ profiles No Researcher lookup
9 YouTube Innertube Unlimited No Video data
10 Research CLI 800M+ combined No Unified search

5 Things I Learned

1. Academic data is surprisingly free

I expected paywalls everywhere. Instead, I found that most major academic databases have free, well-documented APIs. The paywall is on the papers themselves, not the metadata.

2. Crossref is the backbone of everything

Every DOI in the world is registered with Crossref. When OpenAlex, PubMed, or Semantic Scholar give you a DOI — that DOI resolves through Crossref. It's the invisible infrastructure of academic publishing.

3. ~30% of paywalled papers have free versions

Using Unpaywall, I checked 200 random DOIs from Crossref. About 30% had a free, legal version somewhere. For recent papers (2023+), it was closer to 50%.

The free version is usually on:

  • Author's personal website
  • University institutional repository
  • Preprint server (arXiv, bioRxiv)

4. AI summaries change how you read papers

Semantic Scholar's TLDR feature gives you a one-sentence AI summary of every paper. When you're scanning 100 papers for a literature review, this saves hours.

import requests
resp = requests.get("https://api.semanticscholar.org/graph/v1/paper/search",
    params={"query": "large language models", "limit": 1, "fields": "title,tldr"})
paper = resp.json()["data"][0]
print(f"{paper['title']}")
print(f"TLDR: {paper['tldr']['text']}")
Enter fullscreen mode Exit fullscreen mode

5. OpenAlex is the most underrated API

250M+ works, no API key, no rate limits (practically), and they have data on authors, institutions, journals, funders, and concepts. It's like a free version of Web of Science.

The Unified CLI

I also built a CLI that searches multiple databases at once:

$ python research_cli.py search "CRISPR gene therapy" --limit 3

 1. CRISPR-Cas9 Gene Editing for Sickle Cell Disease (2023)
    Citations: 847 | Source: OpenAlex

 2. In Vivo CRISPR Gene Therapy (2024)
    Citations: 234 | Source: OpenAlex
Enter fullscreen mode Exit fullscreen mode

research-paper-cli — one command, 7 databases, 800M+ papers.

What's Next

I'm thinking about:

  • Adding ClinicalTrials.gov support
  • Building a web UI for non-technical researchers
  • Creating a citation graph visualizer

All 10 toolkits are on my GitHub. Star them if you find them useful.


Which academic API would you add to this suite? And what would you build with access to 800M+ papers?

Top comments (0)