The Problem: Academic Research Data Is Locked Behind Paywalls
Last month, a data scientist asked me: "How do I get citation data for 1000 papers without paying $50,000 for a Web of Science subscription?"
My answer: Crossref API — 150M+ metadata records, completely free, no API key required.
What Is Crossref?
Crossref is the backbone of scholarly publishing. Every time a journal assigns a DOI, Crossref stores the metadata. That means:
- 150M+ works (papers, books, conferences)
- Citation links between papers
- Funding data — who paid for the research
- License info — is it open access?
- All free, all via REST API
Quick Start: Search Papers in 3 Lines
import requests
results = requests.get("https://api.crossref.org/works", params={
"query": "machine learning drug discovery",
"rows": 5,
"sort": "relevance"
}).json()
for item in results["message"]["items"]:
title = item["title"][0] if item.get("title") else "No title"
cited = item.get("is-referenced-by-count", 0)
year = item.get("published-print", {}).get("date-parts", [["N/A"]])[0][0]
print(f"[{cited} citations] ({year}) {title}")
Output:
[1247 citations] (2019) Machine Learning for Drug Discovery
[892 citations] (2020) Deep Learning Approaches for Drug Design
[534 citations] (2021) AI-Driven Drug Discovery: Current Progress
[312 citations] (2018) Neural Networks in Pharmaceutical Research
[198 citations] (2022) Transformer Models for Molecular Property Prediction
Advanced: Build a Citation Network
This is where it gets powerful. Find the most-cited papers in any field:
import requests
from collections import Counter
def get_top_cited(query, sample_size=100):
"""Find most-cited papers by sampling references."""
results = requests.get("https://api.crossref.org/works", params={
"query": query,
"rows": sample_size,
"select": "DOI,title,is-referenced-by-count,author",
"sort": "is-referenced-by-count",
"order": "desc"
}).json()
papers = []
for item in results["message"]["items"]:
authors = item.get("author", [{}])
first_author = authors[0].get("family", "Unknown") if authors else "Unknown"
papers.append({
"doi": item["DOI"],
"title": item["title"][0][:80] if item.get("title") else "Untitled",
"citations": item.get("is-referenced-by-count", 0),
"author": first_author
})
return sorted(papers, key=lambda x: x["citations"], reverse=True)[:10]
# Example: Top cited papers in "transformer architecture"
for i, p in enumerate(get_top_cited("transformer architecture"), 1):
print(f"{i}. [{p[\"citations\"]:,} citations] {p[\"author\"]} — {p[\"title\"]}")
Track Funding Sources
Want to know who funds research in a specific area?
def analyze_funding(query, sample=50):
results = requests.get("https://api.crossref.org/works", params={
"query": query, "rows": sample,
"filter": "has-funder:true"
}).json()
funders = Counter()
for item in results["message"]["items"]:
for funder in item.get("funder", []):
funders[funder.get("name", "Unknown")] += 1
print(f"Top funders in \"{query}\":")
for name, count in funders.most_common(10):
print(f" {count:3d} papers — {name}")
analyze_funding("quantum computing")
Output:
Top funders in "quantum computing":
12 papers — National Science Foundation
8 papers — European Research Council
6 papers — Department of Energy
5 papers — DARPA
4 papers — Google Research
Monitor New Publications (RSS-like)
from datetime import datetime, timedelta
def get_recent_papers(query, days=7):
date_from = (datetime.now() - timedelta(days=days)).strftime("%Y-%m-%d")
results = requests.get("https://api.crossref.org/works", params={
"query": query,
"filter": f"from-pub-date:{date_from}",
"sort": "published",
"order": "desc",
"rows": 10
}).json()
print(f"Papers published in last {days} days for \"{query}\":")
for item in results["message"]["items"]:
title = item["title"][0][:70] if item.get("title") else "Untitled"
doi = item["DOI"]
print(f" • {title}")
print(f" https://doi.org/{doi}")
get_recent_papers("large language models")
Polite API Usage
Crossref asks you to include your email in the mailto parameter for the "polite pool" (faster responses):
params = {
"query": "machine learning",
"mailto": "your@email.com" # Gets you into the fast lane
}
Rate Limits
| Pool | Rate | How to get |
|---|---|---|
| Public | 50 req/sec | Default |
| Polite | 50+ req/sec, priority | Add mailto parameter |
| Plus | Unlimited | Paid subscription |
For most use cases, the polite pool is more than enough.
Real Use Cases
- Literature reviews — find the 50 most-cited papers in your field in seconds
- Funding analysis — discover which organizations fund specific research areas
- Trend detection — monitor publication rates over time for emerging topics
- Open access tracking — find freely available versions of papers
- Author analysis — map co-authorship networks
Tools I Built
I created a Python toolkit for Crossref analysis: crossref-research-tools on GitHub
Features:
- Citation network builder
- Funding analysis
- Publication trend tracker
- Batch DOI resolver
What research API should I cover next? I have already written about arXiv, OpenAlex, and Crossref. PubMed? Semantic Scholar? CORE? Let me know in the comments!
Need custom research data tools? Check my profile or GitHub for more API tutorials and open-source tools.
Top comments (0)