Alex Spinov

Posted on Mar 24 • Edited on Mar 26

Crossref API: Search 150M+ Academic Papers for Free (No API Key Needed)

#api #python #research #tutorial

The Problem: Academic Research Data Is Locked Behind Paywalls

Last month, a data scientist asked me: "How do I get citation data for 1000 papers without paying $50,000 for a Web of Science subscription?"

My answer: Crossref API — 150M+ metadata records, completely free, no API key required.

What Is Crossref?

Crossref is the backbone of scholarly publishing. Every time a journal assigns a DOI, Crossref stores the metadata. That means:

150M+ works (papers, books, conferences)
Citation links between papers
Funding data — who paid for the research
License info — is it open access?
All free, all via REST API

Quick Start: Search Papers in 3 Lines

import requests

results = requests.get("https://api.crossref.org/works", params={
    "query": "machine learning drug discovery",
    "rows": 5,
    "sort": "relevance"
}).json()

for item in results["message"]["items"]:
    title = item["title"][0] if item.get("title") else "No title"
    cited = item.get("is-referenced-by-count", 0)
    year = item.get("published-print", {}).get("date-parts", [["N/A"]])[0][0]
    print(f"[{cited} citations] ({year}) {title}")

Output:

[1247 citations] (2019) Machine Learning for Drug Discovery
[892 citations] (2020) Deep Learning Approaches for Drug Design
[534 citations] (2021) AI-Driven Drug Discovery: Current Progress
[312 citations] (2018) Neural Networks in Pharmaceutical Research
[198 citations] (2022) Transformer Models for Molecular Property Prediction

Advanced: Build a Citation Network

This is where it gets powerful. Find the most-cited papers in any field:

import requests
from collections import Counter

def get_top_cited(query, sample_size=100):
    """Find most-cited papers by sampling references."""
    results = requests.get("https://api.crossref.org/works", params={
        "query": query,
        "rows": sample_size,
        "select": "DOI,title,is-referenced-by-count,author",
        "sort": "is-referenced-by-count",
        "order": "desc"
    }).json()

    papers = []
    for item in results["message"]["items"]:
        authors = item.get("author", [{}])
        first_author = authors[0].get("family", "Unknown") if authors else "Unknown"
        papers.append({
            "doi": item["DOI"],
            "title": item["title"][0][:80] if item.get("title") else "Untitled",
            "citations": item.get("is-referenced-by-count", 0),
            "author": first_author
        })

    return sorted(papers, key=lambda x: x["citations"], reverse=True)[:10]

# Example: Top cited papers in "transformer architecture"
for i, p in enumerate(get_top_cited("transformer architecture"), 1):
    print(f"{i}. [{p[\"citations\"]:,} citations] {p[\"author\"]} — {p[\"title\"]}")

Track Funding Sources

Want to know who funds research in a specific area?

def analyze_funding(query, sample=50):
    results = requests.get("https://api.crossref.org/works", params={
        "query": query, "rows": sample,
        "filter": "has-funder:true"
    }).json()

    funders = Counter()
    for item in results["message"]["items"]:
        for funder in item.get("funder", []):
            funders[funder.get("name", "Unknown")] += 1

    print(f"Top funders in \"{query}\":")
    for name, count in funders.most_common(10):
        print(f"  {count:3d} papers — {name}")

analyze_funding("quantum computing")

Output:

Top funders in "quantum computing":
  12 papers — National Science Foundation
   8 papers — European Research Council
   6 papers — Department of Energy
   5 papers — DARPA
   4 papers — Google Research

Monitor New Publications (RSS-like)

from datetime import datetime, timedelta

def get_recent_papers(query, days=7):
    date_from = (datetime.now() - timedelta(days=days)).strftime("%Y-%m-%d")
    results = requests.get("https://api.crossref.org/works", params={
        "query": query,
        "filter": f"from-pub-date:{date_from}",
        "sort": "published",
        "order": "desc",
        "rows": 10
    }).json()

    print(f"Papers published in last {days} days for \"{query}\":")
    for item in results["message"]["items"]:
        title = item["title"][0][:70] if item.get("title") else "Untitled"
        doi = item["DOI"]
        print(f"  • {title}")
        print(f"    https://doi.org/{doi}")

get_recent_papers("large language models")

Polite API Usage

Crossref asks you to include your email in the mailto parameter for the "polite pool" (faster responses):

params = {
    "query": "machine learning",
    "mailto": "your@email.com"  # Gets you into the fast lane
}

Rate Limits

Pool	Rate	How to get
Public	50 req/sec	Default
Polite	50+ req/sec, priority	Add `mailto` parameter
Plus	Unlimited	Paid subscription

For most use cases, the polite pool is more than enough.

Real Use Cases

Literature reviews — find the 50 most-cited papers in your field in seconds
Funding analysis — discover which organizations fund specific research areas
Trend detection — monitor publication rates over time for emerging topics
Open access tracking — find freely available versions of papers
Author analysis — map co-authorship networks

Tools I Built

I created a Python toolkit for Crossref analysis: crossref-research-tools on GitHub

Features:

Citation network builder
Funding analysis
Publication trend tracker
Batch DOI resolver

What research API should I cover next? I have already written about arXiv, OpenAlex, and Crossref. PubMed? Semantic Scholar? CORE? Let me know in the comments!

Need custom research data tools? Check my profile or GitHub for more API tutorials and open-source tools.

Need web scraping or data extraction? I've built 77+ production scrapers. Email spinov001@gmail.com — quote in 2 hours. Or try my ready-made Apify actors — no code needed.

DEV Community