Forem

Alex Spinov
Alex Spinov

Posted on

Crossref API: Search 150M+ Academic Papers for Free (No API Key Needed)

The Problem: Academic Research Data Is Locked Behind Paywalls

Last month, a data scientist asked me: "How do I get citation data for 1000 papers without paying $50,000 for a Web of Science subscription?"

My answer: Crossref API — 150M+ metadata records, completely free, no API key required.

What Is Crossref?

Crossref is the backbone of scholarly publishing. Every time a journal assigns a DOI, Crossref stores the metadata. That means:

  • 150M+ works (papers, books, conferences)
  • Citation links between papers
  • Funding data — who paid for the research
  • License info — is it open access?
  • All free, all via REST API

Quick Start: Search Papers in 3 Lines

import requests

results = requests.get("https://api.crossref.org/works", params={
    "query": "machine learning drug discovery",
    "rows": 5,
    "sort": "relevance"
}).json()

for item in results["message"]["items"]:
    title = item["title"][0] if item.get("title") else "No title"
    cited = item.get("is-referenced-by-count", 0)
    year = item.get("published-print", {}).get("date-parts", [["N/A"]])[0][0]
    print(f"[{cited} citations] ({year}) {title}")
Enter fullscreen mode Exit fullscreen mode

Output:

[1247 citations] (2019) Machine Learning for Drug Discovery
[892 citations] (2020) Deep Learning Approaches for Drug Design
[534 citations] (2021) AI-Driven Drug Discovery: Current Progress
[312 citations] (2018) Neural Networks in Pharmaceutical Research
[198 citations] (2022) Transformer Models for Molecular Property Prediction
Enter fullscreen mode Exit fullscreen mode

Advanced: Build a Citation Network

This is where it gets powerful. Find the most-cited papers in any field:

import requests
from collections import Counter

def get_top_cited(query, sample_size=100):
    """Find most-cited papers by sampling references."""
    results = requests.get("https://api.crossref.org/works", params={
        "query": query,
        "rows": sample_size,
        "select": "DOI,title,is-referenced-by-count,author",
        "sort": "is-referenced-by-count",
        "order": "desc"
    }).json()

    papers = []
    for item in results["message"]["items"]:
        authors = item.get("author", [{}])
        first_author = authors[0].get("family", "Unknown") if authors else "Unknown"
        papers.append({
            "doi": item["DOI"],
            "title": item["title"][0][:80] if item.get("title") else "Untitled",
            "citations": item.get("is-referenced-by-count", 0),
            "author": first_author
        })

    return sorted(papers, key=lambda x: x["citations"], reverse=True)[:10]

# Example: Top cited papers in "transformer architecture"
for i, p in enumerate(get_top_cited("transformer architecture"), 1):
    print(f"{i}. [{p[\"citations\"]:,} citations] {p[\"author\"]}{p[\"title\"]}")
Enter fullscreen mode Exit fullscreen mode

Track Funding Sources

Want to know who funds research in a specific area?

def analyze_funding(query, sample=50):
    results = requests.get("https://api.crossref.org/works", params={
        "query": query, "rows": sample,
        "filter": "has-funder:true"
    }).json()

    funders = Counter()
    for item in results["message"]["items"]:
        for funder in item.get("funder", []):
            funders[funder.get("name", "Unknown")] += 1

    print(f"Top funders in \"{query}\":")
    for name, count in funders.most_common(10):
        print(f"  {count:3d} papers — {name}")

analyze_funding("quantum computing")
Enter fullscreen mode Exit fullscreen mode

Output:

Top funders in "quantum computing":
  12 papers — National Science Foundation
   8 papers — European Research Council
   6 papers — Department of Energy
   5 papers — DARPA
   4 papers — Google Research
Enter fullscreen mode Exit fullscreen mode

Monitor New Publications (RSS-like)

from datetime import datetime, timedelta

def get_recent_papers(query, days=7):
    date_from = (datetime.now() - timedelta(days=days)).strftime("%Y-%m-%d")
    results = requests.get("https://api.crossref.org/works", params={
        "query": query,
        "filter": f"from-pub-date:{date_from}",
        "sort": "published",
        "order": "desc",
        "rows": 10
    }).json()

    print(f"Papers published in last {days} days for \"{query}\":")
    for item in results["message"]["items"]:
        title = item["title"][0][:70] if item.get("title") else "Untitled"
        doi = item["DOI"]
        print(f"{title}")
        print(f"    https://doi.org/{doi}")

get_recent_papers("large language models")
Enter fullscreen mode Exit fullscreen mode

Polite API Usage

Crossref asks you to include your email in the mailto parameter for the "polite pool" (faster responses):

params = {
    "query": "machine learning",
    "mailto": "your@email.com"  # Gets you into the fast lane
}
Enter fullscreen mode Exit fullscreen mode

Rate Limits

Pool Rate How to get
Public 50 req/sec Default
Polite 50+ req/sec, priority Add mailto parameter
Plus Unlimited Paid subscription

For most use cases, the polite pool is more than enough.

Real Use Cases

  1. Literature reviews — find the 50 most-cited papers in your field in seconds
  2. Funding analysis — discover which organizations fund specific research areas
  3. Trend detection — monitor publication rates over time for emerging topics
  4. Open access tracking — find freely available versions of papers
  5. Author analysis — map co-authorship networks

Tools I Built

I created a Python toolkit for Crossref analysis: crossref-research-tools on GitHub

Features:

  • Citation network builder
  • Funding analysis
  • Publication trend tracker
  • Batch DOI resolver

What research API should I cover next? I have already written about arXiv, OpenAlex, and Crossref. PubMed? Semantic Scholar? CORE? Let me know in the comments!


Need custom research data tools? Check my profile or GitHub for more API tutorials and open-source tools.

Top comments (0)