Alex Spinov

Posted on Mar 25 • Edited on Mar 26

I Automated My Entire Research Workflow With 10 Free APIs

#python #tutorial #api #productivity

Two weeks ago, I started a research project that required:

Academic papers from multiple databases
Patent data
Clinical trial information
Security checks on all downloaded files

Manually, this would take days. With 10 free APIs, I automated it in an afternoon.

Here's the stack I built.

The Research Pipeline

Query → OpenAlex (papers) → Crossref (metadata) → Unpaywall (free PDFs)
     → PubMed (medical) → ClinicalTrials.gov (trials) → Patents (USPTO)
     → Semantic Scholar (AI summaries) → Export → Analyze

Each step is one Python function. Total code: ~200 lines.

Step 1: Find Papers (OpenAlex)

import requests

def find_papers(topic, limit=20):
    resp = requests.get('https://api.openalex.org/works', params={
        'search': topic, 'per_page': limit,
        'sort': 'cited_by_count:desc'
    })
    return [{
        'title': w['title'],
        'doi': w.get('doi'),
        'citations': w['cited_by_count'],
        'year': w.get('publication_year')
    } for w in resp.json()['results']]

papers = find_papers('CRISPR gene editing therapy')
print(f"Found {len(papers)} papers, top cited: {papers[0]['citations']}")

Step 2: Enrich Metadata (Crossref)

def get_metadata(doi):
    if not doi: return {}
    doi_id = doi.replace('https://doi.org/', '')
    resp = requests.get(f'https://api.crossref.org/works/{doi_id}')
    if resp.status_code != 200: return {}
    item = resp.json()['message']
    return {
        'publisher': item.get('publisher'),
        'journal': item.get('container-title', [''])[0],
        'references': item.get('references-count', 0)
    }

Step 3: Find Free PDFs (Unpaywall)

def find_pdf(doi):
    if not doi: return None
    doi_id = doi.replace('https://doi.org/', '')
    resp = requests.get(f'https://api.unpaywall.org/v2/{doi_id}',
                        params={'email': 'research@example.com'})
    data = resp.json()
    if data.get('is_oa'):
        return data['best_oa_location'].get('url_for_pdf')
    return None

Step 4: Get AI Summaries (Semantic Scholar)

def get_tldr(title):
    resp = requests.get('https://api.semanticscholar.org/graph/v1/paper/search',
        params={'query': title, 'limit': 1, 'fields': 'tldr'})
    papers = resp.json().get('data', [])
    if papers and papers[0].get('tldr'):
        return papers[0]['tldr']['text']
    return 'No summary available'

Step 5: Check Related Trials (ClinicalTrials.gov)

def find_trials(topic, limit=5):
    resp = requests.get('https://clinicaltrials.gov/api/v2/studies', params={
        'query.term': topic, 'pageSize': limit, 'format': 'json'
    })
    return [{
        'nct_id': s['protocolSection']['identificationModule']['nctId'],
        'title': s['protocolSection']['identificationModule']['briefTitle'],
        'status': s['protocolSection']['statusModule']['overallStatus']
    } for s in resp.json().get('studies', [])]

Step 6: Check Patents (USPTO)

def find_patents(topic, limit=5):
    resp = requests.post('https://api.patentsview.org/patents/query', json={
        'q': {'_text_any': {'patent_abstract': topic}},
        'f': ['patent_number', 'patent_title', 'patent_date'],
        'o': {'per_page': limit},
        's': [{'patent_date': 'desc'}]
    })
    return resp.json().get('patents', [])

The Full Pipeline

def research(topic):
    print(f"Researching: {topic}\n")

    # Papers
    papers = find_papers(topic, limit=10)
    print(f"📚 {len(papers)} papers found")

    # Enrich top 5 with metadata + PDFs
    for p in papers[:5]:
        meta = get_metadata(p['doi'])
        pdf = find_pdf(p['doi'])
        tldr = get_tldr(p['title'])
        print(f"  • {p['title'][:60]}")
        print(f"    Citations: {p['citations']} | Journal: {meta.get('journal', 'N/A')}")
        print(f"    PDF: {'✅' if pdf else '❌'} | TLDR: {tldr[:80]}...")

    # Clinical trials
    trials = find_trials(topic)
    print(f"\n🏥 {len(trials)} clinical trials")
    for t in trials:
        print(f"  [{t['status']}] {t['title'][:60]}")

    # Patents
    patents = find_patents(topic)
    print(f"\n📜 {len(patents)} patents")
    for p in patents:
        print(f"  [{p['patent_date']}] {p['patent_title'][:60]}")

research('CRISPR gene editing therapy')

Results

For one query, I got:

10 highly-cited papers with metadata
4 free PDFs (via Unpaywall)
AI summaries for all papers
5 active clinical trials
5 related patents

All in under 30 seconds.

All Toolkits (Open Source)

I packaged each step into its own toolkit:

#	Toolkit	What it does
1	OpenAlex	250M+ academic works
2	Crossref	150M+ article metadata
3	PubMed	36M+ medical papers
4	Semantic Scholar	AI summaries
5	arXiv	2.4M+ preprints
6	CORE	300M+ open access
7	Unpaywall	Find free PDFs
8	ClinicalTrials.gov	500K+ trials
9	USPTO Patents	8M+ patents
10	Security Scanner	5 security APIs

Full collection: awesome-free-research-apis

What would you automate if you had all these APIs in one pipeline? I'm curious about creative use cases.

Need custom data pipelines? My tools | GitHub

Need custom dev tools, scrapers, or API integrations? I build automation for dev teams. Email spinov001@gmail.com — or explore awesome-web-scraping.

You might also like:

Need data from the web without writing scrapers? Check my *Apify actors** — ready-made scrapers for HN, Reddit, LinkedIn, and 75+ more sites. Or email: spinov001@gmail.com*

DEV Community