DEV Community

Alex Spinov
Alex Spinov

Posted on

I Automated My Entire Research Workflow With 10 Free APIs

Two weeks ago, I started a research project that required:

  • Academic papers from multiple databases
  • Patent data
  • Clinical trial information
  • Security checks on all downloaded files

Manually, this would take days. With 10 free APIs, I automated it in an afternoon.

Here's the stack I built.

The Research Pipeline

Query → OpenAlex (papers) → Crossref (metadata) → Unpaywall (free PDFs)
     → PubMed (medical) → ClinicalTrials.gov (trials) → Patents (USPTO)
     → Semantic Scholar (AI summaries) → Export → Analyze
Enter fullscreen mode Exit fullscreen mode

Each step is one Python function. Total code: ~200 lines.

Step 1: Find Papers (OpenAlex)

import requests

def find_papers(topic, limit=20):
    resp = requests.get('https://api.openalex.org/works', params={
        'search': topic, 'per_page': limit,
        'sort': 'cited_by_count:desc'
    })
    return [{
        'title': w['title'],
        'doi': w.get('doi'),
        'citations': w['cited_by_count'],
        'year': w.get('publication_year')
    } for w in resp.json()['results']]

papers = find_papers('CRISPR gene editing therapy')
print(f"Found {len(papers)} papers, top cited: {papers[0]['citations']}")
Enter fullscreen mode Exit fullscreen mode

Step 2: Enrich Metadata (Crossref)

def get_metadata(doi):
    if not doi: return {}
    doi_id = doi.replace('https://doi.org/', '')
    resp = requests.get(f'https://api.crossref.org/works/{doi_id}')
    if resp.status_code != 200: return {}
    item = resp.json()['message']
    return {
        'publisher': item.get('publisher'),
        'journal': item.get('container-title', [''])[0],
        'references': item.get('references-count', 0)
    }
Enter fullscreen mode Exit fullscreen mode

Step 3: Find Free PDFs (Unpaywall)

def find_pdf(doi):
    if not doi: return None
    doi_id = doi.replace('https://doi.org/', '')
    resp = requests.get(f'https://api.unpaywall.org/v2/{doi_id}',
                        params={'email': 'research@example.com'})
    data = resp.json()
    if data.get('is_oa'):
        return data['best_oa_location'].get('url_for_pdf')
    return None
Enter fullscreen mode Exit fullscreen mode

Step 4: Get AI Summaries (Semantic Scholar)

def get_tldr(title):
    resp = requests.get('https://api.semanticscholar.org/graph/v1/paper/search',
        params={'query': title, 'limit': 1, 'fields': 'tldr'})
    papers = resp.json().get('data', [])
    if papers and papers[0].get('tldr'):
        return papers[0]['tldr']['text']
    return 'No summary available'
Enter fullscreen mode Exit fullscreen mode

Step 5: Check Related Trials (ClinicalTrials.gov)

def find_trials(topic, limit=5):
    resp = requests.get('https://clinicaltrials.gov/api/v2/studies', params={
        'query.term': topic, 'pageSize': limit, 'format': 'json'
    })
    return [{
        'nct_id': s['protocolSection']['identificationModule']['nctId'],
        'title': s['protocolSection']['identificationModule']['briefTitle'],
        'status': s['protocolSection']['statusModule']['overallStatus']
    } for s in resp.json().get('studies', [])]
Enter fullscreen mode Exit fullscreen mode

Step 6: Check Patents (USPTO)

def find_patents(topic, limit=5):
    resp = requests.post('https://api.patentsview.org/patents/query', json={
        'q': {'_text_any': {'patent_abstract': topic}},
        'f': ['patent_number', 'patent_title', 'patent_date'],
        'o': {'per_page': limit},
        's': [{'patent_date': 'desc'}]
    })
    return resp.json().get('patents', [])
Enter fullscreen mode Exit fullscreen mode

The Full Pipeline

def research(topic):
    print(f"Researching: {topic}\n")

    # Papers
    papers = find_papers(topic, limit=10)
    print(f"📚 {len(papers)} papers found")

    # Enrich top 5 with metadata + PDFs
    for p in papers[:5]:
        meta = get_metadata(p['doi'])
        pdf = find_pdf(p['doi'])
        tldr = get_tldr(p['title'])
        print(f"{p['title'][:60]}")
        print(f"    Citations: {p['citations']} | Journal: {meta.get('journal', 'N/A')}")
        print(f"    PDF: {'' if pdf else ''} | TLDR: {tldr[:80]}...")

    # Clinical trials
    trials = find_trials(topic)
    print(f"\n🏥 {len(trials)} clinical trials")
    for t in trials:
        print(f"  [{t['status']}] {t['title'][:60]}")

    # Patents
    patents = find_patents(topic)
    print(f"\n📜 {len(patents)} patents")
    for p in patents:
        print(f"  [{p['patent_date']}] {p['patent_title'][:60]}")

research('CRISPR gene editing therapy')
Enter fullscreen mode Exit fullscreen mode

Results

For one query, I got:

  • 10 highly-cited papers with metadata
  • 4 free PDFs (via Unpaywall)
  • AI summaries for all papers
  • 5 active clinical trials
  • 5 related patents

All in under 30 seconds.

All Toolkits (Open Source)

I packaged each step into its own toolkit:

# Toolkit What it does
1 OpenAlex 250M+ academic works
2 Crossref 150M+ article metadata
3 PubMed 36M+ medical papers
4 Semantic Scholar AI summaries
5 arXiv 2.4M+ preprints
6 CORE 300M+ open access
7 Unpaywall Find free PDFs
8 ClinicalTrials.gov 500K+ trials
9 USPTO Patents 8M+ patents
10 Security Scanner 5 security APIs

Full collection: awesome-free-research-apis


What would you automate if you had all these APIs in one pipeline? I'm curious about creative use cases.


Need custom data pipelines? My tools | GitHub

Top comments (0)