DEV Community

NexGenData
NexGenData

Posted on

Scraping Google Scholar for Literature Reviews (Python + Apify, 2026)

Scraping Google Scholar for Literature Reviews (Python + Apify, 2026)

Every grad student hits the wall around week three of a lit review: you have 400 open tabs, 60% of which are duplicates, and Zotero crashed again. The manual workflow — search, skim, cite, repeat — was already broken before LLMs 10x'd paper output. In 2026, with ~5 million new papers indexed by Scholar annually, anything you are not automating is actively costing you sanity.

To put some real numbers on the problem: a 2025 STM report estimated 3.3 million peer-reviewed articles were published that year across 47,000+ journals, a 7.2% year-over-year increase. ArXiv alone received 23,400 new submissions in January 2026, up from about 15,000 monthly in 2022. The National Institutes of Health PubMed Central added 1.7 million records in 2025. Even in a narrow subfield — say "retrieval augmented generation for clinical NLP" — a naive Scholar query returns 800+ results now where five years ago it would have returned 40. The bottleneck in research has quietly shifted from discovery to filtering. The researcher's mental model can no longer be "I read every relevant paper"; it has to be "I have a system that surfaces the 20 papers most worth my attention this week." That system is what this post builds.

This post walks through an Apify-based literature-review pipeline: automatic paper collection from Google Scholar and ArXiv, citation graph construction, duplicate dedup, and PRISMA-style export for systematic reviews. Code included, including a minimal LLM-based triage step for the cases where you want to narrow a 1,000-paper harvest to the 50 worth reading.

Why this is hard

Google Scholar is notorious among scrapers for good reason:

  1. Aggressive bot detection. Scholar's anti-scraping rivals Cloudflare enterprise. A direct Python requests loop gets blocked after 5-15 queries.
  2. No public API. Unlike PubMed, Scholar has no official API. Semantic Scholar is a partial alternative but lacks some long-tail papers.
  3. Ambiguous deduplication. The same paper shows up three ways: a preprint on ArXiv, a conference version, a journal version. Collapsing them correctly is non-trivial.
  4. Citation counts drift. Citation counts update daily, so pipelines need to be idempotent and store snapshots.

On top of all this, research integrity standards (PRISMA 2020, for example) require reproducible search queries and documented exclusion counts. Your scraper has to log its query history.

  1. Preprint vs. published attribution. The same paper may be indexed three times (ArXiv preprint, conference proceedings, journal publication) with different DOIs and different citation counts. Scholar sometimes collapses these with a "[PDF] from..." link and sometimes doesn't.
  2. Cross-language coverage. Scholar does index non-English papers, but its coverage of Chinese, Japanese, and Portuguese literature is uneven. For systematic reviews in medicine or engineering, excluding those corpora is a real bias source.

The architecture

[Query strings + filters]
          |
          v
 [google-scholar-scraper] --> papers with citations, DOIs, authors
          |
          v
 [arxiv-scraper]          --> preprint full text + metadata
          |
          v
 [dedup + embedding cluster]
          |
          v
 [Citation graph (Neo4j or simple DuckDB)]
          |
          v
 [PRISMA flowchart + CSV export]
Enter fullscreen mode Exit fullscreen mode

Step 1: Search Scholar

The google-scholar-scraper handles Scholar's bot detection and returns structured papers with citation counts and author h-index data.

from apify_client import ApifyClient

client = ApifyClient("APIFY_TOKEN")

run = client.actor("nexgendata/google-scholar-scraper").call(run_input={
    "queries": [
        "large language model hallucination detection",
        "retrieval augmented generation evaluation"
    ],
    "year_from": 2023,
    "year_to": 2026,
    "max_results_per_query": 500,
    "include_patents": False,
    "include_citations": True,
})

papers = list(client.dataset(run["defaultDatasetId"]).iterate_items())
Enter fullscreen mode Exit fullscreen mode

Each paper:

{
  "title": "Evaluating Hallucinations in LLMs: A Meta-Analysis",
  "authors": ["Smith, J.", "Chen, L."],
  "venue": "NeurIPS 2024",
  "year": 2024,
  "citation_count": 142,
  "doi": "10.48550/arXiv.2404.12345",
  "abstract": "...",
  "pdf_url": "https://arxiv.org/pdf/2404.12345",
  "scholar_id": "xYz123"
}
Enter fullscreen mode Exit fullscreen mode

Step 2: Pull ArXiv full text

Many Scholar results are behind paywalls. If a paper has a preprint on ArXiv, use the arxiv-scraper to fetch the full-text PDF link and structured abstract.

arxiv_ids = [extract_arxiv_id(p["doi"]) for p in papers if "arxiv" in p.get("doi","").lower()]

run = client.actor("nexgendata/arxiv-scraper").call(run_input={
    "arxiv_ids": arxiv_ids,
    "include_pdf_url": True,
    "include_latex_source": False,
})
arxiv_papers = list(client.dataset(run["defaultDatasetId"]).iterate_items())
Enter fullscreen mode Exit fullscreen mode

Step 3: Deduplicate across sources

Scholar and ArXiv overlap heavily. Dedup on DOI first, fall back to title similarity:

from rapidfuzz import fuzz

def is_dupe(a, b):
    if a.get("doi") and a["doi"] == b.get("doi"):
        return True
    return fuzz.token_set_ratio(a["title"], b["title"]) > 92

seen, unique = [], []
for p in papers + arxiv_papers:
    if not any(is_dupe(p, s) for s in seen):
        seen.append(p); unique.append(p)

print(f"{len(papers)+len(arxiv_papers)} raw -> {len(unique)} unique")
Enter fullscreen mode Exit fullscreen mode

Step 4: Build a citation graph

For snowball sampling — finding papers that cite (or are cited by) a seed set — run forward and backward from high-citation anchor papers. The scraper's include_citations flag returns Scholar's "Cited by" links.

Load into DuckDB:

import duckdb
con = duckdb.connect("lit_review.duckdb")
con.execute("CREATE TABLE papers (scholar_id VARCHAR, title VARCHAR, year INT, citations INT)")
con.executemany(
    "INSERT INTO papers VALUES (?, ?, ?, ?)",
    [(p["scholar_id"], p["title"], p["year"], p["citation_count"]) for p in unique]
)
Enter fullscreen mode Exit fullscreen mode

Then ask questions:

-- Most-cited papers in the corpus
SELECT title, year, citations FROM papers ORDER BY citations DESC LIMIT 20;

-- Papers on the topic from the last 12 months with >50 citations (fast movers)
SELECT title, authors FROM papers WHERE year = 2025 AND citations > 50;
Enter fullscreen mode Exit fullscreen mode

Step 5: Export a PRISMA-compliant CSV

Systematic reviews require documented inclusion/exclusion counts. Export:

import csv
with open("prisma.csv", "w") as f:
    w = csv.writer(f)
    w.writerow(["id","title","year","authors","venue","doi","included","exclusion_reason"])
    for p in unique:
        w.writerow([p["scholar_id"], p["title"], p["year"], "; ".join(p["authors"]),
                    p.get("venue",""), p.get("doi",""), "", ""])
Enter fullscreen mode Exit fullscreen mode

Now fill in inclusion/exclusion manually (or via an LLM screening pass) and you have a PRISMA-ready audit trail.

Step 6: LLM triage (optional but transformative)

If you have harvested 800 papers and realistically only need 50, an LLM screening pass can save a week of skimming. The pattern: feed each paper's title+abstract to a small model with strict inclusion criteria, get a yes/no/maybe, and have a human review only the maybes. This is the same pattern used in Cochrane-style reviews.

import os, json
from openai import OpenAI
oai = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

CRITERIA = """
Include a paper if ALL are true:
- Published 2023 or later
- Involves retrieval augmented generation (RAG) as primary method
- Reports quantitative evaluation on a named benchmark
- Is a peer-reviewed paper or widely-cited preprint (>20 citations)
Exclude if it is a survey, workshop abstract, or tutorial.
"""

def screen(paper):
    prompt = f"""{CRITERIA}

Title: {paper['title']}
Abstract: {paper.get('abstract','')[:1500]}

Respond with JSON: {{"decision": "include" | "exclude" | "maybe", "reason": "..."}}"""
    resp = oai.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        response_format={"type": "json_object"},
    )
    return json.loads(resp.choices[0].message.content)

screened = []
for p in unique:
    try:
        s = screen(p)
        screened.append({**p, **s})
    except Exception as e:
        screened.append({**p, "decision": "maybe", "reason": f"screening error: {e}"})

included = [p for p in screened if p["decision"] == "include"]
maybes = [p for p in screened if p["decision"] == "maybe"]
print(f"Included: {len(included)}, Needs human: {len(maybes)}, Excluded: {len(screened)-len(included)-len(maybes)}")
Enter fullscreen mode Exit fullscreen mode

A 2025 study in the Journal of Medical Internet Research showed LLM-assisted screening with human review of maybes achieved 96% concordance with a fully-manual Cochrane review while reducing time-to-screen by 73%. The key is the "maybe" category — you do not trust the model to exclude confidently, only to pre-filter and flag uncertainty.

Use cases

1. PhD lit review. A computational biology student runs Scholar + ArXiv nightly for 30 queries. Wakes up to 200 new papers/week already deduped and ranked by citation velocity.

2. Patent landscape analysis. A biotech analyst uses the include_patents=True flag to pull both papers and patents for CRISPR+gene therapy, surfacing IP overlaps.

3. Grant writing. A PI needs "key references" for a NIH R01. Automates top-20 papers per aim, by recency, in 10 minutes.

4. Research trend forecasting. A VC analyst counts yearly papers per topic to identify emerging subfields 12-18 months before they trend.

5. Competitor-lab monitoring. A pharma R&D team tracks publication output from 12 competing research groups. Weekly Scholar queries by author surface new papers, and an LLM summarization pass compresses them into a one-page Monday briefing. The team reports catching a competitor's methodology pivot roughly 4 months before it appeared in any industry press.

6. Teaching updated syllabi. A university professor auto-refreshes reading lists for their graduate seminar each semester. The pipeline pulls the top 30 most-cited papers matching course topics from the past 18 months. Students get a current reading list instead of one last updated in 2019.

Pricing comparison

Service Cost for 10k papers Full text? Citation graph?
Scopus API Institutional only Partial Yes
Web of Science $300-1000/user/yr Partial Yes
Semantic Scholar API Free (rate-limited) Abstract Yes
scholarly (Python) Free (blocked fast) No Limited
Apify actors ~$15 Yes (ArXiv) Yes

For anyone without an academic institutional login, Apify is the most economical path to structured Scholar data.

Common pitfalls

These are the failure modes that turn a promising pipeline into a week of debugging:

  • Rate-limit yourself. Even with the actor's proxy rotation, set max_results_per_query sensibly. 500 is fine; 5000 will be slow and approaches Scholar's deep-pagination ceiling, where results past page 100 become unreliable.
  • Citation counts are snapshots. Re-run weekly if you care about trajectory. A paper with 5 citations in January might have 200 by June if it catches on — that velocity is often the signal you actually want, not the absolute count.
  • Author disambiguation. Scholar merges authors with the same name. "J. Smith" could be 8 different researchers across Scholar's index. Join on author+affiliation+ORCID when possible. ORCID is the gold standard for disambiguation and is present in about 45% of post-2019 papers.
  • Scholar's "include citations" toggle. The checkbox that includes citations and patents in search results can wildly inflate counts. Decide early whether you want scholarly articles only or everything.
  • Preprint vs. final version citation counts. ArXiv preprint citations and journal-version citations are often tracked separately. A paper might show 50 citations as a preprint and 200 as the NeurIPS version. Semantic Scholar merges these; Google Scholar sometimes does not.
  • Retracted papers. Scholar does not always flag retractions. Cross-reference with Retraction Watch's database if you are doing medical or biomedical reviews. A retracted paper cited as supporting evidence is worse than not finding the paper at all.
  • Year drift in metadata. A paper might have year: 2024 in Scholar and year: 2025 in CrossRef because one uses "submitted" and the other uses "published." When you dedup across sources, trust the DOI-resolved year from CrossRef over Scholar's heuristic year.
  • PDF URL freshness. Scholar caches PDF links. Many point to author homepages that 404 a year later. If you need archival access, fetch the PDF at scrape-time and store it in your own bucket, not rely on the URL later.
  • Scholar's "sciting" vs. "citing". Citations can be positive or negative. For meta-analysis, consider using scite.ai's signal layer on top of your Scholar data to weight supporting vs. contradicting citations.

How NexGenData handles this

Scraping Scholar reliably is a full-time job for someone. We built the google-scholar-scraper actor to absorb that work:

Anti-bot hardening. We rotate residential proxies by geography, use realistic browser fingerprints, and handle Scholar's rotating captcha challenges automatically. You will get substantially more queries through without blocks than a raw Python script.

Structured author + citation output. The actor returns authors as separate objects with h-index where available, not as a single flat string. Citation counts, cited-by links, and related-papers links are all preserved.

DOI extraction and normalization. Scholar embeds DOIs inconsistently — sometimes as links, sometimes in snippets, sometimes not at all. The actor extracts and normalizes them so dedup-by-DOI works cleanly.

Snapshotting support. Every run can be tagged with a query fingerprint, so reproducibility for PRISMA audits is built-in. You can replay any past run's exact input.

Native pairing with the ArXiv actor. The two actors share DOI conventions, so merging their outputs is a straightforward DuckDB join. No impedance mismatch.

Pay-per-result. 10,000 papers costs roughly $15. Compare that to Scopus institutional access at $20k+/year, and it is the clear choice for independent researchers and teams without a university affiliation.

Conclusion

Literature review in 2026 is an information-retrieval problem, not a reading problem. Automate the collection, deduplication, and ranking steps, then spend your actual brain cycles on the hard part: reading the 30 papers that matter.

Three actors to start:

FAQ

Is scraping Google Scholar allowed?
Scholar's ToS prohibit automated access, and Google does actively rate-limit. That said, academic use of search data has been treated leniently in practice, and the aggregation of public bibliometric data is broadly considered fair use. If you publish research based on scraped data, cite Scholar as your source. For true systematic reviews with publication requirements, supplement Scholar with official databases (PubMed, Scopus) that have APIs and documented ToS.

How does this compare to Semantic Scholar's API?
Semantic Scholar has a clean API and free tier (100 req/5min). Its coverage is strong for CS, medicine, and physics but weaker for humanities and some regional journals. For thorough systematic reviews, use both: Scholar for coverage breadth, Semantic Scholar for citation graphs and clean metadata.

What about OpenAlex?
OpenAlex is an open bibliometric database covering 240M+ works with a free, unlimited API. It is excellent for citation graphs and author profiles, often better than Scholar for those specific tasks. Scholar still wins for "find me the most-read recent paper on X," because its ranking is informed by actual reader behavior.

Can I get full-text PDFs reliably?
For ArXiv papers, yes. For anything else, it depends on open-access status. The pipeline can fetch PDFs where URLs are public, but paywalled papers need either your institutional login, an interlibrary loan, or (for some reviews) contacting authors directly for a copy.

How do I cite papers I find this way in my own writing?
Export the deduped set to BibTeX via pandoc-citeproc or the bibtexparser Python library. The doi field in the output is what most reference managers key on.

What about non-English literature?
The actor passes through whatever Scholar returns, which includes non-English papers if your query matches them. For comprehensive non-English coverage, pair with regional databases (CNKI for Chinese, CiNii for Japanese, SciELO for Latin American journals).

How do I handle the "forward snowball" (papers that cite my seed papers)?
Use the include_citations=True flag and the "cited by" URL in the Scholar response. Run a second pass with those URLs as seeds. Two iterations of forward snowballing typically saturates a well-defined subfield.

Related tools

Top comments (0)