George Kioko

Posted on Mar 18

How to Scrape Google Scholar at Scale: Papers, Citations & Author Data

#api

Google Scholar is a goldmine for researchers, analysts, and anyone tracking academic publications. There's just one problem — there's no official API, and Google actively blocks scraping attempts.

I spent weeks building a production-grade Google Scholar scraper that handles all of this. Here's how it works and how you can use it.

The Problem

If you've ever tried scraping Google Scholar, you know the pain:

CAPTCHAs after a handful of requests
IP bans that last hours or days
Rate limiting that makes bulk collection impossible
Dynamic rendering that breaks simple HTTP scrapers

Google doesn't want you programmatically accessing Scholar data. But researchers, competitive intelligence teams, and data scientists need this data.

The Solution: A Production Scraper with Anti-Bot Handling

I built the Google Scholar Scraper on Apify's platform. It uses headless browsers with fingerprint rotation, automatic proxy management, and retry logic to reliably extract Scholar data at scale.

What It Extracts

For every search query, you get structured data:

Paper titles and direct URLs
Full author lists with profile links
Citation counts
Publication year
Journal/conference names
Snippets and abstracts

Quick Start

Using the Apify API (JavaScript)

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({
    token: 'YOUR_API_TOKEN',
});

const input = {
    queries: ["machine learning healthcare", "transformer architecture NLP"],
    maxResults: 50,
    includeAuthors: true,
    includeCitations: true,
};

const run = await client.actor("george.the.developer/google-scholar-scraper").call(input);
const { items } = await client.dataset(run.defaultDatasetId).listItems();

console.log(`Found ${items.length} papers`);
items.forEach(paper => {
    console.log(`${paper.title} (${paper.year}) - ${paper.citations} citations`);
});

Using the Apify SDK (Inside an Actor)

import { Actor } from 'apify';
import { PuppeteerCrawler } from 'crawlee';

await Actor.init();

const input = await Actor.getInput();
const { queries, maxResults = 100 } = input;

const crawler = new PuppeteerCrawler({
    async requestHandler({ page, request }) {
        await page.waitForSelector('#gs_res_ccl_mid');

        const papers = await page.evaluate(() => {
            const results = [];
            document.querySelectorAll('.gs_r.gs_or.gs_scl').forEach(el => {
                results.push({
                    title: el.querySelector('.gs_rt a')?.textContent?.trim(),
                    url: el.querySelector('.gs_rt a')?.href,
                    authors: el.querySelector('.gs_a')?.textContent?.trim(),
                    snippet: el.querySelector('.gs_rs')?.textContent?.trim(),
                    citations: parseInt(
                        el.querySelector('a[href*="cites"]')?.textContent?.match(/\d+/)?.[0] || '0'
                    ),
                });
            });
            return results;
        });

        for (const paper of papers) {
            await Actor.pushData(paper);
        }
    },
});

await crawler.run(queries.map(q => `https://scholar.google.com/scholar?q=${encodeURIComponent(q)}`));
await Actor.exit();

Sample Output

Here's what the structured JSON output looks like:

[
  {
    "title": "Attention Is All You Need",
    "url": "https://arxiv.org/abs/1706.03762",
    "authors": "A Vaswani, N Shazeer, N Parmar, J Uszkoreit...",
    "year": 2017,
    "citations": 124500,
    "journal": "Advances in Neural Information Processing Systems",
    "snippet": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks..."
  },
  {
    "title": "BERT: Pre-training of Deep Bidirectional Transformers",
    "url": "https://arxiv.org/abs/1810.04805",
    "authors": "J Devlin, MW Chang, K Lee, K Toutanova",
    "year": 2018,
    "citations": 89200,
    "journal": "NAACL-HLT",
    "snippet": "We introduce a new language representation model called BERT..."
  },
  {
    "title": "Deep Residual Learning for Image Recognition",
    "url": "https://arxiv.org/abs/1512.03385",
    "authors": "K He, X Zhang, S Ren, J Sun",
    "year": 2016,
    "citations": 198000,
    "journal": "IEEE CVPR",
    "snippet": "Deeper neural networks are more difficult to train..."
  }
]

Use Cases

Academic Research — Track citations for your papers, monitor what's being published in your field, build literature review datasets automatically.

Competitive Analysis — See what research your competitors are publishing, which papers are gaining traction, and identify emerging trends before they hit mainstream.

Literature Reviews — Instead of manually searching and copying paper details, collect thousands of papers matching your criteria in minutes. Export to CSV and filter in your favorite tool.

Grant Writing — Quickly find and cite the most relevant and highly-cited papers in your field to strengthen your proposals.

Trend Analysis — Track publication volume over time for specific topics. Spot when a research area is heating up or cooling down.

Pricing

The scraper runs at $0.004 per paper — that's 250 papers for a dollar. No subscriptions, no minimum spend. You pay only for what you use through Apify's pay-per-event pricing.

For most research projects, you're looking at a few dollars total.

Why Not Just Use Semantic Scholar or OpenAlex?

Good question. Those are great free APIs, and you should use them when they fit. But Google Scholar has coverage they don't — it indexes PDFs, theses, preprints, and gray literature that other databases miss. If you need the broadest possible coverage of academic work, Scholar is still the best source.

Get Started

Try the Google Scholar Scraper — run it in the cloud, no setup needed.

Check out my full portfolio of 27+ scrapers and APIs on Apify — including tools for LinkedIn, YouTube, Telegram, Google News, and more.

Questions? Drop a comment below or find me on X/Twitter.

Built with Crawlee + Apify SDK. If you're building scrapers, these tools are a game-changer.

DEV Community