NexGenData

Posted on Jun 2

Citation Analysis at Scale: Track Research Trends with Semantic Scholar Data

#python #api #datascience #tutorial

The research landscape moves faster than any single researcher can keep up with.

You might spend weeks reading the 20 papers everyone cites on neural architecture search, only to miss the emerging thread on efficient fine-tuning that's building momentum in parallel.

Academic databases exist, but they're designed for individual researchers searching one query at a time. If you're trying to map an entire field—understand who's influential, what problems are being solved, what direction the consensus is moving—you need a different approach.

You need to see the citation network itself.

Citation patterns reveal research trajectories months before the average researcher catches on. Papers that are being cited heavily in new contexts are signaling emerging importance. Authors whose work appears across multiple subfields are bridges between communities. Venues where groundbreaking work clusters reveal where innovation is concentrating.

I'll show you how to scrape and analyze citation data at scale to map a research landscape, identify influential work, and spot emerging trends before they hit mainstream awareness.

What Citation Networks Tell You

A citation network is just a directed graph: papers point to papers they reference. But the structure of that graph reveals everything about a research field.

High in-degree papers (frequently cited) = foundational or benchmark work

Not always the newest papers
Often older foundational work
Tells you the field's consensus baseline

High out-degree papers (cite many others) = comprehensive reviews

Survey papers, literature reviews
Map the landscape comprehensively
Good starting point for understanding a field

High clustering coefficient (papers that cite each other) = tight research subcommunities

Multiple papers citing the same references suggests a cohesive group
Different clusters in the same field = parallel research tracks
Bridges between clusters = pivotal papers connecting different subfields

Citation velocity (how many new citations per month) = current momentum

Old papers accumulate citations slowly now (baseline noise)
Recent papers gaining citations quickly = emerging importance
A 2025 paper already cited 50 times = likely significant

Author co-citation patterns = research influence and networks

Frequently cited together = established working relationships
Cited together but in different contexts = influence spanning multiple areas
Sudden co-citation increase = collaboration beginning

These patterns take months to notice by reading papers manually. The data reveals them instantly.

The Data Structure

The Apify Google Scholar Scraper returns structured research data. Here's what a typical query output looks like:

{
  "papers": [
    {
      "id": "gs_paper_2024_001",
      "title": "Efficient Fine-Tuning of Large Language Models",
      "authors": [
        {"name": "Alice Chen", "id": "scholar_alice"},
        {"name": "Bob Kumar", "id": "scholar_bob"}
      ],
      "year": 2024,
      "citationCount": 187,
      "venue": "NeurIPS",
      "abstract": "We propose parameter-efficient methods for adapting...",
      "references": [
        {"paperId": "gs_paper_2023_042", "title": "LoRA: Low-Rank..."},
        {"paperId": "gs_paper_2023_015", "title": "Adapters for..."}
      ],
      "citedBy": [
        {"paperId": "gs_paper_2024_089", "title": "Scaling Efficient..."},
        {"paperId": "gs_paper_2024_102", "title": "Evaluating Parameter..."}
      ]
    },
    {
      "id": "gs_paper_2023_042",
      "title": "LoRA: Low-Rank Adaptation of Large Language Models",
      "authors": [{"name": "Edward Hu", "id": "scholar_edward"}],
      "year": 2023,
      "citationCount": 2847,
      "venue": "ICLR",
      "abstract": "We propose low-rank decomposition for parameter-efficient...",
      "references": [...],
      "citedBy": [...]
    }
  ],
  "metadata": {
    "query": "efficient fine-tuning",
    "resultsCount": 1204,
    "yearRange": "2020-2024"
  }
}

With this structure, you can immediately analyze the research landscape.

Building Your Citation Analysis Framework

Here's a Python system for processing citation networks and extracting actionable insights:

from collections import defaultdict, Counter
from datetime import datetime
import json

class ResearchLandscapeAnalyzer:
    def __init__(self, papers_data):
        self.papers = {p['id']: p for p in papers_data}
        self.citation_graph = defaultdict(list)
        self.author_graph = defaultdict(set)
        self.venue_papers = defaultdict(list)

        # Build graphs
        for paper_id, paper in self.papers.items():
            # Citation edges
            for ref in paper.get('references', []):
                if ref['paperId'] in self.papers:
                    self.citation_graph[paper_id].append(ref['paperId'])

            # Author co-authorship
            for author in paper.get('authors', []):
                self.author_graph[author['id']].add(author['name'])

            # Venue tracking
            self.venue_papers[paper['venue']].append(paper_id)

    def find_foundational_papers(self, min_citations=100, max_age_years=10):
        """Papers most frequently cited by others"""
        in_degree = defaultdict(int)

        for paper_id in self.papers:
            for cited_id in self.citation_graph[paper_id]:
                in_degree[cited_id] += 1

        foundational = []
        for paper_id, citations in in_degree.items():
            if citations >= min_citations:
                paper = self.papers[paper_id]
                age = 2024 - paper['year']

                if age <= max_age_years:
                    foundational.append({
                        'title': paper['title'],
                        'authors': ', '.join([a['name'] for a in paper['authors']]),
                        'year': paper['year'],
                        'venue': paper['venue'],
                        'inbound_citations': citations,
                        'age_years': age
                    })

        return sorted(foundational, key=lambda x: x['inbound_citations'], reverse=True)

    def find_emerging_papers(self, min_citations=20, min_year=2023):
        """Recent papers gaining citation momentum"""
        emerging = []

        for paper_id, paper in self.papers.items():
            if paper['year'] >= min_year:
                citations = paper['citationCount']

                if citations >= min_citations:
                    months_since_publication = (
                        (datetime.now().year - paper['year']) * 12 +
                        (datetime.now().month - 1)
                    )

                    citation_velocity = citations / max(months_since_publication, 1)

                    emerging.append({
                        'title': paper['title'],
                        'year': paper['year'],
                        'citations': citations,
                        'months_old': months_since_publication,
                        'citation_velocity': round(citation_velocity, 2),
                        'venue': paper['venue']
                    })

        return sorted(emerging, key=lambda x: x['citation_velocity'], reverse=True)

    def find_research_subcommunities(self):
        """Groups of papers that cite each other heavily"""
        # Simple clustering: papers sharing references
        reference_overlap = defaultdict(lambda: defaultdict(int))

        for paper_id1, paper in self.papers.items():
            for paper_id2 in self.papers:
                if paper_id1 != paper_id2:
                    overlap = len(
                        set(self.citation_graph[paper_id1]) &
                        set(self.citation_graph[paper_id2])
                    )
                    if overlap >= 3:  # At least 3 shared references
                        reference_overlap[paper_id1][paper_id2] = overlap

        # Identify tight clusters
        clusters = []
        processed = set()

        for paper_id, overlaps in reference_overlap.items():
            if paper_id in processed:
                continue

            if overlaps:
                cluster = {paper_id}
                cluster.update([pid for pid, _ in overlaps.items()])
                processed.update(cluster)

                if len(cluster) >= 3:
                    clusters.append({
                        'size': len(cluster),
                        'papers': [
                            {
                                'title': self.papers[pid]['title'][:60],
                                'year': self.papers[pid]['year']
                            }
                            for pid in list(cluster)[:5]  # Top 5
                        ]
                    })

        return sorted(clusters, key=lambda x: x['size'], reverse=True)

    def influential_authors(self, min_papers=3):
        """Authors appearing across multiple highly-cited papers"""
        author_stats = defaultdict(lambda: {'papers': [], 'citations': 0})

        for paper in self.papers.values():
            for author in paper['authors']:
                author_stats[author['name']]['papers'].append(paper['id'])
                author_stats[author['name']]['citations'] += paper['citationCount']

        influential = []
        for author_name, stats in author_stats.items():
            if len(stats['papers']) >= min_papers:
                avg_citations = stats['citations'] // len(stats['papers'])
                influential.append({
                    'name': author_name,
                    'papers': len(stats['papers']),
                    'total_citations': stats['citations'],
                    'avg_citations_per_paper': avg_citations
                })

        return sorted(influential, key=lambda x: x['total_citations'], reverse=True)

    def venue_analysis(self):
        """Which venues are publishing important work?"""
        venue_stats = {}

        for venue, paper_ids in self.venue_papers.items():
            papers = [self.papers[pid] for pid in paper_ids]
            avg_citations = sum(p['citationCount'] for p in papers) / len(papers)
            median_year = sorted([p['year'] for p in papers])[len(papers)//2]

            venue_stats[venue] = {
                'papers': len(papers),
                'avg_citations': round(avg_citations, 1),
                'most_recent_year': max(p['year'] for p in papers),
                'median_year': median_year
            }

        return sorted(
            venue_stats.items(),
            key=lambda x: x[1]['avg_citations'],
            reverse=True
        )


# Usage
analyzer = ResearchLandscapeAnalyzer(papers_data)

print("FOUNDATIONAL PAPERS (most cited):")
for paper in analyzer.find_foundational_papers()[:5]:
    print(f"  {paper['title']}")
    print(f"    {paper['authors']} ({paper['year']})")
    print(f"    Cited {paper['inbound_citations']} times | Venue: {paper['venue']}")

print("\nEMERGING PAPERS (rising momentum):")
for paper in analyzer.find_emerging_papers()[:5]:
    print(f"  {paper['title']}")
    print(f"    {paper['citations']} citations in {paper['months_old']} months")
    print(f"    Velocity: {paper['citation_velocity']} citations/month")

print("\nTOP VENUES BY AVERAGE CITATION:")
for venue, stats in analyzer.venue_analysis()[:5]:
    print(f"  {venue}: {stats['papers']} papers, {stats['avg_citations']} avg citations")

print("\nINFLUENTIAL AUTHORS:")
for author in analyzer.influential_authors()[:5]:
    print(f"  {author['name']}: {author['papers']} papers, {author['total_citations']} citations")

Run this over your research domain and you get an instant map of the landscape.

Real Example: Efficient Fine-Tuning Research

Let's say you're tracking the "efficient fine-tuning" landscape across 2023-2024. Your analyzer reveals:

Foundational Papers:

LoRA (2023): 2,847 citations—the baseline everyone builds on
Adapters (2019): 1,204 citations—prior art frequently compared
Prefix-Tuning (2021): 689 citations—related approach, less adopted

Emerging Papers (high citation velocity):

"Scaling Efficient Fine-Tuning" (2024, 2 months old): 156 citations, 78 cites/month—exploding
"DoRA: Decomposed Rank-Aware" (2024, 1 month old): 89 citations, 89 cites/month—just breaking through
"Evaluating Parameter Efficiency Methods" (2024, 3 months old): 67 citations, 22 cites/month—steady growth

Research Subcommunities:

Cluster A (28 papers): LoRA variants and improvements
Cluster B (15 papers): Adapter-based methods
Cluster C (12 papers): Prompt-tuning approaches

Key Insight: Cluster A is far larger and more cited. LoRA variants are where the field is consolidating. But Cluster C (prompt-tuning) is growing faster—emerging alternative approach.

Venues:

NeurIPS and ICLR dominating (highest avg citations)
ArXiv papers appearing 3-6 months before top-tier venues
Pattern: ArXiv preprints appear, top-tier venue acceptance in next conference cycle

Implication: If you're launching an efficient fine-tuning paper, you need to position it against LoRA variants (established field) but also watch for prompt-tuning breakthroughs (emerging alternative).

Using This for Your Research

If you're a researcher, grad student, or R&D leader, this kind of analysis transforms your strategy:

For literature reviews:

Find foundational papers (establish baseline understanding)
Find emerging papers (see where the field is heading)
Identify subcommunities (understand which approach is winning)

For positioning your own work:

Locate yourself relative to the network
Find white space (what's NOT being researched)
Identify emerging areas with less competition

For staying current:

Automate monthly tracking of your field
Get alerts when new papers break citation velocity thresholds
Notice when author collaborations shift (signal of new directions)

For evaluating research opportunities:

Is the subfield you're interested in consolidating or exploding?
Are the influential authors still publishing actively?
Is your target venue's quality increasing or decreasing?

The Apify Approach

The Google Scholar Scraper handles all the complexity of querying Google Scholar, parsing results, extracting citation networks, and returning clean JSON.

Set it up to run monthly on your research domain:

import requests
import time

actor_id = "q0FlwuKcvz7TM2bNU"
api_token = "your_apify_token"

domains = [
    "efficient fine-tuning",
    "parameter-efficient learning",
    "low-rank adaptation"
]

for domain in domains:
    payload = {
        "queries": [domain],
        "limit": 500,
        "minYear": 2022
    }

    response = requests.post(
        f"https://api.apify.com/v2/acts/{actor_id}/runs",
        json=payload,
        auth=("", api_token)
    )

    run_id = response.json()["data"]["id"]

    # Wait for completion
    # (polling logic omitted)

    # Then run your analyzer
    time.sleep(60)

Over time, you accumulate a monthly view of how your field is evolving.

Why This Matters

Academic research is increasingly collaborative and fast-moving. Individual researchers can't keep up with thousands of papers annually.

But the citation network compresses all that information into patterns. Which papers are foundational? Which are emerging? Who's influential? What's the consensus direction?

That's all visible in the data if you know how to read it.

The researchers and organizations that map their fields systematically will outpace those relying on quarterly literature reviews and conference attendance.

The data is public. You just have to collect and analyze it.

Are you tracking a specific research domain? What patterns would change your strategy if you knew them? Drop your thoughts in the comments.

DEV Community