DEV Community

Cover image for Scrape Google Scholar Using Python
ApiForSeo
ApiForSeo

Posted on • Originally published at serpdog.io

Scrape Google Scholar Using Python

Google Scholar data can be a great choice for businesses that specifically want to access quality researched based content available on the internet.

In this tutorial, we will learn to scrape Google Scholar Results using Python and libraries requests and BeautifulSoup.

Scrape Google Scholar Using Python

Why Scrape Google Scholar?

Scraping Google Scholar can provide you with various benefits:

Content Access — Scraping Google Scholar allows you to get access to research-based content published by highly experienced scientists around the world.

Analyzing Trends — Developers can analyze trends in the scholarly literature by extracting information like articles title, abstracts, and citation counts from Google Scholar Results

Let’s start scraping Google Scholar using Python

In this section, we are going to prepare a basic script using Python to scrape Google Scholar Results, but let us first complete the requirements of this project.

Set-Up

If you have not already installed Python on your device, please consider these videos:

  1. How to install Python on Windows?

  2. How to install Python on MacOS?

Or you can directly install Python from their official website.

Requirements:

To scrape Google Scholar Results, we will be using these two Python libraries:

  1. Beautiful Soup — Used for parsing the raw HTML data.

  2. Requests — Used for making HTTP requests.

You can run the below commands in your project terminal to install the libraries.

pip install requests
pip install beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

Scraping Google Scholar Organic Results

In this section, we are going to scrape the organic results from Google Scholar.

Google Scholar Organic Results

Open this link on your desktop. We will scrape the title, link, id, displayed link, snippet, and site links from those results.

Here is the complete code to scrape the Google Organic Scholar Results:

import requests
from bs4 import BeautifulSoup

def getScholarData():
    try:
        url = "https://www.google.com/scholar?q=Quantum+Physics&hl=en"
        headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
        }
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.text, 'html.parser')

        scholar_results = []

        for el in soup.select(".gs_ri"):
            scholar_results.append({
                "title": el.select(".gs_rt")[0].text,
                "title_link": el.select(".gs_rt a")[0]["href"],
                "id": el.select(".gs_rt a")[0]["id"],
                "displayed_link": el.select(".gs_a")[0].text,
                "snippet": el.select(".gs_rs")[0].text.replace("\n", ""),
                "cited_by_count": el.select(".gs_nph+ a")[0].text,
                "cited_link": "https://scholar.google.com" + el.select(".gs_nph+ a")[0]["href"],
                "versions_count": el.select("a~ a+ .gs_nph")[0].text,
                "versions_link": "https://scholar.google.com" + el.select("a~ a+ .gs_nph")[0]["href"] if el.select("a~ a+ .gs_nph")[0].text else "",
            })

        for i in range(len(scholar_results)):
            scholar_results[i] = {key: value for key, value in scholar_results[i].items() if value != "" and value is not None}

        print(scholar_results)

    except Exception as e:
        print(e)

getScholarData()
Enter fullscreen mode Exit fullscreen mode

You can use this CSS Selectors Gadget to find the tags for the respective elements you want from the HTML results.

This will return the following data from the web page:

[
 {
  'title': '[BOOK][B] Quantum physics',
  'title_link': 'https://books.google.com/books?hl=en&lr=&id=qFtQiVmjWUEC&oi=fnd&pg=PA28&dq=Quantum+Physics&ots=tuFPLNEOmu&sig=tRYUJyK8VECC_j1Lwpbl6bjC5ag',
  'id': 'cU32d0ZoSA0J',
  'displayed_link': 'S Gasiorowicz - 2007 - books.google.com',
  'snippet': '… the Old Quantum Theory. This material appears in every textbook on modern physics in one … the creation of quantum mechanics, and it highlights the differences between classical and …',
  'cited_by_count': 'Cited by 1400',
  'cited_link': 'https://scholar.google.com/scholar?cites=957129572685860209&as_sdt=2005&sciodt=0,5&hl=en',
  'versions_count': 'All 10 versions',
  'versions_link': 'https://scholar.google.com/scholar?cluster=957129572685860209&hl=en&as_sdt=0,5'
 },
 {
  'title': '[BOOK][B] Quantum physics',
  'title_link': 'https://books.google.com/books?hl=en&lr=&id=hBlZu4M51IMC&oi=fnd&pg=PA1&dq=Quantum+Physics&ots=uEsu7Kytk-&sig=54fJyjI177Zm2gC7HZcgM_iCEJU',
  'id': 'ENaNHF5MoQsJ',
  'displayed_link': 'M Le Bellac - 2011 - books.google.com',
  'snippet': '… allow the fundamentals of quantum mechanics to be tested … in quantum physics. It will also allow us to give a small sample of the current ideas on the notion of measurement in quantum …',
  'cited_by_count': 'Cited by 211',
  'cited_link': 'https://scholar.google.com/scholar?cites=838034972757317136&as_sdt=2005&sciodt=0,5&hl=en',
  'versions_count': 'All 4 versions',
  'versions_link': 'https://scholar.google.com/scholar?cluster=838034972757317136&hl=en&as_sdt=0,5'
 },
 ....
Enter fullscreen mode Exit fullscreen mode

Scraping Google Scholar Cite Results

Google Scholar Cite Results

Then, we will use the IDs we got from scraping organic results to scrape the cite results.

import requests
from bs4 import BeautifulSoup
def getData():
    try:
        url = "https://scholar.google.com/scholar?q=info:cU32d0ZoSA0J:scholar.google.com&output=cite"
        headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"}
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, "html.parser")
        cite_results = []
        for el in soup.select("#gs_citt tr"):
            cite_results.append({
                "title": el.select_one(".gs_cith").text.strip(),
                "snippet": el.select_one(".gs_citr").text.strip()
            })
        links = []
        for el in soup.select("#gs_citi .gs_citi"):
            links.append({
                "name": el.text.strip(),
                "link": el.get("href")
            })
        print(cite_results)
        print(links)
    except Exception as e:
        print(e)
getData()
Enter fullscreen mode Exit fullscreen mode

If you take a look at the URL, after the info we have used an ID of the first organic result we got from the above section.

Our result should look like this:

  [
    {
        'title': 'MLA', 'snippet': 'Gasiorowicz, Stephen. Quantum physics. John Wiley & Sons, 2007.'
    },
    {
        'title': 'APA', 'snippet': 'Gasiorowicz, S. (2007). Quantum physics. John Wiley & Sons.'
    },
    {
        'title': 'Chicago', 'snippet': 'Gasiorowicz, Stephen. Quantum physics. John Wiley & Sons, 2007.'
    },
    {
        'title': 'Harvard', 'snippet': 'Gasiorowicz, S., 2007. Quantum physics. John Wiley & Sons.'
    },
    {
        'title': 'Vancouver', 'snippet': 'Gasiorowicz S. Quantum physics. John Wiley & Sons; 2007 Jan 29.'
    }
  ]
  [
    {
        'name': 'BibTeX',
        'link': 'https: //scholar.googleusercontent.com/scholar.bib?q=info:cU32d0ZoSA0J:scholar.google.com/&output=citation&scisdr=Cm3wRhgwGAA:AGlGAw8AAAAAZFe6DtNdVceWMhOLx32wQFKpqxA&scisig=AGlGAw8AAAAAZFe6DjCj95cYqQaVYG2_H2xJMDY&scisf=4&ct=citation&cd=-1&hl=en'
    },
    {
        'name': 'EndNote',
        'link': 'https://scholar.googleusercontent.com/scholar.enw?q=info:cU32d0ZoSA0J:scholar.google.com/&output=citation&scisdr=Cm3wRhgwGAA:AGlGAw8AAAAAZFe6DtNdVceWMhOLx32wQFKpqxA&scisig=AGlGAw8AAAAAZFe6DjCj95cYqQaVYG2_H2xJMDY&scisf=3&ct=citation&cd=-1&hl=en' 
    },
    {
        'name': 'RefMan',
        'link': 'https://scholar.googleusercontent.com/scholar.ris?q=info:cU32d0ZoSA0J:scholar.google.com/&output=citation&scisdr=Cm3wRhgwGAA:AGlGAw8AAAAAZFe6DtNdVceWMhOLx32wQFKpqxA&scisig=AGlGAw8AAAAAZFe6DjCj95cYqQaVYG2_H2xJMDY&scisf=2&ct=citation&cd=-1&hl=en' 
    },
    { 
        'name': 'RefWorks',
        'link': 'https://scholar.googleusercontent.com/scholar.rfw?q=info:cU32d0ZoSA0J:scholar.google.com/&output=citation&scisdr=Cm3wRhgwGAA:AGlGAw8AAAAAZFe6DtNdVceWMhOLx32wQFKpqxA&scisig=AGlGAw8AAAAAZFe6DjCj95cYqQaVYG2_H2xJMDY&scisf=1&ct=citation&cd=-1&hl=en' 
    }
   ]
Enter fullscreen mode Exit fullscreen mode

Scraping Google Scholar Authors

Google Scholar Authors

Now, let’s scrape the profiles of the authors who have published quality content on Quantum Physics.

Here is our code:

import requests
from bs4 import BeautifulSoup

def getScholarProfiles():
    try:
        url = "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=Quantum+Physics"
        headers = {
            "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36"
        }
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

        scholar_profiles = []
        for el in soup.select('.gsc_1usr'):
            profile = {
                'name': el.select_one('.gs_ai_name').get_text(),
                'name_link': 'https://scholar.google.com' + el.select_one('.gs_ai_name a')['href'],
                'position': el.select_one('.gs_ai_aff').get_text(),
                'email': el.select_one('.gs_ai_eml').get_text(),
                'departments': el.select_one('.gs_ai_int').get_text(),
                'cited_by_count': el.select_one('.gs_ai_cby').get_text().split(' ')[2]
            }
            scholar_profiles.append({k: v for k, v in profile.items() if v})

        print(scholar_profiles)
    except Exception as e:
        print(e)

getScholarProfiles()
Enter fullscreen mode Exit fullscreen mode

Our results should look like this:

[
  {
    name: 'Georg Kresse',
    name_link: 'https://scholar.google.com/citations?hl=en&user=Pn8ouvAAAAAJ',
    position: 'University of Vienna, Faculty of Physics, Professor for Computational Quantum Mechanics',
    email: 'Verified email at univie.ac.at',
    departments: 'density functional theory first principles calculations many body theory condensed matter physics materials science ',
    cited_by_count: '345869'
  },
  {
    name: 'Manuel Proissl',
    name_link: 'https://scholar.google.com/citations?hl=en&user=ikHSFIkAAAAJ',
    position: 'Data Science Leader, Quantum Computing Technologist, Physicist',
    email: 'Verified email at accenture.com',
    departments: 'quantum computing machine learning deep learning causal inference ',
    cited_by_count: '82357'
  },
    ....
Enter fullscreen mode Exit fullscreen mode

Scraping Google Scholar Author Profile

Google Scholar Author Profile

Let us not create a scraper to extract data from Google Author Profile.
First, we will extract some main details about the author then we will move to its published content part.

Google Scholar Author

import requests
from bs4 import BeautifulSoup

def getAuthorProfileData():
    try:
        url = "https://scholar.google.com/citations?hl=en&user=Pn8ouvAAAAAJ"
        headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36"
        }
        response = requests.get(url, headers=headers)
        print(response.status_code)
        soup = BeautifulSoup(response.text, 'html.parser')
        author_results = {}
        author_results['name'] = soup.select_one("#gsc_prf_in").get_text()
        author_results['position'] = soup.select_one("#gsc_prf_inw+ .gsc_prf_il").text
        author_results['email'] = soup.select_one("#gsc_prf_ivh").text
        author_results['published_content'] = soup.select_one("#gsc_prf_int").text
        print(author_results)
    except Exception as e:
        print(e)

getAuthorProfileData()
Enter fullscreen mode Exit fullscreen mode

Our result should look like this:

{
  name: 'Georg Kresse',
  position: 'University of Vienna, Faculty of Physics, Professor for Computational Quantum Mechanics',
  email: 'Verified email at univie.ac.at - Homepage',
  published_content: 'density functional theoryfirst principles calculationsmany body theorycondensed matter physicsmaterials science'
}
Enter fullscreen mode Exit fullscreen mode

Then, we will scrape the content published by the author.

Google Scholar Author Published Content

Here is our code:

for el in soup.select("#gsc_a_b .gsc_a_t"):
            article = {
                'title': el.select_one(".gsc_a_at").text,
                'link': "https://scholar.google.com" + el.select_one(".gsc_a_at")['href'],
                'authors': el.select_one(".gsc_a_at+ .gs_gray").text,
                'publication': el.select_one(".gs_gray+ .gs_gray").text
            }
            articles.append(article)
        for i in range(len(articles)):
            articles[i] = {k: v for k, v in articles[i].items() if v and v != ""}
Enter fullscreen mode Exit fullscreen mode

And the results should look like this:

[
  {
    title: 'Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set',
    link: 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=Pn8ouvAAAAAJ&citation_for_view=Pn8ouvAAAAAJ:a3BOlSfXSfwC',
    authors: 'G Kresse, J Furthmüller',
    publication: 'Physical review B 54 (16), 11169, 1996'
  },
  {
    title: 'From ultrasoft pseudopotentials to the projector augmented-wave method',
    link: 'https://scholar.google.com/citations?view_op=view_citation&hl=en&user=Pn8ouvAAAAAJ&citation_for_view=Pn8ouvAAAAAJ:F9fV5C73w3QC',
    authors: 'G Kresse, D Joubert',
    publication: 'Physical review b 59 (3), 1758, 1999'
  },
Enter fullscreen mode Exit fullscreen mode

Now, we will scrape the Google Scholar Author profile Cited By results in which we will cover citation, h-index, and the i10-index since 2017.

Google Scholar Author Cited By Results

Here is the code:

    cited_by = {}
            cited_by['table'] = []
            cited_by['table'].append({})
            cited_by['table'][0]['citations'] = {}
            cited_by['table'][0]['citations']['all'] = soup.select_one("tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std").text
            cited_by['table'][0]['citations']['since_2017'] = soup.select_one("tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std").text
            cited_by['table'].append({})
            cited_by['table'][1]['h_index'] = {}
            cited_by['table'][1]['h_index']['all'] = soup.select_one("tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std").text
            cited_by['table'][1]['h_index']['since_2017'] = soup.select_one("tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std").text
            cited_by['table'].append({})
            cited_by['table'][2]['i_index'] = {}
            cited_by['table'][2]['i_index']['all'] = soup.select_one("tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std").text
            cited_by['table'][2]['i_index']['since_2017'] = soup.select_one("tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std").text

Here are the results:

    [
      { citations: { all: '345869', since_2017: '168297' } },
      { h_index: { all: '132', since_2017: '82' } },
      { i_index: { all: '367', since_2017: '276' } }
    ]
Enter fullscreen mode Exit fullscreen mode

Here is the complete code to scrape the complete profile of an Author:

    import requests
    from bs4 import BeautifulSoup

    def getAuthorProfileData():
        try:
            url = "https://scholar.google.com/citations?hl=en&user=cOsxSDEAAAAJ"
            headers = {
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36"
            }
            response = requests.get(url, headers=headers)
            soup = BeautifulSoup(response.text, 'html.parser')
            author_results = {}
            articles = []
            author_results['name'] = soup.select_one("#gsc_prf_in").text
            author_results['position'] = soup.select_one("#gsc_prf_inw+ .gsc_prf_il").text
            author_results['email'] = soup.select_one("#gsc_prf_ivh").text
            author_results['departments'] = soup.select_one("#gsc_prf_int").text
            for el in soup.select("#gsc_a_b .gsc_a_t"):
                article = {
                    'title': el.select_one(".gsc_a_at").text,
                    'link': "https://scholar.google.com" + el.select_one(".gsc_a_at")['href'],
                    'authors': el.select_one(".gsc_a_at+ .gs_gray").text,
                    'publication': el.select_one(".gs_gray+ .gs_gray").text
                }
                articles.append(article)
            for i in range(len(articles)):
                articles[i] = {k: v for k, v in articles[i].items() if v and v != ""}
            cited_by = {}
            cited_by['table'] = []
            cited_by['table'].append({})
            cited_by['table'][0]['citations'] = {}
            cited_by['table'][0]['citations']['all'] = soup.select_one("tr:nth-child(1) .gsc_rsb_sc1+ .gsc_rsb_std").text
            cited_by['table'][0]['citations']['since_2017'] = soup.select_one("tr:nth-child(1) .gsc_rsb_std+ .gsc_rsb_std").text
            cited_by['table'].append({})
            cited_by['table'][1]['h_index'] = {}
            cited_by['table'][1]['h_index']['all'] = soup.select_one("tr:nth-child(2) .gsc_rsb_sc1+ .gsc_rsb_std").text
            cited_by['table'][1]['h_index']['since_2017'] = soup.select_one("tr:nth-child(2) .gsc_rsb_std+ .gsc_rsb_std").text
            cited_by['table'].append({})
            cited_by['table'][2]['i_index'] = {}
            cited_by['table'][2]['i_index']['all'] = soup.select_one("tr~ tr+ tr .gsc_rsb_sc1+ .gsc_rsb_std").text
            cited_by['table'][2]['i_index']['since_2017'] = soup.select_one("tr~ tr+ tr .gsc_rsb_std+ .gsc_rsb_std").text
            print(author_results)
            print(articles)
            print(cited_by['table'])
        except Exception as e:
            print(e)

    getAuthorProfileData()
Enter fullscreen mode Exit fullscreen mode

Using Google Scholar API

Scraping Google Scholar can be difficult for a developer with frequent blockage from Google. Also, one has to maintain the scraper accordingly with the changing HTML structure.

Suppose you were provided with a straightforward and efficient solution to scrape Google Scholar Results, wouldn’t that be an excellent choice?

Yes, you heard right! Our Google Scholar API allows businesses to scrape educational content from Googe Scholar at scale using our powerful API infrastructure which is powered by a massive pool of 10M+ residential proxies.

Google Scholar API

We also offer 100 free requests on the first sign-up.

After getting registered on our website, you will get an API Key. Embed this API Key in the below code, you will be able to scrape Google Scholar Results at a much faster speed.

  import requests
    payload = {'api_key': 'APIKEY', 'q':'quantum+physics'}
    resp = requests.get('https://api.serpdog.io/scholar', params=payload)
    print (resp.text)
Enter fullscreen mode Exit fullscreen mode

Conclusion:

This tutorial taught us to scrape Google Scholar using Python. Feel free to message me if I missed something or if anything you need clarification on. Follow me on Twitter Thanks for reading!

Additional Resources

  1. Web Scraping With JavaScript and Node JS — An Ultimate Guide

  2. Scrape Google Play Store

  3. Web Scraping Google Maps

  4. Scrape Google Shopping

Top comments (0)