Scrape Google Scholar in R

#r #data #programming #webscraping

What will be scraped
Prerequisites
Full Code
- Explanation
Google Scholar Python scraper alternatives

Scraping Google Scholar profiles in R can be a powerful tool for academic researchers, librarians, and data analysts.

This blog post will show how to scrape profile data with pagination.

What will be scraped

Prerequisites

In your R CMD install all the needed packages:

install.packages("httr")
install.packages("rvest")
install.packages("jsonlite")
install.packages("purrr")
install.packages("stringr")
install.packages("glue")
install.packages("dplyr")

Full Code

Please keep in mind that I'm not an experienced R user and some of the techniques might be better implemented.

library(httr)
library(rvest)
library(jsonlite)
library(purrr)
library(stringr)
library(glue)
library(dplyr)


scrape_all_profiles_from_university <- function(label, university_name) {
  headers <- c("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36")

  # remove trailing whitespaces/hidden characters
  label <- trimws(label)
  university_name <- trimws(university_name)

  params <- list(
    view_op = "search_authors",
    mauthors = glue("label:{label} '{university_name}'"),
    hl = "en",
    astart = 0
  )

  # empty list to store future the profile results
  all_profile_results <- list()

  profiles_is_present <- TRUE
  while (profiles_is_present) {
    response <- GET("https://scholar.google.com/citations", query = params, add_headers(.headers = headers))
    page <- read_html(content(response, "text"))
    print(paste0("extracting authors at page #", params$astart))

    profiles <- page %>% html_elements(".gs_ai_chpr")
    profile_results <- map(profiles, function(profile) {
      name <- profile %>% html_element(".gs_ai_name a") %>% html_text()
      link <- paste0("https://scholar.google.com", profile %>% html_element(".gs_ai_name a") %>% html_attr("href"))
      affiliations <- profile %>% html_element(".gs_ai_aff") %>% html_text(trim = TRUE)
      email <- profile %>% html_element(".gs_ai_eml") %>% html_text()
      cited_by <- profile %>% html_element(".gs_ai_cby") %>% html_text() %>% gsub(pattern = "[^0-9]", replacement = "") # Cited by 17143 -> 17143
      interests <- profile %>% html_elements(".gs_ai_one_int") %>% html_text()

      # scalar values instead of single-value vectors
      # or data.frame() could be used instead here
      list(
        profile_name = name[[1]],
        profile_link = link[[1]],
        profile_affiliations = affiliations[[1]],
        profile_email = email[[1]],
        profile_city_by_count = cited_by[[1]],
        profile_interests = interests
      )
    })

    # append profile results to the list
    all_profile_results <- c(all_profile_results, profile_results)

    # pagination
    next_page_button <- page %>% html_elements("button.gs_btnPR") %>% html_attr("onclick")
    if (!is.na(next_page_button)) {
      # extract the "after_author" parameter from the "onclick" attribute of the "Next" button using regex
      # and assign it to the "after_author" URL parameter which is the next token pagination
      # along with "astart" URL param
      params$after_author <- str_match(next_page_button, "after_author\\\\x3d(.*)\\\\x26")[, 2]
      params$astart <- params$astart + 10
    } else {
      profiles_is_present <- FALSE
    }
  }

  # convert to data frame
  all_profile_results <- data.frame(do.call(rbind, all_profile_results), stringsAsFactors = FALSE)

  return(all_profile_results)
}

# Scrape the data
data <- scrape_all_profiles_from_university(label="physics", university_name="Harvard University")

# Select all columns of the data frame using dplyr
all_data <- select(data, everything())

# Extract the email addresses using dplyr
emals <- all_data %>% pull(profile_email)
for (email in emals) {
  cat("- ", name, "\n")
}

Explanation

Import all the needed packages:

library(httr)
library(rvest)
library(jsonlite)
library(purrr)
library(stringr)
library(glue)
library(dplyr)

Next, we create a function with 2 arguments, label and university_name:

scrape_all_profiles_from_university <- function(label, university_name) {
  # ... code
}

The following step is to pass a browser user-agent to act like we're sending request as an actual user, not a bot that sends a request. Check what's your user-agent.

You can read more about bypassing topic from my reducing the chance of being blocked while web scraping blog post.

params list is used to create URL parameters for the reqesut and dynamically pass label and university_name data to the request.

headers <- c("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36")

# remove trailing whitespaces/hidden characters
label <- trimws(label)
university_name <- trimws(university_name)

params <- list(
  view_op = "search_authors",
  mauthors = glue("label:{label} '{university_name}'"),
  hl = "en",
  astart = 0
)

# empty list to store future the profile results
all_profile_results <- list()

After that, we need to create a while loop that will be used to paginate through all pages dynamically no matter how many there're, it will go through all of them. We can use hardcoded approach (iterate from x to x page) but this is not reliable.

If you want to iterate over x amount of pages, you may create a varibale before the while loop that tells how many iteration should be done and a condition to check for the current iteration.

While making request, we're passing earlier created params and headers and reading HTML content and assigning it to the page variable:

profiles_is_present <- TRUE
while (profiles_is_present) {
  # on each params update, 'response' will be different (new page) and 'page' accordingly
  response <- GET("https://scholar.google.com/citations", query = params, add_headers(.headers = headers))
  page <- read_html(content(response, "text"))
  # ...
}

At this step we're iterating (map()) over HTML containers with .gs_ai_chpr CSS selector and extracting data.

profiles <- page %>% html_elements(".gs_ai_chpr")
  profile_results <- map(profiles, function(profile) {
    name <- profile %>% html_element(".gs_ai_name a") %>% html_text()
    link <- paste0("https://scholar.google.com", profile %>% html_element(".gs_ai_name a") %>% html_attr("href"))
    affiliations <- profile %>% html_element(".gs_ai_aff") %>% html_text(trim = TRUE)
    email <- profile %>% html_element(".gs_ai_eml") %>% html_text()
    cited_by <- profile %>% html_element(".gs_ai_cby") %>% html_text() %>% gsub(pattern = "[^0-9]", replacement = "") # Cited by 17143 -> 17143
    interests <- profile %>% html_elements(".gs_ai_one_int") %>% html_text()

Right after that, we append extracted data from the current iteration and append it to the all_profile_results by concatenating two lists profile_results and profile_results together:

# not really sure if [[1]] is needed
list(
  profile_name = name[[1]],
  profile_link = link[[1]],
  profile_affiliations = affiliations[[1]],
  profile_email = email[[1]],
  profile_city_by_count = cited_by[[1]],
  profile_interests = interests
)
})

# append profile results to the list
all_profile_results <- c(all_profile_results, profile_results)

At this point we get to the actual pagination.

Firstly, we extract onclick attribute from the button HTML element and assign it to next_page_button.
Secondly, we check if (!is.na(next_page_button)) (if button is present) otherwise exit the while loop of no button available.
Thirdly, we extract next page token from the button onclick attribute and pass it to params as a new key.
Lastly, we increment a 10 to a astart under params which is used in combination with after_authour parameter to drive pagination.
- astart 10 = 2nd page, 20 = 3rd page and so on.

next_page_button <- page %>% html_elements("button.gs_btnPR") %>% html_attr("onclick")
if (!is.na(next_page_button)) {
  params$after_author <- str_match(next_page_button, "after_author\\\\x3d(.*)\\\\x26")[, 2]
  params$astart <- params$astart + 10
} else {
  profiles_is_present <- FALSE
}

Finally, we convert all_profile_results to a dataframe:

all_profile_results <- data.frame(do.call(rbind, all_profile_results), stringsAsFactors = FALSE)
return(all_profile_results)

do.call(rbind) will stack all of the data to create a single matrix/data.
stringsAsFactors = FALSE will convert all column type to a string.

And as a final step, here's how we can access the data:

data <- scrape_all_profiles_from_university(label="physics", university_name="Harvard University")

all_data <- select(data, everything())

emals <- all_data %>% pull(profile_email)
for (email in emals) {
  cat("- ", name, "\n")
}

Outputs:

[1] "extracting authors at page #0"
[1] "extracting authors at page #10"
[1] "extracting authors at page #20"
[1] "extracting authors at page #30"
[1] "extracting authors at page #40"
[1] "extracting authors at page #50"
> ...
-  Verified email at neu.edu
-  Verified email at seas.harvard.edu
-  Verified email at physics.harvard.edu
-  Verified email at cfa.harvard.edu
-  Verified email at physics.harvard.edu
-  Verified email at cfa.harvard.edu
-  Verified email at seas.harvard.edu 
-  Verified email at mcb.harvard.edu
-
-  Verified email at mcgill.ca
-  Verified email at cfa.harvard.edu
-  Verified email at bidmc.harvard.edu
-  Verified email at physics.harvard.edu
-  Verified email at physics.harvard.edu
-  Verified email at hsph.harvard.edu
-  Verified email at g.harvard.edu
-  Verified email at fas.harvard.edu
-  Verified email at iisc.ac.in
-  Verified email at fas.harvard.edu
-  Verified email at fas.harvard.edu
-  Verified email at physics.harvard.edu
-  Verified email at fas.harvard.edu
-  Verified email at bwh.harvard.edu
-  Verified email at g.harvard.edu
-  Verified email at seas.harvard.edu
-  Verified email at g.harvard.edu 
-  Verified email at hsph.harvard.edu
-  Verified email at math.harvard.edu
-  Verified email at fas.harvard.edu
-  Verified email at fas.harvard.edu
-  Verified email at cfa.harvard.edu
-  Verified email at harvard.edu
-  Verified email at polytechnique.edu
-  Verified email at seas.harvard.edu
-  Verified email at fas.harvard.edu
-  Verified email at g.harvard.edu
-  Verified email at fas.harvard.edu
-  Verified email at g.harvard.edu
-  Verified email at seas.harvard.edu
-  Verified email at cfa.harvard.edu
-  Verified email at fas.harvard.edu
-  Verified email at fas.harvard.edu 
-  Verified email at college.harvard.edu
-  Verified email at fas.harvard.edu
-  Verified email at go.cambridgecollege.edu
-  Verified email at mgh.harvard.edu
-  Verified email at hsph.harvard.edu
-  Verified email at g.harvard.edu
-
-  Verified email at g.harvard.edu
-  Verified email at college.harvard.edu
-  Verified email at college.harvard.edu
-  Verified email at physics.uoc.gr

Google Scholar Python scraper alternatives

If you want to extract more data from Google Scholar in R but haven't figured it out, you can use a few of the Python alternatives if you're comfortable using it:

scrape-google-scholar-py is a open-source project of mine that aims to extract all the possible data from Google Scholar. In the future I'll port it to R.
scholarly is also an open-source project that extracts data from Google Scholar. The difference between this and mine package is that mine aim to extract all possible pages, while scholarly not.

Join us on Twitter | YouTube

Add a Feature Request💫 or a Bug🐞