Scraping Google Scholar profiles in R can be a powerful tool for academic researchers, librarians, and data analysts.
This blog post will show how to scrape profile data with pagination.
What will be scraped
Prerequisites
In your R CMD install all the needed packages:
install.packages("httr")
install.packages("rvest")
install.packages("jsonlite")
install.packages("purrr")
install.packages("stringr")
install.packages("glue")
install.packages("dplyr")
Full Code
Please keep in mind that I'm not an experienced R user and some of the techniques might be better implemented.
library(httr)
library(rvest)
library(jsonlite)
library(purrr)
library(stringr)
library(glue)
library(dplyr)
scrape_all_profiles_from_university <- function(label, university_name) {
headers <- c("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36")
# remove trailing whitespaces/hidden characters
label <- trimws(label)
university_name <- trimws(university_name)
params <- list(
view_op = "search_authors",
mauthors = glue("label:{label} '{university_name}'"),
hl = "en",
astart = 0
)
# empty list to store future the profile results
all_profile_results <- list()
profiles_is_present <- TRUE
while (profiles_is_present) {
response <- GET("https://scholar.google.com/citations", query = params, add_headers(.headers = headers))
page <- read_html(content(response, "text"))
print(paste0("extracting authors at page #", params$astart))
profiles <- page %>% html_elements(".gs_ai_chpr")
profile_results <- map(profiles, function(profile) {
name <- profile %>% html_element(".gs_ai_name a") %>% html_text()
link <- paste0("https://scholar.google.com", profile %>% html_element(".gs_ai_name a") %>% html_attr("href"))
affiliations <- profile %>% html_element(".gs_ai_aff") %>% html_text(trim = TRUE)
email <- profile %>% html_element(".gs_ai_eml") %>% html_text()
cited_by <- profile %>% html_element(".gs_ai_cby") %>% html_text() %>% gsub(pattern = "[^0-9]", replacement = "") # Cited by 17143 -> 17143
interests <- profile %>% html_elements(".gs_ai_one_int") %>% html_text()
# scalar values instead of single-value vectors
# or data.frame() could be used instead here
list(
profile_name = name[[1]],
profile_link = link[[1]],
profile_affiliations = affiliations[[1]],
profile_email = email[[1]],
profile_city_by_count = cited_by[[1]],
profile_interests = interests
)
})
# append profile results to the list
all_profile_results <- c(all_profile_results, profile_results)
# pagination
next_page_button <- page %>% html_elements("button.gs_btnPR") %>% html_attr("onclick")
if (!is.na(next_page_button)) {
# extract the "after_author" parameter from the "onclick" attribute of the "Next" button using regex
# and assign it to the "after_author" URL parameter which is the next token pagination
# along with "astart" URL param
params$after_author <- str_match(next_page_button, "after_author\\\\x3d(.*)\\\\x26")[, 2]
params$astart <- params$astart + 10
} else {
profiles_is_present <- FALSE
}
}
# convert to data frame
all_profile_results <- data.frame(do.call(rbind, all_profile_results), stringsAsFactors = FALSE)
return(all_profile_results)
}
# Scrape the data
data <- scrape_all_profiles_from_university(label="physics", university_name="Harvard University")
# Select all columns of the data frame using dplyr
all_data <- select(data, everything())
# Extract the email addresses using dplyr
emals <- all_data %>% pull(profile_email)
for (email in emals) {
cat("- ", name, "\n")
}
Explanation
Import all the needed packages:
library(httr)
library(rvest)
library(jsonlite)
library(purrr)
library(stringr)
library(glue)
library(dplyr)
Next, we create a function with 2 arguments, label and university_name:
scrape_all_profiles_from_university <- function(label, university_name) {
# ... code
}
The following step is to pass a browser user-agent to act like we're sending request as an actual user, not a bot that sends a request. Check what's your user-agent.
You can read more about bypassing topic from my reducing the chance of being blocked while web scraping blog post.
params list is used to create URL parameters for the reqesut and dynamically pass label and university_name data to the request.
headers <- c("User-Agent" = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/113.0.0.0 Safari/537.36")
# remove trailing whitespaces/hidden characters
label <- trimws(label)
university_name <- trimws(university_name)
params <- list(
view_op = "search_authors",
mauthors = glue("label:{label} '{university_name}'"),
hl = "en",
astart = 0
)
# empty list to store future the profile results
all_profile_results <- list()
After that, we need to create a while loop that will be used to paginate through all pages dynamically no matter how many there're, it will go through all of them. We can use hardcoded approach (iterate from x to x page) but this is not reliable.
If you want to iterate over x amount of pages, you may create a varibale before the while loop that tells how many iteration should be done and a condition to check for the current iteration.
While making request, we're passing earlier created params and headers and reading HTML content and assigning it to the page variable:
profiles_is_present <- TRUE
while (profiles_is_present) {
# on each params update, 'response' will be different (new page) and 'page' accordingly
response <- GET("https://scholar.google.com/citations", query = params, add_headers(.headers = headers))
page <- read_html(content(response, "text"))
# ...
}
At this step we're iterating (map()) over HTML containers with .gs_ai_chpr CSS selector and extracting data.
profiles <- page %>% html_elements(".gs_ai_chpr")
profile_results <- map(profiles, function(profile) {
name <- profile %>% html_element(".gs_ai_name a") %>% html_text()
link <- paste0("https://scholar.google.com", profile %>% html_element(".gs_ai_name a") %>% html_attr("href"))
affiliations <- profile %>% html_element(".gs_ai_aff") %>% html_text(trim = TRUE)
email <- profile %>% html_element(".gs_ai_eml") %>% html_text()
cited_by <- profile %>% html_element(".gs_ai_cby") %>% html_text() %>% gsub(pattern = "[^0-9]", replacement = "") # Cited by 17143 -> 17143
interests <- profile %>% html_elements(".gs_ai_one_int") %>% html_text()
Right after that, we append extracted data from the current iteration and append it to the all_profile_results by concatenating two lists profile_results and profile_results together:
# not really sure if [[1]] is needed
list(
profile_name = name[[1]],
profile_link = link[[1]],
profile_affiliations = affiliations[[1]],
profile_email = email[[1]],
profile_city_by_count = cited_by[[1]],
profile_interests = interests
)
})
# append profile results to the list
all_profile_results <- c(all_profile_results, profile_results)
At this point we get to the actual pagination.
- Firstly, we extract
onclickattribute from thebuttonHTML element and assign it tonext_page_button. - Secondly, we check
if (!is.na(next_page_button))(if button is present) otherwise exit thewhileloop of no button available. - Thirdly, we extract next page token from the
buttononclickattribute and pass it toparamsas a new key. - Lastly, we increment a
10to aastartunderparamswhich is used in combination withafter_authourparameter to drive pagination.-
astart10 = 2nd page, 20 = 3rd page and so on.
-
next_page_button <- page %>% html_elements("button.gs_btnPR") %>% html_attr("onclick")
if (!is.na(next_page_button)) {
params$after_author <- str_match(next_page_button, "after_author\\\\x3d(.*)\\\\x26")[, 2]
params$astart <- params$astart + 10
} else {
profiles_is_present <- FALSE
}
Finally, we convert all_profile_results to a dataframe:
all_profile_results <- data.frame(do.call(rbind, all_profile_results), stringsAsFactors = FALSE)
return(all_profile_results)
-
do.call(rbind)will stack all of the data to create a single matrix/data. -
stringsAsFactors = FALSEwill convert all column type to a string.
And as a final step, here's how we can access the data:
data <- scrape_all_profiles_from_university(label="physics", university_name="Harvard University")
all_data <- select(data, everything())
emals <- all_data %>% pull(profile_email)
for (email in emals) {
cat("- ", name, "\n")
}
Outputs:
[1] "extracting authors at page #0"
[1] "extracting authors at page #10"
[1] "extracting authors at page #20"
[1] "extracting authors at page #30"
[1] "extracting authors at page #40"
[1] "extracting authors at page #50"
> ...
- Verified email at neu.edu
- Verified email at seas.harvard.edu
- Verified email at physics.harvard.edu
- Verified email at cfa.harvard.edu
- Verified email at physics.harvard.edu
- Verified email at cfa.harvard.edu
- Verified email at seas.harvard.edu
- Verified email at mcb.harvard.edu
-
- Verified email at mcgill.ca
- Verified email at cfa.harvard.edu
- Verified email at bidmc.harvard.edu
- Verified email at physics.harvard.edu
- Verified email at physics.harvard.edu
- Verified email at hsph.harvard.edu
- Verified email at g.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at iisc.ac.in
- Verified email at fas.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at physics.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at bwh.harvard.edu
- Verified email at g.harvard.edu
- Verified email at seas.harvard.edu
- Verified email at g.harvard.edu
- Verified email at hsph.harvard.edu
- Verified email at math.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at cfa.harvard.edu
- Verified email at harvard.edu
- Verified email at polytechnique.edu
- Verified email at seas.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at g.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at g.harvard.edu
- Verified email at seas.harvard.edu
- Verified email at cfa.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at college.harvard.edu
- Verified email at fas.harvard.edu
- Verified email at go.cambridgecollege.edu
- Verified email at mgh.harvard.edu
- Verified email at hsph.harvard.edu
- Verified email at g.harvard.edu
-
- Verified email at g.harvard.edu
- Verified email at college.harvard.edu
- Verified email at college.harvard.edu
- Verified email at physics.uoc.gr
Google Scholar Python scraper alternatives
If you want to extract more data from Google Scholar in R but haven't figured it out, you can use a few of the Python alternatives if you're comfortable using it:
-
scrape-google-scholar-pyis a open-source project of mine that aims to extract all the possible data from Google Scholar. In the future I'll port it to R. -
scholarlyis also an open-source project that extracts data from Google Scholar. The difference between this and mine package is that mine aim to extract all possible pages, whilescholarlynot.
Add a Feature Request💫 or a Bug🐞

Top comments (0)