DEV Community

Cover image for Scrape Google Scholar Profiles based on University name in Python
Dmitriy Zub ☀️
Dmitriy Zub ☀️

Posted on • Updated on • Originally published at serpapi.com

Scrape Google Scholar Profiles based on University name in Python

What will be scraped

image

How university filtering works

Search engine operators Explanation Search query
Label: label:<keyword> Label is a search keyword label:computer_vision
Double-quotes: "" Specific <university name> search label:computer_vision "Michigan State University"
Pipe operator OR <univ. name> OR <univ. abbrivation name> label:computer_vision "Michigan State University" OR "U.Michigan"

gif_1_02

Prerequisites

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.

If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective.

Separate virtual environment

If you're on Linux:

python -m venv env && source env/bin/activate
Enter fullscreen mode Exit fullscreen mode

If you're on Windows and using Git Bash:

python -m venv env && source env/Scripts/activate
Enter fullscreen mode Exit fullscreen mode

If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.

In short, it's a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other at the same system thus preventing libraries or Python version conflicts.

📌Note: this is not a strict requirement for this blog post.

Install libraries:

pip install requests, parsel, google-search-results
Enter fullscreen mode Exit fullscreen mode

Reduce the chance of being blocked

There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.


Full Code

import requests, re, json
from parsel import Selector

def scrape_all_authors_from_university(label: str, university_name: str):

    params = {
        "view_op": "search_authors",                       # author results
        "mauthors": f'label:{label} "{university_name}"',  # search query
        "hl": "en",                                        # language
        "astart": 0                                        # page number
    }

    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
    }

    profile_results = []

    profiles_is_present = True
    while profiles_is_present:

        html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
        select = Selector(html.text)

        print(f"extracting authors at page #{params['astart']}.")

        for profile in select.css(".gs_ai_chpr"):
            name = profile.css(".gs_ai_name a::text").get()
            link = f'https://scholar.google.com{profile.css(".gs_ai_name a::attr(href)").get()}'
            affiliations = profile.css(".gs_ai_aff").xpath('normalize-space()').get()
            email = profile.css(".gs_ai_eml::text").get()
            cited_by = profile.css(".gs_ai_cby::text").get()  # Cited by 17143 -> 17143
            interests = profile.css(".gs_ai_one_int::text").getall()

            profile_results.append({
                "profile_name": name,
                "profile_link": link,
                "profile_affiliations": affiliations,
                "profile_email": email,
                "profile_city_by_count": cited_by,
                "profile_interests": interests
            })

        # if next page token is present -> update next page token and increment 10 to get the next page
        if select.css("button.gs_btnPR::attr(onclick)").get():
            # https://regex101.com/r/e0mq0C/1
            params["after_author"] = re.search(r"after_author\\x3d(.*)\\x26", select.css("button.gs_btnPR::attr(onclick)").get()).group(1)  # -> XB0HAMS9__8J
            params["astart"] += 10
        else:
            profiles_is_present = False

    return profile_results


print(json.dumps(scrape_all_authors_from_university(label="biology", university_name="Michigan University"), indent=2))
Enter fullscreen mode Exit fullscreen mode

Code Explanation

Import libraries:

import requests, re, json
from parsel import Selector
Enter fullscreen mode Exit fullscreen mode
Library Explanation
requests to make a request.
re to match parts of HTML via regular expression.
json to make pretty printing, in this case.
parsel to extract and remove data from HTML and XML documents.

Define a function:

def scrape_all_authors_from_university(label: str, university_name: str):
    # further code
Enter fullscreen mode Exit fullscreen mode
Code Explanation
label: str, university_name: str parameter annotations which tells that label and university_name should be a str

Create search query params, request headers and make a request:

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
    "view_op": "search_authors",                       # author results
    "mauthors": f'label:{label} "{university_name}"',  # search query
    "hl": "en",                                        # language
    "astart": 0                                        # page number
}

# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
Enter fullscreen mode Exit fullscreen mode
Code Explanation
User-Agent to pretend that it's a "real" user sends a request, not a bot or a script.

Create temporary list to store extracted data:

profile_results = []
Enter fullscreen mode Exit fullscreen mode

Create a while loop:

profiles_is_present = True
while profiles_is_present:
    # further code..
Enter fullscreen mode Exit fullscreen mode

Make a request and pass URL params and headers:

html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
select = Selector(html.text)
Enter fullscreen mode Exit fullscreen mode
Code Explanation
timeout=30 to tell requests to stop waiting for response after 30 seconds.
Selector() like BeautifulSoup() object, if you used it before.

Extract the data:

for profile in select.css(".gs_ai_chpr"):
    name = profile.css(".gs_ai_name a::text").get()
    link = f'https://scholar.google.com{profile.css(".gs_ai_name a::attr(href)").get()}'
    affiliations = profile.css(".gs_ai_aff").xpath('normalize-space()').get()
    email = profile.css(".gs_ai_eml::text").get()
    cited_by = profile.css(".gs_ai_cby::text").get()  # Cited by 17143 -> 17143
    interests = profile.css(".gs_ai_one_int::text").getall()
Enter fullscreen mode Exit fullscreen mode
Code Explanation
::text or ::attr(<attribute_name>) parsel pseudo-element to grab the text or attributes out of the element node, and get() will grab the actual data.
xpath('normalize-space()') to grab blank next child nodes.
getall() to return al list of all matches.

Append extracted data as to temporary list as dictionary:

profile_results.append({
    "profile_name": name,
    "profile_link": link,
    "profile_affiliations": affiliations,
    "profile_email": email,
    "profile_city_by_count": cited_by,
    "profile_interests": interests
})
Enter fullscreen mode Exit fullscreen mode

Check if the next page token is present:

# if next page token is present -> update next page token and increment 10 to get the next page
if select.css("button.gs_btnPR::attr(onclick)").get():
    # https://regex101.com/r/e0mq0C/1
    params["after_author"] = re.search(r"after_author\\x3d(.*)\\x26", select.css("button.gs_btnPR::attr(onclick)").get()).group(1)  # -> XB0HAMS9__8J
    params["astart"] += 10
else:
    profiles_is_present = False
Enter fullscreen mode Exit fullscreen mode
Code Explanation
re.search() to search next page token via regular expression.
params["astart"] += 10 to increment query parameter to a next page.

Return and print the data:

return profile_results

print(json.dumps(scrape_all_authors_from_university(label="biology", university_name="Michigan University"), indent=2))
Enter fullscreen mode Exit fullscreen mode

Part of the output:

[
  {
    "profile_name": "Richard McCabe",
    "profile_link": "https://scholar.google.com/citations?hl=en&user=EL414mgAAAAJ",
    "profile_affiliations": "Central Michigan University",
    "profile_email": "Verified email at cmich.edu",
    "profile_city_by_count": "992",
    "profile_interests": [
      "Biology",
      "Physiology",
      "Pathophysiology"
    ]
  }, ... other profiles
]
Enter fullscreen mode Exit fullscreen mode

SerpApi Solution

Alternatively, you can achieve the same by using Google Scholar Profiles API from SerpApi.

The difference is that there's no need to create the parser and maintain it, figure out how to bypass blocks from search engines and how to scale it.

Example code to integrate to achieve almost the same as in the parsel example:

import os
from urllib.parse import urlsplit, parse_qsl
from serpapi import GoogleSearch


def serpapi_scrape_all_authors_from_university(label: str, university_name: str):
    params = {
        "api_key": os.getenv("API_KEY"),                   # SerpApi API key
        "engine": "google_scholar_profiles",               # profile results search engine
        "mauthors":  f'label:{label} "{university_name}"'  # search query
    }
    search = GoogleSearch(params)

    profile_results_data = []

    profiles_is_present = True
    while profiles_is_present:
        profile_results = search.get_dict()

        for profile in profile_results["profiles"]:
            thumbnail = profile["thumbnail"]
            name = profile["name"]
            link = profile["link"]
            author_id = profile["author_id"]
            affiliations = profile["affiliations"]
            email = profile.get("email")
            cited_by = profile.get("cited_by")
            interests = profile.get("interests")

            profile_results_data.append({
                "thumbnail": thumbnail,
                "name": name,
                "link": link,
                "author_id": author_id,
                "email": email,
                "affiliations": affiliations,
                "cited_by": cited_by,
                "interests": interests
            })

        if "next" in profile_results.get("serpapi_pagination", {}):
            # splits URL in parts as a dict() and update search "params" variable to a new page that will be passed to GoogleSearch()
            search.params_dict.update(dict(parse_qsl(urlsplit(profile_results.get("serpapi_pagination").get("next")).query)))
        else:
            profiles_is_present = False

    return profile_results_data


print(json.dumps(serpapi_scrape_all_authors_from_university(label="biology", university_name="Michigan University"), indent=2))
Enter fullscreen mode Exit fullscreen mode

Import libraries:

import os
from urllib.parse import urlsplit, parse_qsl
from serpapi import GoogleSearch
Enter fullscreen mode Exit fullscreen mode
Code Explanation
os to access environment variable key.
urllib to split URL in parts and pass new page data to GoogleSearch()

Define a function with argument annotations:

def serpapi_scrape_all_authors_from_university(label: str, university_name: str):
    # further code
Enter fullscreen mode Exit fullscreen mode

Create search parameters and pass them to the search:

params = {
    "api_key": os.getenv("API_KEY"),                   # SerpApi API key
    "engine": "google_scholar_profiles",               # profile results search engine
    "mauthors":  f'label:{label} "{university_name}"'  # search query
}
search = GoogleSearch(params)                          # where data extraction happens
Enter fullscreen mode Exit fullscreen mode

Create a temporary list where all the extracted data will be stored:

profile_results_data = []
Enter fullscreen mode Exit fullscreen mode

Create a while loop:

profiles_is_present = True
while profiles_is_present:
    profile_results = search.get_dict()  # JSON converted to Python dictionary 
    # further code..
Enter fullscreen mode Exit fullscreen mode
Code Explanation
search.get_dict() needs to be in the while loop because after each while iteration search parameters will be updated. If it will be outside while loop, the same search parameters (token ID) will be applying over and over again.

Iterate over profile results:

for profile in profile_results["profiles"]:

    print(f'Currently extracting {profile["name"]} with {profile["author_id"]} ID.')

    thumbnail = profile["thumbnail"]
    name = profile["name"]
    link = profile["link"]
    author_id = profile["author_id"]
    affiliations = profile["affiliations"]
    email = profile.get("email")
    cited_by = profile.get("cited_by")
    interests = profile.get("interests")
Enter fullscreen mode Exit fullscreen mode

Append the extracted data to temporary list:

 profile_results_data.append({
    "thumbnail": thumbnail,
    "name": name,
    "link": link,
    "author_id": author_id,
    "email": email,
    "affiliations": affiliations,
    "cited_by": cited_by,
    "interests": interests
})
Enter fullscreen mode Exit fullscreen mode

Check if next page token is present:

if "next" in profile_results.get("serpapi_pagination", {}):
    # splits URL in parts as a dict() and update search "params" variable to a new page that will be passed to GoogleSearch()
    search.params_dict.update(dict(parse_qsl(urlsplit(profile_results.get("serpapi_pagination").get("next")).query)))
else:
    profiles_is_present = False

return profile_results_data
Enter fullscreen mode Exit fullscreen mode

Print extracted data:

print(json.dumps(serpapi_scrape_all_profiles_from_university(label="Deep_Learning", university_name="Harvard University"), indent=2))
Enter fullscreen mode Exit fullscreen mode

Part of the output:

[
  {
    "thumbnail": "https://scholar.googleusercontent.com/citations?view_op=small_photo&user=EL414mgAAAAJ&citpid=3",
    "name": "Richard McCabe",
    "link": "https://scholar.google.com/citations?hl=en&user=EL414mgAAAAJ",
    "author_id": "EL414mgAAAAJ",
    "email": "Verified email at cmich.edu",
    "affiliations": "Central Michigan University",
    "cited_by": 992,
    "interests": [
      {
        "title": "Biology",
        "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Abiology",
        "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:biology"
      },
      {
        "title": "Physiology",
        "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Aphysiology",
        "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:physiology"
      },
      {
        "title": "Pathophysiology",
        "serpapi_link": "https://serpapi.com/search.json?engine=google_scholar_profiles&hl=en&mauthors=label%3Apathophysiology",
        "link": "https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:pathophysiology"
      }
    ]
  }, ... other profiles results
]
Enter fullscreen mode Exit fullscreen mode

Join us on Twitter | YouTube

Add a Feature Request💫 or a Bug🐞

Oldest comments (6)

Collapse
 
jiefengs profile image
Jiefeng Sun • Edited

Hello Dmitriy,
Does it mean that I am blocked if I start to have the following output after it initially worked for a while?
""""
extracting authors at page #0.
[]
""""

Another question is:
"""
params["astart"] += 10 to increment query parameter to a next page.
"""
Why "10"? I tried "params["astart"] += 1". I seems nothing changed.

It seems it won't change anything if I change the initial "params["astart"] = 0" to "params["astart"] = 10".

Can you please explain this a little bit more? Thank you!

Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️ • Edited

Being blocked

Yes, such output could be because your request is being blocked or something similar:

extracting authors at page #0.
[]
Enter fullscreen mode Exit fullscreen mode

To see what's the problem, you can print requests text output to see what will be returned from the HTML:

import requests, re, json
from parsel import Selector

params = {
    "view_op": "search_authors",                       # author results
    "mauthors": 'label:biology "Michigan University"',  # search query
    "hl": "en",                                        # language
    "astart": 0                                        # page number
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}

html = requests.get("https://scholar.google.com/citations", params=params, headers=headers, timeout=30)
selector = Selector(html.text)

print(html.text)
Enter fullscreen mode Exit fullscreen mode

You can copy the whole HTML output and paste it to some HTML online previewer if you don't want to read the output in the terminal.

Next page query parameter

Why "10"? I tried "params["astart"] += 1". I seems nothing changed.

It's 10 because this is how Google behaves. If you try to click on the next page arrow button yourself and track the astart parameter value in your browser, you would see that it increments a 10 after you click on the next page button.

It seems it won't change anything if I change the initial "params["astart"] = 0" to "params["astart"] = 10"

It's my bad because I forget to add explanation about after_author query parameter.

This parameter contains next page token value, for example XB0HAMS9__8J and being used in combination with astart parameter. astart don't affect anything as you said without after_author.

Example

If you look in your browser URL, when you're on the first page, URL would look something like this:

https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label%3Abiology+%22Stanford+University%22&btnG=
Enter fullscreen mode Exit fullscreen mode

When you click next page arrow button (2nd page), your URL would be like this:

https://scholar.google.com/citations?view_op=search_authors&hl=en&mauthors=label:biology+%22Stanford+University%22&after_author=i5ojAdH5__8J&astart=10
Enter fullscreen mode Exit fullscreen mode

Where after_author=i5ojAdH5__8J is a token for the next page (3rd in this case) and astart=10 tracks current page, i.e 2nd page in this case.

📌Note: you might think that astart is useless because why just don't parse after_author next page token value and call it a good? You can do it but your page order will mess up so results could be inconsistent:

image

11-20 will become 1-10 if you just use after_author next page token for the 3rd page in this case.

To sum up

after_author (next page token) parameter should (must) be us in combination with astart parameter that tracks current page.


If you was using the same search query I've provided i.e label:biology "Michigan University" then yes, there's nothing that will be changed because there's only one page of user profiles so changing it's value will lead to nothing 🙂

Let me know if it answers your questions 🎈

Collapse
 
jiefengs profile image
Jiefeng Sun

Thank you for the detailed response. It helps a lot!

Thread Thread
 
dmitryzub profile image
Dmitriy Zub ☀️ • Edited

Of course 🤗 Ask me more whenever you need.

Another way to ask me is by using #AskSerpApi hashtag on Twitter. However it's more about SerpApi usage.

Collapse
 
jiefengs profile image
Jiefeng Sun

Hello Dmitriy,

I believe there is a bug in " cited_by = re.search(r"\d+", profile.xpath('//div[@class="gs_ai_cby"]').get()).group() # Cited by 17143 -> 17143".

The numbers of citations for a page are the same.

Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️ • Edited

Hi, Jiefeng 🙂 Thank you for your notice, there was indeed a bug 🐛 I've updated the code.

Working solution would be to use css() with ::text (grabs text value) methods instead: re.search(r"\d+", profile.css(".gs_ai_cby::text").get()).group()

Output from the terminal: