What will be scraped
Prerequisites
Basic knowledge scraping with CSS selectors
If you haven't scraped with CSS selectors, there's a dedicated blog post of mine about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they're matter from a web-scraping perspective and show the most common approaches of using CSS selectors when web scraping.
Separate virtual environment
If you didn't work with a virtual environment before, have a look at the dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get familiar.
Reduce the chance of being blocked
There's a chance that a request might be blocked. Have a look at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.
Install libraries:
pip install parsel playwright
Full Code
from parsel import Selector
from playwright.sync_api import sync_playwright
import json
def scrape_researchgate_publications(query: str):
with sync_playwright() as p:
browser = p.chromium.launch(headless=True, slow_mo=50)
page = browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36")
publications = []
page_num = 1
while True:
page.goto(f"https://www.researchgate.net/search/publication?q={query}&page={page_num}")
selector = Selector(text=page.content())
for publication in selector.css(".nova-legacy-c-card__body--spacing-inherit"):
title = publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::text").get().title()
title_link = f'https://www.researchgate.net{publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::attr(href)").get()}'
publication_type = publication.css(".nova-legacy-v-publication-item__badge::text").get()
publication_date = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(1) span::text").get()
publication_doi = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(2) span").xpath("normalize-space()").get()
publication_isbn = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(3) span").xpath("normalize-space()").get()
authors = publication.css(".nova-legacy-v-person-inline-item__fullname::text").getall()
source_link = f'https://www.researchgate.net{publication.css(".nova-legacy-v-publication-item__preview-source .nova-legacy-e-link--theme-bare::attr(href)").get()}'
publications.append({
"title": title,
"link": title_link,
"source_link": source_link,
"publication_type": publication_type,
"publication_date": publication_date,
"publication_doi": publication_doi,
"publication_isbn": publication_isbn,
"authors": authors
})
print(f"page number: {page_num}")
# checks if next page arrow key is greyed out `attr(rel)` (inactive) and breaks out of the loop
if selector.css(".nova-legacy-c-button-group__item:nth-child(9) a::attr(rel)").get():
break
else:
page_num += 1
print(json.dumps(publications, indent=2, ensure_ascii=False))
browser.close()
scrape_researchgate_publications(query="coffee")
Code explanation
Import libraries:
from parsel import Selector
from playwright.sync_api import sync_playwright
import json
Code | Explanation |
---|---|
parsel |
to parse HTML/XML documents. Supports XPath. |
playwright |
to render the page with a browser instance. |
json |
to convert Python dictionary to JSON string. |
Define a function and open a playwright
with a context manager::
def scrape_researchgate_publications(query: str):
with sync_playwright() as p:
# ...
Code | Explanation |
---|---|
query: str |
to tell Python that query should be an str . |
Lunch a browser instance, open new_page
with passed user-agent
:
browser = p.chromium.launch(headless=True, slow_mo=50)
page = browser.new_page(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36")
Code | Explanation |
---|---|
p.chromium.launch() |
to launch Chromium browser instance. |
headless |
to explicitly tell playwright to run in headless mode even though it's a defaut value. |
slow_mo |
to tell playwright to slow down execution. |
browser.new_page() |
to open new page. user_agent is used to act a real user makes a request from the browser. If not used, it will default to playwright value which is None . Check what's your user-agent. |
Add a temporary list, set up a while loop, and open a new URL:
authors = []
while True:
page.goto(f"https://www.researchgate.net/search/publication?q={query}&page={page_num}")
selector = Selector(text=page.content())
# ...
Code | Explanation |
---|---|
goto() |
to make a request to specific URL with passed query and page parameters. |
Selector() |
to pass returned HTML data with page.content() and process it. |
Iterate over author results on each page, extract the data and append
to a temporary list
:
for publication in selector.css(".nova-legacy-c-card__body--spacing-inherit"):
title = publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::text").get().title()
title_link = f'https://www.researchgate.net{publication.css(".nova-legacy-v-publication-item__title .nova-legacy-e-link--theme-bare::attr(href)").get()}'
publication_type = publication.css(".nova-legacy-v-publication-item__badge::text").get()
publication_date = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(1) span::text").get()
publication_doi = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(2) span").xpath("normalize-space()").get()
publication_isbn = publication.css(".nova-legacy-v-publication-item__meta-data-item:nth-child(3) span").xpath("normalize-space()").get()
authors = publication.css(".nova-legacy-v-person-inline-item__fullname::text").getall()
source_link = f'https://www.researchgate.net{publication.css(".nova-legacy-v-publication-item__preview-source .nova-legacy-e-link--theme-bare::attr(href)").get()}'
publications.append({
"title": title,
"link": title_link,
"source_link": source_link,
"publication_type": publication_type,
"publication_date": publication_date,
"publication_doi": publication_doi,
"publication_isbn": publication_isbn,
"authors": authors
})
Code | Explanation |
---|---|
css() |
to parse data from the passed CSS selector(s). Every CSS query traslates to XPath using csselect package under the hood. |
::text /::attr(attribute)
|
to extract textual or attribute data from the node. |
get() /getall()
|
to get actual data from a matched node, or to get a list of matched data from nodes. |
xpath("normalize-space()") |
to parse blank text node as well. By default, blank text node is be skipped by XPath. |
Check if the next page is present and paginate:
# checks if the next page arrow key is greyed out `attr(rel)` (inactive) -> breaks out of the loop
if selector.css(".nova-legacy-c-button-group__item:nth-child(9) a::attr(rel)").get():
break
else:
page_num += 1
Print extracted data, and close
browser instance:
print(json.dumps(publications, indent=2, ensure_ascii=False))
browser.close()
# call the function
scrape_researchgate_publications(query="coffee")
Part of the JSON output:
[
{
"title":"The Social Life Of Coffee Turkey’S Local Coffees",
"link":"https://www.researchgate.netpublication/360540595_The_Social_Life_of_Coffee_Turkey%27s_Local_Coffees?_sg=kzuAi6HlFbSbnLEwtGr3BA_eiFtDIe1VEA4uvJlkBHOcbSjh5XlSQe6GpYvrbi12M0Z2MQ6grwnq9fI",
"source_link":"https://www.researchgate.netpublication/360540595_The_Social_Life_of_Coffee_Turkey%27s_Local_Coffees?_sg=kzuAi6HlFbSbnLEwtGr3BA_eiFtDIe1VEA4uvJlkBHOcbSjh5XlSQe6GpYvrbi12M0Z2MQ6grwnq9fI",
"publication_type":"Conference Paper",
"publication_date":"Apr 2022",
"publication_doi":null,
"publication_isbn":null,
"authors":[
"Gülşen Berat Torusdağ",
"Merve Uçkan Çakır",
"Cinucen Okat"
]
},
{
"title":"Coffee With The Algorithm",
"link":"https://www.researchgate.netpublication/359599064_Coffee_with_the_Algorithm?_sg=3KHP4SXHm_BSCowhgsa4a2B0xmiOUMyuHX2nfqVwRilnvd1grx55EWuJqO0VzbtuG-16TpsDTUywp0o",
"source_link":"https://www.researchgate.netNone",
"publication_type":"Chapter",
"publication_date":"Mar 2022",
"publication_doi":"DOI: 10.4324/9781003170884-10",
"publication_isbn":"ISBN: 9781003170884",
"authors":[
"Jakob Svensson"
]
}, ... other publications
{
"title":"Coffee In Chhattisgarh", # last publication
"link":"https://www.researchgate.netpublication/353118247_COFFEE_IN_CHHATTISGARH?_sg=CsJ66DoWjFfkMNdujuE-R9aVTZA4kVb_9lGiy1IrYXls1Nur4XFMdh2s5E9zkF5Skb5ZZzh663USfBA",
"source_link":"https://www.researchgate.netNone",
"publication_type":"Technical Report",
"publication_date":"Jul 2021",
"publication_doi":null,
"publication_isbn":null,
"authors":[
"Krishan Pal Singh",
"Beena Nair Singh",
"Dushyant Singh Thakur",
"Anurag Kerketta",
"Shailendra Kumar Sahu"
]
}
]
Links
Add a Feature Request💫 or a Bug🐞
Top comments (0)