DEV Community

Cover image for Scrape ResearchGate all institution members in Python

Scrape ResearchGate all institution members in Python

Dmitriy Zub β˜€οΈ on May 06, 2022

What will be scraped Prerequisites Full Code Extracting data from the JSON string Links What will be scraped Prerequisites Basic k...
Collapse
 
datum_geek profile image
Mohamed Hachaichi πŸ‡ΊπŸ‡¦

Dmi, the code does not work? it says "Error: It looks like you are using Playwright Sync API inside the asyncio loop. Please use the Async API instead." Alsi, it would be extremely helful to extract data from each profile (Research Interest, Citations, and h-index).

Collapse
 
dmitryzub profile image
Dmitriy Zub β˜€οΈ • Edited

Hi, @datum_geek :) The code works, most likely you got a captcha on your end. The provided code should be used in addition to proxies or at least a captcha solving service.

After X number of requests, ResearchGate throws a captcha that needs to be solved.

Try to change user-agent to yours. Check what's your user-agent and replace it.

Also, I'm not sure about the error as the code import sync_api and context manager also:

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    # ...
Enter fullscreen mode Exit fullscreen mode

GIF that shows the output. 624 results in total:

image

The follow up blog post will be about scraping whole profile page πŸ’«

Collapse
 
datum_geek profile image
Mohamed Hachaichi πŸ‡ΊπŸ‡¦

Yes, indeed. Notice that the code does not work in jupyter notebook/lab environmlent, but can write in vs code.

Thread Thread
 
dmitryzub profile image
Dmitriy Zub β˜€οΈ

I don't work with notebooks too often :) Don't know such nuances that can make it work other than using async playwright API instead.

Thread Thread
 
dmitryzub profile image
Dmitriy Zub β˜€οΈ • Edited

A possible workaround is to run the script in terminal, where data will be saved to the file and then load data inside the notebook.

Not very convenient though.

Thread Thread
 
dmitryzub profile image
Dmitriy Zub β˜€οΈ

Let me know if you figure out it or not πŸ™‚

Thread Thread
 
datum_geek profile image
Mohamed Hachaichi πŸ‡ΊπŸ‡¦

It does, it works using the terminal (ps, the code inside VS code also does not work, the loop does not break at all :p ).

Dimi, tell me if you could manage to collect extra data on ResearchGate (inside of each profile page)

Thread Thread
 
dmitryzub profile image
Dmitriy Zub β˜€οΈ

Awesome πŸ‘ What do you mean that it doesn't work in vscode? The code has been written inside vscode. How do you run the code inside vscode? Using code runner extension or from the terminal?

I'm have planned a blog post about a scraping profiles data. I'll try to publish it next week.

Thread Thread
 
dmitryzub profile image
Dmitriy Zub β˜€οΈ • Edited

@datum_geek, the blog post about scraping profile information is available: dev.to/dmitryzub/scrape-researchga...

Thank you for your suggestion :)

Collapse
 
datum_geek profile image
Mohamed Hachaichi πŸ‡ΊπŸ‡¦

One small questions: how to save the output as a dataframe (pandas)?

Thread Thread
 
dmitryzub profile image
Dmitriy Zub β˜€οΈ • Edited
df = pd.DataFrame(institution_memebers)
Enter fullscreen mode Exit fullscreen mode

If you need to save it:

# https://stackoverflow.com/a/45141782/15164646
df.to_csv("your_csv_filename.csv", index=False)
Enter fullscreen mode Exit fullscreen mode