Intro
Prerequisites
SelectorGadgets Extension
Organic Search
Organic Search Pagination
Organic Cite
Organic Profiles Pagination
Autho...
For further actions, you may consider blocking this person and/or reporting abuse
Thasksful dear Dmitriy for a great post.
I have two problems when scraping profiles:
1_ When I use your code it returns 0 pages (i don't want to use serpapi) or "unusual traffic", I couldn't solve it. ( maybe I have been blocked by google).
2_ I wrote a code that works for me by Urllib, but now I notice a problem that the query just returns a matching case, and doesn't return a substring, for ex. I looking for ecology at Michigan State University, and as a result, it doesn't return "Scott C Stark", but if I write exactly "Tropical_Forest_Ecology" then it returns "Scott C Stark".
How do I solve it that returns any word even in substring?
Hey, @mohammadreza20 🙂
Yes, most likely you get blocked by Google. But I don't have enough context to give a proper answer 🙂
Have a look at
scrape-google-scholar-py
custom backend solution. It usesselenium-stealth
under the hood which bypasses cloudflare captcha, and other captchas (and IP rate limits).It's a package of mine. Note that it's in early alpha. Open issue if you found and bugs.
Without seeing your code I can only give you a very generic answer which wouldn't be helpful. Show code where you have difficulties if you want a deeper answer.
@dmitryzub Thank you for your response.
The question 2 is important for me but i couldn't solve it,
my code:
Hi Dmitriy,
Thank you for your very useful article!
However, whenever using beautifulsoup I can only scrape the first page when scraping a profile for example. Or if I need to scrape citations from a specific author, this only scrapes the first few citations instead of all of them. How can I scrape all the results (in multiple pages using beautifulsoup or SerpApi?
Thanks!
Hi @rafambraga,
Thank you for finding it helpful! Yes, the example I've shown in this blog post scrapes only first profiles page. This blog post is old and needs an upgrade.
I've answered Stackoverflow question about scraping profile results from all pages using both
bs4
and SerpApi with example in the online IDE.About citations. I've also written a code snippet to scrape citations in
bs4
. Note that examples scrapes only Bibtex data. You need add a few lines of code to scrape all of them.Besides that, there's also a dedicated blog posts on scrape historic Google Scholar results using Python and scrape all Google Scholar Profile, Author Results to CSV with Python and SerpApi.
If you need to scrape profiles from a certain university, there's also a dedicated scrape Google Scholar Profiles from a certain University in Python blog post just about it with a step-by-step explanation 👀
Do you have any recommendations for scraping all pages of an organic search result? I tried adding "start": 0 into the parameters and just manually changing that but it seems to repeat results occasionally. I also tried to follow your Scrape historic Google Scholar results using Python script but keep getting the error KeyError: 'serpapi_pagination'
Thanks for any help you can provide!
@dlittlewood12 Thank you for reaching out!
Most likely you're getting an
KeyError: 'serpapi_pagination'
error is that you need to pass yourAPI_KEY
toos.getenv("API_KEY")
. In the terminal, typeAPI_KEY=<your-api-key> python your-script-file.py
Or remove it completely and pass api key as a string inside
params
dict
, for example:"api_key": "2132414122asdsadadaa"
Let me know if answers your question 🎈
@rafambraga I've just published major updates to this blog post which includes:
And othe changes 🐱👤🐱🏍
Hello Dimitry
I would like to know if there is any parameter (using SerpAPI)
so that we can scrape author profiles with certain minimum number of citations.
Hey @sim777, thank you for your question. SerpApi doesn't has such parameter, and Google Scholar itself (as far as I know) doesn't have it also.
As a workaround you can always do an
if
condition manually by accessing acited_by
hash key from SerpApi response and check if it's bigger or lower to the value you provide. And if condition is true - extract profile.Example code of what I mean:
What parameter can I use in the search query so that I can scrape authors with certain number of citations?
Hey @sim777, I've answered to your question in the response above. I've showed a SerpApi example but you can do the same thing with your own solution without SerpApi 👍
Hi Dmitriy, I noticed that you only scrape the shorted descritions of the papers, but not the entire description. If you look on google scholar search results page, only short excerpt from the abstract, title or authors if there are many is seen ending by a tripple dot (...) The scraper only scrapes this, leaving the rest of the information out. Do you maybe know a solution to this?
Hi, Georg! Thank you reaching out! Unfortunately, only part of the snippet was provided by the Google backend, and as you wrote, this is what scraper scrapes.
To make it work you have to make another request to desired website, check if there's exists the same text as in the snippet from Google Scholar Organic results, and if so, scrape the rest of it.
But this will work if the rest of the text will is on the same website over and over again which in most cases - not. I believe it can be done but it's a tricky task.
All the best,
Dmitriy
@george_z
Hi Dimitriy,
Thank you for the very useful articles!
I was following 'Scrape historic Google Scholar results using Python' article on serpapi.com/blog/scrape-historic-g....
However, whenever I try to call 'cite_results()' I am getting key error: 'citations'. And if I change the query I am getting key error: 'serpapi_pagination'. I also tried to run my script from terminal by running - 'API_KEY=my_api_key python test11.py'. But nothing seems to work especially when I change the query.
Suggestions are highly appreciate.
Thanks.
Hey, @shwetat26 🙂 Thank you for reaching out.
First question, are you actually changing
my_api_key
to your actual API key, and Python filetest11.py
to your actual Python file in theAPI_KEY=my_api_key python test11.py
command you've shown?It should be something like this:
<your_script.py>
to your actual script name.Let me know it makes sense.
Can we scrap the number of citations+year inside the barplot (citation by graph)?
Do you mean value of each individual cell?
@datum_geek the blog post has received major updates including graph extraction if it's something you still need 🙂
Hi @dmitryzub, I facing issues to scrap data on "samsung", for all pages of google scholar. Including title of the publication, authors of the publication, the year, and the full abstract. You code (the first peace) does not work, it reders nothing!
Hi, @datum_geek.
What error do you receive? Pagination, data extraction works as expected without changes. I've just tested in the online IDE that is linked in this post:
Dmitriy, thanks for your tutorial. I'm newbie at python and I already try your script Scrape Google Scholar All Author Articles. I have some questions:
Hey @jayaivan, thank you 🙂
Great questions! We can make a solution for two questions.
Have a look at examples in the online IDE: replit.com/@DimitryZub1/Google-Sch...
We can use
pandas
to_csv()
method or using Python's build-in context manager and build-incsv
library.The main difference is that
pandas
is an additional dependency (additional thing to install which leads to larger project storage size), however,pandas
simplifies this task a lot.📌Note: I'll be using
json.dumps()
at the very end of the code just to show what is being printed (extracted). Delete it if it's unnecessary to you.Using
pandas
(don't forget topip install
it):Actual example using code in the blog post (the line you're looking for is almost at the end of the script):
Outputs:
Using context manager:
Actual example from the blog post code:
Outputs:
You need a
list
of user IDs. Iterate over it at extract the data as already shown in the blog post.Keep in mind that each user = new request. More users = more time to extract data. If it takes a lot of time, think about asynchronous requests as it will speed up things quite a lot.
Iteration over user ID's
list
and passinguser_id
value toparams["user"]
which will be passed to search URL:Actual code:
Let me know if any of this makes sense 🙂
Hi, this is very useful article. Definitely reduces the time. I'm planning to get all articles within the last month. Can you tell me how to do that. Below two actions. 1) default display of search is sort by " by relevance". how to set it to "sort by date" 2) there is information on number of days before article published. how to get it?
Hi, @deepeshsagar ! I'm glad that the article helped you somehow!
Use
sortby=pubdate
query parameter which will sort by published date.In articles example the link would look like this:
https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ&sortby=pubdate
Or you can add a params
dict()
to make it more readable and faster to understand:I updated code on replit so you can test in the browser (try to remove
sortby
param and see the difference in first articles).@deepeshsagar i've just updated blog post and now you're able to extract all available articles from author page. This is possible because of pagination i've added.
🐱👤
Thanks mitry for a great and useful post. I am currently having one problem using your code for "Scrape Google Scholar Organic Results using SerpApi with Pagination". I edited it only, so that it loops thorugh different search terms. However, the code for some reason is not able to get pass the last page of this particular search of one particular query. If I use the code for all the other search terms it works, but for some reason it does not work for only one of the search terms (regardless of the position it has in the loop). The code simply stays forever at saying "Currently extracting page #6" and never adavcnes or ends. I am guessing it has something to do whith what is in the last page, but I havent been able to fix it or identify the problem. Below is a snapshot of what that last page shows. I hope you can help me with this. Thanks!