Nishant Mittal

Posted on Jul 1, 2020

What is the best way for web scraping?

#help #python #webdev #discuss

Hi Guys,

I know that Scrapy can be used to scrape data. But also I want the code to be presentable on Github. I want to know what are best practices for web scraping using Python.

Also, if you guys know any web scraping project on Github please provide me the link to it.

Top comments (29)

Pacharapol Withayasakpunt • Jul 2 '20 • Edited

Indeed, in the past, I used Python.

Concurrency -- try ThreadPoolExecutor, or some kinds of coroutines. It may speed up things a lot.
GET the content. I guess requests is OK.
Locating the content. I now prefer lxml to BeautifulSoup.

As you have noticed in some of the comments, you might try Node.js, where you can use Cheerio, which is jQuery-ish; but has no problem with CORS. (You may still need to fetch with axios or Node-fetch, though.)

Nishant Mittal • Jul 2 '20

Yup, but still I'm wondering why almost no one is in favour of Scrapy.

Leon Lafayette • Jul 1 '20

I've found puppeteer and cheerio to be a good combo.

Nishant Mittal • Jul 1 '20 • Edited

This is the first time I've heard about cheerio. It looks nice but honestly I'm not a big fan of jquery.

Leon Lafayette • Jul 2 '20

Me either.

I was pushed for time tbf and found cheerio made things a little less verbose allowing me to get things done quickly :D

Nishant Mittal • Jul 2 '20

Oh, ok.

Peter Jachim • Jul 1 '20

I really like to use requests to make http requests, with bs4 to parse data. I am generally able to get what I need in about 4 lines of code, or about 10 to iterate through links that meet specific criteria. I think it generally looks pretty pythonic and does everything I need it to.

If there are a lot of tables, you can use pandas to read them in as dataframes, and if you need to click through pop ups or fill out forms you can use selenium (which is less presentable, but still super interpretable).

I actually used all of those techniques in this paper: arxiv.org/abs/2006.13990 (though the code is not available, so not super helpful for you).

Nishant Mittal • Jul 1 '20

Thanks! Looks like beautiful soup is a common choice!

Jesse • Jul 1 '20

BeautifulSoup is great, and I've had a good experience with it. lxml (XPath) is my go-to though, and I like it!

Nishant Mittal • Jul 1 '20

Thanks for sharing!

Vigneshwaran Chandrasekaran • Jul 1 '20

use puppeteer js

Nishant Mittal • Jul 1 '20

Ok that's new! Maybe you can point towards a already written code or a tutorial for web scraping using puppeteer.

Jason Britto • Jul 1 '20

I would suggest Selenium, its very easy with few methods, which can be used to explore DOM and fetch data.
I myself have made few projects with the help of youtube.
Try any project tutorial from youtube provided if you are alright familiar with basic python you will understand without any problem

Nishant Mittal • Jul 1 '20

Thanks for the suggestion but I don't think selenium would be the best fit for me.

Martin Dawson • Jul 1 '20

There are more than one ways to scrape with Python, but Beautiful Soup is definitely a stable, well documented, tried and tested library to use. I made a video about how to use it if that might help. Also wrote an article too 😀