If you would need to scrape many different websites nowdays, which tool/language combo would you pick?

#webscraping #python #go #javascript

Basically I want to crawl simple blogs and extract their blog posts. The biggest challenge here would probably be the parsing of the data and understanding different content parts within a blogpost

Top comments (6)

Médéric Burlet • Mar 18 '20

Would depend on the type of scraping.

If we need to interact as a human then puppetteer with JS / TS would be good: github.com/puppeteer/puppeteer

If you just need to parse data I really like to use cheerio with JS / TS : github.com/cheeriojs/cheerio
It let's you access webpage information with jquery syntax. which can be quite practical.

Mario • Mar 18 '20

Thanks for the response!

I do not need to interact as a human, but just collect news articles from different websites, at scale. Looking at cheerio, seems like a very decent option. Thanks!

Pacharapol Withayasakpunt • Mar 18 '20

Node.js +/- Puppetteer would probably be the first natural choice; although I am not accustomed to Puppetteer that much.

I used to use Selenium API with Python, if I need to scrape dynamic websites. But async in Python does not seems to be as natural as Node.js

I don't know much about Golang. How often is it used for web scraping?

Talha Mansoor • Mar 18 '20

I do not need to interact as a human, but just collect news articles from different websites, at scale.

If it is scale you are looking for then best option would be scrapy.org/ with Scrapy Cloud. You can also run multiple Scrapy spiders in a process.

João Veiga • Mar 18 '20

Elixir + Floki

Jennifer Fadriquela • Aug 13 '20

I'm also a beginner to webscraping. Scrapy framework is a good tool but will have a steeper learning curve than just using libraries (selenium, beautifulsoup, requests).