DEV Community

Mario
Mario

Posted on

If you would need to scrape many different websites nowdays, which tool/language combo would you pick?

Basically I want to crawl simple blogs and extract their blog posts. The biggest challenge here would probably be the parsing of the data and understanding different content parts within a blogpost

Top comments (6)

Collapse
 
patarapolw profile image
Pacharapol Withayasakpunt

Node.js +/- Puppetteer would probably be the first natural choice; although I am not accustomed to Puppetteer that much.

I used to use Selenium API with Python, if I need to scrape dynamic websites. But async in Python does not seems to be as natural as Node.js

I don't know much about Golang. How often is it used for web scraping?

Collapse
 
crimsonmed profile image
Médéric Burlet

Would depend on the type of scraping.

If we need to interact as a human then puppetteer with JS / TS would be good: github.com/puppeteer/puppeteer

If you just need to parse data I really like to use cheerio with JS / TS : github.com/cheeriojs/cheerio
It let's you access webpage information with jquery syntax. which can be quite practical.

Collapse
 
rioma profile image
Mario

Thanks for the response!

I do not need to interact as a human, but just collect news articles from different websites, at scale. Looking at cheerio, seems like a very decent option. Thanks!

Collapse
 
talha131 profile image
Talha Mansoor

I do not need to interact as a human, but just collect news articles from different websites, at scale.

If it is scale you are looking for then best option would be scrapy.org/ with Scrapy Cloud. You can also run multiple Scrapy spiders in a process.

Collapse
 
jcsvveiga profile image
João Veiga

Elixir + Floki

Collapse
 
jengfad profile image
Jennifer Fadriquela

I'm also a beginner to webscraping. Scrapy framework is a good tool but will have a steeper learning curve than just using libraries (selenium, beautifulsoup, requests).