DEV Community

Ali Abbas
Ali Abbas

Posted on

BeautifulSoup or Scrapy?

Which do you prefer?

Top comments (6)

Collapse
 
jmplourde profile image
Jean-Michel Plourde • Edited

It really depends on the needs. While they both get HTML, they aren't doing it to the same length and with the same capabilities.

Beautifulsoup is library that parse the HTML from a given URL without any efforts. It fetches the HTML then it stops (you could add some automation but there is already other tools doing it). It gives you access to the data without any hassle.

Scrapy is a full fledged framework to get all the HTML from many pages inside a set of domains. You specify constraints and it fetches all the HTML it can within the limits you set.

It boils down to a library vs a framework.

I'm currently working on a project where I need to fetch some data from a website with requests then parsing the HTML with BeautifulSoup. It's simple and surface parsing.

There is another project where a bot is crawling many websites, collect all the data then sends it to a neural network to work on it. In this case scrapy is the best option because you just put some rules and send it doing its job automatically.

Collapse
 
eluzix profile image
eluzix • Edited

They are not the same thing.

Beautiful Soup allows you to build a navigatable tree from HTML and XML sources (be a file, URL or a stream). After building the tree, you can search modify it or pull data out.

Scrapy is a framework for crawling and scraping content from websites. For each page crawled you get access to it's DOM so you can extract your relevant information. This part is much like BS so if you are looking for comparison that's where you should look.

To give a living an example, I built a system that crawls a website for its historical content, extract and save the data. Then, periodically check the site content via it's RSS stream.

For the initial crawling, I used Scrapy to easily navigate through the site content, for the RSS stage, I used BS4 to parse each new URL I got from the RSS.

Edit:
Working with Scrapy you can use BS to extract information from the HTML you got, see docs.scrapy.org/en/latest/topics/s...

Collapse
 
kamarajanis profile image
Kamaraj

BeautifulSoup is best,
And use the requests module to get the Html page
to pass it to the BeautifulSoup and scrape it
And this is a good combination to scrape website

Collapse
 
steelwolf180 profile image
Max Ong Zong Bao • Edited

Depends on your use case. I prefer scrapy.

Collapse
 
rhymes profile image
rhymes

Yes, it can. But you can also use whatever you want as a parser.

Collapse
 
eluzix profile image
eluzix

You can use any output you want so rewriting the DOM is possible.