Copy of Scrapy Vs. Apache Nutch - Which is Better for Web Scraping?

Here are the main differences between Scrapy and Nutch

Web crawling and scraping ideally should be separated so that any failures in scraping don't bring down the whole project, plus it is easier to address issues with each (there are so many) separately. Nutch does this, but with Scrapy, both the processes are linearly tied together.

Scrapy has built-in support for XPath & CSS selectors making web scraping a breeze.

Scrapy also provides an interactive shell console for trying out the CSS and XPath selectors making writing and debugging scrapers very easy.

Nutch has built-in support for a distributed file system (Hadoop) and graph database.

Scrapy deals with non-standard and broken encodings by detecting it automatically.

Scrapy has abstractions that simplify things like cookies handling, session handling, compression, authentication, caching, user-agent spoofing, rate limiting, concurrency support, crawl depth restrictions, etc.

Scrapy can also use the Scrapyd daemon to run multiple spiders at once. Combining Scrapyd and a rotating proxy API like Proxies API can allow you to scale web crawling projects to almost unlimited scale.

DEV Community

Copy of Scrapy Vs. Apache Nutch - Which is Better for Web Scraping?

Top comments (0)