Best Open Source Frameworks for Web Scraping

Best Web Scraping Frameworks

Table of contents

PHP
Goutte
Python
Scrapy, Mechanical Soup, PySpider
Language agnostic

Selenium Web Driver
Node Js

Apify SDK
Codecrawler
Puppeteer
Golang

Colly
Goutte
Goutte is a screen scraping and web crawling library for PHP.

Goutte provides a nice API to crawl websites and extract data from the HTML/XML responses.

PHP Version: PHP 7.1 .

Example of submitting a form in Goutte
Scrapy

Scrapy is an extremely powerful crawling and scraping library written in Python.

Here is how easy it is to create a multi-threaded crawler and parse it at a single endpoint.
And to scrape, it allows both XPath and CSS selectors.
Mechanical Soup

Mechanical soup is a super simple library that helps you scrape, store and pass cookies, submit forms, etc. but it doesn't support Javascript rendering.

Here is an example of submitting a form and scraping the results on Duck Duck Go
PySpider

PySpider is useful if you want to crawl and spider at massive scales. It has a web UI to monitor crawling projects, support DB integrations out of the box, uses message queues, and comes ready with support for a distributed architecture. This library is a beast.

You can do complex operations like.

Set priorities.
Set delayed crawls. This one crawls after 30 mins using queues.
This one automatically recrawls a page every 5 hours.
NodeCrawler

This powerful crawling and scraping package for Node Js allows server-side DOM and injection of JQuery and has queueing support with controllable pool sizes, priority settings, and rate limit control.

It's great for working with bottlenecks like rate limits that many websites impose.

Here is an example that does that.
Selenium Web Driver

Selenium was built for automating tasks on web browsers but is very effective in web scraping as well.

Here you are controlling the Firefox browser and automating a search query.
Its language agnostic, so here is the same thing accomplished using Javascript.
Puppeteer

Puppeteer lives up to its name and comes closest to full-scale browser automation. It can do more or less everything that a human can do.

It can take screenshots, render javascript, submit forms, simulate keyboard input,

This example takes a screenshot of the Ycombinator home page in very few lines of code.
Colly

Colly is a super fast and scalable and extremely popular spider/scraper.

it supports web crawling, rate limiting, caching, parallel scraping, cookie, and session handling and distributed scraping

Here is an example of fetching 2 URLs in parallel.

DEV Community

Best Open Source Frameworks for Web Scraping

Top comments (0)