Scoopi Web Scraper

In Java, it is easy to scrape with JSoup and HtmlUnit in Java, but the things get complicated when data is from large number of pages. Scraping libraries do well in scraping data from limited set of pages but they are not meant to handle thousands of pages.

Scoopi Web Scraper is developed taking these aspects into consideration. It is built upon JSoup and HtmlUnit. Some of the features of Scoopi are

Scoopi is completely definition driven and no knowledge of coding is required. Data structure, task workflow and pages to scrape are defined with a set of YML definition files. It can be configured to use either JSoup or HtmlUnit as scraper.
Query can be written either using Selectors with JSoup or XPath with HtmlUnit.
Scoopi persists pages and parsed data to file system and recovers from the failed state without repeating the tasks that are completed.
Scoopi is a multi-thread application which process pages in parallel for maximum throughput.
Allows to transform, filter and sort the data
In cluster mode, it can scale horizontally by distributing tasks across multiple nodes.
Designed to run in various environments; in bare JVM or in Docker containers or even on high end container orchestration platforms such as Kubernetes.

Know more about Scoopi at Scoopi Web Scraper

DEV Community

Scoopi Web Scraper

Top comments (0)