DEV Community

Chandra
Chandra

Posted on

Scrapping WebPages from CommonCrawl Archive ?

I Recently started an Open-Source project to build a Nodejs scrapper for CommonCrawl web archive.

The current version fetches the snapshots for any given URLs, and the next version will parse those snapshots and give RAW html data output. From there, Cheerio can be used to parse individual datapoints.

It helps crawl web-pages about 100X faster than crawling them from their own hosting servers and the data is fairly up-to date.

CommonCrawl data is hosted on AWS and running from the same DC with Lambda Functions (I have higher limits) I was able to crawl & parse 300M pages in about a day without worrying about rate-limits, Bot detection, proxies or slowing down websites.

Is this something you like to see ? Link to the repo if you are interested

Top comments (0)