Supacrawler is an opensource webscraping api engine written in Go. Out of the box it comes with 3 endpoints: Scrape, Crawl, and Screenshots.
It's a light wrapper on playwright with Dockerfiles for both local development and for production. It's also ultra-fast because of go concurrency and channels. I have a write-up of the benchmarks in the documentation in Supacrawler benchmarks.
Going through the endpoints, we have the following:
Scrape: This endpoint allows you to scrape the web using headless browsers and receive the output automatically cleaned in markdown.
Crawl: This endpoint allows you to, with a headless browser, systematically crawl an entire website and receive it back in both markdown/html format.
Screnshots: This endpoint is for rendering javascript pages, rendering full page screenshots, mobile screenshots all through an api endpoint.
Watch (app exclusive): This endpoint is for watching/monitoring changes within the contents of a website. You can run a job that uses a cron job and then sends you an email notification if anything changes. Works like a charm!
The best part about Supacrawler is that it works out of the box with just a few lines of code:
curl -O https://raw.githubusercontent.com/supacrawler/supacrawler/main/docker-compose.yml
docker compose up
I'm always keen to know more about how people will use tools like this. Let me know if you find this useful or if you have any questions!
If you're interested in seeing more you can visit the following:
Website
Github
Top comments (0)