DEV Community

Clean Code Studio
Clean Code Studio

Posted on • Edited on

4

I built an industrial scale web scraper. Here's what I learned.

Recently, I built an industrial scale web scraper. Here's what I learned.

1. Why build a scalable scraper/crawler?

  • Google's primary product (their Search Engine) is empowered by web scrapers & crawl extracting data from the internet at an unfathomable level of scale.
  • Open AI's capability (and willingness) to access data using scrapers & crawlers at internet wide scale is what empowered them to build (and continually improve) ChatGPT.
  • Unlike last decade, intelligence is something you can build, use, and sell with the one catch being you require an immense amount of one resource to do so and that resource is a hell of a lot of data.

*2. Using chromium programmatically is helpful (I chose Puppeteer)
*

3. Industrial scale requires using proxies (I rotated between residential proxies)

*4. Bots can find rules via a robots.txt file for a site (Ask SEO experts about it)
*

5. Bypassing captchas, although ethically questionable, doesn't seem to be an illegal act to program your robot to take. (I explored Github python programs that were capable of this to satisfy my own curiosity).

AWS GenAI LIVE image

How is generative AI increasing efficiency?

Join AWS GenAI LIVE! to find out how gen AI is reshaping productivity, streamlining processes, and driving innovation.

Learn more

Top comments (0)

AWS GenAI LIVE image

How is generative AI increasing efficiency?

Join AWS GenAI LIVE! to find out how gen AI is reshaping productivity, streamlining processes, and driving innovation.

Learn more

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay