DEV Community

Mohan Ganesan
Mohan Ganesan

Posted on • Originally published at proxiesapi.com

5 Web Crawling Habits You Should Adopt

  1. Learn a framework like Scrapy. I am just going to preach it to you. Here is the list of frameworks you HAVE to know about and study because if you don’t and call yourself a programmer that knows web scraping, well, there is something wrong with you.

a. Scrapy — please, for the love of GOD, don’t take a step in any direction before you have explored what this can do for you… let it open your mind to what is out there, the kind of abstractions and weaponry the community has created before you venture out on your own armed with your cute CURL calls and stupid RegEx patterns.

b. Puppeteer — Whatever a human can do, you can simulate that with Puppeteer. Do you see those services that take snapshots of entire webpages, fully rendered in their evening best? Ever wondered how they do it? It is by using Puppeteer man. Go know it now before you read another line or waste another breath.

c. There is a lot more. But I know you are not listening to me. Just do this for now. I have low expectations for you!

  1. Take these tests to ramp up your web crawling genius.

  2. Ask yourself — How can I double the speed of my web crawler?

The path should take you to something like Scrapyd. When you come to it, tell them my name, and they will set you right up.

  1. Just learn XPath and CSS selectors if you haven’t already. That's how you get good at extracting data. XPath is confusing, to be honest, just force yourself to learn it, man. It's your life raft!

  2. I can't remember the 5th…. I KNOW I SAID THERE ARE 5 !@^$#@… Just do the 4 for now, will you?

Sorry, I am in a bad kinda way today.

The author is the founder of Proxies API, a proxy rotation API service.

Top comments (1)

Collapse
 
crawlbase profile image
Crawlbase

Thanks alot! What a fun read! Loved the author's passion for Scrapy and Puppeteer. And those tips on XPath and CSS selectors? Spot on! If you're into web crawling like me, you gotta check out Crawlbase. Seriously, it's a game-changer.