Discussion on: Bye Bye 403: Building a Filter Resistant Web Crawler Part II: Building a Proxy List

View post

Very cool project!

I always thought so! Thanks. One of my personal projects over the past couple years is a web-server level filter meant to detect bots like the end product of this entire post series, as well as those that use more advanced methods such as reverse engineering JSON requests and simply asking the web server for the information in the header rather than parsing HTML, XML, or using webdrivers etc. This series is meant to introduce novice web scrapers to the idea of fooling the current, admittedly kinda dumb, filters on most websites and to help expose the idea of how arbitrary most filter checks are on the modern web. My hope is that, if enough people know about the problem (in the detail of how to execute it), others will rally to the cause of stopping offshore trolls from making it necessary to have an explicit view in Google Analytics that factors out bot traffic and makes us depressed about how many human beings actually visit our sites and services :D