DEV Community

Bye Bye 403: Building a Filter Resistant Web Crawler Part II: Building a Proxy List

kaelscion on December 05, 2018

originally published on the Coding Duck blog: www.ccstechme.com/coding-duck-blog Woohooo! We've got our environment set up and are ready to st...
Collapse
 
ben profile image
Ben Halpern

Very cool project!

Collapse
 
kaelscion profile image
kaelscion

I always thought so! Thanks. One of my personal projects over the past couple years is a web-server level filter meant to detect bots like the end product of this entire post series, as well as those that use more advanced methods such as reverse engineering JSON requests and simply asking the web server for the information in the header rather than parsing HTML, XML, or using webdrivers etc. This series is meant to introduce novice web scrapers to the idea of fooling the current, admittedly kinda dumb, filters on most websites and to help expose the idea of how arbitrary most filter checks are on the modern web. My hope is that, if enough people know about the problem (in the detail of how to execute it), others will rally to the cause of stopping offshore trolls from making it necessary to have an explicit view in Google Analytics that factors out bot traffic and makes us depressed about how many human beings actually visit our sites and services :D