kaelscion

Posted on Sep 29, 2018 • Edited on Dec 13, 2018

Bye Bye 403: Building a Filter Resistant Web Crawler - Part 1: What is Web Scraping?

#webscraping #requests #python #datascience

originally published on the Coding Duck blog: www.ccstechme.com/coding-duck-blog

If you program with Python, or are interested in the topic, a great deal of information surrounding the language in 2018 is its usefulness in Machine Learning and Data Analysis with Frameworks like SciKit Learn, Tensorflow, Pandas, NumPy and the like. Thing is, if you have no data to analyze, your kind of stuck. In that regard you have two options: Use a service like Quandl to get your data, which is great, but if your like me, the data you want to use is almost always locked behind a paywall. Your other option, is to go out and collect the data yourself, which is my personal preference.

Doing so requires skills in another, slightly less buzzy, strength of the language: web scraping/crawling/aggregation, which is the use of automated processes to collect publicly available data from public space on the internet. Now, these terms are used to describe slightly different things, but the difference is moot in the early stages of your journey on the topic. However, regardless of what sub-niche your bot is dedicated to, you will inevitably encounter issues with web server filters and firewalls.

EDIT

So, a user pointed out to me that my understanding of the legality of this issue was flawed. I apologize to anybody who read this before this edit and encourage you to look at this through a different lens. The previous version I had of this post discussed the legal "landscape" that I had been told by an attorney in my area on repeated occasions. It has since been clarified to me, that those assertions were incorrect. In light of that, the legal ramifications of web-scraping will vary widely based on the jurisdiction in which you perform your data collection. The ethics of the issue remain the same, respect the website and respect the owner. However, there is an inherent risk with scraping a web page that you do not have permission to scrape. When in doubt, PLEASE reach out to the owner of the website and request permission to scrape their content. I will bear no responsibility if you abuse the idea of "public space" and get into trouble. Again, PLEASE RESPECT THE CONTENT OF THE SITE YOU ARE SCRAPING and realize that these websites have finite limits of requests they can handle in a given amount of time. Those resources are meant to serve customers that can potentially buy from them, which your bots will use with no intent of purchase. While web scraping is a fun hobby that has the prospect of growing into a very large industry as data scientists need more and more data to analyze, right now there are a lot of "gray areas" surrounding the topic that could play out differently depending on who hears about it and gets upset. Enjoy this post series, but be respectful.\

END EDIT

That is where the ethics come into play. Even with the GIL limits of Python, concurrent, asynchronous, and even true parallelism are becoming more and more a part of the language's culture. Along that line of thinking, spawning 20-30+ daemon threads and tossing them at a web server in the form of Requests Sessions is not that difficult if you machine has the horsepower. While that may not sound like all that much, some cheaper web hosting tiers on eCommerce platforms like Shopify, Woocommerce, or WordPress could easily see a significant reduction in performance, or a complete lock up from those threads hitting their server every few seconds. Some of those servers are only using 1GB of RAM and a single processor core so a great rule of thumb is this: If my livelihood or business interest was tied to a server with that little horsepower, how much of it would I be willing to allow a service to access that provides a 0% change of a sale? The trick to this rule is to put yourself in the site-owner's shoes. Yes, I agree, if your livelihood were based on your server, give it more resources. But that is NOT your decision, as a data collector, to make. No legal entity is going to care if the person who comes after you should have done a better job with their server. The argument will ultimately be: Was your site working fine until this person sent their bot after you? Yup. Okay, so Mr/Ms Web Scraper, the fine you now owe this person is... Basically, the data you are collecting belongs to somebody else. Respect them, respect their data, and respect their website. Follow those rules, and you will find success, or at the very least, avoid any unwanted attention.

In this tutorial series, we're going to discuss, not only how to avoid being blocked by a web filter, but also how to not hammer a web server unnecessarily when collecting data. While this series will cater specifically to Python, the process and principles involved can be translated to any language that has an HTTP requests module and HTML parser.

So, now that you know what web scraping is and why its important, we're simply going to set up our environment.

first thing's first, open up your favorite editor and create a file called: requirements.txt.

In that file, type the follwing:

requests
bs4
pandas

These libraries all assist in the setup of our environment, retrieving of web pages, and the parsing of their HTML. The final module, pandas, will be used later on to parse, sort, clean, and analyze the data we collect.

Next step is to install our dependencies. But to do that while avoiding messing with the python environment on our current machine, we are going to create a virtual environment for this particular Python setup to live in. Since we are going to be using Python 3 for this series, make sure you have that installed and in your PATH, even if the default python is 2. Then, open your respective terminal application and type:

pip3 install virtualenv

From there, enter the command:

virtualenv -p python3 venv

this process is telling your Virtualenv to create a python 3 environment that is totally separate from your computer's global environment where you can install packages, dependencies, python versions, and other goodies that are only accessible from within that environment, effectively keeping your global machine environment from getting confused. This comes in handy when you want to be able to pick and choose which python version will run with which project, or which versions of certain libraries you want to use based on what other dependencies that project needs that may not play nice with later or earlier versions of that module.

To use our virtualenv, we need to type the following:

source venv/bin/activate

This command activates the environment and gives you access to the dependencies, environment variables, etc of venv (btw, when you create the virtualenv, you can name it whatever you want, you need not user the name "venv". This is simply a making convention that had been widely adopted by the community).

Now that we have activated our Python 3 environment, let's install the dependencies we put into requirements.txt

pip install -r requirements.txt

This will install all of the dependencies listed in or text file in one command rather than "pipping" them one at a time.

Alright, our environment is set up and ready to go! In the next post, we'll start building out bot, then proceed to make it UNBLOCKABLE!!!

Top comments (2)

Thomas H Jones II • Oct 1 '18

Yeah... You're on really shaky ground with your assertion that they have to be explicit in saying "no, you can't do this." The presence of a robots.txt and other software-based dicouragements has been interpreted by some courts as being an implicit "don't do this." Also, some courts see such actions not just through a civil-liability lens but criminal liability. So, proceed at your own risk and understand the possible jurisdictions where your actions might be tried ...and don't gloss over it when encouraging others (as doing so can carry its own liabilities).

kaelscion • Oct 1 '18 • Edited

I apologize, let me edit this post to update this information and thank you SO much for pointing it out. I have done a lot of this work over the years and routinely spoke to an Intellectual Property lawyer about this legality and this is how the "playing field" has been described to me on repeated occasions. I really appreciate you stepping up and informing me about this error in my assertion. I suppose the understanding I have is through the lens of this attorney's personal experience and, perhaps even the "landscape" that we have here in ME, USA. I will update this post IMMEDIATELY to include a "proceed at your own risk" rather than giving it the "it is my understanding that...". Thank you so much!

EDIT COMPLETE. POST UPDATED