Intro to Web Scraping a Table

#python #tutorial

Intro to Web Scraping a Table

What is web scraping? From my limited knowledge web scraping is getting info off a webpage by utilizing the underlining HTML code that contains this vital information.

Today, I am going to walk through the general process to web scrape a table off the internet. In the process, I hope to answer a random question: Does cold weather affect quarterback play?

What you need:

responses (HTTP library)
BeautifulSoup4 (Parser library)
pandas (data manipulation library)

Use this link for more setup details

Overview of Web Scraping

1) Get the response
2) Find the object
3) Parse and store the object
4) Finalize the data

1) Get the response.

This is the easy part. You use 'responses' to get the response from the URL and then you turn the response into a soupy 'soup' with BS4 that it then can navigate and parse.

url ='http://www.espn.com/nfl/qbr/_/type/player-week/week/3'
response = get(url) #getting url response
nfl = soup(response.content, 'html.parser') #turning the response into soup

2) Find the object (the hard part)

For beginners this is the hard part because HTML is a little daunting to decipher at the beginning but there is some underlying principles to help.

With BS4, you can navigate the HTML soup in two ways.

1) Parent, Child, Sibling Hierarchy
HTML is structured with higher level 'Parent' tags (classifiers) with 'Child' tags that it encompasses. BS4 has method to manually move line by line through the HTML if you need to fine tune where you are at.

2) .find( ) and .find_all( )

This is the easier method. You can tell BS4 to find the specific type of tag and what 'name' tag you are looking for.

How do you know what name you want? Easy, just navigate the webpage and right click the object/table you want to parse and 'inspect' it. This should open up a window that will direct you to the corresponding HTML code.

Even easier in HTML are tables because they are structured fairly similar for most webpages.

1) First find the 'table' tag. This will grab the whole table object.
2) Next find_all row tags as 'tr'. This will grab all rows in the table.
3) Next find_all cell tags as 'td' per row. This will grab all cells item in each row.
4) Sometimes the header row could be tag as 'th'. This is useful if you want to label your columns the same as the headers.

I find that slowly working through each object to find my place in the HTML code was best. Many times I would index out different object to see the response and adjust accordingly. It is a pretty iterative process in the beginning and you might have to backtrack to move forward. I find that referring back to the original web page can help you find where you are in the HTML code as well.

General work flow:

1) Attempt to access an object/tag
2) Count/verify # of objects
3) See the response
4) Verify with web page
5) Go further into the object or repeat process.

Also if you stuck with access info/ navigating use this cheatsheet.

http://akul.me/blog/2016/beautifulsoup-cheatsheet/

Below is the code summarizing this process:

tables = nfl.findAll('table') # finding all tables in this soup (lucky for me only one table)
len(tables) # Checking # of tables as predicted only 1 
rows = qbr_table[0.]findAll('tr') # from that table I know look for rows 'tr' is rows in HTML
len(rows) #Verifying how many rows I am trying to get
first_row = rows[1] #Inspect one row
first_row.findAll('td')[1].text # Looking at one row and one element

3) Parse and store the object.

After you find the method to find the object and info you want in the table. Just do the same process over each row. I find that a function parse a row in combination with a list comprehension to loop over all rows is best. Below is the code:

def parse_row(row):
    return [x.text for x in row.findAll('td')]

list_parsed_rows =[parse_row(row) for row in rows[0:]] # list_parsed_rows
df = pd.DataFrame(list_parsed_rows)

4) Finalize the data.

Use pandas to wrangle the data to whatever your heart desires. I find that some column had multiple information and needed to be parse out even further.

Below is the result I got from web scraping ESPN for the quarterback rating (QBR) and another weather site for temperature. The QBR is normalize from 0 to 100 with 50 being the average QB score/effectiveness. In general, there seems to be a hinderance as seen from the void of high QBR at low temps.

Cheers!