This is an overview of a blog post I recently wrote about how to scrap web pages using Python BeautifulSoup and Requests libraries.
What is Web Scraping:
Web scraping is the process of automatically extracting information from a website. Web scraping, or data scraping, is useful for researchers, marketers and analysts interested in compiling, filtering and repackaging data.
A word of caution: Always respect the website’s privacy policy and check robots.txt before scraping. If a website offers API to interact with its data, it is better to use that instead of scraping.
Web Scraping with Python and BeautifulSoup:
Web scraping in Python is a breeze. There are number of ways to access a web page and scrap its data. I have used Python and BeautifulSoup for the purpose.
In this example, I have scraped college footballer data from ESPN website.
The Process:
- Install requests and beautifulsoup libraries
- Fetch the web page and store it in a BeautifulSoup object.
- Set a parser to parse the HTML in the web page. I have used the default html.parser
- Extract the player name, school, city, playing position and grade.
- Appended the data to a list which will be written to a CSV file at later stage.
The Code:
''' | |
Example of web scraping using Python and BeautifulSoup. | |
Sraping ESPN College Football data | |
http://www.espn.com/college-sports/football/recruiting/databaseresults/_/sportid/24/class/2006/sort/school/starsfilter/GT/ratingfilter/GT/statuscommit/Commitments/statusuncommit/Uncommited | |
The script will loop through a defined number of pages to extract footballer data. | |
''' | |
from bs4 import BeautifulSoup | |
import requests | |
import os | |
import os.path | |
import csv | |
import time | |
def writerows(rows, filename): | |
with open(filename, 'a', encoding='utf-8') as toWrite: | |
writer = csv.writer(toWrite) | |
writer.writerows(rows) | |
def getlistings(listingurl): | |
''' | |
scrap footballer data from the page and write to CSV | |
''' | |
# prepare headers | |
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'} | |
# fetching the url, raising error if operation fails | |
try: | |
response = requests.get(listingurl, headers=headers) | |
except requests.exceptions.RequestException as e: | |
print(e) | |
exit() | |
soup = BeautifulSoup(response.text, "html.parser") | |
listings = [] | |
# loop through the table, get data from the columns | |
for rows in soup.find_all("tr"): | |
if ("oddrow" in rows["class"]) or ("evenrow" in rows["class"]): | |
name = rows.find("div", class_="name").a.get_text() | |
hometown = rows.find_all("td")[1].get_text() | |
school = hometown[hometown.find(",")+4:] | |
city = hometown[:hometown.find(",")+4] | |
position = rows.find_all("td")[2].get_text() | |
grade = rows.find_all("td")[4].get_text() | |
# append data to the list | |
listings.append([name, school, city, position, grade]) | |
return listings | |
if __name__ == "__main__": | |
''' | |
Set CSV file name. | |
Remove if file alreay exists to ensure a fresh start | |
''' | |
filename = "footballers.csv" | |
if os.path.exists(filename): | |
os.remove(filename) | |
''' | |
Url to fetch consists of 3 parts: | |
baseurl, page number, year, remaining url | |
''' | |
baseurl = "http://www.espn.com/college-sports/football/recruiting/databaseresults/_/page/" | |
page = 1 | |
parturl = "/sportid/24/class/2006/sort/school/starsfilter/GT/ratingfilter/GT/statuscommit/Commitments/statusuncommit/Uncommited" | |
# scrap all pages | |
while page < 259: | |
listingurl = baseurl + str(page) + parturl | |
listings = getlistings(listingurl) | |
# write to CSV | |
writerows(listings, filename) | |
# take a break | |
time.sleep(3) | |
page += 1 | |
if page > 1: | |
print("Listings fetched successfully.") |
Top comments (2)
hi there dear Kashif - many many thanks fo this great idea and nice example - i run this on a fresh insalled ATOM . and got some errors - eg due to the encoding issues. see here
raceback (most recent call last):
File "/home/martin/dev/vscode/dev_to_scraper.py", line 81, in
writerows(listings, filename)
File "/home/martin/dev/vscode/dev_to_scraper.py", line 17, in writerows
with open(filename, 'a', encoding='utf-8') as toWrite:
TypeError: 'encoding' is an invalid keyword argument for this function
[Finished in 14.479s]
guess that i have to do setups in the ATOM Editor !?! what do you think
Some comments may only be visible to logged-in visitors. Sign in to view all comments.