DEV Community

Kashif Aziz
Kashif Aziz

Posted on

7 4

Web Scraping with Python BeautifulSoup and Requests

This is an overview of a blog post I recently wrote about how to scrap web pages using Python BeautifulSoup and Requests libraries.

What is Web Scraping:

Web scraping is the process of automatically extracting information from a website. Web scraping, or data scraping, is useful for researchers, marketers and analysts interested in compiling, filtering and repackaging data.

A word of caution: Always respect the website’s privacy policy and check robots.txt before scraping. If a website offers API to interact with its data, it is better to use that instead of scraping.

Web Scraping with Python and BeautifulSoup:

Web scraping in Python is a breeze. There are number of ways to access a web page and scrap its data. I have used Python and BeautifulSoup for the purpose.

In this example, I have scraped college footballer data from ESPN website.

The Process:

  • Install requests and beautifulsoup libraries
  • Fetch the web page and store it in a BeautifulSoup object.
  • Set a parser to parse the HTML in the web page. I have used the default html.parser
  • Extract the player name, school, city, playing position and grade.
  • Appended the data to a list which will be written to a CSV file at later stage.

Python BeautifulSoup Tutorial: Web Scraping In 20 Lines Of Code

The Code:

'''
Example of web scraping using Python and BeautifulSoup.
Sraping ESPN College Football data
http://www.espn.com/college-sports/football/recruiting/databaseresults/_/sportid/24/class/2006/sort/school/starsfilter/GT/ratingfilter/GT/statuscommit/Commitments/statusuncommit/Uncommited
The script will loop through a defined number of pages to extract footballer data.
'''
from bs4 import BeautifulSoup
import requests
import os
import os.path
import csv
import time
def writerows(rows, filename):
with open(filename, 'a', encoding='utf-8') as toWrite:
writer = csv.writer(toWrite)
writer.writerows(rows)
def getlistings(listingurl):
'''
scrap footballer data from the page and write to CSV
'''
# prepare headers
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0'}
# fetching the url, raising error if operation fails
try:
response = requests.get(listingurl, headers=headers)
except requests.exceptions.RequestException as e:
print(e)
exit()
soup = BeautifulSoup(response.text, "html.parser")
listings = []
# loop through the table, get data from the columns
for rows in soup.find_all("tr"):
if ("oddrow" in rows["class"]) or ("evenrow" in rows["class"]):
name = rows.find("div", class_="name").a.get_text()
hometown = rows.find_all("td")[1].get_text()
school = hometown[hometown.find(",")+4:]
city = hometown[:hometown.find(",")+4]
position = rows.find_all("td")[2].get_text()
grade = rows.find_all("td")[4].get_text()
# append data to the list
listings.append([name, school, city, position, grade])
return listings
if __name__ == "__main__":
'''
Set CSV file name.
Remove if file alreay exists to ensure a fresh start
'''
filename = "footballers.csv"
if os.path.exists(filename):
os.remove(filename)
'''
Url to fetch consists of 3 parts:
baseurl, page number, year, remaining url
'''
baseurl = "http://www.espn.com/college-sports/football/recruiting/databaseresults/_/page/"
page = 1
parturl = "/sportid/24/class/2006/sort/school/starsfilter/GT/ratingfilter/GT/statuscommit/Commitments/statusuncommit/Uncommited"
# scrap all pages
while page < 259:
listingurl = baseurl + str(page) + parturl
listings = getlistings(listingurl)
# write to CSV
writerows(listings, filename)
# take a break
time.sleep(3)
page += 1
if page > 1:
print("Listings fetched successfully.")

Detailed blog post is available here.

AWS GenAI LIVE image

Real challenges. Real solutions. Real talk.

From technical discussions to philosophical debates, AWS and AWS Partners examine the impact and evolution of gen AI.

Learn more

Top comments (2)

Collapse
 
tarifa10 profile image
tarifa-man

hi there dear Kashif - many many thanks fo this great idea and nice example - i run this on a fresh insalled ATOM . and got some errors - eg due to the encoding issues. see here

raceback (most recent call last):
File "/home/martin/dev/vscode/dev_to_scraper.py", line 81, in
writerows(listings, filename)
File "/home/martin/dev/vscode/dev_to_scraper.py", line 17, in writerows
with open(filename, 'a', encoding='utf-8') as toWrite:
TypeError: 'encoding' is an invalid keyword argument for this function
[Finished in 14.479s]

guess that i have to do setups in the ATOM Editor !?! what do you think

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

Billboard image

Create up to 10 Postgres Databases on Neon's free plan.

If you're starting a new project, Neon has got your databases covered. No credit cards. No trials. No getting in your way.

Try Neon for Free →

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay