Introduction
Web scraping is one of those skills that sounds complicated but is surprisingly easy to pick up in Python. In this tutorial, I'll show you how to scrape real data from a website in under 10 minutes using two popular libraries: Requests and BeautifulSoup.
By the end, you'll have a working scraper that pulls data from a webpage and saves it in a format you can actually use.
- What is Web Scraping?
Web scraping is the process of automatically extracting data from websites. Instead of manually copying information, you write a script that does it for you — instantly, at scale.
Common use cases:
- Collecting product prices for comparison
- Gathering news headlines
- Building datasets for research or machine learning
- Monitoring job listings
What we'll Need
- Python 3.7+
- Basic Python knowledge (loops, lists, print statements)
- Terminal / command prompt
Step 1: Install the Libraries
Open your terminal and install the two libraries we'll use:
pip install requests beautifulsoup4
- Requests — fetches the HTML content of a webpage
- BeautifulSoup — parses and navigates that HTML so you can extract what you need
Step 2: Fetch a Web Page
We'll use books.toscrape.com — a website built specifically for scraping practice. No legal issues, no rate limits.
Create a file called scraper.py:
import requests
from bs4 import BeautifulSoup
url = "http://books.toscrape.com/"
response = requests.get(url)
print(response.status_code) # Should print 200
print(response.text[:500]) # First 500 characters of the HTML
Run it:
python scraper.py
If you see 200 printed, the request worked. You're now downloading an entire webpage with 2 lines of code.
Step 3: Parse the HTML with BeautifulSoup
Now let's make sense of that HTML:
import requests
from bs4 import BeautifulSoup
url = "http://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.title.text) # Prints the page title
BeautifulSoup turns the raw HTML into a navigable object. Think of it as a map for the webpage.
Step 4: Find the Data we Want
Right-click any book title on books.toscrape.com and hit "Inspect" in your browser. You'll see book titles are inside <h3> tags wrapped in <article class="product_pod">.
Let's extract all book titles:
import requests
from bs4 import BeautifulSoup
url = "http://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
books = soup.find_all('article', class_='product_pod')
for book in books:
title = book.h3.a['title']
print(title)
Run it and you'll see 20 book titles printed in your terminal. That's web scraping — done.
Step 5: Extract More Data (Price + Rating)
Let's also grab the price and star rating for each book:
import requests
from bs4 import BeautifulSoup
url = "http://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
books = soup.find_all('article', class_='product_pod')
for book in books:
title = book.h3.a['title']
price = book.find('p', class_='price_color').text
rating = book.p['class'][1] # e.g. "Three", "Five"
print(f"{title} | {price} | {rating} stars")
Output will look like:
A Light in the Attic | £51.77 | Three stars
Tipping the Velvet | £53.74 | One stars
Soumission | £50.10 | One stars
...
Step 6: Save the Data to a CSV File
Raw printed data isn't very useful. Let's save it to a CSV so you can open it in Excel or use it in another script:
import requests
from bs4 import BeautifulSoup
import csv
url = "http://books.toscrape.com/"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
books = soup.find_all('article', class_='product_pod')
with open('books.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['Title', 'Price', 'Rating']) # Header row
for book in books:
title = book.h3.a['title']
price = book.find('p', class_='price_color').text
rating = book.p['class'][1]
writer.writerow([title, price, rating])
print("Saved to books.csv!")
Open books.csv and you'll have a clean spreadsheet of all 20 books with titles, prices, and ratings.
Step 7: Scrape Multiple Pages
The site has 50 pages. Let's loop through them all:
import requests
from bs4 import BeautifulSoup
import csv
base_url = "http://books.toscrape.com/catalogue/page-{}.html"
all_books = []
for page_num in range(1, 51): # Pages 1 to 50
url = base_url.format(page_num)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
books = soup.find_all('article', class_='product_pod')
for book in books:
title = book.h3.a['title']
price = book.find('p', class_='price_color').text
rating = book.p['class'][1]
all_books.append([title, price, rating])
print(f"Scraped page {page_num}/50")
with open('all_books.csv', 'w', newline='', encoding='utf-8') as file:
writer = csv.writer(file)
writer.writerow(['Title', 'Price', 'Rating'])
writer.writerows(all_books)
print(f"Done! Saved {len(all_books)} books.")
This scrapes all 1,000 books from the website and saves them in one CSV file.
Important: Be Responsible When Scraping
Before scraping any real website, always:
-
Check the
robots.txt— visitwebsite.com/robots.txtto see what's allowed -
Add delays between requests using
time.sleep(1)to avoid overloading servers - Read the Terms of Service — some sites prohibit scraping
- Never scrape personal data without consent
What we Built
Here's a summary of what you can now do:
| Task | Code |
|---|---|
| Fetch a webpage | requests.get(url) |
| Parse HTML | BeautifulSoup(html, 'html.parser') |
| Find elements | soup.find_all('tag', class_='name') |
| Save to CSV | csv.writer |
| Scrape multiple pages | Loop with range()
|
Top comments (0)