Python Web Scraping Tutorial for Beginners 2026: From Zero to Your First Scraper
This guide goes from zero Python knowledge to a working web scraper in about 30 minutes. We'll build a real scraper — not a toy example — and you'll understand exactly what every line does.
What You'll Build
By the end of this tutorial, you'll have a Python script that:
- Loads any website
- Finds specific data on the page (products, prices, headlines, etc.)
- Exports the results to a CSV file
Prerequisites
- Python 3.x installed (check:
python3 --version) - pip package manager (comes with Python)
- Basic familiarity with Python (variables, loops, functions)
Step 1: Install the Libraries
We need two libraries:
pip install requests beautifulsoup4
- requests — loads web pages (makes HTTP requests)
- beautifulsoup4 — parses HTML and helps us find elements
Step 2: Your First Request
Let's understand what scraping actually is. Every time you visit a website in your browser, your browser sends an HTTP request and receives HTML back. Web scraping does the same thing — except with Python instead of a browser.
import requests
# Load the page
response = requests.get("https://news.ycombinator.com")
# Check if it worked (200 = success)
print(f"Status code: {response.status_code}")
# Look at the HTML
print(response.text[:500]) # First 500 characters
Run this. You'll see a bunch of HTML printed. That's the raw page data.
Step 3: Parse the HTML with BeautifulSoup
Raw HTML is hard to work with. BeautifulSoup makes it easy to navigate:
import requests
from bs4 import BeautifulSoup
url = "https://news.ycombinator.com"
response = requests.get(url)
# Parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')
# Find elements using CSS selectors
# Inspect the page in browser → right-click → Inspect to find selectors
headlines = soup.select('.titleline a')
# Print the first 5 headlines
for headline in headlines[:5]:
print(headline.text)
How to find the right selector:
- Open the target page in Chrome/Firefox
- Right-click the element you want → "Inspect"
- Look at the HTML structure — note the class names and tags
- Use those in
soup.select('.classname')orsoup.find('tag')
Step 4: Extract Multiple Data Points
Let's build a real example — scraping Hacker News stories with their scores:
import requests
from bs4 import BeautifulSoup
def scrape_hackernews() -> list:
url = "https://news.ycombinator.com"
# Add a User-Agent to look like a browser (polite scraping)
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
stories = []
# Find all story rows
for row in soup.select('.athing'):
story_id = row.get('id')
# Title and URL
title_elem = row.select_one('.titleline a')
if not title_elem:
continue
title = title_elem.text.strip()
link = title_elem.get('href', '')
# Score is in the next row (sibling element)
score_row = row.find_next_sibling('tr')
score_elem = score_row.select_one('.score') if score_row else None
score = score_elem.text.replace(' points', '') if score_elem else '0'
# Comments count
comments_elem = score_row.select_one('a[href*="item?id="]') if score_row else None
comments_text = comments_elem.text if comments_elem else '0 comments'
comments = comments_text.split()[0] if comments_text else '0'
stories.append({
'title': title,
'url': link,
'score': score,
'comments': comments,
})
return stories
# Run it
stories = scrape_hackernews()
# Print results
for story in stories[:10]:
print(f"[{story['score']} pts] {story['title'][:60]}")
print(f" URL: {story['url'][:50]}")
print()
Step 5: Save to CSV
import csv
def save_to_csv(data: list, filename: str):
if not data:
print("No data to save")
return
with open(filename, 'w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
print(f"Saved {len(data)} rows to {filename}")
# Save Hacker News data
stories = scrape_hackernews()
save_to_csv(stories, 'hn_stories.csv')
Open the CSV in Excel or Google Sheets — your data is ready.
Step 6: Handle Common Errors
Real websites don't always cooperate. Here's how to handle common issues:
import requests
import time
def safe_request(url: str, retries: int = 3) -> requests.Response | None:
"""Make a request with error handling and retries"""
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
}
for attempt in range(retries):
try:
response = requests.get(url, headers=headers, timeout=10)
if response.status_code == 200:
return response
elif response.status_code == 429:
# Rate limited — wait and retry
wait = 10 * (attempt + 1)
print(f"Rate limited. Waiting {wait}s...")
time.sleep(wait)
elif response.status_code == 403:
print(f"Access denied (403) for {url}")
return None
else:
print(f"Got status {response.status_code}, retrying...")
time.sleep(2)
except requests.Timeout:
print(f"Timeout on attempt {attempt + 1}")
time.sleep(2)
except requests.ConnectionError:
print(f"Connection error on attempt {attempt + 1}")
time.sleep(5)
return None
# Usage
response = safe_request("https://example.com")
if response:
soup = BeautifulSoup(response.text, 'html.parser')
# ... continue scraping
Step 7: Scraping Multiple Pages
Most useful data spans multiple pages:
import requests
from bs4 import BeautifulSoup
import time
def scrape_all_pages(base_url: str, max_pages: int = 5) -> list:
all_items = []
for page_num in range(1, max_pages + 1):
url = f"{base_url}?page={page_num}"
print(f"Scraping page {page_num}...")
response = requests.get(url, headers={
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
})
if response.status_code != 200:
print(f"Page {page_num} failed, stopping")
break
soup = BeautifulSoup(response.text, 'html.parser')
# Extract items from this page
items = soup.select('.item-class') # Replace with actual selector
if not items:
print(f"No items on page {page_num}, done")
break
for item in items:
all_items.append({
'text': item.text.strip(),
'page': page_num,
})
# Be polite — wait between pages
time.sleep(1)
return all_items
Complete Example: E-commerce Price Scraper
Here's a complete, working scraper for a real use case:
import requests
from bs4 import BeautifulSoup
import csv
import time
def scrape_books(max_pages: int = 3) -> list:
"""
Scrape books from books.toscrape.com (a practice scraping site)
Returns list of dicts with: title, price, rating, availability
"""
base_url = "https://books.toscrape.com/catalogue/page-{}.html"
books = []
rating_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}
for page in range(1, max_pages + 1):
url = base_url.format(page)
print(f"Scraping page {page}...")
r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
if r.status_code != 200:
break
soup = BeautifulSoup(r.text, 'html.parser')
for book in soup.select('article.product_pod'):
# Title
title = book.find('h3').find('a').get('title', '')
# Price (remove the £ symbol)
price_text = book.find('p', class_='price_color').text
price = float(price_text.replace('£', '').replace('Â', '').strip())
# Star rating (stored as CSS class)
rating_word = book.find('p', class_='star-rating').get('class', ['', ''])[1]
rating = rating_map.get(rating_word, 0)
# Availability
availability = book.find('p', class_='availability').text.strip()
books.append({
'title': title,
'price': price,
'rating': rating,
'availability': availability,
})
time.sleep(0.5) # Polite delay
return books
# Run and save
books = scrape_books(max_pages=5)
print(f"\nScraped {len(books)} books")
# Save to CSV
with open('books.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=['title', 'price', 'rating', 'availability'])
writer.writeheader()
writer.writerows(books)
# Simple analysis
prices = [b['price'] for b in books]
print(f"Price range: £{min(prices):.2f} - £{max(prices):.2f}")
print(f"Average price: £{sum(prices)/len(prices):.2f}")
print(f"Top rated books (5 stars):")
five_star = [b for b in books if b['rating'] == 5]
for b in five_star[:5]:
print(f" {b['title'][:50]} - £{b['price']:.2f}")
This is a complete, working scraper. Run it — you'll get a CSV with 100 books including titles, prices, and ratings.
What's Next?
Once you've mastered requests + BeautifulSoup, here are the natural next steps:
For sites that block you:
→ Use curl_cffi with impersonate="chrome124" — bypasses TLS fingerprinting that blocks requests
For JavaScript-rendered sites:
→ Use Playwright — launches a real browser that executes JavaScript
For scraping at scale (100+ pages):
→ Use Scrapy — handles concurrency, retries, and data pipelines automatically
For anti-bot heavy sites:
→ Add residential proxies + session warm-up (visit homepage first)
Recommended reading:
- Web Scraping Tools Comparison 2026: requests vs curl_cffi vs Playwright vs Scrapy
- Web Scraping Without Getting Banned in 2026
- How to Cut Web Scraping Costs by 95%
Quick Reference
# Minimal working scraper template
import requests
from bs4 import BeautifulSoup
import csv, time
URL = "https://target-site.com"
HEADERS = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}
r = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(r.text, 'html.parser')
results = []
for item in soup.select('.your-selector'):
results.append({
'field1': item.select_one('.field1-selector').text.strip(),
'field2': item.select_one('.field2-selector').text.strip(),
})
with open('output.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=results[0].keys())
writer.writeheader()
writer.writerows(results)
print(f"Done: {len(results)} items saved")
That's it. Change the URL and selectors, and this template works for 80% of basic scraping tasks.
Next Steps
If you want to skip the setup and use pre-built scrapers instead of building from scratch:
Apify Scrapers Bundle — $29 one-time — 30 production-ready actors for Google, LinkedIn, Amazon, and more.
All code, documented, ready to deploy.
Top comments (0)