DEV Community

Vhub Systems
Vhub Systems

Posted on

Python Web Scraping Tutorial for Beginners 2026: From Zero to Your First Scraper

Python Web Scraping Tutorial for Beginners 2026: From Zero to Your First Scraper

This guide goes from zero Python knowledge to a working web scraper in about 30 minutes. We'll build a real scraper — not a toy example — and you'll understand exactly what every line does.

What You'll Build

By the end of this tutorial, you'll have a Python script that:

  • Loads any website
  • Finds specific data on the page (products, prices, headlines, etc.)
  • Exports the results to a CSV file

Prerequisites

  • Python 3.x installed (check: python3 --version)
  • pip package manager (comes with Python)
  • Basic familiarity with Python (variables, loops, functions)

Step 1: Install the Libraries

We need two libraries:

pip install requests beautifulsoup4
Enter fullscreen mode Exit fullscreen mode
  • requests — loads web pages (makes HTTP requests)
  • beautifulsoup4 — parses HTML and helps us find elements

Step 2: Your First Request

Let's understand what scraping actually is. Every time you visit a website in your browser, your browser sends an HTTP request and receives HTML back. Web scraping does the same thing — except with Python instead of a browser.

import requests

# Load the page
response = requests.get("https://news.ycombinator.com")

# Check if it worked (200 = success)
print(f"Status code: {response.status_code}")

# Look at the HTML
print(response.text[:500])  # First 500 characters
Enter fullscreen mode Exit fullscreen mode

Run this. You'll see a bunch of HTML printed. That's the raw page data.

Step 3: Parse the HTML with BeautifulSoup

Raw HTML is hard to work with. BeautifulSoup makes it easy to navigate:

import requests
from bs4 import BeautifulSoup

url = "https://news.ycombinator.com"
response = requests.get(url)

# Parse the HTML
soup = BeautifulSoup(response.text, 'html.parser')

# Find elements using CSS selectors
# Inspect the page in browser → right-click → Inspect to find selectors
headlines = soup.select('.titleline a')

# Print the first 5 headlines
for headline in headlines[:5]:
    print(headline.text)
Enter fullscreen mode Exit fullscreen mode

How to find the right selector:

  1. Open the target page in Chrome/Firefox
  2. Right-click the element you want → "Inspect"
  3. Look at the HTML structure — note the class names and tags
  4. Use those in soup.select('.classname') or soup.find('tag')

Step 4: Extract Multiple Data Points

Let's build a real example — scraping Hacker News stories with their scores:

import requests
from bs4 import BeautifulSoup

def scrape_hackernews() -> list:
    url = "https://news.ycombinator.com"

    # Add a User-Agent to look like a browser (polite scraping)
    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
    }

    response = requests.get(url, headers=headers)
    soup = BeautifulSoup(response.text, 'html.parser')

    stories = []

    # Find all story rows
    for row in soup.select('.athing'):
        story_id = row.get('id')

        # Title and URL
        title_elem = row.select_one('.titleline a')
        if not title_elem:
            continue

        title = title_elem.text.strip()
        link = title_elem.get('href', '')

        # Score is in the next row (sibling element)
        score_row = row.find_next_sibling('tr')
        score_elem = score_row.select_one('.score') if score_row else None
        score = score_elem.text.replace(' points', '') if score_elem else '0'

        # Comments count
        comments_elem = score_row.select_one('a[href*="item?id="]') if score_row else None
        comments_text = comments_elem.text if comments_elem else '0 comments'
        comments = comments_text.split()[0] if comments_text else '0'

        stories.append({
            'title': title,
            'url': link,
            'score': score,
            'comments': comments,
        })

    return stories

# Run it
stories = scrape_hackernews()

# Print results
for story in stories[:10]:
    print(f"[{story['score']} pts] {story['title'][:60]}")
    print(f"  URL: {story['url'][:50]}")
    print()
Enter fullscreen mode Exit fullscreen mode

Step 5: Save to CSV

import csv

def save_to_csv(data: list, filename: str):
    if not data:
        print("No data to save")
        return

    with open(filename, 'w', newline='', encoding='utf-8') as f:
        writer = csv.DictWriter(f, fieldnames=data[0].keys())
        writer.writeheader()
        writer.writerows(data)

    print(f"Saved {len(data)} rows to {filename}")

# Save Hacker News data
stories = scrape_hackernews()
save_to_csv(stories, 'hn_stories.csv')
Enter fullscreen mode Exit fullscreen mode

Open the CSV in Excel or Google Sheets — your data is ready.

Step 6: Handle Common Errors

Real websites don't always cooperate. Here's how to handle common issues:

import requests
import time

def safe_request(url: str, retries: int = 3) -> requests.Response | None:
    """Make a request with error handling and retries"""

    headers = {
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
    }

    for attempt in range(retries):
        try:
            response = requests.get(url, headers=headers, timeout=10)

            if response.status_code == 200:
                return response

            elif response.status_code == 429:
                # Rate limited — wait and retry
                wait = 10 * (attempt + 1)
                print(f"Rate limited. Waiting {wait}s...")
                time.sleep(wait)

            elif response.status_code == 403:
                print(f"Access denied (403) for {url}")
                return None

            else:
                print(f"Got status {response.status_code}, retrying...")
                time.sleep(2)

        except requests.Timeout:
            print(f"Timeout on attempt {attempt + 1}")
            time.sleep(2)

        except requests.ConnectionError:
            print(f"Connection error on attempt {attempt + 1}")
            time.sleep(5)

    return None

# Usage
response = safe_request("https://example.com")
if response:
    soup = BeautifulSoup(response.text, 'html.parser')
    # ... continue scraping
Enter fullscreen mode Exit fullscreen mode

Step 7: Scraping Multiple Pages

Most useful data spans multiple pages:

import requests
from bs4 import BeautifulSoup
import time

def scrape_all_pages(base_url: str, max_pages: int = 5) -> list:
    all_items = []

    for page_num in range(1, max_pages + 1):
        url = f"{base_url}?page={page_num}"
        print(f"Scraping page {page_num}...")

        response = requests.get(url, headers={
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"
        })

        if response.status_code != 200:
            print(f"Page {page_num} failed, stopping")
            break

        soup = BeautifulSoup(response.text, 'html.parser')

        # Extract items from this page
        items = soup.select('.item-class')  # Replace with actual selector

        if not items:
            print(f"No items on page {page_num}, done")
            break

        for item in items:
            all_items.append({
                'text': item.text.strip(),
                'page': page_num,
            })

        # Be polite — wait between pages
        time.sleep(1)

    return all_items
Enter fullscreen mode Exit fullscreen mode

Complete Example: E-commerce Price Scraper

Here's a complete, working scraper for a real use case:

import requests
from bs4 import BeautifulSoup
import csv
import time

def scrape_books(max_pages: int = 3) -> list:
    """
    Scrape books from books.toscrape.com (a practice scraping site)
    Returns list of dicts with: title, price, rating, availability
    """
    base_url = "https://books.toscrape.com/catalogue/page-{}.html"
    books = []

    rating_map = {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}

    for page in range(1, max_pages + 1):
        url = base_url.format(page)
        print(f"Scraping page {page}...")

        r = requests.get(url, headers={"User-Agent": "Mozilla/5.0"})
        if r.status_code != 200:
            break

        soup = BeautifulSoup(r.text, 'html.parser')

        for book in soup.select('article.product_pod'):
            # Title
            title = book.find('h3').find('a').get('title', '')

            # Price (remove the £ symbol)
            price_text = book.find('p', class_='price_color').text
            price = float(price_text.replace('£', '').replace('Â', '').strip())

            # Star rating (stored as CSS class)
            rating_word = book.find('p', class_='star-rating').get('class', ['', ''])[1]
            rating = rating_map.get(rating_word, 0)

            # Availability
            availability = book.find('p', class_='availability').text.strip()

            books.append({
                'title': title,
                'price': price,
                'rating': rating,
                'availability': availability,
            })

        time.sleep(0.5)  # Polite delay

    return books

# Run and save
books = scrape_books(max_pages=5)
print(f"\nScraped {len(books)} books")

# Save to CSV
with open('books.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=['title', 'price', 'rating', 'availability'])
    writer.writeheader()
    writer.writerows(books)

# Simple analysis
prices = [b['price'] for b in books]
print(f"Price range: £{min(prices):.2f} - £{max(prices):.2f}")
print(f"Average price: £{sum(prices)/len(prices):.2f}")
print(f"Top rated books (5 stars):")
five_star = [b for b in books if b['rating'] == 5]
for b in five_star[:5]:
    print(f"  {b['title'][:50]} - £{b['price']:.2f}")
Enter fullscreen mode Exit fullscreen mode

This is a complete, working scraper. Run it — you'll get a CSV with 100 books including titles, prices, and ratings.

What's Next?

Once you've mastered requests + BeautifulSoup, here are the natural next steps:

For sites that block you:
→ Use curl_cffi with impersonate="chrome124" — bypasses TLS fingerprinting that blocks requests

For JavaScript-rendered sites:
→ Use Playwright — launches a real browser that executes JavaScript

For scraping at scale (100+ pages):
→ Use Scrapy — handles concurrency, retries, and data pipelines automatically

For anti-bot heavy sites:
→ Add residential proxies + session warm-up (visit homepage first)

Recommended reading:


Quick Reference

# Minimal working scraper template
import requests
from bs4 import BeautifulSoup
import csv, time

URL = "https://target-site.com"
HEADERS = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36"}

r = requests.get(URL, headers=HEADERS)
soup = BeautifulSoup(r.text, 'html.parser')

results = []
for item in soup.select('.your-selector'):
    results.append({
        'field1': item.select_one('.field1-selector').text.strip(),
        'field2': item.select_one('.field2-selector').text.strip(),
    })

with open('output.csv', 'w', newline='') as f:
    writer = csv.DictWriter(f, fieldnames=results[0].keys())
    writer.writeheader()
    writer.writerows(results)

print(f"Done: {len(results)} items saved")
Enter fullscreen mode Exit fullscreen mode

That's it. Change the URL and selectors, and this template works for 80% of basic scraping tasks.


Next Steps

If you want to skip the setup and use pre-built scrapers instead of building from scratch:

Apify Scrapers Bundle — $29 one-time — 30 production-ready actors for Google, LinkedIn, Amazon, and more.

All code, documented, ready to deploy.


Related Tools

Top comments (0)