8 Professional Python Web Scraping Methods That Actually Work in 2024

#programming #devto #python #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Let's talk about getting data from websites. I'm not talking about copying and pasting. I mean teaching your computer to visit web pages, read them, and pull out the information you need, all by itself. This is called web scraping. It's how I gather prices for comparison, collect news headlines for analysis, or monitor changes on a competitor's site. Python is my favorite tool for this job because it's like having a well-stocked toolbox. Today, I'll walk you through eight methods I use regularly to collect data from the modern web. Think of it as a practical guide, filled with code you can actually use.

The journey starts with a simple question: how does your browser get a web page? It sends a request. We can do the same in Python. The requests library is my starting point. It's like a polite courier that goes to a website address and brings back the page's content. But the web isn't always friendly. Servers can be busy, or they might temporarily reject you. That's why I never send a request without planning for failure.

Here’s how I set up a reliable courier. I create a session, which is like giving my courier a briefcase. In this briefcase, I put instructions to retry if something goes wrong, and I make him look like a normal web browser by setting headers. If I don't do this, some websites will just turn my courier away at the door.

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

session = requests.Session()
retry_strategy = Retry(
    total=3,
    backoff_factor=1,
    status_forcelist=[429, 500, 502, 503, 504]
)
adapter = HTTPAdapter(max_retries=retry_strategy)
session.mount("http://", adapter)
session.mount("https://", adapter)

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
}
session.headers.update(headers)

try:
    response = session.get('https://example.com/products', timeout=10)
    response.raise_for_status()
    html_content = response.text
    print("Got the page!")
except requests.exceptions.RequestException as e:
    print(f"It didn't work: {e}")

Now I have the raw HTML. It's a mess of tags and text. To make sense of it, I need a parser. This is where BeautifulSoup comes in. I feed it the HTML, and it gives me a structured map of the page. I can then ask it to find specific things, like all the product titles or the main article text. The key is to be specific in your questions. Don't just say "find a price," tell it to look for a <span> with the class "price."

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_content, 'lxml')

product_cards = soup.select('div.product-card')
for card in product_cards:
    title_element = card.select_one('h2.title')
    title = title_element.text.strip() if title_element else 'No Title'

    price_element = card.find('span', class_='price')
    price = price_element.text.strip() if price_element else 'N/A'

    print(f"{title}: {price}")

This works perfectly for about half the websites I visit. The other half look completely empty when my courier brings back the page. Why? Because modern websites often use JavaScript to build their content after the page loads. My simple request got the skeleton, but not the flesh. For these, I need a different tool: a browser simulator. I use Playwright. It controls a real browser (like Chrome) in the background, loads the page, lets all the JavaScript run, and then gives me the complete HTML.

It feels like magic. I tell it to go to a page, wait for a specific element to appear, and even scroll down to trigger lazy-loaded images. Then, I can extract data from the fully-formed page.

from playwright.sync_api import sync_playwright

with sync_playwright() as p:
    browser = p.chromium.launch(headless=True)
    page = browser.new_page()
    page.goto('https://example.com/dynamic-dashboard')

    page.wait_for_selector('.data-table', timeout=10000)

    rows = page.query_selector_all('table tr')
    for row in rows:
        cells = row.query_selector_all('td')
        if cells:
            cell_text = [cell.inner_text() for cell in cells]
            print(cell_text)

    browser.close()

Once I have data, I need to put it somewhere. Saving it to a file is fine for a one-time job, but for ongoing projects, I use a database. SQLAlchemy helps me talk to databases in a clean, organized way. I define what my data looks like—a product has a title, a price, a URL—and SQLAlchemy handles creating the tables and storing the information. This makes it easy to avoid duplicates and query the data later.

from sqlalchemy import create_engine, Column, Integer, String, DateTime
from sqlalchemy.ext.declarative import declarative_base
from sqlalchemy.orm import sessionmaker
from datetime import datetime

Base = declarative_base()

class ScrapedArticle(Base):
    __tablename__ = 'articles'
    id = Column(Integer, primary_key=True)
    headline = Column(String(500))
    url = Column(String(1000), unique=True)
    published_date = Column(DateTime)
    saved_at = Column(DateTime, default=datetime.utcnow)

engine = create_engine('sqlite:///scraped_data.db')
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
db_session = Session()

new_article = ScrapedArticle(
    headline='The Future of Web Data',
    url='https://example.com/article/123',
    published_date=datetime(2023, 10, 27)
)

db_session.add(new_article)
db_session.commit()
print("Article saved to database.")

If I start scraping a website too quickly, I become a nuisance. The site might slow down or block me entirely. Being polite is not just ethical; it's practical. I build delays into my code. I space out my requests, and I never hammer a server with dozens of requests per second. I often mimic human reading speed, with a random pause between actions. This keeps my scraper running smoothly for days or weeks without getting shut out.

import time
import random

def polite_request(url, session):
    time.sleep(random.uniform(1.0, 3.0))
    try:
        response = session.get(url, timeout=10)
        return response
    except Exception as e:
        print(f"Error on {url}: {e}")
        return None

urls_to_scrape = ['https://example.com/page1', 'https://example.com/page2']
for url in urls_to_scrape:
    print(f"Fetching {url}...")
    resp = polite_request(url, session)
    if resp:
        print(f"Got {len(resp.text)} characters.")

Even when I'm polite, some sites are designed to block scrapers. They track how many requests come from a single IP address. If the number is too high, they block that IP. The solution is to use multiple IP addresses. This is called proxy rotation. I have a list of proxy servers, and my scraper rotates through them, sending each new request from a different address. It's like wearing a series of different disguises.

import itertools

class ProxyRotator:
    def __init__(self, proxies):
        self.proxies = proxies
        self.proxy_cycle = itertools.cycle(proxies)

    def get_session_with_proxy(self):
        proxy = next(self.proxy_cycle)
        session = requests.Session()
        session.proxies = {"http": proxy, "https": proxy}
        return session, proxy

proxy_list = [
    'http://proxy1.example.com:8080',
    'http://proxy2.example.com:8080',
]

rotator = ProxyRotator(proxy_list)

for i in range(5):
    current_session, current_proxy = rotator.get_session_with_proxy()
    print(f"Attempt {i+1} using proxy: {current_proxy}")
    try:
        resp = current_session.get('https://httpbin.org/ip', timeout=8)
        print(f"Success. Appeared as: {resp.json()['origin']}")
    except:
        print("This proxy failed.")

Websites change. The CSS class name for a price today might be different tomorrow. If my scraper only looks for one specific thing, it will break. I build resilience by giving my parsers multiple ways to find the same data. I might tell it to look for a price in three different places. If the first selector doesn't work, it tries the second, then the third. I also use more flexible tools like XPath, which can find elements based on their position in the document, not just their class name.

from lxml import html

def robust_extraction(page_html):
    tree = html.fromstring(page_html)

    price = None
    possible_selectors = [
        '//span[@class="sale-price"]/text()',
        '//div[@data-testid="price"]/text()',
        '//meta[@property="product:price"]/@content'
    ]

    for selector in possible_selectors:
        result = tree.xpath(selector)
        if result:
            price = result[0]
            break

    return price

html_with_price = '<meta property="product:price" content="49.99" />'
print(f"Found price: {robust_extraction(html_with_price)}")

Sometimes the data isn't in nice, clean HTML tags. It might be buried inside a script tag as a JSON object, or it might be part of the page's text in an irregular format. For these situations, I use regular expressions. They are like powerful pattern-matching tools. I can say, "find me any sequence of numbers that looks like a dollar amount," and it will. I use them sparingly because they can be fragile, but for certain tasks, they are the perfect tool.

import re

text_blob = """
Our special offers: iPhone $999.99, Coffee maker €45.50,
and the famous widget for only 129.00 GBP.
"""

dollar_pattern = r'\$(\d+\.\d{2})'
euro_pattern = r'€(\d+\.\d{2})'

dollar_prices = re.findall(dollar_pattern, text_blob)
euro_prices = re.findall(euro_pattern, text_blob)

print(f"Dollar prices: {dollar_prices}")
print(f"Euro prices: {euro_prices}")

Bringing it all together is the final step. A real-world scraper isn't just one technique; it's several of them combined. I might use requests for simple pages and Playwright for complex ones. I'll always add delays and use proxies for large jobs. Every piece of data goes through a robust parser with fallback options and gets stored in a database. I write logs so I know what succeeded and what failed. The goal is to create a system that works reliably without constant supervision.

Web scraping is a skill built on understanding how websites work and how to work with them respectfully. It's part programming, part problem-solving, and part diplomacy. These eight techniques form a strong foundation. Start with simple requests and parsing. Add browser automation for tricky sites. Always be polite and resilient. Store your results properly. With these tools, you can turn the vast amount of data on the web into a structured resource for your projects. Just remember to always check a website's terms of service and to use the data you collect responsibly.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!