Muhammad Ikramullah Khan

Posted on Dec 25

CSS Selectors vs XPath in Scrapy: The Complete Beginner's Guide

#programming #python #webdev #beginners

When I started scraping with Scrapy, I had no idea whether to use CSS selectors or XPath. I'd see tutorials using CSS, then others using XPath, and I'd think "Which one should I actually use?"

I wasted weeks bouncing between both, never really mastering either. I'd copy-paste selectors from tutorials without understanding why they worked.

Then I learned the truth: they're both good. But they're good at different things.

Let me save you the confusion I went through. I'll show you what CSS and XPath actually are, when to use each one, and the tricks nobody tells beginners.

What Are Selectors, Really?

Think of a webpage like a massive book. You want to find a specific sentence on page 47. How do you tell someone where to look?

You could say:

"Go to chapter 3, section 2, paragraph 5" (like XPath)
"Find the paragraph with the red text" (like CSS)

Both get you to the same sentence. Just different approaches.

In Scrapy, selectors are how you tell your spider where to find data. Instead of saying "get me the price," you say "get me the text inside the span with class 'price'."

CSS Selectors: The Easy One

CSS selectors are patterns used in web design to style elements. You're probably already familiar with them if you know any CSS.

The Basics

# Get the title
response.css('title::text').get()

# Get all paragraph texts
response.css('p::text').getall()

# Get element with specific class
response.css('.product-name::text').get()

# Get element with specific ID
response.css('#price::text').get()

# Get attribute value
response.css('a::attr(href)').get()

See the pattern? It looks like CSS. That's because it literally is CSS.

Real Example

Let's scrape a product page:

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'product'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            yield {
                'name': product.css('h2.title::text').get(),
                'price': product.css('span.price::text').get(),
                'rating': product.css('.rating::text').get(),
                'url': product.css('a::attr(href)').get()
            }

Clean. Readable. Easy to understand.

XPath: The Powerful One

XPath is a query language for navigating XML and HTML. It's more verbose but way more powerful.

The Basics

# Get the title
response.xpath('//title/text()').get()

# Get all paragraph texts
response.xpath('//p/text()').getall()

# Get element with specific class
response.xpath('//span[@class="product-name"]/text()').get()

# Get element with specific ID
response.xpath('//div[@id="price"]/text()').get()

# Get attribute value
response.xpath('//a/@href').get()

Notice how it's more explicit? You're literally saying "give me all (//) title tags, then get their text (/text())."

Real Example

Same product page, but with XPath:

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'product'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.xpath('//div[@class="product"]'):
            yield {
                'name': product.xpath('.//h2[@class="title"]/text()').get(),
                'price': product.xpath('.//span[@class="price"]/text()').get(),
                'rating': product.xpath('.//*[@class="rating"]/text()').get(),
                'url': product.xpath('.//a/@href').get()
            }

More verbose, but you'll see why this matters in complex scenarios.

Side-by-Side Comparison

Let's look at common tasks with both:

Task 1: Get All Links

CSS:

response.css('a::attr(href)').getall()

XPath:

response.xpath('//a/@href').getall()

CSS wins for simplicity here.

Task 2: Get Text from Div with Class "content"

CSS:

response.css('div.content::text').get()

XPath:

response.xpath('//div[@class="content"]/text()').get()

Still pretty even. CSS is shorter.

Task 3: Get the Parent of an Element

CSS:

# Can't do this easily with CSS!

XPath:

response.xpath('//span[@class="price"]/parent::div').get()

XPath wins. CSS can't navigate upward.

Task 4: Get Element by Text Content

CSS:

# Can't do this with CSS!

XPath:

response.xpath('//button[contains(text(), "Add to Cart")]').get()

XPath wins again. CSS can't select by text.

Task 5: Get the Third Item in a List

CSS:

response.css('li:nth-child(3)::text').get()

XPath:

response.xpath('//li[3]/text()').get()

Both work. CSS uses :nth-child(), XPath uses [3].

Benefits of CSS Selectors

Benefit #1: Easier to Learn

If you know any CSS, you already know CSS selectors. The syntax is familiar.

# This looks like CSS because it IS CSS
response.css('.product h2.title::text').get()

Benefit #2: Shorter Syntax

CSS is usually more concise:

# CSS
response.css('div.content p::text').getall()

# XPath equivalent
response.xpath('//div[@class="content"]//p/text()').getall()

Benefit #3: Better for Simple Tasks

When you just need to grab elements by class or ID, CSS is perfect:

# Simple and clean
response.css('#product-price::text').get()
response.css('.product-name::text').get()

Benefit #4: Slightly Faster

CSS selectors are marginally faster than XPath. For most use cases, this doesn't matter, but if you're scraping millions of pages, it adds up.

Benefits of XPath

Benefit #1: Navigate Anywhere

XPath can go up the tree (parent), down (children), sideways (siblings). CSS can only go down.

# Get the parent div of a price span
response.xpath('//span[@class="price"]/parent::div').get()

# Get the next sibling
response.xpath('//h2/following-sibling::p[1]').get()

# Get the previous sibling
response.xpath('//p/preceding-sibling::h2').get()

CSS can't do any of this.

Benefit #2: Select by Text Content

This is huge. You can find elements based on what they say, not just their class or ID:

# Find button that says "Add to Cart"
response.xpath('//button[contains(text(), "Add to Cart")]').get()

# Find link with specific text
response.xpath('//a[text()="Next Page"]/@href').get()

# Find div containing specific text
response.xpath('//div[contains(., "Out of Stock")]').get()

Benefit #3: Complex Conditions

XPath supports logical operators:

# Element with class "product" AND data-type "electronics"
response.xpath('//div[@class="product" and @data-type="electronics"]').get()

# Elements with class "item" OR class "product"
response.xpath('//div[@class="item" or @class="product"]').getall()

Benefit #4: More Powerful Filtering

XPath has functions CSS doesn't:

# Get products where price is less than 50 (if price is in text)
response.xpath('//span[number(translate(text(), "$", "")) < 50]').getall()

# Get elements that start with specific text
response.xpath('//h2[starts-with(text(), "Product:")]').getall()

# Case-insensitive matching
response.xpath('//div[contains(translate(@class, "PRODUCT", "product"), "product")]').get()

Drawbacks and Limitations

CSS Limitations

Can't go up the tree

If you need the parent element, CSS can't help. You'll need XPath.

Can't select by text

CSS can't find elements based on their content. Only structure and attributes.

Less powerful filtering

Complex conditions are difficult or impossible with CSS.

No logical operators

You can't do "AND" or "OR" conditions easily.

XPath Limitations

Steeper learning curve

XPath syntax is less intuitive if you're coming from web design.

More verbose

For simple tasks, XPath is longer than CSS:

   # CSS
   response.css('.price::text').get()

   # XPath
   response.xpath('//span[@class="price"]/text()').get()

Harder to read

XPath expressions can get messy fast:

   response.xpath('//div[@class="product"]/descendant::span[contains(@class, "price") and not(contains(@class, "old-price"))]/text()').get()

When to Use CSS Selectors

Use CSS when:

1. The HTML is well-structured

If elements have nice classes and IDs, CSS is perfect:

response.css('#main-content .article-title::text').get()

2. You're doing simple selections

Getting elements by class, ID, or tag? CSS is cleaner:

response.css('div.product::text').getall()

3. You're comfortable with CSS

If you already know CSS from web development, stick with what you know.

4. You care about code readability

CSS is usually easier for others to understand:

# Easy to read
for product in response.css('.product'):
    name = product.css('h2::text').get()

When to Use XPath

Use XPath when:

1. You need to navigate up the tree

Finding parents or ancestors:

# Get the parent div of a button
response.xpath('//button[@class="submit"]/parent::div').get()

2. You need to select by text content

Finding elements based on what they say:

# Find the "Next" link
response.xpath('//a[text()="Next"]/@href').get()

3. The HTML structure is messy

When elements don't have good classes or IDs:

# Find span that's 3rd child of div with id "content"
response.xpath('//div[@id="content"]/span[3]/text()').get()

4. You need complex conditions

Multiple conditions or logical operators:

# Products that are in-stock AND discounted
response.xpath('//div[@class="product" and @data-stock="true" and @data-sale="true"]').getall()

5. You need sibling navigation

Getting previous or next elements:

# Get the paragraph right after an h2
response.xpath('//h2/following-sibling::p[1]').get()

Mixing CSS and XPath (The Secret Weapon)

Here's something most beginners don't know: you can combine them!

# Start with CSS (clean and easy)
product = response.css('div.product')

# Switch to XPath for complex navigation
price = product.xpath('.//span[contains(text(), "$")]/text()').get()

This is incredibly powerful. Use CSS for the easy stuff, then drop into XPath when you need to.

Real Example

def parse(self, response):
    # Use CSS to get all products (clean)
    for product in response.css('.product-card'):
        # Use XPath to find the price span by its text pattern
        price = product.xpath('.//span[contains(text(), "$")]/text()').get()

        # Back to CSS for simple selections
        name = product.css('h2.product-title::text').get()

        # XPath again for complex conditions
        in_stock = product.xpath('.//span[@class="stock" and text()="In Stock"]').get()

        yield {
            'name': name,
            'price': price,
            'in_stock': bool(in_stock)
        }

Practical Examples

Example 1: Scraping a Blog

HTML:

<article class="post">
    <h2 class="title">My Blog Post</h2>
    <span class="author">John Doe</span>
    <div class="content">
        <p>First paragraph...</p>
        <p>Second paragraph...</p>
    </div>
    <a href="/read-more">Read More</a>
</article>

With CSS:

def parse(self, response):
    for post in response.css('article.post'):
        yield {
            'title': post.css('h2.title::text').get(),
            'author': post.css('span.author::text').get(),
            'paragraphs': post.css('div.content p::text').getall(),
            'link': post.css('a::attr(href)').get()
        }

With XPath:

def parse(self, response):
    for post in response.xpath('//article[@class="post"]'):
        yield {
            'title': post.xpath('.//h2[@class="title"]/text()').get(),
            'author': post.xpath('.//span[@class="author"]/text()').get(),
            'paragraphs': post.xpath('.//div[@class="content"]/p/text()').getall(),
            'link': post.xpath('.//a/@href').get()
        }

Both work fine. CSS is shorter and cleaner here.

Example 2: Complex Product Page

HTML:

<div class="product">
    <h2>Product Name</h2>
    <div class="details">
        <span>Price: $29.99</span>
        <span>Old Price: $39.99</span>
        <button>Add to Cart</button>
    </div>
</div>

CSS (struggles here):

# Hard to get just the current price, not old price
price = response.css('.details span::text').get()  # Gets first span (might be either!)

XPath (handles it easily):

# Get only the span containing "Price:"
current_price = response.xpath('//span[contains(text(), "Price:")]/text()').get()

# Or get the span that's NOT the old price
current_price = response.xpath('//span[not(contains(text(), "Old"))]/text()').get()

XPath wins when you need to filter by content.

Example 3: Nested Data

HTML:

<div class="container">
    <div class="row">
        <h3>Electronics</h3>
        <div class="item">Laptop</div>
        <div class="item">Phone</div>
    </div>
    <div class="row">
        <h3>Books</h3>
        <div class="item">Novel</div>
        <div class="item">Magazine</div>
    </div>
</div>

CSS:

# Get all categories
for row in response.css('div.row'):
    category = row.css('h3::text').get()
    items = row.css('div.item::text').getall()
    yield {'category': category, 'items': items}

XPath:

# Same thing
for row in response.xpath('//div[@class="row"]'):
    category = row.xpath('./h3/text()').get()
    items = row.xpath('.//div[@class="item"]/text()').getall()
    yield {'category': category, 'items': items}

Both work equally well here.

Common Mistakes

Mistake #1: Forgetting ::text or /text()

# WRONG (returns selector object, not text)
response.css('h1').get()

# RIGHT
response.css('h1::text').get()

# Or with XPath
response.xpath('//h1/text()').get()

Mistake #2: Not Using .get() or .getall()

# WRONG (returns SelectorList)
titles = response.css('h2::text')

# RIGHT (returns actual text)
titles = response.css('h2::text').getall()

Mistake #3: Absolute XPath Paths

# WRONG (breaks if HTML structure changes)
response.xpath('/html/body/div[1]/div[2]/span/text()').get()

# RIGHT (more flexible)
response.xpath('//span[@class="price"]/text()').get()

Mistake #4: Not Using Relative Paths in Loops

# WRONG (searches entire document every time)
for product in response.css('.product'):
    name = response.css('.title::text').get()  # Gets first title on entire page!

# RIGHT (searches within current product)
for product in response.css('.product'):
    name = product.css('.title::text').get()  # Gets title within this product

Testing Your Selectors

Before writing your spider, test selectors in Scrapy shell:

scrapy shell "https://example.com"

Then try both:

# Test CSS
>>> response.css('.product-name::text').get()
'Product Name'

# Test XPath
>>> response.xpath('//span[@class="product-name"]/text()').get()
'Product Name'

# See all results
>>> response.css('.product-name::text').getall()
['Product 1', 'Product 2', 'Product 3']

This saves hours of debugging!

My Recommendation

Start with CSS. It's easier to learn and covers 80% of use cases.

When you hit a wall (can't navigate up, can't select by text), switch to XPath for that specific part.

Mix them freely:

def parse(self, response):
    # CSS for structure
    for product in response.css('.product'):
        # XPath for complex text matching
        price = product.xpath('.//span[contains(text(), "$")]/text()').get()

        # Back to CSS for simple stuff
        name = product.css('h2::text').get()

        yield {'name': name, 'price': price}

Quick Reference

CSS Cheat Sheet

# By tag
response.css('div')

# By class
response.css('.classname')

# By ID
response.css('#idname')

# Get text
response.css('h1::text').get()

# Get attribute
response.css('a::attr(href)').get()

# Multiple classes
response.css('.class1.class2')

# Child selector
response.css('div > p')

# Descendant selector
response.css('div p')

# First child
response.css('li:first-child')

# Nth child
response.css('li:nth-child(3)')

XPath Cheat Sheet

# By tag
response.xpath('//div')

# By class
response.xpath('//div[@class="classname"]')

# By ID
response.xpath('//div[@id="idname"]')

# Get text
response.xpath('//h1/text()').get()

# Get attribute
response.xpath('//a/@href').get()

# Contains text
response.xpath('//button[contains(text(), "Click")]')

# Parent
response.xpath('//span/parent::div')

# Following sibling
response.xpath('//h2/following-sibling::p[1]')

# Preceding sibling
response.xpath('//p/preceding-sibling::h2')

# Multiple conditions (AND)
response.xpath('//div[@class="product" and @data-type="book"]')

# Position
response.xpath('//li[3]')