DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

CSS Selectors vs XPath in Scrapy: The Complete Beginner's Guide

When I started scraping with Scrapy, I had no idea whether to use CSS selectors or XPath. I'd see tutorials using CSS, then others using XPath, and I'd think "Which one should I actually use?"

I wasted weeks bouncing between both, never really mastering either. I'd copy-paste selectors from tutorials without understanding why they worked.

Then I learned the truth: they're both good. But they're good at different things.

Let me save you the confusion I went through. I'll show you what CSS and XPath actually are, when to use each one, and the tricks nobody tells beginners.


What Are Selectors, Really?

Think of a webpage like a massive book. You want to find a specific sentence on page 47. How do you tell someone where to look?

You could say:

  • "Go to chapter 3, section 2, paragraph 5" (like XPath)
  • "Find the paragraph with the red text" (like CSS)

Both get you to the same sentence. Just different approaches.

In Scrapy, selectors are how you tell your spider where to find data. Instead of saying "get me the price," you say "get me the text inside the span with class 'price'."


CSS Selectors: The Easy One

CSS selectors are patterns used in web design to style elements. You're probably already familiar with them if you know any CSS.

The Basics

# Get the title
response.css('title::text').get()

# Get all paragraph texts
response.css('p::text').getall()

# Get element with specific class
response.css('.product-name::text').get()

# Get element with specific ID
response.css('#price::text').get()

# Get attribute value
response.css('a::attr(href)').get()
Enter fullscreen mode Exit fullscreen mode

See the pattern? It looks like CSS. That's because it literally is CSS.

Real Example

Let's scrape a product page:

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'product'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.css('.product'):
            yield {
                'name': product.css('h2.title::text').get(),
                'price': product.css('span.price::text').get(),
                'rating': product.css('.rating::text').get(),
                'url': product.css('a::attr(href)').get()
            }
Enter fullscreen mode Exit fullscreen mode

Clean. Readable. Easy to understand.


XPath: The Powerful One

XPath is a query language for navigating XML and HTML. It's more verbose but way more powerful.

The Basics

# Get the title
response.xpath('//title/text()').get()

# Get all paragraph texts
response.xpath('//p/text()').getall()

# Get element with specific class
response.xpath('//span[@class="product-name"]/text()').get()

# Get element with specific ID
response.xpath('//div[@id="price"]/text()').get()

# Get attribute value
response.xpath('//a/@href').get()
Enter fullscreen mode Exit fullscreen mode

Notice how it's more explicit? You're literally saying "give me all (//) title tags, then get their text (/text())."

Real Example

Same product page, but with XPath:

import scrapy

class ProductSpider(scrapy.Spider):
    name = 'product'
    start_urls = ['https://example.com/products']

    def parse(self, response):
        for product in response.xpath('//div[@class="product"]'):
            yield {
                'name': product.xpath('.//h2[@class="title"]/text()').get(),
                'price': product.xpath('.//span[@class="price"]/text()').get(),
                'rating': product.xpath('.//*[@class="rating"]/text()').get(),
                'url': product.xpath('.//a/@href').get()
            }
Enter fullscreen mode Exit fullscreen mode

More verbose, but you'll see why this matters in complex scenarios.


Side-by-Side Comparison

Let's look at common tasks with both:

Task 1: Get All Links

CSS:

response.css('a::attr(href)').getall()
Enter fullscreen mode Exit fullscreen mode

XPath:

response.xpath('//a/@href').getall()
Enter fullscreen mode Exit fullscreen mode

CSS wins for simplicity here.

Task 2: Get Text from Div with Class "content"

CSS:

response.css('div.content::text').get()
Enter fullscreen mode Exit fullscreen mode

XPath:

response.xpath('//div[@class="content"]/text()').get()
Enter fullscreen mode Exit fullscreen mode

Still pretty even. CSS is shorter.

Task 3: Get the Parent of an Element

CSS:

# Can't do this easily with CSS!
Enter fullscreen mode Exit fullscreen mode

XPath:

response.xpath('//span[@class="price"]/parent::div').get()
Enter fullscreen mode Exit fullscreen mode

XPath wins. CSS can't navigate upward.

Task 4: Get Element by Text Content

CSS:

# Can't do this with CSS!
Enter fullscreen mode Exit fullscreen mode

XPath:

response.xpath('//button[contains(text(), "Add to Cart")]').get()
Enter fullscreen mode Exit fullscreen mode

XPath wins again. CSS can't select by text.

Task 5: Get the Third Item in a List

CSS:

response.css('li:nth-child(3)::text').get()
Enter fullscreen mode Exit fullscreen mode

XPath:

response.xpath('//li[3]/text()').get()
Enter fullscreen mode Exit fullscreen mode

Both work. CSS uses :nth-child(), XPath uses [3].


Benefits of CSS Selectors

Benefit #1: Easier to Learn

If you know any CSS, you already know CSS selectors. The syntax is familiar.

# This looks like CSS because it IS CSS
response.css('.product h2.title::text').get()
Enter fullscreen mode Exit fullscreen mode

Benefit #2: Shorter Syntax

CSS is usually more concise:

# CSS
response.css('div.content p::text').getall()

# XPath equivalent
response.xpath('//div[@class="content"]//p/text()').getall()
Enter fullscreen mode Exit fullscreen mode

Benefit #3: Better for Simple Tasks

When you just need to grab elements by class or ID, CSS is perfect:

# Simple and clean
response.css('#product-price::text').get()
response.css('.product-name::text').get()
Enter fullscreen mode Exit fullscreen mode

Benefit #4: Slightly Faster

CSS selectors are marginally faster than XPath. For most use cases, this doesn't matter, but if you're scraping millions of pages, it adds up.


Benefits of XPath

Benefit #1: Navigate Anywhere

XPath can go up the tree (parent), down (children), sideways (siblings). CSS can only go down.

# Get the parent div of a price span
response.xpath('//span[@class="price"]/parent::div').get()

# Get the next sibling
response.xpath('//h2/following-sibling::p[1]').get()

# Get the previous sibling
response.xpath('//p/preceding-sibling::h2').get()
Enter fullscreen mode Exit fullscreen mode

CSS can't do any of this.

Benefit #2: Select by Text Content

This is huge. You can find elements based on what they say, not just their class or ID:

# Find button that says "Add to Cart"
response.xpath('//button[contains(text(), "Add to Cart")]').get()

# Find link with specific text
response.xpath('//a[text()="Next Page"]/@href').get()

# Find div containing specific text
response.xpath('//div[contains(., "Out of Stock")]').get()
Enter fullscreen mode Exit fullscreen mode

Benefit #3: Complex Conditions

XPath supports logical operators:

# Element with class "product" AND data-type "electronics"
response.xpath('//div[@class="product" and @data-type="electronics"]').get()

# Elements with class "item" OR class "product"
response.xpath('//div[@class="item" or @class="product"]').getall()
Enter fullscreen mode Exit fullscreen mode

Benefit #4: More Powerful Filtering

XPath has functions CSS doesn't:

# Get products where price is less than 50 (if price is in text)
response.xpath('//span[number(translate(text(), "$", "")) < 50]').getall()

# Get elements that start with specific text
response.xpath('//h2[starts-with(text(), "Product:")]').getall()

# Case-insensitive matching
response.xpath('//div[contains(translate(@class, "PRODUCT", "product"), "product")]').get()
Enter fullscreen mode Exit fullscreen mode

Drawbacks and Limitations

CSS Limitations

  1. Can't go up the tree

If you need the parent element, CSS can't help. You'll need XPath.

  1. Can't select by text

CSS can't find elements based on their content. Only structure and attributes.

  1. Less powerful filtering

Complex conditions are difficult or impossible with CSS.

  1. No logical operators

You can't do "AND" or "OR" conditions easily.

XPath Limitations

  1. Steeper learning curve

XPath syntax is less intuitive if you're coming from web design.

  1. More verbose

For simple tasks, XPath is longer than CSS:

   # CSS
   response.css('.price::text').get()

   # XPath
   response.xpath('//span[@class="price"]/text()').get()
Enter fullscreen mode Exit fullscreen mode
  1. Harder to read

XPath expressions can get messy fast:

   response.xpath('//div[@class="product"]/descendant::span[contains(@class, "price") and not(contains(@class, "old-price"))]/text()').get()
Enter fullscreen mode Exit fullscreen mode

When to Use CSS Selectors

Use CSS when:

1. The HTML is well-structured

If elements have nice classes and IDs, CSS is perfect:

response.css('#main-content .article-title::text').get()
Enter fullscreen mode Exit fullscreen mode

2. You're doing simple selections

Getting elements by class, ID, or tag? CSS is cleaner:

response.css('div.product::text').getall()
Enter fullscreen mode Exit fullscreen mode

3. You're comfortable with CSS

If you already know CSS from web development, stick with what you know.

4. You care about code readability

CSS is usually easier for others to understand:

# Easy to read
for product in response.css('.product'):
    name = product.css('h2::text').get()
Enter fullscreen mode Exit fullscreen mode

When to Use XPath

Use XPath when:

1. You need to navigate up the tree

Finding parents or ancestors:

# Get the parent div of a button
response.xpath('//button[@class="submit"]/parent::div').get()
Enter fullscreen mode Exit fullscreen mode

2. You need to select by text content

Finding elements based on what they say:

# Find the "Next" link
response.xpath('//a[text()="Next"]/@href').get()
Enter fullscreen mode Exit fullscreen mode

3. The HTML structure is messy

When elements don't have good classes or IDs:

# Find span that's 3rd child of div with id "content"
response.xpath('//div[@id="content"]/span[3]/text()').get()
Enter fullscreen mode Exit fullscreen mode

4. You need complex conditions

Multiple conditions or logical operators:

# Products that are in-stock AND discounted
response.xpath('//div[@class="product" and @data-stock="true" and @data-sale="true"]').getall()
Enter fullscreen mode Exit fullscreen mode

5. You need sibling navigation

Getting previous or next elements:

# Get the paragraph right after an h2
response.xpath('//h2/following-sibling::p[1]').get()
Enter fullscreen mode Exit fullscreen mode

Mixing CSS and XPath (The Secret Weapon)

Here's something most beginners don't know: you can combine them!

# Start with CSS (clean and easy)
product = response.css('div.product')

# Switch to XPath for complex navigation
price = product.xpath('.//span[contains(text(), "$")]/text()').get()
Enter fullscreen mode Exit fullscreen mode

This is incredibly powerful. Use CSS for the easy stuff, then drop into XPath when you need to.

Real Example

def parse(self, response):
    # Use CSS to get all products (clean)
    for product in response.css('.product-card'):
        # Use XPath to find the price span by its text pattern
        price = product.xpath('.//span[contains(text(), "$")]/text()').get()

        # Back to CSS for simple selections
        name = product.css('h2.product-title::text').get()

        # XPath again for complex conditions
        in_stock = product.xpath('.//span[@class="stock" and text()="In Stock"]').get()

        yield {
            'name': name,
            'price': price,
            'in_stock': bool(in_stock)
        }
Enter fullscreen mode Exit fullscreen mode

Practical Examples

Example 1: Scraping a Blog

HTML:

<article class="post">
    <h2 class="title">My Blog Post</h2>
    <span class="author">John Doe</span>
    <div class="content">
        <p>First paragraph...</p>
        <p>Second paragraph...</p>
    </div>
    <a href="/read-more">Read More</a>
</article>
Enter fullscreen mode Exit fullscreen mode

With CSS:

def parse(self, response):
    for post in response.css('article.post'):
        yield {
            'title': post.css('h2.title::text').get(),
            'author': post.css('span.author::text').get(),
            'paragraphs': post.css('div.content p::text').getall(),
            'link': post.css('a::attr(href)').get()
        }
Enter fullscreen mode Exit fullscreen mode

With XPath:

def parse(self, response):
    for post in response.xpath('//article[@class="post"]'):
        yield {
            'title': post.xpath('.//h2[@class="title"]/text()').get(),
            'author': post.xpath('.//span[@class="author"]/text()').get(),
            'paragraphs': post.xpath('.//div[@class="content"]/p/text()').getall(),
            'link': post.xpath('.//a/@href').get()
        }
Enter fullscreen mode Exit fullscreen mode

Both work fine. CSS is shorter and cleaner here.

Example 2: Complex Product Page

HTML:

<div class="product">
    <h2>Product Name</h2>
    <div class="details">
        <span>Price: $29.99</span>
        <span>Old Price: $39.99</span>
        <button>Add to Cart</button>
    </div>
</div>
Enter fullscreen mode Exit fullscreen mode

CSS (struggles here):

# Hard to get just the current price, not old price
price = response.css('.details span::text').get()  # Gets first span (might be either!)
Enter fullscreen mode Exit fullscreen mode

XPath (handles it easily):

# Get only the span containing "Price:"
current_price = response.xpath('//span[contains(text(), "Price:")]/text()').get()

# Or get the span that's NOT the old price
current_price = response.xpath('//span[not(contains(text(), "Old"))]/text()').get()
Enter fullscreen mode Exit fullscreen mode

XPath wins when you need to filter by content.

Example 3: Nested Data

HTML:

<div class="container">
    <div class="row">
        <h3>Electronics</h3>
        <div class="item">Laptop</div>
        <div class="item">Phone</div>
    </div>
    <div class="row">
        <h3>Books</h3>
        <div class="item">Novel</div>
        <div class="item">Magazine</div>
    </div>
</div>
Enter fullscreen mode Exit fullscreen mode

CSS:

# Get all categories
for row in response.css('div.row'):
    category = row.css('h3::text').get()
    items = row.css('div.item::text').getall()
    yield {'category': category, 'items': items}
Enter fullscreen mode Exit fullscreen mode

XPath:

# Same thing
for row in response.xpath('//div[@class="row"]'):
    category = row.xpath('./h3/text()').get()
    items = row.xpath('.//div[@class="item"]/text()').getall()
    yield {'category': category, 'items': items}
Enter fullscreen mode Exit fullscreen mode

Both work equally well here.


Common Mistakes

Mistake #1: Forgetting ::text or /text()

# WRONG (returns selector object, not text)
response.css('h1').get()

# RIGHT
response.css('h1::text').get()

# Or with XPath
response.xpath('//h1/text()').get()
Enter fullscreen mode Exit fullscreen mode

Mistake #2: Not Using .get() or .getall()

# WRONG (returns SelectorList)
titles = response.css('h2::text')

# RIGHT (returns actual text)
titles = response.css('h2::text').getall()
Enter fullscreen mode Exit fullscreen mode

Mistake #3: Absolute XPath Paths

# WRONG (breaks if HTML structure changes)
response.xpath('/html/body/div[1]/div[2]/span/text()').get()

# RIGHT (more flexible)
response.xpath('//span[@class="price"]/text()').get()
Enter fullscreen mode Exit fullscreen mode

Mistake #4: Not Using Relative Paths in Loops

# WRONG (searches entire document every time)
for product in response.css('.product'):
    name = response.css('.title::text').get()  # Gets first title on entire page!

# RIGHT (searches within current product)
for product in response.css('.product'):
    name = product.css('.title::text').get()  # Gets title within this product
Enter fullscreen mode Exit fullscreen mode

Testing Your Selectors

Before writing your spider, test selectors in Scrapy shell:

scrapy shell "https://example.com"
Enter fullscreen mode Exit fullscreen mode

Then try both:

# Test CSS
>>> response.css('.product-name::text').get()
'Product Name'

# Test XPath
>>> response.xpath('//span[@class="product-name"]/text()').get()
'Product Name'

# See all results
>>> response.css('.product-name::text').getall()
['Product 1', 'Product 2', 'Product 3']
Enter fullscreen mode Exit fullscreen mode

This saves hours of debugging!


My Recommendation

Start with CSS. It's easier to learn and covers 80% of use cases.

When you hit a wall (can't navigate up, can't select by text), switch to XPath for that specific part.

Mix them freely:

def parse(self, response):
    # CSS for structure
    for product in response.css('.product'):
        # XPath for complex text matching
        price = product.xpath('.//span[contains(text(), "$")]/text()').get()

        # Back to CSS for simple stuff
        name = product.css('h2::text').get()

        yield {'name': name, 'price': price}
Enter fullscreen mode Exit fullscreen mode

Quick Reference

CSS Cheat Sheet

# By tag
response.css('div')

# By class
response.css('.classname')

# By ID
response.css('#idname')

# Get text
response.css('h1::text').get()

# Get attribute
response.css('a::attr(href)').get()

# Multiple classes
response.css('.class1.class2')

# Child selector
response.css('div > p')

# Descendant selector
response.css('div p')

# First child
response.css('li:first-child')

# Nth child
response.css('li:nth-child(3)')
Enter fullscreen mode Exit fullscreen mode

XPath Cheat Sheet

# By tag
response.xpath('//div')

# By class
response.xpath('//div[@class="classname"]')

# By ID
response.xpath('//div[@id="idname"]')

# Get text
response.xpath('//h1/text()').get()

# Get attribute
response.xpath('//a/@href').get()

# Contains text
response.xpath('//button[contains(text(), "Click")]')

# Parent
response.xpath('//span/parent::div')

# Following sibling
response.xpath('//h2/following-sibling::p[1]')

# Preceding sibling
response.xpath('//p/preceding-sibling::h2')

# Multiple conditions (AND)
response.xpath('//div[@class="product" and @data-type="book"]')

# Position
response.xpath('//li[3]')
Enter fullscreen mode Exit fullscreen mode

Summary

CSS Selectors:

  • Easier to learn
  • Shorter syntax
  • Perfect for simple tasks
  • Can't navigate up or select by text

XPath:

  • More powerful
  • Can navigate anywhere
  • Can select by text content
  • More verbose

When to use what:

  • Start with CSS for simple selections
  • Switch to XPath when you need power
  • Mix them freely for best results

Don't stress about choosing one. Learn both basics, then use whichever makes sense for each situation.

Happy scraping! 🕷️

Top comments (0)