When I started scraping with Scrapy, I had no idea whether to use CSS selectors or XPath. I'd see tutorials using CSS, then others using XPath, and I'd think "Which one should I actually use?"
I wasted weeks bouncing between both, never really mastering either. I'd copy-paste selectors from tutorials without understanding why they worked.
Then I learned the truth: they're both good. But they're good at different things.
Let me save you the confusion I went through. I'll show you what CSS and XPath actually are, when to use each one, and the tricks nobody tells beginners.
What Are Selectors, Really?
Think of a webpage like a massive book. You want to find a specific sentence on page 47. How do you tell someone where to look?
You could say:
- "Go to chapter 3, section 2, paragraph 5" (like XPath)
- "Find the paragraph with the red text" (like CSS)
Both get you to the same sentence. Just different approaches.
In Scrapy, selectors are how you tell your spider where to find data. Instead of saying "get me the price," you say "get me the text inside the span with class 'price'."
CSS Selectors: The Easy One
CSS selectors are patterns used in web design to style elements. You're probably already familiar with them if you know any CSS.
The Basics
# Get the title
response.css('title::text').get()
# Get all paragraph texts
response.css('p::text').getall()
# Get element with specific class
response.css('.product-name::text').get()
# Get element with specific ID
response.css('#price::text').get()
# Get attribute value
response.css('a::attr(href)').get()
See the pattern? It looks like CSS. That's because it literally is CSS.
Real Example
Let's scrape a product page:
import scrapy
class ProductSpider(scrapy.Spider):
name = 'product'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.css('.product'):
yield {
'name': product.css('h2.title::text').get(),
'price': product.css('span.price::text').get(),
'rating': product.css('.rating::text').get(),
'url': product.css('a::attr(href)').get()
}
Clean. Readable. Easy to understand.
XPath: The Powerful One
XPath is a query language for navigating XML and HTML. It's more verbose but way more powerful.
The Basics
# Get the title
response.xpath('//title/text()').get()
# Get all paragraph texts
response.xpath('//p/text()').getall()
# Get element with specific class
response.xpath('//span[@class="product-name"]/text()').get()
# Get element with specific ID
response.xpath('//div[@id="price"]/text()').get()
# Get attribute value
response.xpath('//a/@href').get()
Notice how it's more explicit? You're literally saying "give me all (//) title tags, then get their text (/text())."
Real Example
Same product page, but with XPath:
import scrapy
class ProductSpider(scrapy.Spider):
name = 'product'
start_urls = ['https://example.com/products']
def parse(self, response):
for product in response.xpath('//div[@class="product"]'):
yield {
'name': product.xpath('.//h2[@class="title"]/text()').get(),
'price': product.xpath('.//span[@class="price"]/text()').get(),
'rating': product.xpath('.//*[@class="rating"]/text()').get(),
'url': product.xpath('.//a/@href').get()
}
More verbose, but you'll see why this matters in complex scenarios.
Side-by-Side Comparison
Let's look at common tasks with both:
Task 1: Get All Links
CSS:
response.css('a::attr(href)').getall()
XPath:
response.xpath('//a/@href').getall()
CSS wins for simplicity here.
Task 2: Get Text from Div with Class "content"
CSS:
response.css('div.content::text').get()
XPath:
response.xpath('//div[@class="content"]/text()').get()
Still pretty even. CSS is shorter.
Task 3: Get the Parent of an Element
CSS:
# Can't do this easily with CSS!
XPath:
response.xpath('//span[@class="price"]/parent::div').get()
XPath wins. CSS can't navigate upward.
Task 4: Get Element by Text Content
CSS:
# Can't do this with CSS!
XPath:
response.xpath('//button[contains(text(), "Add to Cart")]').get()
XPath wins again. CSS can't select by text.
Task 5: Get the Third Item in a List
CSS:
response.css('li:nth-child(3)::text').get()
XPath:
response.xpath('//li[3]/text()').get()
Both work. CSS uses :nth-child(), XPath uses [3].
Benefits of CSS Selectors
Benefit #1: Easier to Learn
If you know any CSS, you already know CSS selectors. The syntax is familiar.
# This looks like CSS because it IS CSS
response.css('.product h2.title::text').get()
Benefit #2: Shorter Syntax
CSS is usually more concise:
# CSS
response.css('div.content p::text').getall()
# XPath equivalent
response.xpath('//div[@class="content"]//p/text()').getall()
Benefit #3: Better for Simple Tasks
When you just need to grab elements by class or ID, CSS is perfect:
# Simple and clean
response.css('#product-price::text').get()
response.css('.product-name::text').get()
Benefit #4: Slightly Faster
CSS selectors are marginally faster than XPath. For most use cases, this doesn't matter, but if you're scraping millions of pages, it adds up.
Benefits of XPath
Benefit #1: Navigate Anywhere
XPath can go up the tree (parent), down (children), sideways (siblings). CSS can only go down.
# Get the parent div of a price span
response.xpath('//span[@class="price"]/parent::div').get()
# Get the next sibling
response.xpath('//h2/following-sibling::p[1]').get()
# Get the previous sibling
response.xpath('//p/preceding-sibling::h2').get()
CSS can't do any of this.
Benefit #2: Select by Text Content
This is huge. You can find elements based on what they say, not just their class or ID:
# Find button that says "Add to Cart"
response.xpath('//button[contains(text(), "Add to Cart")]').get()
# Find link with specific text
response.xpath('//a[text()="Next Page"]/@href').get()
# Find div containing specific text
response.xpath('//div[contains(., "Out of Stock")]').get()
Benefit #3: Complex Conditions
XPath supports logical operators:
# Element with class "product" AND data-type "electronics"
response.xpath('//div[@class="product" and @data-type="electronics"]').get()
# Elements with class "item" OR class "product"
response.xpath('//div[@class="item" or @class="product"]').getall()
Benefit #4: More Powerful Filtering
XPath has functions CSS doesn't:
# Get products where price is less than 50 (if price is in text)
response.xpath('//span[number(translate(text(), "$", "")) < 50]').getall()
# Get elements that start with specific text
response.xpath('//h2[starts-with(text(), "Product:")]').getall()
# Case-insensitive matching
response.xpath('//div[contains(translate(@class, "PRODUCT", "product"), "product")]').get()
Drawbacks and Limitations
CSS Limitations
- Can't go up the tree
If you need the parent element, CSS can't help. You'll need XPath.
- Can't select by text
CSS can't find elements based on their content. Only structure and attributes.
- Less powerful filtering
Complex conditions are difficult or impossible with CSS.
- No logical operators
You can't do "AND" or "OR" conditions easily.
XPath Limitations
- Steeper learning curve
XPath syntax is less intuitive if you're coming from web design.
- More verbose
For simple tasks, XPath is longer than CSS:
# CSS
response.css('.price::text').get()
# XPath
response.xpath('//span[@class="price"]/text()').get()
- Harder to read
XPath expressions can get messy fast:
response.xpath('//div[@class="product"]/descendant::span[contains(@class, "price") and not(contains(@class, "old-price"))]/text()').get()
When to Use CSS Selectors
Use CSS when:
1. The HTML is well-structured
If elements have nice classes and IDs, CSS is perfect:
response.css('#main-content .article-title::text').get()
2. You're doing simple selections
Getting elements by class, ID, or tag? CSS is cleaner:
response.css('div.product::text').getall()
3. You're comfortable with CSS
If you already know CSS from web development, stick with what you know.
4. You care about code readability
CSS is usually easier for others to understand:
# Easy to read
for product in response.css('.product'):
name = product.css('h2::text').get()
When to Use XPath
Use XPath when:
1. You need to navigate up the tree
Finding parents or ancestors:
# Get the parent div of a button
response.xpath('//button[@class="submit"]/parent::div').get()
2. You need to select by text content
Finding elements based on what they say:
# Find the "Next" link
response.xpath('//a[text()="Next"]/@href').get()
3. The HTML structure is messy
When elements don't have good classes or IDs:
# Find span that's 3rd child of div with id "content"
response.xpath('//div[@id="content"]/span[3]/text()').get()
4. You need complex conditions
Multiple conditions or logical operators:
# Products that are in-stock AND discounted
response.xpath('//div[@class="product" and @data-stock="true" and @data-sale="true"]').getall()
5. You need sibling navigation
Getting previous or next elements:
# Get the paragraph right after an h2
response.xpath('//h2/following-sibling::p[1]').get()
Mixing CSS and XPath (The Secret Weapon)
Here's something most beginners don't know: you can combine them!
# Start with CSS (clean and easy)
product = response.css('div.product')
# Switch to XPath for complex navigation
price = product.xpath('.//span[contains(text(), "$")]/text()').get()
This is incredibly powerful. Use CSS for the easy stuff, then drop into XPath when you need to.
Real Example
def parse(self, response):
# Use CSS to get all products (clean)
for product in response.css('.product-card'):
# Use XPath to find the price span by its text pattern
price = product.xpath('.//span[contains(text(), "$")]/text()').get()
# Back to CSS for simple selections
name = product.css('h2.product-title::text').get()
# XPath again for complex conditions
in_stock = product.xpath('.//span[@class="stock" and text()="In Stock"]').get()
yield {
'name': name,
'price': price,
'in_stock': bool(in_stock)
}
Practical Examples
Example 1: Scraping a Blog
HTML:
<article class="post">
<h2 class="title">My Blog Post</h2>
<span class="author">John Doe</span>
<div class="content">
<p>First paragraph...</p>
<p>Second paragraph...</p>
</div>
<a href="/read-more">Read More</a>
</article>
With CSS:
def parse(self, response):
for post in response.css('article.post'):
yield {
'title': post.css('h2.title::text').get(),
'author': post.css('span.author::text').get(),
'paragraphs': post.css('div.content p::text').getall(),
'link': post.css('a::attr(href)').get()
}
With XPath:
def parse(self, response):
for post in response.xpath('//article[@class="post"]'):
yield {
'title': post.xpath('.//h2[@class="title"]/text()').get(),
'author': post.xpath('.//span[@class="author"]/text()').get(),
'paragraphs': post.xpath('.//div[@class="content"]/p/text()').getall(),
'link': post.xpath('.//a/@href').get()
}
Both work fine. CSS is shorter and cleaner here.
Example 2: Complex Product Page
HTML:
<div class="product">
<h2>Product Name</h2>
<div class="details">
<span>Price: $29.99</span>
<span>Old Price: $39.99</span>
<button>Add to Cart</button>
</div>
</div>
CSS (struggles here):
# Hard to get just the current price, not old price
price = response.css('.details span::text').get() # Gets first span (might be either!)
XPath (handles it easily):
# Get only the span containing "Price:"
current_price = response.xpath('//span[contains(text(), "Price:")]/text()').get()
# Or get the span that's NOT the old price
current_price = response.xpath('//span[not(contains(text(), "Old"))]/text()').get()
XPath wins when you need to filter by content.
Example 3: Nested Data
HTML:
<div class="container">
<div class="row">
<h3>Electronics</h3>
<div class="item">Laptop</div>
<div class="item">Phone</div>
</div>
<div class="row">
<h3>Books</h3>
<div class="item">Novel</div>
<div class="item">Magazine</div>
</div>
</div>
CSS:
# Get all categories
for row in response.css('div.row'):
category = row.css('h3::text').get()
items = row.css('div.item::text').getall()
yield {'category': category, 'items': items}
XPath:
# Same thing
for row in response.xpath('//div[@class="row"]'):
category = row.xpath('./h3/text()').get()
items = row.xpath('.//div[@class="item"]/text()').getall()
yield {'category': category, 'items': items}
Both work equally well here.
Common Mistakes
Mistake #1: Forgetting ::text or /text()
# WRONG (returns selector object, not text)
response.css('h1').get()
# RIGHT
response.css('h1::text').get()
# Or with XPath
response.xpath('//h1/text()').get()
Mistake #2: Not Using .get() or .getall()
# WRONG (returns SelectorList)
titles = response.css('h2::text')
# RIGHT (returns actual text)
titles = response.css('h2::text').getall()
Mistake #3: Absolute XPath Paths
# WRONG (breaks if HTML structure changes)
response.xpath('/html/body/div[1]/div[2]/span/text()').get()
# RIGHT (more flexible)
response.xpath('//span[@class="price"]/text()').get()
Mistake #4: Not Using Relative Paths in Loops
# WRONG (searches entire document every time)
for product in response.css('.product'):
name = response.css('.title::text').get() # Gets first title on entire page!
# RIGHT (searches within current product)
for product in response.css('.product'):
name = product.css('.title::text').get() # Gets title within this product
Testing Your Selectors
Before writing your spider, test selectors in Scrapy shell:
scrapy shell "https://example.com"
Then try both:
# Test CSS
>>> response.css('.product-name::text').get()
'Product Name'
# Test XPath
>>> response.xpath('//span[@class="product-name"]/text()').get()
'Product Name'
# See all results
>>> response.css('.product-name::text').getall()
['Product 1', 'Product 2', 'Product 3']
This saves hours of debugging!
My Recommendation
Start with CSS. It's easier to learn and covers 80% of use cases.
When you hit a wall (can't navigate up, can't select by text), switch to XPath for that specific part.
Mix them freely:
def parse(self, response):
# CSS for structure
for product in response.css('.product'):
# XPath for complex text matching
price = product.xpath('.//span[contains(text(), "$")]/text()').get()
# Back to CSS for simple stuff
name = product.css('h2::text').get()
yield {'name': name, 'price': price}
Quick Reference
CSS Cheat Sheet
# By tag
response.css('div')
# By class
response.css('.classname')
# By ID
response.css('#idname')
# Get text
response.css('h1::text').get()
# Get attribute
response.css('a::attr(href)').get()
# Multiple classes
response.css('.class1.class2')
# Child selector
response.css('div > p')
# Descendant selector
response.css('div p')
# First child
response.css('li:first-child')
# Nth child
response.css('li:nth-child(3)')
XPath Cheat Sheet
# By tag
response.xpath('//div')
# By class
response.xpath('//div[@class="classname"]')
# By ID
response.xpath('//div[@id="idname"]')
# Get text
response.xpath('//h1/text()').get()
# Get attribute
response.xpath('//a/@href').get()
# Contains text
response.xpath('//button[contains(text(), "Click")]')
# Parent
response.xpath('//span/parent::div')
# Following sibling
response.xpath('//h2/following-sibling::p[1]')
# Preceding sibling
response.xpath('//p/preceding-sibling::h2')
# Multiple conditions (AND)
response.xpath('//div[@class="product" and @data-type="book"]')
# Position
response.xpath('//li[3]')
Summary
CSS Selectors:
- Easier to learn
- Shorter syntax
- Perfect for simple tasks
- Can't navigate up or select by text
XPath:
- More powerful
- Can navigate anywhere
- Can select by text content
- More verbose
When to use what:
- Start with CSS for simple selections
- Switch to XPath when you need power
- Mix them freely for best results
Don't stress about choosing one. Learn both basics, then use whichever makes sense for each situation.
Happy scraping! 🕷️
Top comments (0)