DEV Community

Cover image for Web Scraping with CSS Selectors using Python
Dmitriy Zub ☀️ for SerpApi

Posted on • Updated on • Originally published at serpapi.com

Web Scraping with CSS Selectors using Python

Intro

This blog post is ongoing with somewhat regular updates. It's about understanding CSS selectors when doing web scraping, and what tools might be handy to use.

At SerpApi we've encountering different types of selectors, some of them are quite complex that include complex logic, for example, logic may include selectors like :has(), :not() among other selectors, and we want to share a back a little of our knowledge gained during our journey while building our APIs.

We want to point out that this blog post is not a complete CSS selectors reference, but a mini-guided tour of frequently used and more advanced type of selectors and how to work them while web scraping with code examples.

Prerequisites

A basic familiarity with bs4 library, or whatever HTML parser package/framework you're using as usage of CSS selectors in different languages, frameworks, packages are not much different.

Install libraries:

pip install requests lxml beautifulsoup4
Enter fullscreen mode Exit fullscreen mode

What is CSS selectors

CSS selectors are patterns used to select match the element(s) you want to style extract from HTML page.

SelectorGadget

Let's start with easy one, SelectorGadget Chrome extension. This extension allows to quickly grab CSS selector(s) by clicking on desired element in your browser, and returns a CSS selector(s).

SelectorGadget is an open-source tool that makes CSS selector generation and discovery on complicated sites a breeze.

Uses cases:

  • for web page scraping with tools such as Nokogiri and BeautifulSoup.
  • to generate jQuery selectors for dynamic sites.
  • as a tool to examine JavaScript-generated DOM structures.
  • as a tool to help you style only particular elements on the page with your stylesheets.
  • for selenium or phantomjs testing.

When using SelectorGadget it highlights element(s) in:

  • yellow which is mean that it's guessing what the user is looking for, and needs possible additional clarification.

Image description

  • red excludes from match selection.

Image description

  • green includes to match selection.

Image description


Picking CSS Selectors by Hand

Since SelectorGadget isn't a magical all around tool, sometimes it can't get the desired element. This happens when website HTML tree is not well structured, or if the site is rendered via JavaScript.

When it happens, we use Elements tab via Dev Tools (F12 on a keyboard or CTRL+SHIFT+C) to locate and grab CSS selector(s) or HTML elements by their:

  • type selector: <input>
  • class selector: .class
  • id selector: #id
  • attribute selector: [attribute]

Types of CSS Selectors

Type Selectors

✍Syntax: element_name

Type selectors matches elements by node name. In other words, it selects all elements of the given type within a HTML document.

soup.select('a')      # returns all <a> elements
soup.select('span')   # returns all <span> elements
soup.select('input')  # returns all <input> elements
soup.select('script') # returns all <script> elements
Enter fullscreen mode Exit fullscreen mode

Class Selectors

✍Syntax: .class_name

Class selectors matches elements based on the contents of their class attribute. It's like calling a class method PressF().when_playing_cod().

soup.select('.mt-5')                   # returns all elements with current .selector
soup.select('.crayons-avatar__image')  # returns all elements with current .selector
soup.select('.w3-btn')                 # returns all elements with current .selector
Enter fullscreen mode Exit fullscreen mode

ID Selectors

✍Syntax: #id_value

ID selectors matches an element based on the value of the elements id attribute. In order for the element to be selected, its id attribute must match exactly the value given in the selector.

soup.select('#eob_16')              # returns all elements with current #selector
soup.select('#notifications-link')  # returns all elements with current #selector
soup.select('#value_hover')         # returns all elements with current #selector
Enter fullscreen mode Exit fullscreen mode

Attribute Selectors

✍Syntax: [attribute=attribute_value] or [attribute], more examples.

Attribute selectors matches elements based on the presence or value of a given attribute.

The only difference is that this selectors uses curly braces [] instead of a dot (.) as class, or a hash (or octothorpe) symbol (#) as ID.

soup.select('[jscontroller="K6HGfd"]')         # returns all elements with current [selector]
soup.select('[data-ved="2ascASqwfaspoi_SA8"]') # returns all elements with current [selector]

# elements with an attribute name of data-id
soup.select('[data-id]')                       # returns all elements with current [selector]
Enter fullscreen mode Exit fullscreen mode

Selector List

✍Syntax: element, element, element, ...

Selector list selects all the matching nodes (elements). From a web scraping perspective this CSS selectors is great (in my opinion) to handle different HTML layouts because if one of the selectors is present it will grab all elements from an existing selector.

As an example from Google Search (carousel results), the HTML layout will be different depending on country where the search is coming from.

When country of the search is not the United States:

Image description

When country of the search is set to the United States:

Image description

Following examples translates to this code snippet (handles both HTML layouts):

# will return all elements either by one of these selectors
soup.select('#kp-wp-tab-Albums .PZPZlf, .keP9hb')
Enter fullscreen mode Exit fullscreen mode

Descendant combinator

✍Syntax: selector1 selector2

Descendant combinator represented by a single space () character and selects two selectors such that elements matched by the second selector are selected if they have an ancestor (parent, parent's parent, parent, etc) element matching the first selector.

soup.select('.NQyKp .REySof')   # dives inside .NQyKp -> dives again to .REySof and grabs data from it
soup.select('div cite.iUh30')   # dives inside div -> dives inside cite.iUh30 and grabs data from it
soup.select('span#21Xy a.XZx2') # dives inside span#id -> dives inside a.XZx2 and grabs data from it
Enter fullscreen mode Exit fullscreen mode

Selector :nth-child()

✍Syntax: selector|element:nth-child()

The :nth-child() pseudo-class matches elements based on their position among a group of siblings.

soup.select('p.SacA1:nth-child(1)') # selects every second p.SacA1 element
Enter fullscreen mode Exit fullscreen mode

Selector :has()

✍Syntax: selector|element:has(selector|element)

:has() is a pseudo-class that checks if parent element(s) contains certain child element(s)

soup.select('p:has(.sA1Sg)') # checks if p element that has .sA1Sg selector as a child
Enter fullscreen mode Exit fullscreen mode

Function contains()

✍Syntax: selector|element:contains(selector|element|text)

contains() method is not completely related to CSS selectors but rather to XPath. It's returns true or false if there's a value in a substring of searched (first) string. A little confusing, let's show an example.

from parsel import Selector

dummy_string_1 = 'I saw a cat that had $3000 in the pocket'
dummy_string_2 = 'I saw a cat that was dancing with pigeon'

selector_1 = Selector(text=dummy_string_1)
selector_2 = Selector(text=dummy_string_2)

# $ has to be espaced with \ symbol 
# otherwise SelectorSyntaxError will be raised
text_1 = selector_1.css(':contains(\$)::text').get() 👈👈👈
text_2 = selector_2.css(':contains(\$)::text').get()

print(text_1)
print(text_2)
Enter fullscreen mode Exit fullscreen mode

Outputs:

I saw a cat that had $3000 in the pocket 👈👈👈
None
Enter fullscreen mode Exit fullscreen mode

Selector :not()

✍Syntax: selector|element:not(selector|element|text)

The :not() pseudo-class is used to prevent specific items from being selected.

The :not pseudo-class could used (chained) with contains() method to create a boolean expression which is really handy.

Continuing with the previous example, we can select element that doesn't contains $ symbol in the text :not(:contains(\$))::text:

from parsel import Selector

dummy_string_1 = 'I saw a cat that had $3000 in the pocket'
dummy_string_2 = 'I saw a cat that was dancing with pigeon'

selector_1 = Selector(text=dummy_string_1)
selector_2 = Selector(text=dummy_string_2)

# $ has to be espaced with \ symbol 
# otherwise SelectorSyntaxError will be raised
text_1 = selector_1.css(':contains(\$)::text').get()
text_2 = selector_2.css(':not(:contains(\$))::text').get() 👈👈👈

print(text_1)
print(text_2)
Enter fullscreen mode Exit fullscreen mode

Outputs:

I saw a cat that had $3000 in the pocket
I saw a cat that was dancing with pigeon 👈👈👈
Enter fullscreen mode Exit fullscreen mode

Here's a more practical usage where we need to select everything (only category) that doesn't contains $ in the text string:

Without :not(:contains(\$)) With :not(:contains(\$))
image image

Other useful CSS selectors:

Selector Explanation
:nth-of-type() Selects every n element that is the second n element of its parent.
:is() pseudo-class function takes a selector list as its argument, and selects any element that can be selected by one of the selectors in that list.

Additional useful CSS selectors you can find on W3C Level 4 selectors, W3Schools CSS selectors reference, and MDN CSS selectors documentation.

Testing CSS Selectors

To test if the selector extracts correct data you can:

Place those CSS selector(s) in the SelectorGadget window and see what elements being selected:

Image description

Use Dev Tools Console tab via $$(".selector") method (creates an array (list()) of elements):

$$(".DKV0Md")
Enter fullscreen mode Exit fullscreen mode

Which is equivalent to document.querySelectorAll(".selector") method (according to Chrome Developers website:

document.querySelectorAll(".DKV0Md")
Enter fullscreen mode Exit fullscreen mode

Output from the DevTools Console for both methods are the same:

Image description

Pros of CSS Selector

  • easy to pick.
  • easy to get used (especially if have an HTML background).
  • has tools to help pick (select) them.
  • can be understandable inside the code if selector itself is understandable, not something like .wtf228YoLo.

Cons of CSS Selector

Betting only classes might be not a good idea since they could probably change.

A little more realible way would be to use attribute selectors selectors (mentioned above) they are likely to change less frequently.

Attribute selectors examples: (HTML from Google organic results):

Image description

Many modern websites use autogenerated CSS selectors for every change that is being made to certain style component, which means that rely exclusively on them is not a good idea. But again, it will depend on how often do they really change.

The biggest problem that might appear is that when the code will be executed it will blow up with an error, and the maintainer of the code should manually change CSS selector(s) to make the code run properly.

Seems like not a big deal, which is true, but it might be annoying if selectors are changing frequently.

Code Examples

This section will show a couple of actual examples from different websites to get you familiarize a bit more.

Extract title, snippet, link, displayed link from Google Search results.

Image description

Test CSS container selector:

Image description

Code:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"
}

html = requests.get("https://www.google.com/search?q=minecraft", headers=headers)
soup = BeautifulSoup(html.text, "lxml")

for result in soup.select(".tF2Cxc"):
    title = result.select_one(".DKV0Md").text
    link = result.select_one(".yuRUbf a")["href"]
    displayed_link = result.select_one(".lEBKkf span").text
    snippet = result.select_one(".lEBKkf span").text

    print(f"{title}\n{link}\n{displayed_link}\n{snippet}\n")


# part of the output 
'''
Log in | Minecraft
https://minecraft.net/login
https://minecraft.net › login
Still have a Mojang account? Log in here: Email. Password. Forgot your password? Login. Mojang © 2009-2021. "Minecraft" is a trademark of Mojang AB.

What is Minecraft? | Minecraft
https://www.minecraft.net/en-us/about-minecraft
https://www.minecraft.net › en-us › about-minecraft
Prepare for an adventure of limitless possibilities as you build, mine, battle mobs, and explore the ever-changing Minecraft landscape.
'''
Enter fullscreen mode Exit fullscreen mode

Extract titles from SerpApi Blog

Image description

Testing .post-card-title CSS selector in Devtools Console:

$$(".post-card-title")

(7) [h2.post-card-title, h2.post-card-title, h2.post-card-title, h2.post-card-title, h2.post-card-title, h2.post-card-title, h2.post-card-title]
0: h2.post-card-title
1: h2.post-card-title
2: h2.post-card-title
3: h2.post-card-title
4: h2.post-card-title
5: h2.post-card-title
6: h2.post-card-title
length: 7
[[Prototype]]: Array(0)
Enter fullscreen mode Exit fullscreen mode

Code:

import requests, lxml
from bs4 import BeautifulSoup

html = requests.get("https://serpapi.com/blog/")
soup = BeautifulSoup(html.text, "lxml")

for title in soup.select(".post-card-title"):
    print(title.text)

'''
Scrape Google Carousel Results with Python
SerpApi’s YouTube Search API
DuckDuckGo Search API for SerpApi
Extract all search engines ad results at once using Python
Scrape Multiple Google Answer Box Layouts with Python
SerpApi’s Baidu Search API
How to reduce the chance of being blocked while web scraping search engines
'''
Enter fullscreen mode Exit fullscreen mode

Extract title, link from dev.to feed

Image description

Test CSS selector with either SelectorGadget or DevTools Console:

Image description

Code:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.45 Safari/537.36"
}

html = requests.get("https://dev.to/", headers=headers)
soup = BeautifulSoup(html.text, "lxml")

for result in soup.select(".crayons-story__title"):
    title = result.text.strip()
    link = f'https://dev.to{result.a["href"].strip()}'

    print(title, link, sep="\n")


# part of the output:
'''
How to Create and Publish a React Component Library
https://dev.to/alexeagleson/how-to-create-and-publish-a-react-component-library-2oe
A One Piece of CSS Art!
https://dev.to/afif/a-one-piece-of-css-art-225l
Windster - Tailwind CSS admin dashboard interface [MIT License]
https://dev.to/themesberg/windster-tailwind-css-admin-dashboard-interface-mit-license-3lb6
'''
Enter fullscreen mode Exit fullscreen mode

Join us on Twitter | YouTube

Top comments (2)

Collapse
 
steelwolf180 profile image
Max Ong Zong Bao

Not all websites auto generate HTML tags, in fact there will be times it's much better to just use xpath.

Since there may not be websites without any class name, id or HTML tags for you to select when it is loaded due to it being generated on the fly through other methods like JavaScript to prevent scrapers from being used.

Collapse
 
dmitryzub profile image
Dmitriy Zub ☀️ • Edited

@steelwolf180 Thank you for your reply! This is exactly was is being written in this blog post: "Many modern websites use autogenerated CSS selectors", many, not all of them :-) Using XPath is surely a great idea from the start, but this was not the topic of this blog post.