DEV Community

Waylon Walker
Waylon Walker

Posted on • Originally published at


Find all Headings with BeautifulSoup

BeautifulSoup is a DOM like library for python. It's quite useful to manipulate html. Here is an example to find_all html headings. I stole the regex from stack overflow, but who doesn't.

Make an example


Lets make a sample.html file with the following contents. It mainly has some headings, <h1> and <h2> tags that I want to be able to find.

<!DOCTYPE html>
<html lang="en">
    <p>this is a paragraph</p>
    <h2>second heading</h2>
    <p>this is also a paragraph</p>
    <h2>third heading</h2>
    <p>this is the last paragraph</p>

Enter fullscreen mode Exit fullscreen mode

Get the headings with BeautifulSoup

Lets import our packages, read in our sample.html using pathlib and find all headings using BeautifulSoup.

from bs4 import BeautifulSoup from pathlib import Path

soup = BeautifulSoup(Path('sample.html').read_text(), features="lxml") headings = soup.find_all(re.compile("^h[1-6]$"))
Enter fullscreen mode Exit fullscreen mode

And what we get is a list of bs4.element.Tag's.

>> print(headings)
[<h1>hello</h1>, <h2>second heading</h2>, <h2>third heading</h2>]
Enter fullscreen mode Exit fullscreen mode

I recently added a heading_link plugin to markata, you might notice the
🔗's next to each heading on this page, that is powered by this exact

Top comments (0)

Timeless DEV post...

Git Concepts I Wish I Knew Years Ago

The most used technology by developers is not Javascript.

It's not Python or HTML.

It hardly even gets mentioned in interviews or listed as a pre-requisite for jobs.

I'm talking about Git and version control of course.

One does not simply learn git