Max Humber

Posted on Oct 9, 2020

BeautifulSoup is so 2000-and-late: Web Scraping in 2020

#python #webscraping #gazpacho #hacktoberfest

BeautifulSoup (bs4) was created over a decade-and-a-half ago. And it's been the standard for web scraping ever since. But it's time for something new, because bs4 is so 2000-and-late.

In this post we'll explore 10 reasons why gazpacho is the future of web scraping, by scraping parts of this post!

1. No Dependencies

gazpacho is installed at command line:

pip install gazpacho

With no extra dependencies:

pip freeze
# gazpacho==1.1

In contrast, bs4 is packaged with soupsieve and lxml. I won't tell you how to write software, but minimizing dependencies is usually a good idea...

2. Batteries Included

The html for this blog post can be fetched and made parse-able with Soup.get:

from gazpacho import Soup

url = "https://dev.to/maxhumber/beautifulsoup-is-so-2000-and-late-web-scraping-in-2020-2528"
soup = Soup.get(url)

Unfortunately, you'll need requests on top of bs4 to do the same thing:

import requests
from bs4 import BeautifulSoup

url = "https://dev.to/maxhumber/beautifulsoup-is-so-2000-and-late-web-scraping-in-2020-2528"
html = requests.get(url).text
bsoup = BeautifulSoup(html)

3. Simple `find`ing

bs4 is a monster. There are 184 methods and attributes attached to every BeautifulSoup object. Making it hard to know what to use and when to use it:

len(dir(BeautifulSoup()))
# 184

In contrast, Soup objects in gazpacho are simple; there are just seven methods and attributes to keep track of:

[method for method in dir(Soup())]
# ['attrs', 'find', 'get', 'html', 'strip', 'tag', 'text']

Looking at that list it's clear that to find the title of this post (nested inside of an h1 tag), for example, we'll need to use .find:

soup.find('h1')

4. Prototyping to Production

gazpacho is awesome for prototyping and even better for production. By default, .find will return one Soup object if it finds just one element, or a list of Soup objects if it finds more than one.

To guarantee and enforce return types in production the mode= argument in .find can be set manually:

title = (soup
    .find("header", {'id': 'main-title'}, mode="first")
    .find("h1", mode="all")[0]
    .text
)

In contrast, bs4 has 27 find methods and they all return something different:

[method for method in dir(BeautifulSoup()) if 'find' in method]

5. PEP 561 Compliant

As of version 1.1, gazpacho is PEP 561 compliant. Meaning that the entire library is typed and will work with your typed (or standard duck/un-typed!) code-base:

help(soup.find)
# Signature:
# soup.find(
#     tag: str,
#     attrs: Union[Dict[str, Any], NoneType] = None,
#     *,
#     partial: bool = True,
#     mode: str = 'automatic',
#     strict: Union[bool, NoneType] = None,
# ) -> Union[List[ForwardRef('Soup')], ForwardRef('Soup'), NoneType]

6. Automatic Formatting

The html on dev.to and this post is well formatted. But if it weren't:

header = soup.find("div", {'class': 'crayons-article__header__meta'})
html = str(header.find("div", {'class': 'mb-4 spec__tags'}))
bad_html = html.replace("\n", "") # remove new line characters
print(bad_html)
# <div class="mb-4 spec__tags">  <a class="crayons-tag mr-1" href="/t/python" style="background-color:#1E38BB;color:#FFDF5B">    <span class="crayons-tag__prefix">#</span>    python  </a>  <a class="crayons-tag mr-1" href="/t/webscraping" style="background-color:;color:">    <span class="crayons-tag__prefix">#</span>    webscraping  </a>  <a class="crayons-tag mr-1" href="/t/gazpacho" style="background-color:;color:">    <span class="crayons-tag__prefix">#</span>    gazpacho  </a>  <a class="crayons-tag mr-1" href="/t/hacktoberfest" style="background-color:#29161f;color:#ffa368">    <span class="crayons-tag__prefix">#</span>    hacktoberfest  </a></div>

gazpacho would be able to automatically format and indent the bad/malformed html:

tags = Soup(bad_html)

Making things easier to read:

print(tags)
# <div class="mb-4 spec__tags">
#   <a class="crayons-tag mr-1" href="/t/python" style="background-color:#1E38BB;color:#FFDF5B">
#     <span class="crayons-tag__prefix">#</span>
#         python
#   </a>
#   <a class="crayons-tag mr-1" href="/t/webscraping" style="background-color:;color:">
#     <span class="crayons-tag__prefix">#</span>
#         webscraping
#   </a>
#   <a class="crayons-tag mr-1" href="/t/gazpacho" style="background-color:;color:">
#     <span class="crayons-tag__prefix">#</span>
#         gazpacho
#   </a>
#   <a class="crayons-tag mr-1" href="/t/hacktoberfest" style="background-color:#29161f;color:#ffa368">
#     <span class="crayons-tag__prefix">#</span>
#         hacktoberfest
#   </a>
# </div>

7. Speed

gazpacho is fast. It takes just 258 µs to scrape the tag links for this post:

%%timeit
tags = Soup(bad_html)
tags = tags.find("a")
tag_links = ["https://dev.to" + tag.attrs['href'] for tag in tags]
# 258 µs ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

While bs4 takes nearly twice as long to do the same thing:

%%timeit
tags = BeautifulSoup(bad_html)
tags = tags.find_all("a")
tag_links = ["https://dev.to" + tag.attrs['href'] for tag in tags]
# 465 µs ± 2.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

8. Partial Matching

gazpacho can partially match html element attributes. For instance, the sidebar for this page is displayed with the following html:

<aside class="crayons-layout__sidebar-right" aria-label="Right sidebar navigation">

And can be matched exactly with:

soup.find("aside", {"class": "crayons-layout__sidebar-right"}, partial=False)

Or partially (the default behaviour) with:

sidebar = soup.find("aside", {'aria-label': 'Right sidebar'}, partial=True)

# finding my name
sidebar.find("span", {'class': 'crayons-subtitle-2'}, partial=True).text

9. Debt-free

gazpacho is Python 3 first, Black, typed with mypy, and about ~400 sloc. It's easy to read through the source:

import inspect

source = inspect.getsource(Soup.find)
print(source)

And like bs4 isn't riddled with Python 2 technical debt.

10. Open (and Friendly)!

Most importantly, gazpacho is open-source, hosted on GitHub (instead of some clunky custom platform) and looking for contributors.

If you're participating in #hacktoberfest, we'd love to have you. There's a couple of open issues that could use some help!

Top comments (9)

Mohiuddin Sumon • Oct 10 '20

Yes it's true bs4 does have a lot of methods and knowing and remembering them is tough but for most of the work we can simply use common ones.

However where I faced most difficulty is with dynamic pages for example you can go to an ecommerce site and scrape a search result page.

In this type of scenario how would gazpacho work ? @maxhumber maybe you can make a tutorial video with selenium and gazpacho ?

Max Humber • Oct 10 '20

There's some examples of gazpacho + selenium on this website: scrape.world/

Tom Quirk • Oct 10 '20

One thing I've never understood about Beautiful soup is how un-user friendly the docs are. Gazpacho looks so simple in comparison - I'll definitely check out!

Max Humber • Oct 10 '20

Right? 🙈

Guilherme Bauer-Negrini • Oct 10 '20

I've used bs4 in 3 projects by now and it never had occurred to me to search for alternatives. I'll certainly give gazpacho a chance, seems pretty easy.

Robert • Oct 10 '20

Cool!

Pacharapol Withayasakpunt • Oct 10 '20

Why not just simply lxml with xpath? (Who says we have to use BeautifulSoup?)

My favorite is Cheerio (in Node.js / web browser), a jQuery analog, though.

mehmeh-ctrl • Mar 21 '21

Remembering attending your workshops (also these where you've talked about how one of your cousins plunged into a barn without his parachute opened), I wonder why I tomorrow won't be teaching people how to use Gazpacho. I've prepared some extended Pandas exercises (that's only Intro to Big Data). Gazpacho should turn into a tutorial of the kind of Django Girls website to conquer the world.