A General Overview
When people discuss "web scrapers," they commonly refer to a process that involves:
Retrieving HTML data from a domain name
Parsing that data for target information
Storing the target information
Optionally, moving to another page to repeat the process
Introduction to BeautifulSoup
Because the BeautifulSoup library is not a default Python library, it must be installed.
!pip install beautifulsoup4
The most commonly used object in the BeautifulSoup library is, appropriately, the BeautifulSoup object. Let's take a look at it in action:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://www.pythonscraping.com/pages/page1.html')
bs = BeautifulSoup(html.read(), 'html.parser')
print(bs.h1)
The output is as follows:
|<h1>Beautiful Soup Documentation<a class="headerlink" href="#module-bs4" title="Link to this heading">¶</a></h1>
Note that this returns only the first instance of the h1 tag found on the page. By con‐ vention, only one h1 tag should be used on a single page, but conventions are often broken on the web, so you should be aware that this will retrieve the first instance of the tag only, and not necessarily the one that you're looking for.
Another popular parser is lxml. This can be installed through pip:|
!pip install lxml
lxml has some advantages over html.parser in that it is generally better at parsing "messy" or malformed HTML code. It is forgiving and fixes problems like unclosed tags, tags that are improperly nested, and missing head or body tags. It is also some‐ what faster than html.parser
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.crummy.com/software/BeautifulSoup/bs4/doc/')
bs = BeautifulSoup(html, 'lxml')
print(bs.h1)
Connecting Reliably and Handling Exceptions
The web is messy. Data is poorly formatted, websites go down, and closing tags go missing. One of the most frustrating experiences in web scraping is to go to sleep with a scraper running, dreaming of all the data you'll have in your database the next day - only to find that the scraper hit an error on some unexpected data format and stopped execution shortly after you stopped looking at the screen.
Two main things can go wrong in this line:
- The page is not found on the server (or there was an error in retrieving it)
- The server is not found.
Of course, if the page is retrieved successfully from the server, there is still the issue of the content on the page not quite being what you expected. Every time you access a tag in a BeautifulSoup object, it's smart to add a check to make sure the tag actually exists.
from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup
def getTitle(url):
try:
html = urlopen(url)
except HTTPError as e:
return None
try:
bs = BeautifulSoup(html.read(), 'html.parser')
title = bs.body.h1
except AttributeError as e:
return None
return title
title = getTitle('https://www.crummy.com/software/BeautifulSoup/bs4/doc/')
if title == None:
print('Title could not be found')
else:
print(title)
find() and find_all() with BeautifulSoup
BeautifulSoup's find() and find_all() are the two functions you will likely use the most. With them, you can easily filter HTML pages to find lists of desired tags, or a single tag, based on their various attributes. The two functions are extremely similar, as evidenced by their definitions in the BeautifulSoup documentation:
find_all(tag, attributes, recursive, text, limit, keywords)
find(tag, attributes, recursive, text, keywords)
These examples demonstrate how find_all can be used to locate elements based on different criteria such as tag names, attributes, and text content within an HTML or XML document parsed with BeautifulSoup.
bs.find_all(['h1','h2','h3','h4','h5','h6'])
bs.find_all('span', {'class':{'n'}})
nameList = bs.find_all(text='soup')
title = bs.find_all(id='searchlabel')
The find_all method searches for all occurrences of tags that match the given criteria. On the other hand, The find method is used to search for the first occurrence of a tag that matches the given criteria.
Navigating Trees
The find_all function is responsible for finding tags based on their name and attributes. But what if you need to find a tag based on its location in a document?
**Dealing with children and other descendants
For example, if you need to retrieve the contents of a table by accessing only its immediate children, you can use the .children method:
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.crummy.com/software/BeautifulSoup/bs4/doc/')
bs = BeautifulSoup(html, 'html.parser')
for child in bs.find('table').children:
print(child)
**Dealing with siblings
Additionally, to extract data from a table beyond its immediate children, BeautifulSoup offers the next_siblings() function. This function simplifies the process by allowing us to retrieve siblings of a
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.crummy.com/software/BeautifulSoup/bs4/doc/')
bs = BeautifulSoup(html, 'html.parser')
for child in bs.find('table').tr.next_siblings:
print(child)
Dealing with parents
hen you're scraping web pages, you often focus more on finding children or siblings of tags rather than their parents. Typically, when you begin crawling an HTML page, you start by examining the top-level tags and then work your way down to locate specific pieces of data. However, there are times when you might encounter unusual scenarios that necessitate using BeautifulSoup's parent-finding functions like .parent and .parents.
we can also get the table by find the tag td then finding his parent and printing all his siblings
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.crummy.com/software/BeautifulSoup/bs4/doc/')
bs = BeautifulSoup(html, 'html.parser')
element = bs.find('td')
if element:
parent_element = element.parent
for sibling in parent_element.next_siblings:
print(sibling)
Regular Expressions
One classic example of regular expressions can be found in the practice of identifying email addresses. Although the exact rules governing email addresses vary slightly from mail server to mail server, we can create a few general rules.
[A-Za-z0–9._+]+@[A-Za-z]+.(com|org|edu|net)
When scraping web pages using BeautifulSoup, the synergy with regular expressions becomes invaluable, especially for tasks like extracting specific elements such as event images from complex HTML structures. For instance, consider a scenario where you need to retrieve URLs of event images from a page.
To address this, leveraging regular expressions allows you to pinpoint images by specific attributes like their src attribute, which contains the file path. This approach ensures you accurately target product images regardless of their position or surrounding elements on the page.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('http://enetcomje.com/gallerie.php')
bs = BeautifulSoup(html, 'html.parser')
# Adjust the regular expression pattern to match 'event' images
images = bs.find_all('img', {'src': re.compile('event\d+\.png')})
for image in images:
print(image['src'])
Accessing Attributes
When web scraping, you often need to access attributes of HTML tags rather than their content. This is especially useful for tags like , where the URL is in the href attribute, or , where the image source (src) is crucial. BeautifulSoup simplifies this process with its attrs attribute, which returns a Python dictionary containing all attributes of a tag.
print(bs.img.attrs['src'])
Writing Web Crawlers
So far, you've seen single static pages . In this part, you'll start looking at real-world problems, with scrapers traversing multiple pages and even multiple sites.
Web crawlers are called such because they crawl across the web. At their core is an element of recursion. They must retrieve page contents for a URL, examine that page for another URL, and retrieve that page, ad infinitum.
this is a code that retives an arbitrary wikepideia pages and produces a list of link of that pages :
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('a'):
if 'href' in link.attrs:
print(link.attrs['href'])
If you look at the list of links produced, you'll notice that all the articles you'd expect are there: "Apollo 13," "Philadelphia," "Primetime Emmy Award," and so on. However, there are some things that you don't want as well: //wikimediafoundation.org/wiki/Privacy_policy //en.wikipedia.org/wiki/Wikipedia:Contact_us
If you examine the links that point to article pages (as opposed to other internal pages), you'll see that they all have three things in common:
- They reside within the div with the id set to bodyContent.
- The URLs do not contain colons.
- The URLs begin with /wiki/.
so we can improve the last code to recivee only the desired article by using the regular expression ^(/wiki/)((?!:).)*$")
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
html = urlopen('http://en.wikipedia.org/wiki/Kevin_Bacon')
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find('div', {'id':'bodyContent'}).find_all(
'a', href=re.compile('^(/wiki/)((?!:).)*$')):
if 'href' in link.attrs:
print(link.attrs['href'])
Crawling an Entire Site
Web scrapers that traverse an entire site are useful for several purposes:
- Generating a Site Map: Using a crawler, you can scan an entire site, gather all internal links, and organize the pages into their actual folder structure. This could help you discover hidden sections and accurately count the number of pages.
- Gathering Data: If you need to collect articles (such as stories, blog posts, news articles, etc.) to create a prototype for a specialized search platform, you would need data from only a few sites but want a broad collection. Therefore, you will build crawlers that traverse each site and collect data specifically from article pages.
The general approach to an exhaustive site crawl is to start with a top-level page (such as the home page), and search for a list of all internal links on that page. Every one of those links is then crawled, and additional lists of links are found on each one of them, triggering another round of crawling.
this a code that could give me all the link available in wikipidia :
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
global pages
html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
bs = BeautifulSoup(html, 'html.parser')
for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
#We have encountered a new page
newPage = link.attrs['href']
print(newPage)
pages.add(newPage)
getLinks(newPage)
getLinks('')
Collecting Data Across an Entire Site
Web crawlers would be fairly boring if all they did was hop from one page to the other. To make them useful, you need to be able to do something on the page while you're there. Let's look at how to build a scraper that collects the title, the first para‐ graph of content, and the link to edit the page (if available).
As always, the first step to determine how best to do this is to look at a few pages from the site and determine a pattern. By looking at a handful of Wikipedia pages (both articles and nonarticle pages such as the privacy policy page), the following things should be clear:
- All titles (on all pages, regardless of their status as an article page, an edit history page, or any other page) have titles under h1 → span tags, and these are the only h1 tags on the page.
- As mentioned before, all body text lives under the div#bodyContent tag. How‐ ever, if you want to get more specific and access just the first paragraph of text, you might be better off using div#mw-content-text → p (selecting the first para‐ graph tag only). This is true for all content pages except file pages (for example, https://en.wikipedia.org/wiki/File:Orbit_of_274301_Wikipedia.svg), which do not have sections of content text.
- Edit links occur only on article pages. If they occur, they will be found in the li#ca-edit tag, under li#ca-edit → span → a.
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
pages = set()
def getLinks(pageUrl):
global pages
html = urlopen('http://en.wikipedia.org{}'.format(pageUrl))
bs = BeautifulSoup(html, 'html.parser')
try:
print(bs.h1.get_text())
print(bs.find(id ='mw-content-text').find_all('p')[0])
print(bs.find(id='ca-edit').find('span')
.find('a').attrs['href'])
except AttributeError:
print('This page is missing something! Continuing.')
for link in bs.find_all('a', href=re.compile('^(/wiki/)')):
if 'href' in link.attrs:
if link.attrs['href'] not in pages:
#We have encountered a new page
newPage = link.attrs['href']
print('-'*20)
print(newPage)
pages.add(newPage)
getLinks(newPage)
getLinks('')
Note: Handling Redirects
Redirects allow a web server to point one domain name or URL to a piece of content at a different location. There are two types of redirects:
- Server-side redirects, where the URL is changed before the page is loaded
- Client-side redirects, sometimes seen with a "You will be redirected in 10 sec‐ onds" type of message, where the page loads before redirecting to the new one
With server-side redirects, you usually don't have to worry. If you're using the urllib library with Python 3.x, it handles redirects automatically! If you're using the requests library, make sure to set the allow-redirects flag to True:
r = requests.get('http://github.com', allow_redirects=True)
Just be aware that, occasionally, the URL of the page you're crawling might not be exactly the URL that you entered the page on.
Crawling Across the Internet
Before you start writing a crawler that follows all outbound links, you should ask yourself a few questions:
- What data am I trying to gather? Can this be accomplished by scraping just a few predefined websites (almost always the easier option), or does my crawler need to be able to discover new websites I might not know about?
- When my crawler reaches a particular website, will it immediately follow the next outbound link to a new website, or will it stick around for a while and drill down into the current website?
- Are there any conditions under which I would not want to scrape a particular site? Am I interested in non-English content?
- How am I protecting myself against legal action if my web crawler catches the attention of a webmaster on one of the sites it runs across?
import time
import random
from urllib.request import urlopen
from urllib.parse import urlparse, urljoin
from bs4 import BeautifulSoup
import re
pages = set()
# Seed random generator with current time
random.seed(time.time())
def getInternalLinks(bs, includeUrl):
""" Retrieves internal links from a BeautifulSoup object """
includeUrl = '{}://{}'.format(urlparse(includeUrl).scheme,
urlparse(includeUrl).netloc)
internalLinks = []
for link in bs.find_all('a', href=re.compile('^(/|.*'+includeUrl+')')):
if 'href' in link.attrs:
if link.attrs['href'] is not None:
url = urljoin(includeUrl, link.attrs['href'])
if url not in internalLinks:
internalLinks.append(url)
return internalLinks
def getExternalLinks(bs, excludeUrl):
""" Retrieves external links from a BeautifulSoup object """
externalLinks = []
for link in bs.find_all('a', href=re.compile('^(http|www)((?!'+excludeUrl+').)*$')):
if 'href' in link.attrs:
if link.attrs['href'] is not None:
url = link.attrs['href']
if url not in externalLinks:
externalLinks.append(url)
return externalLinks
def getRandomExternalLink(startingPage):
""" Retrieves a random external link from the starting page """
html = urlopen(startingPage)
bs = BeautifulSoup(html, 'html.parser')
externalLinks = getExternalLinks(bs, urlparse(startingPage).netloc)
if len(externalLinks) == 0:
print('No external links found, looking for internal links')
domain = '{}://{}'.format(urlparse(startingPage).scheme, urlparse(startingPage).netloc)
internalLinks = getInternalLinks(bs, domain)
if internalLinks:
return getRandomExternalLink(random.choice(internalLinks))
else:
return None
else:
return random.choice(externalLinks)
def followExternalOnly(startingSite):
""" Follows external links recursively from the starting site """
if startingSite in pages:
return
pages.add(startingSite)
externalLink = getRandomExternalLink(startingSite)
if externalLink:
print('Random external link is: {}'.format(externalLink))
followExternalOnly(externalLink)
else:
print('No external links found on {}'.format(startingSite))
# Start following external links from a specified starting site
followExternalOnly('http://oreilly.com')
Web Crawling Models
Planning and Defining Objects
One common trap of web scraping is defining the data that you want to collect based entirely on what's available in front of your eyes.
If you want to collect product data, you may first look at a clothing store and decide that each product you scrape needs to have the
following fields:
- Product name
- Price
- Description
- Sizes
- Colors
Looking at another website, you find that it has SKUs (stock keeping units, used to track and order items) listed on the page. You definitely want to collect that data as well, even if it doesn't appear on the first site! You add this field:
- Item SKU
Clearly, this is an unsustainable approach. Simply adding attributes to your product type every time you see a new piece of information on a website will lead to far too many fields to keep track of. Not only that, but every time you scrape a new website, you'll be forced to perform a detailed analysis of the fields the website has and the fields you've accumulated so far, and potentially add new fields (modifying your Python object type and your database structure). This will result in a messy and difficult-to-read dataset that may lead to problems using it.
One of the best things you can do when deciding which data to collect is often to ignore the websites altogether. You don't start a project that's designed to be large and scalable by looking at a single website and saying, "What exists?" but by saying "What do I need?" and then finding ways to seek the information that you need from there.
It's important to take a step back and perform a checklist for each item you consider and ask yourself the following questions:
- Will this information help with the project goals? Will it be a roadblock if I don't have it, or is it just "nice to have" but won't ultimately impact anything?
- If it might help in the future, but I'm unsure, how difficult will it be to go back and collect the data at a later time?
- Is this data redundant to data I've already collected?
- Does it make logical sense to store the data within this particular object? (Storing a description in a product doesn't make sense if that description changes from site to site for the same product.)
If you do decide that you need to collect the data, it's important to ask a few more questions to then decide how to store and handle it in code:
- Is this data sparse or dense? Will it be relevant and populated in every listing, or just a handful out of the set?
- How large is the data?
- Especially in the case of large data, will I need to regularly retrieve it every time I run my analysis, or only on occasion?
- How variable is this type of data? Will I regularly need to add new attributes, modify types, or is it set in stone (shoe sizes)?
Dealing with Dierent Website Layouts
The most obvious approach is to write a separate web crawler or page parser for each website. Each might take in a URL, string, or BeautifulSoup object, and return a Python object for the thing that was scraped.
The following is an example of a Content class (representing a piece of content on a website, such as a news article) and two scraper functions that take in a Beauti fulSoup object and return an instance of Content:
import requests
from bs4 import BeautifulSoup
class Content:
def __init__(self, url, title, body):
self.url = url
self.title = title
self.body = body
def getPage(url):
req = requests.get(url)
return BeautifulSoup(req.text, 'html.parser')
def scrapetheguardian(url):
bs = getPage(url)
title_tag = bs.find("h1", {"class": "dcr-u0152o"})
title = title_tag.text.strip() if title_tag else "Title not found."
article_body_div = bs.find("div", {"class": "article-body-commercial-selector article-body-viewer-selector dcr-fp1ya"})
if article_body_div:
paragraphs = article_body_div.find_all("p")
body = '\n'.join([p.text.strip() for p in paragraphs])
else:
body = "Body content not found."
return Content(url, title, body)
def scrapeBrookings(url):
bs = getPage(url)
title_tag = bs.find("h1")
title = title_tag.text.strip() if title_tag else "Title not found."
body_tag = bs.find("section", {"id": "content"})
if body_tag:
paragraphs = body_tag.find_all("p")
body = '\n'.join([p.text.strip() for p in paragraphs])
else:
body = "Body content not found."
return Content(url, title, body)
# Example usage
url_brookings = 'https://www.brookings.edu/articles/delivering-inclusive-urban-access-3-uncomfortable-truths/'
content_brookings = scrapeBrookings(url_brookings)
print('Title: {}'.format(content_brookings.title))
print('URL: {}\n'.format(content_brookings.url))
print(content_brookings.body)
url_theguardian = 'https://www.theguardian.com/us-news/article/2024/jul/01/trump-hush-money-supreme-court-immunity'
content_theguardian = scrapetheguardian(url_theguardian)
print('\nTitle: {}'.format(content_theguardian.title))
print('URL: {}\n'.format(content_theguardian.url))
print(content_theguardian.body)

Top comments (0)