How to create a web crawler with Beautiful Soup

Web crawling is an essential technique used in data extraction, indexing, and web scraping. Web crawlers are automated programs that navigate through websites, extract data, and save it for further processing. Python is an excellent language for building web crawlers, and the Beautiful Soup library is one of the most popular tools for web scraping.

In this article, we will show you how to create a web crawler using Python and Beautiful Soup. We will extract text data from a website and save it to a CSV file.

First, we need to import the required libraries. We will use the requests library to send GET requests to the website and the Beautiful Soup library to parse the HTML.

import requests
import csv
from bs4 import BeautifulSoup

url = "https://www.businessdailyafrica.com/"

We will use the requests library to send a GET request to the website and store the response

response = requests.get(url)

Now, we can create a Beautiful Soup object from the response HTML.

soup = BeautifulSoup(response.text, "html.parser")

We will use the find_all method to extract all anchor tags from the HTML.

links = soup.find_all("a")

We need to filter out links that do not belong to the website. We will create an empty list to store the internal links.

internal_links = []

We will loop through all the links and append the internal links to the internal_links list.

for link in links:
    href = link.get("href")
    if href and href.startswith("/bd"):
        href1 = 'https://www.businessdailyafrica.com'+href
        internal_links.append(href1)

Next, we will visit each internal link and repeat the process. We will create a set to store the visited links and a list to store the text data.

visited_links = set()
p_array = []

We will open a CSV file to store the text data.

with open('bdailyps.csv', mode='w') as csv_file:
    fieldnames = ['words']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()

We will loop through each internal link and send a GET request to the website. We will create a Beautiful Soup object from the response HTML and extract all the paragraphs.

    for link in internal_links:
        if (len(visited_links) >= 100000 & len(p_array)>=10000001):
            break
        if link not in visited_links:
            visited_links.add(link)
            response = requests.get(link)
            soup = BeautifulSoup(response.text, "html.parser")
            h2s = soup.find_all("h2")
            ps = soup.find_all("p")

We will extract all the links from the HTML and append the internal links to the internal_links list.

links = soup.find_all("a")
for sublink in links:
    href = sublink.get("href")
    if href and href.startswith("/bd"):
        href1 = 'https://www.businessdailyafrica.com'+href
        internal_links.append(href1)

We will loop through all the paragraphs and extract the text. We will split the text into words and save each word to the p_array list and the CSV file.

for p in ps:
    paragraph =  p.text.strip()
    for word in paragraph.split():
        p_array.append(word)
        writer.writerow({'words': word})

Finally, we will print the number of visited links.

print(f'{len(visited_links)} links visited.')

A summary of the code:

import requests
import csv
from bs4 import BeautifulSoup

# Set the URL of the website to crawl
url = "https://www.businessdailyafrica.com/"

# Send a GET request to the URL and store the response
response = requests.get(url)

# Create a BeautifulSoup object from the response HTML
soup = BeautifulSoup(response.text, "html.parser")

# Find all anchor tags in the HTML
links = soup.find_all("a")

# Filter out links that do not belong to the website
internal_links = []
p_array = []
for link in links:
    href = link.get("href")
    if href and href.startswith("/bd"):
        href1 = 'https://www.businessdailyafrica.com'+href
        internal_links.append(href1)

# Visit each internal link and repeat the process
visited_links = set()
with open('bdailyps.csv', mode='w') as csv_file:
    fieldnames = ['words']
    writer = csv.DictWriter(csv_file, fieldnames=fieldnames)
    writer.writeheader()

    for link in internal_links:
        if (len(visited_links) >= 100000 & len(p_array)>=10000001):
            break
        if link not in visited_links:
            visited_links.add(link)
            response = requests.get(link)
            soup = BeautifulSoup(response.text, "html.parser")
            h2s = soup.find_all("h2")
            ps = soup.find_all("p")
            links = soup.find_all("a")
            for sublink in links:
                href = sublink.get("href")
                if href and href.startswith("/bd"):
                    href1 = 'https://www.businessdailyafrica.com'+href
                    internal_links.append(href1)

            # Print each h1 tag and the element directly below it
            for p in ps:
                print(link, p)
                paragraph =  p.text.strip()
                for word in paragraph.split():
                    p_array.append(word)
                    writer.writerow({'words': word})

# Print the number of visited links
print(f'{len(visited_links)} links visited.')

That's it! We have created a web crawler using Beautiful Soup