Pranjol Das

Posted on Jun 11, 2024

Building a Web Scraping Tool with Python: Extracting News Headlines

#python #tutorial #beautifulsoup #opensource

Introduction

Web scraping allows us to automatically extract data from websites. In this tutorial, we'll use Python along with the requests and beautifulsoup4 libraries to build a web scraping tool. Our goal is to fetch news headlines from the BBC News website.

Prerequisites

Before we start, ensure you have the following:

Basic understanding of Python programming.
Python installed on your machine (Python 3.6 or higher).
Familiarity with HTML and CSS basics (helpful but not required).

Step 1: Setting Up Your Environment

Installing Libraries

First, let's install the necessary Python libraries. Open your terminal and run the following command:

pip install requests beautifulsoup4

These libraries will help us make HTTP requests (requests) to fetch web pages and parse HTML (beautifulsoup4) to extract data.

Step 2: Writing the Web Scraping Script

Fetching HTML Content

Now, let's create a Python script named scraper.py. Open your favorite code editor and start by importing the required libraries:

import requests
from bs4 import BeautifulSoup

Next, define the URL of the BBC News website we want to scrape:

url = 'https://www.bbc.com/news'

Function to Fetch HTML Content

We'll create a function fetch_html to fetch the HTML content from a given URL using requests:

def fetch_html(url):
    try:
        response = requests.get(url)
        response.raise_for_status()  # Raise an HTTPError for bad responses
        return response.text
    except requests.exceptions.RequestException as e:
        print(f"Error fetching HTML: {e}")
        return None

This function sends a GET request to the URL and returns the HTML content if successful. It handles exceptions to ensure robust error handling.

Function to Scrape Website for News Headlines

Now, let's define a function scrape_website to parse the HTML and extract news headlines using BeautifulSoup:

def scrape_website(url):
    html = fetch_html(url)
    if html:
        soup = BeautifulSoup(html, 'html.parser')
        headlines = soup.find_all('h3', class_='gs-c-promo-heading__title')
        for headline in headlines:
            title = headline.text.strip()
            print(title)
    else:
        print("Failed to fetch HTML.")

Here's what this function does:

It calls fetch_html(url) to get the HTML content of the BBC News page.
If the HTML content is retrieved (if html:), it uses BeautifulSoup to parse the HTML (soup = BeautifulSoup(html, 'html.parser')).
It then finds all <h3> elements with the class gs-c-promo-heading__title, which typically contain news headlines on the BBC News website.
For each headline found (for headline in headlines:), it extracts the text (headline.text.strip()) and prints it.

Running the Script

To execute the scraping script, add the following code at the end of scraper.py:

if __name__ == "__main__":
    scrape_website(url)

This will run the scrape_website function when you run python scraper.py in your terminal.

Step 3: Handling Data and Output

Storing Data

To store the extracted headlines in a structured format (e.g., CSV or JSON), you can modify the scrape_website function to save the data into a file instead of printing it.

Advanced Scraping Techniques

For more advanced scraping tasks, you might explore:

Handling pagination (navigating through multiple pages of results).
Dealing with dynamic content (using tools like Selenium for JavaScript-heavy websites).
Implementing rate limiting to avoid overwhelming the target website's servers.

Conclusion

Congratulations! You've built a web scraping tool with Python to extract news headlines from the BBC News website. Web scraping opens up possibilities for automating data collection tasks. Always scrape responsibly and respect the website's terms of service.

GitHub Repository

I've uploaded the complete code to GitHub. You can view it here.

DEV Community