Nchanji Faithful

Posted on Nov 2, 2024

How to Scrape YouTube Video Data from a Playlist Using Python and BeautifulSoup

#python #webscraping #data #beautifulsoup

Are you looking to scrape essential data from YouTube video pages, like the title, channel name, publish date, view count, and video URL? In this tutorial, I’ll walk you through creating a Python script to do just that. By the end of this, you’ll be able to scrape data from a list of YouTube URLs and save it to a cleanly formatted CSV file.

Step 1: Setting Up the Environment

To get started, you’ll need to have Python installed on your computer. If you don’t have it already, you can download it from python.org.

Install Required Libraries

We’ll use requests for making HTTP requests and BeautifulSoup from bs4 for parsing HTML content. Open your terminal and run:

pip install requests beautifulsoup4

Step 2: Writing the Code

We’ll create a script that reads video URLs from a CSV file, extracts video details using BeautifulSoup, and writes the collected data into a new CSV file.

Full Code

Here's the complete Python script:

import requests
from bs4 import BeautifulSoup
import csv

def extract_youtube_data(url):
    """Extracts relevant data from a YouTube video page.

    Args:
        url: The URL of the YouTube video.

    Returns:
        A dictionary containing the extracted data.
    """
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract title
    title_element = soup.find('meta', itemprop='name')
    title = title_element['content'] if title_element else "N/A"

    # Extract channel name
    channel_name_element = soup.find('link', itemprop='name')
    channel_name = channel_name_element['content'] if channel_name_element else "N/A"

    # Extract publish date
    publish_date_element = soup.find('meta', itemprop='datePublished')
    publish_date = publish_date_element['content'] if publish_date_element else "N/A"

    # Extract view count
    view_count_element = soup.find('meta', itemprop='interactionCount')
    view_count = view_count_element['content'] if view_count_element else "N/A"

    return {
        'Title': title,
        'Channel Name': channel_name,
        'Publish Date': publish_date,
        'View Count': view_count,
        'URL': url
    }

def main():
    input_file = 'my_data.csv'  # CSV file with URLs
    output_file = 'youtube_data_output.csv'

    with open(input_file, 'r') as file:
        reader = csv.reader(file)
        urls = [row[0] for row in reader]

    with open(output_file, 'w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=['Number', 'Title', 'Channel Name', 'Publish Date', 'View Count', 'URL'])
        writer.writeheader()

        for i, url in enumerate(urls, start=1):
            data = extract_youtube_data(url)
            # Add the 'Number' field to the data dictionary
            data['Number'] = i
            writer.writerow(data)
            print(f"Processed: {i}")

if __name__ == '__main__':
    main()

Step 3: Understanding the Code

Let’s break down what each part of the code does.

1. Import Libraries

import requests
from bs4 import BeautifulSoup
import csv

requests: Used to send HTTP requests to fetch YouTube video pages.
BeautifulSoup: Parses and extracts data from HTML content.
csv: Handles reading and writing CSV files.

2. Function to Extract YouTube Data

def extract_youtube_data(url):
    # Fetch the page content
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Extract the title
    title_element = soup.find('meta', itemprop='name')
    title = title_element['content'] if title_element else "N/A"

    # Extract the channel name
    channel_name_element = soup.find('link', itemprop='name')
    channel_name = channel_name_element['content'] if channel_name_element else "N/A"

    # Extract the publish date
    publish_date_element = soup.find('meta', itemprop='datePublished')
    publish_date = publish_date_element['content'] if publish_date_element else "N/A"

    # Extract the view count
    view_count_element = soup.find('meta', itemprop='interactionCount')
    view_count = view_count_element['content'] if view_count_element else "N/A"

    # Return all extracted data
    return {
        'Title': title,
        'Channel Name': channel_name,
        'Publish Date': publish_date,
        'View Count': view_count,
        'URL': url
    }

3. The Main Function

def main():
    input_file = 'my_data.csv'  # CSV file with URLs
    output_file = 'youtube_data_output.csv'

    # Read URLs from the input CSV
    with open(input_file, 'r') as file:
        reader = csv.reader(file)
        urls = [row[0] for row in reader]

    # Open the output file for writing
    with open(output_file, 'w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=['Number', 'Title', 'Channel Name', 'Publish Date', 'View Count', 'URL'])
        writer.writeheader()

        # Loop through each URL, extract data, and write to the CSV
        for i, url in enumerate(urls, start=1):
            data = extract_youtube_data(url)
            data['Number'] = i
            writer.writerow(data)
            print(f"Processed: {i}")

Explanation

Input File: my_data.csv is expected to contain the list of YouTube video URLs.
Output File: youtube_data_output.csv will store the extracted data.
Progress Indicator: The script prints the number of videos processed to keep track.

Step 4: Preparing Your Input CSV

Create a file named my_data.csv in the same directory as the script. This file should contain one YouTube video URL per line, like so:

https://www.youtube.com/watch?v=VIDEO_ID1
https://www.youtube.com/watch?v=VIDEO_ID2
...

Step 5: Running the Script

To run the script, open your terminal and navigate to the directory where the script is saved. Then, execute:

python your_script_name.py

The script will fetch data from each URL, extract the relevant details, and write them to youtube_data_output.csv.

Conclusion

You now have a fully functional script that can scrape data from YouTube videos and save it to a CSV file. This is especially useful for analyzing video details for research, content management, or SEO purposes.

Feel free to extend the script further by adding more data fields or refining the extraction logic. Happy scraping!

Have questions or suggestions? Drop a comment below!

Python + AI + Spreadsheet

Chat with your data and get insights in seconds with the all-in-one spreadsheet that connects to your data, supports code natively, and has built-in AI.

Try Quadratic free

DEV Community