How to Effectively Scrape Vital YouTube Metrics for Analysis

#scrapeyoutube

Scraping vital YouTube data can be a game-changer for creators, marketers, and data analysts. Whether you're analyzing comments to understand audience sentiment, checking video engagement metrics, or comparing performance against competitors, automating the process can save hours of manual work.
In this guide, I’ll walk you through how to automate YouTube data collection using Python, so you can get your insights faster, and without the hassle.

Why Scrape Vital YouTube Data

YouTube is an ocean of information. Each video tells a story, from views and likes to comments and channel statistics. However, manually checking these stats for dozens or hundreds of videos? A total pain. That's where scraping comes in. By automating the process, you can analyze vast amounts of data in a fraction of the time. Plus, with the right setup, you can do this safely while avoiding the platform's restrictions.

Step 1: Setting Up the Environment

Before anything, we need to install the right tools. To get started, use the following command:

pip install selenium-wire selenium blinker==1.7.0

This installs selenium-wire (which lets us use proxies) and selenium (the core tool for web automation).

Step 2: Importing Libraries

Once the packages are installed, we need to import the necessary libraries. Here's a rundown:

from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import json
import time

These imports will help us interact with the YouTube page, perform clicks, gather information, and process the data.

Step 3: Proxy Setup

YouTube has strict policies on scraping, so we need to mask our IP with a proxy to avoid being flagged. Here’s how to do it:

proxy_address = "your.proxy.server"
proxy_username = "your_username"
proxy_password = "your_password"

chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')

driver = wiredriver.Chrome(options=chrome_options)

This setup ensures your scraping activity remains anonymous and compliant with YouTube’s terms.

Step 4: Scraping the Data

Next, let’s write a function that scrapes the video data. Here’s a breakdown of what happens:
1. Navigate to the YouTube page: We use driver.get() to open the video’s URL.
2. Click to reveal more content: Some information is hidden under a “Show More” button, so we need to simulate a click to expand it.
3. Scroll to load more comments: Many comments load dynamically, so we scroll down a couple of times to make sure we capture all of them.
Here’s the function:

def extract_information() -> dict:
    try:
        # Wait for "Show More" button and click it
        element = WebDriverWait(driver, 15).until(
            EC.presence_of_element_located((By.XPATH, '//[@id="expand"]'))
        )
        element.click()

        time.sleep(10)  # Give it time to load
        actions = ActionChains(driver)

        # Scroll to load more comments
        actions.send_keys(Keys.END).perform()
        time.sleep(10)
        actions.send_keys(Keys.END).perform()
        time.sleep(10)

        # Now, gather the data
        video_title = driver.find_elements(By.XPATH, '//[@id="title"]/h1')[0].text
        owner = driver.find_elements(By.XPATH, '//[@id="text"]/a')[0].text
        total_number_of_subscribers = driver.find_elements(By.XPATH, "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']")[0].text
        video_description = ''.join([i.text for i in driver.find_elements(By.XPATH, '//[@id="description-inline-expander"]/yt-attributed-string/span/span')])
        publish_date = driver.find_elements(By.XPATH, '//[@id="info"]/span')[2].text
        total_views = driver.find_elements(By.XPATH, '//[@id="info"]/span')[0].text
        number_of_likes = driver.find_elements(By.XPATH, '//[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartination/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div')[1].text

        # Scrape comments
        comment_names = driver.find_elements(By.XPATH, '//[@id="author-text"]/span')
        comment_content = driver.find_elements(By.XPATH, '//[@id="content-text"]/span')
        comment_library = []

        for each in range(len(comment_names)):
            name = comment_names[each].text
            content = comment_content[each].text
            comment_library.append({'name': name, 'comment': content})

        # Bundle up the data
        data = {
            'owner': owner,
            'subscribers': total_number_of_subscribers,
            'video_title': video_title,
            'description': video_description,
            'date': publish_date,
            'views': total_views,
            'likes': number_of_likes,
            'comments': comment_library
        }
        return data
    except Exception as err:
        print(f"Error: {err}")

Step 5: Output Data to a JSON File

Now that we have the data, let's save it into a file. Here's how you can do that:

def organize_write_data(data: dict):
    output = json.dumps(data, indent=2, ensure_ascii=False).encode("ascii", "ignore").decode("utf-8")
    try:
        with open("output.json", 'w', encoding='utf-8') as file:
            file.write(output)
    except Exception as err:
        print(f"Error encountered: {err}")

Final Thoughts

By using this Python script, you’re now able to scrape detailed insights from YouTube videos at scale. Whether you’re gathering data for analysis or comparing video performance, this process automates it all, saving you time and effort. Plus, by using proxies, you can scrape responsibly, minimizing your chances of being blocked.