DEV Community

Scrape YouTube Video Details Efficiently with Python

YouTube boasts a massive user base, creating an immense pool of content—videos, comments, channels—ripe for analysis. However, scraping this treasure trove isn't as simple as clicking "play." YouTube’s dynamic content and sophisticated anti-bot defenses are designed to prevent automated scraping. So, how can you get past these hurdles?
In this guide, I’ll show you how to scrape YouTube video data using Python, Playwright, and lxml. No fluff—just real, actionable steps to help you extract valuable information efficiently and ethically.

Step 1: Initializing Your Environment

Before diving into the code, let's get everything set up.
You’ll need these tools:
1. Playwright: This library automates headless browsers like Chromium, enabling you to interact with web pages just like a human.
2. lxml: A Python library for parsing HTML/XML, perfect for scraping web data with speed and precision.
3. CSV module: A built-in Python library to save the data you scrape into a CSV for easy analysis.
Install the libraries:
First, use pip to install Playwright and lxml:

pip install playwright
pip install lxml
Enter fullscreen mode Exit fullscreen mode

Then, install the necessary browser binaries for Playwright:

playwright install
Enter fullscreen mode Exit fullscreen mode

Or, if you only need Chromium:

playwright install chromium
Enter fullscreen mode Exit fullscreen mode

Step 2: Importing Libraries for the Task

Once you’ve got everything installed, import the libraries that will power your script:

import asyncio
from playwright.async_api import Playwright, async_playwright
from lxml import html
import csv
Enter fullscreen mode Exit fullscreen mode

Step 3: Controlling the Browser with Playwright

Now we get into the fun part: controlling the browser. Playwright allows you to control a browser programmatically. You’ll navigate to the YouTube video, let it load, and even scroll down to load more content. Here’s how:

browser = await playwright.chromium.launch(headless=True)
context = await browser.new_context()
page = await context.new_page()

# Navigate to the YouTube video
await page.goto("https://www.youtube.com/watch?v=Ct8Gxo8StBU", wait_until="networkidle")

# Scroll down to load more comments
for _ in range(20):
    await page.mouse.wheel(0, 200)
    await asyncio.sleep(0.2)

# Let some content load
await page.wait_for_timeout(1000)
Enter fullscreen mode Exit fullscreen mode

Step 4: Handling HTML Content Parsing

Once the page is loaded, we’ll extract its HTML and parse it using lxml. This allows us to easily pull out the data we want.

page_content = await page.content()
parser = html.fromstring(page_content)
Enter fullscreen mode Exit fullscreen mode

Step 5: Pulling the Data

Here's where you get to pull out all the juicy details: the video title, channel name, number of views, comments, and more. Use XPath to grab the information you need:

title = parser.xpath('//div[@id="title"]/h1/yt-formatted-string/text()')[0]
channel = parser.xpath('//yt-formatted-string[@id="text"]/a/text()')[0]
channel_link = 'https://www.youtube.com' + parser.xpath('//yt-formatted-string[@id="text"]/a/@href')[0]
posted = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[2]
total_views = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[0]
total_comments = parser.xpath('//h2[@id="count"]/yt-formatted-string/span/text()')[0]
comments_list = parser.xpath('//yt-attributed-string[@id="content-text"]/span/text()')
Enter fullscreen mode Exit fullscreen mode

Step 6: Outputting the Data

Now that you’ve got the data, it’s time to store it. We’ll save it to a CSV file for later analysis. Here’s how:

with open('youtube_video_data.csv', 'w', newline='', encoding='utf-8') as file:
    writer = csv.writer(file)
    writer.writerow(["Title", "Channel", "Channel Link", "Posted", "Total Views", "Total Comments", "Comments"])
    writer.writerow([title, channel, channel_link, posted, total_views, total_comments, ", ".join(comments_list)])
Enter fullscreen mode Exit fullscreen mode

Step 7: Proxies—How to Keep Your Scraping Under the Radar

When scraping at scale, proxies are essential. YouTube can quickly block your IP if you make too many requests. So, how do you get around this?
1. Proxy Setup: Playwright allows you to use proxies easily by adding a proxy parameter when launching the browser.

browser = await playwright.chromium.launch(
    headless=True,
    proxy={"server": "http://your_proxy_ip:port", "username": "your_username", "password": "your_password"}
)
Enter fullscreen mode Exit fullscreen mode

2. The Need for Proxies
Hide IP: Proxies hide your real IP, lowering the chances of getting blocked.
Request Handling: Rotating proxies distribute your requests, making them look like they’re coming from different users.
Bypass Regional Restrictions: Some content is only available in certain regions. Proxies can help you access it.
Proxies make it harder for YouTube to flag your activities, but use them responsibly. Don’t overdo it.

Complete Coding Example

Now that you know the steps, here’s the full implementation in one go:

import asyncio
from playwright.async_api import Playwright, async_playwright
from lxml import html
import csv

# Main function to scrape the YouTube video data
async def run(playwright: Playwright) -> None:
    # Launch the browser with proxy settings
    browser = await playwright.chromium.launch(
        headless=True,
        proxy={"server": "http://your_proxy_ip:port", "username": "your_username", "password": "your_password"}
    )
    context = await browser.new_context()
    page = await context.new_page()

    # Navigate to the YouTube video URL
    await page.goto("https://www.youtube.com/watch?v=Ct8Gxo8StBU", wait_until="networkidle")

    # Scroll to load more comments
    for _ in range(20):
        await page.mouse.wheel(0, 200)
        await asyncio.sleep(0.2)

    # Wait for additional content to load
    await page.wait_for_timeout(1000)

    # Get page content
    page_content = await page.content()

    # Close the browser
    await context.close()
    await browser.close()

    # Parse the HTML
    parser = html.fromstring(page_content)

    # Extract data
    title = parser.xpath('//div[@id="title"]/h1/yt-formatted-string/text()')[0]
    channel = parser.xpath('//yt-formatted-string[@id="text"]/a/text()')[0]
    channel_link = 'https://www.youtube.com' + parser.xpath('//yt-formatted-string[@id="text"]/a/@href')[0]
    posted = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[2]
    total_views = parser.xpath('//yt-formatted-string[@id="info"]/span/text()')[0]
    total_comments = parser.xpath('//h2[@id="count"]/yt-formatted-string/span/text()')[0]
    comments_list = parser.xpath('//yt-attributed-string[@id="content-text"]/span/text()')

    # Save the data to a CSV file
    with open('youtube_video_data.csv', 'w', newline='', encoding='utf-8') as file:
        writer = csv.writer(file)
        writer.writerow(["Title", "Channel", "Channel Link", "Posted", "Total Views", "Total Comments", "Comments"])
        writer.writerow([title, channel, channel_link, posted, total_views, total_comments, ", ".join(comments_list)])

# Running the async function
async def main():
    async with async_playwright() as playwright:
        await run(playwright)

asyncio.run(main())
Enter fullscreen mode Exit fullscreen mode

Pro Tips for Proxy Selection

Residential Proxies: These are harder to detect and usually offer more anonymity. They're ideal for large-scale scraping.
Static ISP Proxies: Fast and reliable, great for high-speed requests without interruptions.
While scraping YouTube data is powerful, it’s essential to follow ethical standards. Respect YouTube's terms of service. Avoid overwhelming their servers and always consider the impact of your actions.

Wrapping It Up
You’ve now got the tools to scrape YouTube efficiently and effectively. With Playwright, lxml, and the right proxy setup, you're ready to extract valuable insights from the platform. Just make sure to scrape responsibly, and you’ll have a solid, scalable scraping setup in no time.

Top comments (0)