Scraping vital YouTube data can be a game-changer for creators, marketers, and data analysts. Whether you're analyzing comments to understand audience sentiment, checking video engagement metrics, or comparing performance against competitors, automating the process can save hours of manual work.
In this guide, I’ll walk you through how to automate YouTube data collection using Python, so you can get your insights faster, and without the hassle.
Why Scrape Vital YouTube Data
YouTube is an ocean of information. Each video tells a story, from views and likes to comments and channel statistics. However, manually checking these stats for dozens or hundreds of videos? A total pain. That's where scraping comes in. By automating the process, you can analyze vast amounts of data in a fraction of the time. Plus, with the right setup, you can do this safely while avoiding the platform's restrictions.
Step 1: Setting Up the Environment
Before anything, we need to install the right tools. To get started, use the following command:
pip install selenium-wire selenium blinker==1.7.0
This installs selenium-wire
(which lets us use proxies) and selenium
(the core tool for web automation).
Step 2: Importing Libraries
Once the packages are installed, we need to import the necessary libraries. Here's a rundown:
from selenium.webdriver.chrome.options import Options
from seleniumwire import webdriver as wiredriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import json
import time
These imports will help us interact with the YouTube page, perform clicks, gather information, and process the data.
Step 3: Proxy Setup
YouTube has strict policies on scraping, so we need to mask our IP with a proxy to avoid being flagged. Here’s how to do it:
proxy_address = "your.proxy.server"
proxy_username = "your_username"
proxy_password = "your_password"
chrome_options = Options()
chrome_options.add_argument(f'--proxy-server={proxy_address}')
chrome_options.add_argument(f'--proxy-auth={proxy_username}:{proxy_password}')
driver = wiredriver.Chrome(options=chrome_options)
This setup ensures your scraping activity remains anonymous and compliant with YouTube’s terms.
Step 4: Scraping the Data
Next, let’s write a function that scrapes the video data. Here’s a breakdown of what happens:
1. Navigate to the YouTube page: We use driver.get()
to open the video’s URL.
2. Click to reveal more content: Some information is hidden under a “Show More” button, so we need to simulate a click to expand it.
3. Scroll to load more comments: Many comments load dynamically, so we scroll down a couple of times to make sure we capture all of them.
Here’s the function:
def extract_information() -> dict:
try:
# Wait for "Show More" button and click it
element = WebDriverWait(driver, 15).until(
EC.presence_of_element_located((By.XPATH, '//[@id="expand"]'))
)
element.click()
time.sleep(10) # Give it time to load
actions = ActionChains(driver)
# Scroll to load more comments
actions.send_keys(Keys.END).perform()
time.sleep(10)
actions.send_keys(Keys.END).perform()
time.sleep(10)
# Now, gather the data
video_title = driver.find_elements(By.XPATH, '//[@id="title"]/h1')[0].text
owner = driver.find_elements(By.XPATH, '//[@id="text"]/a')[0].text
total_number_of_subscribers = driver.find_elements(By.XPATH, "//div[@id='upload-info']//yt-formatted-string[@id='owner-sub-count']")[0].text
video_description = ''.join([i.text for i in driver.find_elements(By.XPATH, '//[@id="description-inline-expander"]/yt-attributed-string/span/span')])
publish_date = driver.find_elements(By.XPATH, '//[@id="info"]/span')[2].text
total_views = driver.find_elements(By.XPATH, '//[@id="info"]/span')[0].text
number_of_likes = driver.find_elements(By.XPATH, '//[@id="top-level-buttons-computed"]/segmented-like-dislike-button-view-model/yt-smartination/div/div/like-button-view-model/toggle-button-view-model/button-view-model/button/div')[1].text
# Scrape comments
comment_names = driver.find_elements(By.XPATH, '//[@id="author-text"]/span')
comment_content = driver.find_elements(By.XPATH, '//[@id="content-text"]/span')
comment_library = []
for each in range(len(comment_names)):
name = comment_names[each].text
content = comment_content[each].text
comment_library.append({'name': name, 'comment': content})
# Bundle up the data
data = {
'owner': owner,
'subscribers': total_number_of_subscribers,
'video_title': video_title,
'description': video_description,
'date': publish_date,
'views': total_views,
'likes': number_of_likes,
'comments': comment_library
}
return data
except Exception as err:
print(f"Error: {err}")
Step 5: Output Data to a JSON File
Now that we have the data, let's save it into a file. Here's how you can do that:
def organize_write_data(data: dict):
output = json.dumps(data, indent=2, ensure_ascii=False).encode("ascii", "ignore").decode("utf-8")
try:
with open("output.json", 'w', encoding='utf-8') as file:
file.write(output)
except Exception as err:
print(f"Error encountered: {err}")
Final Thoughts
By using this Python script, you’re now able to scrape detailed insights from YouTube videos at scale. Whether you’re gathering data for analysis or comparing video performance, this process automates it all, saving you time and effort. Plus, by using proxies, you can scrape responsibly, minimizing your chances of being blocked.
Top comments (0)