DEV Community

loading...
Cover image for Scrape YouTube Search with Python (part 1)

Scrape YouTube Search with Python (part 1)

Dimitry Zub
Python Web Scraping
Updated on ・6 min read

Contents: intro, imports, video search, fuckit module, ad, channel results, links, outro.

Intro

This blog post will show how to scrape YouTube organic search, ad and channel results.

Each section will be represented with the screenshot that will show which part is being scraped.

I decided to use not the fastest solution Selenium but I wanted to scrape everything to the bottom of the search results page, which could be done by calling DOM directly like so:

driver.execute_script("var scrollingElement = (document.scrollingElement || document.body);scrollingElement.scrollTop = scrollingElement.scrollHeight;")
# https://stackoverflow.com/a/57076690/15164646 (contains several references for a better understanding)
Enter fullscreen mode Exit fullscreen mode

Imports

from selenium import webdriver
from serpapi import GoogleSearch
import json, time # this two could be skipped (prettier output/time buffer)
Enter fullscreen mode Exit fullscreen mode

Video Search Results

image

from selenium import webdriver
import json, time


def get_video_results():
    driver = webdriver.Chrome()
    driver.get('https://www.youtube.com/results?search_query=minecraft')

    # scrolling to the end of the page
    # https://stackoverflow.com/a/57076690/15164646
    while True:
        # end_result = "No more results" string at the bottom of the page
        # this will be used to break out of the while loop
        end_result = driver.find_element_by_css_selector('#message').is_displayed()
        driver.execute_script("var scrollingElement = (document.scrollingElement || document.body);scrollingElement.scrollTop = scrollingElement.scrollHeight;")
        # time.sleep(1) # could be removed

        youtube_data = []

        for result in driver.find_elements_by_css_selector('.text-wrapper.style-scope.ytd-video-renderer'):
            title = result.find_element_by_css_selector('.title-and-badge.style-scope.ytd-video-renderer').text
            link = result.find_element_by_css_selector('.title-and-badge.style-scope.ytd-video-renderer a').get_attribute('href')
            channel_name = result.find_element_by_css_selector('.long-byline').text
            channel_link = result.find_element_by_css_selector('#text > a').get_attribute('href')
            views = result.find_element_by_css_selector('.style-scope ytd-video-meta-block').text.split('\n')[0]

            try:
                time_published = result.find_element_by_css_selector('.style-scope ytd-video-meta-block').text.split('\n')[1]
            except:
                time_published = None

            try:
                snippet = result.find_element_by_css_selector('.metadata-snippet-container').text
            except:
                snippet = None

            try:
                if result.find_element_by_css_selector('#channel-name .ytd-badge-supported-renderer') is not None:
                    verified_badge = True
                else:
                    verified_badge = False
            except:
                verified_badge = None

            try:
                extensions = result.find_element_by_css_selector('#badges .ytd-badge-supported-renderer').text
            except:
                extensions = None

            youtube_data.append({
                'title': title,
                'link': link,
                'channel': {'channel_name': channel_name, 'channel_link': channel_link},
                'views': views,
                'time_published': time_published,
                'snippet': snippet,
                'verified_badge': verified_badge,
                'extensions': extensions,
            })

        print(json.dumps(youtube_data, indent=2, ensure_ascii=False))

        # once element is located, break out of the loop
        if end_result == True:
            break
    driver.quit()

get_video_results()


# part of the output:
'''
[
  {
    "title": "I Survived 100 Days in Ancient Greece on Minecraft.. Here's What Happened..",
    "link": "https://www.youtube.com/watch?v=hUAjdnhpTXU",
    "channel": {
      "channel_name": "Forrestbono",
      "channel_link": "https://www.youtube.com/user/ForrestboneMC"
    },
    "views": "2.6M views",
    "time_published": "5 days ago",
    "snippet": "I had to survive for 100 Days of Hardcore Minecraft in Ancient Greece and battle Poseidon, God of the Sea, and Cronos, the God ...",
    "verified_badge": true,
    "extensions": "New"
  }
]
'''
Enter fullscreen mode Exit fullscreen mode

Fuckit module

If you don't like too many try/except blocks, or you say something like: "Some code has an error? Fuck it.", then you can use context manager from fuckit module that will continue to run, skipping the statements that cause errors:

# pip install fuckit
import fuckit

with fuckit:
    title = result.find_element_by_css_selector('.title-and-badge.style-scope.ytd-video-renderer').text
    link = result.find_element_by_css_selector('.title-and-badge.style-scope.ytd-video-renderer a').get_attribute('href')
    channel_name = result.find_element_by_css_selector('.long-byline').text
    channel_link = result.find_element_by_css_selector('#text > a').get_attribute('href')
    views = result.find_element_by_css_selector('.style-scope ytd-video-meta-block').text.split('\n')[0]
    time_published = result.find_element_by_css_selector('.style-scope ytd-video-meta-block').text.split('\n')[1]
    snippet = result.find_element_by_css_selector('.metadata-snippet-container').text
    extensions = result.find_element_by_css_selector('#badges .ytd-badge-supported-renderer').text
Enter fullscreen mode Exit fullscreen mode

Using YouTube Video Results API

SerpApi is paid API with a free trial of 5,000 searches.

from serpapi import GoogleSearch

def get_video_results():
    params = {
      "api_key": "YOUR_API_KEY",
      "engine": "youtube",
      "search_query": "minecraft"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    for results in results['video_results']:
        title = results['title']
        link = results['link']
        channel = results['channel']
        try:
            published_date = results['published_date']
        except:
            published_date = None
        try:
            views = results['views']
        except:
            views = None
        try:
            video_length = results['length']
        except:
            video_length = None
        try:
            extensions = results['extensions']
        except:
            extensions = None

        print(f'{title}\n{link}\n{channel}\n{published_date}\n{views}\n{video_length}\n{extensions}\n')

get_video_results()

# part of the output:
'''
I Spent 100 Days in Medieval Times in Minecraft... Here's What Happened
https://www.youtube.com/watch?v=hjV30hf6yEM
{'name': 'Forge Labs', 'link': 'https://www.youtube.com/user/AirsoftXX', 'verified': True, 'thumbnail': 'https://yt3.ggpht.com/ytc/AAUvwnjgpo-Pvk7jrXkd4HFErsnrLr2Nwru5f8TgtWGJ7w=s68-c-k-c0x00ffffff-no-rj'}
6 days ago
10136089
1:49:28
['New']
'''
Enter fullscreen mode Exit fullscreen mode

Ad results

image

from selenium import webdriver
from selenium.webdriver.chrome.options import Options

def get_video_ad_results():
    options = Options()
    options.headless = True
    driver = webdriver.Chrome(options=options)
    driver.get('https://www.youtube.com/results?search_query=how to tie a tie')

    for result in driver.find_elements_by_css_selector('.style-scope ytd-search-pyv-renderer'):
        title = result.find_element_by_css_selector('#video-title').text
        channel_name = result.find_element_by_css_selector('#channel-name').text
        channel_link = result.find_element_by_css_selector('#text a').get_attribute('href')
        video_link = result.find_element_by_css_selector('#endpoint').get_attribute('href')
        views = result.find_element_by_css_selector('#metadata-line').text
        desc = result.find_element_by_css_selector('#description-text').text
        print(f'{title}\n{channel_name}\n{channel_link}{video_link}\n{views}\n{desc}\n')

get_video_ad_results()

# output:
'''
How to tie a tie EASY WAY
How to tie a tie
https://www.youtube.com/channel/UC4UuK5vs0b8HhqLDE6ssWOA
https://www.googleadservices.com/pagead/aclk?sa=L&ai=Cw7aOxrTOYIL4LdqJ9u8PgtWd-AWTucasY9mO756NDsCNtwEQASAAYKWWo4b0IoIBF2NhLXB1Yi02MjE5ODExNzQ3MDQ5MzcxoAGP3d7QA6kC1tyMLsuvYz6oAwSqBIgCT9Cnglg6NKBsd-PDBGHllhIo3j6gxAjkcwDoAkp7nHUsJEW7DGH5yhXLGFX1ZUysJkVvRncH4iJh7A9q1X-LRwJD1cSE8ZODyrNzKmP3YswA23bToV2p5yCKzb3SJJw7pZnp6HBJFQy3_bV4ZZbR5YU7txo9LNOqyCzXHB0zKe8HIRgLCYwz8_lQJwdjzYvtEfQn84kRsvGs646kym5AM7AuK7ZkzYZs68dxtuZU4EV64-8mG4_0kuyKt6GXcFHxydZSYSqQUBm5N8WBFmVYqTTX2MZs6uv7JL_T2ilO_GSvWAXSm_TeJcvwdI0zQlPNvqIF-8kBfRZzx5xLfimBFeJF4hdIS_cqkgUMCBIw4qn4ztD9_P5fkgUHCBN4qZauMKAGVYAH2aKhL5AHBKgHhAioB6jSG6gHtgeoB-DPG6gH6dQbqAeMzRuoB7HcG6gH8NkbqAekmrECqAeBxhuoB9XOG6gHq8UbqAfezhuoB5zcG5IIC1hfM3o3UW5lRk9JqAgB0ggFCIBBEAGxCXHg2OWyEZICyAkXyAmPAZgLAboLHggDEAUYBiAGKAEwBUABSABYC2AAaABwAYgBAJgBAdALE7gMAbgT____________AbAUA8AVgYCAQNAVAdgVAYAXAaAXAQ&num=1&cid=CAASFeRoSrmeG6BSAe4hx5xjr7z2wLbhwQ&sig=AOD64_2-SpesMfmcgSQlQ9oXqQ3KeRo52g&adurl=https://www.youtube.com/watch%3Fv%3DX_3z7QneFOI&ctype=21&video_id=X_3z7QneFOI&client=ca-pub-6219811747049371
43K views 
How to tie a tie quick and easy Best tutorial
'''
Enter fullscreen mode Exit fullscreen mode

Using YouTube Ad Results API

from serpapi import GoogleSearch

def get_video_ad_results():
    params = {
      "api_key": "YOUR_API_KEY",
      "engine": "youtube",
      "search_query": "how to tie a tie"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    for result in results['ads_results']:
        title = result['title']
        link = result['link']
        channel = result['channel']
        description = result['description']
        print(f'{title}\n{link}\n{channel}\n{description}\n')

get_video_ad_results()

# output:
'''
How to tie a tie EASY WAY
https://www.youtube.com/watch?v=X_3z7QneFOI
{'name': 'How to tie a tie', 'link': 'https://www.youtube.com/channel/UC4UuK5vs0b8HhqLDE6ssWOA'}
How to tie a tie quick and easy Best tutorial
'''
Enter fullscreen mode Exit fullscreen mode

Channel results

image

from selenium import webdriver

def get_channel_results():
    driver = webdriver.Chrome()
    driver.get('https://www.youtube.com/results?search_query=mojang')

    title = driver.find_element_by_css_selector('#info #text').text
    link = driver.find_element_by_css_selector('#main-link').get_attribute('href')
    subs = driver.find_element_by_css_selector('#subscribers').text
    video_count = driver.find_element_by_css_selector('#video-count').text
    desc = driver.find_element_by_css_selector('#description').text
    print(f'{title}\n{link}\n{subs}\n{video_count}\n{desc}')

get_channel_results()

# output:
'''
Minecraft
https://www.youtube.com/user/TeamMojang
7.4M subscribers
542 videos
This is the official YouTube channel of Minecraft. We tell stories about the Minecraft Universe. ESRB Rating: Everyone 10+ with ...
'''
Enter fullscreen mode Exit fullscreen mode

Using YouTube Channel Results API

from serpapi import GoogleSearch

def get_channel_results():
    params = {
      "api_key": "YOUR_API_KEY",
      "engine": "youtube",
      "search_query": "mojang"
    }

    search = GoogleSearch(params)
    results = search.get_dict()

    for result in results['channel_results']:
        title = result['title']
        link = result['link']
        verified = result['verified']
        subs = result['subscribers']
        video_count = result['video_count']
        desc = result['description']
        print(f'{title}\n{link}\n{verified}\n{subs}\n{video_count}\n{desc}\n')

get_channel_results()

# output:
'''
Minecraft
https://www.youtube.com/user/TeamMojang
True
7400000.0
542
This is the official YouTube channel of Minecraft. We tell stories about the Minecraft Universe. ESRB Rating: Everyone 10+ with ...
'''
Enter fullscreen mode Exit fullscreen mode

Links

Code in the online IDE
YouTube Search Engine Results API

Outro

You can also scrape YouTube by using requests-html library where you still have to render the page by calling html.render(), I'm not tested how much quicker it compare to selenium.

Selenium could be also run headless mode in Firefox. If the first solution didn't work for you, check out this or this answer from stackoverflow. Firefox webdriver download.

Or if you're using selenium with Chrome, you can do it like so. Chrome webdriver download.

If you would like to see how to scrape something specific or project made with SerpApi, please, write me a message.

Yours, D.

Discussion (0)