What will be scraped
Full Code
If you don't need explanation, have a look at full code example in the online IDE.
import re, json, time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from parsel import Selector
def scrape_youtube_autocomplete():
service = Service(ChromeDriverManager().install())
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--lang=en')
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
queries = ["lord of the rings ", "fus ro dah ", "harry potter "]
youtube_autocomplete = []
for query in queries:
driver = webdriver.Chrome(service=service, options=options)
driver.get("https://www.youtube.com/")
WebDriverWait(driver, 10000).until(EC.visibility_of_element_located((By.TAG_NAME, 'body')))
search_input = driver.find_element(By.XPATH, '//input[@id="search"]')
search_input.click()
search_input.send_keys(query)
time.sleep(1)
selector = Selector(driver.page_source)
# https://regex101.com/r/zZb3X0/1
autocomplete_results = [
re.search(r'">(.*)</b>', result).group(1).replace("<b>", "")
for result in selector.css('.sbqs_c').getall()
]
youtube_autocomplete.append({
"query": query.strip(),
"autocomplete_results": autocomplete_results
})
driver.quit()
print(json.dumps(youtube_autocomplete, indent=2, ensure_ascii=False))
scrape_youtube_autocomplete()
Preparation
Install libraries:
pip install parsel selenium webdriver webdriver_manager
Basic knowledge scraping with CSS selectors
CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.
If you haven't scraped with CSS selectors, there's a dedicated blog post of mine
about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they matter from a web-scraping perspective.
Reduce the chance of being blocked
Make sure you're using request headers user-agent
to act as a "real" user visit. Because default requests
user-agent
is python-requests
and websites understand that it's most likely a script that sends a request. Check what's your user-agent
.
There's a how to reduce the chance of being blocked while web scraping blog post that can get you familiar with basic and more advanced approaches.
Scrape without code
You can also parse data using following URL which will output a .txt
file:
https://clients1.google.com/complete/search?client=youtube&hl=en&q=minecraft
Code Explanation
Import libraries:
import re, json, time
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from parsel import Selector
Library | Purpose |
---|---|
re |
to extract parts of the data via regular expression. |
json |
to convert extracted data to a JSON object. |
time |
to work with time in Python. |
webdriver |
to drive a browser natively, as a user would, either locally or on a remote machine using the Selenium server. |
Service |
to manage the starting and stopping of the ChromeDriver. |
By |
to set of supported locator strategies (By.ID, By.TAG_NAME, By.XPATH etc). |
WebDriverWait |
to wait only as long as required. |
expected_conditions |
contains a set of predefined conditions to use with WebDriverWait. |
Selector |
XML/HTML parser that have full XPath and CSS selectors support. |
The algorithm for getting autocomplete results is as follows:
- Go to the main page of YouTube.
- Click on the search field.
- Enter a query there.
- Scrape suggested autocomplete results.
- Go to step 1 until the queries are over.
To simulate user actions in the browser, let's use the selenium
library. This will help complete steps 1, 2 and 3. For selenium
to work, you need to use ChromeDriver
, which can be downloaded manually or using code. In our case, the second method is used. To control the start and stop of ChromeDriver
, you need to use Service
which will install browser binaries under the hood:
service = Service(ChromeDriverManager().install())
You should also add options
to work correctly:
options = webdriver.ChromeOptions()
options.add_argument("--headless")
options.add_argument('--lang=en')
options.add_argument("user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36")
Chrome options | Explanation |
---|---|
--headless |
to run Chrome in headless mode. |
--lang=en |
to set the browser language to English. |
user-agent |
to act as a "real" user request from the browser by passing it to request headers. Check what's your user-agent . |
Use the queries
list, which will have a few queries and pass each one in a for
loop. Also, create the youtube_autocomplete
list that will store the extracting data.
queries = ["lord of the rings ", "fus ro dah ", "harry potter "]
youtube_autocomplete = []
for query in queries:
# the following code will be here
Now we can start webdriver
and pass the url to the get()
method.
driver = webdriver.Chrome(service=service, options=options)
driver.get("https://www.youtube.com/")
Sometimes it is difficult to calculate how long it will take to load a page, it all depends on the speed of the Internet, the power of the computer and other factors. The method described below is much better than using a delay in seconds since the wait occurs exactly until the moment when the page is fully loaded:
WebDriverWait(driver, 10000).until(EC.visibility_of_element_located((By.TAG_NAME, 'body')))
📌Note: In this case, we give 10 seconds for the page to load, if it loads earlier then the wait will end.
When the page has loaded, it is necessary to find the search field. Selenium provides the ability to find element by XPath.
To enter a search query, we can use click()
method on the search_input
element to make it possible to enter text using the send_keys()
method in the search field. Wait a while for the autocomplete results to load using the sleep()
method.
search_input = driver.find_element(By.XPATH, '//input[@id="search"]')
search_input.click()
search_input.send_keys(query)
time.sleep(1)
To extract suggested queries, we can use Parsel
library, in which we pass the html
structure with all the data that was received earlier.
The parsel
has much faster scraping times because of the engine itself and there is no network component anymore, no real-time interaction with a page and the element, there is only HTML parsing involved.
selector = Selector(driver.page_source)
# https://regex101.com/r/zZb3X0/1
autocomplete_results = [
re.search(r'">(.*)</b>', result).group(1).replace("<b>", "")
for result in selector.css('.sbqs_c').getall()
]
youtube_autocomplete.append({
"query": query.strip(),
"autocomplete_results": autocomplete_results
})
driver.quit()
Code | Explanation |
---|---|
autocomplete_results |
a temporary list where extracted data will be appended. |
css() |
to access elements by the passed selector. |
getall() |
to actually extract text data from all matching objects. |
search() |
to search for a pattern in a string and return the corresponding match object. |
group() |
to extract the found element from the match object. |
replace() |
to replace all occurrences of the old substring with the new one without extra elements. |
youtube_autocomplete.append({}) |
to append extracted data to a list as a dictionary. |
In the gif below, I demonstrate how this function works:
Output
[
{
"query": "lord of the rings ",
"autocomplete_results": [
"lord of the rings amazon trailer",
"lord of the rings soundtrack",
"lord of the rings trailer",
"lord of the rings amazon",
"lord of the rings rings of power",
"lord of the rings reaction",
"lord of the rings music",
"lord of the rings audiobook",
"lord of the rings ambience",
"lord of the rings theme",
"lord of the rings ost",
"lord of the rings full movie",
"lord of the rings online",
"lord of the rings rings of power trailer"
]
},
{
"query": "fus ro dah ",
"autocomplete_results": [
"fus ro dah sound effect",
"fus ro dah song",
"fus ro dah skyrim sound effect",
"fus ro dah sound",
"fus ro dah reaction",
"fus ro dah meme",
"fus ro dah all races",
"fus ro dah lyrics",
"fus ro dah misheard lyrics",
"fus ro dah shout",
"fus ro dah remix",
"fus ro dah anime",
"fus ro dah earrape",
"fus ro dah trailer"
]
},
{
"query": "harry potter ",
"autocomplete_results": [
"harry potter music",
"harry potter intro",
"harry potter game",
"harry potter audiobook",
"harry potter react to harry as",
"harry potter shittyflute",
"harry potter leviosa",
"harry potter full movie",
"harry potter theme",
"harry potter soundtrack",
"harry potter and the cursed child",
"harry potter kalimba",
"harry potter piano",
"harry potter ambience"
]
}
]
Add a Feature Request💫 or a Bug🐞
Top comments (0)