Better web scraping using headless instance of Chrome

#python #tutorial

Have you ever written a web scraping program in python using requests and seen some differences in the content. You might not have gotten exactly the same results as when opening the website on a browser.

This is because some sites use javascript to render the content on the client side or the site could be making api calls to the server and then rendering that content.

Moreover requests made using the requests library can register as older browsers because of missing headers and can lead the server to respond with a page that is compatible with older browsers.

This problem can be easily solved by using webdrivers. The chrome webdriver can be downloaded from https://chromedriver.chromium.org/ . Make sure to download the driver that matches your chrome version and put the chromedriver.exe in the folder you're running python program from or add it to PATH.

Usually with requests the code for web scraping looks something like this:

from bs4 import BeautifulSoup
import requests

res = requests.get(url)
res_soup = BeautifulSoup(res.text, 'html.parser')
print(res_soup.prettify())


for image in res_soup.findAll('img') :
    print(image)

for image in res_soup.findAll('img') :
    imageSources = image['src']
    print(imageSources)

In the above code we are making a GET request to the url stored in url and then using BeautifulSoup to parse the text from the response into HTML and store it in res_soup. We can then look for tags like the img tag in this response using the findAll() method that returns all the tags with the given filter(here img tags).

Now using the selenium chrome webdriver the code looks like this:

from bs4 import BeautifulSoup
from selenium import webdriver

driver = webdriver.Chrome()
driver.get(url)

html = driver.execute_script("return document.documentElement.outerHTML")
sel_soup = BeautifulSoup(html, 'html.parser')
print(len(sel_soup.findAll('img')))
images = []
for image in sel_soup.findAll('img'):
    #print(image)
    try:
        src = image["src"]
    except:
        src = image["data-src"]
    images.append(src)

First we define that we are using the chrome webdriver for selenium and launch it using driver = webdriver.Chrome().Then we use the .get() method of the driver to open a website. Next, we extract the html content of the website by executing some javascript in the browser using .execute_script() method of the browser. Then we use BeautifulSoup to parse this text into HTML and use the findAll method to find all the image tags. The notable difference is here is that some websites that render content on the client side may use data-src attribute of the img tag in HTMl to parse a data-uri which may contain the base64 encoded image.
Note: selenium also has methods to obtain the html content of individual tags directly from the driver objects.

We can also use these webdrivers in headless instances so that the window doesn't appear i.e. the browser is not presented to the user. This can be easily performed by adding the few lines mentioned below to our code:

from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920,1200")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")   
chrome_options.add_argument("--ignore-certificate-errors")

First we import the Options class which is used to pass argumnets and options to our webdriver. We use --disable-gpu since we are not going to displaying anything in the browser and it would save some system resources. The --headless argument is the most essential one here as it basically tells the driver to launch the browser in headless mode. The --window-size argument is specified since by default window size in headless mode is 800x600 and that can cause issues on some websites. The --disable-extensions is provided to stop the extensions from interfering in some cases. The --no-sandbox and --disable-dev-shm-usage are needed in some cases to help reduce the chrome webdriver crashes in headless mode. Finally, the --ignore-certificate-errors is to allow chrome to ignore errors due to SSL certificates.

We also need to change

driver = webdriver.Chrome(options=chrome_options)

driver = webdriver.Chrome(options=chrome_options)

to tell the driver to use our Options.

Docs for reference
Selenium Python Docs
BeautifulSoup Docs

My web scraper for google images can be found here:

mushahidq / py_webScraper

A simple web scraper using beautifulsoup and requests

py_webScraper

A simple web scraper using beautifulsoup and requests

File Descriptions:

simmpleWebScraper.py : It is a simple web scraper built using requests and beautifulsoup to get data from any website.

googleImages.py : Contains a google images web scraper to obtain images from Google using ChromeWebDriver, Selenium and BeutifulSoup

googleImagesWithRequests.py : This webscraper for google images uses the requests library and beautifulsoup. sampleGoogleImages.html : This is the page which is obtained when using the requests library.

As it can be seen that using the WebDriver more images can be obtained because it enables the use of Javascript while using the requests library we get pure HTML and CSS.

View on GitHub

Also check out my previous post: