Have you ever written a web scraping program in python using requests
and seen some differences in the content. You might not have gotten exactly the same results as when opening the website on a browser.
This is because some sites use javascript
to render the content on the client side or the site could be making api calls to the server and then rendering that content.
Moreover requests made using the requests
library can register as older browsers because of missing headers and can lead the server to respond with a page that is compatible with older browsers.
This problem can be easily solved by using webdrivers. The chrome webdriver can be downloaded from https://chromedriver.chromium.org/ . Make sure to download the driver that matches your chrome version and put the chromedriver.exe
in the folder you're running python program from or add it to PATH
.
Usually with requests
the code for web scraping looks something like this:
from bs4 import BeautifulSoup
import requests
res = requests.get(url)
res_soup = BeautifulSoup(res.text, 'html.parser')
print(res_soup.prettify())
for image in res_soup.findAll('img') :
print(image)
for image in res_soup.findAll('img') :
imageSources = image['src']
print(imageSources)
In the above code we are making a GET
request to the url stored in url
and then using BeautifulSoup
to parse the text from the response into HTML and store it in res_soup
. We can then look for tags like the img
tag in this response using the findAll()
method that returns all the tags with the given filter(here img
tags).
Now using the selenium chrome webdriver the code looks like this:
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Chrome()
driver.get(url)
html = driver.execute_script("return document.documentElement.outerHTML")
sel_soup = BeautifulSoup(html, 'html.parser')
print(len(sel_soup.findAll('img')))
images = []
for image in sel_soup.findAll('img'):
#print(image)
try:
src = image["src"]
except:
src = image["data-src"]
images.append(src)
First we define that we are using the chrome webdriver for selenium and launch it using driver = webdriver.Chrome()
.Then we use the .get()
method of the driver
to open a website. Next, we extract the html content of the website by executing some javascript in the browser using .execute_script()
method of the browser. Then we use BeautifulSoup to parse this text into HTML and use the findAll
method to find all the image tags. The notable difference is here is that some websites that render content on the client side may use data-src
attribute of the img
tag in HTMl to parse a data-uri
which may contain the base64 encoded image.
Note: selenium also has methods to obtain the html content of individual tags directly from the driver
objects.
We can also use these webdrivers in headless instances so that the window doesn't appear i.e. the browser is not presented to the user. This can be easily performed by adding the few lines mentioned below to our code:
from selenium.webdriver.chrome.options import Options
chrome_options = Options()
chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920,1200")
chrome_options.add_argument("--disable-extensions")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
chrome_options.add_argument("--ignore-certificate-errors")
First we import the Options
class which is used to pass argumnets and options to our webdriver. We use --disable-gpu
since we are not going to displaying anything in the browser and it would save some system resources. The --headless
argument is the most essential one here as it basically tells the driver to launch the browser in headless mode. The --window-size
argument is specified since by default window size in headless mode is 800x600 and that can cause issues on some websites. The --disable-extensions
is provided to stop the extensions from interfering in some cases. The --no-sandbox
and --disable-dev-shm-usage
are needed in some cases to help reduce the chrome webdriver crashes in headless mode. Finally, the --ignore-certificate-errors
is to allow chrome to ignore errors due to SSL certificates.
We also need to change
driver = webdriver.Chrome(options=chrome_options)
to
driver = webdriver.Chrome(options=chrome_options)
to tell the driver to use our Options
.
Docs for reference
Selenium Python Docs
BeautifulSoup Docs
My web scraper for google images can be found here:
mushahidq / py_webScraper
A simple web scraper using beautifulsoup and requests
py_webScraper
A simple web scraper using beautifulsoup and requests
File Descriptions:
simmpleWebScraper.py : It is a simple web scraper built using requests and beautifulsoup to get data from any website.
googleImages.py : Contains a google images web scraper to obtain images from Google using ChromeWebDriver, Selenium and BeutifulSoup
googleImagesWithRequests.py : This webscraper for google images uses the requests library and beautifulsoup. sampleGoogleImages.html : This is the page which is obtained when using the requests library.
As it can be seen that using the WebDriver more images can be obtained because it enables the use of Javascript while using the requests library we get pure HTML and CSS.
Also check out my previous post:
Top comments (0)