DEV Community

Po
Po

Posted on

Scrapping google images to create Mask,No-Mask Dataset using Selenium

In this tough time and each of us should share knowledge and collaborate. I was trying to make a dataset of People wearing mask and without mask,I have collected a little data. But I am sharing,how can you scrape google images and do this task. Here is my video explaining the concept.
First of all we need to have Selenium and a webdriver, e.g. chromium webdriver.

Here is the code:

import os
import time
import urllib.request
from selenium import webdriver
from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome("C:\\Users\\Sourabh\\chromedriver.exe")
driver.get('https://www.google.com/') #opens up google
search = driver.find_element_by_name('q') # the name of the searchbox
search.send_keys('people wearing mask',Keys.ENTER)
Enter fullscreen mode Exit fullscreen mode

Now, we need to go to the images section

elem = driver.find_element_by_link_text('Images')
elem.get_attribute('href')
elem.click()
Enter fullscreen mode Exit fullscreen mode

Now, We need to scroll the page to collect more and more values of src from

value = 0
for i in range(50):  #Scrolls the page 50 times
 driver.execute_script('scrollBy("+ str(value) +",+100);')
 value += 100
 time.sleep(4)
Enter fullscreen mode Exit fullscreen mode

Now we need to find the class/id of img tag to get the src attribute from there.As of now there are three classes in google images img tag.Keep in mind that google changes it periodically ,So, It might not work after weeks.

elements = driver.find_elements_by_xpath('//img[contains(@class,"rg_i") and contains(@class, "Q4LuWd") and contains(@class, "tx8vtf")]')
try:
    os.mkdir('peoplewithmask')
except FileExistsError:
    pass
Enter fullscreen mode Exit fullscreen mode

Finally we need to retrieve and download the links

count = 0
for i in elements:
    src = i.get_attribute('src')
    try:
        if src != None:
            src  = str(src)
            count+=1
            urllib.request.urlretrieve(src, os.path.join('withMask','image'+str(count)+'.jpg'))
            if count%10 == 0: print("downloaded",count,"images")
        else:
            raise TypeError
    except TypeError:
        pass
Enter fullscreen mode Exit fullscreen mode

Done, This was all for today. Feel free to reach out if you need help.
I did not explain how to inspect and find out the class,id,etc because I feel that most developers know,Still if you find problem please refer to this video tutorial.

Thanking You
Sourabh

Top comments (0)