This articles relies on the code written by Fabian Bosler:
I've only modified Bosler's code to make it a bit easier to pull images for multiple search terms.
The full code can be found the series' Github repository:
Magic Symbols
As I've mentioned in my previous article, I needed a lot of images of magic symbols for training a deep convolutional generative adversarial network (DCGAN). Luckily, I landed on Bosler's article early on.
To get my images, I used Chrome browser, Chromedriver, Selenium, and a Python script to slowly scrape images from Google's image search. The scraping was done throttled to near human speed, but allowed automating the collection of a lot of images.
Regarding this process, I'll echo Bosler, I'm in no way a legal expert. I'm not a lawyer and nothing I state should be taking as legal advice. I'm just some hack on the internet. However, from what I understand, scraping the SERPs (search engine results pages) is not illegal, at least, not for personal use. But using Google's Image search for automated scraping of images is against their terms of service (ToS). Replicate this project at your own risk. I know when I adjusted my script to search faster Google banned my IP. I'm glad it was temporary.
Bosler's Modified Script
The script automatically searches for images and collects their underlying URL. After searching, it uses the Python requests
library to download all the images into a folder named respective to the search term.
Here are the modifications I made to Bosler's original script:
- Added a search term loop. This allows the script to continue running past one search term.
- The script was getting stuck when it ran into the "Show More Results," I've fixed the issue.
- The results are saved in directories associated with the search term. If the script is interrupted and rerun it will look at what directories are created first, and remove those from the search terms.
- I added a timeout feature; thanks to a user on Stack Overflow.
- I parameterized the number of images to look for per search term, sleep times, and timeout.
Code: Libraries
You will need to install Chromedriver and Selenium--this is explained well in the original article.
You will also need to install Pillow--a Python library for managing images.
You can install it with:
pip install pillow
After installing all the needed libraries the following block of code should execute without error:
import os
import time
import io
import hashlib
import signal
from glob import glob
import requests
from PIL import Image
from selenium import webdriver
If you have any troubles, revisit the original articles setup explanation or feel free to ask questions in the comments below.
Code: Parameters
I've added a few parameters to the script to make use easier.
number_of_images = 400
GET_IMAGE_TIMEOUT = 2
SLEEP_BETWEEN_INTERACTIONS = 0.1
SLEEP_BEFORE_MORE = 5
IMAGE_QUALITY = 85
output_path = "/path/to/your/image/directory"
The number_of_images
tells the script how many images to search for per search term. If the script runs out of images before reaching number_of_images
, it will skip to the next term.
GET_IMAGE_TIMEOUT
determines how long the script should wait for a response before skipping to the next image URL.
SLEEP_BETWEEN_INTERACTIONS
is how long the script should delay before checking the URL of the next image. In theory, this can be set low, as I don't think it makes any requests of Google. But I'm unsure, adjust at your own risk.
SLEEP_BEFORE_MORE
is how long the script should wait before clicking on the "Show More Results" button. This should not be set lower than you can physically search. Your IP will be banned. Mine was.
Code: Search Terms
Here is where the magic happens. The search_terms
array should include any terms which you think will get the sorts of images you are targeting.
Below are the exact set of terms I used to collect magic symbol images:
search_terms = [
"black and white magic symbol icon",
"black and white arcane symbol icon",
"black and white mystical symbol",
"black and white useful magic symbols icon",
"black and white ancient magic sybol icon",
"black and white key of solomn symbol icon",
"black and white historic magic symbol icon",
"black and white symbols of demons icon",
"black and white magic symbols from book of enoch",
"black and white historical magic symbols icons",
"black and white witchcraft magic symbols icons",
"black and white occult symbols icons",
"black and white rare magic occult symbols icons",
"black and white rare medieval occult symbols icons",
"black and white alchemical symbols icons",
"black and white demonology symbols icons",
"black and white magic language symbols icon",
"black and white magic words symbols glyphs",
"black and white sorcerer symbols",
"black and white magic symbols of power",
"occult religious symbols from old books",
"conjuring symbols",
"magic wards",
"esoteric magic symbols",
"demon summing symbols",
"demon banishing symbols",
"esoteric magic sigils",
"esoteric occult sigils",
"ancient cult symbols",
"gypsy occult symbols",
"Feri Tradition symbols",
"Quimbanda symbols",
"Nagualism symbols",
"Pow-wowing symbols",
"Onmyodo symbols",
"Ku magical symbols",
"Seidhr And Galdr magical symbols",
"Greco-Roman magic symbols",
"Levant magic symbols",
"Book of the Dead magic symbols",
"kali magic symbols",
]
Before searching, the script checks the image output directory to determine if images have already been gathered for a particular term. If it has, the script will exclude the term from the search. This is part of my "be cool" code. We don't need to be downloading a bunch of images twice.
The code below grabs all the directories in our output path, then reconstructs the search term from the directory name (i.e., it replaces the "_"s with " "s.)
dirs = glob(output_path + "*")
dirs = [dir.split("/")[-1].replace("_", " ") for dir in dirs]
search_terms = [term for term in search_terms if term not in dirs]
Code: Chromedriver
Before starting the script, we have to kick off a Chromedriver session. Note, you must put the chromedriver
executable into a folder listed in your PATH
variable for Selenium to find it.
For MacOS users, setting up Chromedriver for Selenium use is a bit tough to do manually. But, using homebrew makes it easy.
brew install chromedriver
If everything is setup correctly, executing the following code will open a Chrome browser and bring up the Google search page.
wd = webdriver.Chrome()
wd.get("https://google.com")
Code: Chrome Timeout
The timeout class below I borrowed from Thomas Ahle at Stack Overflow. It is a dirty way of creating a timeout for the GET
request to download the image. Without it, the script can get stuck on unresponsive image downloads.
class timeout:
def __init__(self, seconds=1, error_message="Timeout"):
self.seconds = seconds
self.error_message = error_message
def handle_timeout(self, signum, frame):
raise TimeoutError(self.error_message)
def __enter__(self):
signal.signal(signal.SIGALRM, self.handle_timeout)
signal.alarm(self.seconds)
def __exit__(self, type, value, traceback):
signal.alarm(0)
Code: Fetch Images
As I've hope I made clear, the code below I did not write; I just polished it. I'll provide a brief explanation, but refer back to Bosler's article for more information.
Essentially, the script:
- Creates a directory corresponding to a search term in the array.
- It passes the search term to the
fetch_image_urls()
, this function drives the Chrome session. The script navigates the Google to find images relating to the search term. It stores the image link in an list. After it has searched through all the images or reached thenum_of_images
it returns a list (res
) containing all the image URLs. - The list of image URLs is passed to the
persist_image()
, which then downloads each one of the images into the corresponding folder. - It repeats steps 1-3 per search term.
I've added extra comments as a guide:
def fetch_image_urls(
query: str,
max_links_to_fetch: int,
wd: webdriver,
sleep_between_interactions: int = 1,
):
def scroll_to_end(wd):
wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(sleep_between_interactions)
# Build the Google Query.
search_url = "https://www.google.com/search?safe=off&site=&tbm=isch&source=hp&q={q}&oq={q}&gs_l=img"
# load the page
wd.get(search_url.format(q=query))
# Declared as a set, to prevent duplicates.
image_urls = set()
image_count = 0
results_start = 0
while image_count < max_links_to_fetch:
scroll_to_end(wd)
# Get all image thumbnail results
thumbnail_results = wd.find_elements_by_css_selector("img.Q4LuWd")
number_results = len(thumbnail_results)
print(
f"Found: {number_results} search results. Extracting links from {results_start}:{number_results}"
)
# Loop through image thumbnail identified
for img in thumbnail_results[results_start:number_results]:
# Try to click every thumbnail such that we can get the real image behind it.
try:
img.click()
time.sleep(sleep_between_interactions)
except Exception:
continue
# Extract image urls
actual_images = wd.find_elements_by_css_selector("img.n3VNCb")
for actual_image in actual_images:
if actual_image.get_attribute(
"src"
) and "http" in actual_image.get_attribute("src"):
image_urls.add(actual_image.get_attribute("src"))
image_count = len(image_urls)
# If the number images found exceeds our `num_of_images`, end the seaerch.
if len(image_urls) >= max_links_to_fetch:
print(f"Found: {len(image_urls)} image links, done!")
break
else:
# If we haven't found all the images we want, let's look for more.
print("Found:", len(image_urls), "image links, looking for more ...")
time.sleep(SLEEP_BEFORE_MORE)
# Check for button signifying no more images.
not_what_you_want_button = ""
try:
not_what_you_want_button = wd.find_element_by_css_selector(".r0zKGf")
except:
pass
# If there are no more images return.
if not_what_you_want_button:
print("No more images available.")
return image_urls
# If there is a "Load More" button, click it.
load_more_button = wd.find_element_by_css_selector(".mye4qd")
if load_more_button and not not_what_you_want_button:
wd.execute_script("document.querySelector('.mye4qd').click();")
# Move the result startpoint further down.
results_start = len(thumbnail_results)
return image_urls
def persist_image(folder_path: str, url: str):
try:
print("Getting image")
# Download the image. If timeout is exceeded, throw an error.
with timeout(GET_IMAGE_TIMEOUT):
image_content = requests.get(url).content
except Exception as e:
print(f"ERROR - Could not download {url} - {e}")
try:
# Convert the image into a bit stream, then save it.
image_file = io.BytesIO(image_content)
image = Image.open(image_file).convert("RGB")
# Create a unique filepath from the contents of the image.
file_path = os.path.join(
folder_path, hashlib.sha1(image_content).hexdigest()[:10] + ".jpg"
)
with open(file_path, "wb") as f:
image.save(f, .jpg", quality=IMAGE_QUALITY)
print(f"SUCCESS - saved {url} - as {file_path}")
except Exception as e:
print(f"ERROR - Could not save {url} - {e}")
def search_and_download(search_term: str, target_path=".https://ladvien.com/images/", number_images=5):
# Create a folder name.
target_folder = os.path.join(target_path, "_".join(search_term.lower().split(" ")))
# Create image folder if needed.
if not os.path.exists(target_folder):
os.makedirs(target_folder)
# Open Chrome
with webdriver.Chrome() as wd:
# Search for images URLs.
res = fetch_image_urls(
search_term,
number_images,
wd=wd,
sleep_between_interactions=SLEEP_BETWEEN_INTERACTIONS,
)
# Download the images.
if res is not None:
for elem in res:
persist_image(target_folder, elem)
else:
print(f"Failed to return links for term: {search_term}")
# Loop through all the search terms.
for term in search_terms:
search_and_download(term, output_path, number_of_images)
Results
Scraping tehe images resulted in a lot of garbage images (noise) along with my ideal training images.
For example, out of all the images shown, I only wanted the image highlighted:
There was also the problem of lots of magic symbols stored in a single image. These "collection" images would need further processing to extract all of the symbols.
However, even with a few rough edges, the script sure as hell beat manually downloading the 10k images I had in the end.
Top comments (0)