An easy way to scrape public domain images from Pixabay

#python

Originally featured in LearnThinkImplement

Lately, I am going forward with some Python development. For one of my personal projects I needed a couple hundred images, and I needed them to be in the public domain so that I don't have to deal with any copyright issues. Pixabay is well known to help for such situations, but since I needed a lot of images, I couldn't just manually pick, choose and download them. So I decided to check out its API to see if it's easy to scrape images, and it turns out that it is!

Beware of the following request when you're using these images. Since I used them for a non-commercial personal project that is not open to the outside world and is on my local machine (i.e. I don't have any users but me) I do not include the links for the image, but it's very easy to fetch them as well.

If you make use of the API, show your users where the images and videos are from, whenever search results are displayed. A link to Pixabay is required and you may use our logo for this purpose. That's the one thing we kindly request in return for free API usage.

If you check out the documentation, you'll see that you'll be needing an API key. This is very easy to do, you just need to sign up for Pixabay.

Here's how the API works. We have a base endpoint for a GET request https://pixabay.com/api/ and we just append query parameters to this endpoint, and we get a JSON response, very simple, very effective.

The full code is at the very below, so if you don't want the walkthrough, feel free to skip to there

Defining constants and variables

Once you get the API key, let's add that, the endpoint and the parameters that we want. The full list of parameters is available on the docs, and you can add them to the PARAMS object.

query is a variable that I've put here because I want the search query to be istanbul. I'll add that to my PARAMS as well.

We define our PARAMS and ENDPOINT as follows.


API_KEY = "API-KEY"
URL_ENDPOINT = "https://pixabay.com/api/"
query = "istanbul"
PER_PAGE = 200

PARAMS = {'q': location, 'per_page': PER_PAGE, 'page': 1}
ENDPOINT = URL_ENDPOINT + "?key=" + API_KEY

Request to API

We need to send a GET request to that endpoint, for that we will import and use the requests package. Requests makes it easy to send a request to an endpoint and utilize the response. The API uses pagination, so we'll get the total page number by dividing the total hits (which come from the response) by the PER_PAGE constant we have.

Once we get the data we will call the .json() method on it to read the data in JSON format. You can print the JSON format to your console, and look at the JSON with a beautifier.

When we do that we'll see that we have hits as the object that holds the image objects. In each object, we have a webformatURL key, and its value is a link to the image file. The actual image is not returned to us in this request. We need to collect these image urls. For that we create a new array named url_links and we'll append the links into this array.

url_links = []

req = requests.get(url=ENDPOINT, params=PARAMS)
data = req.json()

num_pages = (data["totalHits"] // PER_PAGE) + 1

for image in data["hits"]:
    url_links.append(image["webformatURL"])

What about this pagination?

When we do this, we'll get the first page, but we still need to get the other pages. Since we know the page count now, let's loop over the other pages. We'll use time.sleep() to pause the execution in order to not exceed our query limit.

for page in range(2, num_pages + 1):
    time.sleep(3)
    PARAMS['page'] = page
    req = requests.get(url=ENDPOINT, params=PARAMS)
    data = req.json()
    for image in data["hits"]:
        url_links.append(image["webformatURL"])

Downloading the images

Once we have these links, we need to make a request to each link and download them. We'll use the requests package again.

index = 0
for image in url_links:
    index += 1
    r = requests.get(image, allow_redirects=False)
    file_name = "istanbul_image_" + str(index)
    script_dir = os.path.dirname(__file__)
    rel_path = "../images/" + file_name + ".jpg"
    abs_file_path = os.path.join(script_dir, rel_path)
    open(abs_file_path, 'wb').write(r.content)

For each image link in url_links, we are creating a GET request to that image, we create a new file (I named it istanbul_image_${index}) but you can use whatever you want of course). We import the os package in order to easily access our file system. We get the path the Python executable is in, we create a relative path to put the images in a folder, and then write the contents of the image into this file.

By changing the query string (and of course the file naming) you can pretty much download any image you want from Pixabay.
Here's the whole code. Let me know over @cihankoseoglu on Twitter if you have any questions!

#!/usr/bin/env python3

import requests
import time
import os


API_KEY = "API-KEY"
URL_ENDPOINT = "https://pixabay.com/api/"
query = "istanbul"
PER_PAGE = 200

PARAMS = {'q': query, 'per_page': PER_PAGE, 'page': 1}
ENDPOINT = URL_ENDPOINT + "?key=" + API_KEY

url_links = []

req = requests.get(url=ENDPOINT, params=PARAMS)
data = req.json()

num_pages = (data["totalHits"] // PER_PAGE) + 1

for image in data["hits"]:
    url_links.append(image["webformatURL"])

for page in range(2, num_pages + 1):
    time.sleep(3)
    PARAMS['page'] = page
    req = requests.get(url=ENDPOINT, params=PARAMS)
    data = req.json()
    for image in data["hits"]:
        url_links.append(image["webformatURL"])

index = 0
for image in url_links:
    index += 1
    r = requests.get(image, allow_redirects=False)
    file_name = "istanbul_image_" + str(index)
    script_dir = os.path.dirname(__file__)
    rel_path = "../images/" + file_name + ".jpg"
    abs_file_path = os.path.join(script_dir, rel_path)
    open(abs_file_path, 'wb').write(r.content)

DEV Community

An easy way to scrape public domain images from Pixabay

Defining constants and variables

Request to API

What about this pagination?

Downloading the images

Top comments (0)