Python Concurrent Image Downloader

#python #programming #productivity #tutorial

One excellent example of the benefits of multithreading is, without a doubt, the use of multiple threads to download multiple images or files. This is, actually, one of the best use cases for multithreading due to the blocking nature of I/O.

We are going to retrieve 10 different images from https://picsum.photos/200/300, which is a free API that delivers a different image every time you hit that link. We’ll then store these 10 different images within a temp folder.

Concurrent Download

It’s time to write a quick program that will concurrently download all the images that we require. We’ll be going over creating and starting threads. The key point of this is to realize the potential performance gains to be had by writing programs concurrently:

import threading
import urllib.request
import time

def downloadImage(imgPath, fileName):
    print("Downloading Image from ", imgPath)
    urllib.request.urlretrieve(imgPath, fileName)
    print("Completed Download")

def createThread(i,url):
    imgName = "temp/image-" + str(i) + ".jpg"
    downloadImage(url,imgName)

def main():
    url = "https://picsum.photos/200/300"
    t = time.time()
    # create an array which will store a reference to
    # all of our threads
    threads = []

    # create 10 threads, append them to our array of threads
    # and start them off  
    for i in range(10):
        thread = threading.Thread(target=createThread, args=(i,url,))
        threads.append(thread)
        thread.start()

    # ensure that all the threads in our array have completed
    # their execution before we log the total time to complete
    for i in threads:
        i.join()

    # calculate the total execution time
    t1 = time.time()
    totalTime = t - t1
    print("Total Execution Time {}".format(totalTime))

if __name__ == '__main__':
    main()

In the first line of our newly modified program, you should see that we are now importing the threading module. We then abstract our filename generation, call the downloadImage function into our own createThread function.

Within the main function, we first create an empty array of threads, and then iterate 10 times, creating a new thread object, appending this to our array of threads, and then starting that thread.

Finally, we iterate through our array of threads by calling for i in threads, and call the join method on each of these threads. This ensures that we do not proceed with the execution of our remaining code until all of our threads have finished downloading the image.

If you execute this on your machine, you should see that it almost instantaneously starts the download of the 10 different images. When the downloads finish, it again prints out that it has successfully completed, and you should see the temp folder being populated with these images.

$ concurrentImageDownloader.py
Downloading Image from  https://picsum.photos/200/300
Downloading Image from  https://picsum.photos/200/300
Downloading Image from  https://picsum.photos/200/300
Downloading Image from  https://picsum.photos/200/300
Downloading Image from  https://picsum.photos/200/300
Downloading Image from  https://picsum.photos/200/300
Downloading Image from  https://picsum.photos/200/300
Downloading Image from  https://picsum.photos/200/300
Downloading Image from  https://picsum.photos/200/300
Downloading Image from  https://picsum.photos/200/300
Completed Download
Completed Download
Completed Download
Completed Download
Completed Download
Completed Download
Completed Download
Completed Download
Completed Download
Completed Download
Total Execution Time -1.1606624126434326

my personal blog Programming Geeks Club

Top comments (5)

Vincent A. Cicirello • Oct 21 '22

I have a few comments:

You should really use the timeit module or something else like it designed for benchmarking.
Did you also time the sequential alternative of downloading one at a time? Without doing so, your timing result doesn't tell us much. Did threads actually save time, and if so, how much?
What about using the multiprocessing module instead threads? Perhaps a process Pool? You can potentially gain benefit of parallelism that way.

spO0q • Oct 23 '22

what do you mean by "multiprocessing module"? multi-threads?

Using threads is convenient to request remote resources.

Vincent A. Cicirello • Oct 23 '22

Python's multiprocessing module uses multiple processes, instead of multiple threads. With threads in Python (assuming CPython with its GIL) you get concurrency but not really parallelism. The multiprocessing module circumvents that limitation. Not sure whether it would make a difference compared to threads in downloading multiple files though, since you're also constrained by network bandwidth. It would be interesting to compare.

spO0q • Oct 23 '22

ah ok. Thanks for the details. Indeed, concurrency is not parallelism.

Vincent A. Cicirello • Oct 23 '22

You're welcome