To ensure the high performance of an application, programmers need to write efficient code. Code efficiency is directly linked with algorithmic efficiency and the speed of runtime execution for software.
The 🐍 Python language is a 🐌 slow language - compared to C or FORTRAN and to combat this, various methods have been developed to help speed up the programming language.
One method is CONCURRENCY
.
CONCURRENCY 🤷♂️
Concurrency is when two tasks can start, run, and complete in overlapping time periods. It doesn't necessarily mean they'll ever both be running at the same instant(that the becomes Parallelism
). Still confused 😕? Let's paint a scenario:
We are planning a dream wedding for a particular couple. At our disposal, we have Mary, Susan, Max, and Simon. A cake, band, decorations, and invitation cards are needed for the dream wedding. We assign the baking of the cake to Susan, the hiring of a band to Simon, the setting up of decorations to Mary and the sending of invitations to Max.
These four friends(or processors) all perform their tasks(or processes) at the same time without task switching or interruptions until it is completed. This is - in layman's term referred to as CONCURRENCY.
TYPES OF CONCURRENCY IN PYTHON 🐍
The basic types of concurrency in Python include:-
- multi-threading 🧵
- multiprocessing 🧩(yeah, I know that's a jigsaw piece 😏)
- asyncio ⏰ (more on this in another tutorial)
MULTI-THREADING 🧵
A Thread
is the smallest unit of execution in an operating system. Threads are a way for a program to split itself into two or more simultaneously running tasks. A thread is not itself a program, it only runs within a particular program or process
.
Multi-threading is game-changing and is mainly used for I/O bound operations. It is the ability of a central processing unit (CPU) (or a single core in a multi-core processor) to provide multiple threads of execution concurrently, supported by the operating system. Each thread shares the same resource supplied by the process it exists in.
Let's illustrate an example of a multi-threaded operation. First, we take a look at the synchronous process.
Synchronous Process
# A simple python script that gets query a list of site
import requests
import time
def get_single_site(url):
with requests.get(url) as response:
print(f"Read {len(response.content)} from {url}")
def get_all_sites(sites):
for url in sites:
get_single_site(url, session)
if __name__ == "__main__":
start_time = time.time()
urls = [
"https://www.google.com",
"https://www.facebook.com",
"https://www.twitter.com/theghostyced"
] * 30
get_all_sites(urls)
end_time = time.time() - start_time
print(f"Downloaded {len(sites)} in {end_time} seconds")
This is a simple python program that downloads the content of the provided websites. After downloading the content of the website, it then prints out the number of sites visited and the time it took.
The script makes use of the requests
library and the built-in python standard time library.
The output of running the code is:
...
Read 107786 from https://www.facebook.com
Read 608312 from https://www.twitter.com/theghostyced
Read 11369 from https://www.google.com
Read 107786 from https://www.facebook.com
Read 608077 from https://www.twitter.com/theghostyced
Read 11369 from https://www.google.com
Read 107787 from https://www.facebook.com
Read 608077 from https://www.twitter.com/theghostyced
Read 11369 from https://www.google.com
Read 107351 from https://www.facebook.com
Read 608311 from https://www.twitter.com/theghostyced
Read 11369 from https://www.google.com
Read 107507 from https://www.facebook.com
Read 608312 from https://www.twitter.com/theghostyced
Read 11369 from https://www.google.com
Read 107918 from https://www.facebook.com
Read 608312 from https://www.twitter.com/theghostyced
Read 11369 from https://www.google.com
Read 107149 from https://www.facebook.com
Read 608312 from https://www.twitter.com/theghostyced
Read 11365 from https://www.google.com
Read 107445 from https://www.facebook.com
Read 608077 from https://www.twitter.com/theghostyced
Read 11369 from https://www.google.com
Read 107351 from https://www.facebook.com
Read 608312 from https://www.twitter.com/theghostyced
Read 11369 from https://www.google.com
Read 107482 from https://www.facebook.com
Read 608312 from https://www.twitter.com/theghostyced
Downloaded 90 in 17.5553081035614 seconds
Here, it takes the script 17.5 seconds to complete the task. Now let's try it again and see if we can speed it up using a multi-threaded approach.
Multi Threaded Process
# A simple python script that gets query a list of site
import requests
import time
import concurrent.futures
def get_single_site(url):
with requests.get(url) as response:
print(f"Read {len(response.content)} from {url}")
def get_all_sites(sites):
with concurrent.futures.ThreadPoolExecutor(max_workers=5) as executor:
executor.map(get_single_site, sites)
if __name__ == "__main__":
start_time = time.time() # our scripts start time
sites = [
"https://www.google.com",
"https://www.facebook.com",
"https://www.twitter.com/theghostyced"
] * 30
get_all_sites(sites)
end_time = time.time() - start_time
print(f"Downloaded {len(sites)} in {end_time} seconds")
In the code above, we imported the concurrent.futures
module from the Python Standard library. The module has an Executor class which is where the ThreadPoolExecutor
subclass comes from. Let's break down the ThreadPoolExecutor
.
All the ThreadPoolExecutor
subclass is doing is simply creating a Pool
of Threads. The Executor part, controls how and when each of the threads in the pool will run.
The output of the above scripts is shown below:-
...
Read 608312 from https://www.twitter.com/theghostyced
Read 11354 from https://www.google.com
Read 107810 from https://www.facebook.com
Read 608312 from https://www.twitter.com/theghostyced
Read 11343 from https://www.google.com
Read 107823 from https://www.facebook.com
Read 608312 from https://www.twitter.com/theghostyced
Read 11326 from https://www.google.com
Read 107388 from https://www.facebook.com
Read 11350 from https://www.google.com
Read 608312 from https://www.twitter.com/theghostyced
Read 107787 from https://www.facebook.com
Read 608311 from https://www.twitter.com/theghostyced
Read 608077 from https://www.twitter.com/theghostyced
Read 11299 from https://www.google.com
Read 11367 from https://www.google.com
Read 608312 from https://www.twitter.com/theghostyced
Read 107785 from https://www.facebook.com
Read 11321 from https://www.google.com
Read 107800 from https://www.facebook.com
Read 107350 from https://www.facebook.com
Read 608076 from https://www.twitter.com/theghostyced
Read 608312 from https://www.twitter.com/theghostyced
Read 608312 from https://www.twitter.com/theghostyced
Read 608311 from https://www.twitter.com/theghostyced
Downloaded 90 in 6.443061351776123 seconds
Here, it takes the script 6.4 seconds to complete the task. Compared to the 17.5 seconds it took when running the code synchronously. You could be saying to yourself - it's only a 12 seconds difference, I can live with that. Imagine we had a larger amount of data, say 1000, then you would significantly see the difference in both approaches.
MULTIPROCESSING 🧩
A Process is simply an instance of a running program. To put it in layman's terms, when we write our computer programs/script in a text file and execute this program, it becomes a process that performs all the tasks mentioned in the program/script. A Process
does not share any memory space with other processes like Threads
.
Multiprocessing involves the use of two or more cores within a single computer system. By default, multithreading is not possible with the Python 🐍 programming language due to the GIL or Global Interpreter Lock hindrance
.
The GIL Hindrance
Python was developed by Guido van Rossum in the 1980s and at that time, computers only made use of a single CPU and to increase memory management, Guido implemented the GIL which allows only one thread to hold the control of the Python interpreter. Meaning, making use of more than one CPU core or separate CPUs to run threads in parallel was not possible.
The multiprocessing module
was introduced in order to bypass this.
NB -- The GIL does not prevent the creation of multiple threads. All the GIL does is make sure only one thread is executing Python code at a time; control still switches between threads. If you still having confusions,
this article will definitely help you out
.
Let's illustrate how a multi-processing operation would be written using the synchronous code we have above.
Synchronous Process
# A simple python script that gets query a list of site
import requests
import time
def get_single_site(url):
with requests.get(url) as response:
print(f"Read {len(response.content)} from {url}")
def get_all_sites(sites):
for url in sites:
get_single_site(url, session)
if __name__ == "__main__":
start_time = time.time()
urls = [
"https://www.google.com",
"https://www.facebook.com",
"https://www.twitter.com/theghostyced"
] * 30
get_all_sites(urls)
end_time = time.time() - start_time
print(f"Downloaded {len(sites)} in {end_time} seconds")
Multiprocessing Method
# A simple python script that gets query a list of site
import requests
import time
import multiprocessing
def get_single_site(url):
with requests.get(url) as response:
print(f"Read {len(response.content)} from {url}")
def get_all_sites(sites):
with multiprocessing.Pool(5) as pool:
pool.map(get_single_site, sites)
if __name__ == "__main__":
start_time = time.time() # our scripts start time
sites = [
"https://www.google.com",
"https://www.facebook.com",
"https://www.twitter.com/theghostyced"
] * 30
get_all_sites(sites)
end_time = time.time() - start_time
print(f"Downloaded {len(sites)} in {end_time} seconds")
Here we import the multiprocessing package from Python's standard library. The multiprocessing module comes with a few sub-classes Process and Pool.
Here we are making use of the Pool
subclass. The Pool
takes the number of workers or processes it needs to spawn as its first argument which is what is happening on with multiprocessing.Pool(5) as pool:
line. On pool.map(get_single_site, sites)
we use the map method that is provided to the Pool . The method takes in the function that needs to be called as the first argument and the iterable which is our urls list as the second. It then chops the iterable into a number of chunks which it submits to the process pool as separate tasks.
The output for this given operation is:-
...
Read 608423 from https://www.twitter.com/theghostyced
Read 108078 from https://www.facebook.com
Read 11386 from https://www.google.com
Read 11387 from https://www.google.com
Read 11304 from https://www.google.com
Read 11353 from https://www.google.com
Read 108021 from https://www.facebook.com
Read 107985 from https://www.facebook.com
Read 108022 from https://www.facebook.com
Read 608423 from https://www.twitter.com/theghostyced
Read 108079 from https://www.facebook.com
Read 608423 from https://www.twitter.com/theghostyced
Read 11340 from https://www.google.com
Read 608423 from https://www.twitter.com/theghostyced
Read 11321 from https://www.google.com
Read 608423 from https://www.twitter.com/theghostyced
Read 107985 from https://www.facebook.com
Read 608423 from https://www.twitter.com/theghostyced
Read 11384 from https://www.google.com
Read 107549 from https://www.facebook.com
Read 608423 from https://www.twitter.com/theghostyced
Read 11294 from https://www.google.com
Read 608423 from https://www.twitter.com/theghostyced
Read 107985 from https://www.facebook.com
Read 11360 from https://www.google.com
Read 609124 from https://www.twitter.com/theghostyced
Downloaded 90 in 6.056399154663086 seconds
Here, it takes the script 6 seconds to complete the task and just slightly faster than the threading solution. This is justified because the kind of operation being done is an I/O bounded on. Multiprocessing performs better when it does CPU bound operations like crunching some huge amount of data numbers.
CONCLUSION
I know right now you are itching to try it out for yourself so before you go, a few notes on when to use concurrency.
You need to figure out if your program is CPU-bound or I/O-bound first. Remember that I/O-bound programs are those that spend most of their time waiting for something to happen(making external calls or requests) while CPU-bound programs spend their time processing data or crunching numbers as fast as they can.
So for I/O-bound operation, MultiThreading would be the best way to go and for CPU-bound operations, Multiprocessing would be the right way to go.
👋 👋 👋 👋
I'm CED and I hope you enjoyed the tutorial on how to speed up the runtime of your Python 🐍 scripts.
Top comments (10)
Hi CED, glad you wrote this article. There are a few definitions that contain imprecisions or incorrect statements that I think should be addressed, mostly because they trip a lot of developers. Concurrency is hard, there's no going around that
In the beginning you say:
This is strictly incorrect, as it's a possible definition of parallelism, not concurrency. Unfortunately this topic gets tricky really quickly because there is not one way to define both, let's use "looking at the system from the outside" as a lens.
Concurrency means that in, let's say, 10 seconds, more than one unit of work makes progress, but if you observe the system the system might be executing only one of those computations at any given time. This is why single thread async programs can be concurrent, because every independent unit of work makes progress in the lifetime of the program (basically using a mechanism similar to that of a basic operating system: rotating among unit of works and giving them attention for a small period of time).
If, by using concurrency and by observing the system, more than one computation is executed at the same time (literally), as my body can breathe and type without alternating between the two, then you achieved parallelism. So in this scenario the system has to be concurrent already (both breathing and typing advance while I'm writing this comment) but it has to have the resources to be parallel (my typing motor skills and the breathing apparatus do not communicate and work independently).
Same as above, they could be, maybe not. You hint at the "not" scenario when the threads are not executing at the same time when you talk about the GIL later on and boundness.
I guess you mean cores here.
This is a typo. The GIL limits multithreading, not multiprocessing. I wouldn't call it a hindrance but more a design choice, but that's entirely a personal point of view, the objective part is that I'm sure you meant multithreading here, not multiprocessing.
Hope this helps, concurrency is tricky :-)
Thanks for the input @ryhmes. I knew concurrency was really trick that why I was a bit skeptical about writing it. Changes have been made
Eheh don't worry, it's just a complicated topic and the fact that a lot of tutorials around still confuse the two doesn't help. It took me a while as well to have the picture clear in my head.
Thanks for the corrections!
If your use of these terms is accurate to current use, I feel like they’ve been misappropriated a little.
Concurrent literally means running simultaneously, while parallelism expresses multiplicity but also similarity, so it seems like parallelism should be reserved for code where the same block or instruction is being executed concurrently (whether via SIMD instructions or SIMD code) while concurrency would refer to e.g. execution of multithreaded code where one thread handles IO, and one handles computation, but both still run simultaneously.
By extension, code written to be able to execute multiple threads simultaneously should be incapable of actual concurrency if executing on a single (non-pipelined) processor core because in any sufficiently small slice of time, only one thing is happening at once. That is to say, context-switching shouldn’t be counted as concurrency.
Thanks for the comment Ian. It's interesting because there's no universal consensus on the definition of the two. It's not the first time that I hear slightly different definitions of the two terms. I believe it's also because we tend to create definition in abstract and not relative to a context.
Taking into account your premises: how would you define code that runs in a single thread but that's able to advance multiple units of computation? The age old c10k problem implemented using select/poll/epoll/kqueue/iocp for example.
Hm, yes, this almost seems analogous to the question of "how many things are you juggling at a time" vs "how many things do you have in your hand when you're juggling N things".
As to the c10k question: it's not a question I'd thought about -- it's still a form of multithreading, isn't it? Only one thing happens at once, but context-switching between threads, even if they're not formally labelled as threads, say, as green threads, 'coroutines', or even just handler contexts, can occur. It seems like "multiplexing" might even be an appropriate term to use, analogously to the use in signal transmission.
Very nice post, easy to understand and lots of new stuff learned. I was wondering if measuring the time it take for completion with a time library directly inside the program isn't tied to how the machine perform in itself and that could be different on other machine if they are slower or faster ?
Also, I don't know why but most of the links are broken and it's a shame because I really wanted to read those complements.
The completion time will 100% have to due with the machine it is run on. This is true of all intensive tasks. There are a lot of factors too and I don't just mean clock speed, cache size, etc. For example if you are involving network tasks (such as requests in this article) your bandwidth, network saturation, latency and other variables come into play. If you're involving a lot of disk I/O, your system could have almost identical hardware, but if one is using a standard hard drive and the other a decent SSD, again, those times will be different.
The multi-threaded test the author posted above "Downloaded 90 in 4.992406606674194 seconds" on my Ryzen 5 2600 over gigabit Ethernet, but the same test on my Raspberry Pi 3B over 2.4ghz Wifi 802.11n "Downloaded 90 in 7.817208528518677 seconds".
Multiple factors came into play there. Slower processor, slower memory, greater network latency (due to wifi, rasppi3B is about 25+ feet away, lots of other 2.4ghz networks in the building).
To give a greater understanding however, let me run 900 instead of 90 (sorry Google, Facebook, and Twitter, it's for science!):
Ryzen:
Downloaded 900 in 46.64533185958862 seconds
Raspberry Pi 3B:
Downloaded 900 in 71.28835225105286 seconds
The faster system, with a better network connection, took just over 45 seconds while the slower wifi-crippled system took 1 minute and 11 seconds!
If I took strictly computational tasks (cutting out the network and waiting on responses to requests), it would look more like this:
Synchronous:
Ryzen:
Finished computations in 7.4013049602508545 seconds.
Raspberry Pi 3B:
Finished computations in 7.918483018875122 seconds.
Multi-threaded:
Ryzen:
Finished computations in 6.342863082885742 seconds.
Raspberry Pi 3B:
Finished computations in 8.66378116607666 seconds.
BUT WAIT, THE RASPBERRY PI WENT UP IN TIME?!
That's right, because my 6-core, 12-thread ryzen was able to benefit from use of 12 threads at a time to queue up work for concurrency. My 4-core, 4-thread ARM processor in my Pi actually took a hit from breaking up the work in a non-beneficial way AND, of course, was slower, due to it's slower performance and lower number of threads.
I hope that clears some stuff up! :)
Thanks for this awesome experiment. That's what I was suspecting. That's why big O notation is a more reliable way (not always) to measure code performance.
Sorry for the inconvenience @jim . The links are working fine now.