Co-authored by me (Daniel) and Haile
In today’s modern world, data lives at the heart of our lives, from social media to smart appliances. The performance of a program is determined by its ability to manipulate and compute data often over a network. Processing large amounts of data come with problems; notably an increase in the program’s execution time leading to “blocks” or “lag”.
Out of the need for efficient execution of a program and an increasingly sophisticated multicore os/hardware architecture, programming languages have tried to better leverage this behavior. The literal meaning of the word concurrency is “simultaneous occurrence”. The execution time of a concurrent program can be significantly decreased because the computer can simultaneously run multiple instructions.
Python has three main operating systems concepts interwoven into its concurrency model; namely thread, task, and process.
Unweaving the Threads
Threads of the same operating system process, distribute computing workloads into multiple cores as seen in programming languages like C++ and Java. Generally, python uses only one process from which a single main thread is spawned to execute the runtime. It remains on a single core irrespective of the number of cores the computer has or how many new threads are spawned due to a locking mechanism called a global interpreter lock, introduced to prevent something called a race condition.
At the mention of racing, you’re probably thinking of NASCAR and Formula One. Let’s use that analogy and imagine all Formula 1 drivers trying to race in one race car all at the same time. Sounds absurd right?, this would only be possible, if each driver had access to their own car or better still go one lap at a time, each time handing the car over to the next driver.
This is very similar to what happens in threads. Threads are “spawned” from a “main” thread, each subsequent thread is a copy of the previous one. These threads all exists within the same process “context” (an event or competition) - so all the resources assigned to that process like memory are shared. For example in a typical python interpreter session:
>>> a = 8
here, a
is consuming a very little bit of memory(RAM), by having some arbitrary location in memory temporarily hold the value of 8.
So far so good, let’s spin up a few threads and observe their behavior, when adding two numbers x
and y
:
import time
import threading
from threading import Thread
a = 8
def threaded_add(x, y):
# simulation of a more complex task by asking
# python to sleep, since adding happens so quick!
for i in range(2):
global a
print("computing task in a different thread!")
time.sleep(1)
#this is not okay! but python will force sync, more on that later!
a = 10
print(a)
# the current thread will be a subset fork!
if __name__ != "__main__":
current_thread = threading.current_thread()
# here we tell python from the main
# thread of execution make others
if __name__ == "__main__":
thread = Thread(target = threaded_add, args = (1, 2))
thread.start()
thread.join()
print(a)
print("main thread finished...exiting")
>>> computing task in a different thread!
>>> 10
>>> computing task in a different thread!
>>> 10
>>> 10
>>> main thread finished...exiting
Two threads are currently running. Let’s call them thread_one
and thread_two
. If thread_one
wants to modify a
with the value 10 and thread_two
is simultaneously trying to update the same variable, we have a problem! A condition known as a data race would occur and the resulting value of a
would be incoherent.
A NASCAR race you didn’t watch but heard two contradicting results from two of your friends! thread_one
tells you one thing and thread two
contradicts that! Here’s a pseudo code snippet illustrating that:
a = 8
# spawns two different threads 1 and 2
# thread_one updates the value of a to 10
if (a == 10):
# a check
#thread_two updates the value of a to 15
a = 15
b = a * 2
# if thread_one finished first the result will be 20
# if thread_two finished first the result will be 30
# who is right?
What’s really going on?
Python is an interpreted language, which means it comes with an interpreter - a program that parses it’s source code from another language! Some of such interpreters in python include cpython, pypy, Jpython, IronPython, with Cpython being the original implementation for python.
CPython is an interpreter that provides a foreign function interface with C as well as other programming languages, it compiles python source code into intermediary bytecode which is interpreted by the CPython virtual machine. The discussion thus far and going forward has been about CPython and understanding behaviour within that environment.
Memory Model and Locking Mechanisms
A programming language uses objects in its programs to perform operations. These objects consist of primitive data types like string
, integer
, or boolean.
They also include more complex data structures like a list
or classes/objects
. The values of your program’s objects are stored in memory for quick access. When a variable is used in a program, the process will read the value from memory and operate on it. In early programming languages, most developers were responsible for all memory management in their programs. This meant before creating a list or an object, you first had to allocate the memory for your variable. On doing that you, proceed to deallocate it “freeing” that memory.
In python, objects are stored in memory by reference. A reference is a kind of label for an object, so one object can have many names, like how you can have your given name and a nickname. A reference is an exact memory location for an object. A reference counter is used for garbage collection in python, an automatic memory management process.
With the help of the reference counter python keeps track of each object by incrementing the reference counter whenever the object is created or referenced and decrementing whenever the object is dereferenced. when the reference count is 0, the memory for the object is deallocated.
import sys
import gc
hello = "world" #reference to 'world' is 2
print (sys.getrefcount(hello))
bye = "world"
other_bye = bye
print(sys.getrefcount(bye))
print(gc.get_referrers(other_bye))
>>> 4
>>> 6
>>> [['sys', 'gc', 'hello', 'world', 'print', 'sys', 'getrefcount', 'hello', 'bye', 'world', 'other_bye', 'bye', 'print', 'sys', 'getrefcount', 'bye', 'print', 'gc', 'get_referrers', 'other_bye'], (0, None, 'world'), {'__name__': '__main__', '__doc__': None, '__package__': None, '__loader__': <_frozen_importlib_external.SourceFileLoader object at 0x0138ADF0>, '__spec__': None, '__annotations__': {}, '__builtins__': <module 'builtins' (built-in)>, '__file__': 'test.py', '__cached__': None, 'sys': <module 'sys' (built-in)>, 'gc': <module 'gc' (built-in)>, 'hello': 'world', 'bye': 'world', 'other_bye': 'world'}]
These reference counter variables need to be protected, against race condition or memory leaks. In order to protect these variables; locks can be added to all data structures that are shared across threads.
CPython’s GIL controls the python interpreter, by allowing one thread at a time to control the interpreter. It provides a performance boost to single-threaded programs as only one lock needs to be managed, the tradeoff however being it prevents multithreaded CPython programs from taking full advantage of multiprocessor systems in certain situations.
💡 when a user writes a python program, there’s a difference between those that are CPU-bound in their performance and those that are I/O-bound. CPU push the program to its limits by performing many operations simultaneously whereas I/O program have to spend time waiting for I/O.
Therefore it is only multithreaded programs that spend a lot of time inside the GIL, interpreting CPython bytecode; that the GIL becomes a bottleneck. The GIL can degrade performance even when it is not strictly necessary to do so. For example, a program written in python that handle both IO bound and CPU bound tasks:
import time, os
from threading import Thread, current_thread
from multiprocessing import current_process
COUNT = 200000000
SLEEP = 10
def io_bound(sec):
pid = os.getpid()
threadName = current_thread().name
processName = current_process().name
print(f"{pid} * {processName} * {threadName} \
---> Start sleeping...")
time.sleep(sec)
print(f"{pid} * {processName} * {threadName} \
---> Finished sleeping...")
def cpu_bound(n):
pid = os.getpid()
threadName = current_thread().name
processName = current_process().name
print(f"{pid} * {processName} * {threadName} \
---> Start counting...")
while n>0:
n -= 1
print(f"{pid} * {processName} * {threadName} \
---> Finished counting...")
def timeit(function,args,threaded=False):
start = time.time()
if threaded:
t1 = Thread(target = function, args =(args, ))
t2 = Thread(target = function, args =(args, ))
t1.start()
t2.start()
t1.join()
t2.join()
else:
function(args)
end = time.time()
print('Time taken in seconds for running {} on Argument {} is {}s -{}'.format(function,args,end - start,"Threaded" if threaded else "None Threaded"))
if __name__=="__main__":
#Running io_bound task
print("IO BOUND TASK NON THREADED")
timeit(io_bound,SLEEP)
print("IO BOUND TASK THREADED")
#Running io_bound task in Thread
timeit(io_bound,SLEEP,threaded=True)
print("CPU BOUND TASK NON THREADED")
#Running cpu_bound task
timeit(cpu_bound,COUNT)
print("CPU BOUND TASK THREADED")
#Running cpu_bound task in Thread
timeit(cpu_bound,COUNT,threaded=True)
>>> IO BOUND TASK NON THREADED
>>> 17244 * MainProcess * MainThread ---> Start sleeping...
>>> 17244 * MainProcess * MainThread ---> Finished sleeping...
>>> 17244 * MainProcess * MainThread ---> Start sleeping...
>>> 17244 * MainProcess * MainThread ---> Finished sleeping...
>>> Time taken in seconds for running <function io_bound at 0x00C50810> on Argument 10 is 20.036664724349976s -None Threaded
>>> IO BOUND TASK THREADED
>>> 10180 * MainProcess * Thread-1 ---> Start sleeping...
>>> 10180 * MainProcess * Thread-2 ---> Start sleeping...
>>> 10180 * MainProcess * Thread-1 ---> Finished sleeping...
>>> 10180 * MainProcess * Thread-2 ---> Finished sleeping...
>>> Time taken in seconds for running <function io_bound at 0x01B90810> on Argument 10 is 10.01464056968689s -Threaded
>>> CPU BOUND TASK NON THREADED
>>> 14172 * MainProcess * MainThread ---> Start counting...
>>> 14172 * MainProcess * MainThread ---> Finished counting...
>>> 14172 * MainProcess * MainThread ---> Start counting...
>>> 14172 * MainProcess * MainThread ---> Finished counting...
>>> Time taken in seconds for running <function cpu_bound at 0x018F4468> on Argument 200000000 is 44.90199875831604s -None Threaded
>>> CPU BOUND TASK THEADED
>>> 15616 * MainProcess * Thread-1 ---> Start counting...
>>> 15616 * MainProcess * Thread-2 ---> Start counting...
>>> 15616 * MainProcess * Thread-1 ---> Finished counting...
>>> 15616 * MainProcess * Thread-2 ---> Finished counting...
>>> Time taken in seconds for running <function cpu_bound at 0x01E44468> on Argument 200000000 is 106.09711360931396s -Threaded
From the results we notice that multithreading
does excellently well for multiple IO-bound tasks, with execution time of 10 seconds while it took 20 seconds for the non threaded approach to execute . We used the same approach for executing our CPU-bound tasks. Well, it did kick off our threads at the same time initially, but in the end, we see that the whole program execution takes about a whopping 106 seconds! What then happened? This is because when Thread-1
starts, it acquires the Global Interpreter Lock (GIL) which prevents Thread-2
from making use of the CPU. Hence, Thread-2
has to wait for Thread-1
to finish its task and release the lock so that it can acquire the lock and perform its task. This acquisition and release of the lock adds overhead to the total execution time. Therefore, it is safe to say that threading is not an ideal solution for tasks that are depend on CPU for thier execution.
This peculiarity makes concurrent programming difficult. If the GIL holds us back in terms of concurrency, shouldn’t we get rid of it or be able to turn it off?. Well It’s not that easy. Other features, libraries, and packages have come to rely on the GIL, so something must replace it, or else the entire ecosystem will break. It’s a hard nut to crack.
Picking the Locks.
We have established that CPython uses locks to protect data against racing , though this lock exists, programmers have found a way to implement concurrency explicitly. When it comes to the GIL, we could use the multiprocessing
library to get around the global lock. Multiprocessing achieves concurrency in its true sense as it executes code across different processes on different CPU cores. It creates a new instance of the Python interpreter to run on each core. Different processes reside in a different memory location, so object sharing among them is not so easy. In this implementation, python provides a different interpreter for each process to run; so in this case a single thread is provided to each process in multi-processing.
import os
import time
from multiprocessing import Process, current_process
SLEEP = 10
COUNT = 200000000
def count_down(cnt):
pid = os.getpid()
processName = current_process().name
print(f"{pid} * {processName} \
---> Start counting...")
while cnt > 0:
cnt -= 1
def io_bound(sec):
pid = os.getpid()
threadName = current_thread().name
processName = current_process().name
print(f"{pid} * {processName} * {threadName} \
---> Start sleeping...")
time.sleep(sec)
print(f"{pid} * {processName} * {threadName} \
---> Finished sleeping...")
if __name__ == '__main__':
# creating processes
start = time.time()
#CPU BOUND
p1 = Process(target=count_down, args=(COUNT, ))
p2 = Process(target=count_down, args=(COUNT, ))
#IO BOUND
#p1 = Process(target=, args=(SLEEP, ))
#p2 = Process(target=count_down, args=(SLEEP, ))
# starting process_thread
p1.start()
p2.start()
# wait until finished
p1.join()
p2.join()
stop = time.time()
elapsed = stop - start
print ("The time taken in seconds is :", elapsed)
>>> 1660 * Process-2 ---> Start counting...
>>> 10184 * Process-1 ---> Start counting...
>>> The time taken in seconds is : 12.815475225448608
It can be seen that multiprocessing
performs exceptionally well for cpu and io bound task . The MainProcess
spins up two sub processes, Process-1
and Process-2
, having different PIDs
, each of which performs the task of decreasing COUNT
to zero. Each process runs in parallel, making use of separate CPU core and its own instance of the Python interpreter, therefore the whole program execution takes only 12 seconds.
Note that the output might be printed in unordered fashion as the processes are independent of each other. This is because each process executes the function in its own default main thread.
We could also use asyncio
library to get around the GIL lock. The basic concept of asyncio
is that a single python object known as the event loop controls how and when each task gets run. The event loop is aware of each task and its state. The ready state indicates that the task is ready to run, and the waiting stage indicates that the task is waiting for some external task to complete. In asyncio, tasks, never give up control and do not get interrupted in the middle of execution, so object sharing is thread-safe.
import time
import asyncio
COUNT = 200000000
# asynchronous function defination
async def func_name(cnt):
while cnt > 0:
cnt -= 1
#asynchronous main function defination
async def main ():
# Creating 2 tasks.....You could create as many tasks (n tasks)
task1 = loop.create_task(func_name(COUNT))
task2 = loop.create_task(func_name(COUNT))
# await each task to execute before handing control back to the program
await asyncio.wait([task1, task2])
if __name__ =='__main__':
# get the event loop
start_time = time.time()
loop = asyncio.get_event_loop()
# run all tasks in the event loop until completion
loop.run_until_complete(main())
loop.close()
print("--- %s seconds ---" % (time.time() - start_time))
>>> --- 41.74118399620056 seconds ---
We can see that asyncio
takes 41 seconds to complete the countdown which is better than multithreading
’s 106 seconds but not as good as multiprocessing
's 12 seconds for cpu bound tasks. Asyncio creates an eventloop
and two tasks, task1
and task2
, these tasks are then placed on the eventloop
. The program then await
the execution of the tasks as the event loop
executes all tasks to completion.
Finally, to fully utilise the full power of concurrency in python we could also use a different interpreter. JPython and IronPython have no GIL which means a user can fully exploit multiprocessor systems.
Top comments (0)