Tiago Antao

Posted on Jun 9, 2021

Advanced Array access in Cython using a practical example

#python #cython #tutorial #numpy

This is an Excerpt from https://www.manning.com/books/fast-python-for-data-science available from Manning.

Python is slow. The standard implementation is slow and the language dynamic features pay a performance toll. Many Python libraries are performant precisely because they are partially implemented in lower level languages making available efficient data processing algorithms. But sometimes we will need to implement our own high-performance algorithms in something faster than Python. In this article we are going to consider Cython, a superset of Python that is converted to C and is substantially more performant than Python.

We are going to discuss Cython via a concrete example: a color version of Conway’s Game of Life. Conway’s Game of Life is a zero player game which evolves automatically from its initial state — devising interesting initial states is part of the fun. The state of the game is composed of a grid of arbitrary size and each cell can have two states: alive or dead. As time advances each cell will change its state according to the following rules:

Any live cell with 2 or 3 neighbors survives
A dead cell with 3 neighbors becomes live
All other cells die or stay dead

The world wraps around in the sense that the left-most column will look at the right-most column to compute neighbors (and vice-versa). The same applies for the top and bottom rows.
There are 3 examples shown over time in figure 1: the first example is a dash that will be eternally changing directions from vertical to horizontal; the second is a box that is stable and the third dies off completely.

We will be using an extension called QuadLife where each live cell can have 4 different states: Red, Green, Blue and Yellow. We prefer this extension simply because it looks cooler. It includes two new rules for newborns:

If there is a color that is present in the majority of neighbors, it becomes the color of the new cell
If all 3 live neighbors are different then the new cell takes the remaining color of the four possible

A Cython implementation of QuadLife

All the code referred here can be found in https://github.com/tiagoantao/python-performance/tree/master/05-cython/sec6-quadlife

When we say “Cython implementation”, we actually have two parts: the Cython part – which should cover the computational expensive inner loop and all the rest which is done in Python. As such our implementation will include two parts: A calling Python code with the computational expensive part implemented in Cython.

This example is of intermediate complexity from a Cython perspective: there are plenty of introductory tutorials on the Web and we want to add some value above the basic. That being said it should be self-contained.

The Python part has to refer to be able to link Cython. There are several ways to do that, here we will use pyximport which is a simple way to do that – as we don’t want to spend much time in this article discussing linking.

import sys

import numpy as np
import pyximport
pyximport.install(   (1)
    language_level=3,
    setup_args={
        'include_dirs': np.get_include()})

import cquadlife as quadlife

SIZE_X = int(sys.argv[1])   (2)
SIZE_Y = int(sys.argv[2])
GENERATIONS = int(sys.argv[3])

world = quadlife.create_random_world(SIZE_Y, SIZE_X) (3)
for i in range(GENERATIONS):     (4)
    world = quadlife.live(world)

Notes above:

We setup ~pyximport~ to include NumPy
We read parameters from the command line
We use a function — defined later — to create a random world
We apply the Quadlife algorithm for the number of generations specified by the user

We call this script by passing the desired X and Y resolution along with the number of generations. The script doesn’t output anything for now, it just runs the game — later we will do some cool stuff with the results.

To start we will generate a random world with create_random_world — later we will consider better looking alternatives — which is good enough for testing. We will use a NumPy array of dimension SIZE_Y, SIZE_X as specified by the user. It will be filled with random values between 0 and 4. 0 represents a dead cell. We then run the simulation function called live for GENERATIONS generations: the first call gets the random world and its output is then fed to itself in succession.

We now consider our Cython code. Creating a random world doesn’t need to be really optimized as it is called only once at the beginning:

#cython: language_level=3
import numpy as np

cimport cython
cimport numpy as cnp

def create_random_world(y, x):
    cdef cnp.ndarray [cnp.uint8_t, ndim=2] world = np.random.randint(0, 5, (y, x), np.uint8)
    return world

Notice that we are typing the variable world: this is a Cython extension that allows to generate C code that is more specific to our data and hence faster.
Now the fun starts. First off, we want to make sure we our inner loop can be GIL-free. For that we create a Cython top-level live function that mostly converts NumPy arrays into memoryviews.

def live(cnp.ndarray[cnp.uint8_t, ndim=2] old_world):
    cdef int size_y = old_world.shape[0]
    cdef int size_x = old_world.shape[1]
    cdef cnp.ndarray[cnp.uint8_t, ndim=2] extended_world = np.empty((size_y + 2, size_x + 2), dtype=np.uint8)
    cdef cnp.ndarray[cnp.uint8_t, ndim=2] new_world = np.empty((size_y, size_x), np.uint8)
    cdef cnp.ndarray[cnp.uint8_t, ndim=1] states = np.empty((5,), np.uint8)

    live_core(old_world, extended_world, new_world, states)
    return new_world

The actual conversion to memoryviews will be forced by the live_core function signature — see below — but we still need a layer that is able to convert Python objects into potentially GIL-free representations (objects that do not require the Python machinery and thus bring us to a situation where we can release the GIL). old_world is the input world, new_world will have the output. extended_world and states are live_core internal variables that we will pre-allocate here. Before we present the core algorithm in live_core lets discuss how we are going to algorithmically optimize part of it.

In the game of life, the extremes of the board are connected, for example the left-most column cells will "look" at the states of the right-most cells to compute its new state. In order to avoid a lot of corner case testing — which would add a lot of if statements and hence time to compute — we will implement a temporary extended world — in the above mentioned variable extended_world of dimensions (y+2, x+2). The extended boundaries copy what happens on the other side, as in figure 2.

The purpose of this algorithm is that it allows a slightly more efficient approach when computing the new board: we do not need to if for boundary conditions. This is done at the cost of memory — we now need to store a new and bigger version of the board. Making these kinds of tradeoffs — memory vs computation — is our bread and butter across high performance computing problems. It’s difficult to come up with generalized guidelines to make a decision on the tradeoff: It will depend on the computational and memory cost of a specific algorithm and the resources that you have available.

Here is the code that implements this extended world — notice how the code has no boundary tests thus reducing the computation time of using if:

@cython.boundscheck(False)  (1)
@cython.nonecheck(False)
@cython.wraparound(False)
cdef void get_extended_world(   (2)
        cnp.uint8_t[:,:] world,
        cnp.uint8_t[:,:] extended_world): (3)
    cdef int y = world.shape[0]
    cdef int x = world.shape[1]
    extended_world[1:y+1, 1:x+1] = world (4)

    extended_world[0, 1:x+1] = world[y-1, :]  # top
    extended_world[y+1, 1:x+1] = world[0, :]  # bottom
    extended_world[1:y+1, 0] = world[:, x-1]  # left
    extended_world[1:y+1, x+1] = world[:, 0]  # right

    extended_world[0, 0] = world[y-1, x-1]   # top left
    extended_world[0, x+1] = world[y-1, 0]   # top right
    extended_world[y+1, 0] = world[0, x-1]   # bottom left
    extended_world[y+1, x+1] = world[0, 0]   # bottom right

Notes:

Use deactivate bounds, None and wrap checks
We use cdef to be able to avoid the GIL
We type everything on the function signature
The copy of world into the middle of extended_world can be potentially expensive

The copy of world into the middle of extended_world can potentially pay a steep computational — along with memory — price, but the computational part might be compensated by the easier core algorithm. But at least for pedagogical purposes, it makes the core algorithm substantially simpler which is important for learning.

You might notice that many lines on the function above look like they could be written in a more expedient notation. For example, perhaps:

extended_world[1:y+1, 1:x+1] = world

Could maybe be written as:

extended_world[1:-1, 1:-1] = world

It turns out, however, that we cannot do these kinds of rewrites because we deactivated wraparound in order to avoid paying the price of the generated C code to spend time on wrap verification. Deactivating wraparound means we cannot use negative indexes. The tradeoff is worth it, though: wrap around requires the CPython machinery, which slows things down, so our implementation without it is substantially faster and we need to deactivate it to release the GIL as the wrap-around could would use the Python machinery.

NOTE: Not doing wrap around or bounds checking might result
in Segmentation faults of your code. In you see those
errors make sure to deactivate the decorator during
development. You need to make sure your code is robust
enough to tolerate removing this and the other checks.

We also use other optimizations: a cdef (a function that only has a C-interface and cannot thus be called from Python), complete typing of parameters and variables and use of memoryviews instead of NumPy arrays. If you use cython -a cquadlife.pyx (-a generates a HTML with code interations with C and the CPython machinery) you will see that there is no yellow — i.e. Python interaction lines — in the code above.

The main implementation that changes the state makes use of the extended world. The code below implements the QuadLife game rules. Because it is quite long we will carefully annotate it, including issues that we might have alluded upon in the past.

@cython.boundscheck(False)    (1)
@cython.nonecheck(False)
@cython.wraparound(False)
cdef void live_core(  (2)
    cnp.uint8_t[:,:] old_world,  (3)
    cnp.uint8_t[:,:] extended_world,
    cnp.uint8_t[:,:] new_world,
    cnp.uint8_t[:] states):   (4)
    cdef cnp.uint16_t x, y, i   (5)
    cdef cnp.uint8_t num_alive, max_represented
    cdef int size_y = old_world.shape[0]
    cdef int size_x = old_world.shape[1]
    get_extended_world(old_world, extended_world)  (6)

    for x in range(size_x):
        for y in range(size_y):
            for i in range(5):
                states[i] = 0
            for i in range(3):
                states[extended_world[y, x + i]] += 1
                states[extended_world[y + 2, x + i]] += 1
            states[extended_world[y + 1, x]] += 1
            states[extended_world[y + 1, x + 2]] += 1

            num_alive = states[1] + states[2] + states[3] + states[4] (7)
            if num_alive < 2 or num_alive > 3:
               # Too few or too many neighbors
                new_world[y, x] = 0
            elif old_world[y, x] != 0:
                # Stays alive
                new_world[y, x] = old_world[y, x]
            elif num_alive == 3:  # Will be born
                max_represented = max(states[1], max(states[2], max(states[3], states[4]))) (8)
                if max_represented > 1:
                    # majority rule for color
                    for i in range(1, 5):
                        if states[i] == max_represented: (9)
                            new_world[y, x] = i
                            break
                else:
                    # diversity - use whichever color doesn't exist
                    for i in range(1, 5):
                        if states[i] == 0: (10)
                            new_world[y, x] = i
                            break
            else:
                new_world[y, x] = 0  # stays dead

Notes:

We deactivate a lot of checking machinery: bounds and None checking along with wrap-around
We use a cdef to avoid passing standard Python objects. We also declare the return type as void which is C for nothing at all
We type all parameters
Some internal variables — states and extended_world — were actually allocated outside and we use the memory made available
We type all local variables
Everything is pre-allocated when we call get_extended_world`
Implements sum(states[:1])
Implements max(states[:1])
Implements states[1:].index(max_represented)
Implements states[1:].index(0)

This function is complicated but many of the techniques used here were actually introduced before — they are actually combined in a more realistic example. So read the code carefully and you will understand it all.

You might see a few oddities in the code, namely the replacement of sum and index by less declarative versions. We do this because sum and index would make use of the CPython machinery, and we want to avoid that. A similar argument is valid for the max function, but in that case there is a replacement version that is able to compare values at a time. When you are using general functions, you might want to profile them and maybe replace them with optimized non-general functions.

NOTE: Because the game of life produces nice evolving visualizations, we are going to make a simple GUI. We will be using the Python built-in tkinter module for the GUI along with the external library Pillow for image manipulation.
Our implementation is now complete but we would like to gauge the performance gain that we got from our code

Basic performance analysis

You will be able to find a native Python version on the repository, we will use this version to make some basic comparisons with the Cython one. Running the Python version for a resolution of 1000x1000 for 200 generations on my computer takes slightly less than 1000 seconds, i.e. a bit less than 17 minutes. The Cython code takes 2.5 seconds.

Our implementation is memory intensive. Be very careful if you test it with large resolutions.

Let’s take a lighter approach and generate a cool video from the game and make some considerations about computational complexity—especially with regards about the role of theory to help us writing more efficient programs.

A spacewar example using Quadlife

In the repository you will find code that will generate a video using a starting state that includes "spaceships" and "defenses" (The patterns come from http://www.conwaylife.com). The code itself is not very relevant for optimization purposes so we won’t discuss it here. If you want to replicate it you will need Python’s Pillow library for image processing and ffmpeg to generate videos.

Here is the starting state, color inverted:

And here is the video link

This code will run the game of life on the spaceship model for 400 generations on a map of 400x250 and takes less than a second. For an HD resolution of 1920x1080 with the same 400 generations it would take around 11 seconds, 800 frames would take 22 seconds. 400 frames at the 4K resolution of 3840x2160 would take 48 seconds. At 40 frames per second a 4K game of 90 minutes would take roughly 196 minutes to generate. The silver lining is that the same video would take 54 days to generate in a pure Python solution.

This is an Excerpt from https://www.manning.com/books/fast-python-for-data-science available from Manning.

Top comments (1)

Tiago Antao • Jun 10 '21

I think that there are quite a few cases where going from list to set can be a massive boost (search based on scan vs a dictionary). People are so used to lists in any case that sometimes they just skip such a simple change