horaceg

Posted on Sep 10, 2021

Concurrency in Python with FastAPI

#programming #python #concurrency #parallelism

I've struggled for a long time with concurrency and parallelism. Let's dive in with the hot-cool-new ASGI framework, FastAPI. It is a concurrent framework, which means asyncio-friendly. Tiangolo, the author, claims that the performance is on par with Go and Node webservers. We're going to see a glimpse of the reason (spoilers: concurrency).

First things first, let's install FastAPI by following the guide.

Purely IO-bound workloads

We are going to simulate a pure IO operation, such as an waiting for a database to finish its operation. Let's create the following server.py file:

# server.py

import time

from fastapi import FastAPI

app = FastAPI()


@app.get("/wait")
def wait():
    duration = 1.
    time.sleep(duration)
    return {"duration": duration}

Run it with

uvicorn server:app --reload

You should see at http://127.0.0.1:8000/wait something like:

{ "duration": 1 }

Ok, it works. Now, let's dive into the performance comparison. We could use ApacheBench, but here we are going to implement everything in python for the sake of clarity.

Let's create a client.py file:

# client.py

import functools
import time

import requests


def timed(N, url, fn):
    @functools.wraps(fn)
    def wrapper(*args, **kwargs):
        start = time.time()
        res = fn(*args, **kwargs)
        stop = time.time()
        duration = stop - start
        print(f"{N / duration:.2f} reqs / sec | {N} reqs | {url} | {fn.__name__}")
        return res

    return wrapper


def get(url):
    resp = requests.get(url)
    assert resp.status_code == 200
    return resp.json()


def sync_get_all(url, n):
    l = [get(url) for _ in range(n)]
    return l


def run_bench(n, funcs, urls):
    for url in urls:
        for func in funcs:
            timed(n, url, func)(url, n)


if __name__ == "__main__":
    urls = ["http://127.0.0.1:8000/wait"]
    funcs = [sync_get_all]
    run_bench(10, funcs, urls)

Let's run this:

python client.py

0.99 reqs / sec | 10 reqs | http://127.0.0.1:8000/wait | sync_get_all

So far, we know that the overhead is sub-10 ms for ten requests, so less than 1ms per request. Cool!

Threadpools client-side

Now, we are going to simulate multiple simultaneous connections. This is usually a problem we want to have: the more users of our web API or app, the more simultaneous requests. The previous test wasn't very realistic: users rarely browse sequentially, but rather appear simultaneously, forming bursts of activity.

We are going to implement concurrent requests using a threadpool:

# client.py

...
from concurrent.futures import ThreadPoolExecutor as Pool
...


def thread_pool(url, n, limit=None):
    limit_ = limit or n
    with Pool(max_workers=limit_) as pool:
        result = pool.map(get, [url] * n)
    return result


if __name__ == "__main__":
    urls = ["http://127.0.0.1:8000/wait"]
    run_bench(10, [sync_get_all, thread_pool], urls)

We get:

0.99 reqs / sec | 10 reqs | http://127.0.0.1:8000/wait | sync_get_all
9.56 reqs / sec | 10 reqs | http://127.0.0.1:8000/wait | thread_pool

This looks 10x better! The overhead is of 44 ms for 10 requests, where does that come from?

Also, how come the server was able to answer asynchronously, since we only wrote synchronous (regular) Python code? There are no async nor await...

Well, this is how FastAPI works behind the scenes: it runs every synchronous request in a threadpool. So, we have threadpools both client-side and server-side!

Let's lower the duration:

# server.py

...

@app.get("/wait")
def wait():
    duration = 0.05
    time.sleep(duration)
    return {"duration": duration}

Let's also run the benchmark 100 times:

# client.py

...

if __name__ == "__main__":
    urls = ["http://127.0.0.1:8000/wait"]
    run_bench(100, [sync_get_all, thread_pool], urls)

15.91 reqs / sec | 100 reqs | http://127.0.0.1:8000/wait | sync_get_all
196.06 reqs / sec | 100 reqs | http://127.0.0.1:8000/wait | thread_pool

We can see there is some overhead on the server-side. Indeed, we should have $100 / 0.05 = 2000$ requests per second if everything worked without any friction.

`async` routes

There is another way to declare a route with FastAPI, using the asyncio keywords.

# server.py

import asyncio
...

@app.get("/asyncwait")
async def asyncwait():
    duration = 0.05
    await asyncio.sleep(duration)
    return {"duration": duration}

Now just add this route to the client:

# client.py

if __name__ == "__main__":
    urls = ["http://127.0.0.1:8000/wait", "http://127.0.0.1:8000/asyncwait"]
    run_bench(10, [sync_get_all, thread_pool], urls)

And run the benchmark:

15.66 reqs / sec | 100 reqs | http://127.0.0.1:8000/wait | sync_get_all
195.41 reqs / sec | 100 reqs | http://127.0.0.1:8000/wait | thread_pool
15.52 reqs / sec | 100 reqs | http://127.0.0.1:8000/asyncwait | sync_get_all
208.06 reqs / sec | 100 reqs | http://127.0.0.1:8000/asyncwait | thread_pool

We see a small improvement. But isn't asyncio supposed to be very performant? And Uvicorn is based on uvloop, described as:

Ultra fast asyncio event loop.

Maybe the overhead comes from the client? Threadpools maybe?

Drinking the `asyncio` kool-aid

To check this, we're going to implement a fully-asynchronous client. This is a bit more involved. Yes, this means asyncs and awaits. I know you secretly enjoy these.

Just do pip install aiohttp, then:

# client.py

import asyncio
...
import aiohttp

...


async def aget(session, url):
    async with session.get(url) as response:
        assert response.status == 200
        json = await response.json()
        return json


async def gather_limit(n_workers, *tasks):
    semaphore = asyncio.Semaphore(n_workers)

    async def sem_task(task):
        async with semaphore:
            return await task

    return await asyncio.gather(*(sem_task(task) for task in tasks))


async def aget_all(url, n, n_workers=None):
    limit = n_workers or n
    async with aiohttp.ClientSession() as session:
        result = await gather_limit(limit, *[aget(session, url) for _ in range(n)])
        return result


def async_main(url, n):
    return asyncio.run(aget_all(url, n))

We also add this function to the benchmark. Let's also add a benchmark with 1000 request, just for async methods.

# client.py

if __name__ == "__main__":
    urls = ["http://127.0.0.1:8000/wait", "http://127.0.0.1:8000/asyncwait"]
    funcs = [sync_get_all, thread_pool, async_main]
    run_bench(100, funcs, urls)
    run_bench(1000, [thread_pool, async_main], urls)

The results can be suprising:

15.84 reqs / sec | 100 reqs | http://127.0.0.1:8000/wait | sync_get_all
191.74 reqs / sec | 100 reqs | http://127.0.0.1:8000/wait | thread_pool
187.36 reqs / sec | 100 reqs | http://127.0.0.1:8000/wait | async_main
15.69 reqs / sec | 100 reqs | http://127.0.0.1:8000/asyncwait | sync_get_all
217.35 reqs / sec | 100 reqs | http://127.0.0.1:8000/asyncwait | thread_pool
666.23 reqs / sec | 100 reqs | http://127.0.0.1:8000/asyncwait | async_main
234.24 reqs / sec | 1000 reqs | http://127.0.0.1:8000/wait | thread_pool
222.16 reqs / sec | 1000 reqs | http://127.0.0.1:8000/wait | async_main
316.08 reqs / sec | 1000 reqs | http://127.0.0.1:8000/asyncwait | thread_pool
1031.05 reqs / sec | 1000 reqs | http://127.0.0.1:8000/asyncwait | async_main

It appears that the bottleneck was indeed on the client-side! When both sides are asynchronous - and there is a lot of IO - the speed is impressive!

CPU-bound workloads

This is all great, until some heavy computation is required. We refer to these as CPU-bound workloads, as opposed to IO-bound. Inspired by the legendary David Beazley's live coding, we are going to use a naive implementation of the Fibonacci sequence to perform heavy computations.

# server.py

...

def fibo(n):
    if n < 2:
        return 1
    else:
        return fibo(n - 1) + fibo(n - 2)


@app.get("/fib/{n}")
def fib(n: int):
    return {"fib": fibo(n)}

Now, when I open two terminals, running curl -I http://127.0.0.1:8000/fib/42 in one and python client.py in the other, we see the following results:

8.75 reqs / sec | 100 reqs | http://127.0.0.1:8000/wait | sync_get_all
54.94 reqs / sec | 100 reqs | http://127.0.0.1:8000/wait | thread_pool
60.64 reqs / sec | 100 reqs | http://127.0.0.1:8000/wait | async_main
9.52 reqs / sec | 100 reqs | http://127.0.0.1:8000/asyncwait | sync_get_all
53.02 reqs / sec | 100 reqs | http://127.0.0.1:8000/asyncwait | thread_pool
46.81 reqs / sec | 100 reqs | http://127.0.0.1:8000/asyncwait | async_main
72.87 reqs / sec | 1000 reqs | http://127.0.0.1:8000/wait | thread_pool
122.97 reqs / sec | 1000 reqs | http://127.0.0.1:8000/wait | async_main
72.36 reqs / sec | 1000 reqs | http://127.0.0.1:8000/asyncwait | thread_pool
51.73 reqs / sec | 1000 reqs | http://127.0.0.1:8000/asyncwait | async_main

It's not that bad, but a bit disappointing. Indeed, we have 20x less throughput for the originally most performant one (asyncwait route x async_main client).

What's happening here ? In python, there is a Global Interpreter Lock (GIL). If one request takes a very long time to be processed with high-CPU activity, in the meantime other requests cannot be processed as quickly: priority is given to the computations. We will see later how to take care of this.

For now, we try nested recursive concurrency. Let's add:

# server.py

...


async def afibo(n):
    if n < 2:
        return 1
    else:
        fib1 = await afibo(n - 1)
        fib2 = await afibo(n - 2)
        return fib1 + fib2


@app.get("/asyncfib/{n}")
async def asyncfib(n: int):
    res = await afibo(n)
    return {"fib": res}

Let's also add a timing middleware to our FastAPI app:

# server.py
...
from fastapi import FastAPI, Request

@app.middleware("http")
async def add_process_time_header(request: Request, call_next):
    start_time = time.time()
    response = await call_next(request)
    process_time = time.time() - start_time
    response.headers["X-Process-Time"] = str(process_time)
    return response

Now let's test the speed:

curl -D - http://127.0.0.1:8000/fib/30

HTTP/1.1 200 OK
server: uvicorn
content-length: 15
content-type: application/json
x-process-time: 0.17467308044433594

{"fib":1346269}

And with async:

curl -D - http://127.0.0.1:8000/asyncfib/30

HTTP/1.1 200 OK
server: uvicorn
content-length: 15
content-type: application/json
x-process-time: 0.46001315116882324

{"fib":1346269}

It's not that bad for $2^{30}$ overhead. But we see here a limitation of threads in Python: the same code in Julia would lead to a speed-up (using parallelism)!

Gunicorn and multiprocessing

So far we've used FastAPI with Uvicorn. The latter can be run with Gunicorn. Gunicorn forks a base process into n worker processes, and each worker is managed by Uvicorn (with the asynchronous uvloop). Which means:

Each worker is concurrent
The worker pool implements parallelism

This way, we can have the best of both worlds: concurrency (multithreading) and parallelism (multiprocessing).

Let's try this with the last setup, when we ran the benchmark while asking for the 42th Fibonacci number:

pip install gunicorn

gunicorn server:app -w 2 -k uvicorn.workers.UvicornWorker --reload

we get the following results:

19.02 reqs / sec | 100 reqs | http://127.0.0.1:8000/wait | sync_get_all
216.84 reqs / sec | 100 reqs | http://127.0.0.1:8000/wait | thread_pool
223.52 reqs / sec | 100 reqs | http://127.0.0.1:8000/wait | async_main
18.80 reqs / sec | 100 reqs | http://127.0.0.1:8000/asyncwait | sync_get_all
400.12 reqs / sec | 100 reqs | http://127.0.0.1:8000/asyncwait | thread_pool
208.68 reqs / sec | 100 reqs | http://127.0.0.1:8000/asyncwait | async_main
241.06 reqs / sec | 1000 reqs | http://127.0.0.1:8000/wait | thread_pool
311.40 reqs / sec | 1000 reqs | http://127.0.0.1:8000/wait | async_main
433.80 reqs / sec | 1000 reqs | http://127.0.0.1:8000/asyncwait | thread_pool
1275.48 reqs / sec | 1000 reqs | http://127.0.0.1:8000/asyncwait | async_main

Which is on par (if not a bit better!) than with a single Uvicorn process

The final files (client and server) are available as a github gist

Further resources

I wholeheartedly recomment this amazing live-coding session by David Beazley. Maybe you can google websockets first, just to get that they open a bi-directional channel between client and server.

You can also read this detailed answer from stackoverflow to grasp differences between concurrency and parallelism in python.