DEV Community

Cover image for Monitoring sync and async network calls in Python using TIG stack
Jane Radetska
Jane Radetska

Posted on

Monitoring sync and async network calls in Python using TIG stack

Republished by author. First appeared in Web Performance Calendar 2020.

Web applications and API endpoints are known to perform backend calls. Often that is all application does: fetches data from a couple of backends, combines it, and produces response.

Monitoring how much time fetching data took is essential. There are plenty production-ready buy-and-snap-on solutions that provide such monitoring, but they might be not good fit for some cases. And I think it's fun to dig deeper into things to get more understanding of how it all works.

Let's look at code examples that use popular Python networking libraries and are instrumented to report HTTP request execution time.

What I'm going to explore in this post

I'm going to compare how request timings look for fetching HTML pages using requests library and for asyncronously fetching same HTML pages using aiohttp library. I aim to visualize the difference in timings, and to introduce tools that can be used for such monitoring.

To be fair, requests library has plugins that enable asyncronous IO and there are so many other ways to achieve this in Python... I picked aiohttp as it provides neat request timing tracing opportunities, and I use this library a lot in the wild.

To monitor request timings we will use Telegraf, InfluxDB and Grafana stack. These tools are very easy to setup locally, open source, free for personal usage, and could be used in production environment.

Running code examples section describes in detail how to run example code and setup monitoring infrastructure (Telegraf, InfluxDB, Grafana).

All code from this writing is available in repo.

Example 0: monitor requests request time

Let's dive into first Python code example. Here's what it does:

  • in forever loop, executes two HTTP requests using requests Python library
  • reports request time and request exceptions to Telegraf

Here's request execution time plotted on the dashboard:
Request execution time plot

Full code of Example 0 can be found in example-0-requests-send-stats.py.

High-level execution flow can be followed from main part of the program:

if __name__ == '__main__':
    while True:
        result = call_python_and_mozilla_using_requests()
        print(result)
        time.sleep(3)
Enter fullscreen mode Exit fullscreen mode

Inside call_python_and_mozilla_using_requests two simple HTTP requests are performed one by one, and their response text used to compose result:

def call_python_and_mozilla_using_requests():
    py_response = get_response_text('https://www.python.org/')
    moz_response = get_response_text('https://www.mozilla.org/en-US/')
    return (
        f'Py response piece: {py_response[:60].strip()}... ,\n'
        f'Moz response piece: {moz_response[:60].strip()}...'
    )
Enter fullscreen mode Exit fullscreen mode

get_response_text function executes HTTP request for a given URL with primitive exception handling, and hooks to report request execution time:

def profile_request(start_time, response, *args, **kwargs):
    elapsed_time = round((
        time.perf_counter() - start_time
    ) * 1000)
    send_stats(
        'requests_request_exec_time',
        elapsed_time,
        {'domain': URL(response.url).raw_host}
    )


def get_response_text(url):
    try:
        request_complete_callback = partial(
            profile_request,
            time.perf_counter()
        )
        response = requests.get(
            url,
            hooks={'response': request_complete_callback}
        )
        response.raise_for_status()
        return response.content.decode()
    except RequestException as e:
        send_stats(
            'requests_request_exception',
            1,
            {'domain': URL(url).raw_host, 'exception_class': e.__class__.__name__}
        )
        return f'Exception occured: {e}'
Enter fullscreen mode Exit fullscreen mode

This code uses requests library (docs). Basic usage to get text content from URL is as follows:

response = requests.get(url).content.decode()
Enter fullscreen mode Exit fullscreen mode

requests.get accepts optional hooks argument, where function to be called after request is completed is specified - request_complete_callback.

This callback function may look funny if you're not familiar with functional programming. partial(profile_request, time.perf_counter()) is itself a function. It's same function as profile_request but the first argument is already filled in - time.perf_counter() was passed as start_time argument. This trick is used to supply correct start_time for each request, as request_complete_callback function is constructed anew for each request, while code for sending request execution time is isolated in another function profile_request. We can rewrite that as follows:

def get_response_text(url):
    try:
        start_time = time.perf_counter()

        def profile_request(response, *args, **kwargs):
            elapsed_time = round((time.perf_counter() - start_time) * 1000)
            send_stats('requests_request_exec_time', elapsed_time, ...)

        response = requests.get(url, hooks={'response': profile_request})
Enter fullscreen mode Exit fullscreen mode

And it's going to work alright. Now there's a function defined inside a function, and get_response_text is bloated with profiling stuff, which is not something I like.

You can read more about partial functions and Python functools.

time.perf_counter() is used to measure execution time in Python (docs). time.perf_counter() returns microseconds that are converted to milliseconds using * 1000.

Sending stats

send_stats function is used to report measurements to Telegraf: metric name is 'requests_request_exec_time', metric value is time request execution took, tags include additional useful information (domain of URL).
get_response_text also invokes send_stats when exception occurs, passing different metric name this time - 'requests_request_exception'.

I have another post that describes ways to send stats from Python program to Telegraf.

In short, send_stats accepts metric name, metric value and tags dictionary. Those are converted to one string and sent to the socket on which Telegraf listens for measurement data. Telegraf sends received metrics to a database (InfluxDB). Grafana dashboard queries the database to put a dot on graph for each metric value reported.

profile decorator

A piece of code which is a decorator suitable for any function (async, sync, method of class or pure function) is adapted here to measure execution time of function that is decorated.
profile decorator is used to profile total execition time of functions call_python_and_mozilla_using_requests and call_python_and_mozilla_using_aiohttp (see the following examples).
Don't confuse with another useful tool - line_profiler - that also provides profile decorator.

requests execution time on dashboard

Let's run this example and set up all the monitoring tools. See Running code examples on how to run example code and set up monitoring infrastructure.

We can configure a panel that shows request execution time:
Request execution time configure panel

Blue dots of total execution time roughly correspond to the sum of time request to python.org and request to mozilla.org took (green and yellow dots). They measure at approximately 150 msec on average.

Need more exceptions

If we change 'www.python.org' to 'www.python1.org' in function call_python_and_mozilla_using_requests, exceptions appear in terminal output, and exception metrics are sent to Telegraf:

    Reported stats: aiohttp_request_exception=1, tags={'domain': 'www.python1.org', 'exception_class': 'ClientConnectorError'}
    'Py response piece: ...Exception occured: Cannot conn... 
Enter fullscreen mode Exit fullscreen mode

Configure a separate Grafana panel to see exceptions on the dashboard:
Configure exceptions panel

Exception class is sent as tag along with metric value. This gives us the ability to plot different lines for exceptions of different classes. To achieve this, pick 'group by - tag(exception_class)' when editing request exceptions panel.

Example 0 improved: reuse connection

Code of example 0 can be improved to reuse the same connection for all calls performed in that forever running while loop - here's an improved version.

The only significant code change is this:

...
session = requests.Session()
while True:
    result = call_python_and_mozilla_using_requests(session)
...
Enter fullscreen mode Exit fullscreen mode

Connection creation is moved out of the while loop. Now, the connection is established once and for all.

Let's compare how much time request execution takes when a connection is reused:
Compare timing when connection is reused and not, for requests lib

The dots on the left are measurements for for original version of Example 0, and ones on the right came from the improved version. We can definitely notice how total execution time get lower, below 100 msec on average.

Example 1: monitor aiohttp request time

Let's dive into the next code example. Here's what it does:

  • in forever loop, executes two asyncronous HTTP requests using aiohttp
  • hooks into aiohttp request lifecycle signals
  • reports request time and request exceptions to Telegraf

Full code of Example 1 can be found in example-1-aiohttp-send-stats-basic.py.

High-level execution flow is similar to the Example 0, the way content is fetched from URLs differs.

The tale of two HTTP requests

Let's start with the function call_python_and_mozilla_using_aiohttp that executes two asyncronous HTTP requests and returns pieces of response content. It is the sister of call_python_and_mozilla_using_requests from Example 0:

async def get_response_text(url):
    try:
        async with ClientSession(trace_configs=[Profiler()]) as session:
            async with session.get(url) as response:
                response.raise_for_status()
                return await response.text()
    except ClientError as e:
        return f'Exception occured: {e}'

@profile
async def call_python_and_mozilla_using_aiohttp():
        py_response, moz_response = await asyncio.gather(
            get_response_text('https://www.python.org/'),
            get_response_text('https://www.mozilla.org/en-US/')
        )
        return (
            f'Py response piece: {py_response[:60].strip()}... ,\n'
            f'Moz response piece: {moz_response[:60].strip()}...'
        )
Enter fullscreen mode Exit fullscreen mode

Here, aiohttp library's ClientSession is used to execute the request (docs). Basic usage to get text content from URL is as follows:

async with ClientSession() as session:
    async with session.get(url) as response:
        return await response.text()
Enter fullscreen mode Exit fullscreen mode

which is basically what happens in get_response_text. get_response_text also calls response.raise_for_status(), which raises exception when response status code is error code or timeout occurs . Exception is silenced in get_response_text, so get_response_text always returns str, either with response content or with exception message.

call_python_and_mozilla_using_aiohttp takes care of callings two URLs using asyncio.gather. Execution order for call_python_and_mozilla_using_aiohttp is on the right:
Async and sync flow

await asyncio.gather returns the result after both of these requests are complete. Total execution time is approximately the time of the longest request out of these two. You're probably aware that this is called non-blocking IO. Instead of blocking, this kind of IO operation frees execution thread until it needs it again.

Synchronous, blocking IO, like in Example 0, has different following execution order (see chart above, on the left). Total execution time is approximately the sum of both requests execution time. For positive integers, it's always true that A + B > MAX(A, B). Hence, asynchronous execution takes less time than synchronous one, provided unlimited CPU was available in both cases.

On the panel that shows requests execution time and their total execution time, it's possible to notice that total execution time call_python_and_mozilla_using_aiohttp_exec_time almost matches the longer-executing request time:
Async requests execution time and total time of both requests

The total execution time for both requests is 75-100 msec.

Next, we're going to look at how execution time of each aiohttp request is reported.

aiohttp requests signals

aiohttp provides a way to execute a custom function when HTTP request execution progresses through lifecycle stages: before request is sent, when connection is established, after response chunk is received, etc. For that, object-tracer is passed to aiohttp.ClientSession - trace_configs:

class Profiler(TraceConfig):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.on_request_start.append(on_request_start)
        self.on_request_end.append(on_request_end)
        self.on_request_exception.append(on_request_exception)

...
async with ClientSession(trace_configs=[Profiler()]) as session:
...
Enter fullscreen mode Exit fullscreen mode

Profiler is a subclass of aiohttp.TraceConfig. It "hooks up" functions that are going to be executed when request starts (on_request_start), when it ends (on_request_end) and when request exception is encountered (on_request_exception):

async def on_request_start(session, trace_config_ctx, params):
    trace_config_ctx.request_start = asyncio.get_event_loop().time()

async def on_request_end(session, trace_config_ctx, params):
    elapsed_time = round((
        asyncio.get_event_loop().time() - trace_config_ctx.request_start
    ) * 1000)
    send_stats(
        'aiohttp_request_exec_time',
        elapsed_time,
        {'domain': params.url.raw_host}
    )

async def on_request_exception(session, trace_config_ctx, params):
    send_stats(
        'aiohttp_request_exception',
        1,
        {'domain': params.url.raw_host, 'exception_class': params.exception.__class__.__name__}
    )
Enter fullscreen mode Exit fullscreen mode

Notice how the timestamp is computed:

asyncio.get_event_loop().time()
Enter fullscreen mode Exit fullscreen mode

It is recommended to use event loop’s internal monotonic clock to compute time delta in asyncronous code.

Function-hooks have arguments session, trace_config_ctx, params. Let's look at what they are.

session is an instance of aiohttp.ClientSession.

trace_config_ctx is context that is passed through callbacks. Custom values call be added to it when request is made:

await session.get(url, trace_request_ctx={'flag': 'red'})
...

async def on_request_end(session, trace_config_ctx, params):
    if trace_config_ctx.trace_request_ctx['flag'] == 'red':
        ....
Enter fullscreen mode Exit fullscreen mode

This way function-hook can be programmed to behave differently for different request calls or to report additional data.

Request end hook uses trace_config_ctx.request_start value to compute total time request took. trace_config_ctx.request_start is set in request start hook.

params argument in on_request_end is aiohttp.TraceRequestEndParams and as such has url property. url property is of yarl.URL type. params.url.raw_host returns the domain of the URL that was requested. Domain is sent as a tag for metric, and this makes it possible to plot separate lines for different URLs.

Calling asyncronous code from synchronous

To call async function in sync execution context, special tooling is used, which is adapted from another publication. I'm not going to dive into Python's asyncronous ways in this post. Read more about Python's asyncio, it's pretty cool.

Compare results for Example 0 and 1

Compare example 0 and 1
Connection is not reused for both cases here. Execution time for async version is lower, as expected.

Example 2: more, more stats

aiohttp provides hooks to measure more than just request execution time and request exceptions.

It's possible to report stats for:

  • DNS resolution time
  • DNS cache hit/miss
  • waiting for available connection time
  • connection establishing time
  • connection being reused
  • redirect happening
  • response content chunk received
  • request chunk sent

Impressive, isn't it? Documentation on tracing in aiohttp is here.

Let's add more request lifecycle hooks:

class Profiler(TraceConfig):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.on_request_start.append(on_request_start)
        self.on_request_end.append(on_request_end)
        self.on_request_redirect.append(on_request_redirect)
        self.on_request_exception.append(on_request_exception)
        self.on_connection_queued_start.append(on_connection_queued_start)
        self.on_connection_queued_end.append(on_connection_queued_end)
        self.on_connection_create_start.append(on_connection_create_start)
        self.on_connection_create_end.append(on_connection_create_end)
        self.on_dns_resolvehost_start.append(on_dns_resolvehost_start)
        self.on_dns_resolvehost_end.append(on_dns_resolvehost_end)
        self.on_response_chunk_received.append(on_response_chunk_received)
        self.on_connection_reuseconn.append(on_connection_reuseconn)
        self.on_dns_cache_hit.append(on_dns_cache_hit)
        self.on_dns_cache_miss.append(on_dns_cache_miss)
Enter fullscreen mode Exit fullscreen mode

I won't bore you with code for each function like on_dns_resolvehost_end, it's quite similar to on_request_end. Full code of Example 2 is here.

Reported stats on dashboard for example 2:
aiohttp reporting more stats

We can see that DNS resolution takes couple of milliseconds and happens for every call, and the connection establishing takes 30-40 msec and happens for every call. Also, that DNS cache is not hit, DNS is resolved for every call.

We can definitely improve on that - in Example 3.

Example 3: aiohttp reuse session

Let's modify Example 2 code so that ClientSession is created once, outside while loop:

async def main_async():
    async with ClientSession(trace_configs=[Profiler()]) as session:
        while True:
            result = await call_python_and_mozilla_using_aiohttp(session)
            print(result)
            await asyncio.sleep(3)
Enter fullscreen mode Exit fullscreen mode

And check out how stats look now:
aiohttp reuse session timings

There's only one dot for connection establishing, and one per DNS resoltion per domain. There's plenty of dots for connection reuse event.
Total execution time is below 50 msec. Cool.

Full source code of Example 3 is here.

Compare sync and async URL fetch, with and without reusing connection

Total time for both requests (very approximate):

Connection not reused Connection reused
Sync 150 msec 80 msec
Async 80 msec 40 msec

Top comments (0)