Ricardo Iván Vieitez Parra for Apeleg Limited

Posted on Sep 5, 2022 • Edited on Aug 6, 2024 • Originally published at apeleg.com

Benchmarking in C (for x86 and x64)

#benchmarking #performance #c #x86

Benchmarks are extremely useful to see how performant some code or operation is and a requirement for any empirical decision making. After all, how can we know with any certainty if some library is faster than another one without testing?

The basics of benchmarking

At its core, benchmarking is quite a simple idea: we want to know how long a certain operation takes. Some simple (and naïve) pseudocode representing this can look like this:

start <- getCurrentTime()
runOperation()
end <- getCurrentTime()

print("Operation took: ", end - start)

Astute readers will notice a few issues with the above snippet. Let's go over some of these.

Operation time

As with any real-world operation, how long runOperation() takes is variable, and by having a single measurement we risk that it may not be representative. We can mitigate this by having several measurements and computing statistics, such as the average time.

A second and related issue is that in many applications there is a warm-up, meaning that the first few times runOperation() is executed it may take longer than subsequent executions, and that, because the operation is executed repeatedly, the time it takes after warming up may be more representative.

Time measurements

Another consideration is how exactly getCurrentTime() works, as we need to understand what we are measuring. Ideally, we are after time measurements that don't interfere with runOperation(), that are high-resolution (so that our measurements are accurate) and that are monotonic.

Out of order executions

A further complication is that just because we wrote our code to execute runOperation() in between getCurrentTime() calls, it doesn't mean that the the code will actually be executed in this manner, because processors often implement out-of-order execution. With benchmarks, we want time measurements to be executed exactly as specified and not while runOperation() is running.

Towards an implementation

Most modern operating systems provide some sort of abstraction for time measurements, such as clock_gettime(2) in POSIX systems or QueryPerformanceCounter in Windows systems. While these abstractions can be useful, it is possible to use native instructions to run accurate benchmarks without relying on any system or library calls.

In the x86 (and the nowadays more common 64-bit x86-64 aka x64 aka amd64) architecture, we can use the rdtsc instruction to read the Time Stamp Counter (TSC) register. If we read the TSC before and after executing our code of interest, we can determine how many CPU cycles elapsed, which we can use for comparisons.

Using the TSC

We can see how the TSC can be used for benchmarks in Gabriele Paoloni's white paper. This white paper presents a good introduction to some of the challenges discussed when carrying out measurements and a viable approach for an implementation.

The basic approach can be summarised as follows:

CPUID ; serialise
RDTSC ; read the clock
MOV start_cycles_high, edx
MOV start_cycles_low, eax
; Call the function to benchmark here
RDTSCP
MOV end_cycles_high, edx
MOV end_cycles_low, eax

Note that cpuid is used for serialisation, i.e., ensuring that all previous instructions before the benchmarking code have been executed. In addition, rdtscp is called for the end measurement. This is because, unlike rdtsc, rdtscp waits for all previous instructions to finish executing before reading the counter. The reason why we need cpuid when starting the benchmark and we can't just use rdtscp is that we also want the function to benchmark to start executing after all other code has finished executing.

Improving measurements

One of the first things we can address is that while a serialisation instruction is needed, cpuid is a relatively onerous instruction. lfence is an less onerous instruction that is also serialising, and in non-Intel manufacturers, mfence, which can be even more performant, is also serialising. Hence, we can use one of these instructions instead of cpuid (let's call it xfence).

A second change we can make is that, to make it more explicit that the mov instructions following rdtscp must be executed in that order, and after the code to be benchmarked, we can call xfence between rdtscp and the first mov. Thus we get:

XFENCE ; serialise
RDTSC ; read the clock
MOV start_cycles_high, edx
MOV start_cycles_low, eax
; Call the function to benchmark here
RDTSCP
XFENCE ; serialise
MOV end_cycles_high, edx
MOV end_cycles_low, eax

Warm-up measurements

As previously discussed, executing instructions the first few times might take a different time than executions that follow them. We can address this by calling rdtsc and xfence a few times before our benchmark code, so that we get something like this:

XFENCE
RDTSC
XFENCE
RDTSC
XFENCE
RDTSC
XFENCE ; serialise
RDTSC ; read the clock
MOV start_cycles_high, edx
MOV start_cycles_low, eax
; Call the function to benchmark here
RDTSCP
XFENCE ; serialise
MOV end_cycles_high, edx
MOV end_cycles_low, eax

Measuring in a loop

Now that we have established how to conduct measurements, we can start writing some code to run benchmark our code.

for(j = 0, k = 0; j < ITERATIONS; j++) {
        START_MEASUREMENT(cycles_high_s, cycles_low_s);
        ret = runOperation();
        if (likely(0 == ret)) {
                END_MEASUREMENT(cycles_high_e, cycles_low_e);
                start = ( ((u64) cycles_high_s << 040) | cycles_low_s );
                end   = ( ((u64) cycles_high_e << 040) | cycles_low_e );
                deltas[k++] = end - start;
        } else {
                fprintf(stderr, "Unexpected error occurred benchmarking type %s.\n", "NULL");
                printf("[%s] ERROR\n", "NULL");
        }
}
compute_statistics(deltas, k, &min, &max, &mean, &variance);

This code won't yet compile, because there are many things yet undefined, but we can see that it calls runOperation() for ITERATIONS time, calling START_MEASUREMENT (which we can later define as a macro) before starting and calling END_MEASUREMENT once it's done. In addition, we're calling compute_statistics at the end of the loop, which will compute some statistics about the operation (such as the minimum, maximum and mean time), which is our desired result from the benchmark.

Putting it all together

Different compilers

Since we're using C and executing native instructions, we need to use the asm keyword. Unfortunately, this is implementation defined. We'll address the Microsoft and GNU dialects, which will cover most implementations relevant for the x86 architecture.

For Microsoft, the syntax is __asm INSTRUCTION, whereas for GNU the syntax is __asm__ ("INSTRUCTION").

C standard versions

A further challenge is that different C versions (for example, ANSI C and C99) provide different types, and for compatibility we might want our code to use the latest types when available while still being able to run on older compilers.

Baseline measurement

Lastly, we want to run a baseline measurement for a do-nothing operation to determine the overhead of our measurement.

Final implementation

#include <math.h>
#include <stdio.h>
#include <stdlib.h>

#ifdef HAVE_CONFIG_H
#include <config.h>
#endif

#ifndef ITERATIONS
#define ITERATIONS (2048)
#endif

/* C99 types */
#if defined(__STDC_VERSION__) && __STDC_VERSION__ >= 199901L /* The compiler supports C99 */
#define SIZE_T_FMT "z"
#define SIZE_T_FMT_TYPE size_t
#define LONGLONG long long
#define LONGDOUBLE long double
#define LONGDOUBLE_FMT "L"
#elif defined(_MSC_VER) /* Some MSVC versions don't fully support C99 */
#define SIZE_T_FMT "I"
#define SIZE_T_FMT_TYPE size_t
#define LONGLONG __int64
#if _MSC_VER >= 1310
#define LONGDOUBLE long double
#define LONGDOUBLE_FMT "L"
#else
#define LONGDOUBLE double
#define LONGDOUBLE_FMT ""
#endif
#else
#define SIZE_T_FMT "l" /* Old compiler, use long */
#define SIZE_T_FMT_TYPE long
#if defined(__GNUC__) /* GCC supports long long as an extension */
#define LONGLONG long long
#else
#define LONGLONG long
#define __extension__ /* */
#endif
#define LONGDOUBLE double
#define powl pow
#define sqrtl sqrt
#define LONGDOUBLE_FMT ""
#endif
/* End C99 types */

/* Compatibility */
#if !defined(__has_attribute) /* Clang & newer GCC */
#define __has_attribute(x) 0
#endif /* !defined(__has_attribute) */

#if !defined(__has_builtin) /* Clang */
#define __has_builtin(x) 0
#endif /* !defined(__has_builtin) */

#if !defined(INFINITY)
#define INFINITY ((double) 1.0e120)
#endif

#if defined(_MSC_VER) /* MSVC-specific */
#define NOINLINE __declspec(noinline) /* MSVC extension */
#elif __has_attribute(__noinline__) || (defined(__GNUC__) && (__GNUC__ > 3) || (__GNUC__ == 3 && defined(__GNUC_MINOR__) && __GNUC_MINOR__ >= 1))
#define NOINLINE __attribute__ ((noinline)) /* GNU extension */
#else
#define NOINLINE /* not supported */
#endif /* defined(_MSC_VER) */

#if __has_builtin(__builtin_expect) || (defined(__GNUC__) && (__GNUC__ > 3) || (__GNUC__ == 3 && defined(__GNUC_MINOR__) && __GNUC_MINOR__ >= 1))
#define likely(x) __builtin_expect((x),1) /* GNU extension */
#else
#define likely(x) x
#endif /* __has_builtin(__builtin_expect) */

#if defined(_MSC_VER) /* MSVC inline assembler */
#if !defined(intel) /* Use LFENCE for Intel processors and MFENCE for other manufacturers */
#define xFENCE __asm _emit 0x0f __asm _emit 0xae __asm _emit 0xf0 /* MFENCE */
#else
#define xFENCE __asm _emit 0x0f __asm _emit 0xae __asm _emit 0xe8 /* LFENCE */
#endif
#define RDTSC rdtsc
#define RDTSCP rdtscp

#define WARMUP_MEASUREMENT() __asm { \
        xFENCE \
       __asm RDTSC \
        xFENCE \
        __asm RDTSC \
        xFENCE \
        __asm RDTSC \
        xFENCE \
}

#define START_MEASUREMENT(HI, LO) __asm { \
        xFENCE \
        __asm RDTSC \
        __asm mov LO,eax \
        __asm mov HI,edx \
}

#define END_MEASUREMENT(HI, LO) __asm { \
        __asm RDTSCP \
        xFENCE \
        __asm mov LO,eax \
        __asm mov HI,edx \
}

#define ZERO_REGISTER(OUT) __asm mov OUT,0
#elif defined(__GNUC__) /* GCC-style extended asm */
#if !defined(intel) /* Use LFENCE for Intel processors and MFENCE for other manufacturers */
#define xFENCE ".byte 0x0f, 0xae, 0xf0\n" /* MFENCE */
#else
#define xFENCE ".byte 0x0f, 0xae, 0xe8\n" /* LFENCE */
#endif
#define RDTSC ".byte 0x0f, 0x31\n"        /* RDTSC */
#define RDTSCP ".byte 0x0f, 0x01, 0xf9\n" /* RDTSCP */

#define WARMUP_MEASUREMENT() __asm__ volatile ( \
        xFENCE \
        RDTSC \
        xFENCE \
        RDTSC \
        xFENCE \
        RDTSC \
        xFENCE \
        : : : "%eax", "%edx", "memory" \
)

#define START_MEASUREMENT(HI, LO) __asm__ volatile ( \
        xFENCE \
        RDTSC \
        : "=d" (HI), "=a" (LO) : : "memory" \
)

#define END_MEASUREMENT(HI, LO) __asm__ volatile ( \
        RDTSCP \
        xFENCE \
        : "=d" (HI), "=a" (LO) : : "%ecx", "memory" \
)
#define ZERO_REGISTER(OUT) __asm__ volatile ("xor %0, %0" : "=r"(OUT) : : )
#else /* Unsupported compiler. */
#define WARMUP_MEASUREMENT() /* */
#define START_MEASUREMENT(HI, LO) HI = 0; LO = 0
#define END_MEASUREMENT(HI, LO) HI = 0; LO = 0
#define ZERO_REGISTER(OUT) OUT = 0
#endif /* defined(_MSC_VER) */

/* End Compatibility */

typedef unsigned LONGLONG u64;
typedef unsigned long u32;

/* Function signatures */
NOINLINE static int do_nothing(void);
NOINLINE static int do_something(void);
static void compute_statistics(u64 const data[], size_t const len, LONGDOUBLE * min, LONGDOUBLE * max, LONGDOUBLE * mean, LONGDOUBLE * variance);
/* End function signatures */

int do_nothing() {
    int r;

    ZERO_REGISTER(r);

    return r;
}

int do_something() {
    int r;

    ZERO_REGISTER(r);

    return r;
}

void compute_statistics(u64 const data[], size_t const len, LONGDOUBLE * const min, LONGDOUBLE * const max, LONGDOUBLE * const mean, LONGDOUBLE * const variance) {
    LONGDOUBLE sum1 = 0.0L, sum2 = 0.0L, mean_, variance_, min_ = (LONGDOUBLE)INFINITY, max_ = (LONGDOUBLE) -INFINITY;
    size_t i;

    for (i = 0; i < len; i++) {
        LONGDOUBLE const t = (LONGDOUBLE) data[i];
        sum1 += t;
        if (t > max_) max_ = t;
        if (t < min_) min_ = t;
    }

    mean_ = sum1 / (LONGDOUBLE) len;

    for (i = 0; i < len; i++) {
        sum2 += powl((LONGDOUBLE)data[i] - mean_, 2);
    }

    variance_ = sum2 / (((LONGDOUBLE)len) - 1.0L);

    if (NULL != min) *min = min_;
    if (NULL != max) *max = max_;
    if (NULL != mean) *mean = mean_;
    if (NULL != variance) *variance = variance_;
}

int main(int argc, char **argv) {
    int ret = 0;
    size_t j, k;
    u64 * deltas;
    LONGDOUBLE min, max, mean, variance;
    u32 cycles_low_s, cycles_high_s, cycles_low_e, cycles_high_e;
    u64 start, end;

    (void)argc;
    (void)argv;

    deltas = (u64 *) malloc(sizeof(*deltas) * ITERATIONS);

    WARMUP_MEASUREMENT();

    /* Nothing */
    {
        printf("[%s] Starting benchmark\n", "NULL");
        for(j = 0, k = 0; j < ITERATIONS; j++) {
            START_MEASUREMENT(cycles_high_s, cycles_low_s);
            ret = do_nothing();
            if (likely(0 == ret)) {
                END_MEASUREMENT(cycles_high_e, cycles_low_e);
                start = ( ((u64) cycles_high_s << 040) | cycles_low_s );
                end   = ( ((u64) cycles_high_e << 040) | cycles_low_e );
                deltas[k++] = end - start;
            } else {
                fprintf(stderr, "Unexpected error occurred benchmarking type %s.\n", "NULL");
                printf("[%s] ERROR\n", "NULL");
            }
        }
        compute_statistics(deltas, k, &min, &max, &mean, &variance);
        printf("[%s]\t [%" LONGDOUBLE_FMT "e - %" LONGDOUBLE_FMT "e] mean: %" LONGDOUBLE_FMT "e (std. dev.: %" LONGDOUBLE_FMT "e) N:%" SIZE_T_FMT "u\n", "NULL", min, max, mean, sqrtl(variance), (SIZE_T_FMT_TYPE)k);
    }

    /* Something */
    {
        printf("[%s] Starting benchmark\n", "NULL");
        for(j = 0, k = 0; j < ITERATIONS; j++) {
            START_MEASUREMENT(cycles_high_s, cycles_low_s);
            ret = do_something();
            if (likely(0 == ret)) {
                END_MEASUREMENT(cycles_high_e, cycles_low_e);
                start = ( ((u64) cycles_high_s << 040) | cycles_low_s );
                end   = ( ((u64) cycles_high_e << 040) | cycles_low_e );
                deltas[k++] = end - start;
            } else {
                fprintf(stderr, "Unexpected error occurred benchmarking type %s.\n", "NULL");
                printf("[%s] ERROR\n", "NULL");
            }
        }
        compute_statistics(deltas, k, &min, &max, &mean, &variance);
        printf("[%s]\t [%" LONGDOUBLE_FMT "e - %" LONGDOUBLE_FMT "e] mean: %" LONGDOUBLE_FMT "e (std. dev.: %" LONGDOUBLE_FMT "e) N:%" SIZE_T_FMT "u\n", "NULL", min, max, mean, sqrtl(variance), (SIZE_T_FMT_TYPE)k);
    }

    free(deltas);

    return !!ret;
}

DEV Community