David Stone

Posted on Dec 24, 2024

Integer Type Performance

#cpp #performance

The Problem

C++ defines several integer types: signed and unsigned versions of char, short, int, long, and long long, which typically range in size from 8 bits to 64 bits. If your primary concern is performance and you need to represent an integer between 0 and 100, which type should you use? While all of these types can store the value, the Internet offers contradictory advice: use the smallest possible type, use a signed type, use an unsigned type, or always use int.

The Solution

My testing finds that the optimal priority order of integer types on AMD64, assuming proper application of [[assume()]], is

i8
u8
i16
u16
u32
i32
u64
i64
u128
i128

Jason Turner's Test

Jason Turner recently posted a video exploring the performance of GCC and Clang at various optimization levels. One interesting thing he found is that Clang gives the best performance in a benchmark application when using 32-bit integers instead of the more space-efficient 8- or 16-bit integers. This surprised me so I decided to look into this a bit deeper.

bounded::integer

I created a library called the bounded::integer library that allows users to specify the exact range of values a type needs to represent (so bounded::integer<0, 100> represents any integer between 0 and 100 inclusive). My previous benchmarks have shown that bounded::integer often outperforms built-in integers by providing more information to the optimizer and improving data layout.

Because I use this type for all of my integer needs, I can easily change the underlying integer types in my program and see the performance effects. The function that determines the underlying type looks like this:


template<auto minimum, auto maximum>
constexpr auto determine_type() {
    if constexpr (range_fits_in_type<unsigned char>(minimum, maximum)) {
        return type<unsigned char>;
    } else if constexpr (range_fits_in_type<signed char>(minimum, maximum)) {
        return type<signed char>;
    } else if constexpr (range_fits_in_type<unsigned short>(minimum, maximum)) {
        return type<unsigned short>;
    } else if constexpr (range_fits_in_type<signed short>(minimum, maximum)) {
        return type<signed short>;
    } else if constexpr (range_fits_in_type<unsigned int>(minimum, maximum)) {
        return type<unsigned int>;
    } else if constexpr (range_fits_in_type<signed int>(minimum, maximum)) {
        return type<signed int>;
    } else if constexpr (range_fits_in_type<unsigned long>(minimum, maximum)) {
        return type<unsigned long>;
    } else if constexpr (range_fits_in_type<signed long>(minimum, maximum)) {
        return type<signed long>;
    } else if constexpr (range_fits_in_type<unsigned long long>(minimum, maximum)) {
        return type<unsigned long long>;
    } else if constexpr (range_fits_in_type<signed long long>(minimum, maximum)) {
        return type<signed long long>;
#if defined NUMERIC_TRAITS_HAS_INT128
    } else if constexpr (range_fits_in_type<numeric_traits::uint128_t>(minimum, maximum)) {
        return type<numeric_traits::uint128_t>;
    } else if constexpr (range_fits_in_type<numeric_traits::int128_t>(minimum, maximum)) {
        return type<numeric_traits::int128_t>;
#endif
    } else {
        static_assert(false, "Bounds cannot fit in any type.");
    }
}

This was my original implementation that shows the order of types that I tried -- essentially, use the smallest built-in type, and prefer unsigned types over signed types.

The Hardware

All tests were run on an otherwise idle AMD Ryzen 9 3950x.

The Compiler

My code is compiled against Clang trunk built at 9919295c with -O3 for the optimization level.

My Test

The application I tested evaluates a video game state in a similar manner to traditional chess AIs. I set up a sample game state and measured how long it took to analyze with various integer representations.

To test whether smaller representations improve performance, I modified the code to determine the underlying type to start with 8 bits (the default, matches the code I showed), then 16, 32, and finally 64 bits. To test signed vs unsigned integers, I also reversed each pair of integers so that signed values are preferred over unsigned if the integer can fit in either.

Results are formatted as [bits in smallest integer representation (with a u prefix if unsigned types are preferred, and an i prefix if signed types are preferred)]: [time in seconds]. For example, "i32" means to use the first type that can represent the value, going through the types in order i32, u32, i64, u64, i128, u128.

i8: 53.44
u8: 54.61
i16: 54.26
u16: 54.62
i32: 58.73
u32: 55.68
i64: 66.81
u64: 63.35

These results are from a single run. However, my performance numbers are stable since this is a heavily optimized application. For example, running the u8 and i8 benchmarks multiple times gives:

u8: 54.61, 54.40, 54.44, 54.17
i8: 53.44, 53.48, 53.51, 53.35

The u8 runs have a coefficient of variation of 0.33% (181 ms standard deviation), and the i8 runs have a coefficient of variation of 0.13% (70 ms standard deviation).

The difference between signed and unsigned representations surprised me. In the bounded::integer library, accessing a value includes [[assume(minimum <= m_value and m_value <= maximum)]];, which in theory should tell the optimizer everything it needs to know to remove all signed vs. unsigned differences. It's also interesting that at least in my application, there is no consistent winner of signed vs unsigned. This implies that clang's optimizer is not always able to deduce provable things about my program. I will be looking into this more to find the root cause.

Based on these results, I tested new ordering that prefers small signed types but large unsigned types. This adjusted preference order is i8, u8, i16, u16, u32, i32, u64, i64. This order gives the best results of 52.87 seconds. It could also be reasonable to think that performance could be improved by dropping u8 entirely, but it turns out that actually slows it down slightly to 53.30 seconds.

Comparison with Jason's Results

Benchmarking my application did not reproduce Jason's results of "larger integer size == faster performance on Clang". Given the strange giant performance improvement with a specific integer type and a specific compiler, I'm tempted to say that it's something peculiar about his benchmark. Note that my benchmark is also slightly different from Jason's because it includes assumptions that tell the compiler that regardless of the object representation, the value falls within a certain range.

However, Jason's main point is that you need to test things for yourself to find out what's fastest for your use case, which is hard to disagree with.

DEV Community