DEV Community: absterdabster

What the futex? A linux concurrency fundamental

absterdabster — Mon, 11 May 2026 18:57:27 +0000

WTF?

Exactly.

What is a futex?

It is a tool that let's your program sleep when there is no work and wake up when work is ready.

Ooooh. That sounds pretty efficient!

(If you read this, comment "WTF" so we can confuse everyone else lol.)

What is a synchronization primitive

For you nerds out there, this may sound somewhat like another tool called conditional variables. Fun fact, in linux, they use futexes internally!

For all the noobs out there, these futexes are what let you synchronize between two threads in a program.

A thread is an execution unit. They are often used to:

Split up work by running in parallel
Do background work in parallel

Why would I use a futex?

Let's say I have a thread that is listening to a user's input for a background task to process.

We don't want this listening thread to stall, so we can pass this work to another background thread and wake it up.

Then the background thread can complete the task and go back to sleep.

Great!

How do I use a futex?

Under the hood a futex is just an integer. This is all the state that is needed for waking up a thread or putting one to sleep.

#include <linux/futex.h>
#include <unistd.h>
#include <sys/syscall.h>
#include <cstddef>
#include <atomic>

std::atomic_int futex_var = 0;

This integer is what is used to check if the thread is ready to wake up.

FUTEX_WAIT → Sleep if value matches expected
FUTEX_WAKE → Wake one or more waiting threads

The sleep/wait function:

void futex_wait(std::atomic_int *addr, int expected) {
    syscall(SYS_futex, addr, FUTEX_WAIT, expected, NULL, NULL, 0);
}

The wake/notify function:

void futex_wake(std::atomic_int *addr, int threads=1) {
    syscall(SYS_futex, addr, FUTEX_WAKE, threads, NULL, NULL, 0);
}

Here threads lets you control how many threads to wake up from the waiting queue.

This syscall call is a generic call to make your kernel do protected work like interacting with hardware devices (like memory or network).

A quick crash course to syscalls

Skip to back to futexes if you already know this... nerd.

A syscall is invoked with syscall(). Standard library calls like read(), open(), and write() used for interacting with file descriptors (files, sockets, pipes) are wrappers for syscall.

The signature for calling it is like this where ... is the arguments taken for that specific syscall defined by the number:

long syscall(long number, ...);

On error, -1 will be returned (and ERRNO is set).

If I want to write "Hello" to your computer terminal via stdout (standard output), you could do something like this.

#include <unistd.h>
#include <sys/syscall.h>

int main() {
    syscall(SYS_write, 1, "Hello\n", 6);
    return 0;
}

Example of using futexes

We are going to try to make locks with these futexes.

A lock just protects a shared resource so that threads take turns accessing it.

We want to avoid corruption if 2 threads overwrite each other's changes in the shared resource.

Here it is. Let's go over it.

struct Lock{
    // 0 - unlocked, 1 - locked
    std::atomic_int state{0};

    void lock(){
        int unlocked = 0;
        while(!state.compare_exchange_strong(unlocked, 1)){
            unlocked = 0;
            futex_wait(&state, 1);
        }
    }

    void unlock(){
        state.store(0);
        futex_wake(&state);
    }
};

Locking

Let's start with the logic in lock:

int unlocked = 0;
while(!state.compare_exchange_strong(unlocked, 1)){
     unlocked = 0;
     futex_wait(&state, 1);
}

Here we are checking if state is 0 (unlocked), and if it is then we perform a successful swap and state will then store 1. This is what we call a CAS (Compare And Swap). This would return true and the lock function will return without entering the loop.

BUTTT, if the state was 1, the compare would fail. We enter the loop. We reset unlocked to 0 (because it gets updated with the last value of state). Then we sleep if the value does in fact match 1 still with futex_wait.

Here we can see how futex_wait comes in handy for a simple lock.

If the thread is sleeping, it must wait for futex_wake in unlock to wake up once again from the top of the waiting queue where it can contend for the lock again.

Unlocking

state.store(0);
futex_wake(&state);

Now let's look at the unlocking logic. We simply set the state to be unlocked, and wake up a thread from the waiting queue from futex_wait to recontend for the lock.

Conditional Variable

For fun, I will implement a conditional variable too with futexes.

A conditional variable is a tool that makes threads sleep when a function/condition returns false. And allows for condition rechecking with notify_one (1 thread) or notify_all (all threads) by waking them up.

This is useful: If a thread is preparing a task and then updates a condition once it is finished, it can notify the background thread to finish the task.

For the noobs who want to skip, go to the conclusion. And for the nerds:

struct CondVar{

    // each seq represents the notify epoch
    // this prevents lost wake up (if a notify was called between !predicate and futex_wait)
    // it prevents the lost wakeup because notify causes seq to increment, so we won't be sleeping anymore
    std::atomic_int seq{0};

    void wait(std::function<bool(void)>& predicate){
        // if !predicate -> futex_wait
        int cur = seq.load();
        while(!predicate()){
            futex_wait(&seq, cur);
            cur = seq.load();
        }
    }

    void notify_one(){
        seq.fetch_add(1);
        futex_wake(&seq);
    }

    void notify_all(){
        seq.fetch_add(1);
        futex_wake(&seq, INT_MAX);
    }
};

There is a common problem that conditional variables try to avoid called lost wakeup.

Lost wakeup is where you check the predicate condition, and suddenly another thread calls notify before you reach futex_wait where you go to sleep and the condition could've been updated to return true.

To avoid this, we use a seq to atomically increase on notifies. This way, we only sleep if the seq is the same AND predicate is false. seq being the same means that no new notifies were seen.

Adding a twist: a timeout

Let's say we don't want to sleep until the condition is true. We want to wake up if it takes too long. We can set a timeout by introducing a wait_for function.

Here is an implmentation:

    void wait_for(std::function<bool(void)>& predicate, std::chrono::nanoseconds ns){
        auto end = std::chrono::high_resolution_clock::now() + ns;
        int cur = seq.load();
        while(!predicate()){
            auto now = std::chrono::high_resolution_clock::now();
            if(now >= end){
                return;
            }
            futex_wait_for(&seq, cur, end-now);
            cur = seq.load();
        }
    }

Here we now pass in a timeout. The structure is mainly the same except now, we are dealing with times. We have to add a condition:

if(now >= end){
    return;
}

This let's us resume a timeout if we get woken up with a notify early.

We also have this interesting line:

futex_wait_for(&seq, cur, end-now);

We are using a new function. This futex_wait_for is the same syscall except that it also takes a timeout parameter too like so:

int futex_wait_for(std::atomic_int* addr, int expected, std::chrono::nanoseconds ns){
    timespec ts;
    ts.tv_sec  = ns.count() / 1'000'000'000;
    ts.tv_nsec = ns.count() % 1'000'000'000;
    return syscall(SYS_futex, reinterpret_cast<int*>(addr), FUTEX_WAIT, expected, &ts, nullptr, 0);
}

It does exactly what we need. It wakes up if the timeout is reached and doesn't sleep if the condition is true.

Conclusion

Finally, the general public generally does not need to worry about futexes.

Conditional variables and mutexes/locks in linux are implemented with futexes under the hood.

So when is this useful?

It is useful if you want fine grain control and more direct control over when threads should sleep and when threads should do work.

Feel free to drop your questions below.

Peace out
-absterdabster

How to vectorize your code for faster performance 🚀

absterdabster — Wed, 23 Jul 2025 05:31:51 +0000

Hi! Let's say you have a time sensitive application. Either you have a lot of data that you need to process quickly. Or you are trying to write code that is very fast.

It may be possible to make your code very performant. 👀

How so?

With the help of vectorization!

There's a chance you are running a very big loop and running the same set of instructions on all your data.

What if we can shrink this loop a lot? We can process chunks of this loop in one step.

In fact, if you've ever used Python, fast processing libraries like numpy tend to use vectorized instructions as well for handling large amounts of data faster.

Before I show you how to steal the moon.... ahem... I mean vectorize your code, please drop any questions you have in the comments below!

Vectorized Instructions (SIMD)

SIMD stands for Single Instruction Multiple Data. Okay let's explore some instructions.

Before, we look at instructions, I must say every computer is different. Every CPU has a different architecture.

So some may support vectorized instructions, but some may not.

Lucky for us, most CPUs these days are x86 or x86-64 or ARM architectures. All these architectures support SIMD instructions. (Even the Apple M1 chips too I believe).

How do SIMD instructions work?

Good question. (If you're lazy and want to use SIMD instructions without knowing much about how they work, jump to the I'm Lazy section lol).

If you ever took a computer architecture course, you may have heard of these things called registers.

Registers are like memory holders for tiny pieces of data. Generally, a lot of the usual ones your compiler uses are 64 bit or 32 bit registers.

This is for several reasons:

x86-64 means an x86 architecture with 64 bit instructions, x86 generally uses 32 bit instructions
Memory addresses for modern computers are addressable with 64 bit addresses
The largest data types languages support are 64 bits (uint64_t, double, long).

However, computer architectures have been supporting larger and larger registers for things like vectorization, generally 128 bit, 256 bit, and even 512 registers.

`x86/x86-64`

These are the registers generally used for x86 architectures:

mm0-mm7: 64 bit registers for SIMD
xmm0-xmm15: 128 bit registers for SIMD
ymm0-ymm15: 256 bit registers for SIMD
zmm0-zmm15: 512 bit registers for SIMD

The registers and operations available to you on x86 largely depends on the support your CPU has. Here are the CPU supports for SIMD available over the years:

MMX: 64 bit registers for SIMD and instructions, oldest (1997)
SSE: 128 bit registers and introduced 70 new instructions (1999)
SSE2: introduced 144 new instructions to 128 bit registers (2000)
SSE3: introduced 13 new instructions (horizontal add/subtract) (2004)
SSSE3: introduced 38 new instructions to extend MMX and SSE (2008)
AVX: introduced 256 bit register vectors (2011)
AVX-512: introduced 512 bit register vectors (2016)
AMX: introduce 8192 bit registers (tmm0...tmm7) (2023)

How do you use them??? No need to fear, superman is here.

There are special functions for the architecture you can use in C/C++ called intrinsics. There are intrinsics that you can use to vectorize add, vectorize multiply, etc.

To use the intrinsics, for x86 intel CPUs,

All you have to do now to vectorize your code is:

Include one of the following header files based on the intrinsics you want to use, for example:
- <xmmintrin.h>: (MMX)
- <emmintrin.h>: (SSE)
- <pmmintrin.h>: (SSE3)
- <immintrin.h>: (AVX/AVX-512)
Use one of these intrinsics, here is a list: here
Finally compile with one of the flags on gcc, for example:
- -mmmx: (MMX)
- -msse: (SSE)
- -msse3: (SSE3)
- -mavx: (AVX)
- -mavx512f: (AVX-512)

ARM

ARM is the other popular architecture for CPUs, sometimes used for mobile devices. It also supports SIMD.

It uses these registers they call NEON registers, but the idea is similar:

D0-D31: 64 bit registers for SIMD
Q0-Q15: 128 bit registers for SIMD

To use these ARM vectorized instructions, you would have to do the following in C/C++:

#include <arm_neon.h> at the top of your file
Use the intrinic c++ functions (like vaddq_u32) from here
Compile your program with the gcc flag -mfpu=neon

I'm Lazy

Okay lazy boy. Or girl lol.

If you don't want to think about x86 or ARM, compilers are made powerful just for you.

You can let your compiler automagically figure out your architecture and compile your code with SIMD instructions.

Let's keep it short, but all you have to do is compile like this:

g++ -o test test.cpp -O3 -ftree-vectorize -march=native

-O3: extreme optimization, technically you just need -O2 or higher. It also includes ftree-vectorize, so having ftree-vectorize is redundant.
-ftree-vectorize: in case you forget O3 or you use a lower optimization, you can see SIMD instructions in unoptimized code
-march=native: if you use -ftree-vectorize or -O3 without this flag, the compiler will use a default set of vectorized instructions (up to SSE2 for x86-64). Including this flag, utilizes the best features of your CPU's inventory of SIMD instructions.

Comparing speeds

Let's look at this simple sample program lol, adding 30 ints together.

#include <iostream>
uint64_t rdtsc(){
        volatile uint64_t v{0};
        __asm__ volatile(
                "rdtsc"
                :"=A" (v)
        );
        return v;
}

int main(int argc, char** argv){
        int a[30] = {1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5,1,2,3,4,5};
        int b[30] = {5,4,3,2,1,2,3,4,5,4,3,2,1,2,3,5,4,3,2,1,2,3,4,5,4,3,2,1,2,3};
        int c[30];

        volatile uint64_t start = rdtsc();
        for(size_t i = 0; i < 30; i++){
                c[i] = a[i] + b[i];
        }
        volatile uint64_t end = rdtsc();

        std::cout << end-start << " cycles" << std::endl;
        for(auto v: c){
                std::cout <<  v << " ";
        }
        std::cout << std::endl;
}

Here, I use rdtsc() to time my program. For reference, I am timing and running my program from an x86-64 architecture CPU.

If you are unfamiliar with RDTSC, GO CHECK OUT MY TIMING YOUR PROGRAM BLOG.

If I try to compile and this program with no optimization:

g++ -o test test.cpp

I get the following results:

463 cycles
6 6 6 6 6 3 5 7 9 9 4 4 4 6 8 6 6 6 6 6 3 5 7 9 9 4 4 4 6 8

For reference, the loop takes 463 CPU cycles. My CPU runs at 2.5GHz. This means it took roughly 182 nanoseconds to run!

Let's take a look at the assembly, low level instructions for the juicy part of the code which we can get from with the -s flag when compiling.

        call    _Z5rdtscv
        movq    %rax, -432(%rbp)
        movq    $0, -416(%rbp)
.L5:
        cmpq    $29, -416(%rbp)
        ja      .L4
        movq    -416(%rbp), %rax
        movl    -384(%rbp,%rax,4), %edx
        movq    -416(%rbp), %rax
        movl    -256(%rbp,%rax,4), %eax
        addl    %eax, %edx
        movq    -416(%rbp), %rax
        movl    %edx, -128(%rbp,%rax,4)
        addq    $1, -416(%rbp)
        jmp     .L5
.L4:
        call    _Z5rdtscv

Great we can see that normal compilation leads to no use of vectorized registers/SIMD instructions. We just have the classic rax, rbp registers which are 64 bit and 32 bit registers (edx, eax, etc.)

Now let's see if we can vectorize these. Okay, let's do something simple with just O3 optimization.

g++ -O3 -o test test.cpp

(Add the -s flag for assembly)

As a fun challenge, try to guess how long the optimized version took to run.

Before I reveal the answer, let's see how this program got optimized in the assembly:

#APP
# 7 "test.cpp" 1
        rdtsc
# 0 "" 2
#NO_APP
        movq    %rax, 16(%rsp)
        movq    16(%rsp), %rax
        movdqa  %xmm1, %xmm5
        movdqa  160(%rsp), %xmm4
        paddd   %xmm0, %xmm5
        paddd   192(%rsp), %xmm3
        paddd   208(%rsp), %xmm2
        movq    %rax, (%rsp)
        paddd   %xmm0, %xmm4
        movl    272(%rsp), %eax
        paddd   240(%rsp), %xmm0
        movaps  %xmm5, 304(%rsp)
        addl    144(%rsp), %eax
        movaps  %xmm4, 288(%rsp)
        paddd   256(%rsp), %xmm1
        movl    %eax, 400(%rsp)
        movl    148(%rsp), %eax
        addl    276(%rsp), %eax
        movaps  %xmm3, 320(%rsp)
        movaps  %xmm2, 336(%rsp)
        movaps  %xmm4, 352(%rsp)
        movaps  %xmm0, 368(%rsp)
        movaps  %xmm1, 384(%rsp)
        movl    %eax, 404(%rsp)
        movq    $0, 24(%rsp)
#APP
# 7 "test.cpp" 1
        rdtsc
# 0 "" 2

Ahhh, now we have xmm registers! We also see SIMD x86 instruction like movaps and paddd.

If you remember from earlier, xmm registers are 128 bit registers which is covered by the default SSE support.

Now to show you the speed difference this has:

217 cycles
6 6 6 6 6 3 5 7 9 9 4 4 4 6 8 6 6 6 6 6 3 5 7 9 9 4 4 4 6 8

Whoa! We 2xed our speed by doubling the size of our registers!

We were just at like 182 nanos around, and now its closer to 85 nanos!

Remember, we're only adding 30 numbers here, but if we were doing a billion numbers, these nanoseconds WILL add up.

Okay, now let's see what happens when we introduce the microarchitecture flag into our compilation. Any better?

g++ -o test test.cpp -O3 -march=native

And here is the new assembly:

#APP
# 7 "test.cpp" 1
        rdtsc
# 0 "" 2
#NO_APP
        movq    %rax, 16(%rsp)
        movq    16(%rsp), %rax
        vpaddd  %ymm3, %ymm0, %ymm0
        vpaddd  160(%rsp), %ymm2, %ymm2
        vpaddd  192(%rsp), %ymm1, %ymm1
        vmovdqa %ymm0, 352(%rsp)
        movq    %rax, (%rsp)
        movl    256(%rsp), %eax
        vmovdqa %ymm2, 288(%rsp)
        addl    128(%rsp), %eax
        movl    %eax, 384(%rsp)
        movl    260(%rsp), %eax
        vmovdqa %ymm1, 320(%rsp)
        addl    132(%rsp), %eax
        movl    %eax, 388(%rsp)
        movl    264(%rsp), %eax
        movq    $0, 24(%rsp)
        addl    136(%rsp), %eax
        movl    %eax, 392(%rsp)
        movl    268(%rsp), %eax
        addl    140(%rsp), %eax
        movl    %eax, 396(%rsp)
        movl    272(%rsp), %eax
        addl    144(%rsp), %eax
        movl    %eax, 400(%rsp)
        movl    148(%rsp), %eax
        addl    276(%rsp), %eax
        movl    %eax, 404(%rsp)
#APP
# 7 "test.cpp" 1
        rdtsc
# 0 "" 2
#NO_APP

WHOAAAAA! Do you see what I see?!

We just got ymm registers! These guys are like 256 bit registers that my CPU supports.

Let's see what that means about our speed. Any guesses?

3....
2....
1....
Boom:

162 cycles
6 6 6 6 6 3 5 7 9 9 4 4 4 6 8 6 6 6 6 6 3 5 7 9 9 4 4 4 6 8

This took 65 nanoseconds to run. It seems like our benefits are starting to decay with only 30 numbers.

We went from 182 nanoseconds to 65 nanoseconds. We basically 3xed our speed!

Note: I could've alternatively chosen to used intrinsics as well. But I trust you will figure it out with the resources I've given you and the internet <3

Examples in the world

Okay, I shall attempt to find 1 example of SIMD instructions in the wild wild west.

OpenCV is a library used for image and vision processing!

Vectorized instructions can be useful for processing rows of pixels the same way.

Here is an example of SSE x86 intrinsics being used in OpenCV to merge two maps/images together.

void convertMaps_nninterpolate32f1c16s_SSE41(const float* src1f, const float* src2f, short* dst1, int width)
{
    int x = 0;
    for (; x <= width - 16; x += 16)
    {
        __m128i v_dst0 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src1f + x)),
            _mm_cvtps_epi32(_mm_loadu_ps(src1f + x + 4)));
        __m128i v_dst1 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src1f + x + 8)),
            _mm_cvtps_epi32(_mm_loadu_ps(src1f + x + 12)));

        __m128i v_dst2 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src2f + x)),
            _mm_cvtps_epi32(_mm_loadu_ps(src2f + x + 4)));
        __m128i v_dst3 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src2f + x + 8)),
            _mm_cvtps_epi32(_mm_loadu_ps(src2f + x + 12)));

        _mm_interleave_epi16(v_dst0, v_dst1, v_dst2, v_dst3);

        _mm_storeu_si128((__m128i *)(dst1 + x * 2), v_dst0);
        _mm_storeu_si128((__m128i *)(dst1 + x * 2 + 8), v_dst1);
        _mm_storeu_si128((__m128i *)(dst1 + x * 2 + 16), v_dst2);
        _mm_storeu_si128((__m128i *)(dst1 + x * 2 + 24), v_dst3);
    }

    for (; x < width; x++)
    {
        dst1[x * 2] = saturate_cast<short>(src1f[x]);
        dst1[x * 2 + 1] = saturate_cast<short>(src2f[x]);
    }
}

Okay let's keep this breakdown brief and simple.

__m128i v_dst0 = _mm_packs_epi32(_mm_cvtps_epi32(_mm_loadu_ps(src1f + x)), _mm_cvtps_epi32(_mm_loadu_ps(src1f + x + 4)));
These instructions are:
- _mm_loadu_ps: load 4 floats into an xmm 128 bit register
- _mm_cvtps_epi32: convert the 4 floats in the xmm into 4 byte ints
- _mm_packs_epi32: conver the 2 x 4 ints into 8 shorts (2 Bytes) in another xmm
_mm_interleave_epi16(v_dst0, v_dst1, v_dst2, v_dst3);
Interleaves the 1st and 3rd xmm vectors together. Then it interleaves the 2nd and 4th vectors together. This is done so that src1 and src2 points are interleaved together.
_mm_storeu_si128((__m128i *)(dst1 + x * 2), v_dst0);
Uses vectorized instructions to copy from xmm registers to dst1.
The last loop is scalar to handle the non 16 byte aligned data that is left over.

Cool, that's enough examples in the wild.

The Conclusion

Let's keep this simple. We saw that with a small example you can speed up a program up to 3x with vectorized instructions.

However, with larger programs, you could see even more performance benefits from SIMD (Single Instruction Multiple Data) instructions.

So next time you consider speeding up your program, think about how you can use your CPU's features to your advantage.

All in all:

SIMD support depends on CPU architecture
SIMD instructions can be used with intrinsics
The compiler can handle SIMD optimization for you alternatively
SIMD could speed up your program 2-10x, depending on CPU support/amount of data.
SIMD is used in the wild, like in OpenCV and even numpy libraries.

I hope you are ready to be faster.

That's all I have this time.

Peace
-absterdabster

Creating a list that contains different types in C++ 😎

absterdabster — Fri, 30 May 2025 02:06:48 +0000

Hi! Let's say you are trying to create a list of people and information about them (name, age, gender, address, etc.).

Notice all of these are different types! In Python, lists can handle different types, so you could store all of these.

In C++, you cannot store all of these in a vector or array because these container require the same types for each element. However, you can make a class or struct to represent a person like this:

struct Person{
std::string name;
std::string address
uint8_t age;
char gender;
};

Great you could store all this info together in your program.

However, what if later I told you we are adding email to person?

You would have to stop your program, store the program data somewhere, update the Person struct to have email and reload the program data into the struct properly. Ughhh, so much extra work!

What if I told you, I wanted to iterate through all the fields of Person and print them out. You would have to manually type something like this:

void print(Person p){
std::cout << p.name << std::endl;
std::cout << p.age << std::endl;
...
}

Ughhh, so much extra typing :(

If only, there was some solution.......

There is. Today we are going to try to make a list that can maintain different types of values.

This is inspired from std::tuple which already can hold different types of data. However, you can't add/remove elements to it or iterate through it.

std::tuple<int, char, double> = std::make_tuple(1, 'a', 4.2);

With this new container, we'll try to support all of this. And also try to make operations have fast O(1) constant access where we can with the help of the C++20 compiler!

In order to build this, we'll talk a bit about C++ variadic templates.

Basics of `templates`

Okay cool. Let me just show you a template. If you already know how templates work, skip past this intro.

template<typename T>
void print(T arg){
     std::cout << arg << std::endl;
}

In this example, we made a template function. Here, we say that the argument has type T. T could be any type.

T could be an int. T could be char. But everything inside the function has to work for the given types.

The compiler looks for the usages of print and creates versions of functions for types that were used.

For instance, if you had in your program:

print('a');
....
print(2.5);

The compiler will expand print by creating functions that essentially look like this behind the scenes:

void print(double arg){
    std::cout << arg << std::endl;
}

void print(char arg){
    std::cout << arg << std::endl;
}

In fact we can even templatize structs and class declarations, and those will expand out to the specialized versions of the declarations after the compiler sees all use cases.

template<typename T>
struct test{
    T val;
};

could expand to something like this if you use test<double> or test<char>:

struct test<double>{
   double val;
};

struct test<char>{
   char val;
};

This will be useful when constructing our tuple as we may want to declare different types that our tuple can store at compile time so that the compiler knows how much memory to use. (In the test struct example, test<double> is a different memory size than test<char>, 8 bytes vs 1 byte)

Now let's talk about custom template specialization. We can declare a template, but create our own specialization of that template for certain types if we want it to behave differently.

template<typename T>
void print(T arg){
     std::cout << arg << std::endl;
}

template<>
void print<int>(int arg){
     std::cout << arg << " is an int. " << std::endl;
}

The compiler won't create a new print<int> implementation. Instead, it'll use our custom specialized one.

Okay, that's enough of basic templates.

Let's talk about advanced templates with variadic templates.

Variadic templates

If you already know variadic templates, jump ahead and we'll start making our list! But for you noobs like me, stick around :).

We'll be very brief about it. There's a lot to explore in this world.

Variadic templates/packs have been around since C++11.

template<typename... Pack>
void print(Pack... args){
     (std::cout << ... << args) << std::endl;
}

Here we show that print is using a pack in its template for its arguments. Pack... args shows that the print function can use any number of arguments. So:

print('a', 2, "abc");
print(2, 1, '3', 3.5, true);
print(); // empty line

All of these are valid! The example also had this line:

     (std::cout << ... << args) << std::endl;

This is an example of a (binary right) fold expression. Basically, it is syntax to expand out to its full form:
std::cout << arg1 << arg2 ... etc << std::endl;

If you would like to learn more about fold expressions please read here.

Okay, this was a very very basic intro, but I think we can start building our list data-structure and learn the rest on the way.

Building a tuple list

Let's make a tuple list!

template<typename...>
struct TupleList;

At the very basic level, this is all the TupleList is.

Here you may notice something weird. We use typename... as opposed to typename... Args.

This is because we are only declaring it here and we don't intend to use pack in the declaration which is valid.

We'll be specializing TupleList and you will see us introduce named packs like typename... Args.

Okay, let's specialize the TupleList so we can actually understand how it works lol. The declaration wasn't very informative.

// delcaration
template<typename...>
struct TupleList;

// base case, empty tuple, empty pack
template<>
struct TupleList<>{};

// recursive inheritance case
template<typename T, typename... Rem>
struct TupleList<T, Rem...>: TupleList<Rem...>{
    T val;
};

Here is an implementation of the TupleList. There may be other ways to implement it, but for now we will talk about this method.

After declaring the TupleList, we created a specialization for the empty TupleList. This is our base case, the smallest/simplest unit of TupleList.

All other TupleList branches off the empty TupleList<>.

Let's talk about the non empty TupleList, the recursive case. All non-empty TupleList store a T val.

This represents our TupleList element that we store.

You may also notice that the recursive case is child of another TupleList, TupleList<Rem...>.

Yes templated structs can inherit from itself only if the template parameters are different (and if there is no infinite recursion/inheritence).

The inheritence must come to an end, which in our case is the base case.

Every non empty TupleList inherits from its parent which has one less template argument. So it here is an example visualization of how this may look:

TupleList<int, char, double> -> TupleList<char, double> -> TupleList<double> -> TupleList<>

The arrows show its next immediate parent. This means that TupleList<int, char, double> is also a type TupleList<> at the very base layer.

But each parent has a different T val because its first template parameter is different.

Okay cool, we have a TupleList. How do we create one in our code?

Creating a `TupleList`

To create a TupleList instance, we will need some constructors.

We'll have to add a constructor to our non empty TupleList so that we can pass values into our new container.

template<typename T, typename... Rem>
struct TupleList<T, Rem...>: TupleList<Rem...>{
    T val;
    TupleList<T, Rem...>(T val, Rem... rem): TupleList<Rem...>(rem), val(val){}
};

This constructor takes in a value, and a pack of remaining values.

The remaining values are passed into its parents constructor.

And its parents will pass it to its parents... and so on.

We strip away one value for each parent until it becomes an empty pack at which point the TupleList parent will be the base case.

Isn't this cool?

It's like a type linked list.

One TupleList stores one types value and links to the next via inheritance which stores the next value.

The constructor traverses this linked list of parents that stores various types.

So now if I want to construct a TupleList, it would look like this:

TupleList<int, double, char> tuple(1, 2.5, 'c');

Yayyyy! Now we can store multiple types of values contiguously in memory like a list.

Now how do I get values out?

`get` values from `TupleList`

The ideal interface for getting values from a list would be with indexes. We'd like to randomly access different indexes.

It gets weird though because we want a get function, but depending on the index, get is going to have different return types....

Oh no, .... how can we fix this????

Well good thing we have templates.

We'll be able to generate many types of get functions with minimal implementations.

First we need to figure out how to translate indexes to a specific parent of TupleList so that we can extract its value.

I find this part very interesting...

template<size_t idx, typename TupleList>
struct GetIndex;

template<template T, template... Rem>
struct GetIndex<0, TupleList<T, Rem...>>{
     using type = T;
};

template<size_t idx, template T, template... Rem>
struct GetIndex<idx, TupleList<T, Rem..>>{
     using type = typename GetIndex<idx-1, TupleList<Rem...>>::type;
};

We created a struct called GetIndex.

GetIndex is a templated empty struct. It is defined by a pair, index and a TupleList.

Let's start with the base case here. When index is 0, we have reached our element in this TupleList.

But to get to index 0, we have to keep decrementing index as we strip away an element in the TupleList type.

As we saw earlier, to get to the next element, we have to go to our parent TupleList. By stripping an element and decrementing an index, we go one step into our parent.

Hence each of these GetIndex empty structs are nodes in a type linked list that takes you from one element to another with an index for each parent.

Great. We have indexes, now how do we implement get using GetIndex?

template<size_t idx, typename TupleList>
struct GetIndex;

template<template T, template... Rem>
struct GetIndex<0, TupleList<T, Rem...>>{
     using type = T;
     static type get(TupleList<T, Rem...>& tuple){
          return tuple.val;
     }
};

template<size_t idx, template T, template... Rem>
struct GetIndex<idx, TupleList<T, Rem..>>{
     using type = typename GetIndex<idx-1, TupleList<Rem...>>::type;
     static type get(TupleList<T, Rem...>& tuple){
          return GetIndex<idx-1, TupleList<Rem...>>(tuple);
     }
};

template<size_t idx, typename T, typename... Rem>
GetIndex<idx, TupleList<T, Rem...>::type get(TupleList<T, Rem...>& tuple){
     return GetIndex<idx, TupleList<T, Rem...>>::get(tuple);
}

So we just implemented the get function.

The main one that gets called is the non-member get function, the one outside the GetIndex struct.

This main get function calls the get functions of the GetIndex structs to recursively go through the linkedlist and grab the value.

The compiler should optimize this into an O(1) operation as each get function only has 1 line which is to call the next get function or return the value.

But here is the interesting part. The main get function is a templated function. It's return type is dependent on the GetIndex<idx, TupleList<T, Rem...>::type type.

The return type goes through the templates and runs through the GetIndex type alias recursive chain (of decrementing index and looking at the parent's type) until it hits the definition for type in the base class (index is 0).

The compiler will create definitions by recursing through GetIndex to determine the function's return type.

Now the only part that sucks about this is that the compiler requires you to know the index you want to access at compile time.

Templates are evaluated and specialized by the compiler when you compile your program.

So when you call this get function like this:

TupleList<int, double, char> list(1, 2.5, 'c');
double val = get<1>(list);

The compiler determines at compile time that this get function returns type double and also wants index 1.

(Template parameters must be constant r-values or constexpr values.) So you can't do get<variable>.

Great, if I have a for loop then, how do I access each index one by one?

You have two options:

implement a loop function
have a get function that we can give an index at runtime

Okay let's try to both:

a runtime `get` function

Remember the idea here is we want to use our index at runtime to get the value out of our tuple.

It gets tricky because we no longer get to use our templated index where 0 is our base case. We'll have to check for 0 at runtime.

This makes our return type a bit tricky.

We'll now have to use type erasure because our compiler won't know the return type at compile time anymore without compile time indexes.

What is type erasure? It's where the type of the value is unknown until you query for it.

C++ has std::any and std::variant for this.

template<typename...>
struct TupleList;

template<>
struct TupleList<> {
    std::any get(size_t) const {
        throw std::out_of_range("Index out of bounds");
    }
};

template<typename T, typename... Rem>
struct TupleList<T, Rem...> : TupleList<Rem...> {
    T val;
    TupleList(T val, Rem... rem): TupleList<Rem...>(rem...), val(val) {}

    std::any get(size_t index) const {
        if (index == 0)
            return val;
        else
            return TupleList<Rem...>::get(index - 1);
    }
};

Great! With this, you can call get like this:

TupleList<int, double, char> t(1,2.5,'c');
int idx = 2;
std::any res = t.get(idx);

The difference now is that we can pass in variable indexes at runtime.

We can also optionally make the get function static, so that it is used as before with get(tuple, idx).

Another main thing to note, is that std::any is returned. Now we have to check the type before extracting the value out.

This is an annoying part with a runtime get function.

if (res.type() == typeid(char)) std::cout << std::any_cast<char>(res) << std::endl;

Another annoying part about this function is that the compiler can't optimize it.

So accessing indexes is worst case O(n) now because each time we enter the next parent, we have to check if index is 0 at runtime.

It shouldn't be a problem however if your TupleLists tend to be small. But if possible, prefer the smarter compile time get function.

Okay let's explore the loop option.

Can we loop without runtime indices?

The answer is yes.

template<typename...>
struct TupleList;

template<>
struct TupleList<> {
     static void loop(size_t idx, TupleList<T, Rem...>& tuple, auto&& func){}
};

template<typename T, typename... Rem>
struct TupleList<T, Rem...> : TupleList<Rem...> {
    T val;
    TupleList(T val, Rem... rem): TupleList<Rem...>(rem...), val(val) {}

    static void loop(size_t idx, TupleList<T, Rem...>& tuple, auto&& func ){
         func(idx, tuple.val);
         TupleList<Rem...>::loop(idx+1, tuple, func);
    }
};

template<typename T, typename... Rem>
void loop(TupleList<T, Rem...>& tuple, auto&& func){
    TupleList<T, Rem...>::loop(0, tuple, func);
}

Cool. We have a loop function that takes in a reference to a function.

Honestly, we could also put std::function instead of auto&& as well.

But we can see that we iterate into the static function of the parent which lets us extract the next value as we increment indices. This makes this function O(n).

We can provide both to our function for processing.

The only problem is now your func is expected to handle multiple types of values when looping through.

No need to fear we can have templates. Here are some examples:

TupleList<int, double, char> tuple(1,2.5,'c');
// lambda 'auto' arg becomes a templated arg
auto print = [](size_t idx, auto arg){
     std::cout << arg << std::endl;
}
loop(tuple, print); 

auto int_nonint_print = [](size_t idx, auto arg){
     if constexpr (std::is_same_v<decltype(arg), int>){
          std::cout << arg << " is an int" << std::endl;
     }else{
          std::cout << arg << " is not an int" <<std::endl;
     }
}
loop(tuple, int_nonint_print);

// template lambda
template<typename T>
auto print2 = [](size_t idx, T arg){
     std::cout << arg << std::endl;
}
loop(tuple, print2);

Here I show 3 examples of using loop:

auto lambda will become a functor with a templated argument for its function, so it acts like example 3.
int_nonint_print shows we can use constexpr with type traits to create different functions for ints and nonints, so we can have type based logic implicitly (C++20)
a explicit template lambda is used. The compiler will specialize and generate versions of the function based on the types in the TupleList

Conclusion

Ok we've talked about a lot for this blog. This topic is not over yet.

I want to build out the rest of TupleList including adding/removing elements.

But right now we have a TupleArray. And we can loop through it and access different elements at runtime or compile time.

Let's talk about the differences between the TupleList and std::vector as a baseline.

TupleList is created on the stack (but it could be heap as well) whereas std::vector allocates its elements on the heap
TupleList can store different types and std::vector cannot (unless it uses std::any)
TupleList can have constant compile time access or O(n) runtime access whereas std::vector has constant random access O(1).
You can loop through both.
As of right now, you can only change the size of std::vector until part 2 :)

Okay overall, I hope you liked this mini lesson and cool usage of variadic templates.

And we'll explore more of this another part.

As usual, drop your questions below :)

And here's example code, with an additional hash example if you want to mess around with it: https://pastebin.com/rERy1avn

Until next time
-absterdabster

Motivation behind C++ Concepts

absterdabster — Tue, 08 Apr 2025 03:21:55 +0000

C++ 20 introduced concepts. What are they? Why should I care about them? How do I use them?

Concepts are a powerful tool to help you write generic code with restrictions evaluated at compile time.

What does that mean?

Let's say I make a library and I want to create a function that allows my users to pass in a singular integer, float, or string into it.

However, I don't want to let them pass in a boolean.

Is there a way to accomplish this by writing only one function?

YES!

We can do this with C++ templates and concepts. (Thanks C++ 20)

Ok let's start at the very beginning... templates

templates

Okay, let' say I'm making a library that accepts an integer, float, or string.

One way to do this would be like this with function overloading:

#include <iostream>

void function(int v){
        std::cout << "function: " << v << std::endl;
}

void function(double v){
        std::cout << "function: " << v << std::endl;
}

void function(std::string v){
        std::cout << "function: " << v << std::endl;
}

int main(int argc, char** argv){
        function("hi");
        function(2);
}

Yes all the functions do the same thing and this compiles.

But why do I have to write it 3 times???

You don't!!!

Let's use templates to simplify this to one.

#include <iostream>

template <typename T>
void function(T v){
        std::cout << "function: " << v << std::endl;
}

int main(int argc, char** argv){
        function("hi");
        function(2);
}

Nice so we have a program that works and compiles.

The compiler sees that we call function for "hi" and 2.

So when compiling, it automatically creates two variations of the function. One accepts the integer and another accepts the string.

If I introduced a third type, let's say a double, it would then compile a double version of function.

Super cool!

I can write multiple versions of my function with less code using templates.

I do want to note there is one other way you could do this. In C++ 17 and after, there are type erasure types. (std::variant and std::any)

These types can hold multiple types of variables at once. The types that they hold are determined at runtime as opposed to compile time.

Long story short, you can do this:

#include <iostream>
#include <variant>

void function(std::variant<int, std::string> v){
        if(std::holds_alternative<int>(v)){
                std::cout << "function: " << std::get<int>(v) << std::endl;
        }else{
                std::cout << "function: " << std::get<std::string>(v) << std::endl;
        }
}

int main(int argc, char** argv){
        function("hi");
        function(2);
}

In my opinion, this looks ugly/verbose. Also there is an extra if statement in there.

As a result, I prefer using a template for this case.

The good thing about a variant is it can restrict the types allowed into the function unlike a general template.

For instance, it only let in an integer and string type into function.

Okay, great! Is there a way we can do this with templates?

The answer is yes!

We can look at ways to do this shortly.

Just before we get there, I want to show you another limitation of templates.

`std::enable_if`

C++ has std::enable_if to activate certain functions if a condition is true.

std::enable_if is often used with templates.

std::enable_if is also a templated type as well. The first template argument is the condition, and the second is the type to use if the condition is true.

In C++, this introduces a concept called SFINAE (Substitution Failure is Not an Error).

Let's see an example of this:

#include <iostream>
#include <type_traits>

template <typename T>
typename std::enable_if<std::is_same<T, int>::value ||
std::is_constructible<std::string, T>::value, void>::type
function(T v){
        std::cout << "function: " << v << std::endl;
}

template <typename T>
typename std::enable_if<!(std::is_same<T, int>::value ||
std::is_constructible<std::string, T>::value), void>::type
function(T v){
        std::cout << "diff function: " << v << std::endl;
}

int main(int argc, char** argv){
        function("hi");
        function(2);
        function(true);
}

The output here is:

function: hi
function: 2
diff function: true

So both the std::enable_ifs are used to make a void return type if their condition is true for the templated type.

We use type_traits in C++ to build our condition. (std::is_same and std::is_constructible)

If T is an int or can be constructed to a string, the first function is given a void return type.

However, the second function is given an invalid return type. As a result, the function gets thrown away by the compiler.

Hence, with std::enable_if we can have multiple functions such that if type substitution fails with one function, there may be another function that can be used.

Not only that, by using std::enable_if you can allow partial specialized functions. Like (T and std::vector<T>) coexisting.

BOOM. SFINAE solved.

Okay, cool, we can restrict types with std::enable_if.

But even this looks ugly. Our enable if conditions look long.

Is there a better way we could fix this?

Concepts

In C++ 20, concepts were introduced.

Concepts are so cool.

Let's solve the problem we had earlier with concepts. Things get more concise.

#include <iostream>
#include <type_traits>
#include <concepts>

template <typename T>
concept ValidType = std::same_as<T, int> || std::is_constructible_v<std::string, T>;

template <ValidType T>
void function(T v){
        std::cout << "function: " << v << std::endl;
}

template <typename T>
void function(T v){
        std::cout << "diff function: " << v << std::endl;
}

int main(int argc, char** argv){
        function("hi");
        function(2);
        function(true);
}

This is kind of cool and way cleaner. Shouldn't 2 work for both functions?

Concepts won't throw an error here.

Instead, the compiler picks the most restrictive option, which is the first function.

You can chain multiple concepts together with Concept1 || Concept2 or Concept1 && Concept2.

You can create concepts with other concepts:

template<typename T>
concept IsInt = std::same_as<T, int>;

template<typename T>
concept IsString = std::is_constructible_v<std::string, T>;

template <typename T>
concept ValidType = IsInt<T> || IsString<T>;

And all of this is done at compile time!

This is the modern form of the C++ SFINAE concept.

Concepts go hand in hand with the requires keyword.

I don't want to make this blog tooooo long, but if you would like me to explain requires in more detail, drop a comment! Or read more here.

The Conclusion

Concepts are cool and clean and powerful
std::enable_if is noice but it can get very verbose
std::variant can get messy with type based if statements and runtime running
Generic templates are great! But they cannot handle partial specialization or type restriction.

Try using concepts. They are easy and make you feel great for using modern C++.

Peace
-absterdabster

Measuring your program speed correctly

absterdabster — Thu, 27 Mar 2025 03:46:48 +0000

Hallo curious friend! Have you ever run a program and wondered how long it took?

Let's say you had two programs and you were trying to figure out which one was faster. Maybe you used a tool and got a measurement for both. How sure are you that it took that long?

Let's explore measuring our program speed and try to find out how to be as accurate as possible. MAYBE EVEN DOWN TO THE NANOSECOND!

As Barney Stinson would say, CHALLENGE ACCEPTED!

Okay there are several ways we could try to do this. Let's look at them one by one.

Why would I care about nanosecond precision?

CPUs these days run at 3 GHz.

What does that mean?

That means it runs 3 * 1e9 cpu clock cycles per second.

3 BILLION CYCLES PER SECOND.

This also means 3 cpu clock cycles per nanosecond...
(a nanosecond is 1e-9 seconds, super small....).

If I can run 1 addition operation in 1 cpu clock cycle, in 1 second, I can add 3 BILLION things together!

Wow computers are so powerful. But this also means to be precise about our performance, we should probably try to measure as close to nanoseconds as possible.

The `time` command posix

If you use the time command, which is implemented in POSIX systems like linux, you will get 3 types of times for your program.

real: The real time is end to end time of your program from invocation to the end of the process.
user: This is the time the cpu spent in user space (the logic of your program).
system: This is the time the cpu spent in kernel space for either system calls or interrupts. (System calls are ways your program can make use of your operating system's resources. Interrupts are the way your operating system prevents you from hogging the cpu for yourself.)

Great let's run a simple C or C++ program and see how things get measured.

int main(int argc, char** argv){
        int sum{0};
        for(int i = 0; i < 20; i++){
                sum += i;
        }
        return sum;
}

Let's make this fun. Give a guess how long you think this will run. If you're brave enough, lock in your answer in the comments lol. We can see who gets the closest.

Okay, I'm going to run it now...

time ./test

Here is what we get:

real    0m0.014s
user    0m0.001s
sys     0m0.000s

Okay it seems like our process took 0.014s overall, but the logic of our program took 0.001s (1 millisecond) of it.

Honestly, at this point I have no clue if this is right or not.

But many people might stop here and believe this value.

Let's see what is happening behind the scenes. It seems like time is a built in shell command, so let's look at the bash shell codebase.

After some digging, we find the time_command C function here. Here is an important part of it.

#if defined (HAVE_GETRUSAGE) && defined (HAVE_GETTIMEOFDAY)
  struct timeval real, user, sys;
  struct timeval before, after;
#  if defined (HAVE_STRUCT_TIMEZONE)
  struct timezone dtz;              /* posix doesn't define this */
#  endif
  struct rusage selfb, selfa, kidsb, kidsa; /* a = after, b = before */
#else
#  if defined (HAVE_TIMES)
  clock_t tbefore, tafter, real, user, sys;
  struct tms before, after;
#  endif
#endif

Let's dissect this. There are 2 important terms here to highlight timeval and rusage.

`timeval`

timeval holds two variables.

struct timeval {
    time_t      tv_sec;     /* seconds */
    susecond_t  tv_usec;    /* microseconds */
};

It is often used with the gettimeofday system call. This system call is defined here. It gives you microseconds since epoch (Jan 1, 1970).

A system call is a call your program makes to the system to perform privileged tasks like asking for the time from hardware.

Specifically, gettimeofday is a vsyscall and you can see how it works here.

But long story short, it asks the system's real-time clock (RTC) for the time. RTC actually gives you the time in nanoseconds.

But we actually lose this precision and it gets converted to microseconds/seconds when we get it! AHHHHHHH! This sucks...

The idea is we use gettimeofday before the program starts and after it ends to measure the end to end program time.

`rusage`

It's actually short for resource usage. The linux kernel maintains statistics about your program like the amount of time spent in user space and time spent by the system on other tasks.

It also contains other information relating to memory and input/output devices. To give you an idea, it looks like this:

struct rusage {
               struct timeval ru_utime; /* user CPU time used */
               struct timeval ru_stime; /* system CPU time used */
               long   ru_maxrss;        /* maximum resident set size */
               long   ru_ixrss;         /* integral shared memory size */
               long   ru_idrss;         /* integral unshared data size */
               long   ru_isrss;         /* integral unshared stack size */
               long   ru_minflt;        /* page reclaims (soft page faults) */
               long   ru_majflt;        /* page faults (hard page faults) */
               long   ru_nswap;         /* swaps */
               long   ru_inblock;       /* block input operations */
               long   ru_oublock;       /* block output operations */
               long   ru_msgsnd;        /* IPC messages sent */
               long   ru_msgrcv;        /* IPC messages received */
               long   ru_nsignals;      /* signals received */
               long   ru_nvcsw;         /* voluntary context switches */
               long   ru_nivcsw;        /* involuntary context switches */
           };

Yep, it's a lot of statistics.

To collect this data, your linux libc library supports the getrusage syscall (more info).

You can track the entire process (combines all threads), a specific thread, or even children processes!

As you can see once again, this once again utilizes the timeval struct which had microsecond precision.

While the microsecond precision sucks a little, it is pretty cool that the kernel tracks stuff for us (especially the user/system time).

Overall, here is what I've concluded from the time command.

It has microsecond precision
It maintains user/system time for a process
It provides data from process startup to process destruction

Ok let's test it out. I was a bit lazy so I asked ChatGPT to help me implement the getrusage calls for this one. It looks like this:

#include <iostream>
#include <sys/resource.h>
int main(int argc, char** argv){
        struct rusage usage;
        struct timeval start_user, start_system, end_user, end_system;
        long long start_user_us, start_system_us, end_user_us, end_system_us;

        // Get starting resource usage
        getrusage(RUSAGE_SELF, &usage);

        start_user = usage.ru_utime;
        start_system = usage.ru_stime;

        // Convert timeval to microseconds for easier calculation
        start_user_us = (start_user.tv_sec * 1000000LL) + start_user.tv_usec;
        start_system_us = (start_system.tv_sec * 1000000LL) + start_system.tv_usec;

        int sum{0};
        for(int i = 0; i < 20; i++){
                sum += i;
        }

        getrusage(RUSAGE_SELF, &usage);

        end_user = usage.ru_utime;
        end_system = usage.ru_stime;

        // Convert to microseconds
        end_user_us = (end_user.tv_sec * 1000000LL) + end_user.tv_usec;
        end_system_us = (end_system.tv_sec * 1000000LL) + end_system.tv_usec;

        // Calculate elapsed time
        long long elapsed_user_us = end_user_us - start_user_us;
        long long elapsed_system_us = end_system_us - start_system_us;
        long long elapsed_total_us = elapsed_user_us + elapsed_system_us;

        // Print results
        std::cout << "User CPU time: " << elapsed_user_us / 1000000.0 << " seconds" << std::endl;
        std::cout << "System CPU time: " << elapsed_system_us / 1000000.0 << " seconds" << std::endl;
}

And so how long did it take??? Here is the output:

User CPU time: 1e-06 seconds
System CPU time: 0 seconds
Total CPU time: 1e-06 seconds

1 MICROSECOND! whoa did we get faster??? Or are we wrong?

Remember the time command said we spent 1 millisecond in user space. And the whole program took 14 milliseconds process creation to destruction.

So who is more right??? One measurement is off from the other by 100x!!!

Well, the time command takes into account more than just the logic in the program. Process creation/destruction can be expensive.

At the same time, the time command tends to lose a lot of precision when outputting to users.

`clock_t`

This one is a cool one. Why?

It uses the hardware clock rather than the RTC.

So what exactly is clock_t?

It can be an int or a float or other type depending on your libc implementation as long as it is capable of keeping track of clock ticks.

Clock ticks are an arbitrary measurement of time determined by your hardware clock/timer. It is much more granular than milliseconds/seconds.

In fact, there is a CLOCKS_PER_SECOND macro/variable in C that converts clock ticks to seconds. This is sometimes set to values like 1,000,000.

This would mean you would get microsecond precision. If it was a larger value, you could get even more precision.

Note that this is not the same as clock cycles. Clock cycles are the CPU's internal processor frequency.

To use clock_t, you can get the current clock value from the clock() function.

#include <time.h>
#include <iostream>
int main(int argc, char** argv){
        clock_t start = clock();
        int sum{0};
        for(int i = 0; i < 20; i++){
                sum += i;
        }
        clock_t end = clock();
        std::cout << (end - start) << " us"<< std::endl;
        std::cout << "clocks per sec: " << CLOCKS_PER_SEC << std::endl;
        return sum;
}

If we time, this program again and observe the results... here it is:
And tadaaa....

1 us
clocks per sec: 1000000

This for loop takes 1 microsecond according to our measurement. Similar to getrusage!

Is this the smallest granularity we can go???

Well, I feel like we can do better because even if a cycle were to take 1 nanosecond, that means our code would've took 1000 cycles.

1000 instructions to loop through 20 items is INSANE!!

But with only microsecond precision, we are definitely overshooting with a huge error from the actual program runtime.

Okay, let's try something else because ticks aren't frequent enough.

In C++, there is a chrono library and it seems like it can support nanosecond granularity for its system clock.

`chrono`

chrono has been around since C++11, which means its been there for a while now lol. It is a great time library. It can even do timezone conversions in C++20!! (ikr! why'd it take so long...)

Using chrono, I can use the high resolution clock to supposedly get nanosecond level precision. Here we go:

#include <iostream>
#include <chrono>
int main(int argc, char** argv){
        auto start = std::chrono::high_resolution_clock::now();
        int sum{0};
        for(int i = 0; i < 20; i++){
                sum += i;
        }
        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<long long, std::nano> duration = end - start;
        std::cout << "Time taken: " << duration.count() << " nanoseconds" << std::endl;
        return sum;
}

The code is short and simple. So what does it say about our loop?
Our program outputted:

Time taken: 254 nanoseconds

Much faster than 1 microsecond. In fact 4x better. And 4000x faster than time. CRAZY!

So now is it really 254 nanoseconds?? If we assumed a cycle took a nanosecond, this says that we went through 254 cycles for a 20 item for loop???

Let's see what this clock is really doing and maybe we can find out why...

After looking at the GNU/linux C++ source code (libstdc++), I found the implementation of now().

It seems like it is an alias to either a system clock or steady clock on different versions, but here is a the system clock implementation.

The system clock means a real time clock that users use to perceive time. It can be affected by the calendar and can jump forwards/backwards in time.

The steady clock is a monotonic clock kind of like what clock() used. It only increases at a certain frequency.

It seems like my system clock gives nanosecond precision.

When we look at the source code, we see the following system call:

#ifdef _GLIBCXX_USE_CLOCK_GETTIME_SYSCALL
      syscall(SYS_clock_gettime, CLOCK_REALTIME, &tp);
#else
      clock_gettime(CLOCK_REALTIME, &tp);

This means the kernel has to do some work and get us the time. tp is a timespec struct and it has nanosecond precision (defined in <time.h>).

struct timespec {
    time_t tv_sec;  // seconds
    long   tv_nsec; // nanoseconds
};

The kernel goes to the system's real time clock with CLOCK_REALTIME and fills in this struct. This process of switching from user space code to kernel space and executing kernel instructions takes time!

And in fact, it is affecting us by a lot of nanoseconds.

Not only, that we don't really know the frequency/precision of our system's real time clock with respect to nanoseconds. Does it update every couple nanoseconds? Or every nanosecond?

Lots of questions regarding the real time clock.

There is a lot of factors relating to the hardware and protocol that can impact the frequency, so it is hard to tell.

Clearly, the clock_gettime() syscall used by chrono's high_resolution_clock and system_clock aren't precise enough for this tiny program.

SO CAN WE GET MORE PRECISE???????
CAN WE????????

I can't hear you!!!

Ohhhhhhhhhh......

Just kidding. This isn't Spongebob. I'll stop with the cliff hangers.

`TSC`

Introducing the TSC, the Time Stamp Counter.

Silly enough, the answer has been in front of you all along since the beginning of this blog.

We can use the clock cycles to measure how long our program took. TSC is a monotonic counter that increments for each clock cycle.

How do we get access to the TSC?

Specifically, TSC is 64 bit register on x86 processors. So... I'm sorry if you are on a different processor (like ARM). You probably have something else.

But many of the world's standard machines run x86, so this is very relevant.

Okay, how do we get the 64 bits of data from the x86 register?

Luckily for us there is an assembly instruction that copies the TSC value to two 4 byte registers that we can access.

The instruction is rdtsc.

Also lucky for us, we can embed assembly into our C++. If we do this and then move the register value into our variable, we can use the value.

Here is what that looks like:

#include <iostream>
int main(int argc, char** argv){
        uint64_t start{0};
        uint64_t end{0};
        __asm__ volatile(
                "rdtsc"
                :"=A" (start)
        );
        int sum{0};
        for(int i = 0; i < 20; i++){
                sum += i;
        }
        __asm__ volatile(
                "rdtsc"
                :"=A" (end)
        );
        std::cout << "Result: " << (end-start) << " cycles" << std::endl;
        return sum;
}

And now guess what you think the output is in cycles... How many cycles does this simple for loop of additions take?

And the answer issss

Result: 300 cycles

Alright alright, this isn't as useful unless we know the frequency of our processor cycles.

If I run lscpu in my terminal, it tells me a lot of cool things about my cpu/processor...

In fact it shows me this useful fact:

CPU MHz:                              2495.998

2496 Megahertz! That's roughly 2.5 GHz!!!

If I do 1 / (2.5 * 10^9) to get the number of seconds per cycle, I get roughly 0.4 nanoseconds per cycle!

If I multiply 0.4 nanoseconds per cycle with 300 cycles, I get 120 nanoseconds!!!

We just halved our time from the previous method of using get_clocktime() with the system clock!

We were a lot more precise as we weren't accounting for any system calls and kernel instructions for the most part.

Wow! We have nanosecond level precision.

That's not all. What if we ran our program with O3 optimization on our compiler. Could we get a lower number?

In fact we can....... and get the following results......

Result: 35 cycles

BOOOOOOOOM!!! Faster than the FLASH.... (jk, i've never seen the Flash)

Okay calm down buddy. Turns out our program was so simple the compiler precomputed the sum (I think).

Let's see the assembly just to make sure.

#APP
# 8 "test.cpp" 1
        rdtsc
# 0 "" 2
#NO_APP
        movq    %rax, %rbx
#APP
# 16 "test.cpp" 1
        rdtsc
# 0 "" 2
#NO_APP
        movl    $8, %edx
        movq    %rax, %rbp
... more assembly ...
_ZSt4endlIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_@PLT
        addq    $8, %rsp
        .cfi_def_cfa_offset 24
        movl    $190, %eax

Okay, we see the two rdtsc commands. BUT THERE IS ONLY 1 INSTRUCTION BETWEEN!

And the compiler precomputed the sum and stores 190 into %eax and returns that.

LOL so that's why our -O3 optimization looks faster. Things got reordered and precomputed.

So our algorithm/for loop in fact no longer exists.

Beautiful.

As my boss says at work...
"The best kind of code is no code"

The Conclusion

time is gross and too slow
getrusage is somewhat better but not as precise but it can show you user vs system time
chrono's system_clock is better but has some imprecision. It has nanoseconds!
rdtsc is the GOAT! And using it is simple. Clock cycles are the lowest form of precison, even better than nanoseconds.

We started at 1 millisecond and went to 100 nanos. That is a factor of around 10,000x for our precision! (10^-3 to 10^-7)

I hope you learned how to be insanely precise...

Now I must leave... See you next time
(or many clock cycles from now)

Peace <3

absterdabster

Summarizing "What Every Computer Scientist Should Know About Floating Point Arithmetic"

absterdabster — Mon, 27 Jan 2025 03:14:58 +0000

Hi again! Have you ever used a floating point number in your code? They appear in the forms of float or double usually, but essentially it's a type of data to represent real numbers (like 0.1 or 3.14159653589 or 123456789 * 10^(-23)). While it can represent decimals, it can also do whole numbers like 1 or 12345678.

Regardless of which one you used, there is a chance your code might be in trouble. When you use a number (like 1.5), your computer might not actually be using that number but instead something really close.

Now multiply your wrong number a few times, add it with a few other wrong numbers, and soon you're math is chaos! Your computer isn't actually listening to you.

How do we reduce errors with floating point operations?

Today I'll be summarizing "What Every Computer Scientist Should Know About Floating Point Arithmetic" by David Goldberg in 1991. Give it a read if you dare.... hehe

Ok that blog was very long.... so I'mma just cover some of the basic points of what I read. Drop a comment if you want me to try and explain a section from the blog that I didn't cover (I'll try lol).

Get ready to get smarter :)

Representing Floating Points

Floating points are represented with a numerical base $β$ (like decimal or binary or hexadecimal), an exponent range $e_{min}$ and $e_{ma x}$ , and a precision p (the number of digits). They are represented in scientific notation with a single non-zero digit before the decimal point.

For example, it would look something like this.

d_{0} . d_{1} d_{2} ... d_{p - 1} * β^{e}

The digits of the significand $d_{i}$ are all in the range $0 \leq d_{i} < β$ where there are p digits ( $0 \leq i < p$ ).

So if $β = 2$ and p = 3 and I wanted to represent 0.77 it would be something like this.

1.10 * 2^{- 1}

This would be equal to 0.110 which is

1/2 + 1/4

. This is the closest we could get to 0.77 with 3 digit precision p in binary. As you can see, floating point representations lie 0.75 is not 0.77, butttt it is close enough.

If $β = 10$ and p = 3 then 2734 would be represented as

2.73 * 1 0^{1}

If we keep saying floating points are "close enough" for our numbers and we start doing operations on them, eventually our representations will be far from the actual number.

Okay, let's measure how far we are from the actual number. AKA, what is the error?

Understanding the error

There are two types of floating point errors or rounding errors that are commonly measured. ulps (units in the last place) and relative error.

ulps

The units in the last place is the total error the last digit is off by compared to the actual number. To be exact, it can be calculated by this complicated formula where z is the actual number we are comparing to.

ulps = ∣ d . dd ... d - z / β^{e} ∣ β^{p - 1}

If this looks confusing, let's just do an example and it'll be a lot easier. Let's say we have a number 314.09 and our z = 314.1592653589.

314.09 = 3.1409 * 1 0^{2}

From this we know, $β = 10$ , p = 5, and e = 2.

ulps = ∣3.1409 - 314.1592653589/ 1 0^{2} ∣ 1 0^{4}

= ∣3.1409 - 3.14159653989∣ 1 0^{4}

= ∣ - 0.00069653989∣ 1 0^{4}

= 6.9653989

This has roughly ulps = 6.965.

Relative error

This error takes the absolute error and takes it into proportion to the magnitude of the real number.

Super simple. First take the absolute error (the difference between the actual and the representation).

absolute error = ∣ (d . dd .. d * β^{e}) - z ∣

Now divide that by the real number to take it into proportion.

relative error = absolute error / z

relative error = ∣ (d . dd .. d * β^{e}) - z ∣/ z

The idea is that if I have a very large number like a million. If I wanted to buy a million dollar home (if I'm ever rich enough), I wouldn't really mind a difference in 10$.

Converting 0.5 ulps to relative error

Let's say I have a number represented as $d . dd ... dd * β^{e}$ . If this number had 0.5 ulps, the error could be bounded by $0.00...00 β^{'}$ where $β^{'} = β /2$ .

Convincing you about the 0.5 ulps absolute error

Let me convince you. Let's say we're in base 10 ( $β = 10$ ) and we had p=3 with the number 9.97. If the actual number was properly represented via rounding, the actual number would have been between >= 9.965 and < 9.975.

The limits of the actual number are shown to be bounded by the error 0.005. This error is also 0.5 ulps. It is also the same as $0.00...00 β^{'}$ because $β^{'} = β /2 = 5$ .

This might seem obvious in base 10, but let's try something in base 2 ( $β = 2$ ). Let's say we had 1.11 (1.75 in decimal) when $β = 2$ and p = 3. However, the real number before rounding could have been between < 1.111 and >= 1.110. And of course this range results in 0.1 ulps ex. (1.111 - 1.11) in binary which is 0.5 ulps in decimal which is the same as $0.00...00 β^{'}$ because $β^{'} = β /2 = 1$ .

Therefore, if a number was rounded properly to its proper representation, it would have an error of < 0.5 ulps.

In other words, 0.5 ulps is always has equal to $((β /2) β^{- p}) * β^{e}$ . This would be the absolute error of 0.5 ulps.

Onto the relative error

Now, what is 0.5 ulps in relative error? Remember that relative error was absolute error divided by the actual number.

We just said that the absolute error when we have 0.5 ulps is $((β /2) β^{- p}) * β^{e}$ . But the actual number could be anything.

Specifically, it could be in the range $1 * β^{e}$ and $β * β^{e}$ . So, if 0.5 ulps was 0.001 in binary ( $1 * 2^{- 3} * 2^{0}$ ), then p=3 and e=0. In that case the real number must have been between $1 * 2^{0}$ and $2 * 2^{0}$ which would be 1 and 2 in decimal or 1 and 10 in binary.

From our previous example, we can see that is true. 1.11 (1.75 decimal) was in fact between 1 and 10 in binary.

Cool so we set some bounds on what the actual number could have been, namely: $1 * β^{e}$ and $β * β^{e}$ .

This means we can set bounds for the relative error for 0.5 ulps. So let's divide the absolute error with the bounds of the real number.

Upper bound of relative error:

\frac{(( β /2 ) β ^{- p} ) * β ^{e}}{( 1 * β ^{e} )}

= (\frac{β}{2}) β^{- p}

Lower bound of relative error:

\frac{(( β /2 ) β ^{- p} ) * β ^{e}}{( β * β ^{e} )}

= (\frac{1}{2}) β^{- p}

Therefore,

(\frac{1}{2}) β^{- p} < 0.5 u lp s \leq (\frac{β}{2}) β^{- p}

Machine epsilon

The upper bound relative error for 0.5 ulps is called machine epsilon. This is the largest relative error possible when given a base.

ϵ = (\frac{β}{2}) β^{- p}

Larger precision p, as expected, implies smaller relative error/machine epsilon. We also notice that 0.5 ulps is bounded by machine epsilon and $(\frac{1}{2}) β^{- p}$ . These bounds have a factor of $β$ which we call wobble.

Yeahhh... wobble baby wobble baby...

BTW, machine epsilon is such a cool name. I'm not judging if you name your dog or child machine epsilon.

Relative errors with machine epsilon

Remember machine epsilon was the upper bound for rounding errors or 0.5 ulps. So if we actually got a relative error much lower, we can represent the relative error as a ratio of the machine epsilon like this:

rel. error = k * ϵ

Let's do an example. If I had the number 3.14159 to represent with $β = 10$ and p = 3, I would have to round to 3.14. This would have an absolute error of .00159 or 0.159 ulps. For relative error, I do $0.00159/3.14159$ which leads me to a relative error of 0.0005.

Now, to find the ratio, we must find the machine epsilon:

ϵ = (\frac{β}{2}) β^{- p}

= 5 * 1 0^{- 3} = 0.005

So... the ratio is:

k = rel. error / ϵ = 0.0005/0.005 = 0.1

So we say that the relative error is $0.1 ϵ$ .

The Wobble

Get ready for things to wobble. First let me show you how ulps and relative error react to each other.

Using 1.0 to represent 1.04 in decimal, has an error of 0.4 ulps and relative error of 0.038. The machine epsilon is 0.05 which makes the relative error $0.76 ϵ$ .

Great! Hopefully this made sense so far.

Now, let's multiply our number by let's say 8. The actual number would be 8.32 while the calculated number would be 8.0. This has 3.2 ulps which is 8 times larger than before! However, our relative error is still $0.32/8.32 = 0.038$ which is the same as $0.76 ϵ$ .

Whoa! Our ulps increased, but our relative error was the same?

Yep. It turns out whenever you have a fixed relative error, you're ulps can wobble by $β$ .

On the other hand, whenever we have a fixed ulps (like we showed earlier with 0.5 ulps), the relative error had bounds which showed it can also wobble by $β$ .

So, smaller the $β$ , smaller the wobble or smaller the error bounds! Using binary, can significantly reduce our error.

Contaminated digits

We now know that ulps and relative error's ratio k vary from each other by a factor of $β$ , the wobble. As a result, we can estimate the number of contaminated digits (the number of incorrect digits from the correct representation of the number).

contaminated digits \approx lo g_{β} n

n is the number of ulps. n can also mean k, the ratio between the relative error and $ϵ$ . It can mean either because of the wobble factor.

So if I had a number in decimal, 3.10 with p=3 and it was trying to represent 3.1415, it would have an error of 4.15 ulps. The contaminated digits would be roughly $lo g_{10} 4.15$ which is roughly 0.61804809 digits.

LOL we can't have partial digits! We'll see that when pigs fly.

Visually looking, we can see that it is wrong in 1 digit, the last one, which is pretty close to what we got from our calculation.

Guard digits

Let's subtract 2 values when $β = 10$ and p=3.

x = 1.01 * 1 0^{0}

y = 9.93 * 1 0^{- 1}

x - y = 1.01 - 0.99 = 0.02

It becomes 0.99 and not 0.993 because we had to lose some data with p=3 so that they could be subtracted from each other at the same $β^{e}$ .

As you know, the actual answer 0.017, but the answer ended up being 0.02. So $2.00 * 1 0^{- 2}$ and $1.70 * 1 0^{- 2}$ have an error of 30 ulps!

The relative error from this kind of subtraction is bounded by $β - 1$ . Let me show you why.

If $x = 1.00...00, y = ρ . ρρ ... ρρ * β^{- 1}$ where $ρ = β - 1$ . (ex. $β = 10, ρ = 9$ ). (x and y have p digits.)

If I subtracted them, I should get the actual answer of $1 * β^{- p}$ , but because we shift y to the right and lose a digit, we end up getting $1 * β^{- p + 1}$ .

absolute err. = ∣ β^{- p} - β - p + 1 ∣ = ∣ β^{- p} (1 - β) ∣

relative err. = abs. error / z

= \frac{∣ β ^{- p} ( 1 - β ) ∣}{β ^{- p}} = β - 1

If our $β = 2$ that would make our relative error 1. In terms of $ϵ$ , it would mean $1 = k * ϵ$ and so $k = 1/ ϵ$ . The contaminated digits would be $l o g_{2} 1/ ϵ = l o g_{2} 2^{p} = p$ . If p digits are contaminated, all of them are contaminated.

Was there a way we could have avoided some of this error? Okay, fine, there iss.... otherwise I wouldn't have written about this section...

Just add a temporary extra digit. And let's call it a ... wait for it ... a guard digit. suprise surprise

x = 1.01 * 1 0^{0}

y = 9.93 * 1 0^{- 1}

x - y = 1.010 - 0.993 = 0.017

Now we have no error now. This is now a bit better than before.

Turns out the guard digit bounds the relative error to $2 ϵ$ . I'm lazy but if someone in the comments asks, I'll figure out why and try to explain it (but its in the linked blog).

Benign and Catastrophic Cancellation

When we try to subtract two really close numbers, many of the digits cancel out and become 0. We call this cancellation. Sometimes cancellation can be catastrophic or benign.

Sometimes, when we do subtraction, there are often errors on the later, far right digits (the least significant digits) after rounding the value or after prior operations. The more accurate digits are at the front (the most significant digits). While the more significant digits at the front cancel out, the lesser accurate lower significant digits would have to subtract and produce an even more inaccurate value. (Like when you calculate the determinant $b^{2} - 4 a c$ ).

The catastrophic cancellation just exposes the rounding errors from prior operations.

Benign cancellation happens when you subtract numbers that have no rounding errors.

IEEE Standard

So the IEEE standard is a set of rules that many systems follow to ensure consistency. There are two IEEE standards that are followed: IEEE 754 and IEEE 854. They both support smaller and larger floating points called single precision and double precision.

IEEE 754

The standard allows $β = 2$ . It has single bit precision (p=24) and double bit precision (p=53). It also discusses how the bits should be laid out.

In fact, here is a cool table that shows how IEEE 754 sets all its floating point parameters.

Exponents are represented with a sign/magnitude split. One bit is used for the sign of the exponent. The remaining bits for the exponent are used to represent its magnitude. Two's complement is another approach but is not used by either IEEE standard.

In fact, here is exactly how the bits are laid out to represent different kinds of values. To represent 0, you have to use $e_{min} - 1$ . Infinity is $e_{max} + 1$ without the fractional section 0ed out. NaN is another type (like when 0 is divided by 0, or infinities are added). NaN is represented the same as infinity but with the fraction section set.

The fractional section is the digits after the first digit (also called the significand.

IEEE 854

On the other hand, this standard allows $β = 2$ or $β = 10$ . However, there is no rules about how the bits should be laid out for double and single precision.

It allows base 10 because it is also the standard way humans count. Base 2 is also included because of the low wobble.

The Conclusion

Okay, I kinda rushed the last section. But overall, I wanted to say floating points can have a lot of imprecision. If you can avoid them, use integers instead.

In the case that you do use them, try to limit your wobbles and avoid catastrophic cancellations (there are ways you could do it sometimes by rearranging formulas).

Try to read the original blog yourself. This blog you are reading right now is a summary of a fraction of the original blog by David Goldberg.

But as always, drop your questions and comments. And I'm out for now...

Peace
-absterdabster

Trying to predict the performance of file reads/writes

absterdabster — Sun, 05 Jan 2025 01:38:31 +0000

Hi! Let's say you want to read or write to a text file. Maybe you are trying to persist application data, read file input or write output to a file. Will it be fast or slow?

Could we estimate how long it could take?

If you don't want to read, just jump to the conclusion at the bottom.

Long story short, I write this article in frustration because after trial and error, I realized performance can vary a lot from system to system. A couple microseconds on one system could mean a couple milliseconds on another system! 100-1,000,000x slower! smh smh...

If things start to get slow, you could use background threads or start reading/writing in batches of data. But we'll get into that later.

Let's figure out if we even have to worry about things getting too slow.
(Note: I'm a newbie at this stuff, so please correct me if you need to)

Okay. Let's start with how does reading from a file work.

How does reading/writing from a file work?

The high level idea begins with your programming language. Pick your favorite programming language (that has file io). There is probably a read/write method/function in there.

But everything boils down to system calls. System calls are the interface used for hardware interactions by programs/users through the guidance/safety of your operating system. (So you don't corrupt your systems accidentally lol)

For reading, it's read(int fd, char* buf, size_t count).

Python

Let's look at an example of file reading in Python:

with open('filename.txt', 'r') as file:
        # Read the first char
        first_char = file.read(1)

Python is an interpreted language, meaning an interpreter is required to execute the Python logic. I dug a little into CPython, the original Python interpreter codebase. (Turns out CPython converts Python into Bytecode which is later interpreted by the Python Virtual Environment (PVM) with machine code.) Any C extensions are converted to machine code directly and executed at runtime.

I found that under the hood of the file io logic, we had the sneaky system call used by both Windows and Linux:

#ifdef MS_WINDOWS
        _doserrno = 0;
        n = read(fd, buf, (int)count);
        // read() on a non-blocking empty pipe fails with EINVAL, which is
        // mapped from the Windows error code ERROR_NO_DATA.
        if (n < 0 && errno == EINVAL) {
            if (_doserrno == ERROR_NO_DATA) {
                errno = EAGAIN;
            }
        }
#else
        n = read(fd, buf, count);
#endif

Java

In Java, there are a lot of ways to read files. For example, you could use a FileInputStream.

        try (FileInputStream fileInputStream = new FileInputStream(filePath)) {
            int byteData;
            while ((byteData = fileInputStream.read()) != -1) {
                System.out.print((char) byteData);  
            }

        } catch (IOException e) {
            e.printStackTrace();
        }

Now as you may know, Java is a compiled language. The Java Virtual Machine creates a bytecode in its object file. When ready to execute, the Java Virtual Machine then reads the bytecode into machine code. Like Python, C extensions like the Java Native Interface (JNI) are turned to machine code and executed in runtime.

If you dig deep into the Java Development Kit codebase, you can see the JNI implementations of FileInputStream which has the read syscall hidden in its read logic:

ssize_t
handleRead(FD fd, void *buf, jint len)
{
    ssize_t result;
    RESTARTABLE(read(fd, buf, len), result);
    return result;
}

C++

In C/C++, you can directly use the read syscall. But in the case you don't, standard library constructs like std::ifstream also use read under the hood.

I wasn't able to find read in the implementation for std::ifstream, but I suspect you will have to look inside the bits directory of the gcc implementation. (Let me know if you find it! Do it as homework hehe.)

So why am I showing you all this? I suggest you try finding some of these implementations in the interpreters/compilers yourself lol.

If you do, you will probably notice that the read and write syscall is hidden under a lot of other clutter and logic.

In this blog, I'll discuss the performance of read and write syscalls rather than the programming language higher level functions. We can avoid the overhead of the language if there is any.

Other ways to write

Okayyy so I lied. write isn't the only way to write to a file. Turns out you can also use fprintf, fflush, and fsync. (I've seen a SQL implementation use this.)

So what's the difference?

The fprintf, fflush, and fsync splits writing into 3 steps respectively:

Write to your file into a buffer/cache
Flush the buffer to your OS's cache
Transfer from the OS's cache to your disk driver to write to the disk (This could involve writing the entire disk cache.)

fsync blocks until your disk signals it is done transferring/writing.

This could be useful if you have a lot of modifications you want to make, but you don't want to save them to disk yet. (Maybe you want to make your batch modifications into a giant transaction.)

The issue is now you have to save the entire driver cache which could be like 64MB or 128MB! Here is a nice blog with more info.

However, if we use write, we can limit our writes to just the data we are sending. This would make the write faster than our 3 step process to fsync.

If you use the 3 step process, just keep in mind how much data you are writing, aka your disk driver's cache size.

You can find your disk's cache size by looking at your disk specification.

What kind of disk do I have?

So if you don't know what disk you have like I do. Let's figure this out.

If you type lsblk in your linux terminal, you might see something that looks like this or similar to this:

NAME        MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINTS
sda           8:0    0   1.8T  0 disk
├─sda1        8:1    0  1000M  0 part  /boot/efi
├─sda2        8:2    0   600M  0 part  /boot
└─sda3        8:3    0   1.8T  0 part  /
zram0       252:0    0     8G  0 disk  [SWAP]
nvme1n1     259:0    0   1.8T  0 disk
├─nvme1n1p1 259:1    0  1000M  0 part
│ └─md125     9:125  0 999.9M  0 raid1
├─nvme1n1p2 259:2    0   600M  0 part
│ └─md127     9:127  0   599M  0 raid1
└─nvme1n1p3 259:3    0   1.8T  0 part
  └─md126     9:126  0   1.8T  0 raid1
...

sda is a disk device. There may be a sdb or sdc and so on. If the type of any of these disks says raid, these disks are probably part of some kind of hardware or software RAID configuration.

Disks part of a RAID configuration are basically copying each other. If you write to disk, it'll be written to all of them. It's a way to backup your files.

But remember that if you have a RAID configuration, each disk may have different specifications. Your writes and reads are going to be as slow as the slowest ones because the RAID controller would be writing to both of them.

Overhead from a RAID controller is usually not a bottleneck, but performance can slightly differ between hardware/software controllers because of using separate hardware vs the busy CPU respectively.

Each disk may have a different mountpoint. If that is the case, you only care about the disk(s) that have the file you intend to read/write from in it's mountpoint. You can see this in the MOUNTPOINTS column.

Ok final thing to note from the command. The RO column says if you have a rotational hard drive. A rotational hard drive is mechanical, and as a result, HDDs tend to be slower than SSDs as flash memory is faster. The difference is magnitudes faster in reading/writing sometimes, as we'll see later.

Okay... I'll stop stalling. Let's see what disk you have. Just modify the command to lsblk -io NAME,MODEL.

Here is what I get:

NAME        MODEL
sda         PERC H730P Adp
|-sda1
|-sda2
`-sda3
zram0
nvme1n1     Samsung SSD 970 EVO 2TB
|-nvme1n1p1
| `-md125
|-nvme1n1p2
| `-md127
`-nvme1n1p3
  `-md126
...

Now you have to look up that model and find your disk's specifications.

Understanding your disk specs

If I look up PERC H730P Adp, it turns out this is one of DELL's Raid Controllers. Here is a snapshot of some of the specs:

This RAID controller has a huge disk cache of 2GB! And it has a data transfer rate of 12 Gbps. As you can see, it is pretty fast.

If I wanted to load the Bee Movie Script (80,000 characters). It would take about 50 microseconds to transfer for the Bee Movie Script, ~80KB!

Note: RAID controllers can sometimes ignore fsync operations. It might not ensure a write to the devices because it has it stored in its cache. At this point, it might lazily store into the disk devices.

Great, now what about the other disk?

Digging deeper

Let's search up the Samsung SSD 970 EVO 2TB.

Here is what we care about. Sequential and Random Access operations/data transfers. Usually they either come in units of IOPS (Input/Output Operations) or bits/bytes per second.

Sequential, as the name implies, is for sequentially writing, like the Bee Movie Script. If I wanted to modify different parts of a file, this would be random access writing. Generally, sequential is faster since memory is physically located close by.

Here we have Sequential write is 2500 MB/s, but Random write is 480,000 IOPS for a queue depth of 32 (32 writes at the same time). This seems kind of dumb, why are they in two different units?

Also, why are reads faster than writes? How fast is 2500 MB/s???

No need to fear, I'm here to save you.

What are QDs?

QDs are queue depths. Basically when your disk says QD32 or QD1, it refers to having 32 write or read requests or 1 write or read request waiting. This is important because disks could sometimes handle multiple requests at a time. This is why QD32 can be a lot faster than QD1.

If we are writing our Bee Movie Script all at once, we'd be QD1. However, if we use fsync or write multiple times, then we would build a queue of requests.

A nice way to estimate QD1 from QD32 is by taking 10%-20% of its IOPS. If you know a better way, let me know in the comments!

How fast is 2500 MB/s?

You have a Bee Movie Script of 80,000 characters. That is 80KB. 80KB/2500MB/s is roughly 35 microseconds.

Easy peasy lemon squeasy.

Why are reads faster than writes?

Let's explore how writing/reading disks work at a high level to understand this.

Disks understand memory in regions called sectors. Sectors in HDD originally were 512 bytes. Now, sectors tend to be 4096 bytes as hardware has advanced.

If I ever want to read or write, the minimum you can theoretically read or write at a time from the disk would be a sector size of data. If I want to read or write 1 byte of data, I have to read the entire sector to find that 1 byte. If I am writing, I have to read the entire sector, apply the change, and then write it back in (A 2 step process!)

Okay, I lied a little again. You can't always write a single sector. Our OSes have file systems. File systems operate with blocks rather than sectors. Multiple sectors make up a block. If I want to modify 1 byte, I'd have to actually modify the entire block.

Blocks can range from 1KB - 8KB, but they must be larger than disk sectors.

PS: Blocks are different from OS pages. Pages in OS are like blocks but for accessing physical RAM.

IOPS vs transfer speed (bytes per second)

Great we went over blocks and sectors!

You probably noticed that the random access specs operate in IOPS. If I want to compare it to sequential reads/writes, I'll have to convert it into bytes per second.

I mentioned that disks operate in sectors. Each input/output operation occurs over a sector. We see that a sector size for the Samsung SSD 970 EVO 2TB is 4KB.

So if random writes are 480,000 IOPS, this is 480,000 sectors per second. This is roughly 2,000 MB/s.

Boom! Random writes are slower than sequential writes. (2000 MB/s < 2500 MB/s).

Randomly writing the Bee Movie Script is roughly 40 microseconds.

Great! We looked at an SSD. Now, so that you can feel my pain, let's look at a HDD.

Comparing an HDD

Let's pretend we have a RAID setup with that Samsung SSD and a HDD disk, for example ST9250610NS. Here are the specs:

It looks a bit different, but remember that HDDs are mechanical. Parts have to physically move and that takes time. We see that a write/read has an average time of 8.5, 9.5 milliseconds respectively.

This average time is for a single sector. A single sector in this disk is 512 bytes according to the specs.

It also mentions a transfer rate of 115 MB/s. Let's test that. If we have 512 bytes/9.5 ms, we get ~50KB/second.

HUHHH??!!?!?!? That doesn't match 115 MB/s!

This average read/write time includes the seek time and rotational latency. This means it includes both the transfer time along with the time it takes for the mechanical parts to move to complete the read/write. (I suspect that sequential writes may be faster, since seek times would be little)

Okay, let's do this again.

If I want to write the Bee Movie Script, 80,000 chars/bytes, it would take about 1.6 seconds if we operated at 50KB/second.

LOOK AT THAT! We went from 30-40 microseconds to 1.6 seconds from SSD to HDD! That's a 1,000,000x latency increase. FEEL THE PAINNNN AHHHHHHHH!

Remember since we are pretending this is a RAID device, the SSD might complete a write pretty fast, but we would have to wait for the HDD drive to finish before the disk can signal completion.

OH! By the way, this hard drive has a 64MB cache. If you used fsync, your large write may take a long time.

The Conclusion

I hope you felt my pain. jkjk.

But save yourself this pain and predict your read/write latencies.

Find out how many bytes you want to read/write
Find out if you are using write or fsync or read or if there is any overhead
Find out if they are sequential/random
Find out if you have a RAID setup or where the file is mounted on
Find out what kind of disk you have and its specs (IOPS/transfer rates)

In the end, the estimation formula is essentially bytes / rate = latency.

For fun, you could try estimating your own read/write speeds and see if your read/write reflects that.

Caveats

Using a networked file system has its own fun. Maybe I'll come back to this topic another time. There might be more involved than just network latencies. If you know, drop a comment lol.

Okay, I'm done now. Peace!

ML Model for Font Color on Website

absterdabster — Thu, 21 May 2020 02:39:02 +0000

My Final Project

I created a HTML/CSS page that allows for user input of a background color. Based on the background color that you chose, I dynamically render the page. Based on the user input, I also run a trained ML algorithm (deep neural net) using binary classification to determine the best font that would stand out from the background.

Link to Code

https://github.com/Abinavraj5427/fontColorANN

How I built it (what's the stack? did I run into issues or discover something new along the way?)

I built and constructed my neural network first using python and modules like numpy. I decided to go through the math, so I avoided ML libraries like TensorFlow. After training my model, I decided I should create an interface for users. In order to build this interface, I created a simple HTML page with dynamic CSS through JavaScript. I update real time using AJAX requests via JQuery. Then I pass the data to the server backend, PHP which runs the python script for the model via command line. Then the server responds to the user with their text color.

I never knew that PHP can run python scripts which I found was really cool. This was also one of my first times using WAMP Server to test my full-stack website.

Additional Thoughts / Feelings / Stories

I think this is a great example of how machine learning shouldn't be used. ML is not lightweight, so it cannot be used everywhere. If anything, a mathematical equation for RGB can determine the color, but I did this merely out of excitement and practice. Hope you enjoy it!

Happy Graduation Class of 2020!

DEV Community: absterdabster

What the futex? A linux concurrency fundamental

What is a synchronization primitive

Why would I use a futex?

How do I use a futex?

The sleep/wait function:

The wake/notify function:

A quick crash course to syscalls

Example of using futexes

Locking

Unlocking

Conditional Variable

Adding a twist: a timeout

Conclusion

How to vectorize your code for faster performance 🚀

Vectorized Instructions (SIMD)

How do SIMD instructions work?

x86/x86-64

ARM

I'm Lazy

Comparing speeds

Examples in the world

The Conclusion

Creating a list that contains different types in C++ 😎

Basics of templates

Variadic templates

Building a tuple list

Creating a TupleList

get values from TupleList

a runtime get function

Can we loop without runtime indices?

Conclusion

Motivation behind C++ Concepts

templates

std::enable_if

Concepts

The Conclusion

Measuring your program speed correctly

Why would I care about nanosecond precision?

The time command posix

timeval

rusage

clock_t

chrono

TSC

The Conclusion

Summarizing "What Every Computer Scientist Should Know About Floating Point Arithmetic"

Representing Floating Points

Understanding the error

ulps

Relative error

Converting 0.5 ulps to relative error

Convincing you about the 0.5 ulps absolute error

Onto the relative error

Machine epsilon

Relative errors with machine epsilon

The Wobble

Contaminated digits

Guard digits

Benign and Catastrophic Cancellation

IEEE Standard

IEEE 754

IEEE 854

The Conclusion

Trying to predict the performance of file reads/writes

How does reading/writing from a file work?

Python

Java

C++

Other ways to write

What kind of disk do I have?

Understanding your disk specs

Digging deeper

What are QDs?

How fast is 2500 MB/s?

Why are reads faster than writes?

IOPS vs transfer speed (bytes per second)

Comparing an HDD

The Conclusion

Caveats

`x86/x86-64`

Basics of `templates`

Creating a `TupleList`

`get` values from `TupleList`

a runtime `get` function

`std::enable_if`

The `time` command posix

`timeval`

`rusage`

`clock_t`

`chrono`

`TSC`