beto-bit

Posted on Sep 23, 2023

How are threads created in Linux x86_64

#linux #assembly #c

Introduction

So I was trying to use atomics in C. Got a little working example.

// multithrd.c
#include <stdio.h>
#include <threads.h>
#include <stdatomic.h>

atomic_int acnt;
int cnt;

int f(void* thr_data) {
    for (int n = 0; n < 10000; ++n) {
        ++cnt;
        ++acnt;
    }

    return 0;
}

int main() {
    thrd_t thr[10];
    for (int n = 0; n < 10; ++n)
        thrd_create(thr + n, f, NULL);

    for (int n = 0; n < 10; ++n)
        thrd_join(thr[n], NULL);

    printf("Atomic counter is: %u\n", acnt);
    printf("Non Atomic counter is: %u\n", cnt);
}

And, I got results like this

$ ./multithrd
Atomic counter is: 100000
Non Atomic counter is: 98860 # Depends on the weather

Great! But how is a thread actually spawned? We currently don't know many things, but we know something; threads are created by the OS.

Hello, World! (in assembly)

As you may know, a program can do a lot of things by itself. It can add numbers, pass values around functions, compute digits of pi, etc. But it can't print out something to the screen. That's the kernel's job.
Well, a program can print out something to the screen, after all the Linux kernel is a bunch of instructions for the CPU to execute. It just has a lot of privileges. It can turn on and off your USB ports and communicate with your NIC and turn on specific pixels on your screen in order to print "Hello, World!".

So how does a normal program, an userland program, how does it prints something to the screen. Well, let's figure it out writing a x86_64 Linux assembly "Hello, World" program!

We're gonna be using NASM because I don't want to deal with GAS.

; main.asm
section .text

global _start
_start:
    mov rsi, msg    ; msg
    mov rdx, msg_l  ; len
    mov rax, 1      ; write syscall
    mov rdi, 1      ; stdout
    syscall

    mov rax, 60     ; exit syscall
    mov rdi, 0      ; exit code = 0
    syscall


section .data
msg:    db "Hi mum", 10
msg_l:  equ $ - msg

And then compile it!

$ nasm -f elf64 main.asm -o main.o    # Compile it
$ ld main.o -o main                   # Link it
$ ./main                              # Run it!
Hi mum

You can see the syscall instruction there. It basically says to the OS "make this for me, thx". We use it for writing to stdout and exiting the program. So, whenever the program can't do something by itself, say opening a TCP Socket, it asks the OS for it. That's how threads are created! But how to know which syscall does it uses?

Strace Adventures

The man page for strace says "[...] It intercepts and records the system calls which are called by a process and the signals which are received by a process." So this is the tool we need.

So let's run it against our new program!

$ strace ./main
execve("./main", ["./main"], 0x7ffccab40ae0 /* 67 vars */) = 0
write(1, "Hi mum\n", 7)                 = 7
exit(0)                                 = ?
+++ exited with 0 +++

As you can see, it shows us which syscalls were called. execve is the syscall that actually runs the program! It is called by our shell, and, in fact, the second argument is equivalent to argv in a C main function, and the third to envp.

int main(int argc, char *argv[], char *envp[]);

Then we can see the other 2 syscalls we invoked, write and exit.

Stracing `multithrd.c`

So let's dive in and directly strace our program!

$ strace -o calls.strace ./multithrd
$ cat calls.strace
execve("./multithrd", ["./multithrd"], 0x7ffee0802fc0 /* 72 vars */) = 0
brk(NULL)                               = 0x55b1c8444000
arch_prctl(0x3001 /* ARCH_??? */, 0x7fffb92cf8a0) = -1 EINVAL (Invalid argument)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9dad49e000
...

And it goes like this for 99 lines.

So yeah, let's start for something simpler. What about a program that does nothing? And I will make it without a file for the flex!

$ echo "int main() {}" | gcc -O2 -x c -
$ strace 2>&1 ./a.out | wc -l
34

Much better! 65 lines less! It does a bunch of syscalls even when we are doing nothing, so we will chop the lines that seem similar.

And if you close your eyes, in our multithread program you can notice the clone3 call being invoked multiple times. Let's check how many times.

$ grep -c clone3 calls.strace
10

The same number of threads we create, great!

The `clone3` syscall

So it seems that clone3 is the syscall that creates new threads. Let's use man (better if you use batman) to find about it.

$ man clone3

The man page is quite large. It gives us information about the glibc wrapper, clone and about the syscall itself. Let's check the signature of clone.

int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...);

The man page says "When the child process is created with the clone() wrapper function, it commences execution by calling the function pointed to by the argument fn. [...] When the fn(arg) function returns, the child process terminates".

It also says something really important about the stack: "The stack argument specifies the location of the stack used by the child process. Since the child and calling process may share memory, it is not possible for the child process to execute in the same stack as the calling process. The calling process must therefore set up memory space for the child stack and pass a pointer to this space to clone()".

The stack is basically memory. That's about it. And it's needed for functions to work, because the return address, that is, the address that you should jump to give control back to the caller function, is pushed into the stack. Also you can use it to store local variables and pass more than 6 arguments to functions.

So we just need to pass around values and we have it!

`glibc`'s `clone` function

But how is the function I pass to it executed? I don't see any func_ptr field in clone3 (the syscall)!

Well, clone (the function) has a function pointer argument, let's check what it does.

The source code for that is here.

Surprisingly, it is somewhat well commented! First some sanity checks, then an ABI compliance thing, moving around some stuff to make the syscall, storing the start function and its argument in the stack and when everything is ready, the syscall is made.

And with this, the two threads are in the exact same position. The only differences are the stack pointer (in the child thread is the stack we passed to it) and the rax register.
In the parent thread, the rax register gets the value the thread ID.
In the child thread, it gets to 0.

With this logic, we do a little branching and it just works.

Unfortunately, this process isn't as easy to visualize without doing it inside a debugger. But eventually, it calls the function that you passed to it. And then it just exits with the value you returned from the thread.

So I basically did that by myself!

The repo is here: https://github.com/beto-bit/mt_asm

Aside from not using the standard library, it wasn't actually that hard. The glibc implementation is much more complex, but this a (somewhat) working implementation.

🚀 pgai Vectorizer: SQLAlchemy and LiteLLM Make Vector Search Simple

We built pgai Vectorizer to simplify embedding management for AI applications—without needing a separate database or complex infrastructure. Since launch, developers have created over 3,000 vectorizers on Timescale Cloud, with many more self-hosted.