Introduction
So I was trying to use atomics in C. Got a little working example.
// multithrd.c
#include <stdio.h>
#include <threads.h>
#include <stdatomic.h>
atomic_int acnt;
int cnt;
int f(void* thr_data) {
for (int n = 0; n < 10000; ++n) {
++cnt;
++acnt;
}
return 0;
}
int main() {
thrd_t thr[10];
for (int n = 0; n < 10; ++n)
thrd_create(thr + n, f, NULL);
for (int n = 0; n < 10; ++n)
thrd_join(thr[n], NULL);
printf("Atomic counter is: %u\n", acnt);
printf("Non Atomic counter is: %u\n", cnt);
}
And, I got results like this
$ ./multithrd
Atomic counter is: 100000
Non Atomic counter is: 98860 # Depends on the weather
Great! But how is a thread actually spawned? We currently don't know many things, but we know something; threads are created by the OS.
Hello, World! (in assembly)
As you may know, a program can do a lot of things by itself. It can add numbers, pass values around functions, compute digits of pi, etc. But it can't print out something to the screen. That's the kernel's job.
Well, a program can print out something to the screen, after all the Linux kernel is a bunch of instructions for the CPU to execute. It just has a lot of privileges. It can turn on and off your USB ports and communicate with your NIC and turn on specific pixels on your screen in order to print "Hello, World!".
So how does a normal program, an userland program, how does it prints something to the screen. Well, let's figure it out writing a x86_64 Linux assembly "Hello, World" program!
We're gonna be using NASM because I don't want to deal with GAS.
; main.asm
section .text
global _start
_start:
mov rsi, msg ; msg
mov rdx, msg_l ; len
mov rax, 1 ; write syscall
mov rdi, 1 ; stdout
syscall
mov rax, 60 ; exit syscall
mov rdi, 0 ; exit code = 0
syscall
section .data
msg: db "Hi mum", 10
msg_l: equ $ - msg
And then compile it!
$ nasm -f elf64 main.asm -o main.o # Compile it
$ ld main.o -o main # Link it
$ ./main # Run it!
Hi mum
You can see the syscall
instruction there. It basically says to the OS "make this for me, thx". We use it for writing to stdout
and exiting the program. So, whenever the program can't do something by itself, say opening a TCP Socket, it asks the OS for it. That's how threads are created! But how to know which syscall does it uses?
Strace Adventures
The man page for strace
says "[...] It intercepts and records the system calls which are called by a process and the signals which are received by a process." So this is the tool we need.
So let's run it against our new program!
$ strace ./main
execve("./main", ["./main"], 0x7ffccab40ae0 /* 67 vars */) = 0
write(1, "Hi mum\n", 7) = 7
exit(0) = ?
+++ exited with 0 +++
As you can see, it shows us which syscalls were called. execve
is the syscall that actually runs the program! It is called by our shell, and, in fact, the second argument is equivalent to argv
in a C main function, and the third to envp
.
int main(int argc, char *argv[], char *envp[]);
Then we can see the other 2 syscalls
we invoked, write
and exit
.
Stracing multithrd.c
So let's dive in and directly strace
our program!
$ strace -o calls.strace ./multithrd
$ cat calls.strace
execve("./multithrd", ["./multithrd"], 0x7ffee0802fc0 /* 72 vars */) = 0
brk(NULL) = 0x55b1c8444000
arch_prctl(0x3001 /* ARCH_??? */, 0x7fffb92cf8a0) = -1 EINVAL (Invalid argument)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f9dad49e000
...
And it goes like this for 99 lines.
So yeah, let's start for something simpler. What about a program that does nothing? And I will make it without a file for the flex!
$ echo "int main() {}" | gcc -O2 -x c -
$ strace 2>&1 ./a.out | wc -l
34
Much better! 65 lines less! It does a bunch of syscalls even when we are doing nothing, so we will chop the lines that seem similar.
And if you close your eyes, in our multithread
program you can notice the clone3
call being invoked multiple times. Let's check how many times.
$ grep -c clone3 calls.strace
10
The same number of threads we create, great!
The clone3
syscall
So it seems that clone3
is the syscall that creates new threads. Let's use man
(better if you use batman
) to find about it.
$ man clone3
The man page is quite large. It gives us information about the glibc
wrapper, clone
and about the syscall itself. Let's check the signature of clone
.
int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...);
The man page says "When the child process is created with the clone() wrapper function, it commences execution by calling the function pointed to by the argument fn. [...] When the fn(arg) function returns, the child process terminates".
It also says something really important about the stack: "The stack argument specifies the location of the stack used by the child process. Since the child and calling process may share memory, it is not possible for the child process to execute in the same stack as the calling process. The calling process must therefore set up memory space for the child stack and pass a pointer to this space to clone()".
The stack is basically memory. That's about it. And it's needed for functions to work, because the return address, that is, the address that you should jump to give control back to the caller function, is pushed into the stack. Also you can use it to store local variables and pass more than 6 arguments to functions.
So we just need to pass around values and we have it!
glibc
's clone
function
But how is the function I pass to it executed? I don't see any func_ptr
field in clone3
(the syscall)!
Well, clone
(the function) has a function pointer argument, let's check what it does.
The source code for that is here.
Surprisingly, it is somewhat well commented! First some sanity checks, then an ABI compliance thing, moving around some stuff to make the syscall, storing the start function and its argument in the stack and when everything is ready, the syscall is made.
And with this, the two threads are in the exact same position. The only differences are the stack pointer (in the child thread is the stack we passed to it) and the rax
register.
In the parent thread, the rax
register gets the value the thread ID.
In the child thread, it gets to 0.
With this logic, we do a little branching and it just works.
Unfortunately, this process isn't as easy to visualize without doing it inside a debugger. But eventually, it calls the function that you passed to it. And then it just exits with the value you returned from the thread.
So I basically did that by myself!
The repo is here: https://github.com/beto-bit/mt_asm
Aside from not using the standard library, it wasn't actually that hard. The glibc
implementation is much more complex, but this a (somewhat) working implementation.
Top comments (2)
Nice job on the step-by-step going down the stack and using the tools available to find out how stuff works - my kind of process 😁
Also thanks for the "Hi mum", I do something similar: dev.to/phlash/comment/20l5f 🙏
From now, I am going to use it whenever I can. Thanks!