If you work as a programmer, putting out fires and solving some weird problems is in your job description. A lot of problems can be described by a single word, overflow. Integers can overflow, and stack can overflow. The first one is notorious to locate, and the second one is annoying to fix.
Recently, we had to switch the way we defined timeouts for our queries because of the integer overflow caused by a strong assumption we made for the hardware. While implementing a different solution, we had some troubles with stack overflowing, but in the end, we managed to avoid it.
We think both of those problems are interesting and not something you tackle every day, so I would like to show you how we located those problems, and how we fixed them.
What is a query timeout?
On systems like Memgraph, where you run arbitrary queries and code, you would like to limit the execution time. This is achieved by defining timeouts.
When the user runs a query, the execution is limited by some duration, and then the race between execution and the timer begins. If the query finishes before the defined duration expires, the result is returned, and the query is victorious. If the duration expires, well, bad luck, you get an exception as a consolation prize.
How would you implement a functionality like this?
We had one idea that worked perfectly. But it turned out perfect wasn't a correct adjective, it was almost perfect.
Noticing a problem
One day, a new colleague showed up at work. The first step of the onboarding process in Memgraph is familiarizing yourself with the database. Go through tutorials, play with Memgraph, and try as many things as possible. Pretty straightforward, right?
After few hours, he sends me a message that queries are timing out at random after few seconds even though the timeout is set to 3 minutes. Of course, bugs are possible, so I checked out which query (I was confident with the singular). It turned out, all kinds of queries timed out, and not always. Non-deterministic bug, how great!
But every bug can be seen as a great opportunity to learn something new and get a deeper understanding of the system. Perfect for our new colleague!
It didn’t take long for him to figure out that the timer we used to measure timeouts had overflowed. When you check the elapsed time, you subtract the old time from the new time. But if the new time is less than the old time, you get an overflow. Because we save time as an unsigned integer, you get a very large number that will for sure be larger than any timeout value you set, including our default 3 minutes.
Great, the problem was identified. Now, why was there a problem in the first place, and why was it only observed on his laptop? The problem is obviously in the timer, so what kind of timer were we using?
We used a TSC based timer. TSC is a special register in the CPU that basically counts the cycles, and what a better way to measure time than to get it directly from a register. You’re probably wondering, why isn’t everyone using that kind of timer?
Because it’s unreliable. If a single CPU has multiple cores, each core will get its own counter. Those counters are, of course, unsynchronized. But, on a kernel level, you can keep them synchronized, making them monotonically increasing, which allows you to still use them as fast counters.
It turns out that the kernel-based synchronization functionality doesn’t work well with all CPUs. More specifically, it doesn’t work on AMD Ryzen 7 CPUs which we tested out.
To quickly test out if the TSC counter is monotonically increasing on your CPU, you can try out the following script:
#include <x86intrin.h>
#include <iostream>
#include <vector>
#include <thread>
inline unsigned long long ReadTSC() {
unsigned int cpuid;
return __rdtscp(&cpuid);
}
int main() {
constexpr size_t thread_num = 20;
std::vector<std::thread> threads;
for (size_t i = 0; i < thread_num; ++i) {
threads.emplace_back([] {
unsigned long long previous = 0U;
while (true) {
auto current = ReadTSC();
if (current < previous) {
std::cout << "Invalid counter" << std::endl;
std::abort();
}
previous = current;
}
});
}
for (auto &thread : threads) {
thread.join();
}
}
If it's not increasing, it will fail after some time.
Finding the alternatives
Using different ways of measuring time would for sure introduce performance hits because you can’t get faster than TSC.
We had to look at completely different approaches.
After analyzing different databases, we came to the conclusion that using POSIX timers is our best bet. We work natively only on Linux, so there shouldn’t be any issues.
The idea is simple, you basically use timer_create with the SIGEV_THREAD
flag. After implementing it, it worked! Without any performance hits! And it works on all machines!
And then jemalloc came into play
Of course, we are not working on one feature at a time, we're doing things in parallel. While working on the timers, we introduced jemalloc into our codebase. After merging the jemalloc changes, tests for the timers started to fail. And what kind of failure? Segmentation faults, of course, what else...
The logical thing is to start the debugger and check the backtrace.
This is what we got:
#0 0x0000000000604fdf in extent_split_impl (tsdn=0x7ffff7fcd788, arena=0x7ffff7a00980,
r_extent_hooks=0x0, extent=0x0, size_a=0, szind_a=0, slab_a=false, size_b=2023424,
szind_b=232, slab_b=false, growing_retained=false)
at /home/antonio/Memgraph/memgraph/libs/jemalloc/src/extent.c:2092
#1 0x000000000060dcdd in extent_split_interior (tsdn=0x7ffff7fcd788, arena=0x7ffff7a00980,
r_extent_hooks=0x7ffff7fc8890, rtree_ctx=0x7ffff7fcd7b8, extent=0x7ffff7fc8388,
lead=0x7ffff7fc82e0, trail=0x7ffff7fc82d8, to_leak=0x7ffff7fc82d0,
to_salvage=0x7ffff7fc82c8, new_addr=0x0, size=28672, pad=4096, alignment=64, slab=false,
szind=39, growing_retained=true)
at /home/antonio/Memgraph/memgraph/libs/jemalloc/src/extent.c:1017
#2 0x000000000060c4bc in extent_recycle_split (tsdn=0x7ffff7fcd788, arena=0x7ffff7a00980,
r_extent_hooks=0x7ffff7fc8890, rtree_ctx=0x7ffff7fcd7b8, extents=0x7ffff7a05938,
new_addr=0x0, size=28672, pad=4096, alignment=64, slab=false, szind=39,
extent=0x7ffff7a0a900, growing_retained=true)
at /home/antonio/Memgraph/memgraph/libs/jemalloc/src/extent.c:1066
#3 0x00000000006021a9 in extent_recycle (tsdn=0x7ffff7fcd788, arena=0x7ffff7a00980,
r_extent_hooks=0x7ffff7fc8890, extents=0x7ffff7a05938, new_addr=0x0, size=28672,
pad=4096, alignment=64, slab=false, szind=39, zero=0x7ffff7fc8987, commit=0x7ffff7fc887f,
growing_retained=true) at /home/antonio/Memgraph/memgraph/libs/jemalloc/src/extent.c:1148
#4 0x00000000006035f9 in extent_alloc_retained (tsdn=0x7ffff7fcd788, arena=0x7ffff7a00980,
r_extent_hooks=0x7ffff7fc8890, new_addr=0x0, size=28672, pad=4096, alignment=64,
slab=false, szind=39, zero=0x7ffff7fc8987, commit=0x7ffff7fc887f)
at /home/antonio/Memgraph/memgraph/libs/jemalloc/src/extent.c:1471
#5 0x000000000060340e in extent_alloc_wrapper (tsdn=0x7ffff7fcd788, arena=0x7ffff7a00980,
r_extent_hooks=0x7ffff7fc8890, new_addr=0x0, size=28672, pad=4096, alignment=64,
slab=false, szind=39, zero=0x7ffff7fc8987, commit=0x7ffff7fc887f)
at /home/antonio/Memgraph/memgraph/libs/jemalloc/src/extent.c:1539
#6 0x00000000005ac66e in arena_extent_alloc_large (tsdn=0x7ffff7fcd788,
arena=0x7ffff7a00980, usize=28672, alignment=64, zero=0x7ffff7fc8987)
at /home/antonio/Memgraph/memgraph/libs/jemalloc/src/arena.c:448
#7 0x000000000054b8ea in large_palloc (tsdn=0x7ffff7fcd788, arena=0x7ffff7a00980,
usize=28672, alignment=64, zero=true)
at /home/antonio/Memgraph/memgraph/libs/jemalloc/src/large.c:47
...
It seems complex, but one thing is obvious, it happens in the jemalloc library.
Now we panic. Our new super cool library for allocation is not working. Naturally, I start googling and find nothing.
jemalloc has a lot of configs, maybe changing them will help. Again, nothing.
Maybe changing the configs at random will help. Nothing.
You can build jemalloc in debug mode, so that will probably give us more info. Oh, NOW everything works...
Tools jemalloc provided us are not enough, including the debug build. The only thing left, start digging around the jemalloc code.
The first hint I got, the line that causes the segmentation fault is the function signature.
static extent_t *
extent_split_impl(tsdn_t *tsdn, arena_t *arena,
extent_hooks_t **r_extent_hooks, extent_t *extent, size_t size_a,
szind_t szind_a, bool slab_a, size_t size_b, szind_t szind_b, bool slab_b,
bool growing_retained) {
My guess was that the SIGSEGV
happened because of some invalid argument.
Let's check out the line that calls this function right before SIGSEGV
:
*trail = extent_split_impl(tsdn, arena, r_extent_hooks, *extent,
esize, szind, slab, trailsize, SC_NSIZES, false,
growing_retained);
Okay, there is a chance my theory is right, we have *extent
, maybe extent
is NULL
at this point. After checking the surrounding code and verifying it in the debugger, that wasn't the case. Every parameter was correct, so what could be wrong if the SIGSEGV
happens right at the function call??
The second and most important hint is in the line debugger output:
#0 0x0000000000604fdf in extent_split_impl (tsdn=0x7ffff7fcd788, arena=0x7ffff7a00980,
r_extent_hooks=0x0, extent=0x0, size_a=0, szind_a=0, slab_a=false, size_b=2023424,
szind_b=232, slab_b=false, growing_retained=false)
at /home/antonio/Memgraph/memgraph/libs/jemalloc/src/extent.c:2092
Check out the values all the arguments have, they are almost all 0
. Those could be valid values, but as you previously saw, when the function is called, it's simply forwarding already existing values. I checked the values before the function call and the values the function received. They didn't match. How can this happen?
Here comes the term you've been all waiting for: Stack Overflow (and I don't mean the site).
When you look at the definition of Segmentation Fault, it basically happens when you try to access part of memory you don't have access to. One of those cases is if you try to access stack memory that is outside of the allocated stack. This was exactly what happened in this case.
We are left with one final question: Why does this only happen with jemalloc?
It's not just jemalloc
Taking into account the backtrace and the definition of stack overflow, we know that jemalloc chained a lot of function calls with a lot of arguments. Why didn't it happen in debug the build of jemalloc?
Well, the thread cache is disabled in the debug mode. That eliminates some number of function calls, which are enough to avoid segmentation faults.
But how is it possible that in the current year, we don't have enough stack memory for libraries like jemalloc?
For that, we need to check out a different lib, glibc.
We know that there is something weird going on when we try to use timer_create
with jemalloc.
After downloading glibc, we find the file with the timer_create
definition (sysdeps/unix/sysv/linux/timer_create.c
). Most of the code seemed okay and not problematic, the only questionable line was this one:
pthread_once (&__helper_once, __start_helper_thread);
In sysdeps/unix/sysv/linux/timer_routines.c
we find the definition of __start_helper_thread
and the line responsible for our troubles:
(void) pthread_attr_setstacksize (&attr, __pthread_get_minstack (&attr));
When you use SIGEV_THREAD
with timer_create
, a helper thread with a minimal-sized stack is created on the first call. That thread runs a helper function that does all the logic of parsing the signals and starting a new thread for handlers. And, of course, one of the things it does is calling malloc
:
struct thread_start_data *td = malloc (sizeof (*td));
We found our problem but no way of fixing it because there is no option to change the size of the hidden helper thread stack.
Our options at first were to discard jemalloc or to find a different solution for implementing timeouts.
I think that jemalloc is a great library, and I didn’t like the idea of discarding it. Fortunately, timer_create
enables you to write a slightly different solution on Linux.
The solution - finally
Instead of SIGEV_THREAD
, we ended up using SIGEV_THREAD_ID
which behaves like SIGEV_SIGNAL
but it allows you to direct the signal at a specific thread.
This was also the first time I tackled signals. It was fun to play with them but also filled with a lot of question marks. I would just like to single out the most confusing part I stumbled on. If you want to process a signal in a thread, you have to block it in that thread. Makes sense, right?
After wrapping my head around signals and figuring out how to use SIGEV_THREAD_ID
using the documentation and two existing StackOverflow answers about that topic, we ended up with a solution that works perfectly (and with jemalloc).
Here's a small example of the way we used the POSIX timers for our timeout solution. What do we need for it?
First, we need to set up a timer. timer_create
accepts sigevent
objects which hold every information necessary to create a single timer.
Also, we need to decide which signal to use. There is no real answer here. Just make sure the signal is not used for something else.
Let’s create our own SIGTIMER
definition:
#define SIGTIMER (SIGRTMAX - 2)
And the last and most important part is to set the _sigev_un._tid
to a thread id of the thread that will receive the signals.
Our sigevent
object is defined like this:
sigevent notification_settings{};
notification_settings.sigev_notify = SIGEV_THREAD_ID;
notification_settings.sigev_signo = SIGTIMER;
notification_settings._sigev_un._tid = <thread_id>;
Additionally, we can send a specific value with a signal that we can use to process signals.
uint64_t some_value = 42;
static_assert(sizeof(void *) == sizeof(some_value), "some_value size must be equal to pointer size!");
std::memcpy(¬ification_settings.sigev_value.sival_ptr, &some_value, sizeof(some_value));
Now, we just create a timer with those settings:
timer_t timer_id;
timer_create(CLOCK_MONOTONIC, ¬ification_settings, &timer_id);
The second part is setting up the background thread without a helper function.
We need to somehow forward the underlying kernel ID of that thread. The only way of getting that thread ID is by calling the gettid function.
We set up an object which will help us to forward the thread ID from the background thread to the main thread:
struct ThreadInfo {
pid_t thread_id;
std::atomic<bool> setup_done{false};
};
The helper function will first receive an object of type ThreadInfo*
and set its value:
auto *thread_info = static_cast<ThreadInfo *>(args);
thread_info->thread_id = syscall(SYS_gettid);
After that, we set it up for processing our SIGTIMER
signal.
We need to block it (of course):
sigset_t ss;
sigemptyset(&ss);
sigaddset(&ss, SIGTIMER);
sigprocmask(SIG_BLOCK, &ss, nullptr);
thread_info->setup_done.store(true, std::memory_order_release);
Now, we can wait for SIGTIMER
signal with sigwaitinfo
and execute any functions when we receive them.
while (true) {
siginfo_t si;
int result = sigwaitinfo(&ss, &si);
if (result <= 0) {
continue;
}
if (si.si_code == SI_TIMER) {
uint64_t some_value = 0;
std::memcpy(&some_value, &si.si_value.sival_ptr, sizeof(some_value));
// do something else with some_value
} else if (si.si_code == SI_TKILL) {
pthread_exit(nullptr);
}
}
We just need to set up the background thread in the main thread before we set up the timer because we need its thread ID.
static pthread_t background_timer_thread;
static ThreadInfo thread_info;
pthread_create(&background_timer_thread, nullptr, TimerBackgroundWorker, &thread_info);
while (!thread_info.setup_done.load(std::memory_order_acquire))
;
Altogether, it should look like this:
#define SIGTIMER (SIGRTMAX - 2)
struct ThreadInfo {
pid_t thread_id;
std::atomic<bool> setup_done{false};
};
void *TimerBackgroundWorker(void *args) {
auto *thread_info = static_cast<ThreadInfo *>(args);
thread_info->thread_id = syscall(SYS_gettid);
sigset_t ss;
sigemptyset(&ss);
sigaddset(&ss, SIGTIMER);
sigprocmask(SIG_BLOCK, &ss, nullptr);
thread_info->setup_done.store(true, std::memory_order_release);
while (true) {
siginfo_t si;
int result = sigwaitinfo(&ss, &si);
if (result <= 0) {
continue;
}
if (si.si_code == SI_TIMER) {
uint64_t some_value = 0;
std::memcpy(&some_value, &si.si_value.sival_ptr, sizeof(some_value));
// do something else with some_value
} else if (si.si_code == SI_TKILL) {
pthread_exit(nullptr);
}
}
}
static pthread_t background_timer_thread;
static ThreadInfo thread_info;
pthread_create(&background_timer_thread, nullptr, TimerBackgroundWorker, &thread_info);
while (!thread_info.setup_done.load(std::memory_order_acquire))
;
sigevent notification_settings{};
notification_settings.sigev_notify = SIGEV_THREAD_ID;
notification_settings.sigev_signo = SIGTIMER;
notification_settings._sigev_un._tid = thread_info.thread_id;
uint64_t some_value = 42;
static_assert(sizeof(void *) == sizeof(some_value), "some_value size must be equal to pointer size!");
std::memcpy(¬ification_settings.sigev_value.sival_ptr, &some_value, sizeof(some_value));
timer_t timer_id;
timer_create(CLOCK_MONOTONIC, ¬ification_settings, &timer_id);
Conclusion
Finding bugs and fixing them is never a straightforward process, but that’s what keeps it interesting. Small moments like this that force you to explore things you would never touch otherwise are a great opportunity to dig deeper into libraries you are using and the environment itself.
So, don’t give up, even if you only have a signal line from the debugger to work with. Maybe you just have to block a signal to receive it.
Top comments (0)