Discussion on: I'm an Expert in Memory Management & Segfaults, Ask Me Anything!

View post

Hello Jason,
I have a fairly big code written in c++ (not by me).
The code has been working fine for the most part, until I upgraded the system where it is running to a newer version of debian (bullseye).
We are getting a segfault when one specific operation of the code is executed.
I get the segfault with the code compiled in both optimized and debug flavors.
However, when I run the code with either gdb or valgrind, the code works fine. Valgrind doesn't show any errors.
The last thing I tried was running the code without gdb/valgrind and generating a core dump. Opening the core dump with gdb shows that the segfault happens when I call realloc(), so apparently something is messing up that pointer when it is freed. I tried replacing the realloc by a malloc() followed by a memcpy() (without freeing the previous pointer), and it worked fine.
I was reading in other blogs and it might be that both gdb and valgrind change the memory layout of the code so the bug doesn't show up, so if that's the case, how can I track it ?
Any suggestions would be greatly appreciated!
Thanks a lot!
F.

Jason C. McDonald • Jul 14 '22

Ahh, the joys of a Heisenbug. Unfortunately, you seem to have gotten a particularly shy one. There won't be a way to directly observe it. However, your clever investigative efforts with the core dump have pointed you to the responsible realloc(), and thus the pointer giving you trouble, so there really isn't much more that Valgrind could have told you.

Thinking through it, I wonder if the free is causing issues because the memory at the original pointer was partially or completely uninitialized? Desk check the code allocating, initializing, and accessing that pointer, preferably in execution order (or close to it). Remember that a segfault is an operating-system-level error raised because something attempted to access memory not belonging to the program - or, more specifically, trying to use a memory address in the "protected space" that the OS set aside around the memory assigned to the process.

I hope that helps.

sandlocker • Jul 14 '22

Hello Jason,
Thanks a lot for the reply and the suggestions.

This is the gdb/core dump session:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `executable'.
Program terminated with signal SIGSEGV, Segmentation fault.

0 0x00007f629d991df8 in mremap_chunk (p=0x7f62938d3000,

p@entry=0x7f629960c000, new_size=, new_size@entry=3787904)
at malloc.c:2878
2878 malloc.c: No such file or directory.
(gdb) up

1 0x00007f629d996a70 in GI_libc_realloc (oldmem=0x7f629960c010,

bytes=3787896) at malloc.c:3206
3206 in malloc.c
(gdb) up

2 0x00007f629c37cadb in ?? ()

from /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.129.06
(gdb) up

3 0x000056491c746ab6 in MyClass::allocateF (this=0x56491d35bdf0, n=157829)

at program.cpp:46
46 xyz=(double*)realloc((void*)xyz,3*n*sizeof(double));

(gdb) q

It looks like realloc() goes into the library libnvidia-tls(??) which is strange (at least for me).

So to test things out I created myrealloc() in a different .cpp file, by just doing malloc, memcpy and free and the segfault disappeared although I am not sure if that really solved the problem or if I just pushed the bug away for now.

My first thought was that there is a mismatch between the malloc() that creates the pointer in the first time (maybe from a different library than libnvidia-tls) and the realloc() that is being called here (as it is fairly easy to redefine realloc() with a macro). However I think this would have also generated a segfault within gdb/valgrind.

Have you seen something like this before ?

Thanks again!