DEV Community

Cover image for I'm an Expert in Memory Management & Segfaults, Ask Me Anything!
Jason C. McDonald
Jason C. McDonald

Posted on • Edited on

I'm an Expert in Memory Management & Segfaults, Ask Me Anything!

I'm an expert-level C and C++ developer, with a specialty in memory management. I have experience writing memory-safe code with both the modern safe techniques and the ancient unsafe techniques. I've used malloc and free without killing myself. I love pointers. I've debugged more than my share of undefined behavior, and authored the canonical StackOverflow question on segfault debugging.

Any burning questions about dynamic allocation, undefined behavior, pointers, memory safety, or anything even remotely related? Ask me anything!

(My main languages are C++, C, and Python, although I also deeply grok the underlying computer science principles.)

Latest comments (163)

Collapse
 
david_blash_6ee7598bb4b73 profile image
David Blash

Hello, I've written a shared library (.so) and an executable that links against it which runs on Ubuntu using C. I compiled them with GCC.
When the standalone executable calls into the shared library once and exits, there's no problem although valgrind reports that there are some 'still reachable memory blocks' which are successfully collected by the OS at program's exit.
But when the program calls into the library two times, it encountered a segmentation fault. And valgrind detects nothing.
How should I pinpoint the problem?

Collapse
 
nicolo_torre profile image
Nicolo Torre

Here is a puzzle for you. Context is c++ on linux. Program A allocates some shared memory and writes a simple array of doubles there. Program B gets a pointer into the shared memory and reads the doubles. Then it runs off and does stuff, comes back and tries to read again. The pointer has not changed and the shared memory is still there but the second read produces a SIGABRT. All i can guess is some sort of indirect addressing is being used with corruption of a data structure somewhere. But i am puzzled to put it mildly. These are good old double *pt pointers by the way. Not smart pointers.

Collapse
 
raxel01 profile image
Raxel01

Hello I hope you still alive I just want to ask you about :
-----How the segfault is represented on the memory thank you !

Collapse
 
codemouse92 profile image
Jason C. McDonald

Hello, yes! Alive and well.

When an operating system works with virtual memory, it has to allocate memory addresses to a given page of memory. For example, it might allocate a page of memory to Chrome for anything it wants to allocate/deallocate (heap space). However, it also reserved a range of addresses (but not memory) which should not be accessed by anything. The purpose of this is to ensure that when you are iterating over addresses in that page of memory, you don't leave the page and cross into another. Instead, you hit those reserved addresses, and the operating system immediately throws a segfault.

Short explanation, but I hope that helps.

Collapse
 
fermut profile image
sandlocker

Hello Jason,
I have a fairly big code written in c++ (not by me).
The code has been working fine for the most part, until I upgraded the system where it is running to a newer version of debian (bullseye).
We are getting a segfault when one specific operation of the code is executed.
I get the segfault with the code compiled in both optimized and debug flavors.
However, when I run the code with either gdb or valgrind, the code works fine. Valgrind doesn't show any errors.
The last thing I tried was running the code without gdb/valgrind and generating a core dump. Opening the core dump with gdb shows that the segfault happens when I call realloc(), so apparently something is messing up that pointer when it is freed. I tried replacing the realloc by a malloc() followed by a memcpy() (without freeing the previous pointer), and it worked fine.
I was reading in other blogs and it might be that both gdb and valgrind change the memory layout of the code so the bug doesn't show up, so if that's the case, how can I track it ?
Any suggestions would be greatly appreciated!
Thanks a lot!
F.

Collapse
 
codemouse92 profile image
Jason C. McDonald

Ahh, the joys of a Heisenbug. Unfortunately, you seem to have gotten a particularly shy one. There won't be a way to directly observe it. However, your clever investigative efforts with the core dump have pointed you to the responsible realloc(), and thus the pointer giving you trouble, so there really isn't much more that Valgrind could have told you.

Thinking through it, I wonder if the free is causing issues because the memory at the original pointer was partially or completely uninitialized? Desk check the code allocating, initializing, and accessing that pointer, preferably in execution order (or close to it). Remember that a segfault is an operating-system-level error raised because something attempted to access memory not belonging to the program - or, more specifically, trying to use a memory address in the "protected space" that the OS set aside around the memory assigned to the process.

I hope that helps.

Collapse
 
fermut profile image
sandlocker

Hello Jason,
Thanks a lot for the reply and the suggestions.

This is the gdb/core dump session:

[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `executable'.
Program terminated with signal SIGSEGV, Segmentation fault.

0  0x00007f629d991df8 in mremap_chunk (p=0x7f62938d3000,

    p@entry=0x7f629960c000, new_size=, new_size@entry=3787904)
    at malloc.c:2878
2878 malloc.c: No such file or directory.
(gdb) up

1  0x00007f629d996a70 in GI_libc_realloc (oldmem=0x7f629960c010,

    bytes=3787896) at malloc.c:3206
3206 in malloc.c
(gdb) up

2  0x00007f629c37cadb in ?? ()

   from /usr/lib/x86_64-linux-gnu/libnvidia-tls.so.470.129.06
(gdb) up

3  0x000056491c746ab6 in MyClass::allocateF (this=0x56491d35bdf0, n=157829)

    at program.cpp:46
46      xyz=(double*)realloc((void*)xyz,3*n*sizeof(double));

(gdb) q

It looks like realloc() goes into the library libnvidia-tls(??) which is strange (at least for me).

So to test things out I created myrealloc() in a different .cpp file, by just doing malloc, memcpy and free and the segfault disappeared although I am not sure if that really solved the problem or if I just pushed the bug away for now.

My first thought was that there is a mismatch between the malloc() that creates the pointer in the first time (maybe from a different library than libnvidia-tls) and the realloc() that is being called here (as it is fairly easy to redefine realloc() with a macro). However I think this would have also generated a segfault within gdb/valgrind. 

Have you seen something like this before ?

Thanks again!

Collapse
 
scottrod10 profile image
Scott E. Rodgers

RE my previous post, (which I do not see): I figured out the problem. As I suspected, when I traced the spaghetti logic, I was doing something with undefined behavior. Fixed it, and there is no problem on either instance now. Thanks for all your contribution!

Collapse
 
scottrod10 profile image
Scott E. Rodgers

Hi, Jason: THANK YOU for your kind offer. I am trying to develop C/C++ on WSL2 Ubuntu, then transfer code to a machine running straight-up Ubuntu . I'm writing/compiling C++, but using many C functions and features, notably realloc(). I get a "malloc(): corrupted top size\Aborted" error on my local WSL2 Ubuntu machine when I try to initially realloc() a variable larger than about 64B. But when I compile and run the same code on the remote machine running straight-up Ubuntu, (assuming the same gcc version, but I cannot be sure), I do not run into the error. My local machine, (running WSL2 Ubuntu) is much more beefier than the (remote) Ubuntu machine, (local is I-9 with 32GB RAM, and I believe the remote [school's] machine is a P4 Core2Duo with about 8GB RAM). Do you have any idea what is going on, or could you guide me where I should troubleshoot? THANK YOU AGAIN very much for any time or thoughts you can spare.

Collapse
 
scottrod10 profile image
Scott E. Rodgers

So the remote [working] gcc is version 7.5.0 on Ubuntu 18.04.
My local [error/aborting] gcc is version 9.3.0 on Ubuntu 20.04, (in WSL2).

Thanks again for any thoughts!

Collapse
 
nagarajven profile image
nagarajven

Hello Jason, zmq-bind-proxyd process crashed. The backtrace is mentioned below. As per the stacktrace, crashed originated from zmq_poll function call. I do not see any reason why process would crash at line 643. The instruction is not dereferencing either NULL or uninitialized or dangling pointer. no buffer overrun or stack overflow. Can you please help me find root cause for this ?

643 if ((rc = zmq_poll (in_items, ARRAY_SIZE(in_items), -1)) < 0) {
644 sd_journal_print(LOG_ERR, "%s: custom_proxy - ERROR! (%s) polling-in",
645 name(), zmq_strerror(rc));
646 break;
647 }

Program terminated with signal SIGSEGV, Segmentation fault.
warning: Unexpected size of section `.reg-xstate/346' in core file.

0 0x00007fe7311ef9cc in ?? () from /neteng/nagaraj.venkatapuram/OS10_repo.f10/10.5.3.2.99999/usr/lib/x86_64-linux-gnu/libzmq.so.5

[Current thread is 1 (Thread 0x7fe72ca40700 (LWP 346))]
Setting debian release code name to 0
Sourcing gdb user defined commands
warning: /home/aruljeniston/gdb-macros/os10/zmq_gdb.gdb: No such file or directory
Redefine command "libevent_active_list_dump_by_flags"? (y or n) [answered Y; input not from terminal]
Redefine command "libevent_walk"? (y or n) [answered Y; input not from terminal]
Sourcing miscellaneous gdb-macros
Sourcing dn_sm gdb macros
--------------trace-----------------

0 0x00007fe7311ef9cc in ?? () from /neteng/nagaraj.venkatapuram/OS10_repo.f10/10.5.3.2.99999/usr/lib/x86_64-linux-gnu/libzmq.so.5

1 0x00007fe7311d314c in ?? () from /neteng/nagaraj.venkatapuram/OS10_repo.f10/10.5.3.2.99999/usr/lib/x86_64-linux-gnu/libzmq.so.5

2 0x00007fe7311d3814 in ?? () from /neteng/nagaraj.venkatapuram/OS10_repo.f10/10.5.3.2.99999/usr/lib/x86_64-linux-gnu/libzmq.so.5

3 0x00007fe7311f40ea in zmq_poll () from /neteng/nagaraj.venkatapuram/OS10_repo.f10/10.5.3.2.99999/usr/lib/x86_64-linux-gnu/libzmq.so.5

4 0x0000564cb2ace363 in zmq_proxy_c::custom_proxy() () at zmq-common-proxyd_zmqctx.cpp:643

5 0x0000564cb2acece7 in zmq_proxy_c::c_proxy(void*) () at zmq-common-proxyd_zmqctx.cpp:529

6 0x00007fe730cf2fa3 in start_thread (arg=) at pthread_create.c:486

7 0x00007fe730c234cf in clone () from /neteng/nagaraj.venkatapuram/OS10_repo.f10/10.5.3.2.99999/lib/x86_64-linux-gnu/libc.so.6

(gdb

Collapse
 
gauravbhardwaj03 profile image
gaurav-bhardwaj03

Hello Jason!

I was running a multi threaded code where I am getting a segfault at a particular line.

As shown in the image the curr_node in the code is a healthy node that I am able to check on my debugger.

Collapse
 
codemouse92 profile image
Jason C. McDonald

Hi gaurav,

(1) I can't see the image.

(2) multi-threaded segfaults are scary. Are you able to try and reproduce it single-threaded? If you can, it'll be easier to debug. If you cannot, it may be related to the threading itself.

(3) This is probably a lot deeper than just finding a segfault. You will want to audit your code for thread safety.

Collapse
 
shrivp1 profile image
shrivp1

Hi Jason,

If we encounter a segfault with error code 4 or any other such error code -

localhost kernel: [139154.090095] xxxxx_process[11909]: segfault at 21 ip 00007ff5704b5254 sp 00007ff556bbbb98 error 4 in libmpi_global-release.so[7ff56c832000+6eee000]

However we see no assertions/errors and no core dumps been generated, how do we go along to debug such issues ?

Core dump configuration been verified and is correct -
Limit Soft Limit Hard Limit Units
Max core file size unlimited unlimited bytes

also the flag to generate full cores is been enabled !!!!

Collapse
 
codemouse92 profile image
Jason C. McDonald

In almost all cases, it's very hard to debug a segfault without compiling the code in question with debug symbols (-g) and running it through Valgrind.

Core dumps are just snapshots of the raw memory when the program crashed, and will seldom provide any clues unless you are very familiar with the entire raw memory layout of your program.

Collapse
 
shrivp1 profile image
shrivp1

Thank you for your response !!!
The problem here is we are facing this issue specifically in our client environment, we cannot reproduce this issue in house to try compile our code using -g or use valgrind by attaching it to a process

Using valgrind would add performance overhead in customer environment, so it's not a viable option, any other means to track this live on a client environment for a particular process ?

Thread Thread
 
codemouse92 profile image
Jason C. McDonald • Edited

Off the top of my head, I don't know of any practical ways to debug a segfault in production like you describe. You could use logs and observations to determine what behavior(s) precede the segfault, and use that to focus in on part of your code base.

Meanwhile, your best bet would be to try and isolate what's different about their environment versus your test environment, and try and replicate it.

In any case, this won't be easy. This roadblock you're running into is exactly why it is so often said "if we can't reproduce it, we can't fix it".

Thread Thread
 
shrivp1 profile image
shrivp1

Thank you Jason for the advise..
Will see if we can try to identify a diff in production env and in general..
Logs weren't much helpful to logically conclude in this case

Collapse
 
vilsenhet profile image
Vilsenhet

Hi Jason! I've been having some trouble with an open source computational chemistry program, maybe you can help. The main program is called Enso (github.com/grimme-lab/enso), but I'm specifically running into the issue with one of the resources included with it. It's the statically compiled binary called anmr. I'm running windows subsystem for linux, Ubuntu 20.10. My PC has 32 gb ram and a Ryzen 7 4800H processor (8 cores).

The issue:
Every time I call the program, even with 'anmr -h' just to bring up the help menu, I get the following output:
forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
anmr 000000000180DB13 Unknown Unknown Unknown
anmr 00000000019E2130 Unknown Unknown Unknown
anmr 000000000040A133 Unknown Unknown Unknown
anmr 0000000000403542 Unknown Unknown Unknown
anmr 00000000019E35B0 Unknown Unknown Unknown
anmr 0000000000403427 Unknown Unknown Unknown

What I've already tried:
I'm not the first one to encounter this error, however, everyone else was able to solve it by unlimiting stack size with "ulimit -s unlimited". The short version: it didn't work for me. The long version: I couldn't initially use the command because I didn't have permission to change the stack size hard limit. I managed to change that by altering files such as /etc/security/limits.conf, /etc/pam.d/common-session and so on. When I initially open the terminal, the limit changes don't take effect, but they work when I use "su vilsenhet" even though that's my default profile when I open the terminal. After unlimiting stack size, core file size, etc, calling anmr gives the same segfault error. I reached out to the original programmer for help, and he checked for errors and recompiled the binary for me. It didn't fix the problem. I've tested the archive with 7zip which found no errors. I've added every directory I could think of to my path and LD_LIBRARY_PATH. I've tried a few other longshot fixes I've found by googling the problem, but nothing helps. I'm at a loss.

I'm pretty new to linux and to programming in general, so I really have no idea, but I'm thinking maybe the problem arises from the fact that I have to switch user after my initial login to be able to alter the ulimits, so somehow they don't actually apply to the program when I call it? I don't know. Any insight, tips, or tricks you could offer would greatly be appreciated. Thanks!

Collapse
 
codemouse92 profile image
Jason C. McDonald

First step w/ these is always to compile the program with debug symbols and run it through Valgrind. It's almost impossible to track down the source of the problem without knowing where in the code the program aborts.