Discussion on: I'm an Expert in Memory Management & Segfaults, Ask Me Anything!

View post

Hi Jason,

If we encounter a segfault with error code 4 or any other such error code -

localhost kernel: [139154.090095] xxxxx_process[11909]: segfault at 21 ip 00007ff5704b5254 sp 00007ff556bbbb98 error 4 in libmpi_global-release.so[7ff56c832000+6eee000]

However we see no assertions/errors and no core dumps been generated, how do we go along to debug such issues ?

Core dump configuration been verified and is correct -
Limit Soft Limit Hard Limit Units
Max core file size unlimited unlimited bytes

also the flag to generate full cores is been enabled !!!!

Jason C. McDonald • Jul 2 '21

In almost all cases, it's very hard to debug a segfault without compiling the code in question with debug symbols (-g) and running it through Valgrind.

Core dumps are just snapshots of the raw memory when the program crashed, and will seldom provide any clues unless you are very familiar with the entire raw memory layout of your program.

shrivp1 • Jul 3 '21

Thank you for your response !!!
The problem here is we are facing this issue specifically in our client environment, we cannot reproduce this issue in house to try compile our code using -g or use valgrind by attaching it to a process

Using valgrind would add performance overhead in customer environment, so it's not a viable option, any other means to track this live on a client environment for a particular process ?

Jason C. McDonald • Jul 3 '21 • Edited

Off the top of my head, I don't know of any practical ways to debug a segfault in production like you describe. You could use logs and observations to determine what behavior(s) precede the segfault, and use that to focus in on part of your code base.

Meanwhile, your best bet would be to try and isolate what's different about their environment versus your test environment, and try and replicate it.

In any case, this won't be easy. This roadblock you're running into is exactly why it is so often said "if we can't reproduce it, we can't fix it".

shrivp1 • Jul 5 '21

Thank you Jason for the advise..
Will see if we can try to identify a diff in production env and in general..
Logs weren't much helpful to logically conclude in this case