DEV Community

Discussion on: I'm an Expert in Memory Management & Segfaults, Ask Me Anything!

Collapse
 
mmvillar profile image
mmvillar • Edited

Hello, Jason

I work in a house code that is able to simulate fluid dynamics. Nowadays it has four hundred thousand of lines. An important piece of the code uses PETSc library that can solve linear system. The code uses MPI for parallel communication, C language and Fortran, but most of them are in Fortran90. After new version of GCC (>8.0.0), the code started presented a memory leak, when the Petsc library is active. I tried to use Valgrind and DrMemory to get that leak, but I still could not find the problem. I'm quiet sure that the problem is not in the Petsc Library, but in the way that I communicate the global matrix. I noticed that you are expertise in memory leak and perhaps you can see the valgrind log and help me to detect where the memory leak is happening.

In valgrind log there is a Lot of information, but all of them point to the MPI library or HDF5 library, but any one points to the f90 files or Petsc Library. In that situation is possible to get the memory leak?

the way that I run with valgrind:
mpirun -n 2 valgrind -v --leak-check=full --show-leak-kinds=all --log-file=valgrind%p.log --track-origins=yes ./amr3d

Thanks in advance
Millena

Collapse
 
codemouse92 profile image
Jason C. McDonald • Edited

Hi Millena,

Of course, without seeing the Valgrind log itself, it's hard to say. (If you do post that, please use a GitHub Gist, a Hastebin, or something of the sort.)


First, some technical background. Apologies if you already know some/all of this. It's also for other readers:

Memory leaks can be an issue, but they aren't necessarily a sign that something is wrong. In some particularly complex programs or libraries, it is impractical to manually free all of the allocated memory at the time of program termination, so it is acceptable to just allow that memory to be freed as a part of the entire program's stack and heap being released. So, in that sense, it is perfectly possible that a memory safe program can report memory leaks.

However, as you know, a memory leak can become an issue if it is occurring during the program's lifetime, instead of just before termination.

In any case, a memory leak always has the same cause: memory is being dynamically allocated, but not freed before the last pointer to it goes out of scope. Thus, it becomes virtually impossible to free said allocated memory, so it can never be reused during the life of the program. If this happens enough times, you can actually run out of heap space.

One more issue that makes this difficult is that a memory leak can occur in one place, but be caused by usage elsewhere. For example, you might call a library function that allocates memory, but not realize you must call another library function to deallocate it. That's far more likely to occur in C, instead of C++ or FORTRAN, due to C's lack of objects and their constructors/destructors for handling automatic allocation and deallocation related to the lifetime of an object. That's why the entire stack trace is so important.


Here are my initial thoughts on actually tracking this down.

First, I find it interesting that the problem only occurs after you start using a new version of GCC. This does not necessarily mean there's a bug in the compiler, however. Memory-related bugs have a tendency to develop strange properties, such as Heisenbugs or Schoredinbugs. There are many more such freaks of nature besides.

I wonder if the conditions for a memory leak have been present in either your project's source or the library source for some time, but that particular implementation details (or another bug?) in previous versions of GCC concealed its existence. Once the behavior of GCC or the standard library were changed in the latest version, possibly fixing a bug, the memory leak was no longer being coincidentally diked out. In fact, I'd wager this is the most likely possibility.

I'd also suspect that the memory leak itself actually is in the Petsc library, given that it must be involved for the leak to occur. It may even be possible that the latent bug existed in Petsc. However, the cause of the memory leak quite possibly originates from your source; your particular usage of the library may be triggering some sort of peculiar corner case in Petsc, wherein the bug resides.

The other possibility is that GCC 8 has a bug itself, but given the size and domain of your program and its libraries, that would be quite difficult to isolate.


And now the bad news: tracking this down is probably non-trivial. If you compile both the library and the source with -g, Valgrind should give you line numbers you can use to check the source. You'll need to work backwards to figure out what's wrong.

That would be the easy solution, and hopefully it's as far as you need to go.

I would also recommend testing your code against the LLVM Clang compiler, if possible. Does the same error occur there? If it does, be glad! You need only pick apart your code and that of Petsc to find the problem.

Otherwise, if GCC 8 does prove to be a necessary environmental factor for the memory leak to occur, and you cannot isolate the problem any other way, another rather involved thing you can do is to perform a bisect on compiler versions.

  1. Spin up a clean environment, such as in a virtual machine.

  2. Build your libraries and source with the last compiler version you remember working. Ensure the memory leak is not present.

  3. Build with the compiler version you know isn't working. Verify that the memory leak is present. (If it isn't, you can rule out compiler; you're now probably dealing with a phase-of-the-moon bug, which will require you figure out what on your development machine is causing the issue.)

  4. Assuming 2 and 3 have the expected outcomes, use those two compiler versions as your endpoints for a bisect. Check compiler versions in between, following the same workflow as a git bisect, until you know the first version that presents a problem.

  5. If you have the time and access, consider bisecting on the development versions leading up to the first version of the compiler that presents the issue.

  6. Check the changelogs for the version. If you did step 5, look through the commit messages. Try to isolate the change to the compiler that is contributing to the memory leak.

If you do all this, remember, this might not be an actual bug in the compiler. If you can determine what caused the behavior to change in the library, you may be able to find the latent bug in Petsc.


I hope that helps!

Collapse
 
mmvillar profile image
mmvillar

Hello, Jason
First of all thanks a lot for you opinion and advices!. I will try to perform step by step what you suggested above. In sequence there is a link where the valgrind log is. As the tests are being performed I would like to share with you the results. Can I have a contact with you by email?

gist.github.com/mmvillar/ca0a726a4...

again, Thank you!

Thread Thread
 
codemouse92 profile image
Jason C. McDonald • Edited

Sure, my contact info is on my personal website. Link is on my DEV profile. I can't guarantee that I'll be able to solve this remotely, as you know the code far better than I could hope to, but I'll help where I can.

EDIT: Looked at the log. Yeah, you'll definitely need to compile your dependencies with -g, in order to have all the information you need.