Frits Hoogland for YugabyteDB

Posted on Feb 14, 2022

Mirage of memory, part 3: overcommit

#linux #performance #postgres #yugabyte

This is a series of blogpost about linux memory management. This blogpost is about overcommit.

What is 'overcommit'? In linux, overcommit means the kernel allows more memory to be defined as allocated, which means the allocation adds to the VSZ or virtual set size, than the virtual memory size of the linux system can provide. Virtual memory here is physical available memory plus swap.

The linux kernel does facilitate overcommit because of the kernel parameter /proc/sys/vm/overcommit_memory setting default value of 0. The value of 0 means "heuristic overcommit handling".

This means that a process could startup, allocating more memory than is available, and everything works happy. Once the process that has allocated the huge amount of memory starts paging in memory, and thus starts growing its actual memory footprint, it will force the use of swap of the amount of memory exceeds available memory, and if it has run out of swap, there is no more memory left, so it will have to resort to the out of memory killer (OOM killer) to free memory.

The first thing to notice is that the kernel actually does keep track of the initial allocation that adds to the virtual set size. It also doesn't allow excessive allocations, it only allows allocations within a certain range:

Example of VSZ allocation (only):

$ target/debug/eatmemory -s 1000 -a 0 -v
total memory        :      1860 MB
available memory    :      1513 MB
free memory         :       438 MB
used memory         :       171 MB
total swap          :      2158 MB
free swap           :      2158 MB
used swap           :         0 MB
creating vec with size 1000 MB
pointer to hog: 0x7f43dd7ff010
allocating vec for 0 MB (-1 means creation size)
total memory        :      1860 MB
available memory    :      1512 MB
free memory         :       437 MB
used memory         :       172 MB
total swap          :      2158 MB
free swap           :      2158 MB
used swap           :         0 MB
done. press enter to stop and deallocate

In this example I am allocating a certain amount of memory (-s 1000M), but do not actually page it in (-a 0). As you can see in the second memory overview after the vec (an array in Rust) creation, you see that the available memory only decreased by 1M, the 1000M wasn't actually allocated.

Now let's perform exactly the same, but with an excessive amount: 10000M, 10G, which is far more than the 2G in the system:

$ target/debug/eatmemory -s 10000 -a 0 -v
total memory        :      1860 MB
available memory    :      1513 MB
free memory         :       438 MB
used memory         :       171 MB
total swap          :      2158 MB
free swap           :      2158 MB
used swap           :         0 MB
creating vec with size 10000 MB
memory allocation of 10485760000 bytes failed
Aborted (core dumped)

The program tried to allocate 10G, and the kernel stopped it by aborting the allocation.

For new allocations, the function __vm_enough_memory() in /mm/mmap.c checks if a process is allowed to allocate the amount of memory it wants.

On a linux system with default overcommit settings (vm.overcommit_memory=0, overcommit_ratio=50, overcommit_kbytes=0), it estimates available memory, and applies the overcommit ratio. Only excessive allocations get refused.

This is the default, and is fine if the processes either do not allocate memory memory than is available, or if they do, do not allocate it.

Hopefully in the first blogpost, you have learned that indeed linux tries to minimise performing paging in (truly allocating pages), in order to speed things up, but also to be economical with memory. So the default setting vm.overcommit_memory=0 makes sense.

But, there are cases where overcommit leads to filling up the memory leading to OOM kills, and obviously bad performance. The factual issue cannot be taken away by changing overcommit settings. Too much memory allocation means too much memory allocation.

This is how that can happen: this is a test system with no excessive memory allocations:

$ grep Commit /proc/meminfo
CommitLimit:     3088868 kB
Committed_AS:     502396 kB

This means the CommitLimit is 3G ((total memory pages-huge pages)*vm.commit_ratio)*100+total swap. Currently VSZ memory allocated = 502396 kB.

Now using the a few sessions performing eatmemory -s 2000 -a 0, this is how it looks like:

1 session of eatmemory for 2G:

$ grep Commit /proc/meminfo
CommitLimit:     3088868 kB
Committed_AS:    2555488 kB

(Committed_AS has increased with roughly 2G to 2.5G)
2 sessions of eatmemory for 2G:

$ grep Commit /proc/meminfo
CommitLimit:     3088868 kB
Committed_AS:    4608740 kB

(Yup, we are over the commit limit, and the VSZ total size alias Committed_AS increased up to 4.6G)
3 sessions of eatmemory for 2G:

$ grep Commit /proc/meminfo
CommitLimit:     3088868 kB
Committed_AS:    6660000 kB

This means that by using the default vm.overcommit_memory=0 setting, we can allocate (not page in!) memory beyond memory size, and beyond the size of the CommitLimit, as long as each allocation fits within available memory, or "is reasonable".

However, as you could see in the beginning, the total amount of memory available to the kernel is 1860M. So if the processes of the 6G of VSZ/Committed memory start paging in, it's not hard to imagine the system gets into memory stress first, and has to resort to OOM killing later.

This cannot be solved in the absolute sense, because the kernel cannot change what you and your programs do. However, the detection of overallocation can be moved to a different point in time, which is at the time of the allocation, to prevent a linux system from getting overcommitted. This is done by setting vm.overcommit_memory=2.

Changing to overcommit settings to vm.overcommit_memory=2, which means 'never overcommit', the memory allocation overcommit detection and refusal by the kernel can be made strict. This does not change the fact that a process might over allocate. What this does change, is that a process fails on allocation time rather than getting killed by the OOM killer during paging in, because then the linux system is in a lot of stress and bad performance.

How does that work? On linux, the kernel, as we seen, keeps track of all the allocated memory, whether paged in or not. This can be seen in the statistic Committed_AS in /proc/meminfo. This probably means 'committed allocations on the system'.

Especially on systems where performance and low latency are key, it might be beneficial to make linux fail allocations early. This means you are being notified early about allocating too much memory, rather than on runtime. Again: it doesn't take away the problem, it just tells you earlier.

Example:

# echo 2 > /proc/sys/vm/overcommit_memory
# grep Commit /proc/meminfo
CommitLimit:     3088868 kB
Committed_AS:     502384 kB

1 eatmemory session for 2G:

# grep Commit /proc/meminfo
CommitLimit:     3088868 kB
Committed_AS:    2555344 kB

Second eatmemory session for 2G:

$ ./target/debug/eatmemory -s 2000 -a 0 -v
total memory        :      1860 MB
available memory    :      1497 MB
free memory         :       412 MB
used memory         :       187 MB
total swap          :      2158 MB
free swap           :      2158 MB
used swap           :         0 MB
creating vec with size 2000 MB
memory allocation of 2097152000 bytes failed
Aborted (core dumped)

So instead of the OOM killer having to snipe a session at runtime, vm.overcommit_memory=2 prevents a session from allocating too much memory.

conclusion

The linux kernel by default allows memory to go over the set 'commit limit', which is the total of all memory allocations. The size of this commit limit is determined by vm.overcommit_ratio, which is 50 by default, the overcommit behaviour is set by vm.overcommit_memory, which is set to 0, heuristically overcommit.

Heuristically overcommit means for each memory allocation the total amount of available memory should fit.

If your application, or application usage, leads to too much memory allocation, the heuristic overcommit setting means memory allocation issues expose theirself on runtime when processes get stuck. That might be too late. This can be prevented by setting vm.overcommit_memory to 2, which does not allow allocations to exceed the commit limit, and thus fail much earlier.

This is not new information, and actually is an advise in the postgres documentation

closing thoughts and warning

There is a very interesting PostgreSQL side effect of how the postmaster deals with backends that are influenced by memory issues: with the regular setting of linux overcommit, postgres backends can make a server running out of virtual memory leading to the OOM killer. The OOM killer will snipe a postgres session or the postmaster. If the postmaster gets sniped, all the sessions are terminated, because their parent terminated.

However: if a postgres session is crashed, such as termination by the OOM killer, it will make the postmaster requesting all backends to rollback, terminates these and reinitialise the postgres server: (example of killing a single backend manually)

2022-02-08 11:20:00.221 UTC [10581] LOG:  server process (PID 10639) was terminated by signal 9: Killed
2022-02-08 11:20:00.221 UTC [10581] DETAIL:  Failed process was running: select 1+1;
2022-02-08 11:20:00.221 UTC [10581] LOG:  terminating any other active server processes
2022-02-08 11:20:00.222 UTC [10587] WARNING:  terminating connection because of crash of another server process
2022-02-08 11:20:00.222 UTC [10587] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-02-08 11:20:00.222 UTC [10587] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-02-08 11:20:00.223 UTC [10641] WARNING:  terminating connection because of crash of another server process
2022-02-08 11:20:00.223 UTC [10641] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-02-08 11:20:00.223 UTC [10641] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-02-08 11:20:00.224 UTC [10581] LOG:  all server processes terminated; reinitializing
2022-02-08 11:20:00.234 UTC [10647] LOG:  database system was interrupted; last known up at 2022-02-08 11:19:22 UTC
2022-02-08 11:20:00.334 UTC [10647] LOG:  database system was not properly shut down; automatic recovery in progress
2022-02-08 11:20:00.335 UTC [10647] LOG:  redo starts at 0/165E598
2022-02-08 11:20:00.335 UTC [10647] LOG:  invalid record length at 0/165E5D0: wanted 24, got 0
2022-02-08 11:20:00.335 UTC [10647] LOG:  redo done at 0/165E598
2022-02-08 11:20:00.341 UTC [10581] LOG:  database system is ready to accept connections

This means that even though (out of memory) termination might initially affect one backend, it escalates through the entire local database and affects all database backends.

If the linux kernel parameter vm.overcommit_memory is set to not overcommit (2), the postgres backend will receive an ENOMEM system signal.

Investigation for PostgreSQL out of memory can very easily be done using this small anonymous code block:

(memory_filler.sql)

do $$
declare
 filler text[];
 counter int:=0;
begin
 loop
  filler[counter] := repeat('0123456789',1638);
  counter:=counter+1;
 end loop;
end $$;

If executed against a linux server with vm.overcommit_memory=0 (default), this results in the earlier mentioned complete postgres restart:

2022-02-08 12:16:40.977 UTC [10581] LOG:  server process (PID 10800) was terminated by signal 9: Killed
2022-02-08 12:16:40.977 UTC [10581] DETAIL:  Failed process was running: do $$
    declare
     filler text[];
     counter int:=0;
    begin
     loop
      filler[counter] := repeat('0123456789',1638);
      counter:=counter+1;
     end loop;
    end $$;
2022-02-08 12:16:40.982 UTC [10581] LOG:  terminating any other active server processes
2022-02-08 12:16:41.034 UTC [10796] WARNING:  terminating connection because of crash of another server process
2022-02-08 12:16:41.034 UTC [10796] DETAIL:  The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2022-02-08 12:16:41.034 UTC [10796] HINT:  In a moment you should be able to reconnect to the database and repeat your command.
2022-02-08 12:16:41.113 UTC [10581] LOG:  all server processes terminated; reinitializing
2022-02-08 12:16:41.139 UTC [10802] LOG:  database system was interrupted; last known up at 2022-02-08 12:15:51 UTC
2022-02-08 12:16:41.251 UTC [10802] LOG:  database system was not properly shut down; automatic recovery in progress
2022-02-08 12:16:41.252 UTC [10802] LOG:  redo starts at 0/165E6E8
2022-02-08 12:16:41.252 UTC [10802] LOG:  invalid record length at 0/165E720: wanted 24, got 0
2022-02-08 12:16:41.252 UTC [10802] LOG:  redo done at 0/165E6E8
2022-02-08 12:16:41.262 UTC [10581] LOG:  database system is ready to accept connections

It should be noted this behaviour is not guaranteed linux behaviour: what happens in the background is that linux runs out of virtual memory, which triggers the OOM killer, which evaluates process memory usage, and kills the the process with the largest RSS, adjusted by OOM adjustment (I did not research this down the linux kernel code, corrections welcome).

This means that if PostgreSQL is not the only major consumer of memory, or if multiple PostgreSQL database are running, the OOM killer can potentially pick another process than the allocating backend process.

When vm.overcommit_memory is set to 2, this becomes different; this is what happens then the memory_filler.sql anonymous code block is run:

postgres=# \i memory_filler.sql
psql:memory_filler.sql:10: ERROR:  out of memory
DETAIL:  Failed on request of size 16384 in memory context "expanded array".
CONTEXT:  PL/pgSQL function inline_code_block line 7 at assignment

And this is visible in the PostgreSQL general logfile:

TopMemoryContext: 79496 total in 6 blocks; 8944 free (6 chunks); 70552 used
  Operator lookup cache: 24576 total in 2 blocks; 10760 free (3 chunks); 13816 used
  CFuncHash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  Rendezvous variable hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
...further memory allocation details
  MdSmgr: 8192 total in 1 blocks; 7264 free (0 chunks); 928 used
  LOCALLOCK hash: 8192 total in 1 blocks; 560 free (0 chunks); 7632 used
  Timezones: 104120 total in 2 blocks; 2624 free (0 chunks); 101496 used
  ErrorContext: 8192 total in 1 blocks; 7936 free (5 chunks); 256 used
Grand total: 2563990888 bytes in 156012 blocks; 242296 free (134 chunks); 2563748592 used
2022-02-08 12:27:05.659 UTC [10817] ERROR:  out of memory
2022-02-08 12:27:05.659 UTC [10817] DETAIL:  Failed on request of size 16384 in memory context "expanded array".
2022-02-08 12:27:05.659 UTC [10817] CONTEXT:  PL/pgSQL function inline_code_block line 7 at assignment
2022-02-08 12:27:05.659 UTC [10817] STATEMENT:  do $$
    declare
     filler text[];
     counter int:=0;
    begin
     loop
      filler[counter] := repeat('0123456789',1638);
      counter:=counter+1;
     end loop;
    end $$;

There is no escalation that triggers database wide backend restarts: the backend runs into an out of memory error, and simply fails.