Currently, I’m configuring a Redis as a caching service for our application and during that, I faced with the question: do I need to set vm.overcommit_memory
to the value 1, i.e. disable it – or not?
The question is quite old for me, see The story, but only now I found time to get to the real root of the question, put everything together and write the following post.
It was originally posted in Russian and this is a copy translated by myself. As there is really a lot of text – I hope I didn’t confuse anything during translation. If any – please, feel free to select a text with the mouse and press Shift+Enter to send me a notification.
So, the problem itself is that Redis documentation and almost every HowTo/guide about Redis performance carelessly tell us to disable the Linux overcommit_memory
mechanism by setting vm.overcommit_memory
to 1, especially as a solution for the “fork — Cannot allocate memory” error.
In this post, we will try to figure out – what exactly the overcommit_memory
is, where and how it is used and do we really need to change it in my current case, i.e. when Redis will be used for caching only.
Topics, covered in this post:
- Why overcommitting is bad?
- Redis persistence
- Redis save, SAVE и BGSAVE
- Redis rdbSave() and rdbSaveBackground() functions
- fork() vs fork() vs clone()
- Redis – fork: Cannot allocate memory – the cause
- The overcommit_memory values
- The famous “heuristic algorithm” (Heuristic Overcommit handling)
- Checking vm.overcommit_memory
- Conclusions
- The story
Why overcommitting is bad?
As this post originally was in Russian and has this part translated – I’ll just copy-past a part of the original text here. Read the full story – What is Overcommit? And why is it bad?
Overcommit refers to the practice of giving out virtual memory with no guarantee that physical storage for it exists. To make an analogy, it’s like using a credit card and not keeping track of your purchases. A system performing overcommit just keeps giving out virtual memory until the debt collector comes calling — that is, until some program touches a previously-untouched page, and the kernel fails to find any physical memory to instantiate it — and then stuff starts crashing down.
What happens when “stuff starts crashing down”? It can vary, but the Linux approach was to design an elaborate heuristic “OOM killer” in the kernel that judges the behavior of each process and decides who’s “guilty” of making the machine run out of memory, then kills the guilty parties. In practice this works fairly well from a standpoint of avoiding killing critical system processes and killing the process that’s “hogging” memory, but the problem is that no process is really “guilty” of using more memory than was available, because everyone was (incorrectly) told that the memory was available.
Suppose you don’t want this kind of uncertainty/danger when it comes to memory allocation? The naive solution would be to immediately and statically allocate physical memory corresponding to all virtual memory. To extend the credit card analogy, this would be like using cash for all your purchases, or like using a debit card. You get the safety from overspending, but you also lose a lot of fluidity. Thankfully, there’s a better way to manage memory.
The approach taken in reality when you want to avoid committing too much memory is to account for all the memory that’s allocated. In our credit card analogy, this corresponds to using a credit card, but keeping track of all the purchases on it, and never purchasing more than you have funds to pay off. This turns out to be the Right Thing when it comes to managing virtual memory, and in fact it’s what Linux does when you set the vm.overcommit_memory sysctl parameter to the value 2. In this mode, all virtual memory that could potentially be modified (i.e. has read-write permissions) or lacks backing (i.e. an original copy on disk or other device that it could be restored from if it needs to be discarded) is accounted for as “commit charge”, the amount of memory the kernel as committed/promised to applications. When a new virtual memory allocation would cause the commit charge to exceed a configurable limit (by default, the size of swap plus half the size of physical ram), the allocation fails.
Redis persistence
Redis uses two mechanisms to achieve data persistence – the RDB snapshotting (point-in-time snapshot) whiсh creates data copy from memory to the solid drive and AOF which constantly writes a log which every single operation performed by the server during its work. See more at the documentation – Redis Persistence.
The overcommit_memory
steps in when Redis creates data snapshotting from the memory on the disk, specifically during the BGSAVE
and BGREWRITEAOF
commands execution.
Below we will concentrate on the BGSAVE
command during which Redis creates a child process which makes data copy to the disk.
Redis save, SAVE и BGSAVE
A bit confusing may be Redis itself: in its configuration file the save
option is responsible for the BGSAVE
operation.
However, Redis also has the SAVE
command but it works differently:
-
SAVE
is in-sync command and performs write blocks on the memory during creating a copy -
BGSAVE
in its turn is an asynchronous mechanism – it works in a parallel to a main server’s process and doesn’t affect its operations and client connected, thus it is the preferable way to create a backup
But in a case when BGSAVE
can not be used, for example, because of the “Can’t save in background: fork: Cannot allocate memory” error – one can use the SAVE
command.
To check it let’s use the strace
tool.
Create a test config file redis-testing.conf
:
save 1 1
port 7777
Run strace
and redis-server
using this config:
root@bttrm-dev-console:/home/admin# strace -o redis-trace.log redis-server redis-testing.conf
strace
will write its output to the redis-trace.log file which we will check to find system calls used by the redis-server during the SAVE
and BGSAVE
operations:
root@bttrm-dev-console:/home/admin# tail -f redis-trace.log | grep -v 'gettimeofday\|close\|open\|epoll\_wait\|getpid\|read\|write'
Here with the grep -v
we removed a “garbage” calls which we don’t need now.
We could use -e trace=
to grab only necessary calls – but we don’t know yet what exactly we are looking for.
In the Redis configuration file, we set port 777
and save 1 1
, e.g. to create a database copy to the disk each second if at least one key was changed.
Add a new key:
admin@bttrm-dev-console:~$ redis-cli -p 7777 set test test
OK
And check the strace
log:
root@bttrm-dev-console:/home/admin# tail -f redis-trace.log | grep -v 'gettimeofday\|close\|open\|epoll\_wait\|getpid\|read\|write'
accept(5, {sa\_family=AF\_INET, sin\_port=htons(60816), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 6
...
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2097, ...}) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff26beda190) = 1790
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2097, ...}) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1790, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 1790
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2097, ...}) = 0
Here is the clone()
call (why clone()
instead of the fork()
– we will speak a bit later, in the fork() vs fork() vs clone()). This clone()
creates a new child process which in its turn will create the data copy.
Now – run the SAVE
command:
admin@bttrm-dev-console:~$ redis-cli -p 7777 save
OK
And check the log:
accept(5, {sa_family=AF_INET, sin_port=htons(32870), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 6
...
rename("temp-1652.rdb", "dump.rdb") = 0
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2097, ...}) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 6, 0x7ffe6712430c) = 0
No clone()
at this time – the dump was performed by the main Redis’ process and saved to the dump.rdb
file – check the rename(“temp-1652.rdb”, “dump.rdb”) line in the strace
‘s output (we will see shortly from there this name appeared – temp-1652.rdb).
Now call the BGSAVE
:
admin@bttrm-dev-console:~$ redis-cli -p 7777 bgsave
Background saving started
And check the log again:
accept(5, {sa_family=AF_INET, sin_port=htons(33030), sin_addr=inet_addr("127.0.0.1")}, [128->16]) = 6
...
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7ff26beda190) = 1879
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2097, ...}) = 0
epoll_ctl(3, EPOLL_CTL_DEL, 6, 0x7ffe6712430c) = 0
--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_EXITED, si_pid=1879, si_uid=0, si_status=0, si_utime=0, si_stime=0} ---
wait4(-1, [{WIFEXITED(s) && WEXITSTATUS(s) == 0}], WNOHANG, NULL) = 1879
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=2097, ...}) = 0
And again our clone()
is here which spawned another child process with PID 1879:
...
clone([...]) = 1879
...
Redis rdbSave()
and rdbSaveBackground()
functions
Exactly the dump itself is created by the only one Redis’ function – rdbSave()
:
...
/* Save the DB on disk. Return C_ERR on error, C_OK on success. */
int rdbSave(char *filename, rdbSaveInfo *rsi) {
...
snprintf(tmpfile,256,"temp-%d.rdb", (int) getpid());
fp = fopen(tmpfile,"w");
...
Which is called when you are executing the redis-cli -p 7777 SAVE
command.
And here is our temp-1652.rdb file name from the strace
output above:
...
snprintf(tmpfile,256,"temp-%d.rdb", (int) getpid());
...
Where 1652 – is the Redis server’s main process PID.
For its part, during the BGSAVE
command another function is called – rdbSaveBackground()
:
...
int rdbSaveBackground(char *filename, rdbSaveInfo *rsi) {
...
start = ustime();
if ((childpid = fork()) == 0) {
int retval;
/* Child */
closeListeningSockets(0);
redisSetProcTitle("redis-rdb-bgsave");
retval = rdbSave(filename,rsi);
...
Which in its turn creates a new child process:
...
if ((childpid = fork()) == 0)
...
And this process in its turn will execute the rdbSave()
:
...
retval = rdbSave(filename,rsi);
...
fork()
vs fork()
vs clone()
Now, let’s go back to the question – why in the strace
‘s output we are seeing the clone()
syscall instead of the fork()
which is called by the rdbSaveBackground()
function?
Well, that’s just because of the fork()
!= fork()
:
- there is the Linux-kernel
fork()
syscall - and there is also a
glibc
‘sfork()
function which is a wrapper around theclone()
syscall
Try to check them by using the apropos
tool:
[setevoy@setevoy-arch-work ~/Temp/redis] [unstable*] $ apropos fork
fork (2) - create a child process
fork (3am) - basic process management
fork (3p) - create a new process
So, the fork(2)
– is a system call, whereas the fork(3p)
– is the glibc
‘s function – https://github.com/bminor/glibc/blob/master/sysdeps/nptl/fork.c#L48.
Now, Read the Following Manual – open the man 2 fork
:
[setevoy@setevoy-arch-work ~/Temp/redis] [unstable*] $ man 2 fork | grep -A 5 NOTES
NOTES
Under Linux, fork() is implemented using copy-on-write pages, so the only penalty that it incurs is the time and memory required to duplicate the parent's page tables, and to create a unique task structure for the child.
C library/kernel differences
Since version 2.3.3, rather than invoking the kernel's fork() system call, the glibc fork() wrapper that is provided as part of the NPTL threading implementation invokes clone(2) with flags that provide the same effect as the
traditional system call. (A call to fork() is equivalent to a call to clone(2) specifying flags as just SIGCHLD.) The glibc wrapper invokes any fork handlers that have been established using pthread_atfork(3).
rather than invoking the kernel’s fork() system call, the glibc fork() wrapper […] invokes clone(2)
Consequently, when the rdbSaveBackground()
executes the fork()
– it uses not the fork(2)
but the fork(3p)
from the glibc
, which in its turn is aliased to the __libc_fork()
:
...
weak_alias (__libc_fork, __fork)
libc_hidden_def (__fork)
weak_alias (__libc_fork, fork)
And inside of the __libc_fork()
– the “magic” itself is happening by calling the arch_fork()
macros:
...
pid = arch_fork (&THREAD_SELF->tid);
...
Find it in the glibc
‘s source code just by grep
-ing it:
[setevoy@setevoy-arch-work ~/Temp/glibc] [master*] $ grep -r arch_fork .
./ChangeLog: (arch_fork): Issue INLINE_CLONE_SYSCALL if defined.
./ChangeLog: * sysdeps/nptl/fork.c (ARCH_FORK): Replace byarch_fork.
./ChangeLog: * sysdeps/unix/sysv/linux/arch-fork.h (arch_fork): New function.
./sysdeps/unix/sysv/linux/arch-fork.h:/* arch_fork definition for Linux fork implementation.
./sysdeps/unix/sysv/linux/arch-fork.h:arch_fork (void *ctid)
./sysdeps/nptl/fork.c: pid = arch_fork (&THREAD_SELF->tid);
The arch_fork()
is described in the sysdeps/unix/sysv/linux/arch-fork.h
file which in its turn will call the clone()
:
...
ret = INLINE_SYSCALL_CALL (clone, flags, 0, NULL, 0, ctid);
...
Which we will see in the strace
‘s log.
To check if it is so and we are really using the glibc fork()
and not the system call – let’s write some small C-program by using the official GNU’s documentation:
#include <unistd.h>
#include <sys/wait.h>
#include <stdio.h>
int main () {
pid_t pid;
pid = fork ();
if (pid == 0) {
printf("Child created\n");
sleep(100);
}
}
Here in the pid = fork()
we are calling the fork()
in the same way, as it did by the rdbSaveBackground()
function.
Then let’s use ltrace
to track a libraries functions (unlike the strace
which is used to trace system calls):
$ ltrace -C -f ./test_fork_lib
[pid 5530] fork( <unfinished ...>
[pid 5531] <... fork resumed> ) = 0
[pid 5530] <... fork resumed> ) = 5531
[pid 5531] puts("Child created" <no return ...>
[pid 5530] +++ exited (status 0) +++
Child created
[pid 5531] <... puts resumed> ) = 14
[pid 5531] sleep(100) = 0
And by using the lsof
tool – find all the files opened by our process:
[setevoy@setevoy-arch-work ~/Temp/glibc] [master*] $ lsof -p 5531
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
test_fork 5531 setevoy cwd DIR 254,3 4096 4854992 /home/setevoy/Temp/glibc
test_fork 5531 setevoy rtd DIR 254,2 4096 2 /
test_fork 5531 setevoy txt REG 254,3 16648 4855715 /home/setevoy/Temp/glibc/test_fork_lib
test_fork 5531 setevoy mem REG 254,2 2133648 396251 /usr/lib/libc-2.29.so
...
Or by using the ldd
– check which libraries will be used to make out code working:
[setevoy@setevoy-arch-work ~/Temp/glibc] [master*] $ ldd test_fork_lib
...
libc.so.6 => /usr/lib/libc.so.6 (0x00007f26ba77f000)
libc-2.29.so
is taken from the glibc
package:
[setevoy@setevoy-arch-work ~/Temp/glibc] [master*] $ pacman -Ql glibc | grep libc-2.29.so
glibc /usr/lib/libc-2.29.so
Another way to check functions in a library’s file – by using the objdump
tool:
[setevoy@setevoy-arch-work ~/Temp/linux] [master*] $ objdump -T /usr/lib/libc.so.6 | grep fork
00000000000c93c0 g DF .text 00000000000001fe GLIBC_PRIVATE __libc_fork
__libc_fork
– here is our function in the .text
section (see more in the Linux: C — адресное пространство процесса and C: создание и применение shared library в Linux, both Rus).
Redis – fork: Cannot allocate memory – the cause
Again, as the text below originally was translated for the Russian post – I’ll just copy-paste the documentation here. Read the full documentation – Background saving fails with a fork() error under Linux even if I have a lot of free RAM.
Redis background saving schema relies on the copy-on-write semantic of fork in modern operating systems: Redis forks (creates a child process) that is an exact copy of the parent. The child process dumps the DB on disk and finally exits. In theory the child should use as much memory as the parent being a copy, but actually thanks to the copy-on-write semantic implemented by most modern operating systems the parent and child process will share the common memory pages. A page will be duplicated only when it changes in the child or in the parent. Since in theory all the pages may change while the child process is saving, Linux can’t tell in advance how much memory the child will take, so if the overcommit_memory setting is set to zero fork will fail unless there is as much free RAM as required to really duplicate all the parent memory pages, with the result that if you have a Redis dataset of 3 GB and just 2 GB of free memory it will fail.
Setting overcommit_memory to 1 tells Linux to relax and perform the fork in a more optimistic allocation fashion, and this is indeed what you want for Redis.
The overcommit_memory
values
vm.overcommit_memory
can contain one of the following three values:
- 0 : the kernel will perform a virtual memory allocation more, then the server has, but will rely on the “heuristic algorithm” (heuristic overcommit handling) to decide whenever to approve or decline memory allocation for a process
- 1 : the kernel always will perform overcommit what can lead to more Out of memory errors but maybe good for services which are actively using memory
-
2 : the kernel will perform overcommitting but in the bounds set by the
overcommit_ratio
orovercommit_kbytes
parameters
The famous “heuristic algorithm” (Heuristic Overcommit handling)
In the most documentations/guides/howtos etc this algorithm is only mentioned, but it was not so easy to find its detailed description to understand how it is working.
As usually – “Just read the source!”
The overcommit check is performed by the supplementary function __vm_enough_memory()
from the memory management module and is described in the kernel’s mm/util.c
file.
This function accepts a number of pages requested by a process to be allocated and then this function will::
- if overcommit_memory == 1 (
if (sysctl_overcommit_memory == OVERCOMMIT_ALWAYS)
):- return 0 and allow overcommit
- if overcommit_memory == 0 (
if (sysctl_overcommit_memory == OVERCOMMIT_GUESS)
andsysctl_overcommit_memory
by default is set to the OVERCOMMIT_GUESS, and OVERCOMMIT_GUESS is set to the 0 in thelinux/mman.h
file):- count all free pages now and save them to the
free
variable:free = global_zone_page_state(NR_FREE_PAGES)
- increase the
free
to the number of the file-backed (see File-backed and Swap, Memory-mapped file) memory pages, i.e. pages which can be freed by swapping to the disk:free += global_node_page_state(NR_FILE_PAGES)
- decrease the
free
by the shared memory (see the Shared Memory, Shared memory) pages numberfree -= global_node_page_state(NR_SHMEM)
- increase the
free
by adding swap-pagesfree += get_nr_swap_pages()
- increase by the SReclaimable (see the
man 5 proc
SReclaimable) numberfree += global_node_page_state(NR_SLAB_RECLAIMABLE)
- increase by the KReclaimable (see the
man 5 proc
KReclaimable) numberfree += global_node_page_state(NR_KERNEL_MISC_RECLAIMABLE)
- decrease by the minimal reserved pages (see the
calculate_totalreserve_pages()
and An enhancement of OVERCOMMIT_GUESS)free -= totalreserve_pages
- decrease by the memory which is reserved for the root user (см.
init_admin_reserve()
)free -= sysctl_admin_reserve_kbytes
- and the last step is to check currently available memory – and the requested by a process – if the
free
variable with all our calculations above will contain enough pages – it will return the 0 value:if (free > pages) return 0;
- count all free pages now and save them to the
See the How Linux handles virtual memory overcommit, overcommit-accounting, Checking Available Memory.
Checking vm.overcommit_memory
Well – that’s all is a theory – not it’s time to take a real look at how all this will work during vm.overcommit_memory
changes, how the memory is allocated in general.
Let’s use the next simple code:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
int main() {
printf("main() started\n");
long int mem_size = 4096;
void *mem_stack = malloc(mem_size);
printf("Parent pid: %lu\n", getpid());
sleep(1200);
}
In the void *mem_stack = malloc(mem_size)
line we are requested 4096 bytes allocation which is set in the mem_size
variable.
Check the current overcommit_memory
value:
root@bttrm-dev-console:/home/admin# cat /proc/sys/vm/overcommit_memory 0
Run our program:
root@bttrm-dev-console:/home/admin# ./test_vm
main() started
Parent pid: 14353
Check the memory used by the process now:
root@bttrm-dev-console:/home/admin# ps aux | grep test_vm | grep -v grep
root 14353 0.0 0.0 4160 676 pts/4 S+ 17:29 0:00 ./test_vm
4160 VSZ (Virtual Size) – as we requested in the malloc(mem_size)
call.
Now – change the mem_size
variable value from the 4_096_ bytes – to the 1099511627776 – 1 terabyte :
...
long int mem_size = 1099511627776;
...
Build it, run – and:
root@bttrm-dev-console:/home/admin# ./test_vm
main() started
Segmentation fault
Great!
Check with the strace
:
root@bttrm-dev-console:/home/admin# strace -e trace=mmap ./test_vm
mmap(NULL, 47657, PROT_READ, MAP_PRIVATE, 3, 0) = 0x7fa268f24000
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fa268f22000
mmap(NULL, 3795296, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x7fa26896e000
mmap(0x7fa268d03000, 24576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x195000) = 0x7fa268d03000
mmap(0x7fa268d09000, 14688, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7fa268d09000
main() started
mmap(NULL, 1099511631872, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 1099511762944, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
mmap(NULL, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x7fa26096e000
mmap(NULL, 1099511631872, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = -1 ENOMEM (Cannot allocate memory)
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0xfffffffff8} ---
+++ killed by SIGSEGV +++
Segmentation fault
And here is our lovely -1 ENOMEM (Cannot allocate memory), which we can see in the Redis logs in its “ Can’t save in background: fork: Cannot allocate memory“ error message.
So, here is the mmap()
syscall which is calling the security_vm_enough_memory_mm()
function:
...
if (security_vm_enough_memory_mm(mm, grow))
return -ENOMEM;
...
Which is described in the security.h
header file and where is the __vm_enough_memory()
called:
...
static inline int security_vm_enough_memory_mm(struct mm_struct *mm, long pages)
{
return __vm_enough_memory(mm, pages, cap_vm_enough_memory(mm, pages));
}
...
Now – disable the overcommit limit checking:
root@bttrm-dev-console:/home/admin# echo 1 > /proc/sys/vm/overcommit_memory
Run the program again:
root@bttrm-dev-console:/home/admin# ./test_vm
main() started
Parent pid: 11337
Child pid: 11338
Child is running with PID 11338
Check the VSZ used now:
admin@bttrm-dev-console:~/redis-logs$ ps aux | grep -v grep | grep 11337
root 11337 0.0 0.0 1073745988 656 pts/4 S+ 16:34 0:00 ./test_vm
VSZ == **1073745988** – just awesome: we just allocated 1 TERRABYTE of the virtual memory on the AWS [t2.medium](https://www.ec2instances.info/?selected=t2.medium) EC2 instance with the only 4 Gigabyte of the “real” memory!
And now – guess what will happen, one the child process will start actively using this allocated virtual (yet) memory?
Add the memset()
syscall which will set the 0 into our mem_stack
by filling the whole mem_size
, e.g. 1 TB:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
int main() {
printf("main() started\n");
long int mem_size = 1099511627776;
void *mem_stack = malloc(mem_size);
printf("Parent pid: %lu\n", getpid());
memset(mem_stack, 0, mem_size);
sleep(120);
}
Run it ( do NOT do it on a Production environment! ):
root@bttrm-dev-console:/home/admin# ./test_vm
main() started
Parent pid: 15219
Killed
And check the operating system’s log:
Aug 27 17:46:43 localhost kernel: [7974462.384723] Out of memory: Kill process 15219 (test_vm) score 818 or sacrifice child
Aug 27 17:46:43 localhost kernel: [7974462.393395] Killed process 15219 (test_vm) total-vm:1073745988kB, anon-rss:3411676kB, file-rss:16kB, shmem-rss:0kB
Aug 27 17:46:43 localhost kernel: [7974462.600138] oom_reaper: reaped process 15219 (test_vm), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
The OOM Killer came here and killed everybody. At this time. Next time it may not be in time.
Pay attention, that our process had time to consume whole 3.4 GB of the real memory – anon-rss:3411676k from the one terabyte of the virtual memory given – total-vm:1073745988kB.
Conclusions
In our current case, when Redis is used for caching only and has no RDB or AOF backups enabled – no need to change the overcommit_memory
and best to leave it with its default value – 0.
In the case when you really want to set the boundaries by yourself – it’s best to use overcommit_memory
== 2 and limit the overcommit by setting the overcommit_ratio
or overcommit_kbytes
parameters.
The story
Actually, the whole story with the vm.overcommit_memory
started for me about a year ago.
I wasn’t too much familiar with the Redis at this time and I just came to the new project where Redis already was used.
In one perfect day – our Production server (and by the time when I just came to this project – the whole backend was working on the only one AWS EC2) tired a bit and went down for some rest.
After a magic dropkick via AWS Console – the server went back online and I started looking for the root cause.
In its logs, I found records about OOM Killer that cam to the Redis or RabbitMQ – not sure now exactly. But anyway during the investigation I found the vm.overcommit_memory
was set to the 1, i.e. disabled at all.
So anyway – this story at first gave me a reason to create a more reliable and fault-tolerant architecture for our backend’s infrastructure, and as the second thing – teach me not to blindly trust to any documentation.
Useful links
- Background saving fails with a fork() error under Linux even if I have a lot of free RAM!
- Redis Persistence
- Launching Linux threads and processes with clone
- Where did fork go?
- What is Overcommit? And why is it bad?
- Linux – fork system call and its pitfalls
- Searchable Linux Syscall Table for x86 and x86_64
- The Linux COW
- Physical and virtual memory
- mtrace (3) – Linux Man Pages
- Out Of Memory Management
- Kernel overcommit-accounting
- Virtual memory settings in Linux – The Problem with Overcommit
- Memory – Part 1: Memory Types
- Memory mapping
Top comments (0)