Alex Dzyoba

Posted on Jul 9, 2018 • Originally published at alex.dzyoba.com on Nov 25, 2014

Restricting program memory

#c #linux

On the other day I’ve decided to solve the popular problem: how to sort 1 million of integers in 1 MiB?

But before I’ve even started to do anything I thought – how can I restrict process memory to 1 MiB? Will it work? So, here is the answers.

Process virtual memory

What you have to know before diving in various methods is how process’s virtual memory is structured. There is a, hands down, the best article you could ever find about that is Gustavo Duarte’s “Anatomy of a Program in Memory”. His whole blog is a treasure.

After reading Gustavo’s article I can propose 2 possible options for restricting memory – reduce virtual address space and restrict heap size.

First is to limit the whole virtual address space for the process. This is nice and easy but not fully correct. We can’t limit the whole virtual address space of the process to 1 MB – we won’t be able to map kernel and libs.

Second is to limit heap size. This is not so easy and seems like nobody tries to do this because the only reasonable way to do this is playing with the linker. But for limiting available memory to such small values like 1 MiB it will be absolutely correct.

Also, I will look at other methods like monitoring memory consumption with intercepting library and system calls related to memory management and changing program environment with emulation and sandboxing.

For testing and illustrating I will use this little program big_alloc that allocates (and frees) 100 MiB.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <stdbool.h>

// 1000 allocation per 100 KiB = 100 000 KiB = 100 MiB
#define NALLOCS 1000
#define ALLOC_SIZE 1024*100 // 100 KiB

int main(int argc, const char *argv[])
{
    int i = 0;
    int **pp;
    bool failed = false;

    pp = malloc(NALLOCS * sizeof(int *));
    for(i = 0; i < NALLOCS; i++)
    {
        pp[i] = malloc(ALLOC_SIZE);
        if (!pp[i])
        {
            perror("malloc");
            printf("Failed after %d allocations\n", i);
            failed = true;
            break;
        }
        // Touch some bytes in memory to trick copy-on-write.
        memset(pp[i], 0xA, 100);
        printf("pp[%d] = %p\n", i, pp[i]);
    }

    if (!failed)
        printf("Successfully allocated %d bytes\n", NALLOCS * ALLOC_SIZE);

    for(i = 0; i < NALLOCS; i++)
    {
        if (pp[i])
            free(pp[i]);
    }
    free(pp);

    return 0;
}

All the sources are on github.

ulimit

It’s the first thing that old unix hacker can think of when asked to limit program memory. ulimit is a bash utility that allows you to restrict program resources, and is just interface for setrlimit.

We can set the limit to resident memory size.

$ ulimit -m 1024

Now check:

$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 7802
max locked memory (kbytes, -l) 64
max memory size (kbytes, -m) 1024
open files (-n) 1024
pipe size (512 bytes, -p) 8
POSIX message queues (bytes, -q) 819200
real-time priority (-r) 0
stack size (kbytes, -s) 8192
cpu time (seconds, -t) unlimited
max user processes (-u) 1024
virtual memory (kbytes, -v) unlimited
file locks (-x) unlimited

We set the limit to 1024 kbytes (-m) thus 1 MiB. But when we try to run our program it won’t fail. Setting the limit to something more reasonable like 30 MiB will anyway let our program allocate 100 MB. ulimit simply doesn’t work. Despite setting the resident set size to 1024 kbytes, I can see in top that resident memory for my program is 4872.

The reason is that Linux doesn’t respect this and man ulimit tells it directly:

ulimit [-HSTabcdefilmnpqrstuvx [limit]]
    ...
    -m The maximum resident set size (many systems do not honor this limit)
    ...

There is also ulimit -d that is respected according to kernel, but it still works because of mmap (see Linker chapter).

QEMU

When you want to modify the program environment QEMU is the natural way for this kind of tasks. It has -R option to limit virtual address space. But like I said earlier you can’t restrict address space to small values – there will be no space to map libc and kernel.

Look:

$ qemu-i386 -R 1048576 ./big_alloc
big_alloc: error while loading shared libraries: libc.so.6: failed to map segment from shared object: Cannot allocate memory

Here, -R 1048576 reserves 1 MiB for guest virtual address space.

For whole virtual address space, we have to set something more reasonable like 20 MB. Look:

$ qemu-i386 -R 20M ./big_alloc
malloc: Cannot allocate memory
Failed after 100 allocations

It successfully fails¹ after 100 allocations (10 MB).

So, QEMU is the first winner in restricting program’s memory size though you have to play with -R value to get the correct limit.

Container

Another option after QEMU is to launch an application in the container, restricting its resources. To do this you have several options:

Use fancy high-level docker.
Use regular usermode tools from the lxc package.
Go hardcore and write your own script with libvirt.
Name it…

But after all, resources will be restricted with native Linux subsystem called cgroups. You can try to poke it directly but I suggest using lxc. I would like to use docker but it works only on 64-bit machines and my box is small Intel Atom netbook which is i386.

Ok, quick info. LXC is LinuX Containers. It’s a collection of userspace tools and libs for managing kernel facilities to create containers – isolated and secure environment for an application or whole system.

Kernel facilities that provide such environment are:

Control groups (cgroups)
Kernel namespaces
chroot
Kernel capabilities
SELinux, AppArmor
Seccomp policies

You can find nice documentation on official site, on author’s blog and all over the internet.

To simply run application in container you have to provide config to lxc-execute where you will configure your container. Every sane person should start from examples in /usr/share/doc/lxc/examples. Man pages recommends to start with lxc-macvlan.conf. Ok, let’s do this:

# cp /usr/share/doc/lxc/examples/lxc-macvlan.conf lxc-my.conf
# lxc-execute -n foo -f ./lxc-my.conf ./big_alloc
Successfully allocated 102400000 bytes

It works!

Now let’s limit memory. This is what cgroup for. LXC allows you to configure memory subsystem for container’s cgroup by setting memory limits.

You can find available tunable parameters for memory subsystem in this fine RedHat manual. I’ve found 2:

memory.limit_in_bytes – sets the maximum amount of user memory (including file cache)
memory.memsw.limit_in_bytes – sets the maximum amount for the sum of memory and swap usage

Here is what I added to lxc-my.conf:

lxc.cgroup.memory.limit_in_bytes = 2M
lxc.cgroup.memory.memsw.limit_in_bytes = 2M

Launch again:

# lxc-execute -n foo -f ./lxc-my.conf ./big_alloc
#

Nothing happened, looks like it’s way to small memory. Let’s try to launch it from shell in container.

# lxc-execute -n foo -f ./lxc-my.conf /bin/bash
#

Looks like bash failed to launch. Let’s try /bin/sh:

# lxc-execute -n foo -f ./lxc-my.conf -l DEBUG -o log /bin/sh
sh-4.2# ./dev/big_alloc/big_alloc 
Killed

Yay! We can see this nice act of killing in dmesg:

[15447.035569] big_alloc invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
...
[15447.035779] Task in /lxc/foo
[15447.035785] killed as a result of limit of 
[15447.035789] /lxc/foo

[15447.035795] memory: usage 3072kB, limit 3072kB, failcnt 127
[15447.035800] memory+swap: usage 3072kB, limit 3072kB, failcnt 0
[15447.035805] kmem: usage 0kB, limit 18014398509481983kB, failcnt 0
[15447.035808] Memory cgroup stats for /lxc/foo: cache:32KB rss:3040KB rss_huge:0KB mapped_file:0KB writeback:0KB swap:0KB inactive_anon:1588KB active_anon:1448KB inactive_file:16KB active_file:16KB unevictable:0KB
[15447.035836] [pid] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[15447.035963] [9225] 0 9225 942 308 10 0 0 init.lxc
[15447.035971] [9228] 0 9228 833 698 6 0 0 sh
[15447.035978] [9252] 0 9252 16106 843 36 0 0 big_alloc
[15447.035983] Memory cgroup out of memory: Kill process 9252 (big_alloc) score 1110 or sacrifice child
[15447.035990] Killed process 9252 (big_alloc) total-vm:64424kB, anon-rss:2396kB, file-rss:976kB

Though we haven’t seen an error message from big_alloc about malloc failure and how much memory we were able to get, I think we’ve successfully restricted memory via container technology and can stop with it for now.

Linker

Now, let’s try to modify binary image limiting space available for the heap.

Linking is the final part of building a program and it implies using linker and linker script. Linker script is a description of program sections in memory along with its attributes and stuff.

Here is a simple linker script:

ENTRY(main)

SECTIONS
{
  . = 0x10000;
  .text : { *(.text) }
  . = 0x8000000;
  .data : { *(.data) }
  .bss : { *(.bss) }
}

Dot is current location. What that script tells us is that .text section starts at address 0x10000, and then starting from 0x8000000 we have 2 subsequent sections .data and .bss. Entry point is main.

Nice and sweet but it will not work for any useful applications. And the reason is that the main function that you write in C programs is not actually first function being called. There is a whole lot of initialization and cleanup code. That code is provided with C runtime (also shorthanded to crt) and spread into crt#.o libraries in /usr/lib.

You can see exact details if you launch gcc with -v option. You’ll see that at first it invokes cc1 and creates assembly, then translate it to object file with as and finally combines everything in ELF file with collect2. Thatcollect2 is ld wrapper. It takes your object file and 5 additional libs to create final binary image:

/usr/lib/gcc/i686-redhat-linux/4.8.3/../../../crt1.o
/usr/lib/gcc/i686-redhat-linux/4.8.3/../../../crti.o
/usr/lib/gcc/i686-redhat-linux/4.8.3/crtbegin.o
/tmp/ccEZwSgF.o <-- This one is our program object file
/usr/lib/gcc/i686-redhat-linux/4.8.3/crtend.o
/usr/lib/gcc/i686-redhat-linux/4.8.3/../../../crtn.o

It’s really complicated so instead of writing my own script I’ll modify default linker script. Get default linker script passing -Wl,-verbose to gcc:

gcc big_alloc.c -o big_alloc -Wl,-verbose

Now let’s figure out how to modify it. Let’s see how our binary is built by default. Compile it and look for .data section address. Here is objdump -h big_alloc output

Sections:
Idx Name Size VMA LMA File off Algn
...
12 .text 000002e4 080483e0 080483e0 000003e0 2**4
                 CONTENTS, ALLOC, LOAD, READONLY, CODE
...
23 .data 00000004 0804a028 0804a028 00001028 2**2
                 CONTENTS, ALLOC, LOAD, DATA
24 .bss 00000004 0804a02c 0804a02c 0000102c 2**2
                 ALLOC

.text, .data and .bss sections are located near 128 MiB.

Now, let’s see where is the stack with help of gdb:

[restrict-memory]$ gdb big_alloc
...
Reading symbols from big_alloc...done.
(gdb) break main
Breakpoint 1 at 0x80484fa: file big_alloc.c, line 12.
(gdb) r
Starting program: /home/avd/dev/restrict-memory/big_alloc 

Breakpoint 1, main (argc=1, argv=0xbffff164) at big_alloc.c:12
12 int i = 0;
Missing separate debuginfos, use: debuginfo-install glibc-2.18-16.fc20.i686
(gdb) info registers 
eax 0x1 1
ecx 0x9a8fc98f -1701852785
edx 0xbffff0f4 -1073745676
ebx 0x42427000 1111650304
esp 0xbffff0a0 0xbffff0a0
ebp 0xbffff0c8 0xbffff0c8
esi 0x0 0
edi 0x0 0
eip 0x80484fa 0x80484fa <main+10>
eflags 0x286 [PF SF IF]
cs 0x73 115
ss 0x7b 123
ds 0x7b 123
es 0x7b 123
fs 0x0 0
gs 0x33 51

esp points to 0xbffff0a0 which is near 3 GiB. So we have ~2.9 GiB for heap.

In the real world, stack top address is randomized, e.g. you can see it in the output of

# cat /proc/self/maps

As we all know, heap grows up from the end of .data towards the stack. What if we move .data section to highest possible address?

Let’s put data segment 2 MiB before stack. Take stack top, subtract 2 MiB:

0xbffff0a0 - 0x200000 = 0xbfdff0a0

Now shift all sections starting with .data to that address:

. = 0xbfdff0a0
.data :
{
  *(.data .data.* .gnu.linkonce.d.*)
  SORT(CONSTRUCTORS)
}

Compile it:

$ gcc big_alloc.c -o big_alloc -Wl,-T hack.lst

-Wl is an option to linker and -T hack.lst is a linker option itself. It tells linker to use hack.lst as a linker script.

Now, if we look at the header we’ll see that:

Sections:
Idx Name Size VMA LMA File off Algn

 ...

 23 .data 00000004 bfdff0a0 bfdff0a0 000010a0 2**2
                  CONTENTS, ALLOC, LOAD, DATA
 24 .bss 00000004 bfdff0a4 bfdff0a4 000010a4 2**2
                  ALLOC

But nevertheless, it successfully allocates. How? That’s really neat. When I tried to look at pointer values that malloc returns I saw that allocation is starting somewhere over the end of .data section like 0xbf8b7000, continues for some time with increasing pointers and then resets pointers to _lower_address like 0xb7676000. From that address it will allocate for some time with pointers increasing and then resets pointers again to even lower address like 0xb5e76000. Eventually it looks like heap growing down!

But if you think for a minute it doesn’t really that strange. I’ve examined some glibc sources and found out that when brk fails it will use mmap instead. So glibc asks the kernel to map some pages, kernel sees that process has lots of holes in virtual memory space and map page from that space for glibc, and finally glibc returns pointer from that page.

Running big_alloc under strace confirmed theory. Just look at normal binary:

brk(0) = 0x8135000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77df000
mmap2(NULL, 95800, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb77c7000
mmap2(0x4226d000, 1825436, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4226d000
mmap2(0x42425000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x42425000
mmap2(0x42428000, 10908, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x42428000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77c6000
mprotect(0x42425000, 8192, PROT_READ) = 0
mprotect(0x8049000, 4096, PROT_READ) = 0
mprotect(0x42269000, 4096, PROT_READ) = 0
munmap(0xb77c7000, 95800) = 0
brk(0) = 0x8135000
brk(0x8156000) = 0x8156000
brk(0) = 0x8156000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb77de000
brk(0) = 0x8156000
brk(0x8188000) = 0x8188000
brk(0) = 0x8188000
brk(0x81ba000) = 0x81ba000
brk(0) = 0x81ba000
brk(0x81ec000) = 0x81ec000
...
brk(0) = 0x9c19000
brk(0x9c4b000) = 0x9c4b000
brk(0) = 0x9c4b000
brk(0x9c7d000) = 0x9c7d000
brk(0) = 0x9c7d000
brk(0x9caf000) = 0x9caf000
...
brk(0) = 0xe29c000
brk(0xe2ce000) = 0xe2ce000
brk(0) = 0xe2ce000
brk(0xe300000) = 0xe300000
brk(0) = 0xe300000
brk(0) = 0xe300000
brk(0x8156000) = 0x8156000
brk(0) = 0x8156000
+++ exited with 0 +++

and now the modified binary

brk(0) = 0xbf896000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb778f000
mmap2(NULL, 95800, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7777000
mmap2(0x4226d000, 1825436, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0x4226d000
mmap2(0x42425000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1b8000) = 0x42425000
mmap2(0x42428000, 10908, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x42428000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7776000
mprotect(0x42425000, 8192, PROT_READ) = 0
mprotect(0x8049000, 4096, PROT_READ) = 0
mprotect(0x42269000, 4096, PROT_READ) = 0
munmap(0xb7777000, 95800) = 0
brk(0) = 0xbf896000
brk(0xbf8b7000) = 0xbf8b7000
brk(0) = 0xbf8b7000
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb778e000
brk(0) = 0xbf8b7000
brk(0xbf8e9000) = 0xbf8e9000
brk(0) = 0xbf8e9000
brk(0xbf91b000) = 0xbf91b000
brk(0) = 0xbf91b000
brk(0xbf94d000) = 0xbf94d000
brk(0) = 0xbf94d000
brk(0xbf97f000) = 0xbf97f000
...
brk(0) = 0xbff8e000
brk(0xbffc0000) = 0xbffc0000
brk(0) = 0xbffc0000
brk(0xbfff2000) = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7676000
brk(0) = 0xbffc0000
brk(0xbfffa000) = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7576000
brk(0) = 0xbffc0000
brk(0xbfffa000) = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7476000
brk(0) = 0xbffc0000
brk(0xbfffa000) = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7376000
...
brk(0) = 0xbffc0000
brk(0xbfffa000) = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1c76000
brk(0) = 0xbffc0000
brk(0xbfffa000) = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1b76000
brk(0) = 0xbffc0000
brk(0xbfffa000) = 0xbffc0000
mmap2(NULL, 1048576, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb1a76000
brk(0) = 0xbffc0000
brk(0) = 0xbffc0000
brk(0) = 0xbffc0000
...
brk(0) = 0xbffc0000
brk(0) = 0xbffc0000
brk(0) = 0xbffc0000
+++ exited with 0 +++

That being said, shifting .data section up to stack (thus reducing space for heap) is pointless because the kernel will map page for malloc from the virtual memory empty area.

Sandbox

The other way to restrict program memory is sandboxing. The difference from emulation is that we’re not really emulating anything but instead, we track and control certain things in program behavior. Usually sandboxing is used for security research when you have some kind of malware and need to analyze it without harming your system.

I’ve come up with several sandboxing methods and implemented most promising.

LD_PRELOAD trick

LD_PRELOAD is a special environment variable that when set will make dynamic linker use “preloaded” library before any other, including libc, library. It’s used in a lot of scenarios from debugging to, well, sandboxing.

This trick is also infamously used by some malware.

I have written simple memory management sandbox that intercepts malloc/free calls, do memory usage accounting and returns ENOMEM if memory limit is exceeded.

To do this I have written a shared library with my own malloc/free wrappers that will increment a counter by malloc size and decrement it when free is called. This library is being preloaded with LD_PRELOAD when running the application under test.

Here is my malloc implementation.

void *malloc(size_t size)
{
    void *p = NULL;

    if (libc_malloc == NULL) 
        save_libc_malloc();

    if (mem_allocated <= MEM_THRESHOLD)
    {
        p = libc_malloc(size);
    }
    else
    {
        errno = ENOMEM;
        return NULL;
    }

    if (!no_hook) 
    {
        no_hook = 1;
        account(p, size);
        no_hook = 0;
    }

    return p;
}

libc_malloc is a pointer to original malloc from libc. no_hook is a thread-local flag. It’s is used to be able to use malloc in malloc hooks and avoid recursive calls - an idea taken from Tetsuyuki Kobayashi presentation.

malloc is used implicitly in account function by uthash hash table library. Why use a hash table? It’s because when you call free you pass to it only the pointer and in free you don’t know how much memory has been allocated. So I have a hash table with the pointer as a key and allocated size as a value. Here is what I do on malloc:

struct malloc_item *item, *out;

item = malloc(sizeof(*item));
item->p = ptr;
item->size = size;

HASH_ADD_PTR(HT, p, item);

mem_allocated += size;

fprintf(stderr, "Alloc: %p -> %zu\n", ptr, size);

mem_allocated is that static variable that is compared against a threshold in malloc.

Now when free is called here is what happened:

struct malloc_item *found;

HASH_FIND_PTR(HT, &ptr, found);
if (found)
{
    mem_allocated -= found->size;
    fprintf(stderr, "Free: %p -> %zu\n", found->p, found->size);
    HASH_DEL(HT, found);
    free(found);
}
else
{
    fprintf(stderr, "Freeing unaccounted allocation %p\n", ptr);
}

Yep, just decrement mem_allocated. It’s that simple.

But the really cool thing is that it works rock solid².

[restrict-memory]$ LD_PRELOAD=./libmemrestrict.so ./big_alloc
pp[0] = 0x25ac210
pp[1] = 0x25c5270
pp[2] = 0x25de2d0
pp[3] = 0x25f7330
pp[4] = 0x2610390
pp[5] = 0x26293f0
pp[6] = 0x2642450
pp[7] = 0x265b4b0
pp[8] = 0x2674510
pp[9] = 0x268d570
pp[10] = 0x26a65d0
pp[11] = 0x26bf630
pp[12] = 0x26d8690
pp[13] = 0x26f16f0
pp[14] = 0x270a750
pp[15] = 0x27237b0
pp[16] = 0x273c810
pp[17] = 0x2755870
pp[18] = 0x276e8d0
pp[19] = 0x2787930
pp[20] = 0x27a0990
malloc: Cannot allocate memory
Failed after 21 allocations

Full source code for the library is on github

So, LD_PRELOAD is a great way to restrict memory!

ptrace

ptrace is another feature that can be used to build memory sandboxing. ptrace is a system call that allows you to control the execution of another process. It’s built into various POSIX operating system including, of course, Linux.

ptrace is the foundation of tracers like strace,ltrace, almost every sandboxing software likesystrace, sydbox, mbox and all debuggers including gdb itself.

I have built a custom tool with ptrace. It traces brk calls and looks for the distance between the initial program break value and new value set by the next brk call.

This tool forks and becomes 2 processes. The parent process is tracer and the child process is tracee. In the child process I call ptrace(PTRACE_TRACEME) and then execv. In the parent, I use ptrace(PTRACE_SYSCALL) to stop on syscall and filter brk calls from the child and then another ptrace(PTRACE_SYSCALL) to get brk return value.

When brk exceeded threshold I set -ENOMEM as brk return value. This is set in eax register so I just overwrite it with ptrace(PTRACE_SETREGS). Here is meaty part:

// Get return value
if (!syscall_trace(pid, &state))
{
    dbg("brk return: 0x%08X, brk_start 0x%08X\n", state.eax, brk_start);

    if (brk_start) // We have start of brk
    {
        diff = state.eax - brk_start;

        // If child process exceeded threshold 
        // replace brk return value with -ENOMEM
        if (diff > THRESHOLD || threshold) 
        {
            dbg("THRESHOLD!\n");
            threshold = true;
            state.eax = -ENOMEM;
            ptrace(PTRACE_SETREGS, pid, 0, &state);
        }
        else
        {
            dbg("diff 0x%08X\n", diff);
        }
    }
    else
    {
        dbg("Assigning 0x%08X to brk_start\n", state.eax);
        brk_start = state.eax;
    }
}

Also, I intercept mmap/mmap2 calls because libc is smart enough to call it when brk failed. So when I have threshold exceeded and see mmap calls I just fail it with ENOMEM.

It works!

[restrict-memory]$ ./ptrace-restrict ./big_alloc
pp[0] = 0x8958fb0
pp[1] = 0x8971fb8
pp[2] = 0x898afc0
pp[3] = 0x89a3fc8
pp[4] = 0x89bcfd0
pp[5] = 0x89d5fd8
pp[6] = 0x89eefe0
pp[7] = 0x8a07fe8
pp[8] = 0x8a20ff0
pp[9] = 0x8a39ff8
pp[10] = 0x8a53000
pp[11] = 0x8a6c008
pp[12] = 0x8a85010
pp[13] = 0x8a9e018
pp[14] = 0x8ab7020
pp[15] = 0x8ad0028
pp[16] = 0x8ae9030
pp[17] = 0x8b02038
pp[18] = 0x8b1b040
pp[19] = 0x8b34048
pp[20] = 0x8b4d050
malloc: Cannot allocate memory
Failed after 21 allocations

But… I don’t really like it. It’s ABI specific, i.e. it has to use rax instead of eax on 64-bit machine, so either I make a different version of that tool or use #ifdef to cope with ABI differences or make you build it with -m32 option. But that’s not usable. Also, it probably won’t work on other POSIX like systems, because they might have different ABI.

Other

There are also other things one may try which I rejected for different reasons:

malloc hooks. Deprecated as said man page so I didn’t bother trying it.
Seccomp and prctl with PR_SET_MM_START_BRK. This might work but as said inseccomp filtering kernel documentation it’s not a sandboxing but a “mechanism for minimizing the exposed kernel surface”. So I guess it will be even more awkward than using ptrace by hand. Though I might look at it sometime.
libvirt-sandbox. Nope, it’s just a wrapper over lxc and qemu.
SELinux sandbox. Nope. Just doesn’t work though it uses cgroup.

Recap

In the end, I’d like to recap:

There are a lot of ways to restricting memory:
- Resource limiting with ulimit and cgroup
- Running under an emulator like QEMU
- Sandboxing with LD_PRELOAD and ptrace
- Modifying segments in the binary image.
But not all of them are working
- ulimit doesn’t work.
- cgroup kinda works - crashing application
- Emulating works - crashing application
- LD_PRELOAD works amazing!
- ptrace works good enough but ABI dependant
- Linker magic doesn’t work because ingenious libc calls mmap.