Sarun Tapee

Posted on Sep 24

Garbage Collector Implementation in C

#c #learning

In C, there is no garbage collector. This article will guide you on how to implement a garbage collector for C. You will learn how data is stored in memory and the challenge that C has. By the end, you will be able to make your own garbage collector. The complete code could be found on GitHub

Algorithm

Let's begin with the algorithm for detecting a memory leak. A memory leak occurs when the program keeps allocating heap memory and never frees it when it is no longer used. If your program will finally terminate, it is fine since the operating system will reclaim the memory for other processes to use. But if your program needs to run continuously, your computer will finally run out of memory and crash. So the purpose of the garbage collector is to somehow free the unused memory during runtime.

The first question is how to detect which memory area needs to be freed. The simplest idea is that we can target those memory areas that are not reachable by the program anymore. So there is no need to keep it.

This problem can be modeled as a graph. As you can see in the following picture.

The blue node represents the root of the graph, which is the memory that is located in the stack and the global scope. The green node is a memory in the heap that is reachable from the root. And the red node is a memory that is located in the heap and not reachable.

In conclusion, the algorithm to detect unreachable nodes is to traverse all the nodes in the graph starting from the root and mark the reachable nodes. Then we can identify which node is reachable and which is not.

Challenge

Since C is not a strongly typed language. The challenge is as follows.

There is no metadata at the runtime. So that we do not know the meaning of each byte stored in the memory.
We can cast a type in C to any type, so the pointer value could be stored in a variable of type int.
A programmer can store a pointer value in any form they want. e.g., a Singly Linked List that can traverse in both directions by storing the previous node's address together with the next node's address using the XOR operator.

We will not handle challenge no.3, But we need to solve the first two challenges since we need to traverse the graph. There are two possible solutions.

Register metadata (struct name, struct size, number of fields, field name, field type, field size, offset of each field) of every structure in our program to our garbage collector library so that it knows how to deserialize the data.
Scan every byte in the stack and the global area and assume that it is a pointer value. If it points to a valid area address in the heap, follow it.

Comparing the solutions

The first solution will be faster when you traverse the graph, since you know which field is a pointer. But the drawback is that it is more verbose to use since we need to register every struct type in the library. The usage would be like this

typedef struct student_ {
  int id;
  char name[32];
  student_t *friend;
} student_t;

register_struct("student_t", (fields_t *) {
  {"id", "int", sizeof(int)}, 
  {"name", "char", sizeof(char) * 32}, 
  {"friend", "student_t", sizeof(student_t *)}
});

student_t *student = my_calloc("student_t", 1);

This will work, but there are some problems that after calling my_calloc we should now cast the pointer student to another type and store another value than student_t. Another problem is that some struct might contain a void * pointer, then we cannot traverse the graph further because we do not know what type that pointer holds.

With those drawbacks of the first solution, I decided to go with solution no.2, which is scanning every byte on the stack and the global memory area. But we do not need to check every possibility of a byte combination (the sliding window of 8 bytes, the size of the pointer is 8 bytes) because in normal compilation of a C program, the struct field gets padded to contain in word size (the size of the pointer) to make a reading operation faster. Let's look at the example code below.

#include <stdio.h>
#include <stdint.h>

#define sizeof_field(struct_name, field_name) \
  (sizeof(((struct_name *) 0)->field_name))

#define offsetof(struct_name, field_name) \
  ((unsigned int) (intptr_t) &((struct_name *) 0)->field_name)

#define show_field(struct_name, field) \
  (printf("%-16s size=%-8lu offset=%u\n", #struct_name "." #field , sizeof_field(struct_name, field), offsetof(struct_name, field)))

typedef struct emp_ {
    char emp_name[30];
    unsigned int emp_id;
    unsigned int age;
    struct emp_ *mgr;
    float salary;
} emp_t;

int main (void) {
  show_field(emp_t, emp_name);
  show_field(emp_t, emp_id);
  show_field(emp_t, age);
  show_field(emp_t, mgr);
  show_field(emp_t, salary);
}

Output

emp_t.emp_name   size=30       offset=0
emp_t.emp_id     size=4        offset=32
emp_t.age        size=4        offset=36
emp_t.mgr        size=8        offset=40
emp_t.salary     size=4        offset=48

As you can see from the output emp_id field does not start from position 30 but 32 because the field emp_name is padded to 32, and the extra 2 bytes are unused. This may seem like an inefficient use of memory, but this makes the read operation faster. Because the CPU reads memory one word at a time. Imagine that if the emp_name is not padded, the field emp_id data will live in 2 different words, so when you access the field emp_id, the CPU needs to read 2 times and combine the results to get the value.

With this memory padding assumption, we can scan 8 bytes and jump to the next 8 bytes.

As can be seen from the picture above, instead of scanning bytes as read color, we can scan as green color. Of course, we might have misunderstood that some heap memory is still being used because some word might contain data that has a meaning that it is a pointer to the heap area, but the possibility is very low, and there is no harm in keeping that memory area. It is better than freeing the memory that we are still using.

Implementation

Now we have almost all the knowledge to implement a garbage collector. The missing part is how we get the boundary of the stack and the global variable area. Their is providing macro __builtin_frame_address(0) to get the address of the current frame function. and etex to get the ending of the code section (the beginning of the global variable section), end to get the ending of the global variable section.

So in the beginning, we can call __builtin_frame_address(0) at the top of the main function and save the address to stack_base, but for ease of use, let's make a macro.

uintptr_t stack_base = 0;

#define gc_init() \
  stack_base = (uintptr_t) __builtin_frame_address(0)

Next, we will make a function to wrap a calloc function, let's name it gc_calloc. This function will store the address of every memory that is allocated on the heap and delegate work to calloc.

Now, thinking of the information that we need to keep for our gc_calloc, we probably need to store the address of allocated memory and its size. Since we need to find the address when traversing the tree quite often. I think a sorted array would be a good data structure to hold this information because we can use the binary search algorithm to find the address. The implementation is as follows.

#define ALLOC_CAP 5000
static chunk_t alloced_chunks[ALLOC_CAP] = {0};
static size_t alloced_size = 0;

Then implement the gc_calloc

void *gc_calloc(size_t n, size_t size) {
  void *p;
  chunk_t c;
  int i;

  if (n*size == 0) return NULL;
  assert(alloced_size < ALLOC_CAP);
  p = calloc(n, size);
  c = (chunk_t) {.size=n*size, .addr=p};
  for (i = alloced_size - 1; i >= 0 && alloced_chunks[i].addr > p; i--) {
    alloced_chunks[i + 1] = alloced_chunks[i];
  }
  alloced_chunks[i+1] = c;
  alloced_size++;
  return p;
}

Next, we need to implement the tree traversal. Let's name it gc_mark. For marking, we need a data structure to store the state of traversing, which is an array of int mark. We assumed that the pointer is stored in one word, so when we scan the byte, we need to make sure that our starting point is aligned to the word by checking start % sizeof(void *) == 0. Since the memory allocation is specified in bytes, if the size has a fragment of a word, we should not scan the remainder to avoid buffer overflow. And the leftover cannot hold a pointer anyway because it is smaller than a pointer size.

static char mark[ALLOC_CAP] = {0};

void gc_mark() {
  uintptr_t stack_start;
  size_t i;

  stack_start = (uintptr_t) __builtin_frame_address(0);
  assert(stack_base != 0 && "gc_init() is not called");
  for (i = 0; i < alloced_size; i++) {
    mark[i] = 0;
  }

  // scanning stack area
  do_mark(stack_start, stack_base + sizeof(void *));

  // scanning global variable area
  _mark_from_global_var();
}

static void do_mark(uintptr_t start, uintptr_t end) {
  int i;
  void **p;
  chunk_t *c;
  uintptr_t bound;

  assert(start % sizeof(void *) == 0);
  bound = start + ((end - start) / sizeof(void *) * sizeof(void*));
  for (p = (void *) start; (uintptr_t) p < bound; p++) {
    i = bs_lte(*p);
    if (i == -1) continue;
    assert(i < (int) alloced_size);
    c = &alloced_chunks[i];
    assert((uintptr_t) c->addr <= (uintptr_t) *p);
    if (!mark[i] && (uintptr_t) *p < (uintptr_t) c->addr + c->size) {
      mark[i] = 1;
      do_mark( (uintptr_t) c->addr, (uintptr_t) c->addr + c->size);
    }
  }
}

There is a little gatcha for global variable scanning. Notice that our allocation information is stored in the variable alloced_chunks, which is also located in the global scope. So we need to ignore this area. Otherwise, our gc_mark will always say that every memory in the heap is in use.

static void _mark_from_global_var() {
  uintptr_t _start;
  uintptr_t _end;

  _start = align_up((uintptr_t) &etext);
  _end = (uintptr_t) &end + sizeof(void *);
  assert(_start % sizeof(void *) == 0);
  assert(end % sizeof(void *) == 0);
  do_mark(_start, (uintptr_t) &alloced_chunks[0]);
  do_mark((uintptr_t) &alloced_chunks[ALLOC_CAP], _end);
}

The last piece is to clean the leaked memory gc_sweep.

void gc_sweep() {
  size_t i;
  size_t j;

  for (i = 0; i < alloced_size; i++) {
    if (!mark[i]) {
      free(alloced_chunks[i].addr);
    }
  }
  j = 0;
  for (i = 0; i < alloced_size; i++) {
    if (mark[i]) {
      alloced_chunks[j++] = alloced_chunks[i];
    }
  }
  alloced_size = j;
}

To debug our garbage collector, let's implement a function to print the state of the allocation gc_dump_alloced.

void gc_dump_alloced() {
  int i;

  printf("alloc (%zu)\n", alloced_size);
  for (i = 0; i < (int) alloced_size; i++) {
    printf("addr: %p, size: %zu, mark: %s\n", alloced_chunks[i].addr, alloced_chunks[i].size, mark[i] ? "true": "false");
  }
}

Test the implementation

Now, Let's test the implementaion.

#include "garbage_collect.h"
#include <stdio.h>

void *global_var;

typedef struct node_ node_t;

typedef struct node_ {
  node_t *left;
  node_t *right;
} node_t;

typedef struct student_ {
  int id;
  node_t node;
} student_t;

int main(void) {
  gc_init();
  node_t *stud_lst;

  global_var = gc_calloc(3, sizeof(int));

  stud_lst = &((student_t *) gc_calloc(1, sizeof(student_t)))->node;
  stud_lst->right = &((student_t *) gc_calloc(1, sizeof(student_t)))->node;
  stud_lst->right->right = &((student_t *) gc_calloc(1, sizeof(student_t)))->node;

  printf("global_var hold: %p\n", global_var);
  gc_mark();
  gc_dump_alloced();

  printf("---\nset stud_lst = NULL\n");
  stud_lst = NULL;
  gc_mark();
  printf("mark\n");
  gc_dump_alloced();
  gc_sweep();
  printf("sweep\n");
  gc_dump_alloced();

  return 0;
}

It's work!!

./exe
global_var hold: 0x55a6ef582310
alloc (4)
addr: 0x55a6ef582310, size: 12, mark: true
addr: 0x55a6ef582330, size: 24, mark: true
addr: 0x55a6ef582350, size: 24, mark: true
addr: 0x55a6ef582370, size: 24, mark: true
---
set stud_lst = NULL
mark
alloc (4)
addr: 0x55a6ef582310, size: 12, mark: true
addr: 0x55a6ef582330, size: 24, mark: false
addr: 0x55a6ef582350, size: 24, mark: false
addr: 0x55a6ef582370, size: 24, mark: false
sweep
alloc (1)
addr: 0x55a6ef582310, size: 12, mark: true

Usage

Instead of calling gc_mark and followed by gc_sweep, let's make a wrapper function for convenience.

void gc_clean() {
  gc_mark();
  gc_sweep();
}

Then the usage will simplify to.

int main(void) {
  gc_init();
  node_t *stud_lst;

  global_var = gc_calloc(3, sizeof(int));

  stud_lst = &((student_t *) gc_calloc(1, sizeof(student_t)))->node;
  stud_lst->right = &((student_t *) gc_calloc(1, sizeof(student_t)))->node;
  stud_lst->right->right = &((student_t *) gc_calloc(1, sizeof(student_t)))->node;

  stud_lst = NULL;
  gc_clean();
  return 0;
}