Garbage Collection

#computerscience #performance #programming

It’s easy to forget, while working in the abstract in terms of functions and algorithms, that the memory our programs depend on is real. The values we use in our programs actually exist on the hardware at specific addresses. If we don’t keep track of where we’ve stored data, we run the risk of overwriting something important and getting the wrong information when we go to look it up again. On the other hand, if we’re too guarded about protecting our data, even after we’re finished with it, we waste memory that the program could better use on other tasks.

Most programming languages today implement garbage collection to automate periodically releasing memory we no longer need. The garbage collector cannot predict exactly which values will be used again by our program, but it can find some that cannot due to no longer having any way to use them, and safely free them.

Garbage Collection Algorithms

Reference Counting

The simplest algorithm is just to keep a count of references to a piece of memory. If the number of references ever reaches zero, that memory can no longer be reached and can be safely disposed of. However, this strategy fails with circular references. If two objects, for example, reference each other, their reference counts with never reach zero, even if they are otherwise inaccessible from the main program.

Tracing

Tracing (usually mark-and-sweep), is a more sophisticated approach to memory management. Starting from some defined root(s), the garbage collector visits every piece of memory accessible from either the root or its descendants, marking that memory as still reachable. Any memory not traversed is unreachable and is garbage collected. This avoids the problem of circular references “trapping” memory, since the cycle will not be reached from the main memory graph. However, this approach has more overhead than the reference counting strategy.

Many garbage collectors reduce the overhead of mark-and-search by having two (or more) “generations” of allocations. The generational hypothesis states that most allocations die young (think how many variables you use once in a for loop and never again), but those that survive are much more likely to survive a long time. Thus, the pool of young allocations (the “nursery”) is garbage collected frequently, while the old (“tenured”) pool is checked less often.

It’s possible to combine both strategies in a hybrid collector. For example, Python uses reference counting as its primary algorithm, then uses a mark-and-sweep pass over the (now smaller) pool of allocated memory to find and eliminate circular references.

The Downsides of Garbage Collection

Of course, the garbage collector itself introduces some overhead. Depending on the implementation, it may bring the program to a halt while it scans and frees data or defragments the remaining memory. It is also impossible to create a perfectly efficient garbage collector due to the inherent uncertainty in which values will be used again.

Other Approaches to Memory Management

There are alternatives to garbage collection. A few languages, such as C and C++, require the programmer to manage memory manually (although you can add garbage collectors to both languages yourself if you wish), both while allocating memory to new variables and when deciding when to free memory. Manual memory management avoids the overhead of garbage collection, but adds to program complexity, since this now must be handled by the code itself rather than happening in the background. This also gives the programmer many opportunities to make mistakes, from creating floating pointers by freeing too soon to leaking memory by freeing too late or not at all, to say nothing of the difficulty of using pointers themselves.

Rust takes a third option and introduces the concept of “ownership” – only one variable can own a piece of data at a time, and that data is released as soon as its owner goes out of scope. This eliminates the need for garbage collection at runtime, as well with its associated performance costs. However, the programmer has to keep track of ownership and borrowing of data, which limits how data can be read or changed at certain points of the program. This requires thinking in a different way from other languages, since some familiar patterns simply won’t compile, and increases Rust’s learning curve sharply.

Garbage Collection in JavaScript

JavaScript follows the majority of programming languages in using a garbage collector. However, the garbage collector itself is implemented and run by the JavaScript engine, not the script we write ourselves, so implementation varies slightly across engines. However, the general principles are the same. Modern JavaScript libraries all use a mark-and-sweep algorithm with the global object as the algorithm’s root. Since I regularly use Firefox and Node, I’ll look at their engines in a bit more detail.

SpiderMonkey, the engine used by Firefox, applies the principle of generational collection, dividing allocations into young and old. It attempts to garbage collect incrementally to avoid long pauses, and runs parts of garbage collection in parallel with itself or concurrently with the main thread when possible.

The V8 engine’s Orinoco garbage collector, has three generations: nursery, young, and old, and claims (as of 2019) to be a “mostly parallel and concurrent collector with incremental fallback.” V8 also brags about interweaving garbage collection into the idle time between drawing frames when possible, minimizing the time spent forcing JavaScript execution to pause.

Based only on these descriptions, V8’s garbage collector seems a bit more advanced, perhaps because V8 used by Chromium-based browsers in addition to Node.js and thus has more support. However, they seem to have independently converged to similar architectures. The serious demands to provide a smooth user experience means that browser-based garbage collectors must be efficient and eliminate as much overhead as possible, because, as the Node guide to tracing garbage collection neatly summarizes, “when GC is running, your code is not.”

I admit I’ve rather taken memory management for granted, since most of the languages I’ve studied have garbage collectors. I’ve been fascinated by Rust for years but haven’t managed to wrap my head around its ownership and borrowing rules. (Maybe this is the time it will finally click for me.) But if I struggle with memory management when the compiler itself is looking out for me, I’m not sure how I’d fare in a manual memory management scheme without guardrails. So for now, I’m very grateful to garbage collectors everywhere for making my life easier.

The Memory Management Reference was invaluable while researching this blog post, in addition to many other engine- and language-specific references (linked throughout the text).