Explaining the LMAX Disruptor

#jvm #architecture #programming

More than a decade ago, the LMAX Disruptor pattern entered the scene. I found it interesting. But frankly had trouble wrapping my head around it. It wasn't until years later when I started working on a JVM project that I finally understood. There is a lot of good performance engineering baked into it. But it basically only exists because of JVM limitations. It wouldn't rise to the level of a pattern in system level languages like Go or Rust, or even a high level language with more performance knobs like C#.

Context

Let's back up a bit. LMAX was building a high frequency trading platform. Such a system optimizes latency and throughput. For these constraints, there's a big problem with their choice of platform. The JVM was designed for Java, which ostensibly is meant for high-level -- especially object-oriented -- programming. It just doesn't give the user many performance knobs to turn.

One of the reasons I had such a hard time understanding the purpose for many of the details in the Disruptor was because I came from .NET. Even though Microsoft basically copied Sun's Java homework to make C# on .NET, in the process they also managed to bake in some improvements including performance tune-ability. Many of the pattern details were workarounds that I didn't recognize. I'll give you some examples:

Java Limitations

Structs

Java does not allow the user to create primitives aka structs aka stack-allocated types. It only allows you to make garbage-collected (heap-allocated) types. C# allows you to create structs. System level languages tend to take the inverse approach. Types are assumed stack-allocated until you explicitly allocate them on the heap.

Here's the oversimplified explanation of how that works. Each code scope (e.g. function execution) runs in its own stack frame. (Note: this is where a stack trace comes from.) When you return from that function, the stack frame goes away along with whatever local variables were allocated in it. So stack-allocated variables do not need any special memory management. When you pass struct data into a function or return struct data from it, this data is copied between stack frames. This is efficient until the data gets too large. That's when you might reach for a separate memory allocation on the heap. Then the stack frame only copies the pointer (8 bytes on a 64-bit system) for the heap data to the next stack frame instead of the full data.

What constitutes "too large depends" on the system, but on .NET I recall the threshold being 24-32 bytes (3-4 longs or reference objects). Above that, object references could be better performance, depending on garbage collector behavior. As with all performance optimizations, you don't know for sure which is better until you test it.

High level languages default to heap allocated data so that it's copying 8 byte pointers between stack frames instead of the full data. This is probably better performance in common cases, but optimized code might do better with structs. Java doesn't give you the option to tune performance in this way.

Memory Layout

Alignment to CPU cache line is another common performance optimization. In short, CPUs have multiple levels of caches. Especially for collections, there are potential performance advantages if you design your data types to be exactly 8, 16, 32, or 64 bytes. This is because data is read from memory into CPU cache lines, which are generally 64 bytes. If your data fits exactly in a cache line, the CPU can use optimizations / pipelining which improves performance when processing collections.

Java has no way to specify memory layout, so Disruptor makes use of padding fields in the classes used by ring buffers. Named like p1, p2, etc. These fields are not used by logic, but are only there to align the data structure. Whereas .NET has StructLayout and related attributes to allow you to add padding explicitly.

Boxing

Even though Java does not allow you to create primitive data types, it has some built in, like int. However, when you use primitives in generics Java "boxes" them up as heap-allocated objects. Especially in generic collections, this uses more memory and costs CPU instructions to box and unbox.

Disruptor doesn't explicitly mention this, but it is a well-known drag on JVM generic performance. One which I wasn't familiar with from .NET as it doesn't box primitives for generics.

Performance optimizations

Overall, the Disruptor is a marriage of qualities from actor and SEDA patterns with some distinct performance optimizations like a ring buffer that is also an object pool. From Actor / Agent pattern, Disruptor took the message inbox and independent processing. However, the disruptor is also step-wise coordinated with previous steps like in SEDA.

The ring buffer plays a prominent role in Disruptor for a few reasons. First, it is a common way to implement a fixed-size queue. This is necessary in high-throughput systems because unbounded queues will eventually reach memory usage limits. The bounded queue forces you to think about your memory usage and full queue behavior. (E.g. load shedding vs blocking)

The ring buffer simultaneously serves as an object pool. Object pools are pretty common in game development. Business devs will be more familiar with them from database or HTTP connection pooling. The general idea being that creating a new object is expensive in some way, so instead keep a pool of them and soft reset and reuse an old instance. For db or http pools, the expensive part is actually making the network connection. But for very high-throughput systems, the churn of allocating new objects and then garbage collecting them in a tight loop is the problem. By pre-allocating blank objects in the ring buffer, these objects will never be garbage collected nor need new instantiations later. So it essentially circumvents garbage collection.

There are other optimizations around the ring buffer such as only using sizes which are powers of 2. The intrinsic way a ring buffer works is that it has an ever-increasing queue position which can go past the actual size of the buffer. So if the ring buffer size is 4 (indexes 0 to 3), but we're processing our 15th item out of the queue, we can calculate the ring buffer position with modulus by the buffer size. 15 modulus 4 is 3. So the item we want is at index 3. The problem is the modulus operation, which can be expensive to calculate in a tight loop. But not for powers of 2. Since the computer operates in binary, powers of 2 are especially simple to divide by. So using a power of 2 as the ring buffer size increases performance.

I didn't cover all the optimizations, but you get the idea.

DEV Community