DEV Community

Cover image for Making my concurrent algorithm 6000% better 🚀
CharlieTap
CharlieTap

Posted on

Making my concurrent algorithm 6000% better 🚀

Would you trade a little bit of memory for a whole lot of concurrency?

This is the concept behind a concurrent data structure I've been implementing in my Kotlin Multiplatform library.

The data structure is known as LeftRight and it's here to give you an alternative to those pesky locks that everyone wraps around shared mutable state.

I'll explain

What is LeftRight?

Image description

LeftRight is a concurrency primitive, it allows you turn any mutable data structure into concurrent version of itself.

For example... Have a mutable set you need to be thread safe?

Voilà

val sharedSet = LeftRight<MutableSet<T>(::mutableSetOf)
Enter fullscreen mode Exit fullscreen mode

Or more likely you have mutable HashMap ...

val sharedMap = LeftRight<MutableMap<T>(::mutableMapOf)
Enter fullscreen mode Exit fullscreen mode

Or any other data structure you may want to share, LeftRight is generic over T and just requires a constructor for T to work.

Thats cool but why would I bring in your library as another dependency to my project when I can just wrap my data in a lock?

Well you'd be saying goodbye all of your concurrent performance if you did that. Locks don't scale. LeftRight doesn't use locks, at least not for coordinating reads and writes.

How can your library be safely exposing state whilst allowing reads and writes to access the same memory address, thats impossible?

You're right thats impossible...

What???

Maybe the picture above can give you a clue...

Nope? Okay...

How do we LeftRight?

At its core LeftRight is incredibly simple, it consists of two copies of a given data structure and an atomic pointer. The atomic pointer works like a traffic warden at a fork in the road, directing read traffic to one data structure and write traffic to the other.

So we have 2x the memory footprint?

Probably not no, Kotlin is a pointer heavy language, data structures tend to store references to objects rather than the objects themselves. So what we end up with is just two times the amount of pointers.

The beauty of LeftRight lies in its ability to allow reads to progress without using locks or even asking them to wait. And this isn't an easy feat!

The problem

Imagine we have a LeftRight<MutableSet>, this means we will have two MutableSets internally, we'll call them L and R for short. Initially the atomic pointer is directing writes to L and reads to R.

A write comes in and we update L.

Now we somehow need to let readers know about the change, so we redirect the pointer such that L is now accepting reads and R is now accepting writes.

But we haven't applied the change from before to R yet, so no problem right we just apply it to R and we're good?

Nope, this isn't safe, because we don't know if readers have departed from R yet. Maybe they read the pointer just before we redirected it and are now about to read from the side we also are about to write to ... Somehow we need to signal to writes that reads have departed without using traditional lock/wait based solutions like mutexes or semaphores.

LeftRight has a very clever solution to this problem, the algorithm goes something like this:

We create a global array of counters (these are just 32 bit Ints)

val counters: Array<Counter> = arrayOf(maxReaderParallelism)
Enter fullscreen mode Exit fullscreen mode

When a read comes in:

  1. We increment our counter
  2. We read from the data structure designated by the atomic pointer
  3. We increment our counter again

Given each reader thread has its own index into the global array and thus its own private counter, we can infer that there is only ever one thread writing to the counter. If there is only one writer then we do not need any locks or waits. In fact the counters needn't even be atomic as a CAS loop would never fail, we do however need to mark them as volatile to ensure all threads see the events in the order they happened and that the compiler doesn't rearrange our code in weird and wonderful ways.

For writes:

  1. We perform the write on the data structure designated by the atomic pointer
  2. We switch the atomic pointer so now reads will see this update
  3. Now we need to wait for readers to depart from the old read side (the new write side)
  4. We perform the write again on the new write side of the data structure

So how do we determine that readers have departed?

The is where the counters come in and the magic happens. Remember reads increment the counter twice, once before and once after a read. When the counter is odd, we know a reader is actively in the map, but when its even we know its finished its work and the following read will have to go through the atomic pointer again. So, we filter out all the counters that are odd and then we loop on them, waiting for them to change, not necessarily become even, but change. Any change means that the counter will have moved onto the new write side of the data structure.

Whats with the weird sizing on the counter array?

The size of this array dictates the amount of reader threads left right supports, this is one of the constraints on my implementation. If we allowed the array to grow or be resized then we would have wrap it in a lock. Instead we attempt to anticipate the upper bound on threads in your program, we do this using the same logic Kotlin coroutines does for thread pool sizing.

Why LR?

Image description

I created LR for two reasons:

  • There are no Kotlin Multiplatform concurrent datastructures
  • Locks are not a suitable replacement in 2023

Kotlin Multiplatform is still in its infancy, at present there are no Kotlin Multiplatform concurrent data structures. You can't just simply just reach for ConcurrentHashMap like you could when you were working within the JVM ecosystem, everything depends on what targets your project supports.

In the absence of the wealth of JVM based libraries, I see more and more people reaching for locks. Whilst this may be perfectly acceptable for a given use case, I don't believe generally replacing concurrent data structures with locks is a smart move in 2023.

Locks strip our multicore CPUs of their parallelism super powers through contention, and contention isn't something thats going to go away, quite the opposite actually it's going to augment. Computers are gaining more cores year on year, high end consumer chips now have up to 24 cores, that will shortly rise to 32 and within the next decade we will undoubtedly see core counts that breach the 100 mark. Do you really want 99 cores waiting on your mutex? We'll see the performance ramifications of this later on.

LeftRight gives you a scalable alternative to locks and positions itself generic concurrent data structure with more than acceptable performance.

LeftRight is not a direct competitor to the highest performing concurrent data structures, like ConcurrentHashMap for example. ConcurrentHashMap is a bespoke concurrent HashMap implementation, its not a wrapper around a HashMap, it is the HashMap. LeftRight however is a generic wrapper around a data structure, as always with indirection we incur the cost of chasing pointers. LeftRight is also entirely ignorant to the workings of the thing its wraps, it knows just two invariants, that a reads will occur and that writes will occur. For this reason it's unable to provide domain specific optimisations.

Less talk more benchmarks

Image description

Time to put my money where my mouth is, I've been moaning a fair bit about locks, lets dive into some benchmarks.

We're going to compare:

  • Javas ConcurrentHashMap (The gold standard 👑)
  • LeftRight (The new kid 🤠)
  • ReentrantReadWriteLock (The best locks have to offer 🔒)

I've been generous here, rather than use a standard mutex I've gone with a read write mutex in order to really get the best out of locks.

The below benchmarks are all run on an Apple M1 Max, the benchmark suite can be found in the repo here

ReentrantReadWriteLock

Operation Single Threaded Multi Threaded
read throughput 0.0069 ops/ns 0.0018 ops/ns
read avg time 14 ns/op 5,783 ns/op
write throughput 0.0789 ops/ns 0.0056 ops/ns
write avg time 13 ns/op 227 ns/op

Single threaded benchmarks look pretty good as you would expect of a lock under no contention but it's not looking so great for the multithreaded side of things. The average time of a read is 400x slower and overall throughput around 6x less when accessing the hashmap from multiple threads.

LeftRight

Operation Single Threaded Multi Threaded
read throughput 0.0808 ops/ns 0.6449 ops/ns
read avg time 12 ns/op 16 ns/op
write throughput 0.0381 ops/ns 0.0324 ops/ns
write avg time 28 ns/op 437 ns/op

LeftRight shows a remarkable improvement in both single threaded and multi threaded benchmarks. More importantly the multithreaded throughput of reads is showing an 8x linear improvement which checks out given my CPU has 8 performance cores. Writes are as expected given they are serialised with a mutex.

ConcurrentHashMap

Operation Single Threaded Multi Threaded
read throughput 0.1406 1
read avg time 7 ns/op 9 ns/op
write throughput 0.0093 0.0073
write avg time 10 ns/op 1,398 ns/op

As expected reads are blazingly fast and ahead of both the previous two implementations, but interestingly LR comes within 80-90% which is better than we could have hoped for a data structure with inherently more indirection. Writes seem to struggle but this is likely a quirk a of the benchmark writing to the same key and thus the same bin (ConcurrentHashMap uses bin level locking), in reality writes would have a much fairer distribution and I would expect Concurrent HashMap to come out on top.

Here's a graph depicting all the above courtesy of ChatGPT

Image description

What you came here for

LeftRight is remarkably performant for a generic concurrency primitive. But things weren't always this way... lets take a look at an earlier implementations benchmark

Operation Single Threaded Multi Threaded
read throughput 0.0716 ops/ns 0.0107 ops/ns
read avg time 14 ns/op 534 ns/op
write throughput 0.0368 ops/ns 0.0376 ops/ns
write avg time 27 ns/op 442 ns/op

You can see from the above, the performance takes a nose dive with the introduction of more threads/cpu cores. I expected this to be the case for mutations, after all we have a single writer policy which is enforced using lock.

But for reads this didn't make sense at all, not only is the average time of a read 50x slower but the throughout is 7x less??? I'm using a lock-free/wait-free algorithm for reads! Wtf??? Theres no shared memory and nothing for threads to contend on???

Well ...

Image description

It transpires that it's not enough to have threads guard their own private memory when accessing shared resources. Due to the nature of modern CPU cache architectures, even adjacent memory locations can be a source of contention.

It's like contention through osmosis 😭 😭

This phenomenon, known as false sharing, occurs when multiple threads, running on different cores, access different variables that reside on the same cache line. Even though they're not directly sharing data, the mere fact that their data resides closely in memory means they effectively compete for the same cache line. This can cause performance degradation as threads constantly invalidate each other's cache lines, even if they're not accessing shared data directly.

Honestly I just keep coming back to this quote over and over again in my career...

There are only two hard things in Computer Science: cache invalidation and naming things.
-- Phil Karlton

So how do we fix this?

The problem stems from the original implementation of my global counter array

val globalEpochArray = Array<AtomicInt>(64)
Enter fullscreen mode Exit fullscreen mode

When the arrays allocated, it will more than likely allocate those 64 AtomicInt objects next to one another on the heap.

AtomicInt is just a wrapper around a 4 byte integer, and every class has a 12 byte (on a 64 bit JVM) object header, making one allocation of AtomicInt take 16 bytes of space.

So we now have a 16 * 64 byte block of contiguous memory allocated on the heap.

When thread 1 loads the first counter (which should be just [0 - 16] bytes) it in fact loads an entire cache line worth of memory, on my laptop (M1 Max) an L1 cache line is 128 bytes. So now the core running thread one has the first [0 - 128] bytes and thus the first 8 counters inside of its cache line.

When thread 2 tries to load the second counter it does the same, now thread 2 has counters 2 through to 10 in its cache line.

Now what happens when thread 1 mutates its counter?

It invalidates the cache line of up to 8 other cores, and when poor thread two comes round to trying to change its counter, it now has to load it from RAM again.

To give a rough idea of impact heres the load times between L1 CPU cache and RAM:

L1 Cache Lookup: 0.5 to 1 nanoseconds.
RAM Lookup: 50 to 100 nanoseconds

This explains my average read time went from 14ns to 534ns

So how do stop this?

With this abomination

class PaddedVolatileInt(initialValue: Int) {

    @Volatile private var p1: Long = 0
    @Volatile private var p2: Long = 0
    @Volatile private var p3: Long = 0
    @Volatile private var p4: Long = 0
    @Volatile private var p5: Long = 0
    @Volatile private var p6: Long = 0
    @Volatile private var p7: Long = 0
    @Volatile private var p8: Long = 0
    @Volatile private var p9: Long = 0
    @Volatile private var p10: Long = 0
    @Volatile private var p11: Long = 0
    @Volatile private var p12: Long = 0
    @Volatile private var p13: Long = 0
    @Volatile private var p14: Long = 0

    @Volatile var value: Int = initialValue
Enter fullscreen mode Exit fullscreen mode

Essentially we pad our counter object such that fits perfectly inside an L1 cache line.

12 byte header + (14 * 8) + 4 = 128 bytes

This is really isn't an ideal solution, padding should be dynamic such that on cpus with smaller cache lines are accommodated for but well we don't have this level of control from Kotlin.

But with that change our multithreaded throughput and average time fall dramatically.

Operation Before (Multi Threaded) After (Multi Threaded) % Change
read throughput 0.0107 ops/ns 0.6449 ops/ns +5927.10%
read avg time 534 ns/op 16 ns/op -97.00%
write throughput 0.0376 ops/ns 0.0324 ops/ns -13.83%
write avg time 442 ns/op 437 ns/op -1.13%

The value of throughput now grows linearly with the amount of cores available, thus you would see different increases/decreases depending on the machine.

Rapping up

Image description

You can find my repository containing LeftRight here, you'll notice that the repo is actually called cachemap. That's because the repo contains two libraries, LeftRight and CacheMap, the latter is essentially just a batteries included LeftRight<MutableMap>.

Both LeftRight and CacheMap are experimental at the moment so please proceed with caution despite the positive looking benchmarks. Be extra careful with the suspending variants as these haven't yet been benchmarked nor optimised.

Theres lots of optimisations to be made, writes suck more than need to at the moment. I have a rough design of a write batching algorithm which should amortise the cost of the lengthy writer wait step.

I'm currently in the process of publishing the artifacts in Maven Central (the process is not fun 🤯), hopefully by the time you read this I will have instructions on the repo of how to pull the each artifact.

Acknowledgements

I learned of LeftRight in a youtube stream from Jon Gjengset a while back, Jon is in my opinion the best comp sci content creator out there. Please go like and subscribe to his channel if you're keen on learning!

You can also find what I believe to be the first paper on the primitive here.

All Illustrations were created by incredible artist that is Dalle 3

Top comments (4)

Collapse
 
cbeyls profile image
Christophe Beyls

Hello,
Thanks for the interesting read and the introduction to the LeftRight concurrency algorithm which I didn't know about.

I think you should make it clearer by mentioning right from the start that this algorithm is only lock-free for readers and the writes still use a Mutex lock, and as such is only optimized for scenarios were you have much more reads than writes.

A more common high performance solution for this use case that you didn't even mention in the article (or benchmark) is to share an immutable data structure, and update the value by building a new one from scratch starting with a copy of the old one, swapping the old reference with the new one atomically using compareAndSet. This would allow writers to be lock-free as well, even if a bit slower because of the initial copy. I think LeftRight is more useful for programming languages where memory needs to be managed manually.

I must admit I didn't understand everything about the signaling between the readers and the writer in the LeftRight algorithm and your Kotlin implementation of it.
I have a few questions about it:

  • Since every reader thread has its own separate counter, is it really needed to increment a counter twice, or would it be enough to simply write the value 1 at the beginning of the read and write the value 0 at the end to signal the read is complete? Would this allow higher performance because the new value only needs to be flushed to RAM without the need to read the old one first? Or would it still create a CPU cache miss for other readers?
  • Instead of your PaddedVolatileInt "abomination" class, wouldn't it be easier to use something like Kotlin's AtomicIntArray, and only write to the last position of it? But more importantly, your padding optimization logic seems to be based on the specs of a 64-bit JVM with a specific CPU, which makes it very much non-multiplatform, contradicting your goal of providing a high-performance multiplatform concurrent data structure.
Collapse
 
charlietap profile image
CharlieTap

Hey @cbeyls

Thanks for the feedback, I'll try to clarify as best as possible

I think you should make it clearer by mentioning right from the start that this algorithm is only lock-free for readers and the writes still use a Mutex lock, and as such is only optimized for scenarios were you have much more reads than writes.

Maybe its a little misleading you're right, from a concurrency perspective the marvel of left right is that mutexs are not used to coordinate reads and writes. Mutexs or mutual exclusion more generally is inescapable during writes, you simply cannot safely have two mutations on the piece of memory.

A more common high performance solution for this use case that you didn't even mention in the article (or benchmark) is to share an immutable data structure, and update the value by building a new one from scratch starting with a copy of the old one, swapping the old reference with the new one atomically using compareAndSet. This would allow writers to be lock-free as well, even if a bit slower because of the initial copy. I think LeftRight is more useful for programming languages where memory needs to be managed manually.

This actually came up originally when I posted this article in the Kotlin slack channel. So... immutable datastructures are not capable of serving the same purpose as a concurrent datastructure. For example, say we have an atomic pointer to an immutable list like you mentioned

val immutable = atomic<ImmutableList<String>>(persistentListOf("a"))

imagine two threads concurrently change the same immutable set, one adding "b" and one adding "c". Immutable datastructures always return a new object and therefore you would end up with two CAS swaps

  • persistentListOf("a", "b")
  • persistentListOf("a", "c")

The final state of the pointer would be which ever came second. Without mutual exclusion you cannot get around this unfortunately.

Since every reader thread has its own separate counter, is it really needed to increment a counter twice, or would it be enough to simply write the value 1 at the beginning of the read and write the value 0 at the end to signal the read is complete? Would this allow higher performance because the new value only needs to be flushed to RAM without the need to read the old one first? Or would it still create a CPU cache miss for other readers?

The way cpu cache coherency works means that any change to a memory address you have in stored in cpu cache will need to invalidated for all cores when its mutated. Else you could have a scenario where one core has keeps reading 1 when its been reset to zero. Having the counter grow rather than oscillating between 1 and 0 is there to help establish a causal order, because you know if its 3 and then 7 that 7 came after, this can be useful for a bunch of different reasons.

Instead of your PaddedVolatileInt "abomination" class, wouldn't it be easier to use something like Kotlin's AtomicIntArray, and only write to the last position of it? But more importantly, your padding optimization logic seems to be based on the specs of a 64-bit JVM with a specific CPU, which makes it very much non-multiplatform, contradicting your goal of providing a high-performance multiplatform concurrent data structure.

I believe AtomicIntArray is an array of boxed ints so this wouldn't work as pointers could be different spots in memory and thus you couldn't get the alignment right for the cache line. It's true that this impl is specialised to a 64 bit JVM but not necessarily for a particular cpu. By sizing the counter to 128 it doesn't mean that will only work on cpus with cache lines of 128. On smaller cpus we would just occupy more cache lines that we need to but we would still be preventing false sharing. 128 is the biggest out there right now. Valhalla may land in the next Java release at which point Kotlin has a path forward to value types and I can update the datastructure to be a little less specialised. That being said 64 bit is the norm so this should be fine for 99% of people

Collapse
 
cbeyls profile image
Christophe Beyls • Edited

Thanks for the clarifications about cache invalidations.

imagine two threads concurrently change the same immutable set, one adding "b" and one adding "c". Immutable datastructures always return a new object and therefore you would end up with two CAS swaps
persistentListOf("a", "b")
persistentListOf("a", "c")
The final state of the pointer would be which ever came second. Without mutual exclusion you cannot get around this unfortunately.

Actually yes you can, if you use an atomic compareAndSet(). The second writer, when attempting to update a to c, will not be able to perform the update and the method will return false, because the comparison will fail (a is now b). The writer just has to retry using the new value until it works. That's basically the algorithm used by MutableStateFlow.update():

public inline fun <T> MutableStateFlow<T>.update(function: (T) -> T) {
    while (true) {
        val prevValue = value
        val nextValue = function(prevValue)
        if (compareAndSet(prevValue, nextValue)) {
            return
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

I believe AtomicIntArray is an array of boxed ints so this wouldn't work as pointers could be different spots in memory and thus you couldn't get the alignment right for the cache line.

You can check the source code here, AtomicIntArray is not using boxing but a native array of primitive ints internally.

That being said, now that I think about it, I think it's a mistake to assume where the system will allocate objects on the heap, and this observation is valid for an array of any kind of objects (except for inline value classes), including PaddedVolatileInt.
If you declare an array of object references, we can assume that the references will be contiguous in memory, yes, but can we really always expect that the objects they're pointing to will follow in memory and will be contiguous too? I don't think so, unless there is some JVM specification about that? And if there is, again, it's only valid for the JVM.

It's true that this impl is specialised to a 64 bit JVM but not necessarily for a particular cpu.

At least you admit that it only targets a single Kotlin platform (the JVM). I'm not even sure Android ART uses the same object overhead size.

That being said 64 bit is the norm so this should be fine for 99% of people

There are still a lot of 32-bit ARM CPUs out there. And you may be surprised how many people run 32-bit JVMs on 64-bit ARM CPUs, because it uses less memory.

Let me know what you think and thank you for the entertaining read ;)

Thread Thread
 
charlietap profile image
CharlieTap • Edited

Hey again,

These are all great questions so I'll try to answer them thoroughly so others in the future can understand from them

Actually yes you can, if you use an atomic compareAndSet(). The second writer, when attempting to update a to c, will not be able to perform the update and the method will return false, because the comparison will fail (a is now b). The writer just has to retry using the new value until it works. That's basically the algorithm used by MutableStateFlow.update():

CAS'ing a value doesn't make any sense from a hardware perspective. Theres no assembly that could do something like that for example. As a CAS swaps a value that fits in a register with a value from RAM.

Here if you check the source its using syncronized, so its using a monitor lock

github.com/Kotlin/kotlinx.coroutin...

Even if you could CAS a value it would still be using mutual exclusion, specifically optimistic mutual exclusion. The CAS loop ensures the change is serialised and that each update is guaranteed to be based on the latest state and will be rolled back if not. This type of optimistic mutual exclusion is great in scenarios of low contention like the above, because the reality of two updates happening concurrently in a viewmodel is very very slim. If you were to build a concurrent data structure simply by CAS'ing its value you would run into some big problems for example:

  • The complexity CAS op would grow linearly with the size of your datastructure as the equality operation would have to (in order to work correctly) check all of values contained within are equal.
  • Thread starvation because the CAS op is so long and theres no fairness strategy
  • The CAS would fail often under contention, but whats worse is that you would be burning all the cores optimistically assuming they would succeed, when in fact only one will. This is much worse than a mutex in fact, as with a mutex the non active threads are asleep and the Mutex likely has a fairness protocol.

To bring this back to your original point of benchmarking immutable datastructures. An immutable datastructure is not concurrent as it creates a new object every time rather than a single shared data structure. You could like described update the data structure atomically (if it could fit in a register) but this would have very unfortunate performance characteristics which are little worrying.

This approach you've described and LeftRight are quite unique in that they mutate generic datastructures and then synchronise the change atomically. This will always be suboptimal compared to something like ConcurrentMap that uses knowledge of the datastructure (because its specialised) to coordinate safe access to subsections of the datastructure

That being said, now that I think about it, I think it's a mistake to assume where the system will allocate objects on the heap

Whether its heap or stack allocated it doesn't affect much, both reside in RAM and the cpu pulls memory in the same way into the cache line

At least you admit that it only targets a single Kotlin platform (the JVM). I'm not even sure Android ART uses the same object overhead size.

So currently this is optimised to work on 64 bit JVM as you mentioned (it still works in the other cases and probably would be unnoticeable for mobile workloads like Dalvik and ART just because they are so small), I'm working on a solution for native targets now using C. When the time comes that and we have the introduction of value classes I'll be able to adjust it to be pure Kotlin and it should work across all targets with no "abominations" :P