Flavio Brasil

Posted on Oct 31, 2023 • Edited on Nov 2, 2023

Rendezvous Benchmark Analysis: When Threads (Don't) Meet

#scala #performance #java #loom

I've been considering blogging for some time now, but I always have trouble finding motivation to do something more structured. I'm not sure why writing feels too formal to me. Maybe it's the years of struggle in school environments, or maybe it's just my procrastination habit 😅 I've decided to try to keep a light tone and share things in a less structured way than what I normally do at work for example. Let's see if that works. Here's my first try!

I recently came across a blog post by Adam Warski where he explores the performance limits of Loom when compared to Kotlin's coroutines. Adam has been working on abstractions on top of Loom in a project called Ox. It's interesting how he's trying to bring some of the convenience and safety of Scala's effect systems to Loom. Even though I'm still betting on effect systems, if I were a Java architect, I'd definitely keep an eye on his work. It would be amazing if it could lead to better APIs for concurrency in Java's standard library!

The exploration uses a benchmark where a producer green thread emits numbers, and then a consumer green thread consumes those numbers. I recommend reading Adam's blog post since I'm planning to go directly to the nitty-gritty low-level performance analysis of the benchmark.

I reimplemented the benchmark in a project I've been working on called Kyo. It's a new effect system in Scala that addresses several of the pain points in the current Scala solutions: better effect composition via algebraic effects, very low allocation rate via a radically different monad encoding (unboxed for pure computations), highly-optimized async primitives (only 32B per fiber!), and convenient direct syntax. On a side note, it's interesting how things evolve: from callback hell, to monadic hell, and now direct (hopefully not hell 🤓). I'll likely write more about it, so keep an eye out for new posts. You can see a preview of the documentation on the project's readme: https://github.com/getkyo/kyo.

Results

I've reimplemented the benchmark for four libraries: cats-effect, Kyo, ZIO, and Ox. You can see all the implementations in the source code. Results:

Throughput (higher is better):

Allocation rate (lower is better):

I'm performing this analysis on an M1 Max since it would be difficult to do it on a Linux box at the moment. The JVM version is openjdk 21.0.1.

Allocation profiling

I won't be able to do an analysis of all benchmark targets since the post would get too long, so I'll share the most interesting ones. I'm using async-profiler with its results exported in JFR format.

Let's start with Ox, the library with the lowest allocation rate:

This actually seems great! There are just a few allocations, mainly for Integer boxing and completable futures, but I'm not sure this is a complete picture. Loom's virtual thread suspension happens at a lower level on the JVM. When a virtual thread unmounts, the thread stack is transferred to native memory, which I believe isn't included in the heap allocation profiling. If anyone knows if there's a way to measure it, please let me know! It would be handy to observe production systems on Loom since container OOMs due to native memory usage aren't uncommon.

The library with the most allocations is ZIO:

There are allocations all over the place without a clear bottleneck. The largest stalactite has three main allocation sources: Ref.modify, Promise.await, and for handling interrupts. I'm not sure why these interrupt frames are there, but I imagine it's ZIO ensuring the interruption notification is dispatched at the end of an async computation. Kyo uses a similar mechanism.

The Ref.modify case is an interesting one because it's an approach I often see in Scala libraries that I consider a performance anti-pattern. When working with lock-free code, a fundamental building block is the compare-and-swap loop. Instead of pessimistically locking areas of memory, lock-free code uses an optimistic approach where operations are attempted and retried if other threads are modifying the same memory position concurrently.

The retry mechanism is typically a busy loop since a compare-and-swap is a relatively cheap operation (especially if compared to pessimistic locks), but the loop itself must also be cheap, which isn't the case in these libraries. They provide only a higher level of abstraction with a modify function that requires an interface dispatch and allocations on each retry:

// ZIO and cats-effect
waiting.modify {
  case null => (true, (p, n))
  case v    => (false, v)
}

In comparison, Kyo offers a standard compare-and-swap method without the need for an interface dispatch or allocations:

// Kyo
waiting.cas(null, p)

This issue isn't a major problem in this benchmark, but I thought it was a good topic to cover since it might help optimization efforts for these libraries. I wouldn't be surprised if addressing this issue could lead to a good improvement in contention-heavy code.

CPU Profiling

Now we get to the interesting part of this analysis! At first, I thought a lower allocation rate could be the main factor for Kyo's higher throughput, but that isn't the case. The main bottleneck for all targets, besides Kyo, is thread scheduling.

All libraries use a work-stealing thread pool. It's a great approach to distribute the workload across threads and keep latencies low. Basically, when worker threads become idle, they attempt to steal work from other threads.

But it can also be a major source of overhead. Code that would otherwise be much more efficiently executed by a single thread can end up jumping from one thread to another. Another common issue is threads repeatedly waking up to attempt steals but not succeeding, then going to sleep again. Waking up threads and putting them to sleep is costly if done at a high rate.

Cats-effect spends 49% of its CPU usage on parking threads:

Analyzing the flamegraph, I noticed that the method WorkStealingThreadPool.notifyParked is responsible for virtually all thread wake-ups. It's called from the method that schedules new tasks:

// WorkerThread.scala:131
def schedule(fiber: Runnable): Unit = {
  val rnd = random
  queue.enqueue(fiber, external, rnd)
  pool.notifyParked(rnd)
  ()
}

When a worker submits a new fiber for execution, it enqueues the task in its local queue but then notifies a random worker to wake up. I imagine the intent is that the woken-up worker might be able to steal the new task if the current worker takes too long to get to it.

Another source of wake-ups is when a worker receives new tasks from a thread not managed by cats-effect. There's another call to notifyParked when that happens. It seems the call is there because tasks from unmanaged threads are batched, so it gives an opportunity for other workers to steal tasks from the batch. An easy win seems to be only making that call if the batch has more than one task.

ZIO has a similar pattern:

I had more trouble following ZIO's code, but the main source of wake-up is a LockSupport.unpark call in the worker run loop. It seems when a worker attempts to steal, in certain situations it might end up waking up all idle workers. I wonder if this could be a bug.

Since Ox is based on Loom, it uses ForkJoinPool as its scheduler. Parking threads consumes 70% of the CPU usage 😱

The implementation seems very similar to cats-effect with workers enqueuing locally and then signaling idle workers. I've definitely learned a lot over the years by reading Java's concurrency primitives, but it's just too late at the moment to try to decode the ForkJoinPool.signalWork implementation. There's a for loop, so I have no idea how many threads it can wake up. I'll leave this exercise for the reader 🤣

Let's finally take a look at Kyo's CPU profiling:

Most of the CPU goes into executing the actual benchmark code and handling async operations. The overhead for thread wake-ups and sleeps is minimal.

This design is based on my experience working on a similar solution. When I used to work on the performance of Twitter's (sorry, I'll never call it X!) base libraries, I worked on a project with an adaptive scheduler with work-stealing for Twitter's Future. One of the most challenging aspects of the project was not having a performance regression. The default scheduler always schedules tasks locally using a trampoline mechanism based on a thread-local. It's an extremely efficient implementation that we optimized several times over the years. Users have to explicitly use a FuturePool to shift the execution to other threads, which typically happens in only a handful of places in a system.

Most of my attempts at introducing work-stealing failed to avoid a regression. At the time, I decided to make workers always enqueue locally and then have an adaptive worker parking mechanism to ensure work-stealing would happen within a reasonable timeframe (around 1ms). This approach finally avoided a regression.

In Kyo, I've landed on the following design:

A worker can only enqueue a task locally if its queue is empty.
All idle workers wake up every 1ms and attempt to steal tasks from two random workers.

By allowing only one task to be enqueued locally, the pattern where a computation submits multiple tasks to be executed in parallel always ends up forking the execution to multiple threads since only one can remain local.

The drawback is when a worker enqueues a task locally and takes too long to get to it. In this scenario, there can be some delay for the task to be stolen once an idle worker wakes up. Based on my experience working on the work-stealing scheduler at Twitter, I don't expect this to be an issue since I used a more relaxed design there and still saw a significant latency drop.

Something that was a bit challenging was ensuring the cost of the worker attempting the steals is very low. I landed on a worker queue design based on pessimistic locking but with a mechanism that prioritizes the owner of the queue.

Conclusion

It's interesting that a benchmark focused on the efficiency of threads meeting to exchange data is affected by these very threads failing to efficiently exchange work, often leading to them unnecessarily waking up and sleeping.

I think that's it for today. Please keep in mind that this isn't supposed to be a complete and fully accurate analysis. It's just me spending some time writing about an interesting topic. Feel free to reach out if you have any questions or suggestions!

DEV Community

Rendezvous Benchmark Analysis: When Threads (Don't) Meet

Results

Allocation profiling

CPU Profiling

Conclusion

Top comments (0)