GenPool uses a sharded design to reduce contention. To determine which shard serves a request, it uses the procPin
runtime function to pin the goroutine to its logical processor and uses the resulting processor ID as an index into the shards slice. Sounds complicated for a load balancing mechanism, right? What happens if we implement a much simpler way of doing nearly the same thing? Well, apparently things get orders of magnitude worse.
Why The Change?
According to benchmarks, the code below was consuming most of the CPU time, but without anything else to compare it to, it really doesn't mean much. At best, it's showing the hot spots of a function that may not even be capable of being optimized. Nonetheless, I thought I'd try creating a simpler way to do this.
Total: 188.83s
55.38s 70.51s (flat, cum) 37.34% of Total
3.19s 3.21s 316:func (p *ShardedPool[T, P]) getShard() (*Shard[T, P], int) {
2.54s 10.14s 319: id := runtimeProcPin()
. 7.41s 320: runtimeProcUnpin()
49.65s 49.75s 322: return p.Shards[id%numShards], id
}
The Change
This new version creates a dummy variable on the stack and uses its memory address as a pseudo-random number to pick a shard. Since each goroutine has its own stack space, different goroutines will get different addresses, naturally distributing the load across shards.
Instead of relying on the last few bits of the address (which often have low entropy and lead to poor distribution), we shift the address right by 12 bits to tap into the more randomized middle bits. Then, we apply a bitwise AND with (numShards - 1)
to ensure the result stays within bounds.
func (p *ShardedPool[T, P]) getShard() (*Shard[T, P], int) {
var dummy byte
addr := uintptr(unsafe.Pointer(&dummy))
id := int(addr>>12) & (numShards - 1)
return p.Shards[id], id
}
Performance profile after the change:
Total: 238.81s
13.55s 13.61s (flat, cum) 5.70% of Total
. . 317:func (p *ShardedPool[T, P]) getShard() (*Shard[T, P], int) {
. . 318: var dummy byte
. . 319: addr := uintptr(unsafe.Pointer(&dummy))
130ms 130ms 320: id := int(addr>>12) & (numShards - 1)
. . 321:
13.42s 13.48s 322: return p.Shards[id], id
. . 323:}
I was very happy with the individual result of this optimization, but only until I checked the overall performance of the system…
What Got Worse
Well, after we get a shard, we now can retrieve an object from it, and that's exactly where all the load went to.
Before
The performance before the change was okay, there was some contention but it was expected since I was testing it with 1k/2k goroutines.
Total: 189.73s
12.09s 30.37s (flat, cum) 16.01% of Total
1.41s 1.42s 432:func (p *ShardedPool[T, P]) retrieveFromShard(shard *Shard[T, P]) (zero P, success bool) {
. . 433: for {
2.44s 4.03s 434: oldHead := P(shard.Head.Load())
2.33s 2.33s 435: if oldHead == nil {
. . 436: return zero, false
. . 437: }
. . 438:
3.86s 3.87s 439: next := oldHead.GetNext()
. 16.66s 440: if shard.Head.CompareAndSwap(oldHead, next) {
2.05s 2.06s 441: return oldHead, true
. . 442: }
. . 443: }
. . 444:}
After
Essentially, all the computational weight from the getShard
function was shifted down to retrieveFromShard
— and then some.
Total: 238.81s
30.79s 79.69s (flat, cum) 33.37% of Total
2.26s 2.26s 432:func (p *ShardedPool[T, P]) retrieveFromShard(shard *Shard[T, P]) (zero P, success bool) {
400ms 400ms 433: for {
15.74s 22.78s 434: oldHead := P(shard.Head.Load())
2.26s 2.27s 435: if oldHead == nil {
20ms 20ms 436: return zero, false
. . 437: }
. . 438:
8.55s 8.55s 439: next := oldHead.GetNext()
450ms 42.29s 440: if shard.Head.CompareAndSwap(oldHead, next) {
1.11s 1.12s 441: return oldHead, true
. . 442: }
. . 443: }
. . 444:}
When you look at it, performance did get worse, but not by much since it was mostly just weight transfer. But when you look at the benchmark results, you have a 2.58x worsening in performance.
BenchmarkGenPool-8 291123867 3.949 ns/op 0 B/op 0 allocs/op
BenchmarkGenPool-8 150982447 10.17 ns/op 0 B/op 0 allocs/op
If this was merely weight redistribution, what could possibly explain a 2.58x performance degradation?
Breakdown
getShard
(before) and retrieveFromShard
(after the change) consumed nearly identical amounts of CPU time, but the problem was where the contention was placed with this change.
The original procPin
approach creates temporal locality — the same goroutine uses the same shard repeatedly, and as long as the object is returned within the same goroutine that retrieved it, it will go back to the same shard, establishing predictable access patterns. My "optimized" stack-address approach creates spatial randomness — while this distributes load evenly, it's terrible for cache coherency and creates chaotic contention.
Contention Issue
Before: 189.73s total, with atomic operations taking ~33.76s (17.7%) — CompareAndSwapPointer
at 16.5%, Int64.Add
barely visible at 1.3%.
After: 238.81s total, with atomic operations taking ~133.45s (55.9%) — CompareAndSwapPointer
ballooned to 33.36%, Int64.Add
surged to 7.96% (~6.6×), and total contention cost quadrupled.
The optimized version was indeed distributing load close to perfect, but that turned out to be a terrible thing. Random goroutines were now grabbing objects from random shards and returning them to completely different random shards. It's like the difference between everyone having an assigned parking spot (procPin
) versus everyone fighting for random spots — the assigned spots prevent traffic jams!
More importantly, since each goroutine stays pinned to a specific logical processor, the CPU cache can remember where objects were stored much more effectively, and the logical processor's cache from Go's runtime becomes optimized for that goroutine's access patterns, which is critical since the pool relies on intrusive linked lists that already sacrifice some cache benefits — we really can't afford to lose any more, and unfortunately random shard access destroys cache locality entirely, especially across 1,000–2,000 goroutines.
Conclusion
The procPin
approach created natural isolation where each logical processor worked with its own shard, minimizing the number of cores competing for the same atomic variables. When I replaced this with "random" load balancing, I inadvertently created a thundering herd scenario where all 1,000–2,000 goroutines were hammering the same atomic variables across all CPU cores.
The hardware couldn't resolve this contention — it was drowning in it. Each atomic operation had to wait for cache line ownership through the CPU's cache coherency protocol, contributing to the catastrophic 2.58x performance degradation.
Key insight: Synchronization mechanisms are only as good as the access patterns that use them. Perfect load distribution can be perfectly terrible for performance, don't assume, always measure it.
Top comments (0)