When Optimization Backfires: How Aggressive Optimization Made Our Pool 2.58x Slower

#go #performance #programming

GenPool uses a sharded design to reduce contention. To determine which shard serves a request, it uses the procPin runtime function to pin the goroutine to its logical processor and uses the resulting processor ID as an index into the shards slice. Sounds complicated for a load balancing mechanism, right? What happens if we implement a much simpler way of doing nearly the same thing? Well, apparently things get orders of magnitude worse.

Why The Change?

According to benchmarks, the code below was consuming most of the CPU time, but without anything else to compare it to, it really doesn't mean much. At best, it's showing the hot spots of a function that may not even be capable of being optimized. Nonetheless, I thought I'd try creating a simpler way to do this.

Total: 188.83s
55.38s     70.51s (flat, cum) 37.34% of Total
3.19s      3.21s     316:func (p *ShardedPool[T, P]) getShard() (*Shard[T, P], int) {
2.54s      10.14s    319: id := runtimeProcPin()
   .       7.41s     320: runtimeProcUnpin()
49.65s     49.75s    322: return p.Shards[id%numShards], id
}

The Change

This new version creates a dummy variable on the stack and uses its memory address as a pseudo-random number to pick a shard. Since each goroutine has its own stack space, different goroutines will get different addresses, naturally distributing the load across shards.

Instead of relying on the last few bits of the address (which often have low entropy and lead to poor distribution), we shift the address right by 12 bits to tap into the more randomized middle bits. Then, we apply a bitwise AND with (numShards - 1) to ensure the result stays within bounds.

func (p *ShardedPool[T, P]) getShard() (*Shard[T, P], int) {
    var dummy byte
    addr := uintptr(unsafe.Pointer(&dummy))
    id := int(addr>>12) & (numShards - 1)

    return p.Shards[id], id
}

Performance profile after the change:

Total: 238.81s
13.55s     13.61s (flat, cum)  5.70% of Total
.          .        317:func (p *ShardedPool[T, P]) getShard() (*Shard[T, P], int) {
.          .        318: var dummy byte
.          .        319: addr := uintptr(unsafe.Pointer(&dummy))
130ms      130ms    320: id := int(addr>>12) & (numShards - 1)
.          .        321:
13.42s     13.48s   322: return p.Shards[id], id
.          .        323:}

I was very happy with the individual result of this optimization, but only until I checked the overall performance of the system…

What Got Worse

Well, after we get a shard, we now can retrieve an object from it, and that's exactly where all the load went to.

Before

The performance before the change was okay, there was some contention but it was expected since I was testing it with 1k/2k goroutines.

Total: 189.73s
12.09s     30.37s    (flat, cum) 16.01% of Total
1.41s      1.42s     432:func (p *ShardedPool[T, P]) retrieveFromShard(shard *Shard[T, P]) (zero P, success bool) {
.          .         433: for {
2.44s      4.03s     434:  oldHead := P(shard.Head.Load())
2.33s      2.33s     435:  if oldHead == nil {
.          .         436:   return zero, false
.          .         437:  }
.          .         438:
3.86s      3.87s     439:  next := oldHead.GetNext()
.          16.66s    440:  if shard.Head.CompareAndSwap(oldHead, next) {
2.05s      2.06s     441:   return oldHead, true
.          .         442:  }
.          .         443: }
.          .         444:}

After

Essentially, all the computational weight from the getShard function was shifted down to retrieveFromShard — and then some.

Total: 238.81s
30.79s     79.69s (flat, cum) 33.37% of Total
2.26s      2.26s      432:func (p *ShardedPool[T, P]) retrieveFromShard(shard *Shard[T, P]) (zero P, success bool) {
400ms      400ms      433: for {
15.74s     22.78s     434:  oldHead := P(shard.Head.Load())
2.26s      2.27s      435:  if oldHead == nil {
20ms       20ms       436:   return zero, false
.          .          437:  }
.          .          438:
8.55s      8.55s      439:  next := oldHead.GetNext()
450ms     42.29s      440:  if shard.Head.CompareAndSwap(oldHead, next) {
1.11s      1.12s      441:   return oldHead, true
.          .          442:  }
.          .          443: }
.          .          444:}

When you look at it, performance did get worse, but not by much since it was mostly just weight transfer. But when you look at the benchmark results, you have a 2.58x worsening in performance.

BenchmarkGenPool-8    291123867      3.949 ns/op        0 B/op        0 allocs/op
BenchmarkGenPool-8    150982447      10.17 ns/op        0 B/op        0 allocs/op

If this was merely weight redistribution, what could possibly explain a 2.58x performance degradation?

Breakdown

getShard (before) and retrieveFromShard (after the change) consumed nearly identical amounts of CPU time, but the problem was where the contention was placed with this change.

The original procPin approach creates temporal locality — the same goroutine uses the same shard repeatedly, and as long as the object is returned within the same goroutine that retrieved it, it will go back to the same shard, establishing predictable access patterns. My "optimized" stack-address approach creates spatial randomness — while this distributes load evenly, it's terrible for cache coherency and creates chaotic contention.

Contention Issue

Before: 189.73s total, with atomic operations taking ~33.76s (17.7%) — CompareAndSwapPointer at 16.5%, Int64.Add barely visible at 1.3%.

After: 238.81s total, with atomic operations taking ~133.45s (55.9%) — CompareAndSwapPointer ballooned to 33.36%, Int64.Add surged to 7.96% (~6.6×), and total contention cost quadrupled.

The optimized version was indeed distributing load close to perfect, but that turned out to be a terrible thing. Random goroutines were now grabbing objects from random shards and returning them to completely different random shards. It's like the difference between everyone having an assigned parking spot (procPin) versus everyone fighting for random spots — the assigned spots prevent traffic jams!

More importantly, since each goroutine stays pinned to a specific logical processor, the CPU cache can remember where objects were stored much more effectively, and the logical processor's cache from Go's runtime becomes optimized for that goroutine's access patterns, which is critical since the pool relies on intrusive linked lists that already sacrifice some cache benefits — we really can't afford to lose any more, and unfortunately random shard access destroys cache locality entirely, especially across 1,000–2,000 goroutines.

Conclusion

The procPin approach created natural isolation where each logical processor worked with its own shard, minimizing the number of cores competing for the same atomic variables. When I replaced this with "random" load balancing, I inadvertently created a thundering herd scenario where all 1,000–2,000 goroutines were hammering the same atomic variables across all CPU cores.

The hardware couldn't resolve this contention — it was drowning in it. Each atomic operation had to wait for cache line ownership through the CPU's cache coherency protocol, contributing to the catastrophic 2.58x performance degradation.

Key insight: Synchronization mechanisms are only as good as the access patterns that use them. Perfect load distribution can be perfectly terrible for performance, don't assume, always measure it.

GenPool repository