DEV Community: Alexsander Hamir

When Optimization Backfires: How Aggressive Optimization Made Our Pool 2.58x Slower

Alexsander Hamir — Tue, 05 Aug 2025 17:16:10 +0000

GenPool uses a sharded design to reduce contention. To determine which shard serves a request, it uses the procPin runtime function to pin the goroutine to its logical processor and uses the resulting processor ID as an index into the shards slice. Sounds complicated for a load balancing mechanism, right? What happens if we implement a much simpler way of doing nearly the same thing? Well, apparently things get orders of magnitude worse.

Why The Change?

According to benchmarks, the code below was consuming most of the CPU time, but without anything else to compare it to, it really doesn't mean much. At best, it's showing the hot spots of a function that may not even be capable of being optimized. Nonetheless, I thought I'd try creating a simpler way to do this.

Total: 188.83s
55.38s     70.51s (flat, cum) 37.34% of Total
3.19s      3.21s     316:func (p *ShardedPool[T, P]) getShard() (*Shard[T, P], int) {
2.54s      10.14s    319: id := runtimeProcPin()
   .       7.41s     320: runtimeProcUnpin()
49.65s     49.75s    322: return p.Shards[id%numShards], id
}

The Change

This new version creates a dummy variable on the stack and uses its memory address as a pseudo-random number to pick a shard. Since each goroutine has its own stack space, different goroutines will get different addresses, naturally distributing the load across shards.

Instead of relying on the last few bits of the address (which often have low entropy and lead to poor distribution), we shift the address right by 12 bits to tap into the more randomized middle bits. Then, we apply a bitwise AND with (numShards - 1) to ensure the result stays within bounds.

func (p *ShardedPool[T, P]) getShard() (*Shard[T, P], int) {
    var dummy byte
    addr := uintptr(unsafe.Pointer(&dummy))
    id := int(addr>>12) & (numShards - 1)

    return p.Shards[id], id
}

Performance profile after the change:

Total: 238.81s
13.55s     13.61s (flat, cum)  5.70% of Total
.          .        317:func (p *ShardedPool[T, P]) getShard() (*Shard[T, P], int) {
.          .        318: var dummy byte
.          .        319: addr := uintptr(unsafe.Pointer(&dummy))
130ms      130ms    320: id := int(addr>>12) & (numShards - 1)
.          .        321:
13.42s     13.48s   322: return p.Shards[id], id
.          .        323:}

I was very happy with the individual result of this optimization, but only until I checked the overall performance of the system…

What Got Worse

Well, after we get a shard, we now can retrieve an object from it, and that's exactly where all the load went to.

Before

The performance before the change was okay, there was some contention but it was expected since I was testing it with 1k/2k goroutines.

Total: 189.73s
12.09s     30.37s    (flat, cum) 16.01% of Total
1.41s      1.42s     432:func (p *ShardedPool[T, P]) retrieveFromShard(shard *Shard[T, P]) (zero P, success bool) {
.          .         433: for {
2.44s      4.03s     434:  oldHead := P(shard.Head.Load())
2.33s      2.33s     435:  if oldHead == nil {
.          .         436:   return zero, false
.          .         437:  }
.          .         438:
3.86s      3.87s     439:  next := oldHead.GetNext()
.          16.66s    440:  if shard.Head.CompareAndSwap(oldHead, next) {
2.05s      2.06s     441:   return oldHead, true
.          .         442:  }
.          .         443: }
.          .         444:}

After

Essentially, all the computational weight from the getShard function was shifted down to retrieveFromShard — and then some.

Total: 238.81s
30.79s     79.69s (flat, cum) 33.37% of Total
2.26s      2.26s      432:func (p *ShardedPool[T, P]) retrieveFromShard(shard *Shard[T, P]) (zero P, success bool) {
400ms      400ms      433: for {
15.74s     22.78s     434:  oldHead := P(shard.Head.Load())
2.26s      2.27s      435:  if oldHead == nil {
20ms       20ms       436:   return zero, false
.          .          437:  }
.          .          438:
8.55s      8.55s      439:  next := oldHead.GetNext()
450ms     42.29s      440:  if shard.Head.CompareAndSwap(oldHead, next) {
1.11s      1.12s      441:   return oldHead, true
.          .          442:  }
.          .          443: }
.          .          444:}

When you look at it, performance did get worse, but not by much since it was mostly just weight transfer. But when you look at the benchmark results, you have a 2.58x worsening in performance.

BenchmarkGenPool-8    291123867      3.949 ns/op        0 B/op        0 allocs/op
BenchmarkGenPool-8    150982447      10.17 ns/op        0 B/op        0 allocs/op

If this was merely weight redistribution, what could possibly explain a 2.58x performance degradation?

Breakdown

getShard (before) and retrieveFromShard (after the change) consumed nearly identical amounts of CPU time, but the problem was where the contention was placed with this change.

The original procPin approach creates temporal locality — the same goroutine uses the same shard repeatedly, and as long as the object is returned within the same goroutine that retrieved it, it will go back to the same shard, establishing predictable access patterns. My "optimized" stack-address approach creates spatial randomness — while this distributes load evenly, it's terrible for cache coherency and creates chaotic contention.

Contention Issue

Before: 189.73s total, with atomic operations taking ~33.76s (17.7%) — CompareAndSwapPointer at 16.5%, Int64.Add barely visible at 1.3%.

After: 238.81s total, with atomic operations taking ~133.45s (55.9%) — CompareAndSwapPointer ballooned to 33.36%, Int64.Add surged to 7.96% (~6.6×), and total contention cost quadrupled.

The optimized version was indeed distributing load close to perfect, but that turned out to be a terrible thing. Random goroutines were now grabbing objects from random shards and returning them to completely different random shards. It's like the difference between everyone having an assigned parking spot (procPin) versus everyone fighting for random spots — the assigned spots prevent traffic jams!

More importantly, since each goroutine stays pinned to a specific logical processor, the CPU cache can remember where objects were stored much more effectively, and the logical processor's cache from Go's runtime becomes optimized for that goroutine's access patterns, which is critical since the pool relies on intrusive linked lists that already sacrifice some cache benefits — we really can't afford to lose any more, and unfortunately random shard access destroys cache locality entirely, especially across 1,000–2,000 goroutines.

Conclusion

The procPin approach created natural isolation where each logical processor worked with its own shard, minimizing the number of cores competing for the same atomic variables. When I replaced this with "random" load balancing, I inadvertently created a thundering herd scenario where all 1,000–2,000 goroutines were hammering the same atomic variables across all CPU cores.

The hardware couldn't resolve this contention — it was drowning in it. Each atomic operation had to wait for cache line ownership through the CPU's cache coherency protocol, contributing to the catastrophic 2.58x performance degradation.

Key insight: Synchronization mechanisms are only as good as the access patterns that use them. Perfect load distribution can be perfectly terrible for performance, don't assume, always measure it.

GenPool repository

TokenSpan: Rethinking Prompt Compression with Aliases and Dictionary Encoding

Alexsander Hamir — Mon, 04 Aug 2025 22:24:34 +0000

In the era of large language models, prompt size is power — but also a big cost.
The more context you provide, the more tokens you consume. And when working with long, structured prompts or repetitive query templates, that cost can escalate quickly.

TokenSpan isn’t a compression library, it’s a thought experiment — a different way of thinking about prompt optimization.

Can we reduce token usage by substituting repeated phrases with lightweight aliases?

Can we borrow ideas from dictionary encoding to constrain and compress the language we use to communicate with models?

This project explores those questions — not by building a full encoding system, but by probing whether such a technique might be useful, measurable, and worth pursuing.

💡 The Core Insight: Let the Model Do the Work

A crucial insight behind TokenSpan is recognizing where the real cost lies:
We pay for tokens, not computation.

So why not reduce the tokens we send, and let the model handle the substitution?
LLMs easily understand that §a means "Microsoft Designer" — and we’re already paying for those tokens, so there’s no extra cost for that mental mapping.

Dictionary: §a → Microsoft Designer  
Rewritten Prompt: How does §a compare to Canva?

🔁 Scaling with Reusable Dictionaries

If you were to build a system around this idea, the best strategy wouldn't be to re-send the dictionary with every prompt. Instead:

Build the dictionary once
Embed it in the system prompt or long-term memory
Reuse it across multiple interactions

This only makes sense when dealing with large or repetitive prompts, where the cost of setting up the dictionary is outweighed by the long-term savings.

By encouraging simpler, more structured language, your application can:

Reduce costs
Improve consistency
Handle diverse user inputs more efficiently

After all, we’re often asking the same things — just in different ways.

📐 The Formula

What if we replaced a 2-token phrase like "Microsoft Designer" with an alias like §a?

Assume the phrase appears X times:

Original Cost: 2 × X tokens
Compressed Cost: X (alias usage) + 4 (dictionary overhead)

Savings Formula:

Saved = (2 × X) - (X + 4)

Example: "Microsoft Designer" appears 15 times.

Saved = (2 × 15) - (15 + 4) = 30 - 19 = 11 tokens saved

That’s just one phrase — real prompts often contain dozens of reusable patterns.

🎯 Why Focus on Two-Token Phrases?

This experiment targets two-token phrases for a reason:

✅ Single tokens can’t be compressed
✅ Longer phrases save more but occur less
✅ Two-token phrases hit the sweet spot: frequent and compressible

🧾 Understanding the Overhead

Each dictionary entry adds 4 tokens:

1 token for the replacement code (e.g. §a)
1 token for the separator (e.g. →)
2 tokens for the original phrase

You only start saving tokens once a phrase appears 5 or more times.

📊 Real-World Results

Using a raw prompt of 8,019 tokens:
After substitution → 7,138 tokens
Savings: 881 tokens (~11.0%)

The model continued performing correctly with the encoded prompt.

🧠 Conclusion

Natural language gives users the freedom to communicate in flexible, intuitive ways.
But that freedom comes at a cost:

🔄 Repetition
❌ Inaccuracy from phrasing variations
💰 Higher usage costs

If applications limited vocabulary for most interactions, it could:

Lower token usage
Encourage more structured prompts
Improve response consistency

🧪 Lessons from Tokenization Quirks

Here are some interesting quirks noticed during development:

Common Phrases = Fewer Tokens
e.g., "the" often becomes a single token.
Capitalization Can Split Words
"Designer" vs. "designer" — tokenizers treat them differently.
Rare Words Get Chopped Up
"visioneering" might tokenize into "vision" + "eering".
Numbers Don’t Tokenize Nicely
"123456" can break into "123" + "456".
Digits as Aliases? Risky.
Using "0" or "1" as shortcuts often backfires — better to use symbols like § or @.

🔬 Try It Yourself

📍 GitHub: alexsanderhamir/TokenSpan
💬 Contributions & feedback welcome!

TokenSpan is a thought experiment in prompt optimization.
The savings are real — but the real value is in rethinking how we balance cost, compression, and communication with LLMs.

Prof: A Structured Way to Manage and Compare Go Profiles

Alexsander Hamir — Thu, 31 Jul 2025 14:12:11 +0000

Go’s philosophy emphasizes simplicity and readability, aiming to lower the barrier for beginners to understand and contribute to codebases with less friction compared to many other languages. While pprof is already a powerful and user-friendly tool, effective profiling still requires experience to maintain good practices — such as organizing previous runs, documenting performance changes as you go, and keeping track of what was improved and when.

Without that experience, it’s easy to end up with a clutter of files and no clear history, forcing you to dig through old commits just to recall what you did minutes earlier.

That’s why I built Prof — a tool designed to bring structure, clarity, and speed to Go performance workflows, making life easier for both beginners and experienced engineers alike.

The Common Way

The commands below leave organization and documentation entirely up to the developer. And to be fair, these tools already do a lot — but still, why not encourage a more structured approach? Why not simplify the profiling workflow so it doesn’t require running a chain of commands back and forth, or having each team build their own custom scripts around it?

go test -bench=BenchmarkMyFunc -cpuprofile=cpu.out
go tool pprof -top cpu.out > results.txt
go tool pprof -list=MyFunc cpu.out
# Make changes, repeat...
# Hours later: "Wait, was that the baseline or the optimized version?"

The New Way

Prof solves this with a simple idea: treat profiling sessions like a well-structured codebase — organized and easy to navigate.

Instead of wrestling with scattered files, run one command:

prof auto \
  --benchmarks "BenchmarkGenPool" \
  --profiles "cpu,memory,mutex,block" \
  --count 10 \
  --tag "baseline"

That single command replaces dozens of manual steps — creating a neatly organized dataset:

bench/baseline/
├── description.txt              # Your notes for this run
├── bin/BenchmarkGenPool/        # Binary profile files (e.g., .pprof)
├── text/BenchmarkGenPool/       # Human-readable reports (e.g., top, list, disasm)
│
├── cpu_functions/               # ┐
│   ├── <func1>.txt              # │
│   ├── <func2>.txt              # │ Function-level CPU performance data
│   └── ...                      # │
├── memory_functions/            # ┘
    ├── <func1>.txt              # Function-level memory performance data
    ├── <func2>.txt
    └── ...

Now, instead of rerunning commands just to inspect a function’s performance, you can simply open the relevant file or search by its name — everything is structured and ready to explore.

Prof also offers an option to skip wrapping go test, giving users the flexibility to run benchmarks however they prefer while still benefiting from Prof’s organization and analysis.

Profiling Diffs at the Function Level

Thanks to Prof’s structured approach, you no longer need to manually track performance changes between optimizations. Simply pass the tags you want to compare, and Prof will generate the diffs for you — available in HTML, JSON, or terminal output.

prof track auto \
  --base "baseline" \
  --current "optimized" \
  --profile-type "cpu" \
  --bench-name "BenchmarkGenPool" \
  --output-format "summary-html"

Get clear, actionable insights:

⚠️ Top Regressions:

internal/cache.getShard: +200.0% (0.030s → 0.090s)
sync.Pool.Get: +100.0% (0.010s → 0.020s)

✅ Top Improvements:

encoding/json.Unmarshal: -95.0% (0.100s → 0.005s)
pool/isFull: -85.0% (0.020s → 0.003s)

This gives you a more organized and automated way of doing performance work.

Contributions Welcome

Instead of each team building their own scripts, we can come together to create a tool that helps developers handle performance work more easily — whether under pressure or as part of everyday optimization.

Prof aims to be that shared foundation, making profiling more accessible, consistent, and reliable across teams.

🔗 Prof Repository
🔗 LinkedIn