<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Alexsander Hamir</title>
    <description>The latest articles on DEV Community by Alexsander Hamir (@alexsanderhamir).</description>
    <link>https://dev.to/alexsanderhamir</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3385858%2Fd79f9069-37ca-40e0-8408-034cd5c97f4f.png</url>
      <title>DEV Community: Alexsander Hamir</title>
      <link>https://dev.to/alexsanderhamir</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/alexsanderhamir"/>
    <language>en</language>
    <item>
      <title>When Optimization Backfires: How Aggressive Optimization Made Our Pool 2.58x Slower</title>
      <dc:creator>Alexsander Hamir</dc:creator>
      <pubDate>Tue, 05 Aug 2025 17:16:10 +0000</pubDate>
      <link>https://dev.to/alexsanderhamir/when-optimization-backfires-how-aggressive-optimization-made-our-pool-47x-slower-5b4b</link>
      <guid>https://dev.to/alexsanderhamir/when-optimization-backfires-how-aggressive-optimization-made-our-pool-47x-slower-5b4b</guid>
      <description>&lt;p&gt;GenPool uses a sharded design to reduce contention. To determine which shard serves a request, it uses the &lt;code&gt;procPin&lt;/code&gt; runtime function to pin the goroutine to its logical processor and uses the resulting processor ID as an index into the shards slice. Sounds complicated for a load balancing mechanism, right? What happens if we implement a much simpler way of doing nearly the same thing? Well, apparently things get orders of magnitude worse.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why The Change?
&lt;/h2&gt;

&lt;p&gt;According to benchmarks, the code below was consuming most of the CPU time, but without anything else to compare it to, it really doesn't mean much. At best, it's showing the hot spots of a function that may not even be capable of being optimized. Nonetheless, I thought I'd try creating a simpler way to do this.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total: 188.83s
55.38s     70.51s (flat, cum) 37.34% of Total
3.19s      3.21s     316:func (p *ShardedPool[T, P]) getShard() (*Shard[T, P], int) {
2.54s      10.14s    319: id := runtimeProcPin()
   .       7.41s     320: runtimeProcUnpin()
49.65s     49.75s    322: return p.Shards[id%numShards], id
}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The Change
&lt;/h2&gt;

&lt;p&gt;This new version creates a dummy variable on the stack and uses its memory address as a pseudo-random number to pick a shard. Since each goroutine has its own stack space, different goroutines will get different addresses, naturally distributing the load across shards.&lt;/p&gt;

&lt;p&gt;Instead of relying on the last few bits of the address (which often have low entropy and lead to poor distribution), we shift the address right by 12 bits to tap into the more randomized middle bits. Then, we apply a bitwise AND with &lt;code&gt;(numShards - 1)&lt;/code&gt; to ensure the result stays within bounds.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight go"&gt;&lt;code&gt;&lt;span class="k"&gt;func&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;p&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;ShardedPool&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;P&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt; &lt;span class="n"&gt;getShard&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;*&lt;/span&gt;&lt;span class="n"&gt;Shard&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;T&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;P&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="n"&gt;dummy&lt;/span&gt; &lt;span class="kt"&gt;byte&lt;/span&gt;
    &lt;span class="n"&gt;addr&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kt"&gt;uintptr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;unsafe&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Pointer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;&amp;amp;&lt;/span&gt;&lt;span class="n"&gt;dummy&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;id&lt;/span&gt; &lt;span class="o"&gt;:=&lt;/span&gt; &lt;span class="kt"&gt;int&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;addr&lt;/span&gt;&lt;span class="o"&gt;&amp;gt;&amp;gt;&lt;/span&gt;&lt;span class="m"&gt;12&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;numShards&lt;/span&gt; &lt;span class="o"&gt;-&lt;/span&gt; &lt;span class="m"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;p&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Shards&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;id&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;id&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Performance profile after the change:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total: 238.81s
13.55s     13.61s (flat, cum)  5.70% of Total
.          .        317:func (p *ShardedPool[T, P]) getShard() (*Shard[T, P], int) {
.          .        318: var dummy byte
.          .        319: addr := uintptr(unsafe.Pointer(&amp;amp;dummy))
130ms      130ms    320: id := int(addr&amp;gt;&amp;gt;12) &amp;amp; (numShards - 1)
.          .        321:
13.42s     13.48s   322: return p.Shards[id], id
.          .        323:}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I was very happy with the individual result of this optimization, but only until I checked the overall performance of the system…&lt;/p&gt;

&lt;h2&gt;
  
  
  What Got Worse
&lt;/h2&gt;

&lt;p&gt;Well, after we get a shard, we now can retrieve an object from it, and that's exactly where all the load went to.&lt;/p&gt;

&lt;h3&gt;
  
  
  Before
&lt;/h3&gt;

&lt;p&gt;The performance before the change was okay, there was some contention but it was expected since I was testing it with 1k/2k goroutines.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total: 189.73s
12.09s     30.37s    (flat, cum) 16.01% of Total
1.41s      1.42s     432:func (p *ShardedPool[T, P]) retrieveFromShard(shard *Shard[T, P]) (zero P, success bool) {
.          .         433: for {
2.44s      4.03s     434:  oldHead := P(shard.Head.Load())
2.33s      2.33s     435:  if oldHead == nil {
.          .         436:   return zero, false
.          .         437:  }
.          .         438:
3.86s      3.87s     439:  next := oldHead.GetNext()
.          16.66s    440:  if shard.Head.CompareAndSwap(oldHead, next) {
2.05s      2.06s     441:   return oldHead, true
.          .         442:  }
.          .         443: }
.          .         444:}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  After
&lt;/h3&gt;

&lt;p&gt;Essentially, all the computational weight from the &lt;code&gt;getShard&lt;/code&gt; function was shifted down to &lt;code&gt;retrieveFromShard&lt;/code&gt; — and then some.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Total: 238.81s
30.79s     79.69s (flat, cum) 33.37% of Total
2.26s      2.26s      432:func (p *ShardedPool[T, P]) retrieveFromShard(shard *Shard[T, P]) (zero P, success bool) {
400ms      400ms      433: for {
15.74s     22.78s     434:  oldHead := P(shard.Head.Load())
2.26s      2.27s      435:  if oldHead == nil {
20ms       20ms       436:   return zero, false
.          .          437:  }
.          .          438:
8.55s      8.55s      439:  next := oldHead.GetNext()
450ms     42.29s      440:  if shard.Head.CompareAndSwap(oldHead, next) {
1.11s      1.12s      441:   return oldHead, true
.          .          442:  }
.          .          443: }
.          .          444:}
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When you look at it, performance did get worse, but not by much since it was mostly just weight transfer. But when you look at the benchmark results, you have a &lt;strong&gt;2.58x worsening in performance&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;BenchmarkGenPool-8    291123867      3.949 ns/op        0 B/op        0 allocs/op
BenchmarkGenPool-8    150982447      10.17 ns/op        0 B/op        0 allocs/op
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If this was merely weight redistribution, what could possibly explain a 2.58x performance degradation?&lt;/p&gt;

&lt;h2&gt;
  
  
  Breakdown
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;getShard&lt;/code&gt; (before) and &lt;code&gt;retrieveFromShard&lt;/code&gt; (after the change) consumed nearly identical amounts of CPU time, but the problem was &lt;strong&gt;where the contention was placed&lt;/strong&gt; with this change.&lt;/p&gt;

&lt;p&gt;The original &lt;code&gt;procPin&lt;/code&gt; approach creates &lt;strong&gt;temporal locality&lt;/strong&gt; — the same goroutine uses the same shard repeatedly, and as long as the object is returned within the same goroutine that retrieved it, it will go back to the same shard, establishing predictable access patterns. My "optimized" stack-address approach creates &lt;strong&gt;spatial randomness&lt;/strong&gt; — while this distributes load evenly, it's terrible for cache coherency and creates chaotic contention.&lt;/p&gt;

&lt;h3&gt;
  
  
  Contention Issue
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Before:&lt;/strong&gt; 189.73s total, with atomic operations taking ~33.76s (17.7%) — &lt;code&gt;CompareAndSwapPointer&lt;/code&gt; at 16.5%, &lt;code&gt;Int64.Add&lt;/code&gt; barely visible at 1.3%.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After:&lt;/strong&gt; 238.81s total, with atomic operations taking ~133.45s (55.9%) — &lt;code&gt;CompareAndSwapPointer&lt;/code&gt; ballooned to 33.36%, &lt;code&gt;Int64.Add&lt;/code&gt; surged to 7.96% (~6.6×), and total contention cost quadrupled.&lt;/p&gt;

&lt;p&gt;The optimized version was indeed distributing load close to perfect, but that turned out to be a terrible thing. Random goroutines were now grabbing objects from random shards and returning them to completely different random shards. It's like the difference between everyone having an assigned parking spot (&lt;code&gt;procPin&lt;/code&gt;) versus everyone fighting for random spots — the assigned spots prevent traffic jams!&lt;/p&gt;

&lt;p&gt;More importantly, since each goroutine stays pinned to a specific logical processor, the CPU cache can remember where objects were stored much more effectively, and the logical processor's cache from Go's runtime becomes optimized for that goroutine's access patterns, which is critical since the pool relies on intrusive linked lists that already sacrifice some cache benefits — we really can't afford to lose any more, and unfortunately random shard access destroys cache locality entirely, especially across 1,000–2,000 goroutines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;procPin&lt;/code&gt; approach created natural isolation where each logical processor worked with its own shard, minimizing the number of cores competing for the same atomic variables. When I replaced this with "random" load balancing, I inadvertently created a thundering herd scenario where all 1,000–2,000 goroutines were hammering the same atomic variables across all CPU cores.&lt;/p&gt;

&lt;p&gt;The hardware couldn't resolve this contention — it was drowning in it. Each atomic operation had to wait for cache line ownership through the CPU's cache coherency protocol, contributing to the catastrophic 2.58x performance degradation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key insight:&lt;/strong&gt; Synchronization mechanisms are only as good as the access patterns that use them. Perfect load distribution can be perfectly terrible for performance, don't assume, always measure it.&lt;/p&gt;




&lt;p&gt;&lt;a href="https://github.com/AlexsanderHamir/GenPool" rel="noopener noreferrer"&gt;GenPool repository&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.linkedin.com/in/alexsander-baptista/" rel="noopener noreferrer"&gt;Linkedin&lt;/a&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>programming</category>
      <category>performance</category>
    </item>
    <item>
      <title>TokenSpan: Rethinking Prompt Compression with Aliases and Dictionary Encoding</title>
      <dc:creator>Alexsander Hamir</dc:creator>
      <pubDate>Mon, 04 Aug 2025 22:24:34 +0000</pubDate>
      <link>https://dev.to/alexsanderhamir/tokenspan-rethinking-prompt-compression-with-aliases-and-dictionary-encoding-3a27</link>
      <guid>https://dev.to/alexsanderhamir/tokenspan-rethinking-prompt-compression-with-aliases-and-dictionary-encoding-3a27</guid>
      <description>&lt;p&gt;In the era of large language models, prompt size is power — but also a big cost.&lt;br&gt;
The more context you provide, the more tokens you consume. And when working with long, structured prompts or repetitive query templates, that cost can escalate quickly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TokenSpan isn’t a compression library&lt;/strong&gt;, it’s a thought experiment — a different way of thinking about prompt optimization.&lt;/p&gt;

&lt;p&gt;Can we reduce token usage by substituting repeated phrases with lightweight aliases?&lt;/p&gt;

&lt;p&gt;Can we borrow ideas from dictionary encoding to constrain and compress the language we use to communicate with models?&lt;/p&gt;

&lt;p&gt;This project explores those questions — not by building a full encoding system, but by probing whether such a technique might be &lt;strong&gt;useful, measurable, and worth pursuing&lt;/strong&gt;.&lt;/p&gt;


&lt;h2&gt;
  
  
  💡 The Core Insight: Let the Model Do the Work
&lt;/h2&gt;

&lt;p&gt;A crucial insight behind TokenSpan is recognizing where the real cost lies:&lt;br&gt;
&lt;strong&gt;We pay for tokens, not computation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So why not reduce the tokens we send, and let the model handle the substitution?&lt;br&gt;
LLMs easily understand that &lt;code&gt;§a&lt;/code&gt; means &lt;code&gt;"Microsoft Designer"&lt;/code&gt; — and we’re already paying for those tokens, so there’s no extra cost for that mental mapping.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Dictionary: §a → Microsoft Designer  
Rewritten Prompt: How does §a compare to Canva?
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  🔁 Scaling with Reusable Dictionaries
&lt;/h2&gt;

&lt;p&gt;If you were to build a system around this idea, the best strategy wouldn't be to re-send the dictionary with every prompt. Instead:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Build the dictionary once&lt;/li&gt;
&lt;li&gt;Embed it in the system prompt or long-term memory&lt;/li&gt;
&lt;li&gt;Reuse it across multiple interactions&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This only makes sense when dealing with &lt;strong&gt;large or repetitive prompts&lt;/strong&gt;, where the cost of setting up the dictionary is outweighed by the long-term savings.&lt;/p&gt;

&lt;p&gt;By encouraging &lt;strong&gt;simpler, more structured language&lt;/strong&gt;, your application can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Reduce costs&lt;/li&gt;
&lt;li&gt;Improve consistency&lt;/li&gt;
&lt;li&gt;Handle diverse user inputs more efficiently&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After all, we’re often &lt;strong&gt;asking the same things&lt;/strong&gt; — just in different ways.&lt;/p&gt;




&lt;h2&gt;
  
  
  📐 The Formula
&lt;/h2&gt;

&lt;p&gt;What if we replaced a 2-token phrase like &lt;code&gt;"Microsoft Designer"&lt;/code&gt; with an alias like &lt;code&gt;§a&lt;/code&gt;?&lt;/p&gt;

&lt;p&gt;Assume the phrase appears &lt;code&gt;X&lt;/code&gt; times:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Original Cost&lt;/strong&gt;: &lt;code&gt;2 × X&lt;/code&gt; tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Compressed Cost&lt;/strong&gt;: &lt;code&gt;X&lt;/code&gt; (alias usage) + &lt;code&gt;4&lt;/code&gt; (dictionary overhead)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Savings Formula&lt;/strong&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Saved = (2 × X) - (X + 4)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;: &lt;code&gt;"Microsoft Designer"&lt;/code&gt; appears 15 times.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Saved = (2 × 15) - (15 + 4) = 30 - 19 = 11 tokens saved
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That’s &lt;strong&gt;just one phrase&lt;/strong&gt; — real prompts often contain &lt;strong&gt;dozens&lt;/strong&gt; of reusable patterns.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎯 Why Focus on Two-Token Phrases?
&lt;/h2&gt;

&lt;p&gt;This experiment targets &lt;strong&gt;two-token phrases&lt;/strong&gt; for a reason:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;✅ Single tokens can’t be compressed&lt;/li&gt;
&lt;li&gt;✅ Longer phrases save more but occur less&lt;/li&gt;
&lt;li&gt;✅ Two-token phrases hit the &lt;strong&gt;sweet spot&lt;/strong&gt;: frequent &lt;em&gt;and&lt;/em&gt; compressible&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧾 Understanding the Overhead
&lt;/h2&gt;

&lt;p&gt;Each dictionary entry adds &lt;strong&gt;4 tokens&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;1&lt;/code&gt; token for the &lt;strong&gt;replacement code&lt;/strong&gt; (e.g. &lt;code&gt;§a&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;1&lt;/code&gt; token for the &lt;strong&gt;separator&lt;/strong&gt; (e.g. &lt;code&gt;→&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;2&lt;/code&gt; tokens for the &lt;strong&gt;original phrase&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You only start saving tokens &lt;strong&gt;once a phrase appears 5 or more times&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  📊 Real-World Results
&lt;/h2&gt;

&lt;p&gt;Using a raw prompt of &lt;strong&gt;8,019 tokens&lt;/strong&gt;:&lt;br&gt;
After substitution → &lt;strong&gt;7,138 tokens&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Savings: 881 tokens (~11.0%)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The model continued performing correctly with the encoded prompt.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 Conclusion
&lt;/h2&gt;

&lt;p&gt;Natural language gives users the freedom to communicate in flexible, intuitive ways.&lt;br&gt;
But that freedom comes at a cost:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;🔄 Repetition&lt;/li&gt;
&lt;li&gt;❌ Inaccuracy from phrasing variations&lt;/li&gt;
&lt;li&gt;💰 Higher usage costs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If applications &lt;strong&gt;limited vocabulary&lt;/strong&gt; for most interactions, it could:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Lower token usage&lt;/li&gt;
&lt;li&gt;Encourage more structured prompts&lt;/li&gt;
&lt;li&gt;Improve response consistency&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧪 Lessons from Tokenization Quirks
&lt;/h2&gt;

&lt;p&gt;Here are some interesting quirks noticed during development:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Common Phrases = Fewer Tokens&lt;/strong&gt;&lt;br&gt;
e.g., &lt;code&gt;"the"&lt;/code&gt; often becomes a single token.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Capitalization Can Split Words&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;"Designer"&lt;/code&gt; vs. &lt;code&gt;"designer"&lt;/code&gt; — tokenizers treat them differently.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Rare Words Get Chopped Up&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;"visioneering"&lt;/code&gt; might tokenize into &lt;code&gt;"vision"&lt;/code&gt; + &lt;code&gt;"eering"&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Numbers Don’t Tokenize Nicely&lt;/strong&gt;&lt;br&gt;
&lt;code&gt;"123456"&lt;/code&gt; can break into &lt;code&gt;"123"&lt;/code&gt; + &lt;code&gt;"456"&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Digits as Aliases? Risky.&lt;/strong&gt;&lt;br&gt;
Using &lt;code&gt;"0"&lt;/code&gt; or &lt;code&gt;"1"&lt;/code&gt; as shortcuts often backfires — better to use symbols like &lt;code&gt;§&lt;/code&gt; or &lt;code&gt;@&lt;/code&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔬 Try It Yourself
&lt;/h2&gt;

&lt;p&gt;📍 GitHub: &lt;a href="https://github.com/alexsanderhamir/TokenSpan" rel="noopener noreferrer"&gt;alexsanderhamir/TokenSpan&lt;/a&gt;&lt;br&gt;
💬 Contributions &amp;amp; feedback welcome!&lt;/p&gt;




&lt;p&gt;TokenSpan is a &lt;strong&gt;thought experiment&lt;/strong&gt; in prompt optimization.&lt;br&gt;
The savings are real — but the real value is in rethinking how we balance &lt;strong&gt;cost&lt;/strong&gt;, &lt;strong&gt;compression&lt;/strong&gt;, and &lt;strong&gt;communication&lt;/strong&gt; with LLMs.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>machinelearning</category>
      <category>promptengineering</category>
      <category>ai</category>
    </item>
    <item>
      <title>Prof: A Structured Way to Manage and Compare Go Profiles</title>
      <dc:creator>Alexsander Hamir</dc:creator>
      <pubDate>Thu, 31 Jul 2025 14:12:11 +0000</pubDate>
      <link>https://dev.to/alexsanderhamir/prof-a-structured-way-to-manage-and-compare-go-profiles-4kb8</link>
      <guid>https://dev.to/alexsanderhamir/prof-a-structured-way-to-manage-and-compare-go-profiles-4kb8</guid>
      <description>&lt;p&gt;Go’s philosophy emphasizes simplicity and readability, aiming to lower the barrier for beginners to understand and contribute to codebases with less friction compared to many other languages. While &lt;code&gt;pprof&lt;/code&gt; is already a powerful and user-friendly tool, effective profiling still requires experience to maintain good practices — such as organizing previous runs, documenting performance changes as you go, and keeping track of what was improved and when.&lt;/p&gt;

&lt;p&gt;Without that experience, it’s easy to end up with a clutter of files and no clear history, forcing you to dig through old commits just to recall what you did minutes earlier.&lt;/p&gt;

&lt;p&gt;That’s why I built &lt;strong&gt;Prof&lt;/strong&gt; — a tool designed to bring &lt;strong&gt;structure, clarity, and speed&lt;/strong&gt; to Go performance workflows, making life easier for both beginners and experienced engineers alike.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Common Way
&lt;/h2&gt;

&lt;p&gt;The commands below leave organization and documentation entirely up to the developer. And to be fair, these tools already do a lot — but still, why not encourage a more structured approach? Why not simplify the profiling workflow so it doesn’t require running a chain of commands back and forth, or having each team build their own custom scripts around it?&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;go &lt;span class="nb"&gt;test&lt;/span&gt; &lt;span class="nt"&gt;-bench&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;BenchmarkMyFunc &lt;span class="nt"&gt;-cpuprofile&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;cpu.out
go tool pprof &lt;span class="nt"&gt;-top&lt;/span&gt; cpu.out &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; results.txt
go tool pprof &lt;span class="nt"&gt;-list&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;MyFunc cpu.out
&lt;span class="c"&gt;# Make changes, repeat...&lt;/span&gt;
&lt;span class="c"&gt;# Hours later: "Wait, was that the baseline or the optimized version?"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  The New Way
&lt;/h2&gt;

&lt;p&gt;Prof solves this with a simple idea: treat profiling sessions like a well-structured codebase — organized and easy to navigate.&lt;/p&gt;

&lt;p&gt;Instead of wrestling with scattered files, run one command:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;prof auto &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--benchmarks&lt;/span&gt; &lt;span class="s2"&gt;"BenchmarkGenPool"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profiles&lt;/span&gt; &lt;span class="s2"&gt;"cpu,memory,mutex,block"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--count&lt;/span&gt; 10 &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--tag&lt;/span&gt; &lt;span class="s2"&gt;"baseline"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That single command replaces dozens of manual steps — creating a neatly organized dataset:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;bench/baseline/
├── description.txt              # Your notes for this run
├── bin/BenchmarkGenPool/        # Binary profile files (e.g., .pprof)
├── text/BenchmarkGenPool/       # Human-readable reports (e.g., top, list, disasm)
│
├── cpu_functions/               # ┐
│   ├── &amp;lt;func1&amp;gt;.txt              # │
│   ├── &amp;lt;func2&amp;gt;.txt              # │ Function-level CPU performance data
│   └── ...                      # │
├── memory_functions/            # ┘
    ├── &amp;lt;func1&amp;gt;.txt              # Function-level memory performance data
    ├── &amp;lt;func2&amp;gt;.txt
    └── ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now, instead of rerunning commands just to inspect a function’s performance, you can simply open the relevant file or search by its name — everything is structured and ready to explore.&lt;/p&gt;

&lt;p&gt;Prof also offers an option to skip wrapping &lt;code&gt;go test&lt;/code&gt;, giving users the flexibility to run benchmarks however they prefer while still benefiting from Prof’s organization and analysis.&lt;/p&gt;

&lt;h2&gt;
  
  
  Profiling Diffs at the Function Level
&lt;/h2&gt;

&lt;p&gt;Thanks to Prof’s structured approach, you no longer need to manually track performance changes between optimizations. Simply pass the tags you want to compare, and Prof will generate the diffs for you — available in &lt;strong&gt;HTML&lt;/strong&gt;, &lt;strong&gt;JSON&lt;/strong&gt;, or &lt;strong&gt;terminal output&lt;/strong&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;prof track auto &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--base&lt;/span&gt; &lt;span class="s2"&gt;"baseline"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--current&lt;/span&gt; &lt;span class="s2"&gt;"optimized"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--profile-type&lt;/span&gt; &lt;span class="s2"&gt;"cpu"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--bench-name&lt;/span&gt; &lt;span class="s2"&gt;"BenchmarkGenPool"&lt;/span&gt; &lt;span class="se"&gt;\&lt;/span&gt;
  &lt;span class="nt"&gt;--output-format&lt;/span&gt; &lt;span class="s2"&gt;"summary-html"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Get clear, actionable insights:
&lt;/h3&gt;

&lt;h4&gt;
  
  
  ⚠️ Top Regressions:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;internal/cache.getShard&lt;/code&gt;: +200.0% (0.030s → 0.090s)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;sync.Pool.Get&lt;/code&gt;: +100.0% (0.010s → 0.020s)&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  ✅ Top Improvements:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;encoding/json.Unmarshal&lt;/code&gt;: -95.0% (0.100s → 0.005s)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;pool/isFull&lt;/code&gt;: -85.0% (0.020s → 0.003s)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This gives you a more &lt;strong&gt;organized and automated&lt;/strong&gt; way of doing performance work.&lt;/p&gt;

&lt;h2&gt;
  
  
  Contributions Welcome
&lt;/h2&gt;

&lt;p&gt;Instead of each team building their own scripts, we can come together to create a tool that helps developers handle performance work more easily — whether under pressure or as part of everyday optimization.&lt;/p&gt;

&lt;p&gt;Prof aims to be that shared foundation, making profiling more accessible, consistent, and reliable across teams.&lt;/p&gt;

&lt;p&gt;🔗 &lt;strong&gt;&lt;a href="https://github.com/AlexsanderHamir/prof" rel="noopener noreferrer"&gt;Prof Repository&lt;/a&gt;&lt;/strong&gt;&lt;br&gt;
🔗 &lt;strong&gt;&lt;a href="https://www.linkedin.com/in/alexsander-baptista" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>go</category>
      <category>programming</category>
      <category>performance</category>
    </item>
  </channel>
</rss>
