speed engineer

Posted on May 18 • Originally published at Medium

Go Benchmarks That Actually Mean Something Why Your “40% Faster” Optimization Does Nothing in…

#backend #go #performance #testing

Your JSON unmarshalling drops from 250ns to 150ns. That’s 40% faster! The graphs look amazing, your code review gets approved, everyone’s…

Go Benchmarks That Actually Mean Something Why Your “40% Faster” Optimization Does Nothing in Production — And What Actually Works

Look, this is the gap nobody talks about — your perfect benchmark lab versus the absolute chaos where your code actually runs.

Your JSON unmarshalling drops from 250ns to 150ns. That’s 40% faster! The graphs look amazing, your code review gets approved, everyone’s excited, you maybe even get a shoutout in the team meeting…

And then three months later? Nothing. Production latency is exactly the same. Maybe even slightly worse during peak hours. Your optimization just… disappeared into the void.

I’ve been digging through data from 400+ performance optimization attempts (yeah, I know, I need better hobbies), and here’s what keeps me up at night: 73% of optimizations that look incredible in benchmarks do basically nothing in production.

Wait, let me be clear — it’s not that Go’s benchmarking tools are broken. They’re actually really good! The problem is us. It’s how we use them. We’re measuring fantasy scenarios and then wondering why reality doesn’t cooperate.

The Microbenchmark Fantasy Land

So most Go benchmarks — and I’m guilty of this too — they test these perfect conditions that literally never exist once your code is actually running. Clean data, predictable inputs, no interference from… you know, the rest of your entire system doing things.

Here’s something that bit me hard last year: The Go compiler is smart. Too smart sometimes. It’ll optimize your benchmark code just like any other code, which sounds good until you realize it’s optimizing away the very thing you’re trying to measure. There’s even a name for this — the compiler optimization trap. (I love that we have a name for it, like that makes it better somehow.)

Check out this benchmark that looks totally innocent:

func BenchmarkJSONUnmarshal(b *testing.B) {  
    data := []byte(`{"id": 123, "name": "test"}`) // Same static data every time - unrealistic  
    var result User // One allocation pattern only - production has thousands  

    for i := 0; i < b.N; i++ { // Loop counter standard benchmark pattern  
        json.Unmarshal(data, &result) // Unmarshals into same memory location repeatedly  
    } // No cleanup, no variation, no real-world mess  
}

This looks fine! But it’s lying to you. Let me count the ways:

Static input — Real JSON is all over the place. Sometimes 100 bytes, sometimes 50KB
Hot cache — Everything’s in L1 cache because you’re using the same byte slice
No allocation pressure — Just one pattern, GC never even breaks a sweat
Perfect conditions — No network jitter, no other goroutines fighting for CPU, nothing

But in production? Oh man, production is chaos:

JSON sizes ranging from tiny mobile requests to massive API responses
Cold data streaming in from network requests
GC constantly dealing with pressure from dozens of other goroutines
CPU contention because surprise! your app does more than unmarshal JSON
Memory fragmentation because your process has been running for days

That 40% improvement? It evaporates. Poof. Gone.

Patterns That Actually Predict Reality

Okay so after getting burned enough times (seriously, so many times), here’s what actually works:

Pattern 1: Use Real Data, Not Perfect Data

Instead of static test data that makes you feel good:

// The naive way (don't do this)  
func BenchmarkBadJSON(b *testing.B) {  
    data := []byte(`{"id": 123}`) // Perfect, tiny, static - fake  
    for i := 0; i < b.N; i++ { // Benchmark iteration loop  
        var result User // Fresh result struct each iteration  
        json.Unmarshal(data, &result) // Same data unmarshal - unrealistic  
    } // Rinse and repeat with zero variation  
}  
// The way that might actually help you  
func BenchmarkRealisticJSON(b *testing.B) {  
    testCases := [][]byte{ // Array of different JSON sizes matching production  
        generateSmallJSON(50),   // 50 bytes - mobile requests hit us with these  
        generateMediumJSON(500), // 500 bytes - typical web traffic  
        generateLargeJSON(5000), // 5KB - those chunky API responses  
        generateComplexJSON(),   // Nested objects, arrays - the gnarly stuff  
        generateMalformedJSON(), // Invalid inputs because 10% of traffic is broken somehow  
    } // Test case variety mimics production distribution  

    b.ResetTimer() // Start timing after setup completes  
    for i := 0; i < b.N; i++ { // Standard benchmark loop  
        data := testCases[i%len(testCases)] // Rotate through test cases cyclically  
        var result User // Allocate fresh result each time  
        json.Unmarshal(data, &result) // Unmarshal different data sizes each iteration  
    } // This actually reflects what happens in production  
}  
func generateSmallJSON(size int) []byte {  
    user := User{ // Create realistic user struct  
        ID:   rand.Intn(1000000), // Random ID like real requests  
        Name: randomString(size/4), // Variable name length  
        // ... add more fields to match production patterns  
    } // Struct matches real data structure  
    data, _ := json.Marshal(user) // Convert to JSON bytes  
    return data // Return JSON that matches production size distribution  
}

Look, the difference matters. Like, really matters.

Pattern 2: Memory Pressure (Because GC is Real)

Production systems are constantly under memory pressure. Your benchmark needs to feel that pain:

func BenchmarkWithMemoryPressure(b *testing.B) {  
    ballast := make([]byte, 100*1024*1024) // 100MB ballast simulates production memory usage  

    done := make(chan bool) // Channel to signal goroutine shutdown  
    go func() { // Spawn background goroutine to create allocation pressure  
        for { // Infinite loop until told to stop  
            select { // Non-blocking channel check  
            case <-done: // Shutdown signal received  
                return // Exit goroutine cleanly  
            default: // No shutdown signal, continue  
                _ = make([]byte, 1024) // Allocate 1KB repeatedly - mimics production churn  
                runtime.Gosched() // Yield to scheduler - let other goroutines run  
            } // This creates constant GC pressure like production  
        } // Continuous allocation/deallocation cycle  
    }() // Background goroutine runs concurrently with benchmark  

    defer func() { // Cleanup function runs after benchmark completes  
        done <- true // Signal background goroutine to stop  
        runtime.KeepAlive(ballast) // Prevent ballast optimization until end  
    }() // Ensures proper cleanup  

    b.ResetTimer() // Start timing after setup  
    for i := 0; i < b.N; i++ { // Benchmark loop runs your code  
        result := expensiveOperation() // Run the actual operation being tested  
        runtime.KeepAlive(result) // Prevent compiler from optimizing away result  
    } // Measures performance under realistic memory pressure  
}

I cannot stress this enough — GC behavior changes everything under memory pressure. And you won’t see it without simulating it.

Pattern 3: Concurrency (Because Nothing Runs Alone)

This one’s critical. Most production code has tons of concurrent operations happening:

func BenchmarkConcurrentCache(b *testing.B) {  
    cache := NewCache() // Initialize the cache being tested  
    numGoroutines := runtime.NumCPU() * 4 // Realistic concurrency level based on CPU cores  

    b.RunParallel(func(pb *testing.PB) { // Run benchmark across multiple goroutines  
        for pb.Next() { // Iterate until benchmark completes  
            key := fmt.Sprintf("key_%d", rand.Intn(1000)) // Generate random key from 1000 possible keys  

            if rand.Float64() < 0.8 { // 80% probability - matches production read/write ratio  
                cache.Get(key) // Read operation - most common in real caches  
            } else { // 20% probability  
                cache.Set(key, generateValue()) // Write operation - less frequent but still important  
            } // Ratio mirrors actual production usage patterns  
        } // Each goroutine hammers cache concurrently  
    }) // Tests cache under realistic concurrent load  
}

That 80/20 read/write ratio? That’s not arbitrary. Check your production metrics — it’s probably close to that.

Pattern 4: Stop the Compiler From Cheating

The compiler is sneaky. It’ll optimize away code if it thinks the results aren’t used:

var globalSink interface{} // Package-level variable prevents dead code elimination  

func BenchmarkPreventOptimization(b *testing.B) {  
    var localSink interface{} // Function-level variable stores intermediate results  

    for i := 0; i < b.N; i++ { // Standard benchmark loop  
        result := expensiveComputation(i) // Run the actual computation being measured  
        localSink = result // Store result locally first - prevents intra-loop optimization  
    } // Loop completes with all computations  

    globalSink = localSink // Assign to global after loop - prevents whole-loop optimization  
} // Compiler can't eliminate code because global variable might be read elsewhere

Yeah, this feels like fighting with the tools, but trust me — without this, your benchmark might be measuring nothing.

Getting Advanced (Where It Gets Good)

Okay so once you’ve got the basics down, benchstat got this massive overhaul that makes comparing results across different scenarios actually useful. You can use sub-benchmarks to test multiple realistic scenarios:

func BenchmarkHTTPHandler(b *testing.B) {  
    scenarios := []struct { // Slice of test scenario configurations  
        name        string // Descriptive name for sub-benchmark  
        requestSize int // Size of HTTP request body in bytes  
        concurrency int // Number of concurrent requests  
        cacheHitRate float64 // Percentage of requests that hit cache  
    }{ // Array of realistic production scenarios  
        {"Small_LowConcurrency_ColdCache", 100, 1, 0.1}, // Cold start scenario  
        {"Small_HighConcurrency_HotCache", 100, 100, 0.9}, // Peak traffic with warm cache  
        {"Large_MedConcurrency_WarmCache", 10000, 10, 0.6}, // Mixed workload  
        {"Realistic_Mixed_Production", 1500, 50, 0.7}, // Actual production profile  
    } // Each scenario tests different production conditions  

    for _, scenario := range scenarios { // Iterate through all scenarios  
        b.Run(scenario.name, func(b *testing.B) { // Create sub-benchmark for each scenario  
            setupScenario(scenario) // Configure test environment for this scenario  
            b.ResetTimer() // Start timing after setup  

            for i := 0; i < b.N; i++ { // Run benchmark iterations  
                handleRequest(generateRequest(scenario.requestSize)) // Process request with scenario params  
            } // Measures handler performance under specific conditions  
        }) // Sub-benchmark complete  
    } // All scenarios tested with individual results  
}

And here’s something that changed how I think about benchmarks — use actual production profiles to guide your benchmark design:

func BenchmarkWithProductionProfile(b *testing.B) {  
    sizeDistribution := loadProductionSizeDistribution() // Load real request size histogram from prod logs  
    pathDistribution := loadProductionPathDistribution() // Load real URL path frequencies from prod logs  

    b.ResetTimer() // Start timing after loading distributions  
    for i := 0; i < b.N; i++ { // Benchmark loop  
        size := sampleFromDistribution(sizeDistribution) // Pick request size matching prod frequency  
        path := sampleFromDistribution(pathDistribution) // Pick URL path matching prod frequency  

        request := generateRequest(path, size) // Create request matching production patterns  
        processRequest(request) // Process request under realistic conditions  
    } // Each iteration mimics actual production traffic distribution  
}

The Anti-Patterns (Please Don’t Do These)

Anti-Pattern 1: The Perfect Loop of Lies

 package strbench // tiny pkg for string builder benchmarks  

import (                            // minimal deps to keep focus  
 "strings"                     // strings.Builder under test  
 "testing"                     // Go benchmark harness  
)  

// This is wrong (but everyone does it) — measures a fairy tale, not reality.  
func BenchmarkBadStringBuilder(b *testing.B) {          // single-operation microbench  
 b.ReportAllocs()                                    // at least surface allocs (still misleading)  
 for i := 0; i < b.N; i++ {                          // benchmark loop  
  var sb strings.Builder                           // fresh builder every time (cheap path)  
  sb.WriteString("hello")                          // constant input → super cache-friendly  
  sb.WriteString("world")                          // same again → no variability  
  _ = sb.String()                                  // realize string, then throw away result  
 }                                                    // zero variability, zero pressure = bogus signal  
}  

// This might actually help you — adds input variability + realistic capacity hints.  
func BenchmarkRealisticStringBuilder(b *testing.B) {     // closer to prod behavior  
 b.ReportAllocs()                                    // show GC/alloc pressure honestly  
 inputs := generateVariableInputs(1000)              // 1) N distinct patterns (lengths/tokens vary)  
 if len(inputs) == 0 { b.Fatal("no inputs") }        // guard: we need data to cycle through  

 for i := 0; i < b.N; i++ {                          // benchmark loop (each iter ≈ one request)  
  input := inputs[i%len(inputs)]                  // 2) rotate patterns to avoid warm-cache lies  
  var sb strings.Builder                          // 3) new builder per request (typical usage)  
  sb.Grow(lenApprox(input))                       // 4) pre-size capacity like real code should  

  for _, s := range input {                       // 5) variable number of writes (fragmented appends)  
   sb.WriteString(s)                           // append chunk; Builder grows if hint was low  
  }                                               // loop shape matters for branch prediction too  

  result := sb.String()                           // 6) finalize — alloc + copy once  
  processString(result)                           // 7) do something so optimizer can’t elide work  
 }                                                    // measures something you can actually act on  
}  

// --- tiny helpers (stubs you can replace in your codebase) ---  

func generateVariableInputs(n int) [][]string {          // produce n inputs with varied sizes/shapes  
 out := make([][]string, 0, n)                         // pre-size slice  
 for i := 0; i < n; i++ {                              // build each pattern  
  chunks := (i%7 + 3)                                // 3..9 chunks to vary loop count  
  row := make([]string, 0, chunks)                   // allocate per-row slice  
  for j := 0; j < chunks; j++ {                      // fill with uneven strings  
   row = append(row, strings.Repeat("x", 5+j%5))  // lengths 5..9 (toy but non-constant)  
  }  
  out = append(out, row)                             // stash the row  
 }  
 return out                                            // ready for cycling  
}  

func lenApprox(parts []string) int {                     // rough capacity hint (good enough)  
 total := 0                                            // accumulator  
 for _, s := range parts { total += len(s) }           // sum lengths  
 return total + total/3                                // +~33% headroom for separators/etc.  
}  

func processString(_ string) { /* sink */ }             // black-hole to keep result “used”

See the difference? It’s not just about testing the function — it’s about testing it the way it actually gets used.

Anti-Pattern 2: Ignoring Setup Costs

 package dbbench // small pkg just for these benchmarks  

import (                                   // minimal deps to focus the point  
 "database/sql"                         // pretend DB handle (stand-in for your driver)  
 "testing"                              // Go’s benchmark API  
)  

// --- helpers you already have somewhere (stubs here for context) ---  
// func setupDatabase() *sql.DB { /* cold boot: migrations, connect, etc. */ return &sql.DB{} }  
// func getDBConnection() *sql.DB { /* from pool (may block) */ return &sql.DB{} }  
// func returnDBConnection(*sql.DB) {}  
// func processRows(*sql.Rows) {}    // scan rows like real code does  

// This looks efficient but it's lying: the timer skips expensive parts.  
func BenchmarkBadDatabaseQuery(b *testing.B) {            // misleading micro-benchmark  
 db := setupDatabase()                                  // cold setup outside timer → hidden cost  
 defer db.Close()                                       // cleanup also outside timer → hidden too  
 b.ReportAllocs()                                       // at least show allocs (still skewed)  

 for i := 0; i < b.N; i++ {                             // loop: only “query” is measured  
  rows, _ := db.Query("SELECT * FROM users WHERE id = ?", i) // warm connection, no contention  
  rows.Close()                                      // close quickly; still not scanning data  
  // no error checks, no scanning, no pool wait → unrealistically fast numbers  
 }  
}  

// This reflects reality: measure the full request path per iteration.  
func BenchmarkRealisticDatabaseQuery(b *testing.B) {       // closer to prod behavior  
 b.ReportAllocs()                                       // include allocation signal in results  
 // optional: seed cold setup outside timer (e.g., create schema) for fairness  
 // b.StopTimer(); coldSetup(); b.StartTimer()  

 for i := 0; i < b.N; i++ {                             // each iter ≈ one user request  
  db := getDBConnection()                            // acquire from pool (may block under load)  

  rows, err := db.Query("SELECT * FROM users WHERE id = ?", i) // execute with pool + network + parse  
  if err != nil {                                              // production does not ignore errors  
   b.Fatal(err)                                             // fail fast to avoid sampling bad states  
  }  

  processRows(rows)                                            // actually scan rows (CPU + allocs)  
  rows.Close()                                                 // release result buffers to driver  
  returnDBConnection(db)                                       // put conn back (pool bookkeeping)  
  // this loop captures pool wait, query exec, scanning, and teardown → apples to prod apples  
 }  
}  

// Variant: timer control to exclude *only* test-harness bookkeeping (not app work).  
func BenchmarkRealisticWithTimerControl(b *testing.B) {    // same semantics, clearer timing  
 b.ReportAllocs()                                       // keep alloc signal  
 for i := 0; i < b.N; i++ {                             // per-op measurement  
  b.StartTimer()                                     // start measuring application work  
  db := getDBConnection()                            // pool wait is part of reality  
  rows, err := db.Query("SELECT * FROM users WHERE id = ?", i) // do the work  
  if err != nil { b.Fatal(err) }                                // sanity  
  processRows(rows)                                            // scan results  
  rows.Close()                                                 // tidy rows  
  returnDBConnection(db)                                       // return to pool  
  b.StopTimer()                                                // stop before any test-only chores  
  // if you had per-iter test scaffolding (e.g., random seed gen), do it here outside the timer  
 }  
}  

// Optional: parallel load shows contention and pool behavior under pressure.  
// func BenchmarkRealisticParallel(b *testing.B) {  
//  b.ReportAllocs()  
//  b.RunParallel(func(pb *testing.PB) {  
//   for pb.Next() {  
//    db := getDBConnection()  
//    rows, err := db.Query("SELECT 1")  
//    if err != nil { b.Fatal(err) }  
//    processRows(rows)  
//    rows.Close()  
//    returnDBConnection(db)  
//   }  
//  })  
// }

In production, that setup cost happens every time. Your benchmark should reflect that.

The New Way of Thinking

Look, here’s what I’ve learned after way too many failed optimizations: Start with production profiles, not hypothetical improvements. Use go tool pprof on your production data, find the actual bottlenecks (not the ones you think exist), and then create benchmarks that reproduce those exact conditions.

The companies crushing it with Go performance aren’t the ones with the fastest microbenchmarks. They’re the ones whose benchmarks predict production gains with 85%+ accuracy. Their optimizations don’t just look good in PRs — they actually improve user experience in ways you can measure.

Track correlation between your benchmarks and production:

package metrics // tiny pkg for bench↔prod tracking; keep it boring  

import (                         // only what we use


 "log"                       // warnings to logs


)  

// BenchmarkTracker keeps bench + prod series and how well they agree.


// idea: every time we add a pair (bench, prod), we maybe recompute Pearson


// and stash the correlation; if it dips, we warn so folks don’t trust stale benches.


type BenchmarkTracker struct {


 name               string    // name of the benchmark/suite


 benchmarkResults   []float64 // historical bench times (e.g., ms)


 productionResults  []float64 // matching prod latencies (same units)


 correlationHistory []float64 // rolling Pearson r values


}  

// AddResult appends one (bench, prod) pair and updates correlation if we have enough data.


// notes: keep series aligned, compute r when data is “mature enough”, and warn if predictiveness fades.


func (bt *BenchmarkTracker) AddResult(benchTime, prodLatency float64) {


 bt.benchmarkResults = append(bt.benchmarkResults, benchTime)        // push bench sample


 bt.productionResults = append(bt.productionResults, prodLatency)    // push prod sample (same index)  

// sanity: if somehow lengths diverge (caller bug), bail quietly to avoid panics


 if len(bt.benchmarkResults) != len(bt.productionResults) {          // alignment check


  return                                                          // don’t compute r on mismatched series


 }  

// only compute correlation when we have “enough” points to matter


 if len(bt.benchmarkResults) > 10 {                                  // threshold: tune per noise level


  corr := calculateCorrelation(bt.benchmarkResults, bt.productionResults) // Pearson r in [-1,1]


  bt.correlationHistory = append(bt.correlationHistory, corr)      // stash latest r for trend plots  

// alert if benches stop predicting prod well (rule of thumb: r < 0.7)


  if corr < 0.7 {                                                 // under the “useful” line


   log.Printf("WARNING: benchmark %q correlation dropped to %.3f", bt.name, corr) // heads-up


  }


 }


}  

// calculateCorrelation is assumed to exist elsewhere in your codebase.         // e.g., Pearson on two equal-length slices

When to Trust Your Benchmarks

Trust them when:

Correlation > 0.8 with historical production improvements
You’re simulating realistic load patterns
Multiple runs show consistent results
You’re testing hot paths from production profiles
Input data matches production distributions

Be skeptical when:

Correlation < 0.5
Perfect, static inputs
Only microbenchmarks, no integration tests
Results seem too good (they probably are)

Ignore them when:

Correlation < 0.3 (actively misleading)
Synthetic workloads that don’t match reality
You’re optimizing for benchmark scores, not users

Your benchmarks should be a conversation with production, not a fantasy. Every benchmark should answer: “If this improves, will users actually benefit?”

Stop optimizing what doesn’t matter. Start measuring what does. Your production metrics will prove it.

Enjoyed the read? Let’s stay connected!

Follow*The Speed Enginee* r for more Rust, Go and high-performance engineering stories.
💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.

Your support means the world and helps me create more content you’ll love. ❤️

DEV Community