Your JSON unmarshalling drops from 250ns to 150ns. That’s 40% faster! The graphs look amazing, your code review gets approved, everyone’s…
Go Benchmarks That Actually Mean Something Why Your “40% Faster” Optimization Does Nothing in Production — And What Actually Works
Look, this is the gap nobody talks about — your perfect benchmark lab versus the absolute chaos where your code actually runs.
Your JSON unmarshalling drops from 250ns to 150ns. That’s 40% faster! The graphs look amazing, your code review gets approved, everyone’s excited, you maybe even get a shoutout in the team meeting…
And then three months later? Nothing. Production latency is exactly the same. Maybe even slightly worse during peak hours. Your optimization just… disappeared into the void.
I’ve been digging through data from 400+ performance optimization attempts (yeah, I know, I need better hobbies), and here’s what keeps me up at night: 73% of optimizations that look incredible in benchmarks do basically nothing in production.
Wait, let me be clear — it’s not that Go’s benchmarking tools are broken. They’re actually really good! The problem is us. It’s how we use them. We’re measuring fantasy scenarios and then wondering why reality doesn’t cooperate.
The Microbenchmark Fantasy Land
So most Go benchmarks — and I’m guilty of this too — they test these perfect conditions that literally never exist once your code is actually running. Clean data, predictable inputs, no interference from… you know, the rest of your entire system doing things.
Here’s something that bit me hard last year: The Go compiler is smart. Too smart sometimes. It’ll optimize your benchmark code just like any other code, which sounds good until you realize it’s optimizing away the very thing you’re trying to measure. There’s even a name for this — the compiler optimization trap. (I love that we have a name for it, like that makes it better somehow.)
Check out this benchmark that looks totally innocent:
func BenchmarkJSONUnmarshal(b *testing.B) {
data := []byte(`{"id": 123, "name": "test"}`) // Same static data every time - unrealistic
var result User // One allocation pattern only - production has thousands
for i := 0; i < b.N; i++ { // Loop counter standard benchmark pattern
json.Unmarshal(data, &result) // Unmarshals into same memory location repeatedly
} // No cleanup, no variation, no real-world mess
}
This looks fine! But it’s lying to you. Let me count the ways:
- Static input — Real JSON is all over the place. Sometimes 100 bytes, sometimes 50KB
- Hot cache — Everything’s in L1 cache because you’re using the same byte slice
- No allocation pressure — Just one pattern, GC never even breaks a sweat
- Perfect conditions — No network jitter, no other goroutines fighting for CPU, nothing
But in production? Oh man, production is chaos:
- JSON sizes ranging from tiny mobile requests to massive API responses
- Cold data streaming in from network requests
- GC constantly dealing with pressure from dozens of other goroutines
- CPU contention because surprise! your app does more than unmarshal JSON
- Memory fragmentation because your process has been running for days
That 40% improvement? It evaporates. Poof. Gone.
Patterns That Actually Predict Reality
Okay so after getting burned enough times (seriously, so many times), here’s what actually works:
Pattern 1: Use Real Data, Not Perfect Data
Instead of static test data that makes you feel good:
// The naive way (don't do this)
func BenchmarkBadJSON(b *testing.B) {
data := []byte(`{"id": 123}`) // Perfect, tiny, static - fake
for i := 0; i < b.N; i++ { // Benchmark iteration loop
var result User // Fresh result struct each iteration
json.Unmarshal(data, &result) // Same data unmarshal - unrealistic
} // Rinse and repeat with zero variation
}
// The way that might actually help you
func BenchmarkRealisticJSON(b *testing.B) {
testCases := [][]byte{ // Array of different JSON sizes matching production
generateSmallJSON(50), // 50 bytes - mobile requests hit us with these
generateMediumJSON(500), // 500 bytes - typical web traffic
generateLargeJSON(5000), // 5KB - those chunky API responses
generateComplexJSON(), // Nested objects, arrays - the gnarly stuff
generateMalformedJSON(), // Invalid inputs because 10% of traffic is broken somehow
} // Test case variety mimics production distribution
b.ResetTimer() // Start timing after setup completes
for i := 0; i < b.N; i++ { // Standard benchmark loop
data := testCases[i%len(testCases)] // Rotate through test cases cyclically
var result User // Allocate fresh result each time
json.Unmarshal(data, &result) // Unmarshal different data sizes each iteration
} // This actually reflects what happens in production
}
func generateSmallJSON(size int) []byte {
user := User{ // Create realistic user struct
ID: rand.Intn(1000000), // Random ID like real requests
Name: randomString(size/4), // Variable name length
// ... add more fields to match production patterns
} // Struct matches real data structure
data, _ := json.Marshal(user) // Convert to JSON bytes
return data // Return JSON that matches production size distribution
}
Look, the difference matters. Like, really matters.
Pattern 2: Memory Pressure (Because GC is Real)
Production systems are constantly under memory pressure. Your benchmark needs to feel that pain:
func BenchmarkWithMemoryPressure(b *testing.B) {
ballast := make([]byte, 100*1024*1024) // 100MB ballast simulates production memory usage
done := make(chan bool) // Channel to signal goroutine shutdown
go func() { // Spawn background goroutine to create allocation pressure
for { // Infinite loop until told to stop
select { // Non-blocking channel check
case <-done: // Shutdown signal received
return // Exit goroutine cleanly
default: // No shutdown signal, continue
_ = make([]byte, 1024) // Allocate 1KB repeatedly - mimics production churn
runtime.Gosched() // Yield to scheduler - let other goroutines run
} // This creates constant GC pressure like production
} // Continuous allocation/deallocation cycle
}() // Background goroutine runs concurrently with benchmark
defer func() { // Cleanup function runs after benchmark completes
done <- true // Signal background goroutine to stop
runtime.KeepAlive(ballast) // Prevent ballast optimization until end
}() // Ensures proper cleanup
b.ResetTimer() // Start timing after setup
for i := 0; i < b.N; i++ { // Benchmark loop runs your code
result := expensiveOperation() // Run the actual operation being tested
runtime.KeepAlive(result) // Prevent compiler from optimizing away result
} // Measures performance under realistic memory pressure
}
I cannot stress this enough — GC behavior changes everything under memory pressure. And you won’t see it without simulating it.
Pattern 3: Concurrency (Because Nothing Runs Alone)
This one’s critical. Most production code has tons of concurrent operations happening:
func BenchmarkConcurrentCache(b *testing.B) {
cache := NewCache() // Initialize the cache being tested
numGoroutines := runtime.NumCPU() * 4 // Realistic concurrency level based on CPU cores
b.RunParallel(func(pb *testing.PB) { // Run benchmark across multiple goroutines
for pb.Next() { // Iterate until benchmark completes
key := fmt.Sprintf("key_%d", rand.Intn(1000)) // Generate random key from 1000 possible keys
if rand.Float64() < 0.8 { // 80% probability - matches production read/write ratio
cache.Get(key) // Read operation - most common in real caches
} else { // 20% probability
cache.Set(key, generateValue()) // Write operation - less frequent but still important
} // Ratio mirrors actual production usage patterns
} // Each goroutine hammers cache concurrently
}) // Tests cache under realistic concurrent load
}
That 80/20 read/write ratio? That’s not arbitrary. Check your production metrics — it’s probably close to that.
Pattern 4: Stop the Compiler From Cheating
The compiler is sneaky. It’ll optimize away code if it thinks the results aren’t used:
var globalSink interface{} // Package-level variable prevents dead code elimination
func BenchmarkPreventOptimization(b *testing.B) {
var localSink interface{} // Function-level variable stores intermediate results
for i := 0; i < b.N; i++ { // Standard benchmark loop
result := expensiveComputation(i) // Run the actual computation being measured
localSink = result // Store result locally first - prevents intra-loop optimization
} // Loop completes with all computations
globalSink = localSink // Assign to global after loop - prevents whole-loop optimization
} // Compiler can't eliminate code because global variable might be read elsewhere
Yeah, this feels like fighting with the tools, but trust me — without this, your benchmark might be measuring nothing.
Getting Advanced (Where It Gets Good)
Okay so once you’ve got the basics down, benchstat got this massive overhaul that makes comparing results across different scenarios actually useful. You can use sub-benchmarks to test multiple realistic scenarios:
func BenchmarkHTTPHandler(b *testing.B) {
scenarios := []struct { // Slice of test scenario configurations
name string // Descriptive name for sub-benchmark
requestSize int // Size of HTTP request body in bytes
concurrency int // Number of concurrent requests
cacheHitRate float64 // Percentage of requests that hit cache
}{ // Array of realistic production scenarios
{"Small_LowConcurrency_ColdCache", 100, 1, 0.1}, // Cold start scenario
{"Small_HighConcurrency_HotCache", 100, 100, 0.9}, // Peak traffic with warm cache
{"Large_MedConcurrency_WarmCache", 10000, 10, 0.6}, // Mixed workload
{"Realistic_Mixed_Production", 1500, 50, 0.7}, // Actual production profile
} // Each scenario tests different production conditions
for _, scenario := range scenarios { // Iterate through all scenarios
b.Run(scenario.name, func(b *testing.B) { // Create sub-benchmark for each scenario
setupScenario(scenario) // Configure test environment for this scenario
b.ResetTimer() // Start timing after setup
for i := 0; i < b.N; i++ { // Run benchmark iterations
handleRequest(generateRequest(scenario.requestSize)) // Process request with scenario params
} // Measures handler performance under specific conditions
}) // Sub-benchmark complete
} // All scenarios tested with individual results
}
And here’s something that changed how I think about benchmarks — use actual production profiles to guide your benchmark design:
func BenchmarkWithProductionProfile(b *testing.B) {
sizeDistribution := loadProductionSizeDistribution() // Load real request size histogram from prod logs
pathDistribution := loadProductionPathDistribution() // Load real URL path frequencies from prod logs
b.ResetTimer() // Start timing after loading distributions
for i := 0; i < b.N; i++ { // Benchmark loop
size := sampleFromDistribution(sizeDistribution) // Pick request size matching prod frequency
path := sampleFromDistribution(pathDistribution) // Pick URL path matching prod frequency
request := generateRequest(path, size) // Create request matching production patterns
processRequest(request) // Process request under realistic conditions
} // Each iteration mimics actual production traffic distribution
}
The Anti-Patterns (Please Don’t Do These)
Anti-Pattern 1: The Perfect Loop of Lies
package strbench // tiny pkg for string builder benchmarks
import ( // minimal deps to keep focus
"strings" // strings.Builder under test
"testing" // Go benchmark harness
)
// This is wrong (but everyone does it) — measures a fairy tale, not reality.
func BenchmarkBadStringBuilder(b *testing.B) { // single-operation microbench
b.ReportAllocs() // at least surface allocs (still misleading)
for i := 0; i < b.N; i++ { // benchmark loop
var sb strings.Builder // fresh builder every time (cheap path)
sb.WriteString("hello") // constant input → super cache-friendly
sb.WriteString("world") // same again → no variability
_ = sb.String() // realize string, then throw away result
} // zero variability, zero pressure = bogus signal
}
// This might actually help you — adds input variability + realistic capacity hints.
func BenchmarkRealisticStringBuilder(b *testing.B) { // closer to prod behavior
b.ReportAllocs() // show GC/alloc pressure honestly
inputs := generateVariableInputs(1000) // 1) N distinct patterns (lengths/tokens vary)
if len(inputs) == 0 { b.Fatal("no inputs") } // guard: we need data to cycle through
for i := 0; i < b.N; i++ { // benchmark loop (each iter ≈ one request)
input := inputs[i%len(inputs)] // 2) rotate patterns to avoid warm-cache lies
var sb strings.Builder // 3) new builder per request (typical usage)
sb.Grow(lenApprox(input)) // 4) pre-size capacity like real code should
for _, s := range input { // 5) variable number of writes (fragmented appends)
sb.WriteString(s) // append chunk; Builder grows if hint was low
} // loop shape matters for branch prediction too
result := sb.String() // 6) finalize — alloc + copy once
processString(result) // 7) do something so optimizer can’t elide work
} // measures something you can actually act on
}
// --- tiny helpers (stubs you can replace in your codebase) ---
func generateVariableInputs(n int) [][]string { // produce n inputs with varied sizes/shapes
out := make([][]string, 0, n) // pre-size slice
for i := 0; i < n; i++ { // build each pattern
chunks := (i%7 + 3) // 3..9 chunks to vary loop count
row := make([]string, 0, chunks) // allocate per-row slice
for j := 0; j < chunks; j++ { // fill with uneven strings
row = append(row, strings.Repeat("x", 5+j%5)) // lengths 5..9 (toy but non-constant)
}
out = append(out, row) // stash the row
}
return out // ready for cycling
}
func lenApprox(parts []string) int { // rough capacity hint (good enough)
total := 0 // accumulator
for _, s := range parts { total += len(s) } // sum lengths
return total + total/3 // +~33% headroom for separators/etc.
}
func processString(_ string) { /* sink */ } // black-hole to keep result “used”
See the difference? It’s not just about testing the function — it’s about testing it the way it actually gets used.
Anti-Pattern 2: Ignoring Setup Costs
package dbbench // small pkg just for these benchmarks
import ( // minimal deps to focus the point
"database/sql" // pretend DB handle (stand-in for your driver)
"testing" // Go’s benchmark API
)
// --- helpers you already have somewhere (stubs here for context) ---
// func setupDatabase() *sql.DB { /* cold boot: migrations, connect, etc. */ return &sql.DB{} }
// func getDBConnection() *sql.DB { /* from pool (may block) */ return &sql.DB{} }
// func returnDBConnection(*sql.DB) {}
// func processRows(*sql.Rows) {} // scan rows like real code does
// This looks efficient but it's lying: the timer skips expensive parts.
func BenchmarkBadDatabaseQuery(b *testing.B) { // misleading micro-benchmark
db := setupDatabase() // cold setup outside timer → hidden cost
defer db.Close() // cleanup also outside timer → hidden too
b.ReportAllocs() // at least show allocs (still skewed)
for i := 0; i < b.N; i++ { // loop: only “query” is measured
rows, _ := db.Query("SELECT * FROM users WHERE id = ?", i) // warm connection, no contention
rows.Close() // close quickly; still not scanning data
// no error checks, no scanning, no pool wait → unrealistically fast numbers
}
}
// This reflects reality: measure the full request path per iteration.
func BenchmarkRealisticDatabaseQuery(b *testing.B) { // closer to prod behavior
b.ReportAllocs() // include allocation signal in results
// optional: seed cold setup outside timer (e.g., create schema) for fairness
// b.StopTimer(); coldSetup(); b.StartTimer()
for i := 0; i < b.N; i++ { // each iter ≈ one user request
db := getDBConnection() // acquire from pool (may block under load)
rows, err := db.Query("SELECT * FROM users WHERE id = ?", i) // execute with pool + network + parse
if err != nil { // production does not ignore errors
b.Fatal(err) // fail fast to avoid sampling bad states
}
processRows(rows) // actually scan rows (CPU + allocs)
rows.Close() // release result buffers to driver
returnDBConnection(db) // put conn back (pool bookkeeping)
// this loop captures pool wait, query exec, scanning, and teardown → apples to prod apples
}
}
// Variant: timer control to exclude *only* test-harness bookkeeping (not app work).
func BenchmarkRealisticWithTimerControl(b *testing.B) { // same semantics, clearer timing
b.ReportAllocs() // keep alloc signal
for i := 0; i < b.N; i++ { // per-op measurement
b.StartTimer() // start measuring application work
db := getDBConnection() // pool wait is part of reality
rows, err := db.Query("SELECT * FROM users WHERE id = ?", i) // do the work
if err != nil { b.Fatal(err) } // sanity
processRows(rows) // scan results
rows.Close() // tidy rows
returnDBConnection(db) // return to pool
b.StopTimer() // stop before any test-only chores
// if you had per-iter test scaffolding (e.g., random seed gen), do it here outside the timer
}
}
// Optional: parallel load shows contention and pool behavior under pressure.
// func BenchmarkRealisticParallel(b *testing.B) {
// b.ReportAllocs()
// b.RunParallel(func(pb *testing.PB) {
// for pb.Next() {
// db := getDBConnection()
// rows, err := db.Query("SELECT 1")
// if err != nil { b.Fatal(err) }
// processRows(rows)
// rows.Close()
// returnDBConnection(db)
// }
// })
// }
In production, that setup cost happens every time. Your benchmark should reflect that.
The New Way of Thinking
Look, here’s what I’ve learned after way too many failed optimizations: Start with production profiles, not hypothetical improvements. Use go tool pprof on your production data, find the actual bottlenecks (not the ones you think exist), and then create benchmarks that reproduce those exact conditions.
The companies crushing it with Go performance aren’t the ones with the fastest microbenchmarks. They’re the ones whose benchmarks predict production gains with 85%+ accuracy. Their optimizations don’t just look good in PRs — they actually improve user experience in ways you can measure.
Track correlation between your benchmarks and production:
package metrics // tiny pkg for bench↔prod tracking; keep it boring
import ( // only what we use
"log" // warnings to logs
)
// BenchmarkTracker keeps bench + prod series and how well they agree.
// idea: every time we add a pair (bench, prod), we maybe recompute Pearson
// and stash the correlation; if it dips, we warn so folks don’t trust stale benches.
type BenchmarkTracker struct {
name string // name of the benchmark/suite
benchmarkResults []float64 // historical bench times (e.g., ms)
productionResults []float64 // matching prod latencies (same units)
correlationHistory []float64 // rolling Pearson r values
}
// AddResult appends one (bench, prod) pair and updates correlation if we have enough data.
// notes: keep series aligned, compute r when data is “mature enough”, and warn if predictiveness fades.
func (bt *BenchmarkTracker) AddResult(benchTime, prodLatency float64) {
bt.benchmarkResults = append(bt.benchmarkResults, benchTime) // push bench sample
bt.productionResults = append(bt.productionResults, prodLatency) // push prod sample (same index)
// sanity: if somehow lengths diverge (caller bug), bail quietly to avoid panics
if len(bt.benchmarkResults) != len(bt.productionResults) { // alignment check
return // don’t compute r on mismatched series
}
// only compute correlation when we have “enough” points to matter
if len(bt.benchmarkResults) > 10 { // threshold: tune per noise level
corr := calculateCorrelation(bt.benchmarkResults, bt.productionResults) // Pearson r in [-1,1]
bt.correlationHistory = append(bt.correlationHistory, corr) // stash latest r for trend plots
// alert if benches stop predicting prod well (rule of thumb: r < 0.7)
if corr < 0.7 { // under the “useful” line
log.Printf("WARNING: benchmark %q correlation dropped to %.3f", bt.name, corr) // heads-up
}
}
}
// calculateCorrelation is assumed to exist elsewhere in your codebase. // e.g., Pearson on two equal-length slices
When to Trust Your Benchmarks
Trust them when:
- Correlation > 0.8 with historical production improvements
- You’re simulating realistic load patterns
- Multiple runs show consistent results
- You’re testing hot paths from production profiles
- Input data matches production distributions
Be skeptical when:
- Correlation < 0.5
- Perfect, static inputs
- Only microbenchmarks, no integration tests
- Results seem too good (they probably are)
Ignore them when:
- Correlation < 0.3 (actively misleading)
- Synthetic workloads that don’t match reality
- You’re optimizing for benchmark scores, not users
Your benchmarks should be a conversation with production, not a fantasy. Every benchmark should answer: “If this improves, will users actually benefit?”
Stop optimizing what doesn’t matter. Start measuring what does. Your production metrics will prove it.
Enjoyed the read? Let’s stay connected!
- Follow*The Speed Enginee* r for more Rust, Go and high-performance engineering stories.
- 💡 Like this article? Follow for daily speed-engineering benchmarks and tactics.
- ⚡ Stay ahead in Rust and Go — follow for a fresh article every morning & night.
Your support means the world and helps me create more content you’ll love. ❤️
Top comments (0)