Gabriel Anhaia

Posted on May 5

Why Your Go Microservice Spends 40% of CPU on runtime.mallocgc

#go #performance #benchmark

Book: The Complete Guide to Go Programming
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

You open a CPU profile from a Go service that has been running fine for a year. The team finally got around to looking because the autoscaler is adding pods faster than traffic is growing. The flame graph loads. You squint at the top of it and the picture is the same picture you have seen on three other Go services this quarter.

The numbers below are a representative pattern. The percentages come from a synthetic profile assembled to illustrate the shape, not a captured trace from one named production service. The point is the cadence (mallocgc and scanobject dominating) — not the exact figures.

flat   flat%   sum%    cum    cum%
 22.3s 28.1%  28.1%  28.6s  36.0%  runtime.mallocgc
  9.1s 11.5%  39.6%   9.1s  11.5%  runtime.scanobject
  4.8s  6.0%  45.6%   4.8s   6.0%  runtime.memmove
  3.2s  4.0%  49.6%   3.2s   4.0%  runtime.heapBitsSetType

Almost half of the wall-clock CPU is the runtime allocating, scanning, and copying memory. None of it is your business logic. The product code is somewhere underneath the runtime calls, distributed across a thousand small samples that never break 2%.

This shape comes from three causes most of the time. They look similar in pprof and need different fixes. Here is how to tell them apart and what to do for each.

What `mallocgc` and `scanobject` mean

runtime.mallocgc is the function the Go runtime calls every time you allocate something on the heap. The source is in src/runtime/malloc.go, and it handles size classes, the per-P cache, and the GC accounting. If it is at the top of your profile, something in a hot loop is allocating many small things instead of reusing them.

runtime.scanobject is the GC's marker. It walks live heap objects looking for pointers. Its share of CPU grows roughly with the number and size of live objects with pointers. If it is high, you are keeping more live, pointer-rich state than the GC can comfortably scan in the background. Or you are allocating fast enough that the GC has to do mark work in user goroutines instead of the background sweep.

The two travel together: allocate more, scan more.

Culprit 1: per-request struct creation in hot handlers

The most common version of this is innocent-looking handler code that builds a fresh struct on every call. The struct is small. It does not look expensive. It runs ten thousand times per second.

package handler

import (
    "encoding/json"
    "net/http"
)

type RequestCtx struct {
    UserID    string
    TenantID  string
    Headers   map[string]string
    Trace     []string
    Tags      map[string]string
}

func Handle(w http.ResponseWriter, r *http.Request) {
    ctx := &RequestCtx{
        UserID:   r.Header.Get("X-User"),
        TenantID: r.Header.Get("X-Tenant"),
        Headers:  make(map[string]string, 8),
        Trace:    make([]string, 0, 16),
        Tags:     make(map[string]string, 4),
    }
    for k, v := range r.Header {
        if len(v) > 0 {
            ctx.Headers[k] = v[0]
        }
    }
    process(ctx)
    json.NewEncoder(w).Encode(ctx)
}

Five heap allocations per request: the struct, two maps, one slice, plus whatever JSON encoding adds. At 10k req/s that is 50k allocs per second from this handler alone, and mallocgc lights up.

The fix depends on whether the struct outlives the request. RequestCtx here does not. Handle returns and nothing references it. So pooling the struct and clearing it on Get is straightforward:

var ctxPool = sync.Pool{
    New: func() any {
        return &RequestCtx{
            Headers: make(map[string]string, 8),
            Trace:   make([]string, 0, 16),
            Tags:    make(map[string]string, 4),
        }
    },
}

func HandlePooled(w http.ResponseWriter, r *http.Request) {
    ctx := ctxPool.Get().(*RequestCtx)
    defer func() {
        for k := range ctx.Headers {
            delete(ctx.Headers, k)
        }
        for k := range ctx.Tags {
            delete(ctx.Tags, k)
        }
        ctx.Trace = ctx.Trace[:0]
        ctx.UserID = ""
        ctx.TenantID = ""
        ctxPool.Put(ctx)
    }()
    ctx.UserID = r.Header.Get("X-User")
    ctx.TenantID = r.Header.Get("X-Tenant")
    for k, v := range r.Header {
        if len(v) > 0 {
            ctx.Headers[k] = v[0]
        }
    }
    process(ctx)
    json.NewEncoder(w).Encode(ctx)
}

Two notes on this. First, you have to clear the maps and slice. Put-ing a populated struct back is a leak waiting to happen because the next request will see stale data. Second, treat sync.Pool as a hint. The runtime can drop pooled objects on any GC cycle (see src/sync/pool.go).

Often a simpler fix works: pre-size the maps and slices, or keep them on the stack. Reach for sync.Pool when those don't.

Run the benchmark before and after with -benchmem. On a request-pool change you can see the alloc count drop several-fold and ns/op shave off single-digit percentages of the handler's runtime. The exact win depends on handler shape and Go version, so benchmark your own and write the real number in the report.

Culprit 2: string concatenation in tight loops

The Go compiler is honest about what + does for strings. Every concatenation that cannot be folded at compile time allocates a new backing array and copies. Inside a loop, this turns into quadratic memory traffic.

func renderQuadratic(rows []Row) string {
    s := ""
    for _, r := range rows {
        s += r.Key + "=" + r.Value + "&"
    }
    return s
}

For 200 rows, this allocates 200 intermediate strings, each one a copy of the previous plus the new piece. mallocgc reflects the alloc count; runtime.memmove reflects the copying.

strings.Builder (or a pre-sized byte slice) fixes both. The builder grows its underlying slice with the standard append doubling strategy, so the total copy cost is O(n) instead of O(n²):

func renderBuilder(rows []Row) string {
    var b strings.Builder
    b.Grow(len(rows) * 32)
    for _, r := range rows {
        b.WriteString(r.Key)
        b.WriteByte('=')
        b.WriteString(r.Value)
        b.WriteByte('&')
    }
    return b.String()
}

Grow is the part most people skip. Without it, the builder still allocates several times during the doubling. With a reasonable upfront estimate, it allocates once. The String() method on strings.Builder is zero-copy in Go 1.10+ — it reuses the underlying byte slice.

Benchmark to confirm the win on your input distribution. Quadratic concatenation degrades fast as row count grows; benchmark your row distribution to see where the inflection point lands for your data.

Culprit 3: escape-analysis surprises

This one is the meanest because the code looks fine. Locals that should live on the stack get pushed to the heap by the escape analyzer and you find out only when you compile with -gcflags='-m'.

The two patterns that cause it most often:

Interface boxing. Passing a concrete value to a function that takes an interface escapes the value to the heap, because the interface header carries a pointer:

func log(v any) {
    fmt.Fprintln(os.Stderr, v)
}

func process() {
    n := 42
    log(n)
}

Save the file as escape.go and build with go build -gcflags='-m' ./escape.go. The compiler prints (lines abbreviated):

./escape.go:7:6: can inline log
./escape.go:11:6: can inline process
./escape.go:8:17: v escapes to heap
./escape.go:13:6: moved to heap: n

n is an int. It should sit on the stack. Calling log(n) boxes it into an any, the box needs an address, and the address has to be heap-stable. So a humble int becomes a mallocgc call. Multiply that by every log line on the hot path and the per-call boxing dominates.

The fix is either typed loggers (no any parameter), structured logging libraries that special-case scalar types (log/slog does this), or skipping the log call entirely on the hot path.

Closures capturing locals. Returning or passing a closure that captures a local forces the local to the heap so the closure can outlive the frame:

func makeCounter() func() int {
    count := 0
    return func() int {
        count++
        return count
    }
}

count lands on the heap. That is unavoidable here — the closure does outlive the frame — but the issue shows up in places where the closure does not actually escape, and the analyzer cannot prove it.

A loop that creates a closure per iteration and passes it to a goroutine is the classic offender:

for _, job := range jobs {
    job := job
    go func() {
        process(job)
    }()
}

On Go 1.22+ the per-iteration scoping makes the job := job line redundant, but the closure-allocation point still stands. Every iteration allocates a fresh closure environment. If jobs is large and the work is short, the goroutine startup and the closure allocation will both show up in pprof. A worker pool with a channel of jobs avoids both: one goroutine per worker, one allocation for the channel, no closures.

-gcflags='-m' tells you what the analyzer decided. Read it once per service. It is easy to ship a Go service for years without ever reading the output and miss escape problems the whole time.

How to read the profile in order

When mallocgc is at the top, work top-down:

Switch the profile view to alloc_objects, not just CPU. go tool pprof -alloc_objects mem.prof shows the call sites that produce the most allocations regardless of size. The top three are usually where the wins are.
Look at the list view for the hottest function. (pprof) list HandlerName shows you the source lines responsible. Allocations on a := line, a make, an append past capacity, or a + between strings are the most common.
Build with -gcflags='-m=2' and grep for the function name. Every "moved to heap" or "escapes to heap" line is a candidate.
Fix the highest-traffic site first. Do not micro-optimize a function that allocates once per minute when the next one over allocates ten thousand times per second.

Run go tool pprof -alloc_objects on your hottest handler today. The top three call sites are usually where the wins live. Hedge the percentage when you write it up for your team — every service moves differently after these fixes, and only your benchmark knows the real number.

If you want the runtime mental model

Allocation, escape analysis, the GC mark phase, and the per-P scheduler that all of this lives inside are covered end-to-end in The Complete Guide to Go Programming. If you have ever read a runtime.mallocgc line in pprof and wished you knew what the function was actually doing, that is the chapter to find.

The companion book, Hexagonal Architecture in Go, is the design-layer counterpart: how to structure a service so the hot paths you eventually optimize are isolated from the domain code that should never need to think about pools or escape analysis.