- Book: The Complete Guide to Go Programming
- Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
- My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
- Me: xgabriel.com | GitHub
You profile a hot path. The flame graph shows runtime.deferproc and runtime.deferreturn near the top, eating 8% of the CPU. You did not write either of those. You wrote defer mu.Unlock() and defer f.Close() and went home. Now someone on the team is asking why a function that does almost nothing is the most expensive line in the trace.
The answer is that defer has three implementations, not one. The fast one is open-coded defer. It landed in Go 1.14 and runs in around 6ns. The slow one routes through runtime.deferproc, allocates a _defer record, and costs about 35ns. Most of the time you are on the fast path. The compiler has rules for when it bails out, and once you fall off the cliff, every iteration of a tight loop pays the full cost.
Three patterns flip the switch. Each has a tool that makes it visible and a rewrite that gets you back.
What "open-coded defer" actually means
Before Go 1.14, every defer allocated a _defer record either on the heap or on a stack-managed list, and deferreturn walked that list at function exit. The proposal that landed in 1.14 (34481) replaced that for the common case: the compiler inlines the deferred call directly into each return path, tracks which defers fired with an 8-bit bitmap, and skips deferproc entirely. The cost drops from ~35ns to ~6ns per defer.
The optimization has hard preconditions. From the proposal:
- The defer must not appear in a loop in the control-flow graph.
- The function must contain at most 8 defers (one bit per defer in the
deferBitsbyte). - The function must not have too many exit points (the compiler emits the inlined return sequence at every exit).
Hit any of those and the function reverts to deferproc for every defer in the function, not just the offending one. That is the cliff. Walking off it is silent. The benchmark just gets slower.
Cliff 1: defer inside a hot loop
The most common version of this is a function that opens N files in a loop and defers Close each time:
func processAll(paths []string) error {
for _, p := range paths {
f, err := os.Open(p)
if err != nil {
return err
}
defer f.Close()
if err := process(f); err != nil {
return err
}
}
return nil
}
The famous bug is that no file closes until processAll returns. Open 50,000 paths and you exhaust the file-descriptor table. Fine, you knew that.
The quieter bug is that the defer sits inside the for loop. That disqualifies the whole function from open-coded defer. Every defer in processAll (including any other ones higher up) now goes through deferproc, allocates a _defer record, and runs the slow exit path. You can see it in go test -bench:
BenchmarkLoopDefer-8 1652 698.4 ns/op 240 B/op 10 allocs/op
BenchmarkLoopNoDefer-8 5142 234.1 ns/op 0 B/op 0 allocs/op
Numbers vary by machine and Go version. The shape is the point: the version with defer inside the loop allocates per iteration; the rewrite has zero allocations.
The rewrite lifts cleanup into a helper that runs per iteration:
func processAll(paths []string) error {
for _, p := range paths {
if err := processOne(p); err != nil {
return err
}
}
return nil
}
func processOne(p string) error {
f, err := os.Open(p)
if err != nil {
return err
}
defer f.Close()
return process(f)
}
processOne is a small function with one defer at function scope. It qualifies for open-coded defer. The file closes at the end of each iteration, and processAll itself has no defer at all so it pays nothing.
Cliff 2: too many defers in one function
The bitmap that tracks which defers fired is one byte. That is the constraint. The compiler-generated deferBits byte is a uint8: eight bits, eight defers. Add a ninth and the compiler falls back to the heap path for all of them.
A function that grew defers organically over time is the typical victim:
func handleRequest(ctx context.Context, w http.ResponseWriter,
r *http.Request) error {
span, ctx := tracer.Start(ctx, "handleRequest")
defer span.End() // 1
timer := metrics.NewTimer("req")
defer timer.Stop() // 2
conn, err := pool.Get(ctx)
if err != nil { return err }
defer pool.Put(conn) // 3
tx, err := conn.Begin(ctx)
if err != nil { return err }
defer tx.Rollback() // 4
lockCtx := lock.Acquire(r.UserID)
defer lock.Release(lockCtx) // 5
cache := localCache.Snapshot()
defer cache.Free() // 6
f, err := openTempFile(r)
if err != nil { return err }
defer os.Remove(f.Name()) // 7
defer f.Close() // 8
audit := audit.New(r)
defer audit.Flush() // 9 — cliff
// ...
}
Every defer here looks reasonable in isolation. Together, they push the function past the threshold and silently move it onto the slow path. Nobody counts defers per function in code review.
The compiler will tell you which kind it picked, with the right flag:
$ go build -gcflags='-d=defer'
./req.go:42:2: open-coded defer
./req.go:45:2: open-coded defer
./req.go:62:2: stack-allocated defer
stack-allocated defer is the fallback. Once you see it, you know the threshold tripped.
The rewrite is to extract a coherent group of defers into a helper. The natural one here is the request-scoped resource bundle:
type reqResources struct {
span trace.Span
timer *metrics.Timer
conn *pool.Conn
tx *db.Tx
}
func acquireReqResources(
ctx context.Context, r *http.Request,
) (*reqResources, func(), error) {
rr := &reqResources{}
cleanup := func() {
if rr.tx != nil { rr.tx.Rollback() }
if rr.conn != nil { pool.Put(rr.conn) }
if rr.timer != nil { rr.timer.Stop() }
if rr.span != nil { rr.span.End() }
}
// ... acquire each, set fields, return on error
return rr, cleanup, nil
}
The handler now has one defer (cleanup) plus a couple of small ones it actually needs at the top level. Both functions stay under 8 and both qualify for open-coded defer.
Cliff 3: closures that capture loop variables
This one is sneakier. You are not deferring inside a loop. You are deferring a function literal that closes over a loop variable. Escape analysis sees the closure outlives the iteration and moves the captured variable to the heap. The _defer record itself may also escape because the closure is stored on the defer chain.
func processItems(items []*Item) {
for _, it := range items {
defer logCompletion(it) // (a) plain call — fine
}
for _, it := range items {
defer func() { // (b) closure over `it`
log.Printf("done: %s", it.ID)
}()
}
}
(a) is on the slow path because of cliff 1, but its argument is at least evaluated and captured by value when the defer statement runs. (b) adds a problem: the closure forces the captured variable onto the heap. Run with -gcflags='-m=2' and you get the receipt:
$ go build -gcflags='-m=2' ./...
./loop.go:14:6: moved to heap: it
./loop.go:15:9: func literal escapes to heap
./loop.go:16:25: it.ID escapes to heap
Three allocations per iteration where there should be zero. On a 10K-item slice that is 30K extra heap allocations.
The fix is the same shape as cliff 1: pull the body into a function so the defer lives at function scope and there is no closure left to capture anything.
func processItems(items []*Item) {
for _, it := range items {
processOne(it)
}
}
func processOne(it *Item) {
defer logCompletion(it)
// ...
}
processOne has one defer at function scope. Open-coded. No loop. No closure. it stays on the stack as a parameter. Run -gcflags='-m=2' again and the "moved to heap" lines are gone.
How to see the cliff before production does
Three flags get you everything:
go build -gcflags='-d=defer' # which kind of defer
go build -gcflags='-m' # escape analysis decisions
go test -bench=. -benchmem # allocs and ns/op per case
Run the first two on the package you suspect. Look for stack-allocated defer lines. Look for moved to heap lines that point at variables captured by deferred closures. Then write the benchmark that compares the suspect path with the rewritten one. The -benchmem column tells you whether the rewrite worked.
The cost of getting this wrong is not academic. A function that does 35ns of useful work and 35ns of deferproc overhead is half as fast as it should be. Multiply by a few thousand QPS and you have lost a CPU core.
defer is still the right primitive. The runtime is built around it, and the open-coded path is effectively free in most code. The discipline is keeping each function small enough, loop-free enough, and closure-light enough for the compiler to take the optimization. Three rules:
- Defer at function scope, never inside a loop. Push the body into a helper.
- Stay under 8 defers per function. Bundle related cleanups behind one closer.
- Defer a plain call, not a closure that captures loop variables.
Apply those three and your pprof profile stops showing deferproc near the top.
If this saved you a regression
The Complete Guide to Go Programming covers all three end-to-end: escape-analysis output, the open-coded defer rules, and the runtime mechanics that decide stack vs. heap. Most performance surprises in Go come from the same place this one does: a small change in a function's shape flipping the compiler off a fast path you did not know you were on.
The companion book, Hexagonal Architecture in Go, takes the same care to the design layer: how to structure services so the small, hot helpers stay small and the large, mixed-concern handlers do not grow nine defers by accident.

Top comments (0)