Gabriel Anhaia

Posted on Apr 28

Go 1.25 Green Tea GC: Why the 40% Number Is Real for Some Workloads

#go #performance #backend #runtime

Book: The Complete Guide to Go Programming
Also by me: Thinking in Go (2-book series) — Complete Guide to Go Programming + Hexagonal Architecture in Go
My project: Hermes IDE | GitHub — an IDE for developers who ship with Claude Code and other AI coding tools
Me: xgabriel.com | GitHub

Picture two teams flipping GOEXPERIMENT=greenteagc the same week. One is DoltHub, running their oltp_read_write benchmark on Dolt: stock GC and Green Tea produce 73.20 vs 73.36 tx/s, identical median latency, histograms that overlap to the eye. Mark CPU is slightly higher under Green Tea on every cycle. They roll back.

The other is Tile38, the geospatial in-memory store. Their rollout records a 35% reduction in GC CPU, with throughput and latency improvements documented on the upstream issue. They keep the flag on.

Both teams run Go services in production. Both pulled the same compiler. The difference is heap shape, and there is a rule that predicts which side you land on before you flip a flag.

That rule is what this post covers, along with why the 40% number from the Go team's blog is true and useless at the same time, depending on what you ship.

What Green Tea actually changes

Pre-1.25, Go's tri-color mark-sweep walks the live object graph one object at a time. The worker pops a pointer from the work queue, scans the object, pushes any pointers it found, and moves on. This is correct, and also a memory-access pattern that looks like "jump to a random page, read 64 bytes, jump to a different random page, read 32 bytes." On a modern CPU, that's mostly stalls. The Go team measured 35%+ of mark time spent waiting on main memory before Green Tea landed (go.dev/blog/greenteagc).

Green Tea changes the unit of work. Instead of a queue of objects, it keeps a queue of 8 KiB pages (technically: spans of small objects, where small means ≤512 bytes). When a pointer into a page is discovered, the page goes onto the queue. It sits there long enough for multiple pointers into the same page to accumulate. When the worker pops it, every reachable object on that page gets scanned in one pass, left to right, in a tight loop the prefetcher can predict.

The metadata layout matters. Each small-object slot carries two extra bits: a "seen" bit (a pointer into me has been observed) and a "scanned" bit. The seen bits for an entire span pack into one or two machine words, which is what makes the AVX-512 path possible. With that layout the runtime can interrogate every slot on a page with a single vector instruction. The Go blog post documents the use of VGF2P8AFFINEQB (a Galois-field affine transform) to expand the seen-bit mask into per-object work in a few cycles. Practical translation: on a CPU with AVX-512 (Ice Lake server, Sapphire Rapids, recent Zen) the per-page overhead is close to zero.

The benchmark headline numbers, all from the Go team's Green Tea blog post and the announcement issue:

10–40% GC CPU reduction across the standard benchmark suite, without AVX-512.
An additional ~10% on top once vector acceleration ships fully.
Tile38: 35% reduction in GC CPU, with throughput and latency improvements documented on the rollout issue.
Modal improvement: ~10% GC CPU. The 40% figure is the tail of the distribution, not the median.
Internal Google deployment: comparable results at scale, no specific number disclosed (go.dev/blog/greenteagc).

If your service spends 10% of total CPU in GC, a 10% GC CPU reduction is a 1% reduction in service CPU. That's the math the Go team itself walks through. It is real, it is unglamorous, and for most teams it does not move a meeting.

What Google said vs what DoltHub measured

DoltHub published one of the more useful real-world reports on Green Tea. They built two Dolt binaries (stock GC and GOEXPERIMENT=greenteagc) and ran sysbench's oltp_read_write against both, first single-threaded, then 20 threads pinned to 8 cores for 10 minutes.

The numbers, single-threaded:

Stock: 73.20 tx/s, 13.66 ms avg latency, 13.22 ms median.
Green Tea: 73.36 tx/s, 13.63 ms avg latency, 13.22 ms median.

That is not noise on top of an improvement. That is two runs of the same workload. The 20-thread, 8-core run reproduces the same answer with a wider distribution: histograms that overlap to the eye.

The interesting finding is one layer down. From the GC traces, mark phase CPU was higher under Green Tea, on every cycle. STW pause times were comparable. So Green Tea was, for Dolt's workload, a small regression in mark cost and a wash everywhere else.

DoltHub's own read is honest: they note the Go team's caveat that most real-world programs will not see much difference, and point out that their benchmark grows the heap throughout the run, which inflates GC duration regardless of which collector is doing the work.

What both posts under-explain is why the same flag produced 35% on Tile38 and ~0% on Dolt. The answer is in the heap shape.

The heap-shape rule

Green Tea's whole bet is this: when the GC worker pulls a page off the queue, multiple objects on that page should be reachable. If only one object per page is alive at scan time, the page-centric algorithm is doing the same work as the object-centric one, plus the page bookkeeping.

Three properties of a workload determine whether your pages are dense at scan time:

1. Object-size distribution skewed toward small. Green Tea's optimised path is for small objects (≤512 bytes), which sit in size-class spans. A workload whose live set is dominated by []byte of varying size, large map[string]struct{...} values, big protobuf messages, or DOM-like trees of medium-sized nodes spends most of its mark budget outside the optimised path. The published Go benchmarks that hit 40% are mostly small-struct, small-pointer-rich graphs.

2. Allocation pattern that clusters lifetimes. Pages get dense when objects allocated together also become unreachable together (or stay live together). Web request handlers that build a per-request graph of small structs, hold it for the request lifetime, and drop the whole thing on response are ideal: the survivors during a mark cycle are tightly clustered on a few hot pages, and everything else is empty space the page-centric algorithm doesn't touch. A workload that allocates and survives uniformly across the heap (long-lived caches, graph databases that keep historical state warm) loses this property.

3. Working-set residency low relative to total heap. When the live working set is a small fraction of the heap and the rest is sparse, page-centric marking wins big. When the working set is most of the heap and it's roughly uniform, you're back to scanning everything anyway.

Tile38, a geospatial in-memory store, hits all three. It has small index nodes, request-driven allocation, and a hot working set. A 35% drop is plausible.

Dolt is a SQL-on-MVCC database whose job is to keep multi-version structures reachable across a benchmark. It fails the small-object and clustered-lifetime tests. The heap is full of medium-sized chunk nodes, and the live set is most of the heap. There's no sparse page Green Tea can skip and no dense page where multiple survivors get co-scanned for cheap. Mark phase doing slightly more work per cycle is exactly what page bookkeeping costs when there's no payoff to amortise it against.

The heuristic from the published numbers: Green Tea is built for the small-object, request-shaped, mostly-stateless services that dominate web, RPC, and edge tiers. For databases, caches, and anything that keeps a large structured heap warm, expect a wash or a small regression. The split between the two camps has not been measured across the population of Go services; "1 in 4" was a rough call from the benchmark distribution, nothing more. The point is that the median Go service is not the workload Google is measuring.

How to predict your camp in 20 minutes

Before you change a build flag in production, run two diagnostics on a representative load:

// In a process under realistic load, dump GC stats and heap profile.
import (
    "os"
    "runtime"
    "runtime/pprof"
)

func dumpHeap(path string) error {
    runtime.GC()
    f, err := os.Create(path)
    if err != nil {
        return err
    }
    defer f.Close()
    return pprof.Lookup("heap").WriteTo(f, 0)
}

Look at two things in the resulting profile.

Object size mix. go tool pprof -alloc_space heap.pprof, then top10 -cum. If the top allocations are sites returning small structs (16–256 bytes — request contexts, AST nodes, small slices, individual map entries), Green Tea has something to work with. If the top sites are large []byte buffers, large structs, or anything rounded into the >512-byte size classes, Green Tea cannot help on those — large objects bypass its fast path.

Live-set density. Compare runtime.MemStats.HeapInuse with HeapAlloc over a few GC cycles:

import "runtime"

func snapshot() runtime.MemStats {
    var m runtime.MemStats
    runtime.ReadMemStats(&m)
    return m
}

// Live ratio = HeapAlloc / HeapInuse.
// Closer to 1.0 = uniformly live heap (Green Tea has nowhere to skip).
// Closer to 0.3–0.6 = sparse heap with hot pockets (Green Tea wins).

A live ratio of 0.9+ during steady state means almost every page contains live objects. The page-centric algorithm cannot skip pages and can only co-scan survivors that happen to share one. A 0.4–0.6 live ratio means most of the heap is "in flight, soon dead" — the pattern page-centric marking was designed to exploit.

Combine with GODEBUG=gctrace=1 over 30 minutes:

GODEBUG=gctrace=1 ./your-service 2> gctrace.log

Each line gives you gc N @Ts Ns%: a+b+c ms clock, .... The Ns% field is fraction of CPU spent in GC. If that number is below 5%, skip the experiment: even a 40% reduction is a 2% CPU win, and the risk of regression is the same. As a rule of thumb, the teams that get a meaningful win from Green Tea are the ones already burning a double-digit fraction of CPU in GC before they flip the flag.

Running the experiment safely

Once you've decided your workload is plausibly small-object and request-shaped, the rollout is simple:

# In your build stage. Pin Go 1.25.x explicitly.
FROM golang:1.25 AS build
ENV GOEXPERIMENT=greenteagc
WORKDIR /src
COPY . .
RUN go build -o /out/app ./cmd/app

The flag is a build-time toggle: GOEXPERIMENT is read by the compiler, not the runtime. There is no flip-it-back-with-a-config-reload escape hatch. You ship two binaries, A/B them at the load balancer, compare.

Three traps the early reports flagged:

RSS may go up. Page-centric marking holds onto more metadata, so expect baseline resident set size to creep up slightly. If you're CPU-bound, that's a fine trade. If you're memory-bound on Kubernetes with tight requests.memory, your pod might get OOMKilled before it gets a chance to be faster. Watch RSS in the canary; if you see it climb, treat that as a signal to roll back, not as noise.

Mark CPU can go up before total CPU goes down. This is what tripped DoltHub. Green Tea pays a fixed page-bookkeeping cost on every cycle. The win shows up as fewer assists and shorter total GC time, which only manifests once the rest of the pattern (small objects, sparse heap) is doing its job. Looking only at gc-mark-time misleads.

Compare on apples-to-apples load. GC behaviour is allocation-rate-driven. A canary at 10% of production traffic does not produce 10% of production allocation pressure if your hot path is request-shaped. Either shadow-mirror traffic or run a synthetic load that matches your real allocation profile within a factor of two.

If the canary holds for 48 hours with no RSS regression and your gctrace shows a real reduction in mark CPU, ship it. If it doesn't, roll back the build flag. Two hours, not two weeks.

What Go 1.26 changes

Green Tea is the default in Go 1.26. The opt-out is GOEXPERIMENT=nogreenteagc, documented in the Go team's announcement issue and the Green Tea blog post.

For most projects, the upgrade is uneventful. Your binary uses Green Tea, your dashboards either don't move or move slightly in the right direction, and you stop thinking about it. For Dolt-shaped workloads, Green Tea-by-default may produce a small regression, and your options are (a) live with it, or (b) opt out with GOEXPERIMENT=nogreenteagc until the Go team's planned tweaks close the gap.

The longer-term direction is clear from the issue tracker: vector acceleration is rolling out, and the algorithm will get tuned for the workloads that currently regress. Pick a service in your stack this week, run the live-ratio diagnostic, and you'll know which camp it lands in before 1.26 ships the default for you.

If this was useful

The runtime, the GC, the scheduler, the allocator: these are the parts of Go whose mental model pays back when something on the dashboard goes sideways. The Complete Guide to Go Programming covers the runtime end of the language end-to-end: how the GC actually decides when to run, what GOGC and GOMEMLIMIT are doing under the hood, why a goroutine costs 2 KiB of stack and what happens when it grows, and how to read a gctrace line without guessing.

The Complete Guide to Go Programming — runtime, stdlib, idiom: xgabriel.com/go-book
Hexagonal Architecture in Go — the architecture follow-up: xgabriel.com/hexagonal-go
Hermes IDE — an IDE for developers who ship with Claude Code and other AI coding tools: hermes-ide.com
More posts and contact — xgabriel.com