TL;DR Go 1.20 PGO was applied to an existing high-performance application and lead to 2% faster execution with 4.4% less CPU cycles and 5.3% less CPU instructions for extra 100KB size of compiled binary.
A recent release of Go (1.20) introduced a preview version of a new tool to optimize an application in response to performance counters of a CPU profile.
Profile-guided optimization can enable inlining for functions in hot paths, this is a trade-off between increased binary size and improved speed of execution.
CPU profile makes this trade-off informed and allows a relatively low binary size penalty for a noticeable perf improvement.
I have flatjsonl
, an application that processes JSON lines and renders them as tables. Speed of processing is important, so any optimization is welcome, let's see how PGO can help.
CPU Profile
First, a relevant CPU profile is needed. There is already standard tooling for that, CPU profile is basically a collection of metrics and counters observed during application execution. For PGO purposes best results can be obtained using profiles of actual production workloads.
For a CLI app common practice is to define a flag to enable profile collection.
package main
import (
"flag"
"runtime/pprof"
...
func main()
var cpuProfile string
flag.StringVar(&cpuProfile, "dbg-cpu-prof", "", "Write CPU profile to file.")
...
flag.Parse()
...
if cpuProfile != "" {
f, err := os.Create(cpuProfile)
if err != nil {
log.Fatal(err)
}
if err = pprof.StartCPUProfile(f); err != nil {
log.Fatal(err)
}
defer pprof.StopCPUProfile()
}
...
If you're developing a web service, you can use net/http/pprof
to expose profiling handler and collect data on a running instance.
I have a sample of work for flatjsonl
, that is a production workload. If I run it multiple times it takes roughly the same time to process and produces same output, running such command is an idempotent operation.
flatjsonl -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst
So, to collect CPU profile I'll run it again with -dbg-cpu-prof
flag.
flatjsonl -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst -dbg-cpu-prof default.pgo
After the action is complete I can find default.pgo
file, name of the resulting profile file can be any, however, default.pgo
is a special name supported by PGO mode auto
.
PGO is implemented in a way that it can tolerate obsolete CPU profiles (e.g. built with an older version of an application), though more obsolete profile would lead to less efficient PGO.
Instrumented Build
Once CPU profile is collected, it can be used with go build
go build -pgo=auto
or
go build -pgo=/path/to/cpu.pprof
if profile is stored in other place than ./default.pgo
.
In this case PGO instrumentation resulted in ~100KB extra size.
8527872 (8.2M) flatjsonl
8626176 (8.3M) flatjsonl-pgo
Performance Impact
flatjsonl
is optimized for multiple cores, performance was measured on a machine with 32 cores running Linux.
Go has benchmarking tools in standard library, but in this case it is easier to use hyperfine
to measure performance of a CLI app.
Hyperfine
hyperfine --warmup 3 'flatjsonl -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst'
Benchmark 1: flatjsonl -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst
Time (mean ± σ): 7.836 s ± 0.147 s [User: 69.622 s, System: 4.279 s]
Range (min … max): 7.504 s … 8.043 s 10 runs
hyperfine --warmup 3 'flatjsonl-pgo -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst'
Benchmark 1: flatjsonl-pgo -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst
Time (mean ± σ): 7.660 s ± 0.305 s [User: 66.129 s, System: 3.783 s]
Range (min … max): 7.180 s … 8.106 s 10 runs
In this case, binary with PGO enabled worked ~2.2% faster.
Perf Stat
Linux provides another handy tool to inspect performance of an application, perf stat
can count number of CPU instructions and cycles during program execution.
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses flatjsonl -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst
<program output omitted>
Performance counter stats for 'flatjsonl -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst':
71,107.29 msec task-clock:u # 9.081 CPUs utilized
152,046,064,828 cycles:u # 2.138 GHz
193,579,158,625 instructions:u # 1.27 insn per cycle
1,498,126,007 cache-references:u # 21.069 M/sec
374,577,209 cache-misses:u # 25.003 % of all cache refs
7.830534779 seconds time elapsed
68.273271000 seconds user
3.311031000 seconds sys
perf stat -e task-clock,cycles,instructions,cache-references,cache-misses flatjsonl-pgo -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst
<program output omitted>
Performance counter stats for 'flatjsonl-pgo -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst':
70,220.34 msec task-clock:u # 9.427 CPUs utilized
145,321,922,303 cycles:u # 2.070 GHz
183,258,560,907 instructions:u # 1.26 insn per cycle
1,543,859,683 cache-references:u # 21.986 M/sec
386,600,109 cache-misses:u # 25.041 % of all cache refs
7.448585196 seconds time elapsed
64.659898000 seconds user
6.057555000 seconds sys
Binary with PGO instrumentation used ~4.4% fewer CPU cycles and ~5.3% less instructions.
Such performance improvements may seem small, but instrumentation effort is almost free.
Top comments (0)