Profile-guided optimization of a Go application

#go #performance

TL;DR Go 1.20 PGO was applied to an existing high-performance application and lead to 2% faster execution with 4.4% less CPU cycles and 5.3% less CPU instructions for extra 100KB size of compiled binary.

A recent release of Go (1.20) introduced a preview version of a new tool to optimize an application in response to performance counters of a CPU profile.

Profile-guided optimization can enable inlining for functions in hot paths, this is a trade-off between increased binary size and improved speed of execution.

CPU profile makes this trade-off informed and allows a relatively low binary size penalty for a noticeable perf improvement.

I have flatjsonl, an application that processes JSON lines and renders them as tables. Speed of processing is important, so any optimization is welcome, let's see how PGO can help.

CPU Profile

First, a relevant CPU profile is needed. There is already standard tooling for that, CPU profile is basically a collection of metrics and counters observed during application execution. For PGO purposes best results can be obtained using profiles of actual production workloads.

For a CLI app common practice is to define a flag to enable profile collection.

package main

import (
    "flag"
    "runtime/pprof"
...
func main()
    var cpuProfile    string
    flag.StringVar(&cpuProfile, "dbg-cpu-prof", "", "Write CPU profile to file.")
...
    flag.Parse()
...
    if cpuProfile != "" {
        f, err := os.Create(cpuProfile)
        if err != nil {
            log.Fatal(err)
        }

        if err = pprof.StartCPUProfile(f); err != nil {
            log.Fatal(err)
        }

        defer pprof.StopCPUProfile()
    }
...

If you're developing a web service, you can use net/http/pprof to expose profiling handler and collect data on a running instance.

I have a sample of work for flatjsonl, that is a production workload. If I run it multiple times it takes roughly the same time to process and produces same output, running such command is an idempotent operation.

flatjsonl -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst

So, to collect CPU profile I'll run it again with -dbg-cpu-prof flag.

flatjsonl -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst -dbg-cpu-prof default.pgo

After the action is complete I can find default.pgo file, name of the resulting profile file can be any, however, default.pgo is a special name supported by PGO mode auto.

PGO is implemented in a way that it can tolerate obsolete CPU profiles (e.g. built with an older version of an application), though more obsolete profile would lead to less efficient PGO.

Instrumented Build

Once CPU profile is collected, it can be used with go build

go build -pgo=auto

go build -pgo=/path/to/cpu.pprof

if profile is stored in other place than ./default.pgo.

In this case PGO instrumentation resulted in ~100KB extra size.

8527872 (8.2M) flatjsonl
8626176 (8.3M) flatjsonl-pgo

Performance Impact

flatjsonl is optimized for multiple cores, performance was measured on a machine with 32 cores running Linux.

Go has benchmarking tools in standard library, but in this case it is easier to use hyperfine to measure performance of a CLI app.

Hyperfine

hyperfine --warmup 3 'flatjsonl -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst'
Benchmark 1: flatjsonl -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst
  Time (mean ± σ):      7.836 s ±  0.147 s    [User: 69.622 s, System: 4.279 s]
  Range (min … max):    7.504 s …  8.043 s    10 runs

hyperfine --warmup 3 'flatjsonl-pgo -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst'
Benchmark 1: flatjsonl-pgo -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst
  Time (mean ± σ):      7.660 s ±  0.305 s    [User: 66.129 s, System: 3.783 s]
  Range (min … max):    7.180 s …  8.106 s    10 runs

In this case, binary with PGO enabled worked ~2.2% faster.

Perf Stat

Linux provides another handy tool to inspect performance of an application, perf stat can count number of CPU instructions and cycles during program execution.

perf stat -e task-clock,cycles,instructions,cache-references,cache-misses flatjsonl -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst

<program output omitted>

Performance counter stats for 'flatjsonl -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst':

         71,107.29 msec task-clock:u              #    9.081 CPUs utilized
   152,046,064,828      cycles:u                  #    2.138 GHz
   193,579,158,625      instructions:u            #    1.27  insn per cycle
     1,498,126,007      cache-references:u        #   21.069 M/sec
       374,577,209      cache-misses:u            #   25.003 % of all cache refs

       7.830534779 seconds time elapsed

      68.273271000 seconds user
       3.311031000 seconds sys

perf stat -e task-clock,cycles,instructions,cache-references,cache-misses flatjsonl-pgo -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst

<program output omitted>

Performance counter stats for 'flatjsonl-pgo -config ./cfg.json5 -csv ~/all.csv.gz -field-limit 50000 -max-lines 300000 -input ~/slow.log.zst':

         70,220.34 msec task-clock:u              #    9.427 CPUs utilized
   145,321,922,303      cycles:u                  #    2.070 GHz
   183,258,560,907      instructions:u            #    1.26  insn per cycle
     1,543,859,683      cache-references:u        #   21.986 M/sec
       386,600,109      cache-misses:u            #   25.041 % of all cache refs

       7.448585196 seconds time elapsed

      64.659898000 seconds user
       6.057555000 seconds sys

Binary with PGO instrumentation used ~4.4% fewer CPU cycles and ~5.3% less instructions.

Such performance improvements may seem small, but instrumentation effort is almost free.