This article is from bmf-tech.com - Goで始めるコードのパフォーマンス改善.
Day 9 of Makuake Advent Calendar 2022!
Improving the performance of your code starting with Go
When I decided to improve the performance of goblin, a homebrew HTTP Router, I tried to work on performance improvement in Go, so I write about my approach and the efforts I put into practice.
Prerequisites
I'm sure there is more knowledge needed for more in-depth tuning, but I'll just list the bare minimum required.
- Garbage Collection
- A function that automatically releases memory areas allocated by a program at runtime that are no longer needed.
- Memory area
- Text area
- Area where a program converted into machine language can be stored
- Stack area
- Memory area Memory area allocated at program execution
- Data whose size is fixed at runtime is targeted
- Automatically released.e.g., when a function finishes executing and is no longer needed)
- Ex. arguments, return values, temporary variables, etc.
- Heap area
- Memory area allocated during program execution
- Data whose size is dynamically determined
- Subject to garbage collection
- Static area
- Memory area allocated during program execution
- Allocated until the program is terminated
- Ex. global variables, static variables, etc.
- Text area
Approach to Performance Improvement
The assumption is that there is a need to improve performance (is it worth sacrificing readability, can we determine that the application is the bottleneck in the first place, etc.), but we will proceed under the assumption that there is a need.
Some ways to improve the performance of the code include as follows:
- Algorithm optimization
- Optimization of data structures
- Use of cache
- Application of parallel processing
- Compile optimization
There are a number of things that come to mind, but before implementing improvements, measurement and analysis should be performed.
(It is assumed that the need for performance improvement is more important than measurement, but that depends on the needs of each individual, and is not discussed here.)
We will introduce some of the packages and tools for measurement and analysis in Go.
Benchmarks
Go includes Benchmarks in the standard package testing to get a benchmark of your code.
You can run the command go test -bench=. -benchmem
to obtain a benchmark.
package main
import (
"math/rand"
"testing"
)
func BenchmarkRandIn(b *testing.B) {
for i := 0; i < b.N; i++ { // b.N automatically specifies the number of times the benchmark can be trusted
rand.Int() // Function to be measured
}
}
The output result will look like this.
goos: darwin
goarch: amd64
pkg: bmf-san/go-perfomance-tips
cpu: VirtualApple @ 2.50GHz
BenchmarkRandIn-8 87550500 13.53 ns/op 0 B/op 0 allocs/op
PASS
ok bmf-san/go-perfomance-tips 1.381s
The benchmark results that can be read from this are as follows:
- 87550500
- Number of function executions
- The higher the number of executions, the better the performance is considered
- 13.53 ns/op
- Time required per function execution
- Less time is considered better performance
- 0 B/op
- Size of memory allocated per function execution
- The smaller the number, the better the performance is considered to be
- 0 allocs/op
- Number of memory allocations made per function execution
- The fewer the number of allocations, the better the performance
Go allows for easy benchmarking in this way.
For other Go benchmarking features, see the documentation.
Benchmarks
The package benchstat is a good tool to compare benchmark results, showing the percentage of improvement in the benchmark results. The package benchstat is a good tool to compare benchmark results, as it shows the percentage of improvement.
The package bmf-san/goblin, which I manage, is incorporated into CI so that the results can be compared before and after a commit.
// This is an example where nothing has improved...
go test -bench . -benchmem -count 1 > new.out
benchstat old.out new.out
name old time/op new time/op delta
Static1-36 248ns ± 0% 246ns ± 0% ~ (p=1.000 n=1+1)
Static5-36 502ns ± 0% 495ns ± 0% ~ (p=1.000 n=1+1)
Static10-36 874ns ± 0% 881ns ± 0% ~ (p=1.000 n=1+1)
WildCard1-36 1.60µs ± 0% 1.50µs ± 0% ~ (p=1.000 n=1+1)
WildCard5-36 4.49µs ± 0% 4.92µs ± 0% ~ (p=1.000 n=1+1)
WildCard10-36 7.68µs ± 0% 7.58µs ± 0% ~ (p=1.000 n=1+1)
Regexp1-36 1.38µs ± 0% 1.48µs ± 0% ~ (p=1.000 n=1+1)
Regexp5-36 4.30µs ± 0% 4.11µs ± 0% ~ (p=1.000 n=1+1)
Regexp10-36 7.66µs ± 0% 7.18µs ± 0% ~ (p=1.000 n=1+1)
Absolutely no performance degradation allowed! In such a case, it may be a good idea to use a mechanism to make CI fail.
If you want to check the actual memory allocation by looking at the results of such a benchmark, you can check it by building with the build option.
package main
import "fmt"
// Run build with go build -o example -gcflags '-m' gcflagsexample.go
func main() {
a := "hello"
b := "world"
fmt.Println(a + b)
}
Running go build -o example -gcflags '-m' gcflagsexample.go
produces the following output.
# command-line-arguments
./gcflagsexample.go:9:13: inlining call to fmt.Println
./gcflagsexample.go:9:13: ... argument does not escape
./gcflagsexample.go:9:16: a + b escapes to heap
./gcflagsexample.go:9:16: a + b escapes to heap
This is a simple example that is obvious, but it is also a useful method of analysis because memory allocation can be improved by identifying allocations to the heap in this way and reducing heap allocations.
Profiling
Go has a tool called pprof to analyze where the bottlenecks are at the function level.
package main
import (
"sort"
"testing"
)
func sortAlphabetically() {
s := []string{"abc", "aba", "cba", "acb"}
sort.Strings(s)
}
func BenchmarkSortAlphabetically(b *testing.B) {
for i := 0; i < b.N; i++ {
sortAlphabetically()
}
}
If you want to see the CPU profile, run the following.
go test -test.bench=BenchmarkSortAlphabetically -cpuprofile cpu.out && go tool pprof -http=":8888" cpu.out
If you want to see the memory profile, run the following.
go test -test.bench=BenchmarkSortAlphabetically profilingexample_test.go -memprofile mem.out && go tool pprof -http=":8889" mem.out
Utilizing the UI of pprof makes it easier to identify where the bottleneck is in the process.
Practice
Raise an example of improving goblin, a home-made HTTP Router.
The PR on the subject is here.
Reduce the memory allocation by refactoring explodePath method #68
goblin is an HTTP Router that works well with the net/http interface based on a tri-tree.
As for functionality, it has the minimum that is considered necessary for routing.
cf. goblin#features
Benchmarks
First, we run a benchmark test to measure performance.
go test -bench=. -cpu=1 -benchmem
The benchmark test has about three patterns for each test case: static routing (ex. /foo/bar), dynamic routing (ex. /foo/:bar), and routing using regular expressions (ex. /foo/:bar[^\d+$]).
The routing process involves
- put data into the tree structure (≒define routing)
- search for data from the tree structure (return data based on the requested path)
In this test case, only the latter is measured.
The output results are as follows:
goos: darwin
goarch: amd64
pkg: github.com/bmf-san/goblin
cpu: VirtualApple @ 2.50GHz
BenchmarkStatic1 5072353 240.1 ns/op 128 B/op 4 allocs/op
BenchmarkStatic5 2491546 490.0 ns/op 384 B/op 6 allocs/op
BenchmarkStatic10 1653658 729.6 ns/op 720 B/op 7 allocs/op
BenchmarkWildCard1 1602606 747.3 ns/op 456 B/op 9 allocs/op
BenchmarkWildCard5 435784 2716 ns/op 1016 B/op 23 allocs/op
BenchmarkWildCard10 246729 5033 ns/op 1680 B/op 35 allocs/op
BenchmarkRegexp1 1647258 733.2 ns/op 456 B/op 9 allocs/op
BenchmarkRegexp5 456652 2641 ns/op 1016 B/op 23 allocs/op
BenchmarkRegexp10 251998 4780 ns/op 1680 B/op 35 allocs/op
PASS
ok github.com/bmf-san/goblin 14.304s
Several trends can be read into each of the number of executions, number of executions per run, memory size per run, and number of memory allocations.
I am personally concerned that memory allocations are occurring even for static routing. (Other HTTP Router benchmarks show 0 allocs.)
Profiling
Next, use pprof to obtain a profile.
This time, we focus only on memory to obtain a profile.
go test -bench . -memprofile mem.out && go tool pprof -http=":8889" mem.out
Graph output results.
! pprof_graph
We can see that the process with the largest box (using the most memory) is explodePath
.
Even if you look at the Top (list in order of longest execution time), explodePath
is at the top of the list.
Flat is the processing time of the function, and Cum is the processing time including waiting time.
Furthermore, check Source to see where in the function the processing is actually heavy.
Since Search
is the core process responsible for the router's matching process, I thought it would be the bottleneck there, but it turns out that a specific process within it, explodePath
, is the bottleneck.
Tuning
The process of explodePath
is to split the received string by /
and return it as a []string type.
// explodePath removes an empty value in slice.
func explodePath(path string) []string {
s := strings.Split(path, pathDelimiter)
var r []string
for _, str := range s {
if str != "" {
r = append(r, str)
}
}
return r
}
Test code is also included for easy understanding of the specifications.
func TestExplodePath(t *testing.T) {
cases := []struct {
actual []string
expected []string
}{
{
actual: explodePath(""),
expected: nil,
},
{
actual: explodePath("/"),
expected: nil,
},
{
actual: explodePath("//"),
expected: nil,
},
{
actual: explodePath("///"),
expected: nil,
},
{
actual: explodePath("/foo"),
expected: []string{"foo"},
},
{
actual: explodePath("/foo/bar"),
expected: []string{"foo", "bar"},
},
{
actual: explodePath("/foo/bar/baz"),
expected: []string{"foo", "bar", "baz"},
},
{
actual: explodePath("/foo/bar/baz/"),
expected: []string{"foo", "bar", "baz"},
},
}
for _, c := range cases {
if !reflect.DeepEqual(c.actual, c.expected) {
t.Errorf("actual:%v expected:%v", c.actual, c.expected)
}
}
}
Since the variable r
defined in the [[]string type has no capacity defined, it is inferred that memory efficiency is likely to be poor.
The following is a benchmark test and results of adding append to slice prepared for verification.
package main
import "testing"
func BenchmarkSliceLen0Cap0(b *testing.B) {
var s []int
b.StartTimer()
for i := 0; i < b.N; i++ {
s = append(s, i)
}
b.StopTimer()
}
func BenchmarkSliceLenN(b *testing.B) {
var s = make([]int, b.N)
b.StartTimer()
for i := 0; i < b.N; i++ {
s = append(s, i)
}
b.StopTimer()
}
func BenchmarkSliceLen0CapN(b *testing.B) {
var s = make([]int, 0, b.N)
b.StartTimer()
for i := 0; i < b.N; i++ {
s = append(s, i)
}
b.StopTimer()
}
goos: darwin
goarch: amd64
pkg: example.com
cpu: VirtualApple @ 2.50GHz
BenchmarkSliceLen0Cap0 100000000 11.88 ns/op 45 B/op 0 allocs/op
BenchmarkSliceLenN 78467056 23.60 ns/op 65 B/op 0 allocs/op
BenchmarkSliceLen0CapN 616491007 5.057 ns/op 8 B/op 0 allocs/op
PASS
ok example.com 6.898s
bmf@bmfnoMacBook-Air:~/Desktop$
This result suggests that the code could be somewhat more efficient by specifying the capacity.
Therefore, modify explodePath
as follows.
func explodePath(path string) []string {
s := strings.Split(path, "/")
// var r []string
r := make([]string, 0, strings.Count(path, "/")) // Specify capacity
for _, str := range s {
if str != "" {
r = append(r, str)
}
}
return r
}
A little more in-depth and improved implementation.
func explodePath(path string) []string {
splitFn := func(c rune) bool {
return c == '/'
}
return strings.FieldsFunc(path, splitFn)
}
We will compare benchmarks with three patterns: the original explodePath
implementation, an implementation with reserved slice capacity, and an implementation using strings.FieldFunc
.
package main
import (
"strings"
"testing"
)
func explodePath(path string) []string {
s := strings.Split(path, "/")
var r []string
for _, str := range s {
if str != "" {
r = append(r, str)
}
}
return r
}
func explodePathCap(path string) []string {
s := strings.Split(path, "/")
r := make([]string, 0, strings.Count(path, "/"))
for _, str := range s {
if str != "" {
r = append(r, str)
}
}
return r
}
func explodePathFieldsFunc(path string) []string {
splitFn := func(c rune) bool {
return c == '/'
}
return strings.FieldsFunc(path, splitFn)
}
func BenchmarkExplodePath(b *testing.B) {
paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"}
b.StartTimer()
for i := 0; i < b.N; i++ {
for _, v := range paths {
explodePath(v)
}
}
b.StopTimer()
}
func BenchmarkExplodePathCap(b *testing.B) {
paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"}
b.StartTimer()
for i := 0; i < b.N; i++ {
for _, v := range paths {
explodePathCap(v)
}
}
b.StopTimer()
}
func BenchmarkExplodePathFieldsFunc(b *testing.B) {
paths := []string{"", "/", "///", "/foo", "/foo/bar", "/foo/bar/baz"}
b.StartTimer()
for i := 0; i < b.N; i++ {
for _, v := range paths {
explodePathFieldsFunc(v)
}
}
b.StopTimer()
}
goos: darwin
goarch: amd64
pkg: example.com
cpu: VirtualApple @ 2.50GHz
BenchmarkExplodePath 1690340 722.2 ns/op 432 B/op 12 allocs/op
BenchmarkExplodePathCap 1622161 729.5 ns/op 416 B/op 11 allocs/op
BenchmarkExplodePathFieldsFunc 4948364 239.5 ns/op 96 B/op 3 allocs/op
PASS
ok example.com 5.685s
The implementation using strings.PathFieldFunc
seems to have the best performance, so it is adopted.
Measuring Effectiveness
Check the results after improving the implementation of explodePath
.
Benchmarks
# Before
goos: darwin
goarch: amd64
pkg: github.com/bmf-san/goblin
cpu: VirtualApple @ 2.50GHz
BenchmarkStatic1 5072353 240.1 ns/op 128 B/op 4 allocs/op
BenchmarkStatic5 2491546 490.0 ns/op 384 B/op 6 allocs/op
BenchmarkStatic10 1653658 729.6 ns/op 720 B/op 7 allocs/op
BenchmarkWildCard1 1602606 747.3 ns/op 456 B/op 9 allocs/op
BenchmarkWildCard5 435784 2716 ns/op 1016 B/op 23 allocs/op
BenchmarkWildCard10 246729 5033 ns/op 1680 B/op 35 allocs/op
BenchmarkRegexp1 1647258 733.2 ns/op 456 B/op 9 allocs/op
BenchmarkRegexp5 456652 2641 ns/op 1016 B/op 23 allocs/op
BenchmarkRegexp10 251998 4780 ns/op 1680 B/op 35 allocs/op
PASS
ok github.com/bmf-san/goblin 14.304s
# After
go test -bench=. -cpu=1 -benchmem -count=1
goos: darwin
goarch: amd64
pkg: github.com/bmf-san/goblin
cpu: VirtualApple @ 2.50GHz
BenchmarkStatic1 10310658 117.7 ns/op 32 B/op 1 allocs/op
BenchmarkStatic5 4774347 258.1 ns/op 96 B/op 1 allocs/op
BenchmarkStatic10 2816960 435.8 ns/op 176 B/op 1 allocs/op
BenchmarkWildCard1 1867770 653.4 ns/op 384 B/op 6 allocs/op
BenchmarkWildCard5 496778 2484 ns/op 928 B/op 13 allocs/op
BenchmarkWildCard10 258508 4538 ns/op 1560 B/op 19 allocs/op
BenchmarkRegexp1 1978704 608.4 ns/op 384 B/op 6 allocs/op
BenchmarkRegexp5 519240 2394 ns/op 928 B/op 13 allocs/op
BenchmarkRegexp10 280741 4309 ns/op 1560 B/op 19 allocs/op
PASS
ok github.com/bmf-san/goblin 13.666s
Comparing before and after improvements, we can say that the overall trend is toward improvement.
Profiling
pprof's graph.
pprof's top.
You can see that the bottleneck has moved to strings.FieldsFunc
, which is called in the explodePath
.
Further Improvements
Here is the tag released after making other improvements to goblin.
6.0.0
Unfortunately, there are no significant improvements in the data structure and algorithms, so to speak, so there are no remarkable improvements.
I have a feeling that the data structure and algorithms they are using now are still difficult to use. (I have seen other routers that use more advanced trees, so I guess that's true...)
This is somewhat off topic, but I created a bench marker to compare with other routers and see if I can get some hints for improvement.
It was interesting to compare them and see how ragged they are.
I would like to improve it by studying other router implementations, learning about advanced tree structures, etc., which I failed to do before.
Summary
- Go makes benchmarking and profiling easy.
- Don't guess, measure!
- It is hard to get big results with small improvements (that's true)
Reference
- github.com - google/pprof
- github.com - dgryski/go-perfbook
- dave.cheney.net - High Perfomance Go Workshop
- go.dev - Profiling Go Programs
- go.dev - A Guide to the Go Garbage Collector
- developer.so-tech.co.jp
- segment.com - Allocation efficiency in high-performance Go services
- blog.logrocket.com - Benchmarking in Golang: Improving function performance
- medium.com - Go code refactoring : the 23x performance hunt
- medium.com - Go言語のプロファイリングツール、pprofのWeb UIがめちゃくちゃ便利なので紹介する
- teivah.medium.com - Good Code vs Bad Code in Golang
- hnakamur.github.io - goで書いたコードがヒープ割り当てになるかを確認する方法
- glog.kazu69.net - Goのメモリ管理を眺めてみた
- dsas.blog.klab.org - Goでアロケーションに気をつけたコードを書く方法
- tech.speee.jp - Goのロギングライブラリから見たゼロアロケーション
- kawasin73.hatenablog.com - メモリアロケーションに対する罪悪感
Top comments (0)