borislav nikolov for REKKI

Posted on Jul 21, 2021

Interface Dispatch

#performance

TLDR:

Always measure before you optimize, I introduced super shitty bug to https://github.com/rekki/go-query, because I thought the bottleneck is in the binary search implementation, and gained nothing; calling interface methods in hot for loops adds ~10-15 instructions per call.

Hi,

I want to illustrate why intuition in the field of performance is
often misleading, and just like propaganda, the more you think you you are immune to the illusion, the more effect it has on you.

Let me tell you when I introduced terrible unreproducible and contextual bug in a library we use everywhere.

Cost of simple things

First, we will discuss the cost of simple things, like counting from 1
a million.

package main

import "testing"

var sum = 0

func BenchmarkNoop(b *testing.B) {
        for i := 0; i < b.N; i++ {
                sum++
        }
}

// go test -bench=.
// BenchmarkNoop-8         873429528                1.36 ns/op

the for loop needs to do one comparison and one jump

for i:= 0; i < 1000000; i++ {
}

is something like

0: inc i
1: cmp i, 1000000
2: jlt 10
3: ... code
4: jmp 0
10: .. rest of program

you can also think about it implemented with goto:

i = 0
check:
if i > 1000000
   goto done

for loop code
...
goto check

done:
rest of program

(it is also very interesting to investigate the modern branch
predictors)

In order to mitigate the loop overhead sometimes it is possible to
unroll the loop, which will do something like:

i++
code
i++
code
i++
code

Of course, not all instructions are created equal, today it is far
from intuitive what exactly is happening (especially now when people
don't own the computers they run their code on), but for me a good
intuition is:

bitwise, sum, local jump is fast
division by 2 is natural, so it is fast (x >> 1). division is a search https://youtu.be/o4-CwDo2zpg?t=354
same with multiplication anyting but 2 is slow (x << 1)
almost all other math is slow
locality is king
hot method calling adds ~10 instructions cost for i:=0; i<n; i++ { f() }
atomic operations are ~25 instructions

My CPU is ~3ghz, which means it can do 3 * 10^9 things per second, so
we have the for loop in the benchmark so 1.36 per iteration sounds
about right, considering about 4 things per iteration.
(https://en.wikipedia.org/wiki/Arithmetic_logic_unit)
Now that we are so far abstracted from the machine, it is incredibly
difficult to have good intuition about how much things cost, so I
usually multiply my guess by 5 or so (to be cloud native, haha).

Anyway, let's move to the cost of calling a function.

Function calling

Calling functions is quite simple. We need to prepare its parameters
and then call it. This involves putting their values on the call stack
(offset by SP) and then invoking CALL. The function itself uses those
parameters and then invokes RET.

// tell the go compiler to not inline the function by doing go:noinline
// semantic comment
//go:noinline
func add(a, b int) (int, int) {
        return a + b, 1000
}
// go build main.go && go tool objdump -s main.add main
MOVQ $0x0, 0x18(SP)
MOVQ $0x0, 0x20(SP)
MOVQ 0x8(SP), AX
ADDQ 0x10(SP), AX
MOVQ AX, 0x18(SP)
MOVQ $0x3e8, 0x20(SP)
RET

func main() {
        a, b := add(100, 200)
        add(a, b)
}

// go build main.go && go tool objdump -s main.main main
MOVQ $0x64, 0(SP)
MOVQ $0xc8, 0x8(SP)
CALL main.add(SB)
MOVQ 0x10(SP), AX
MOVQ AX, 0x38(SP)
MOVQ 0x18(SP), AX
MOVQ AX, 0x30(SP)
MOVQ 0x38(SP), AX
MOVQ AX, 0x28(SP)
MOVQ 0x30(SP), AX
MOVQ AX, 0x20(SP)
MOVQ 0x28(SP), AX
MOVQ AX, 0(SP)
MOVQ 0x20(SP), AX
MOVQ AX, 0x8(SP)
CALL main.add(SB)

So calling a function with two parameters, that returns 2 values does
at least 6 things. (prepare the call stack, call, prepare the return
stack, return)

Inlining

Because of the calling overhead, all compiled languages try to inline the functions so it won't have to do do the CALL,RET and stack preparation work.

First let's discuss how the CPU executes code. In the Von Neumann
architecture(which is pretty much all modern computers), code and data are in the same memory, so it is only in the eye of the beholder if something is code or data, pretty much the same way as it is up to you if something is the character 'a' or the integer 97, when 'a' is encoded as 01100001.

https://en.wikipedia.org/wiki/Von_Neumann_architecture
https://en.wikipedia.org/wiki/Harvard_architecture (alternative
architecture that has separate memory for code and data)

Let's create super simple 8 bit CPU with two 1 byte general purpose
registers, R0 and R1:

1 MOV addr, r0   copy addr in r0
2 MOV addr, r1   copy addr in r1
3 MOV r0, addr   copy r0 in addr
4 MOV r1, addr   copy r1 in addr
5 MOV r0, $value store $value in r0
6 MOV r1, $value store $value in r1
7 ADD  adds r0 and r1 and stores value in r0
8 JMP  addr jump to given address
9 HAL  stop

(from Richard Buckland's 4 bit CPU)
https://www.youtube.com/watch?v=gTeDX4yAdyU

in our example, we will set the CPU to execute from memory address 0,
and an example program that counts forever would look like:

0000: MOV r0, $0
0001: MOV r1, $1
0002: ADD
0003: MOV r0, 10
0004: JMP 1

now let's look at the memory layout byte by byte

addr:0  00000101 // mov r0, $0
addr:1  00000000 
addr:2  00000110 // mov r1, $1
addr:3  00000001 
addr:4  00000111 // add
addr:5  00000011 // mov r0, 10
addr:6  00001010
addr:7  00001000 // jmp 1
addr:8  00000001
addr:9  00000000 // nothing
addr:10 00000000 // result of mov r0, 10

As you can see, there is no difference between address 0 and address

The only thing stopping the cpu to execute address 8,9,10 is the JMP on address 7, that makes it go to address 0

The way the cpu actually executes it is by using fetch-decode-execute
cycle (https://en.wikipedia.org/wiki/Instruction_cycle)

Fetch Stage: The next instruction is fetched from the memory address
that is currently stored in the program counter and stored into the
instruction register. At the end of the fetch operation, the PC points
to the next instruction that will be read at the next cycle.

Decode Stage: During this stage, the encoded instruction presented in
the instruction register is interpreted by the decoder.

Execute Stage: The control unit of the CPU passes the decoded
information as a sequence of control signals to the relevant function
units of the CPU to perform the actions required by the instruction,
such as reading values from registers, passing them to the ALU to
perform mathematical or logic functions on them, and writing the
result back to a register. If the ALU is involved, it sends a
condition signal back to the CU. The result generated by the operation
is stored in the main memory or sent to an output device. Based on the
feedback from the ALU, the PC may be updated to a different address
from which the next instruction will be fetched.

Repeat Cycle

To access ram is in the order of 100ns, so if we have to waste 100ns
for every instruction the whole thing will be horrible, to mitigate
this CPUs have hierarchy of caches (tlb[21], l1,l2,l3).

L1 cache reference                           0.5 ns
Executing Instruction                        1   ns
Branch mispredict                            5   ns
L2 cache reference                           7   ns 14x L1 cache
Mutex lock/unlock                           25   ns
Main memory reference                      100   ns 20x  L2 cache,
                                                    200x L1 cache

http://norvig.com/21-days.html#answers
https://gist.github.com/jboner/2841832
http://www.cim.mcgill.ca/~langer/273/18-notes.pdf
https://people.freebsd.org/~lstewart/articles/cpumemory.pdf

The l1 cache is usually split into an instruction cache and data
cache, and it is in the order of ~32kb per core (how this cache
hierarchy works in the modern multi-core CPUs is a topic on its own,
but lets think as if we have one core only).

There are other mitigations to how slow the fetch-execute cycle is,
such as instruction pipelining, vectorization, out of order execution
etc (you can check them out on the bottom of the instruction cycle
wiki page https://en.wikipedia.org/wiki/Instruction_cycle)

https://en.wikipedia.org/wiki/Instruction_pipelining
https://en.wikipedia.org/wiki/Hazard_(computer_architecture)
https://en.wikipedia.org/wiki/Out-of-order_execution
https://en.wikipedia.org/wiki/Speculative_execution

Modern CPUs are incredibly complicated systems, and the model I use is
that in most case (unless proven otherwise) "it does one thing at a
time" works ok to do back-of-the-napkin calculations. (e.g., some
modern hashing algorithms that abuse vectorization can hash incredible
amounts of data)

So, inlining is a dance.

I will refer to:
http://www.cs.technion.ac.il/users/yechiel/c++-faq/inline-and-perf.html
[9.3] Do inline functions improve performance?
Yes and no. Sometimes. Maybe.

There are no simple answers. Inline functions might make the code
faster, they might make it slower. They might make the executable
larger, they might make it smaller. They might cause thrashing,
they might prevent thrashing. And they might be, and often are,
totally irrelevant to speed.

inline functions might make it faster:
  As shown above, procedural integration might remove a bunch of
  unnecessary instructions, which might make things run faster.

inline functions might make it slower: 
  Too much inlining might cause code bloat, which might cause
  "thrashing" on demand-paged virtual-memory systems. In other
  words, if the executable size is too big, the system might spend
  most of its time going out to disk to fetch the next chunk of
  code.

inline functions might make it larger:
  This is the notion of code bloat, as described above. For
  example, if a system has 100 inline functions each of which
  expands to 100 bytes of executable code and is called in 100
  places, that's an increase of 1MB. Is that 1MB going to cause
  problems? Who knows, but it is possible that that last 1MB could
  cause the system to "thrash," and that could slow things down.

inline functions might make it smaller:
  The compiler often generates more code to push/pop
  registers/parameters than it would by inline-expanding the
  function's body. This happens with very small functions, and it
  also happens with large functions when the optimizer is able to
  remove a lot of redundant code through procedural integration â€”
  that is, when the optimizer is able to make the large function
  small.

inline functions might cause thrashing:
  Inlining might increase the size of the binary executable, and
  that might cause thrashing.

inline functions might prevent thrashing
  The working set size (number of pages that need to be in memory
  at once) might go down even if the executable size goes up. When
  f() calls g(), the code is often on two distinct pages; when the
  compiler procedurally integrates the code of g() into f(), the
  code is often on the same page.

inline functions might increase the number of cache misses: 
  Inlining might cause an inner loop to span across multiple lines
  of the memory cache, and that might cause thrashing of the
  memory-cache.

inline functions might decrease the number of cache misses:
  Inlining usually improves locality of reference within the
  binary code, which might decrease the number of cache lines
  needed to store the code of an inner loop. This ultimately could
  cause a CPU-bound application to run faster.

inline functions might be irrelevant to speed: 
  Most systems are not CPU-bound. Most systems are I/O-bound,
  database-bound or network-bound, meaning the bottleneck in the
  system's overall performance is the file system, the database or
  the network. Unless your "CPU meter" is pegged at 100%, inline
  functions probably won't make your system faster. (Even in
  CPU-bound systems, inline will help only when used within the
  bottleneck itself, and the bottleneck is typically in only a
  small percentage of the code.)

There are no simple answers: You have to play with it to see what
is best. Do not settle for simplistic answers like, "Never use
inline functions" or "Always use inline functions" or "Use inline
functions if and only if the function is less than N lines of
code." These one-size-fits-all rules may be easy to write down,
but they will produce sub-optimal results.

Interfaces

Lets say we have a common case where a struct gets an io.Writer
interface to write into:

type Database struct {
        writer io.Writer
}
func (d *Database) SetWriter(f io.Writer) error {
        d.writer = f
}
....
func (d *Database) writeBlob(b []byte) error {
        checksum := hash(b)
        _, err := d.writer.Write(checksum)
        if err != nil {
                return err
        }
        _, err := d.writer.Write(b)
        return err
}
....

Now because d.writer can change at runtime, the compiler can never
know what the other end of d.writer is, so it can never inline it even
if it wants to (you can imagine the actual os.File.Write is just doing
the write syscall)

Another issue is that the thing on the other end of d.writer could be
a pointer, so it has to be checked because if it is and it is nil (0),
we must panic accordingly.

The way this works is pretty much:

if x == nil
   goto panic
... code

return x

panic:
help build a stacktrace

(pseudo code)

START:
  CMPQ 0x10(CX), SP
  JBE CALL_PANIC

  MOVQ 0x40(SP), AX
  TESTQ AX, AX
  JLE PANIC
  ... work work ...
TRACE:
  MOVQ CX, 0x50(SP)
  MOVQ 0x28(SP), BP
  ADDQ $0x30, SP
  RET
PANIC:
  XORL CX, CX // cx = 0
  JMP TRACE
CALL_PANIC:
  CALL runtime.morestack_noctxt(SB)
  JMP START

That means on every function call to writeBlob has to check if
d.writer is nil, because it can be and there is no way for the
compiler to know at compile time, and then prepare the stack and call
it.

How Slow is Dynamic Dispatch exactly?

More concrete example:

package main

type Operation interface {
        Apply() int
}

type Number struct {
        n int
}

func (x Number) Apply() int {
        return x.n
}

type Add struct {
        Operations []Operation
}

func (x Add) Apply() int {
        r := 0
        for _, v := range x.Operations {
                r += v.Apply()
        }
        return r
}

type Sub struct {
        Operations []Operation
}

func (x Sub) Apply() int {
        r := 0
        for _, v := range x.Operations {
                r -= v.Apply()
        }
        return r
}

type AddCustom struct {
        Operations []Number
}

func (x AddCustom) Apply() int {
        r := 0
        for _, v := range x.Operations {
                r += v.Apply()
        }
        return r
}

func main() {
        n := 0
        op := Add{Operations: []Operation{Number{n: 5}, Number{n: 6}}}
        n += op.Apply()

        opc := AddCustom{Operations: []Number{Number{n: 5}, Number{n: 6}}}
        n += opc.Apply()
}

Lets look at main.Add.Apply and main.AddCustom.Apply

// go build main.go && go tool objdump main
TEXT main.Add.Apply(SB) main.go
MOVQ FS:0xfffffff8, CX
CMPQ 0x10(CX), SP
JBE 0x4526c7
SUBQ $0x30, SP
MOVQ BP, 0x28(SP)
LEAQ 0x28(SP), BP
MOVQ 0x40(SP), AX
TESTQ AX, AX
JLE 0x4526c3
MOVQ 0x38(SP), CX
XORL DX, DX
XORL BX, BX
JMP 0x452678
MOVQ 0x20(SP), SI
ADDQ $0x10, SI
MOVQ AX, DX
MOVQ CX, BX
MOVQ SI, CX
MOVQ CX, 0x20(SP)
MOVQ DX, 0x18(SP)
MOVQ BX, 0x10(SP)
MOVQ 0(CX), AX
MOVQ 0x8(CX), SI
MOVQ 0x18(AX), AX
MOVQ SI, 0(SP)
CALL AX
MOVQ 0x18(SP), AX
INCQ AX
MOVQ 0x10(SP), CX
ADDQ 0x8(SP), CX
MOVQ 0x40(SP), DX
CMPQ DX, AX
JL 0x452666
MOVQ CX, 0x50(SP)
MOVQ 0x28(SP), BP
ADDQ $0x30, SP
RET
XORL CX, CX
JMP 0x4526b4
CALL runtime.morestack_noctxt(SB)
JMP main.Add.Apply(SB)

TEXT main.AddCustom.Apply(SB) main.go
MOVQ 0x8(SP), AX
MOVQ 0x10(SP), CX
XORL DX, DX
XORL BX, BX
JMP 0x4526fa
MOVQ 0(AX)(DX*8), SI
INCQ DX
ADDQ SI, BX
CMPQ CX, DX
JL 0x4526f0
MOVQ BX, 0x20(SP)
RET

Benchmark:

goos: linux

goarch: amd64

BenchmarkInterface-8    171836106                6.81 ns/op

BenchmarkInline-8       424364508                2.70 ns/op

BenchmarkNoop-8         898746903                1.36 ns/op

PASS

ok      command-line-arguments  4.673s

Optimization

You know the golden rule: always measure first.

When I wrote https://github.com/rekki/go-query I thought the slowest
part was the binary search, I thought: well..the loop cant be
unrolled, and the branchy algorithm is too difficult on the
branch-predictor, so it is probably the slowest, I will just optimize
it, and I did.

Total waste of time, the overhead of the interface dispatch is 70% of
the time, not only I wasted my time, but I introduced a super shitty
bug that was not caught by 90% coverage and etc, luckily it did not
end up in production (by sheer luck!).

Anyway, this is a reminder to myself: always measure.

Oh BTW, of course, this whole thing almost never hits you, ~10 extra
instructions per call never matter. (almost)

How to measure though? Considering what is executed in the benchmark
is very likely not the thing being executed in production, so for
example, I could have a benchmark that is super fast and nice, but
when it runs in prod, the thing gets inlined and starts page
thrashing.

There is a whole spectrum of tools to use from statistics powered
micro benchmark tools to flamegraph[19] visualizations, and the whole
topic is a post on its own, but I believe anything you do to measure,
will be better than intuition.

Luckily go's profiling tools are incredible[20] and you can get good
insights very quickly.

BTW, I did have a profile for go-query before I went into optimizing,
and I saw that the cost is in the Next() call, but because I did not
get deeper into it and I decided to start with optimizing the binary
search, instead of looking carefully into the data. As the poet says:
"everywhere you look, you see what you are looking for".

--
Fri 7 Feb 18:08:42 CET 2020,
Borislav Nikolov (https://github.com/jackdoe)

References:
[1] https://github.com/rekki/go-query
Low level full text search library.

[2] https://youtu.be/o4-CwDo2zpg?t=354
Fastware - Andrei Alexandrescu

[3] https://www.youtube.com/watch?v=Qq_WaiwzOtI
CppCon 2014: Andrei Alexandrescu "Optimization Tips - Mo' Hustle
Mo' Problems"

[4] https://www.youtube.com/watch?v=FJJTYQYB1JQ
CppCon 2019: Andrei Alexandrescu â€œSpeed Is Found In The Minds of
People"

[5] https://en.wikipedia.org/wiki/Arithmetic_logic_unit
[6] https://en.wikipedia.org/wiki/Von_Neumann_architecture
[7] https://en.wikipedia.org/wiki/Harvard_architecture
[8] https://en.wikipedia.org/wiki/Instruction_cycle
[9] https://en.wikipedia.org/wiki/Instruction_pipelining
[10] https://en.wikipedia.org/wiki/Hazard_(computer_architecture)
[11] https://en.wikipedia.org/wiki/Out-of-order_execution
[12] https://en.wikipedia.org/wiki/Speculative_execution

[13] https://www.youtube.com/watch?v=gTeDX4yAdyU
Lecture 3: Machine Code - Richard Buckland UNSW

[14] http://norvig.com/21-days.html#answers
Basic numbers for back of the napkin calculations

[15] https://gist.github.com/jboner/2841832
Basic numbers for back of the napkin calculations

[16] http://www.cim.mcgill.ca/~langer/273/18-notes.pdf
TLB misses

[17] https://people.freebsd.org/~lstewart/articles/cpumemory.pdf
What Every Programmer Should Know About Memory

[18] http://www.cs.technion.ac.il/users/yechiel/c++-faq/inline-and-perf.html
[9.3] Do inline functions improve performance?

[19] http://www.brendangregg.com/flamegraphs.html
Flamegraphs

[20] https://blog.golang.org/profiling-go-programs
Profiling Go Programs

[21] https://en.wikipedia.org/wiki/Translation_lookaside_buffer
Translation lookaside buffer

[22] https://godbolt.org
Compiler Explorer is an interactive online compiler which shows the
assembly output of compiled C++, Rust, Go (and many more) code.

DEV Community