Ville Vesilehto

Posted on Nov 17 • Originally published at ville.dev

Go disassembly: branches on arm64 vs amd64

#go #performance #tutorial #cpu

TL;DR

On arm64 the negative-branch check typically compiles to a single-bit test + branch (e.g., TBNZ on the sign bit). On amd64 it's a flags-setting ALU instruction (TESTQ) followed by a conditional jump and an unconditional jump.
We'll compile and disassemble a small toy program written in Go.
A simple micro-benchmark suggests that in this toy program the conditional itself costs very little compared to everything around it. In real code predictability usually matters more than the exact encoding of the branch.

The toy program

We'll keep the program trivial on purpose: call getNumber(), branch on x < 0, print, done. Since we're targeting 64-bit GOARCH values (arm64 and amd64), int is 64-bit and the sign bit we care about is bit 63.

Install the latest Go version. I've used v1.25.4 here. Then save this main program into a directory.

main.go:

package main

import (
    "fmt"
    "math/rand"
)

func getNumber() int { return int(int64(rand.Uint64())) }

func main() {
    x := getNumber()
    if x < 0 {
        fmt.Println("branch")
        return
    }
    fmt.Println("main")
}

Build a version with and without inlining/optimisations for both architectures. The unoptimised one uses gcflags to:

Disable inlining with -l
Disable most optimisations with -N

These flags are passed to go tool compile at build time. See compile cmd docs for full list of flags and background info.

# arm64
GOARCH=arm64 go build -o bin/main_arm64_optimised .
GOARCH=arm64 go build -gcflags='all=-N -l' -o bin/main_arm64_unoptimised .

# amd64
GOARCH=amd64 go build -o bin/main_amd64_optimised .
GOARCH=amd64 go build -gcflags='all=-N -l' -o bin/main_amd64_unoptimised .

Side note, read this post by Dave Cheney if you're interested how the mid-stack inlining works in Go!

Disassemble the things

Recall that our main point of interest is in line 12:

if x < 0 { ... }

ARM64 disassembly

Let's first disassemble the unoptimised arm64 binary, by outputting the main symbols only.

go tool objdump -s main.main bin/main_arm64_unoptimised

This yields the following:

main.go:12    0x1000bc540    b7f80040    TBNZ $63, R0, 2(PC)
main.go:12    0x1000bc544    14000015    JMP 21(PC)

First three columns:

Source line main.go:12 to which the instruction maps to.
Instruction addresses 0x1000bc540 and 0x1000bc544.
Encoded machine bytes b7f80040 and 14000015.

TBNZ is an AArch64-specific instruction which literally means "test bit if non-zero".

In this case it tests if bit 63 (the sign bit) of R0 (referring to variable x) is non-zero.
If it is, then offset by 2 instructions and jump forward, past the next instruction, and enter the negative branch body. The 2(PC) means "skip the next instruction", because each AArch64 instruction is 4 bytes.
If it isn't, then the JMP instruction is executed and we go back to the join/else path in the main function.

Let's compare it to the optimised one:

go tool objdump -s main.main bin/main_arm64_optimised

This yields:

main.go:12    0x10009e180    b6f80220    TBZ $63, R0, 17(PC)

TBZ is another AArch64-specific instruction which means "test bit and branch if zero". But why is it TBZ here (and not TBNZ or CMP)?

Our condition is x < 0, i.e. "sign bit set". After rand.Uint64() the value is cast to int64 without changing bits, so "negative" is exactly "bit 63 is 1". The compiler recognizes this as a sign-bit check and lowers it to a single bit test plus branch.

In this build, the compiler lays out the "x < 0" path as the fallthrough and the "x ≥ 0" path later, so it can encode the condition as a single TBZ that branches over the fallthrough when the sign bit is zero. If the layout flipped, you'd see TBNZ or even a compare-and-branch sequence instead.

Conceptually the surrounding objdump looks roughly like this:

main.go:10   prologue / stack check
main.go:8    CALL math/rand.Uint64
main.go:12   TBZ $63, R0, 17(PC)   # if sign bit == 0 → jump forward (x >= 0)
main.go:13   …                     # fallthrough: x < 0 path
                                   # → fmt.Fprintln("branch"), return
…                return
main.go:16        …                # target of TBZ: x >= 0 path¨
                                   # → fmt.Fprintln("main"), return

So:

If x < 0 (sign bit = 1): TBZ not taken -> execute "branch" path and return.
If x ≥ 0 (sign bit = 0): TBZ taken -> skip over the branch path into the "main" path.

You may still see a CMP/BPL style sequence in other shapes of code (e.g., range checks like x < 100) or with different optimization decisions. A pure sign test is where TBZ/TBNZ shines.

All in all a pretty nice optimisation. Exact block layout and call sequences can change between Go versions and build flags, so don't rely on this exact listing, just the general shape.

AMD64 disassembly

Let's then disassemble the AMD64 unoptimised binary:

go tool objdump -s main.main bin/main_amd64_unoptimised

From line 12 we get this:

main.go:12    0x10d0f00    4885c0    TESTQ AX, AX
main.go:12    x10d0f03       7c02    JL 0x10d0f07
main.go:12    x10d0f05       eb54    JMP 0x10d0f5b

Before moving forwards, it's worth mentioning that on AMD64 (x86-64) the processor status register RFLAGS has these flags which are set by ALU operations:

ZF (bit 6): zero flag set to 1 if the result of the operation is 0, otherwise 0.
SF (bit 7): sign flag, copies the most significant bit of the result, and for signed values 1 means negative.
OF (bit 11): overflow flag, set to 1 if a signed overflow occurred.

On AArch64 you also have NZCV condition flags and generic conditional branches (B.EQ, B.LT, etc.), but TBZ/TBNZ bypass these condition flags and test the selected bit directly.

Breaking it down to a mouthful:

TESTQ AX, AX is a bitwise test of AX against itself. It leaves AX unchanged and only sets flags (ZF set if x==0, SF reflects the sign, OF is cleared).
JL 0x10d0f07 is "jump if less" for signed values, which means "branch if SF != OF". After TESTQ we know OF=0, so JL is effectively "jump if SF==1", i.e., if the sign bit is set.
JMP 0x10d0f5b is an unconditional jump that skips over the negative-branch block to the "x >= 0" (or "main") path.

Micro-benchmark: does any of this matter?

Let's do a micro-benchmark that's mostly about "shape". We don't measure branch mispredict cost. Just that the single "x < 0" check does not dominate the function.

Save this as bench_test.go next to main.go:

package main

import "testing"

var sink int

func hotpathPredictable(x int) {
    if x < 0 {
        sink++
    }
}

func hotpathRandom(x int) {
    if x < 0 {
        sink++
    }
}

func BenchmarkBranchPredictable(b *testing.B) {
    x := -1
    for i := 0; i < b.N; i++ {
        hotpathPredictable(x)
    }
}

func BenchmarkBranchRandom(b *testing.B) {
    var x int
    r := uint64(1)
    for i := 0; i < b.N; i++ {
        // cheap LCG to flip sign bit “randomly”
        r = r*1103515245 + 12345
        x = int(int64(r)) // sign depends on high bit of r
        hotpathRandom(x)
    }
}

Run it with default settings, letting full inline and optimisation do its thing:

go test -count=10 -bench=BenchmarkBranchPredictable | tee predictable_optimised.out

Then run it by disabling inlining and optimisations:

go test -count=10 -bench=BenchmarkBranchPredictable -gcflags=all='-N -l' | tee predictable_unoptimised.out

And the same for the random version:

go test -count=10 -bench=BenchmarkBranchRandom | tee random_optimised.out
go test -count=10 -bench=BenchmarkBranchRandom -gcflags=all='-N -l' | tee random_unoptimised.out

On my M1 Pro (arm64), the predictable branch averages about 2.2 ns/op. Making the condition pseudo-random via a tiny integer LCG bumps that to about 3.6 ns/op in an optimised build (~1.6× slower), and to about 5.7 ns/op without optimisations.

The predictable case barely changes between optimised and unoptimised builds, while the random case is hit much harder. That matches the intuitive picture: the branch itself is cheap when the predictor can learn it, but once you add noise and extra work around it, the cost of missed predictions and extra instructions starts to dominate.

Closing

This is a tiny example, but it's a nice lens into how Go maps simple conditions onto each instruction set architecture (or ISA).

Finally, I do want to mention that the micro-benchmark here is intentionally simple and not a rigorous study of branch prediction. Its main purpose is to show that the conditional itself compiles to a handful of instructions and doesn't dominate runtime in this toy.

DEV Community