TL;DR
- On arm64 the negative-branch check typically compiles to a single-bit test + branch (e.g.,
TBNZon the sign bit). On amd64 it's a flags-setting ALU instruction (TESTQ) followed by a conditional jump and an unconditional jump. - We'll compile and disassemble a small toy program written in Go.
- A simple micro-benchmark suggests that in this toy program the conditional itself costs very little compared to everything around it. In real code predictability usually matters more than the exact encoding of the branch.
The toy program
We'll keep the program trivial on purpose: call getNumber(), branch on x < 0, print, done. Since we're targeting 64-bit GOARCH values (arm64 and amd64), int is 64-bit and the sign bit we care about is bit 63.
Install the latest Go version. I've used v1.25.4 here. Then save this main program into a directory.
main.go:
package main
import (
"fmt"
"math/rand"
)
func getNumber() int { return int(int64(rand.Uint64())) }
func main() {
x := getNumber()
if x < 0 {
fmt.Println("branch")
return
}
fmt.Println("main")
}
Build a version with and without inlining/optimisations for both architectures. The unoptimised one uses gcflags to:
- Disable inlining with
-l - Disable most optimisations with
-N
These flags are passed to go tool compile at build time. See compile cmd docs for full list of flags and background info.
# arm64
GOARCH=arm64 go build -o bin/main_arm64_optimised .
GOARCH=arm64 go build -gcflags='all=-N -l' -o bin/main_arm64_unoptimised .
# amd64
GOARCH=amd64 go build -o bin/main_amd64_optimised .
GOARCH=amd64 go build -gcflags='all=-N -l' -o bin/main_amd64_unoptimised .
Side note, read this post by Dave Cheney if you're interested how the mid-stack inlining works in Go!
Disassemble the things
Recall that our main point of interest is in line 12:
if x < 0 { ... }
ARM64 disassembly
Let's first disassemble the unoptimised arm64 binary, by outputting the main symbols only.
go tool objdump -s main.main bin/main_arm64_unoptimised
This yields the following:
main.go:12 0x1000bc540 b7f80040 TBNZ $63, R0, 2(PC)
main.go:12 0x1000bc544 14000015 JMP 21(PC)
First three columns:
- Source line
main.go:12to which the instruction maps to. - Instruction addresses
0x1000bc540and0x1000bc544. - Encoded machine bytes
b7f80040and14000015.
TBNZ is an AArch64-specific instruction which literally means "test bit if non-zero".
- In this case it tests if bit 63 (the sign bit) of R0 (referring to variable
x) is non-zero. - If it is, then offset by 2 instructions and jump forward, past the next instruction, and enter the negative branch body. The
2(PC)means "skip the next instruction", because each AArch64 instruction is 4 bytes. - If it isn't, then the
JMPinstruction is executed and we go back to the join/else path in the main function.
Let's compare it to the optimised one:
go tool objdump -s main.main bin/main_arm64_optimised
This yields:
main.go:12 0x10009e180 b6f80220 TBZ $63, R0, 17(PC)
TBZ is another AArch64-specific instruction which means "test bit and branch if zero". But why is it TBZ here (and not TBNZ or CMP)?
Our condition is x < 0, i.e. "sign bit set". After rand.Uint64() the value is cast to int64 without changing bits, so "negative" is exactly "bit 63 is 1". The compiler recognizes this as a sign-bit check and lowers it to a single bit test plus branch.
In this build, the compiler lays out the "x < 0" path as the fallthrough and the "x ≥ 0" path later, so it can encode the condition as a single TBZ that branches over the fallthrough when the sign bit is zero. If the layout flipped, you'd see TBNZ or even a compare-and-branch sequence instead.
Conceptually the surrounding objdump looks roughly like this:
main.go:10 prologue / stack check
main.go:8 CALL math/rand.Uint64
main.go:12 TBZ $63, R0, 17(PC) # if sign bit == 0 → jump forward (x >= 0)
main.go:13 … # fallthrough: x < 0 path
# → fmt.Fprintln("branch"), return
… return
main.go:16 … # target of TBZ: x >= 0 path¨
# → fmt.Fprintln("main"), return
So:
- If x < 0 (sign bit = 1): TBZ not taken -> execute "branch" path and return.
- If x ≥ 0 (sign bit = 0): TBZ taken -> skip over the branch path into the "main" path.
You may still see a CMP/BPL style sequence in other shapes of code (e.g., range checks like x < 100) or with different optimization decisions. A pure sign test is where TBZ/TBNZ shines.
All in all a pretty nice optimisation. Exact block layout and call sequences can change between Go versions and build flags, so don't rely on this exact listing, just the general shape.
AMD64 disassembly
Let's then disassemble the AMD64 unoptimised binary:
go tool objdump -s main.main bin/main_amd64_unoptimised
From line 12 we get this:
main.go:12 0x10d0f00 4885c0 TESTQ AX, AX
main.go:12 x10d0f03 7c02 JL 0x10d0f07
main.go:12 x10d0f05 eb54 JMP 0x10d0f5b
Before moving forwards, it's worth mentioning that on AMD64 (x86-64) the processor status register RFLAGS has these flags which are set by ALU operations:
-
ZF(bit 6): zero flag set to 1 if the result of the operation is 0, otherwise 0. -
SF(bit 7): sign flag, copies the most significant bit of the result, and for signed values 1 means negative. -
OF(bit 11): overflow flag, set to 1 if a signed overflow occurred.
On AArch64 you also have NZCV condition flags and generic conditional branches (B.EQ, B.LT, etc.), but TBZ/TBNZ bypass these condition flags and test the selected bit directly.
Breaking it down to a mouthful:
-
TESTQ AX, AXis a bitwise test ofAXagainst itself. It leavesAXunchanged and only sets flags (ZF set if x==0, SF reflects the sign, OF is cleared). -
JL 0x10d0f07is "jump if less" for signed values, which means "branch if SF != OF". AfterTESTQwe know OF=0, so JL is effectively "jump if SF==1", i.e., if the sign bit is set. -
JMP 0x10d0f5bis an unconditional jump that skips over the negative-branch block to the "x >= 0" (or "main") path.
Micro-benchmark: does any of this matter?
Let's do a micro-benchmark that's mostly about "shape". We don't measure branch mispredict cost. Just that the single "x < 0" check does not dominate the function.
Save this as bench_test.go next to main.go:
package main
import "testing"
var sink int
func hotpathPredictable(x int) {
if x < 0 {
sink++
}
}
func hotpathRandom(x int) {
if x < 0 {
sink++
}
}
func BenchmarkBranchPredictable(b *testing.B) {
x := -1
for i := 0; i < b.N; i++ {
hotpathPredictable(x)
}
}
func BenchmarkBranchRandom(b *testing.B) {
var x int
r := uint64(1)
for i := 0; i < b.N; i++ {
// cheap LCG to flip sign bit “randomly”
r = r*1103515245 + 12345
x = int(int64(r)) // sign depends on high bit of r
hotpathRandom(x)
}
}
Run it with default settings, letting full inline and optimisation do its thing:
go test -count=10 -bench=BenchmarkBranchPredictable | tee predictable_optimised.out
Then run it by disabling inlining and optimisations:
go test -count=10 -bench=BenchmarkBranchPredictable -gcflags=all='-N -l' | tee predictable_unoptimised.out
And the same for the random version:
go test -count=10 -bench=BenchmarkBranchRandom | tee random_optimised.out
go test -count=10 -bench=BenchmarkBranchRandom -gcflags=all='-N -l' | tee random_unoptimised.out
On my M1 Pro (arm64), the predictable branch averages about 2.2 ns/op. Making the condition pseudo-random via a tiny integer LCG bumps that to about 3.6 ns/op in an optimised build (~1.6× slower), and to about 5.7 ns/op without optimisations.
The predictable case barely changes between optimised and unoptimised builds, while the random case is hit much harder. That matches the intuitive picture: the branch itself is cheap when the predictor can learn it, but once you add noise and extra work around it, the cost of missed predictions and extra instructions starts to dominate.
Closing
This is a tiny example, but it's a nice lens into how Go maps simple conditions onto each instruction set architecture (or ISA).
Finally, I do want to mention that the micro-benchmark here is intentionally simple and not a rigorous study of branch prediction. Its main purpose is to show that the conditional itself compiles to a handful of instructions and doesn't dominate runtime in this toy.

Top comments (0)