*I noticed something interesting while running a GCC vs Clang benchmark.
*
Same code. Same machine.
Both loops are scalar (no vectorization).
Yet⦠GCC consistently used fewer CPU cycles.
At first, this doesnβt make sense.
If both:
execute roughly the same instructions
are not vectorised
Why is there a performance gap?
π The Missing Piece: Itβs Not Just Instructions
Most people focus on:
instruction count
vectorization
But in this case, thatβs not the full story.
What actually matters more is:
- how address computations are structured
- how instructions are scheduled
- how well latency is hidden
Here is the data
βοΈ AGU Pressure (Address Generation Units)
On x86 CPUs, memory instructions rely on AGUs (Address Generation Units).
Complex addressing patterns like:
base + index * scale + offset
π increase AGU pressure
Whereas simpler patterns like:
pointer++
π are cheaper and easier for the CPU to execute efficiently
π§ͺ What I Observed
GCC:
Generates simpler addressing patterns
Reduces AGU contention
Keeps execution more consistent
Clang:
Shows higher AGU pressure
More stalls
Less efficient scheduling (in this case)
β‘ Key Takeaway
Itβs not just about what instructions exist.
Itβs about:
How efficiently the compiler feeds the CPU pipeline
Same instruction count β same performance.
π Why This Matters
In tight loops:
AGU pressure
addressing patterns
instruction scheduling
π can matter as much as (or more than) vectorization
π Want to Dive Deeper?
π Full benchmark + assembly breakdown:
π Complete analysis article:
π¬ Discussion
Have you seen cases where:
similar assembly
same instruction count
π still results in very different performance?
Would love to hear your observations.

Top comments (0)