How to force Rust compiler to use several x86 instructions (popcount, etc)

#rust #optimization #tutorial #todayilearned

Sometimes you need tricky operations on your binary data like counting bits, leading or trailing zeros and so on (for example, when you want to implement HAMT or sparse arrays). In Rust this is achieved by using methods like .count_ones() and .trailing_zeros(). The problem with those methods is that they are usually expanded in a huge pile of assembler code. But x86 (and some other architectures) have instructions to perform these counts (specifically, popcnt and tzcnt) and they are really fast (1 cycle for execution and latency of 3 cycles). Today we will learn how to force Rust compiler to use those instructions and what are the possible pitfalls.

Originally published on my website.

Let's start with an example. Here's the code that finds population counts of random integers (on Rust Playground):

use rand::prelude::*;

// this is only to have this function separately in the asm output
#[inline(never)]
fn count(a: u32) -> u32 {
    a.count_ones()
}

fn main() {
    let a: u32 = random();
    println!("{}", count(a));
    println!("{}", count(a + 1));
}

And the count function looks like that:

playground::count:
    mov eax, edi
    shr eax
    and eax, 1431655765
    sub edi, eax
    mov eax, edi
    and eax, 858993459
    shr edi, 2
    and edi, 858993459
    add edi, eax
    mov eax, edi
    shr eax, 4
    add eax, edi
    and eax, 252645135
    imul    eax, eax, 16843009
    shr eax, 24
    ret

Wow. A huge pile of instructions and magic numbers. This is clearly not something we would like to see, especially in performance-critical places. Let's make this a little bit better (on Rust Playground):

use rand::prelude::*;

// this is only to have this function separately in the asm output
#[inline(never)]
#[cfg_attr(target_arch = "x86_64", target_feature(enable = "popcnt"))]
unsafe fn count(a: u32) -> u32 {
    a.count_ones()
}

fn main() {
    let a: u32 = random();
    println!("{}", unsafe { count(a) });
    println!("{}", unsafe { count(a + 1) });
}

And here is the assembly code of count:

playground::count:
    popcnt  eax, edi
    ret

Only a single very fast instruction! And that this will work with all integer types. There is a pitfall though: Rust requires us to mark functions as unsafe when we use target_feature so it makes sense to make functions using those features as small as possible.

Also, you can do something similar with .trailing_zeros() or .leading_zeros() by using target_feature(enable = "bmi1").

To find feature names you can refer to architecture-specific intrinsics
documentation.

That's all, hope you find it useful!