Categories
Rust

Rust and SIMD

SIMD means Single Instruction – Multiple Data. It’s an instruction subset on the CPU that performs the instruction on multiple lanes of data simultaneously. When used right, this can significantly improve performance of repetitive tasks in, for instance, vector calculations or AV-processing.

Rust is a modern language, and it is built on top of very well-established compiler technology (like llvm). This compiler technology has the capability to automatically vectorize your code. This means, it analyzes the code up to a point that it recognizes when and where data access can be parallelized using SIMD instructions by itself. In general, that means that one could leave it all in the compiler’s capable hands and not worry much about optimization.

So let’s put that to the test. Let’s implement Vec4 in a variety of different ways and see which implementation gets the most optimal code generated, and whether or not automatic vectorization does anything useful.

For the tests, I used Compiler Explorer (which can be found here: https://godbolt.org/) and -C opt-level=3 to get the maximum optimization possible.

The Naive Struct

The first implementation is the good old naive struct:

#[derive(Copy,Clone,Debug)]
struct Vec4 {
     x: f32,
     y: f32,
     z: f32,
     w: f32,
 }

impl Vec4 {
     fn new(x: f32,y: f32,z: f32,w: f32) -> Vec4 {
         Vec4 {
             x: x,
             y: y,
             z: z,
             w: w,
         }
     }
 }

impl Add for Vec4 {
     type Output = Vec4;
     fn add(self,other: Self) -> Self {
         Vec4 {
             x: self.x + other.x,
             y: self.y + other.y,
             z: self.z + other.z,
             w: self.w + other.w,
         }
     }
 }

impl Mul for Vec4 {
     type Output = Vec4;
     fn mul(self,other: f32) -> Self {
         Vec4 {
             x: self.x * other,
             y: self.y * other,
             z: self.z * other,
             w: self.w * other,
         }
     }
 }

And for the main program, we have it calculate (a + b) * c for some arbitrary values of a, b and c (c is a scalar).

fn main() {
    let a = Vec4::new(1.0,2.0,3.0,4.0);
    let b = Vec4::new(5.0,6.0,7.0,8.0);
    let result = (a + b) * 9.0;
    println!("{:?} + {:?} = {:?}",a,b,result);
}

Running it through Compiler Explorer and leaving out the println-part gives us:

movaps  xmm0, xmmword ptr [rip + .LCPI9_0]
movaps  xmmword ptr [rsp + 160], xmm0
movaps  xmm0, xmmword ptr [rip + .LCPI9_1]
movaps  xmmword ptr [rsp + 176], xmm0
movaps  xmm0, xmmword ptr [rip + .LCPI9_2]
movaps  xmmword ptr [rsp + 336], xmm0

Ah yes… The compiler has realized that all of the values in the calculation are constants, so it just replaces everything with hardcoded answers for a, b and result. This is not very helpful when exploring different implementations, so, we’ll have to make sure it won’t get the chance to do this.

One way to accomplish this, is to create a function that returns a random f32, based on system time, like so:

#[inline(never)]
fn rng() -> f32 {
    let duration = SystemTime::now()
        .duration_since(UNIX_EPOCH)
        .expect("Time went backwards.");
    ((duration.as_millis() % 1000) as f32) / 1000.0
}

Here the ‘random number’ is the sub-second part of the current time. This turns the main program into:

fn main() {
    let a = Vec4::new(rng(),rng(),rng(),rng());
    let b = Vec4::new(rng(),rng(),rng(),rng());
    let result = (a + b) * rng();
    println!("{:?} + {:?} = {:?}",a,b,result);
}

And the resulting machine language becomes:

call      example::rng
movaps    xmmword ptr [rsp], xmm0
call      example::rng
movaps    xmmword ptr [rsp + 16], xmm0
call      example::rng
movaps    xmmword ptr [rsp + 32], xmm0
call      example::rng
movaps    xmm1, xmm0
movaps    xmmword ptr [rsp + 144], xmm0
movaps    xmm0, xmmword ptr [rsp]
movss     dword ptr [rsp + 240], xmm0
movaps    xmm0, xmmword ptr [rsp + 16]
movss     dword ptr [rsp + 244], xmm0
movaps    xmm0, xmmword ptr [rsp + 32]
movss     dword ptr [rsp + 248], xmm0
movss     dword ptr [rsp + 252], xmm1

call      example::rng
movaps    xmmword ptr [rsp + 224], xmm0
call      example::rng
movaps    xmmword ptr [rsp + 208], xmm0
call      example::rng
movaps    xmmword ptr [rsp + 192], xmm0
call      example::rng
movaps    xmm1, xmmword ptr [rsp + 224]
movss     dword ptr [rsp + 256], xmm1
movaps    xmm2, xmmword ptr [rsp + 208]
movss     dword ptr [rsp + 260], xmm2
movaps    xmm3, xmmword ptr [rsp + 192]
movss     dword ptr [rsp + 264], xmm3
movss     dword ptr [rsp + 268], xmm0

unpcklps  xmm1, xmm2
unpcklps  xmm3, xmm0
movlhps   xmm1, xmm3
movaps    xmm2, xmm1
movaps    xmm0, xmmword ptr [rsp]
unpcklps  xmm0, xmmword ptr [rsp + 16]
movaps    xmm1, xmmword ptr [rsp + 32]
unpcklps  xmm1, xmmword ptr [rsp + 144]
movlhps   xmm0, xmm1
addps     xmm0, xmm2
movaps    xmmword ptr [rsp], xmm0
call      example::rng
shufps    xmm0, xmm0, 0
mulps     xmm0, xmmword ptr [rsp]
movaps    xmmword ptr [rsp + 352], xmm0

Alright, that’s more like it. So what are we looking at? There are 9 calls to the random number generator, a lot of shuffling around of f32s in the xmm registers, and at the very end, the vectorized calls to addps and mulps. It seems the compiler had some trouble preparing the various answers from the random number generator into a suitable format, but did eventually come to addps and mulps.

It’s a relatively simple case, and it already triggers automatic vectorization. Can we find more optimal implementations? Let’s see what else we can do…

Tuple Struct and Array Struct

Using Rust’s tuple representation of a struct, we can also define slightly different versions of the same thing. Maybe the compiler has different ways of treating these.

One such version would be the tuple struct:

struct Vec4(f32,f32,f32,f32);

And similarly, but easier to use with generics:

struct Vec4([f32; 4]);

I’ve omitted the resulting machine language, because it is virtually identical (for both cases) to the naive struct -case. This is good to know. It really doesn’t matter that much how the struct is described.

Underlying SIMD Intrinsics

So, let’s implement the whole thing on top of SIMD intrinsics and basically tell the compiler more exactly which instructions to use where and how. The implementation now looks something like this:

#[derive(Copy,Clone,Debug)]
struct Vec4(__m128);

impl Vec4 {
    fn new(x: f32,y: f32,z: f32,w: f32) -> Vec4 {
        unsafe { Vec4(_mm_set_ps(w,z,y,x)) }
    }
}

impl Add for Vec4 {
    type Output = Vec4;
    fn add(self,other: Self) -> Self {
        unsafe { Vec4(_mm_add_ps(self.0,other.0)) }
    }
}

impl Mul for Vec4 {
    type Output = Vec4;
    fn mul(self,other: f32) -> Self {
        unsafe { Vec4(_mm_mul_ps(self.0,_mm_set1_ps(other))) }
    }
}

And the resulting machine language is:

call      example::rng10
movaps    xmmword ptr [rsp], xmm0
call      example::rng10
movaps    xmmword ptr [rsp + 16], xmm0
call      example::rng10
movaps    xmmword ptr [rsp + 32], xmm0
call      example::rng10
movaps    xmm1, xmmword ptr [rsp]
unpcklps  xmm1, xmmword ptr [rsp + 16]
movaps    xmm2, xmmword ptr [rsp + 32]
unpcklps  xmm2, xmm0
movlhps   xmm1, xmm2
movaps    xmmword ptr [rsp], xmm1
movaps    xmmword ptr [rsp + 400], xmm1

call      example::rng10
movaps    xmmword ptr [rsp + 16], xmm0
call      example::rng10
movaps    xmmword ptr [rsp + 32], xmm0
call      example::rng10
movaps    xmmword ptr [rsp + 144], xmm0
call      example::rng10
movaps    xmm1, xmmword ptr [rsp + 16]
unpcklps  xmm1, xmmword ptr [rsp + 32]
movaps    xmm2, xmmword ptr [rsp + 144]
unpcklps  xmm2, xmm0
movlhps   xmm1, xmm2
movaps    xmmword ptr [rsp + 416], xmm1

addps     xmm1, xmmword ptr [rsp]
movaps    xmmword ptr [rsp + 16], xmm1
call      example::rng10
shufps    xmm0, xmm0, 0
mulps     xmm0, xmmword ptr [rsp + 16]
movaps    xmmword ptr [rsp + 432], xmm0

Which is slightly more compact than the other versions, but in essence does the same thing. Clearly the Rust compiler and the underlying optimizers have relatively advanced ideas of SIMD operations.

Handcoding SIMD Intrinsics Directly

On this side of the compiler, one last option remains. We can handcraft the intrinsics directly, without any structs, like so:

let a = _mm_set_ps(rng10(),rng10(),rng10(),rng10());
let b = _mm_set_ps(rng10(),rng10(),rng10(),rng10());
let result = _mm_mul_ps(_mm_add_ps(a,b),_mm_set1_ps(rng10()));
println!("{:?} + {:?} = {:?}",a,b,result);

Given the craftiness of the compiler thusfar, it is expected that the result will be pretty similar again:

call    example::rng10
movss   dword ptr [rsp], xmm0
call    example::rng10
movss   dword ptr [rsp + 16], xmm0
call    example::rng10
movss   dword ptr [rsp + 32], xmm0
call    example::rng10
movss   dword ptr [rsp + 160], xmm0
movss   xmm0, dword ptr [rsp + 32]
movss   dword ptr [rsp + 164], xmm0
movss   xmm0, dword ptr [rsp + 16]
movss   dword ptr [rsp + 168], xmm0
movss   xmm0, dword ptr [rsp]
movss   dword ptr [rsp + 172], xmm0

call    example::rng10
movss   dword ptr [rsp], xmm0
call    example::rng10
movss   dword ptr [rsp + 16], xmm0
call    example::rng10
movss   dword ptr [rsp + 32], xmm0
call    example::rng10
movss   dword ptr [rsp + 176], xmm0
movss   xmm0, dword ptr [rsp + 32]
movss   dword ptr [rsp + 180], xmm0
movss   xmm0, dword ptr [rsp + 16]
movss   dword ptr [rsp + 184], xmm0
movss   xmm0, dword ptr [rsp]
movss   dword ptr [rsp + 188], xmm0

movaps  xmm0, xmmword ptr [rsp + 160]
addps   xmm0, xmmword ptr [rsp + 176]
movaps  xmmword ptr [rsp], xmm0
call    example::rng10
shufps  xmm0, xmm0, 0
mulps   xmm0, xmmword ptr [rsp]
movaps  xmmword ptr [rsp + 336], xmm0

It is indeed similar, although there are still two seemingly superfluous shuffling blocks.

So??

Well, here’s the thing. I would say all of the above implementations are very equivalent. There is not one single clear winner. The compiler seems to understand how to automatically vectorize this code quite effortlessly. Nevertheless it ends up shuffling around the values quite a lot…

Leave a Reply

Your email address will not be published. Required fields are marked *