SIMD means Single Instruction – Multiple Data. It’s an instruction subset on the CPU that performs the instruction on multiple lanes of data simultaneously. When used right, this can significantly improve performance of repetitive tasks in, for instance, vector calculations or AV-processing.
Rust is a modern language, and it is built on top of very well-established compiler technology (like llvm). This compiler technology has the capability to automatically vectorize your code. This means, it analyzes the code up to a point that it recognizes when and where data access can be parallelized using SIMD instructions by itself. In general, that means that one could leave it all in the compiler’s capable hands and not worry much about optimization.
So let’s put that to the test. Let’s implement Vec4
in a variety of different ways and see which implementation gets the most optimal code generated, and whether or not automatic vectorization does anything useful.
For the tests, I used Compiler Explorer (which can be found here: https://godbolt.org/) and -C opt-level=3
to get the maximum optimization possible.
The Naive Struct
The first implementation is the good old naive struct:
#[derive(Copy,Clone,Debug)] struct Vec4 { x: f32, y: f32, z: f32, w: f32, } impl Vec4 { fn new(x: f32,y: f32,z: f32,w: f32) -> Vec4 { Vec4 { x: x, y: y, z: z, w: w, } } } impl Add for Vec4 { type Output = Vec4; fn add(self,other: Self) -> Self { Vec4 { x: self.x + other.x, y: self.y + other.y, z: self.z + other.z, w: self.w + other.w, } } } impl Mul for Vec4 { type Output = Vec4; fn mul(self,other: f32) -> Self { Vec4 { x: self.x * other, y: self.y * other, z: self.z * other, w: self.w * other, } } }
And for the main program, we have it calculate (a + b) * c
for some arbitrary values of a
, b
and c
(c
is a scalar).
fn main() { let a = Vec4::new(1.0,2.0,3.0,4.0); let b = Vec4::new(5.0,6.0,7.0,8.0); let result = (a + b) * 9.0; println!("{:?} + {:?} = {:?}",a,b,result); }
Running it through Compiler Explorer and leaving out the println
-part gives us:
movaps xmm0, xmmword ptr [rip + .LCPI9_0] movaps xmmword ptr [rsp + 160], xmm0 movaps xmm0, xmmword ptr [rip + .LCPI9_1] movaps xmmword ptr [rsp + 176], xmm0 movaps xmm0, xmmword ptr [rip + .LCPI9_2] movaps xmmword ptr [rsp + 336], xmm0
Ah yes… The compiler has realized that all of the values in the calculation are constants, so it just replaces everything with hardcoded answers for a
, b
and result
. This is not very helpful when exploring different implementations, so, we’ll have to make sure it won’t get the chance to do this.
One way to accomplish this, is to create a function that returns a random f32
, based on system time, like so:
#[inline(never)] fn rng() -> f32 { let duration = SystemTime::now() .duration_since(UNIX_EPOCH) .expect("Time went backwards."); ((duration.as_millis() % 1000) as f32) / 1000.0 }
Here the ‘random number’ is the sub-second part of the current time. This turns the main program into:
fn main() { let a = Vec4::new(rng(),rng(),rng(),rng()); let b = Vec4::new(rng(),rng(),rng(),rng()); let result = (a + b) * rng(); println!("{:?} + {:?} = {:?}",a,b,result); }
And the resulting machine language becomes:
call example::rng movaps xmmword ptr [rsp], xmm0 call example::rng movaps xmmword ptr [rsp + 16], xmm0 call example::rng movaps xmmword ptr [rsp + 32], xmm0 call example::rng movaps xmm1, xmm0 movaps xmmword ptr [rsp + 144], xmm0 movaps xmm0, xmmword ptr [rsp] movss dword ptr [rsp + 240], xmm0 movaps xmm0, xmmword ptr [rsp + 16] movss dword ptr [rsp + 244], xmm0 movaps xmm0, xmmword ptr [rsp + 32] movss dword ptr [rsp + 248], xmm0 movss dword ptr [rsp + 252], xmm1 call example::rng movaps xmmword ptr [rsp + 224], xmm0 call example::rng movaps xmmword ptr [rsp + 208], xmm0 call example::rng movaps xmmword ptr [rsp + 192], xmm0 call example::rng movaps xmm1, xmmword ptr [rsp + 224] movss dword ptr [rsp + 256], xmm1 movaps xmm2, xmmword ptr [rsp + 208] movss dword ptr [rsp + 260], xmm2 movaps xmm3, xmmword ptr [rsp + 192] movss dword ptr [rsp + 264], xmm3 movss dword ptr [rsp + 268], xmm0 unpcklps xmm1, xmm2 unpcklps xmm3, xmm0 movlhps xmm1, xmm3 movaps xmm2, xmm1 movaps xmm0, xmmword ptr [rsp] unpcklps xmm0, xmmword ptr [rsp + 16] movaps xmm1, xmmword ptr [rsp + 32] unpcklps xmm1, xmmword ptr [rsp + 144] movlhps xmm0, xmm1 addps xmm0, xmm2 movaps xmmword ptr [rsp], xmm0 call example::rng shufps xmm0, xmm0, 0 mulps xmm0, xmmword ptr [rsp] movaps xmmword ptr [rsp + 352], xmm0
Alright, that’s more like it. So what are we looking at? There are 9 calls to the random number generator, a lot of shuffling around of f32
s in the xmm
registers, and at the very end, the vectorized calls to addps
and mulps
. It seems the compiler had some trouble preparing the various answers from the random number generator into a suitable format, but did eventually come to addps
and mulps
.
It’s a relatively simple case, and it already triggers automatic vectorization. Can we find more optimal implementations? Let’s see what else we can do…
Tuple Struct and Array Struct
Using Rust’s tuple representation of a struct, we can also define slightly different versions of the same thing. Maybe the compiler has different ways of treating these.
One such version would be the tuple struct:
struct Vec4(f32,f32,f32,f32);
And similarly, but easier to use with generics:
struct Vec4([f32; 4]);
I’ve omitted the resulting machine language, because it is virtually identical (for both cases) to the naive struct -case. This is good to know. It really doesn’t matter that much how the struct is described.
Underlying SIMD Intrinsics
So, let’s implement the whole thing on top of SIMD intrinsics and basically tell the compiler more exactly which instructions to use where and how. The implementation now looks something like this:
#[derive(Copy,Clone,Debug)] struct Vec4(__m128); impl Vec4 { fn new(x: f32,y: f32,z: f32,w: f32) -> Vec4 { unsafe { Vec4(_mm_set_ps(w,z,y,x)) } } } impl Add for Vec4 { type Output = Vec4; fn add(self,other: Self) -> Self { unsafe { Vec4(_mm_add_ps(self.0,other.0)) } } } impl Mul for Vec4 { type Output = Vec4; fn mul(self,other: f32) -> Self { unsafe { Vec4(_mm_mul_ps(self.0,_mm_set1_ps(other))) } } }
And the resulting machine language is:
call example::rng10 movaps xmmword ptr [rsp], xmm0 call example::rng10 movaps xmmword ptr [rsp + 16], xmm0 call example::rng10 movaps xmmword ptr [rsp + 32], xmm0 call example::rng10 movaps xmm1, xmmword ptr [rsp] unpcklps xmm1, xmmword ptr [rsp + 16] movaps xmm2, xmmword ptr [rsp + 32] unpcklps xmm2, xmm0 movlhps xmm1, xmm2 movaps xmmword ptr [rsp], xmm1 movaps xmmword ptr [rsp + 400], xmm1 call example::rng10 movaps xmmword ptr [rsp + 16], xmm0 call example::rng10 movaps xmmword ptr [rsp + 32], xmm0 call example::rng10 movaps xmmword ptr [rsp + 144], xmm0 call example::rng10 movaps xmm1, xmmword ptr [rsp + 16] unpcklps xmm1, xmmword ptr [rsp + 32] movaps xmm2, xmmword ptr [rsp + 144] unpcklps xmm2, xmm0 movlhps xmm1, xmm2 movaps xmmword ptr [rsp + 416], xmm1 addps xmm1, xmmword ptr [rsp] movaps xmmword ptr [rsp + 16], xmm1 call example::rng10 shufps xmm0, xmm0, 0 mulps xmm0, xmmword ptr [rsp + 16] movaps xmmword ptr [rsp + 432], xmm0
Which is slightly more compact than the other versions, but in essence does the same thing. Clearly the Rust compiler and the underlying optimizers have relatively advanced ideas of SIMD operations.
Handcoding SIMD Intrinsics Directly
On this side of the compiler, one last option remains. We can handcraft the intrinsics directly, without any structs, like so:
let a = _mm_set_ps(rng10(),rng10(),rng10(),rng10()); let b = _mm_set_ps(rng10(),rng10(),rng10(),rng10()); let result = _mm_mul_ps(_mm_add_ps(a,b),_mm_set1_ps(rng10())); println!("{:?} + {:?} = {:?}",a,b,result);
Given the craftiness of the compiler thusfar, it is expected that the result will be pretty similar again:
call example::rng10 movss dword ptr [rsp], xmm0 call example::rng10 movss dword ptr [rsp + 16], xmm0 call example::rng10 movss dword ptr [rsp + 32], xmm0 call example::rng10 movss dword ptr [rsp + 160], xmm0 movss xmm0, dword ptr [rsp + 32] movss dword ptr [rsp + 164], xmm0 movss xmm0, dword ptr [rsp + 16] movss dword ptr [rsp + 168], xmm0 movss xmm0, dword ptr [rsp] movss dword ptr [rsp + 172], xmm0 call example::rng10 movss dword ptr [rsp], xmm0 call example::rng10 movss dword ptr [rsp + 16], xmm0 call example::rng10 movss dword ptr [rsp + 32], xmm0 call example::rng10 movss dword ptr [rsp + 176], xmm0 movss xmm0, dword ptr [rsp + 32] movss dword ptr [rsp + 180], xmm0 movss xmm0, dword ptr [rsp + 16] movss dword ptr [rsp + 184], xmm0 movss xmm0, dword ptr [rsp] movss dword ptr [rsp + 188], xmm0 movaps xmm0, xmmword ptr [rsp + 160] addps xmm0, xmmword ptr [rsp + 176] movaps xmmword ptr [rsp], xmm0 call example::rng10 shufps xmm0, xmm0, 0 mulps xmm0, xmmword ptr [rsp] movaps xmmword ptr [rsp + 336], xmm0
It is indeed similar, although there are still two seemingly superfluous shuffling blocks.
So??
Well, here’s the thing. I would say all of the above implementations are very equivalent. There is not one single clear winner. The compiler seems to understand how to automatically vectorize this code quite effortlessly. Nevertheless it ends up shuffling around the values quite a lot…