Speeding up RGB to grayscale conversion in Rust by a factor of 2.2 – and various other multimedia related processing loops

In the previous blog post I wrote about how to write a RGB to grayscale conversion filter for GStreamer in Rust. In this blog post I’m going to write about how to optimize the processing loop of that filter, without resorting to unsafe code or SIMD instructions by staying with plain, safe Rust code.

I also tried to implement the processing loop with faster, a Rust crate for writing safe SIMD code. It looks very promising, but unless I missed something in the documentation it currently is missing some features to be able to express this specific algorithm in a meaningful way. Once it works on stable Rust (waiting for SIMD to be stabilized) and includes runtime CPU feature detection, this could very well be a good replacement for the ORC library used for the same purpose in GStreamer in various places. ORC works by JIT-compiling a minimal “array operation language” to SIMD assembly for your specific CPU (and has support for x86 MMX/SSE, PPC Altivec, ARM NEON, etc.).

If someone wants to prove me wrong and implement this with faster, feel free to do so and I’ll link to your solution and include it in the benchmark results below.

All code below can be found in this GIT repository.

Table of Contents

  1. Baseline Implementation
  2. First Optimization – Assertions
  3. First Optimization – Assertions Try 2
  4. Second Optimization – Iterate a bit more
  5. Third Optimization – Getting rid of the bounds check finally
  6. Summary
  7. Addendum: slice::split_at
  8. Addendum 2: SIMD with faster

Baseline Implementation

This is how the baseline implementation looks like.

This basically iterates over each line of the input and output frame (outer loop), and then for each BGRx chunk of 4 bytes in each line it converts the values to u32, multiplies with a constant array, converts back to u8 and stores the same value in the whole output BGRx chunk.

Note: This is only doing the actual conversion from linear RGB to grayscale (and in BT.601 colorspace). To do this conversion correctly you need to know your colorspaces and use the correct coefficients for conversion, and also do gamma correction. See this about why it is important.

So what can be improved on this? For starters, let’s write a small benchmark for this so that we know whether any of our changes actually improve something. This is using the (unfortunately still) unstable benchmark feature of Cargo.

This can be run with cargo bench and then prints the amount of nanoseconds each iterator of the closure was taking. To only really measure the processing itself, allocations and initializations of the input/output frame are happening outside of the closure. We’re not interested in times for that.

First Optimization – Assertions

To actually start optimizing this function, let’s take a look at the assembly that the compiler is outputting. The easiest way of doing that is via the Godbolt Compiler Explorer website. Select “rustc nightly” and use “-C opt-level=3” for the compiler flags, and then copy & paste your code in there. Once it compiles, to find the assembly that corresponds to a line, simply right-click on the line and “Scroll to assembly”.

Alternatively you can use cargo rustc –release — -C opt-level=3 –emit asm and check the assembly file that is output in the target/release/deps directory.

What we see then for our inner loop is something like the following

This is already quite optimized. For each loop iteration the first few instructions are doing some bounds checking and if they fail jump to the .LBB4_34 or .LBB4_35 labels. How to understand that this is bounds checking? Scroll down in the assembly to where these labels are defined and you’ll see something like the following

Also if you check (with the colors, or the “scroll to source” feature) which Rust code these correspond to, you’ll see that it’s the first and third access to the 4-byte slice that contains our BGRx values.

Afterwards in the assembly, the following steps are happening: 0) incrementing of the “loop counter” representing the number of iterations we’re going to do (r9), 1) actual reading of the B, G and R value and conversion to u32 (the 3 movzx, note that the reading of the x value is optimized away as the compiler sees that it is always multiplied by 0 later), 2) the multiplications with the array elements (the 3 imul), 3) combining of the results and division (i.e. shift) (the 2 add and the shr), 4) storing of the result in the output (the 4 mov). Afterwards the slice pointers are increased by 4 (rbx and r10) and the lengths (used for bounds checking) are decreased by 4 (r8 and r15). Finally there’s a check (cmp) to see if r9 (our loop) counter is at the end of the slice, and if not we jump back to the beginning and operate on the next BGRx chunk.

Generally what we want to do for optimizations is to get rid of unnecessary checks (bounds checking), memory accesses, conditions (cmp, cmov) and jumps (the instructions starting with j). These are all things that are slowing down our code.

So the first thing that seems useful to optimize here is the bounds checking at the beginning. It definitely seems not useful to do two checks instead of one for the two slices (the checks are for the both slices at once but Godbolt does not detect that and believes it’s only the input slice). And ideally we could teach the compiler that no bounds checking is needed at all.

As I wrote in the previous blog post, often this knowledge can be given to the compiler by inserting assertions.

To prevent two checks and just have a single check, you can insert a assert_eq!(in_p.len(), 4) at the beginning of the inner loop and the same for the output slice. Now we only have a single bounds check left per iteration.

As a next step we might want to try to move this knowledge outside the inner loop so that there is no bounds checking at all in there anymore. We might want to add assertions like the following outside the outer loop then to give all knowledge we have to the compiler

Unfortunately adding those has no effect at all on the inner loop, but having them outside the outer loop for good measure is not the worst idea so let’s just keep them. At least it can be used as some kind of documentation of the invariants of this code for future readers.

So let’s benchmark these two implementations now. The results on my machine are the following

This is surprising, our version without the assertions is actually faster by a factor of ~1.1 although it had fewer conditions. So let’s take a closer look at the assembly at the top of the loop again, where the bounds checking happens, in the version with assertions

While this indeed has only one jump as expected for the bounds checking, the number of comparisons is the same and even worse: 3 memory writes to the stack are happening right before the jump. If we follow to the .LBB4_33 label we will see that the assert_eq! macro is going to do something with core::fmt::Debug. This is setting up the information needed for printing the assertion failure, the “expected X equals to Y” output. This is certainly not good and the reason why everything is slower now.

First Optimization – Assertions Try 2

All the additional instructions and memory writes were happening because the assert_eq! macro is outputting something user friendly that actually contains the values of both sides. Let’s try again with the assert! macro instead

This already looks more promising. Compared to our baseline version this gives us a speedup of a factor of 1.12, and compared to the version with assert_eq! 1.23. If we look at the assembly for the bounds checks (everything else stays the same), it also looks more like what we would’ve expected

One cmp less, only one jump left. And no memory writes anymore!

So keep in mind that assert_eq! is more user-friendly but quite a bit more expensive even in the “good case” compared to assert!.

Second Optimization – Iterate a bit more

This is still not very satisfying though. No bounds checking should be needed at all as each chunk is going to be exactly 4 bytes. We’re just not able to convince the compiler that this is the case. While it may be possible (let me know if you find a way!), let’s try something different. The zip iterator is done when the shortest iterator of both is done, and there are optimizations specifically for zipped slice iterators implemented. Let’s try that and replace the grayscale value calculation with

If we run that through our benchmark after removing the assert!(in_p.len() == 4) (and the same for the output slice), these are the results

We’re actually 2.9 times slower! Even when adding back the assert!(in_p.len() == 4) assertion (and the same for the output slice) we’re still slower

If we look at the assembly of the assertion-less variant, it’s a complete mess now

In short, there are now various new conditions and jumps for short-circuiting the zip iterator in the various cases. And because of all the noise added, the compiler was not even able to optimize the bounds check for the output slice away anymore (.LBB0_35 cases). While it was able to unroll the iterator (note that the 3 imul multiplications are not interleaved with jumps and are actually 3 multiplications instead of yet another loop), which is quite impressive, it couldn’t do anything meaningful with that information it somehow got (it must’ve understood that each chunk has 4 bytes!). This looks like something going wrong somewhere in the optimizer to me.

If we take a look at the variant with the assertions, things look much better

This is literally the same as the assertion version we had before, except that the reading of the input slice, the multiplications and the additions are happening in iterator order instead of being batched all together. It’s quite impressive that the compiler was able to completely optimize away the zip iterator here, but unfortunately it’s still many times slower than the original version. The reason must be the instruction-reordering. The previous version had all memory reads batched and then the operations batched, which is apparently much better for the internal pipelining of the CPU (it is going to perform the next instructions without dependencies on the previous ones already while waiting for the pending instructions to finish).

It’s also not clear to me why the LLVM optimizer is not able to schedule the instructions the same way here. It apparently has all information it needs for that if no iterator is involved, and both versions are leading to exactly the same assembly except for the order of instructions. This also seems like something fishy.

Nonetheless, we still have our manual bounds check (the assertion) left here and we should really try to get rid of that. No progress so far.

Third Optimization – Getting rid of the bounds check finally

Let’s tackle this from a different angle now. Our problem is apparently that the compiler is not able to understand that each chunk is exactly 4 bytes.

So why don’t we write a new chunks iterator that has always exactly the requested amount of items, instead of potentially less for the very last iteration. And instead of panicking if there are leftover elements, it seems useful to just ignore them. That way we have API that is functionally different from the existing chunks iterator and provides behaviour that is useful in various cases. It’s basically the slice equivalent of the exact_chunks iterator of the ndarray crate.

By having it functionally different from the existing one, and not just an optimization, I also submitted it for inclusion in Rust’s standard library and it’s nowadays available as an unstable feature in nightly. Like all newly added API. Nonetheless, the same can also be implemented inside your code with basically the same effect, there are no dependencies on standard library internals.

So, let’s use our new exact_chunks iterator that is guaranteed (by API) to always give us exactly 4 bytes. In our case this is exactly equivalent to the normal chunks as by construction our slices always have a length that is a multiple of 4, but the compiler can’t infer that information. The resulting code looks as follows

It’s exactly the same as the previous version with assertions, except for using exact_chunks instead of chunks and the same for the mutable iterator. The resulting benchmark of all our variants now looks as follow

Compared to our initial version this is a speedup of a factor of 2.2, compared to our version with assertions a factor of 1.98. This seems like a worthwhile improvement, and if we look at the resulting assembly there are no bounds checks at all anymore

Also due to this the compiler is able to apply some more optimizations and we only have one loop counter for the number of iterations r10 and the two pointers rcx and rsi that are increased/decreased in each iteration. There is no tracking of the remaining slice lengths anymore, as in the assembly of the original version (and the versions with assertions).

Summary

So overall we got a speedup of a factor of 2.2 while still writing very high-level Rust code with iterators and not falling back to unsafe code or using SIMD. The optimizations the Rust compiler is applying are quite impressive and the Rust marketing line of zero-cost abstractions is really visible in reality here.

The same approach should also work for many similar algorithms, and thus many similar multimedia related algorithms where you iterate over slices and operate on fixed-size chunks.

Also the above shows that as a first step it’s better to write clean and understandable high-level Rust code without worrying too much about performance (assume the compiler can optimize well), and only afterwards take a look at the generated assembly and check which instructions should really go away (like bounds checking). In many cases this can be achieved by adding assertions in strategic places, or like in this case by switching to a slightly different abstraction that is closer to the actual requirements (however I believe the compiler should be able to produce the same code with the help of assertions with the normal chunks iterator, but making that possible requires improvements to the LLVM optimizer probably).

And if all does not help, there’s still the escape hatch of unsafe (for using functions like slice::get_unchecked() or going down to raw pointers) and the possibility of using SIMD instructions (by using faster or stdsimd directly). But in the end this should be a last resort for those little parts of your code where optimizations are needed and the compiler can’t be easily convinced to do it for you.

Addendum: slice::split_at

User newpavlov suggested on Reddit to use repeated slice::split_at in a while loop for similar performance.

This would for example like

Performance-wise this brings us very close to the exact_chunks version

and the assembly is also very similar

Here the compiler even optimizes the storing of the value into a single write operation of 4 bytes, at the cost of an additional multiplication and zero-extend register move.

Overall this code performs very well too, but in my opinion it looks rather ugly compared to the versions using the different chunks iterators. Also this is basically what the exact_chunks iterator does internally: repeatedly calling slice::split_at. In theory both versions could lead to the very same assembly, but the LLVM optimizer is currently handling both slightly different.

Addendum 2: SIMD with faster

Adam Niederer, author of faster, provided a PR that implements the same algorithm with faster to make explicit use of SIMD instructions if available.

Due to some codegen issues, this currently has to be compiled with the CPU being selected as nehalem, i.e. by running RUSTFLAGS=”-C target-cpu=nehalem” cargo +nightly bench, but it provides yet another speedup by a factor of up to 1.27x compared to the fastest previous solution and 2.7x compared to the initial solution:

The code in question is very similar to what you would’ve written with ORC, especially the unstrided version. You basically operate on multiple elements at once, doing the same operation on each, but both versions do this in a slightly different way.

Leave a Reply

Your email address will not be published. Required fields are marked *