In the previous blog post I wrote about how to write a RGB to grayscale conversion filter for GStreamer in Rust. In this blog post I’m going to write about how to optimize the processing loop of that filter, without resorting to unsafe code or SIMD instructions by staying with plain, safe Rust code.
I also tried to implement the processing loop with faster, a Rust crate for writing safe SIMD code. It looks very promising, but unless I missed something in the documentation it currently is missing some features to be able to express this specific algorithm in a meaningful way. Once it works on stable Rust (waiting for SIMD to be stabilized) and includes runtime CPU feature detection, this could very well be a good replacement for the ORC library used for the same purpose in GStreamer in various places. ORC works by JIT-compiling a minimal “array operation language” to SIMD assembly for your specific CPU (and has support for x86 MMX/SSE, PPC Altivec, ARM NEON, etc.).
If someone wants to prove me wrong and implement this with faster, feel free to do so and I’ll link to your solution and include it in the benchmark results below.
All code below can be found in this GIT repository.
Table of Contents
- Baseline Implementation
- First Optimization – Assertions
- First Optimization – Assertions Try 2
- Second Optimization – Iterate a bit more
- Third Optimization – Getting rid of the bounds check finally
- Summary
- Addendum: slice::split_at
- Addendum 2: SIMD with faster
Baseline Implementation
This is how the baseline implementation looks like.
pub fn bgrx_to_gray_chunks_no_asserts( in_data: &[u8], out_data: &mut [u8], in_stride: usize, out_stride: usize, width: usize, ) { let in_line_bytes = width * 4; let out_line_bytes = width * 4; for (in_line, out_line) in in_data .chunks(in_stride) .zip(out_data.chunks_mut(out_stride)) { for (in_p, out_p) in in_line[..in_line_bytes] .chunks(4) .zip(out_line[..out_line_bytes].chunks_mut(4)) { let b = u32::from(in_p[0]); let g = u32::from(in_p[1]); let r = u32::from(in_p[2]); let x = u32::from(in_p[3]); let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) / 65536; let grey = grey as u8; out_p[0] = grey; out_p[1] = grey; out_p[2] = grey; out_p[3] = grey; } } }
This basically iterates over each line of the input and output frame (outer loop), and then for each BGRx chunk of 4 bytes in each line it converts the values to u32, multiplies with a constant array, converts back to u8 and stores the same value in the whole output BGRx chunk.
Note: This is only doing the actual conversion from linear RGB to grayscale (and in BT.601 colorspace). To do this conversion correctly you need to know your colorspaces and use the correct coefficients for conversion, and also do gamma correction. See this about why it is important.
So what can be improved on this? For starters, let’s write a small benchmark for this so that we know whether any of our changes actually improve something. This is using the (unfortunately still) unstable benchmark feature of Cargo.
#![feature(test)] #![feature(exact_chunks)] extern crate test; pub fn bgrx_to_gray_chunks_no_asserts(...) [...] } #[cfg(test)] mod tests { use super::*; use test::Bencher; use std::iter; fn create_vec(w: usize, h: usize) -> Vec<u8> { iter::repeat(0).take(w * h * 4).collect::<_>() } #[bench] fn bench_chunks_1920x1080_no_asserts(b: &mut Bencher) { let i = test::black_box(create_vec(1920, 1080)); let mut o = test::black_box(create_vec(1920, 1080)); b.iter(|| bgrx_to_gray_chunks_no_asserts(&i, &mut o, 1920 * 4, 1920 * 4, 1920)); } }
This can be run with cargo bench and then prints the amount of nanoseconds each iterator of the closure was taking. To only really measure the processing itself, allocations and initializations of the input/output frame are happening outside of the closure. We’re not interested in times for that.
First Optimization – Assertions
To actually start optimizing this function, let’s take a look at the assembly that the compiler is outputting. The easiest way of doing that is via the Godbolt Compiler Explorer website. Select “rustc nightly” and use “-C opt-level=3” for the compiler flags, and then copy & paste your code in there. Once it compiles, to find the assembly that corresponds to a line, simply right-click on the line and “Scroll to assembly”.
Alternatively you can use cargo rustc –release — -C opt-level=3 –emit asm and check the assembly file that is output in the target/release/deps directory.
What we see then for our inner loop is something like the following
.LBB4_19: cmp r15, r11 mov r13, r11 cmova r13, r15 mov rdx, r8 sub rdx, r13 je .LBB4_34 cmp rdx, 3 jb .LBB4_35 inc r9 movzx edx, byte ptr [rbx - 1] movzx ecx, byte ptr [rbx - 2] movzx esi, byte ptr [rbx] imul esi, esi, 19595 imul edx, edx, 38470 imul ecx, ecx, 7471 add ecx, edx add ecx, esi shr ecx, 16 mov byte ptr [r10 - 3], cl mov byte ptr [r10 - 2], cl mov byte ptr [r10 - 1], cl mov byte ptr [r10], cl add r10, 4 add r8, -4 add r15, -4 add rbx, 4 cmp r9, r14 jb .LBB4_19
This is already quite optimized. For each loop iteration the first few instructions are doing some bounds checking and if they fail jump to the .LBB4_34 or .LBB4_35 labels. How to understand that this is bounds checking? Scroll down in the assembly to where these labels are defined and you’ll see something like the following
.LBB4_34: lea rdi, [rip + .Lpanic_bounds_check_loc.D] xor esi, esi xor edx, edx call core::panicking::panic_bounds_check@PLT ud2 .LBB4_35: cmp r15, r11 cmova r11, r15 sub r8, r11 lea rdi, [rip + .Lpanic_bounds_check_loc.F] mov esi, 2 mov rdx, r8 call core::panicking::panic_bounds_check@PLT ud2
Also if you check (with the colors, or the “scroll to source” feature) which Rust code these correspond to, you’ll see that it’s the first and third access to the 4-byte slice that contains our BGRx values.
Afterwards in the assembly, the following steps are happening: 0) incrementing of the “loop counter” representing the number of iterations we’re going to do (r9), 1) actual reading of the B, G and R value and conversion to u32 (the 3 movzx, note that the reading of the x value is optimized away as the compiler sees that it is always multiplied by 0 later), 2) the multiplications with the array elements (the 3 imul), 3) combining of the results and division (i.e. shift) (the 2 add and the shr), 4) storing of the result in the output (the 4 mov). Afterwards the slice pointers are increased by 4 (rbx and r10) and the lengths (used for bounds checking) are decreased by 4 (r8 and r15). Finally there’s a check (cmp) to see if r9 (our loop) counter is at the end of the slice, and if not we jump back to the beginning and operate on the next BGRx chunk.
Generally what we want to do for optimizations is to get rid of unnecessary checks (bounds checking), memory accesses, conditions (cmp, cmov) and jumps (the instructions starting with j). These are all things that are slowing down our code.
So the first thing that seems useful to optimize here is the bounds checking at the beginning. It definitely seems not useful to do two checks instead of one for the two slices (the checks are for the both slices at once but Godbolt does not detect that and believes it’s only the input slice). And ideally we could teach the compiler that no bounds checking is needed at all.
As I wrote in the previous blog post, often this knowledge can be given to the compiler by inserting assertions.
To prevent two checks and just have a single check, you can insert a assert_eq!(in_p.len(), 4) at the beginning of the inner loop and the same for the output slice. Now we only have a single bounds check left per iteration.
As a next step we might want to try to move this knowledge outside the inner loop so that there is no bounds checking at all in there anymore. We might want to add assertions like the following outside the outer loop then to give all knowledge we have to the compiler
assert_eq!(in_data.len() % 4, 0); assert_eq!(out_data.len() % 4, 0); assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride); assert!(in_line_bytes <= in_stride); assert!(out_line_bytes <= out_stride);
Unfortunately adding those has no effect at all on the inner loop, but having them outside the outer loop for good measure is not the worst idea so let’s just keep them. At least it can be used as some kind of documentation of the invariants of this code for future readers.
So let’s benchmark these two implementations now. The results on my machine are the following
test tests::bench_chunks_1920x1080_no_asserts ... bench: 4,420,145 ns/iter (+/- 139,051) test tests::bench_chunks_1920x1080_asserts ... bench: 4,897,046 ns/iter (+/- 166,555)
This is surprising, our version without the assertions is actually faster by a factor of ~1.1 although it had fewer conditions. So let’s take a closer look at the assembly at the top of the loop again, where the bounds checking happens, in the version with assertions
.LBB4_19: cmp rbx, r11 mov r9, r11 cmova r9, rbx mov r14, r12 sub r14, r9 lea rax, [r14 - 1] mov qword ptr [rbp - 120], rax mov qword ptr [rbp - 128], r13 mov qword ptr [rbp - 136], r10 cmp r14, 5 jne .LBB4_33 inc rcx [...]
While this indeed has only one jump as expected for the bounds checking, the number of comparisons is the same and even worse: 3 memory writes to the stack are happening right before the jump. If we follow to the .LBB4_33 label we will see that the assert_eq! macro is going to do something with core::fmt::Debug. This is setting up the information needed for printing the assertion failure, the “expected X equals to Y” output. This is certainly not good and the reason why everything is slower now.
First Optimization – Assertions Try 2
All the additional instructions and memory writes were happening because the assert_eq! macro is outputting something user friendly that actually contains the values of both sides. Let’s try again with the assert! macro instead
test tests::bench_chunks_1920x1080_no_asserts ... bench: 4,420,145 ns/iter (+/- 139,051) test tests::bench_chunks_1920x1080_asserts ... bench: 4,897,046 ns/iter (+/- 166,555) test tests::bench_chunks_1920x1080_asserts_2 ... bench: 3,968,976 ns/iter (+/- 97,084)
This already looks more promising. Compared to our baseline version this gives us a speedup of a factor of 1.12, and compared to the version with assert_eq! 1.23. If we look at the assembly for the bounds checks (everything else stays the same), it also looks more like what we would’ve expected
.LBB4_19: cmp rbx, r12 mov r13, r12 cmova r13, rbx add r13, r14 jne .LBB4_33 inc r9 [...]
One cmp less, only one jump left. And no memory writes anymore!
So keep in mind that assert_eq! is more user-friendly but quite a bit more expensive even in the “good case” compared to assert!.
Second Optimization – Iterate a bit more
This is still not very satisfying though. No bounds checking should be needed at all as each chunk is going to be exactly 4 bytes. We’re just not able to convince the compiler that this is the case. While it may be possible (let me know if you find a way!), let’s try something different. The zip iterator is done when the shortest iterator of both is done, and there are optimizations specifically for zipped slice iterators implemented. Let’s try that and replace the grayscale value calculation with
let grey = in_p.iter() .zip(RGB_Y.iter()) .map(|(i, c)| u32::from(*i) * c) .sum::<u32>() / 65536;
If we run that through our benchmark after removing the assert!(in_p.len() == 4) (and the same for the output slice), these are the results
test tests::bench_chunks_1920x1080_asserts_2 ... bench: 3,968,976 ns/iter (+/- 97,084) test tests::bench_chunks_1920x1080_iter_sum ... bench: 11,393,600 ns/iter (+/- 347,958)
We’re actually 2.9 times slower! Even when adding back the assert!(in_p.len() == 4) assertion (and the same for the output slice) we’re still slower
test tests::bench_chunks_1920x1080_asserts_2 ... bench: 3,968,976 ns/iter (+/- 97,084) test tests::bench_chunks_1920x1080_iter_sum ... bench: 11,393,600 ns/iter (+/- 347,958) test tests::bench_chunks_1920x1080_iter_sum_2 ... bench: 10,420,442 ns/iter (+/- 242,379)
If we look at the assembly of the assertion-less variant, it’s a complete mess now
.LBB0_19: cmp rbx, r13 mov rcx, r13 cmova rcx, rbx mov rdx, r8 sub rdx, rcx cmp rdx, 4 mov r11d, 4 cmovb r11, rdx test r11, r11 je .LBB0_20 movzx ecx, byte ptr [r15 - 2] imul ecx, ecx, 19595 cmp r11, 1 jbe .LBB0_22 movzx esi, byte ptr [r15 - 1] imul esi, esi, 38470 add esi, ecx movzx ecx, byte ptr [r15] imul ecx, ecx, 7471 add ecx, esi test rdx, rdx jne .LBB0_23 jmp .LBB0_35 .LBB0_20: xor ecx, ecx .LBB0_22: test rdx, rdx je .LBB0_35 .LBB0_23: shr ecx, 16 mov byte ptr [r10 - 3], cl mov byte ptr [r10 - 2], cl cmp rdx, 3 jb .LBB0_36 inc r9 mov byte ptr [r10 - 1], cl mov byte ptr [r10], cl add r10, 4 add r8, -4 add rbx, -4 add r15, 4 cmp r9, r14 jb .LBB0_19
In short, there are now various new conditions and jumps for short-circuiting the zip iterator in the various cases. And because of all the noise added, the compiler was not even able to optimize the bounds check for the output slice away anymore (.LBB0_35 cases). While it was able to unroll the iterator (note that the 3 imul multiplications are not interleaved with jumps and are actually 3 multiplications instead of yet another loop), which is quite impressive, it couldn’t do anything meaningful with that information it somehow got (it must’ve understood that each chunk has 4 bytes!). This looks like something going wrong somewhere in the optimizer to me.
If we take a look at the variant with the assertions, things look much better
.LBB3_19: cmp r11, r12 mov r13, r12 cmova r13, r11 add r13, r14 jne .LBB3_33 inc r9 movzx ecx, byte ptr [rdx - 2] imul r13d, ecx, 19595 movzx ecx, byte ptr [rdx - 1] imul ecx, ecx, 38470 add ecx, r13d movzx ebx, byte ptr [rdx] imul ebx, ebx, 7471 add ebx, ecx shr ebx, 16 mov byte ptr [r10 - 3], bl mov byte ptr [r10 - 2], bl mov byte ptr [r10 - 1], bl mov byte ptr [r10], bl add r10, 4 add r11, -4 add r14, 4 add rdx, 4 cmp r9, r15 jb .LBB3_19
This is literally the same as the assertion version we had before, except that the reading of the input slice, the multiplications and the additions are happening in iterator order instead of being batched all together. It’s quite impressive that the compiler was able to completely optimize away the zip iterator here, but unfortunately it’s still many times slower than the original version. The reason must be the instruction-reordering. The previous version had all memory reads batched and then the operations batched, which is apparently much better for the internal pipelining of the CPU (it is going to perform the next instructions without dependencies on the previous ones already while waiting for the pending instructions to finish).
It’s also not clear to me why the LLVM optimizer is not able to schedule the instructions the same way here. It apparently has all information it needs for that if no iterator is involved, and both versions are leading to exactly the same assembly except for the order of instructions. This also seems like something fishy.
Nonetheless, we still have our manual bounds check (the assertion) left here and we should really try to get rid of that. No progress so far.
Third Optimization – Getting rid of the bounds check finally
Let’s tackle this from a different angle now. Our problem is apparently that the compiler is not able to understand that each chunk is exactly 4 bytes.
So why don’t we write a new chunks iterator that has always exactly the requested amount of items, instead of potentially less for the very last iteration. And instead of panicking if there are leftover elements, it seems useful to just ignore them. That way we have API that is functionally different from the existing chunks iterator and provides behaviour that is useful in various cases. It’s basically the slice equivalent of the exact_chunks iterator of the ndarray crate.
By having it functionally different from the existing one, and not just an optimization, I also submitted it for inclusion in Rust’s standard library and it’s nowadays available as an unstable feature in nightly. Like all newly added API. Nonetheless, the same can also be implemented inside your code with basically the same effect, there are no dependencies on standard library internals.
So, let’s use our new exact_chunks iterator that is guaranteed (by API) to always give us exactly 4 bytes. In our case this is exactly equivalent to the normal chunks as by construction our slices always have a length that is a multiple of 4, but the compiler can’t infer that information. The resulting code looks as follows
pub fn bgrx_to_gray_exact_chunks( in_data: &[u8], out_data: &mut [u8], in_stride: usize, out_stride: usize, width: usize, ) { assert_eq!(in_data.len() % 4, 0); assert_eq!(out_data.len() % 4, 0); assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride); let in_line_bytes = width * 4; let out_line_bytes = width * 4; assert!(in_line_bytes <= in_stride); assert!(out_line_bytes <= out_stride); for (in_line, out_line) in in_data .exact_chunks(in_stride) .zip(out_data.exact_chunks_mut(out_stride)) { for (in_p, out_p) in in_line[..in_line_bytes] .exact_chunks(4) .zip(out_line[..out_line_bytes].exact_chunks_mut(4)) { assert!(in_p.len() == 4); assert!(out_p.len() == 4); let b = u32::from(in_p[0]); let g = u32::from(in_p[1]); let r = u32::from(in_p[2]); let x = u32::from(in_p[3]); let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) / 65536; let grey = grey as u8; out_p[0] = grey; out_p[1] = grey; out_p[2] = grey; out_p[3] = grey; } } }
It’s exactly the same as the previous version with assertions, except for using exact_chunks instead of chunks and the same for the mutable iterator. The resulting benchmark of all our variants now looks as follow
test tests::bench_chunks_1920x1080_no_asserts ... bench: 4,420,145 ns/iter (+/- 139,051) test tests::bench_chunks_1920x1080_asserts ... bench: 4,897,046 ns/iter (+/- 166,555) test tests::bench_chunks_1920x1080_asserts_2 ... bench: 3,968,976 ns/iter (+/- 97,084) test tests::bench_chunks_1920x1080_iter_sum ... bench: 11,393,600 ns/iter (+/- 347,958) test tests::bench_chunks_1920x1080_iter_sum_2 ... bench: 10,420,442 ns/iter (+/- 242,379) test tests::bench_exact_chunks_1920x1080 ... bench: 2,007,459 ns/iter (+/- 112,287)
Compared to our initial version this is a speedup of a factor of 2.2, compared to our version with assertions a factor of 1.98. This seems like a worthwhile improvement, and if we look at the resulting assembly there are no bounds checks at all anymore
.LBB0_10: movzx edx, byte ptr [rsi - 2] movzx r15d, byte ptr [rsi - 1] movzx r12d, byte ptr [rsi] imul r13d, edx, 19595 imul edx, r15d, 38470 add edx, r13d imul ebx, r12d, 7471 add ebx, edx shr ebx, 16 mov byte ptr [rcx - 3], bl mov byte ptr [rcx - 2], bl mov byte ptr [rcx - 1], bl mov byte ptr [rcx], bl add rcx, 4 add rsi, 4 dec r10 jne .LBB0_10
Also due to this the compiler is able to apply some more optimizations and we only have one loop counter for the number of iterations r10 and the two pointers rcx and rsi that are increased/decreased in each iteration. There is no tracking of the remaining slice lengths anymore, as in the assembly of the original version (and the versions with assertions).
Summary
So overall we got a speedup of a factor of 2.2 while still writing very high-level Rust code with iterators and not falling back to unsafe code or using SIMD. The optimizations the Rust compiler is applying are quite impressive and the Rust marketing line of zero-cost abstractions is really visible in reality here.
The same approach should also work for many similar algorithms, and thus many similar multimedia related algorithms where you iterate over slices and operate on fixed-size chunks.
Also the above shows that as a first step it’s better to write clean and understandable high-level Rust code without worrying too much about performance (assume the compiler can optimize well), and only afterwards take a look at the generated assembly and check which instructions should really go away (like bounds checking). In many cases this can be achieved by adding assertions in strategic places, or like in this case by switching to a slightly different abstraction that is closer to the actual requirements (however I believe the compiler should be able to produce the same code with the help of assertions with the normal chunks iterator, but making that possible requires improvements to the LLVM optimizer probably).
And if all does not help, there’s still the escape hatch of unsafe (for using functions like slice::get_unchecked() or going down to raw pointers) and the possibility of using SIMD instructions (by using faster or stdsimd directly). But in the end this should be a last resort for those little parts of your code where optimizations are needed and the compiler can’t be easily convinced to do it for you.
Addendum: slice::split_at
User newpavlov suggested on Reddit to use repeated slice::split_at in a while loop for similar performance.
This would for example like
pub fn bgrx_to_gray_split_at( in_data: &[u8], out_data: &mut [u8], in_stride: usize, out_stride: usize, width: usize, ) { assert_eq!(in_data.len() % 4, 0); assert_eq!(out_data.len() % 4, 0); assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride); let in_line_bytes = width * 4; let out_line_bytes = width * 4; assert!(in_line_bytes <= in_stride); assert!(out_line_bytes <= out_stride); for (in_line, out_line) in in_data .exact_chunks(in_stride) .zip(out_data.exact_chunks_mut(out_stride)) { let mut in_pp: &[u8] = in_line[..in_line_bytes].as_ref(); let mut out_pp: &mut [u8] = out_line[..out_line_bytes].as_mut(); assert!(in_pp.len() == out_pp.len()); while in_pp.len() >= 4 { let (in_p, in_tmp) = in_pp.split_at(4); let (out_p, out_tmp) = { out_pp }.split_at_mut(4); in_pp = in_tmp; out_pp = out_tmp; let b = u32::from(in_p[0]); let g = u32::from(in_p[1]); let r = u32::from(in_p[2]); let x = u32::from(in_p[3]); let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) / 65536; let grey = grey as u8; out_p[0] = grey; out_p[1] = grey; out_p[2] = grey; out_p[3] = grey; } } }
Performance-wise this brings us very close to the exact_chunks version
test tests::bench_exact_chunks_1920x1080 ... bench: 1,965,631 ns/iter (+/- 58,832) test tests::bench_split_at_1920x1080 ... bench: 2,046,834 ns/iter (+/- 35,990)
and the assembly is also very similar
.LBB0_10: add rbx, -4 movzx r15d, byte ptr [rsi] movzx r12d, byte ptr [rsi + 1] movzx edx, byte ptr [rsi + 2] imul r13d, edx, 19595 imul r12d, r12d, 38470 imul edx, r15d, 7471 add edx, r12d add edx, r13d shr edx, 16 movzx edx, dl imul edx, edx, 16843009 mov dword ptr [rcx], edx lea rcx, [rcx + 4] add rsi, 4 cmp rbx, 3 ja .LBB0_10
Here the compiler even optimizes the storing of the value into a single write operation of 4 bytes, at the cost of an additional multiplication and zero-extend register move.
Overall this code performs very well too, but in my opinion it looks rather ugly compared to the versions using the different chunks iterators. Also this is basically what the exact_chunks iterator does internally: repeatedly calling slice::split_at. In theory both versions could lead to the very same assembly, but the LLVM optimizer is currently handling both slightly different.
Addendum 2: SIMD with faster
Adam Niederer, author of faster, provided a PR that implements the same algorithm with faster to make explicit use of SIMD instructions if available.
Due to some codegen issues, this currently has to be compiled with the CPU being selected as nehalem, i.e. by running RUSTFLAGS=”-C target-cpu=nehalem” cargo +nightly bench, but it provides yet another speedup by a factor of up to 1.27x compared to the fastest previous solution and 2.7x compared to the initial solution:
test tests::bench_chunks_1920x1080_asserts ... bench: 4,539,286 ns/iter (+/- 106,265) test tests::bench_chunks_1920x1080_asserts_2 ... bench: 3,550,683 ns/iter (+/- 96,917) test tests::bench_chunks_1920x1080_iter_sum ... bench: 5,233,238 ns/iter (+/- 114,671) test tests::bench_chunks_1920x1080_iter_sum_2 ... bench: 3,532,059 ns/iter (+/- 94,964) test tests::bench_chunks_1920x1080_no_asserts ... bench: 4,468,269 ns/iter (+/- 89,329) test tests::bench_chunks_1920x1080_no_asserts_faster ... bench: 2,476,077 ns/iter (+/- 54,877) test tests::bench_chunks_1920x1080_no_asserts_faster_unstrided ... bench: 1,642,980 ns/iter (+/- 108,034) test tests::bench_exact_chunks_1920x1080 ... bench: 2,078,950 ns/iter (+/- 64,536) test tests::bench_split_at_1920x1080 ... bench: 2,096,603 ns/iter (+/- 107,420)
The code in question is very similar to what you would’ve written with ORC, especially the unstrided version. You basically operate on multiple elements at once, doing the same operation on each, but both versions do this in a slightly different way.
pub fn bgrx_to_gray_chunks_no_asserts_faster_unstrided(in_data: &[u8], out_data: &mut [u8]) { // Relies on vector width which is a multiple of 4 assert!(u8s::WIDTH % 4 == 0 && u32s::WIDTH % 4 == 0); const RGB_Y: [u32; 16] = [19595, 38470, 7471, 0, 19595, 38470, 7471, 0, 19595, 38470, 7471, 0, 19595, 38470, 7471, 0]; let rgbvec = u32s::load(&RGB_Y, 0); in_data.simd_iter(u8s(0)).simd_map(|v| { let (a, b) = v.upcast(); let (a32, b32) = a.upcast(); let (c32, d32) = b.upcast(); let grey32a = a32 * rgbvec / u32s(65536); let grey32b = b32 * rgbvec / u32s(65536); let grey32c = c32 * rgbvec / u32s(65536); let grey32d = d32 * rgbvec / u32s(65536); let grey16a = grey32a.saturating_downcast(grey32b); let grey16b = grey32c.saturating_downcast(grey32d); let grey = grey16a.saturating_downcast(grey16b); grey }).scalar_fill(out_data); } pub fn bgrx_to_gray_chunks_no_asserts_faster(in_data: &[u8], out_data: &mut [u8]) { // Sane, but slowed down by faster's current striding implementation. in_data.stride_four(tuplify!(4, u8s(0))).zip().simd_map(|(r, g, b, _)| { let (r16a, r16b) = r.upcast(); let (r32a, r32b) = r16a.upcast(); let (r32c, r32d) = r16b.upcast(); let (g16a, g16b) = g.upcast(); let (g32a, g32b) = g16a.upcast(); let (g32c, g32d) = g16b.upcast(); let (b16a, b16b) = b.upcast(); let (b32a, b32b) = b16a.upcast(); let (b32c, b32d) = b16b.upcast(); let grey32a = (r32a * u32s(19595) + g32a * u32s(38470) + b32a * u32s(7471)) / u32s(65536); let grey32b = (r32b * u32s(19595) + g32b * u32s(38470) + b32b * u32s(7471)) / u32s(65536); let grey32c = (r32c * u32s(19595) + g32c * u32s(38470) + b32c * u32s(7471)) / u32s(65536); let grey32d = (r32d * u32s(19595) + g32d * u32s(38470) + b32d * u32s(7471)) / u32s(65536); let grey16a = grey32a.saturating_downcast(grey32b); let grey16b = grey32c.saturating_downcast(grey32d); let grey = grey16a.saturating_downcast(grey16b); grey }).scalar_fill(out_data); }