{"id":525,"date":"2018-01-21T14:48:48","date_gmt":"2018-01-21T13:48:48","guid":{"rendered":"https:\/\/coaxion.net\/blog\/?p=525"},"modified":"2020-09-12T08:51:48","modified_gmt":"2020-09-12T06:51:48","slug":"speeding-up-rgb-to-grayscale-conversion-in-rust-by-a-factor-of-2-2-and-various-other-multimedia-related-processing-loops","status":"publish","type":"post","link":"https:\/\/coaxion.net\/blog\/2018\/01\/speeding-up-rgb-to-grayscale-conversion-in-rust-by-a-factor-of-2-2-and-various-other-multimedia-related-processing-loops\/","title":{"rendered":"Speeding up RGB to grayscale conversion in Rust by a factor of 2.2 \u2013 and various other multimedia related processing loops"},"content":{"rendered":"<p>In the <a href=\"https:\/\/coaxion.net\/blog\/2018\/01\/how-to-write-gstreamer-elements-in-rust-part-1-a-video-filter-for-converting-rgb-to-grayscale\/\" rel=\"noopener noreferrer\" target=\"_blank\">previous blog post<\/a> I wrote about how to write a RGB to grayscale conversion filter for <a href=\"https:\/\/gstreamer.freedesktop.org\" rel=\"noopener noreferrer\" target=\"_blank\">GStreamer<\/a> in <a href=\"https:\/\/rust-lang.org\" rel=\"noopener noreferrer\" target=\"_blank\">Rust<\/a>. In this blog post I&#8217;m going to write about how to optimize the processing loop of that filter, without resorting to <em>unsafe<\/em> code or <a href=\"https:\/\/en.wikipedia.org\/wiki\/SIMD\" rel=\"noopener noreferrer\" target=\"_blank\">SIMD<\/a> instructions by staying with plain, safe Rust code.<\/p>\n<p>I also tried to implement the processing loop with <a href=\"https:\/\/github.com\/AdamNiederer\/faster\" rel=\"noopener noreferrer\" target=\"_blank\">faster<\/a>, a Rust crate for writing safe SIMD code. It looks very promising, but unless I missed something in the documentation it currently is missing some features to be able to express this specific algorithm in a meaningful way. Once it works on stable Rust (waiting for SIMD to be stabilized) and includes runtime CPU feature detection, this could very well be a good replacement for the <a href=\"https:\/\/cgit.freedesktop.org\/gstreamer\/orc\" rel=\"noopener noreferrer\" target=\"_blank\">ORC<\/a> library used for the same purpose in GStreamer in various places. ORC works by JIT-compiling a minimal &#8220;array operation language&#8221; to SIMD assembly for your specific CPU (and has support for x86 MMX\/SSE, PPC Altivec, ARM NEON, etc.).<\/p>\n<p>If someone wants to prove me wrong and implement this with faster, feel free to do so and I&#8217;ll link to your solution and include it in the benchmark results below.<\/p>\n<p>All code below can be found in this <a href=\"https:\/\/github.com\/sdroege\/bgrx-to-grayscale-benchmark.rs\" rel=\"noopener noreferrer\" target=\"_blank\">GIT repository<\/a>.<\/p>\n<h3 id=\"toc\">Table of Contents<\/h3>\n<ol>\n<li><a href=\"#baseline\">Baseline Implementation<\/a><\/li>\n<li><a href=\"#assertions\">First Optimization \u2013 Assertions<\/a><\/li>\n<li><a href=\"#assertions-2\">First Optimization \u2013 Assertions Try 2<\/a>\n<li><a href=\"#iterator\">Second Optimization \u2013 Iterate a bit more<\/a><\/li>\n<li><a href=\"#no-more-bounds\">Third Optimization \u2013 Getting rid of the bounds check finally<\/a><\/li>\n<li><a href=\"#summary\">Summary<\/a><\/li>\n<li><a href=\"#split-at\">Addendum: <em>slice::split_at<\/em><a><\/li>\n<li><a href=\"#faster\">Addendum 2: SIMD with faster<a><\/li>\n<\/ol>\n<h3 id=\"baseline\">Baseline Implementation<\/h3>\n<p>This is how the baseline implementation looks like.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"rust\" data-enlighter-title=\"Baseline implementation\">pub fn bgrx_to_gray_chunks_no_asserts(\n    in_data: &amp;[u8],\n    out_data: &amp;mut [u8],\n    in_stride: usize,\n    out_stride: usize,\n    width: usize,\n) {\n    let in_line_bytes = width * 4;\n    let out_line_bytes = width * 4;\n\n    for (in_line, out_line) in in_data\n        .chunks(in_stride)\n        .zip(out_data.chunks_mut(out_stride))\n    {\n        for (in_p, out_p) in in_line[..in_line_bytes]\n            .chunks(4)\n            .zip(out_line[..out_line_bytes].chunks_mut(4))\n        {\n            let b = u32::from(in_p[0]);\n            let g = u32::from(in_p[1]);\n            let r = u32::from(in_p[2]);\n            let x = u32::from(in_p[3]);\n\n            let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) \/ 65536;\n            let grey = grey as u8;\n            out_p[0] = grey;\n            out_p[1] = grey;\n            out_p[2] = grey;\n            out_p[3] = grey;\n        }\n    }\n}<\/pre>\n<p>This basically iterates over each line of the input and output frame (outer loop), and then for each BGRx chunk of 4 bytes in each line it converts the values to <em>u32<\/em>, multiplies with a constant array, converts back to <em>u8<\/em> and stores the same value in the whole output BGRx chunk.<\/p>\n<p><strong>Note:<\/strong> This is only doing the actual conversion from linear RGB to grayscale (and in <a href=\"https:\/\/en.wikipedia.org\/wiki\/YUV#SDTV_with_BT.601\" rel=\"noopener noreferrer\" target=\"_blank\">BT.601<\/a> colorspace). To do this conversion correctly you need to know your colorspaces and use the correct coefficients for conversion, and also do <a href=\"https:\/\/en.wikipedia.org\/wiki\/Gamma_correction\" rel=\"noopener noreferrer\" target=\"_blank\">gamma correction<\/a>. See <a href=\"https:\/\/web.archive.org\/web\/20161024090830\/http:\/\/www.4p8.com\/eric.brasseur\/gamma.html\" rel=\"noopener noreferrer\" target=\"_blank\">this<\/a> about why it is important.<\/p>\n<p>So what can be improved on this? For starters, let&#8217;s write a small benchmark for this so that we know whether any of our changes actually improve something. This is using the (unfortunately <a href=\"https:\/\/github.com\/rust-lang\/rfcs\/pull\/2287\" rel=\"noopener noreferrer\" target=\"_blank\">still<\/a>) unstable benchmark feature of Cargo.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"rust\" data-enlighter-title=\"Benchmark\">#![feature(test)]\n#![feature(exact_chunks)]\n\nextern crate test;\n\npub fn bgrx_to_gray_chunks_no_asserts(...)\n    [...]\n}\n\n#[cfg(test)]\nmod tests {\n    use super::*;\n    use test::Bencher;\n    use std::iter;\n\n    fn create_vec(w: usize, h: usize) -&gt; Vec&lt;u8&gt; {\n        iter::repeat(0).take(w * h * 4).collect::&lt;_&gt;()\n    }\n\n    #[bench]\n    fn bench_chunks_1920x1080_no_asserts(b: &amp;mut Bencher) {\n        let i = test::black_box(create_vec(1920, 1080));\n        let mut o = test::black_box(create_vec(1920, 1080));\n\n        b.iter(|| bgrx_to_gray_chunks_no_asserts(&amp;i, &amp;mut o, 1920 * 4, 1920 * 4, 1920));\n    }\n}<\/pre>\n<p>This can be run with <em>cargo bench<\/em> and then prints the amount of nanoseconds each iterator of the closure was taking. To only really measure the processing itself, allocations and initializations of the input\/output frame are happening outside of the closure. We&#8217;re not interested in times for that.<\/p>\n<h3 id=\"assertions\">First Optimization \u2013 Assertions<\/h3>\n<p>To actually start optimizing this function, let&#8217;s take a look at the assembly that the compiler is outputting. The easiest way of doing that is via the <a href=\"https:\/\/godbolt.org\" rel=\"noopener noreferrer\" target=\"_blank\">Godbolt Compiler Explorer<\/a> website. Select &#8220;rustc nightly&#8221; and use <em>&#8220;-C opt-level=3&#8221;<\/em> for the compiler flags, and then copy &amp; paste your code in there. Once it compiles, to find the assembly that corresponds to a line, simply right-click on the line and &#8220;Scroll to assembly&#8221;.<\/p>\n<p>Alternatively you can use <em>cargo rustc &#8211;release &#8212; -C opt-level=3 &#8211;emit asm<\/em> and check the assembly file that is output in the <em>target\/release\/deps<\/em> directory.<\/p>\n<p>What we see then for our inner loop is something like the following<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"asm\" data-enlighter-title=\"\">.LBB4_19:\n  cmp r15, r11\n  mov r13, r11\n  cmova r13, r15\n  mov rdx, r8\n  sub rdx, r13\n  je .LBB4_34\n  cmp rdx, 3\n  jb .LBB4_35\n  inc r9\n  movzx edx, byte ptr [rbx - 1]\n  movzx ecx, byte ptr [rbx - 2]\n  movzx esi, byte ptr [rbx]\n  imul esi, esi, 19595\n  imul edx, edx, 38470\n  imul ecx, ecx, 7471\n  add ecx, edx\n  add ecx, esi\n  shr ecx, 16\n  mov byte ptr [r10 - 3], cl\n  mov byte ptr [r10 - 2], cl\n  mov byte ptr [r10 - 1], cl\n  mov byte ptr [r10], cl\n  add r10, 4\n  add r8, -4\n  add r15, -4\n  add rbx, 4\n  cmp r9, r14\n  jb .LBB4_19<\/pre>\n<p>This is already quite optimized. For each loop iteration the first few instructions are doing some bounds checking and if they fail jump to the <em>.LBB4_34<\/em> or <em>.LBB4_35<\/em> labels. How to understand that this is bounds checking? Scroll down in the assembly to where these labels are defined and you&#8217;ll see something like the following<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"asm\" data-enlighter-title=\"\">.LBB4_34:\n  lea rdi, [rip + .Lpanic_bounds_check_loc.D]\n  xor esi, esi\n  xor edx, edx\n  call core::panicking::panic_bounds_check@PLT\n  ud2\n.LBB4_35:\n  cmp r15, r11\n  cmova r11, r15\n  sub r8, r11\n  lea rdi, [rip + .Lpanic_bounds_check_loc.F]\n  mov esi, 2\n  mov rdx, r8\n  call core::panicking::panic_bounds_check@PLT\n  ud2<\/pre>\n<p>Also if you check (with the colors, or the &#8220;scroll to source&#8221; feature) which Rust code these correspond to, you&#8217;ll see that it&#8217;s the first and third access to the 4-byte slice that contains our BGRx values.<\/p>\n<p>Afterwards in the assembly, the following steps are happening: 0) incrementing of the &#8220;loop counter&#8221; representing the number of iterations we&#8217;re going to do (<em>r9<\/em>), 1) actual reading of the B, G and R value and conversion to <em>u32<\/em> (the 3 <em>movzx<\/em>, note that the reading of the <em>x<\/em> value is optimized away as the compiler sees that it is always multiplied by 0 later), 2) the multiplications with the array elements (the 3 <em>imul<\/em>), 3) combining of the results and division (i.e. shift) (the 2 <em>add<\/em> and the <em>shr<\/em>), 4) storing of the result in the output (the 4 <em>mov<\/em>). Afterwards the slice pointers are increased by 4 (<em>rbx<\/em> and <em>r10<\/em>) and the lengths (used for bounds checking) are decreased by 4 (<em>r8<\/em> and <em>r15<\/em>). Finally there&#8217;s a check (<em>cmp<\/em>) to see if <em>r9<\/em> (our loop) counter is at the end of the slice, and if not we jump back to the beginning and operate on the next BGRx chunk.<\/p>\n<p>Generally what we want to do for optimizations is to get rid of unnecessary checks (bounds checking), memory accesses, conditions (<em>cmp<\/em>, <em>cmov<\/em>) and jumps (the instructions starting with <em>j<\/em>). These are all things that are slowing down our code.<\/p>\n<p>So the first thing that seems useful to optimize here is the bounds checking at the beginning. It definitely seems not useful to do two checks instead of one for the two slices (the checks are for the both slices at once but Godbolt does not detect that and believes it&#8217;s only the input slice). And ideally we could teach the compiler that no bounds checking is needed at all.<\/p>\n<p>As I wrote in the previous blog post, often this knowledge can be given to the compiler by inserting assertions.<\/p>\n<p>To prevent two checks and just have a single check, you can insert a <em>assert_eq!(in_p.len(), 4)<\/em> at the beginning of the inner loop and the same for the output slice. Now we only have a single bounds check left per iteration.<\/p>\n<p>As a next step we might want to try to move this knowledge outside the inner loop so that there is no bounds checking at all in there anymore. We might want to add assertions like the following outside the outer loop then to give all knowledge we have to the compiler<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"rust\" data-enlighter-title=\"\">assert_eq!(in_data.len() % 4, 0);\nassert_eq!(out_data.len() % 4, 0);\nassert_eq!(out_data.len() \/ out_stride, in_data.len() \/ in_stride);\n\nassert!(in_line_bytes &lt;= in_stride);\nassert!(out_line_bytes &lt;= out_stride);<\/pre>\n<p>Unfortunately adding those has no effect at all on the inner loop, but having them outside the outer loop for good measure is not the worst idea so let&#8217;s just keep them. At least it can be used as some kind of documentation of the invariants of this code for future readers.<\/p>\n<p>So let&#8217;s benchmark these two implementations now. The results on my machine are the following<\/p>\n<pre class=\"decode:1 \" >test tests::bench_chunks_1920x1080_no_asserts ... bench:   4,420,145 ns\/iter (+\/- 139,051)\ntest tests::bench_chunks_1920x1080_asserts    ... bench:   4,897,046 ns\/iter (+\/- 166,555)<\/pre>\n<p>This is surprising, our version without the assertions is actually faster by a factor of ~1.1 although it had fewer conditions. So let&#8217;s take a closer look at the assembly at the top of the loop again, where the bounds checking happens, in the version with assertions<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"asm\" data-enlighter-title=\"\">.LBB4_19:\n  cmp rbx, r11\n  mov r9, r11\n  cmova r9, rbx\n  mov r14, r12\n  sub r14, r9\n  lea rax, [r14 - 1]\n  mov qword ptr [rbp - 120], rax\n  mov qword ptr [rbp - 128], r13\n  mov qword ptr [rbp - 136], r10\n  cmp r14, 5\n  jne .LBB4_33\n  inc rcx\n  [...]<\/pre>\n<p>While this indeed has only one jump as expected for the bounds checking, the number of comparisons is the same and even worse: 3 memory writes to the stack are happening right before the jump. If we follow to the <em>.LBB4_33<\/em> label we will see that the <em>assert_eq!<\/em> macro is going to do something with <em>core::fmt::Debug<\/em>. This is setting up the information needed for printing the assertion failure, the &#8220;expected X equals to Y&#8221; output. This is certainly not good and the reason why everything is slower now.<\/p>\n<h3 id=\"assertions-2\">First Optimization \u2013 Assertions Try 2<\/h3>\n<p>All the additional instructions and memory writes were happening because the <em>assert_eq!<\/em> macro is outputting something user friendly that actually contains the values of both sides. Let&#8217;s try again with the <em>assert!<\/em> macro instead<\/p>\n<pre class=\"decode:1 \" >test tests::bench_chunks_1920x1080_no_asserts ... bench:   4,420,145 ns\/iter (+\/- 139,051)\ntest tests::bench_chunks_1920x1080_asserts    ... bench:   4,897,046 ns\/iter (+\/- 166,555)\ntest tests::bench_chunks_1920x1080_asserts_2  ... bench:   3,968,976 ns\/iter (+\/- 97,084)\n<\/pre>\n<p>This already looks more promising. Compared to our baseline version this gives us a speedup of a factor of 1.12, and compared to the version with <em>assert_eq!<\/em> 1.23. If we look at the assembly for the bounds checks (everything else stays the same), it also looks more like what we would&#8217;ve expected<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"asm\" data-enlighter-title=\"\">.LBB4_19:\n  cmp rbx, r12\n  mov r13, r12\n  cmova r13, rbx\n  add r13, r14\n  jne .LBB4_33\n  inc r9\n  [...]<\/pre>\n<p>One <em>cmp<\/em> less, only one jump left. And no memory writes anymore!<\/p>\n<p>So keep in mind that <em>assert_eq!<\/em> is more user-friendly but quite a bit more expensive even in the &#8220;good case&#8221; compared to <em>assert!<\/em>.<\/p>\n<h3 id=\"iterator\">Second Optimization \u2013 Iterate a bit more<\/h3>\n<p>This is still not very satisfying though. No bounds checking should be needed at all as each chunk is going to be exactly 4 bytes. We&#8217;re just not able to convince the compiler that this is the case. While it may be possible (let me know if you find a way!), let&#8217;s try something different. The <em>zip<\/em> iterator is done when the shortest iterator of both is done, and there are optimizations specifically for zipped slice iterators implemented. Let&#8217;s try that and replace the grayscale value calculation with<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"rust\" data-enlighter-title=\"\">let grey = in_p.iter()\n    .zip(RGB_Y.iter())\n    .map(|(i, c)| u32::from(*i) * c)\n    .sum::&lt;u32&gt;() \/ 65536;<\/pre>\n<p>If we run that through our benchmark after removing the <em>assert!(in_p.len() == 4)<\/em> (and the same for the output slice), these are the results<\/p>\n<pre class=\"decode:1 \" >test tests::bench_chunks_1920x1080_asserts_2  ... bench:   3,968,976 ns\/iter (+\/- 97,084)\ntest tests::bench_chunks_1920x1080_iter_sum   ... bench:  11,393,600 ns\/iter (+\/- 347,958)\n<\/pre>\n<p>We&#8217;re actually 2.9 times slower! Even when adding back the <em>assert!(in_p.len() == 4)<\/em> assertion (and the same for the output slice) we&#8217;re still slower<\/p>\n<pre class=\"decode:1 \" >test tests::bench_chunks_1920x1080_asserts_2  ... bench:   3,968,976 ns\/iter (+\/- 97,084)\ntest tests::bench_chunks_1920x1080_iter_sum   ... bench:  11,393,600 ns\/iter (+\/- 347,958)\ntest tests::bench_chunks_1920x1080_iter_sum_2 ... bench:  10,420,442 ns\/iter (+\/- 242,379)\n<\/pre>\n<p>If we look at the assembly of the assertion-less variant, it&#8217;s a complete mess now<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"asm\" data-enlighter-title=\"\">.LBB0_19:\n  cmp rbx, r13\n  mov rcx, r13\n  cmova rcx, rbx\n  mov rdx, r8\n  sub rdx, rcx\n  cmp rdx, 4\n  mov r11d, 4\n  cmovb r11, rdx\n  test r11, r11\n  je .LBB0_20\n  movzx ecx, byte ptr [r15 - 2]\n  imul ecx, ecx, 19595\n  cmp r11, 1\n  jbe .LBB0_22\n  movzx esi, byte ptr [r15 - 1]\n  imul esi, esi, 38470\n  add esi, ecx\n  movzx ecx, byte ptr [r15]\n  imul ecx, ecx, 7471\n  add ecx, esi\n  test rdx, rdx\n  jne .LBB0_23\n  jmp .LBB0_35\n.LBB0_20:\n  xor ecx, ecx\n.LBB0_22:\n  test rdx, rdx\n  je .LBB0_35\n.LBB0_23:\n  shr ecx, 16\n  mov byte ptr [r10 - 3], cl\n  mov byte ptr [r10 - 2], cl\n  cmp rdx, 3\n  jb .LBB0_36\n  inc r9\n  mov byte ptr [r10 - 1], cl\n  mov byte ptr [r10], cl\n  add r10, 4\n  add r8, -4\n  add rbx, -4\n  add r15, 4\n  cmp r9, r14\n  jb .LBB0_19<\/pre>\n<p>In short, there are now various new conditions and jumps for short-circuiting the zip iterator in the various cases. And because of all the noise added, the compiler was not even able to optimize the bounds check for the output slice away anymore (<em>.LBB0_35<\/em> cases). While it was able to unroll the iterator (note that the 3 <em>imul<\/em> multiplications are not interleaved with jumps and are actually 3 multiplications instead of yet another loop), which is quite impressive, it couldn&#8217;t do anything meaningful with that information it somehow got (it must&#8217;ve understood that each chunk has 4 bytes!). This looks like something going wrong somewhere in the optimizer to me.<\/p>\n<p>If we take a look at the variant with the assertions, things look much better<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"asm\" data-enlighter-title=\"\">.LBB3_19:\n  cmp r11, r12\n  mov r13, r12\n  cmova r13, r11\n  add r13, r14\n  jne .LBB3_33\n  inc r9\n  movzx ecx, byte ptr [rdx - 2]\n  imul r13d, ecx, 19595\n  movzx ecx, byte ptr [rdx - 1]\n  imul ecx, ecx, 38470\n  add ecx, r13d\n  movzx ebx, byte ptr [rdx]\n  imul ebx, ebx, 7471\n  add ebx, ecx\n  shr ebx, 16\n  mov byte ptr [r10 - 3], bl\n  mov byte ptr [r10 - 2], bl\n  mov byte ptr [r10 - 1], bl\n  mov byte ptr [r10], bl\n  add r10, 4\n  add r11, -4\n  add r14, 4\n  add rdx, 4\n  cmp r9, r15\n  jb .LBB3_19<\/pre>\n<p>This is literally the same as the assertion version we had before, except that the reading of the input slice, the multiplications and the additions are happening in iterator order instead of being batched all together. It&#8217;s quite impressive that the compiler was able to completely optimize away the zip iterator here, but unfortunately it&#8217;s still many times slower than the original version. The reason must be the instruction-reordering. The previous version had all memory reads batched and then the operations batched, which is apparently much better for the internal pipelining of the CPU (it is going to perform the next instructions without dependencies on the previous ones already while waiting for the pending instructions to finish).<\/p>\n<p>It&#8217;s also not clear to me why the LLVM optimizer is not able to schedule the instructions the same way here. It apparently has all information it needs for that if no iterator is involved, and both versions are leading to exactly the same assembly except for the order of instructions. This also seems like something fishy.<\/p>\n<p>Nonetheless, we still have our manual bounds check (the assertion) left here and we should really try to get rid of that. No progress so far.<\/p>\n<h3 id=\"no-more-bounds\">Third Optimization \u2013 Getting rid of the bounds check finally<\/h3>\n<p>Let&#8217;s tackle this from a different angle now. Our problem is apparently that the compiler is not able to understand that each chunk is exactly 4 bytes.<\/p>\n<p>So why don&#8217;t we write a new chunks iterator that has always exactly the requested amount of items, instead of potentially less for the very last iteration. And instead of panicking if there are leftover elements, it seems useful to just ignore them. That way we have API that is functionally different from the existing chunks iterator and provides behaviour that is useful in various cases. It&#8217;s basically the slice equivalent of the <a href=\"https:\/\/docs.rs\/ndarray\/0.11.0\/ndarray\/struct.ArrayBase.html#method.exact_chunks\" rel=\"noopener noreferrer\" target=\"_blank\">exact_chunks<\/a> iterator of the <em>ndarray<\/em> crate.<\/p>\n<p>By having it functionally different from the existing one, and not just an optimization, I also <a href=\"https:\/\/github.com\/rust-lang\/rust\/pull\/47126\" rel=\"noopener noreferrer\" target=\"_blank\">submitted<\/a> it for inclusion in Rust&#8217;s standard library and it&#8217;s nowadays available as an unstable feature in nightly. Like all newly added API. Nonetheless, the same can also be implemented inside your code with basically the same effect, there are no dependencies on standard library internals.<\/p>\n<p>So, let&#8217;s use our new <a href=\"https:\/\/doc.rust-lang.org\/nightly\/std\/primitive.slice.html#method.exact_chunks\" rel=\"noopener noreferrer\" target=\"_blank\"><em>exact_chunks<\/em><\/a> iterator that is guaranteed (by API) to always give us exactly 4 bytes. In our case this is exactly equivalent to the normal chunks as by construction our slices always have a length that is a multiple of 4, but the compiler can&#8217;t infer that information. The resulting code looks as follows<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"rust\" data-enlighter-title=\"\">pub fn bgrx_to_gray_exact_chunks(\n    in_data: &amp;[u8],\n    out_data: &amp;mut [u8],\n    in_stride: usize,\n    out_stride: usize,\n    width: usize,\n) {\n    assert_eq!(in_data.len() % 4, 0);\n    assert_eq!(out_data.len() % 4, 0);\n    assert_eq!(out_data.len() \/ out_stride, in_data.len() \/ in_stride);\n\n    let in_line_bytes = width * 4;\n    let out_line_bytes = width * 4;\n\n    assert!(in_line_bytes &lt;= in_stride);\n    assert!(out_line_bytes &lt;= out_stride);\n\n    for (in_line, out_line) in in_data\n        .exact_chunks(in_stride)\n        .zip(out_data.exact_chunks_mut(out_stride))\n    {\n        for (in_p, out_p) in in_line[..in_line_bytes]\n            .exact_chunks(4)\n            .zip(out_line[..out_line_bytes].exact_chunks_mut(4))\n        {\n            assert!(in_p.len() == 4);\n            assert!(out_p.len() == 4);\n\n            let b = u32::from(in_p[0]);\n            let g = u32::from(in_p[1]);\n            let r = u32::from(in_p[2]);\n            let x = u32::from(in_p[3]);\n\n            let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) \/ 65536;\n            let grey = grey as u8;\n            out_p[0] = grey;\n            out_p[1] = grey;\n            out_p[2] = grey;\n            out_p[3] = grey;\n        }\n    }\n}<\/pre>\n<p>It&#8217;s exactly the same as the previous version with assertions, except for using <em>exact_chunks<\/em> instead of <em>chunks<\/em> and the same for the mutable iterator. The resulting benchmark of all our variants now looks as follow<\/p>\n<pre class=\"decode:1 \" >test tests::bench_chunks_1920x1080_no_asserts ... bench:   4,420,145 ns\/iter (+\/- 139,051)\ntest tests::bench_chunks_1920x1080_asserts    ... bench:   4,897,046 ns\/iter (+\/- 166,555)\ntest tests::bench_chunks_1920x1080_asserts_2  ... bench:   3,968,976 ns\/iter (+\/- 97,084)\ntest tests::bench_chunks_1920x1080_iter_sum   ... bench:  11,393,600 ns\/iter (+\/- 347,958)\ntest tests::bench_chunks_1920x1080_iter_sum_2 ... bench:  10,420,442 ns\/iter (+\/- 242,379)\ntest tests::bench_exact_chunks_1920x1080      ... bench:   2,007,459 ns\/iter (+\/- 112,287)\n<\/pre>\n<p>Compared to our initial version this is a speedup of a factor of 2.2, compared to our version with assertions a factor of 1.98. This seems like a worthwhile improvement, and if we look at the resulting assembly there are no bounds checks at all anymore<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"asm\" data-enlighter-title=\"\">.LBB0_10:\n  movzx edx, byte ptr [rsi - 2]\n  movzx r15d, byte ptr [rsi - 1]\n  movzx r12d, byte ptr [rsi]\n  imul r13d, edx, 19595\n  imul edx, r15d, 38470\n  add edx, r13d\n  imul ebx, r12d, 7471\n  add ebx, edx\n  shr ebx, 16\n  mov byte ptr [rcx - 3], bl\n  mov byte ptr [rcx - 2], bl\n  mov byte ptr [rcx - 1], bl\n  mov byte ptr [rcx], bl\n  add rcx, 4\n  add rsi, 4\n  dec r10\n  jne .LBB0_10<\/pre>\n<p>Also due to this the compiler is able to apply some more optimizations and we only have one loop counter for the number of iterations <em>r10<\/em> and the two pointers <em>rcx<\/em> and <em>rsi<\/em> that are increased\/decreased in each iteration. There is no tracking of the remaining slice lengths anymore, as in the assembly of the original version (and the versions with assertions).<\/p>\n<h3 id=\"summary\">Summary<\/h3>\n<p>So overall we got a speedup of a factor of 2.2 while still writing very high-level Rust code with iterators and not falling back to unsafe code or using SIMD. The optimizations the Rust compiler is applying are quite impressive and the Rust marketing line of <em>zero-cost abstractions<\/em> is really visible in reality here.<\/p>\n<p>The same approach should also work for many similar algorithms, and thus many similar multimedia related algorithms where you iterate over slices and operate on fixed-size chunks.<\/p>\n<p>Also the above shows that as a first step it&#8217;s better to write clean and understandable high-level Rust code without worrying too much about performance (assume the compiler can optimize well), and only afterwards take a look at the generated assembly and check which instructions should really go away (like bounds checking). In many cases this can be achieved by adding assertions in strategic places, or like in this case by switching to a slightly different abstraction that is closer to the actual requirements (however I believe the compiler should be able to produce the same code with the help of assertions with the normal chunks iterator, but making that possible requires improvements to the LLVM optimizer probably).<\/p>\n<p>And if all does not help, there&#8217;s still the escape hatch of <em>unsafe<\/em> (for using functions like <a href=\"https:\/\/doc.rust-lang.org\/std\/primitive.slice.html#method.get_unchecked\" rel=\"noopener noreferrer\" target=\"_blank\"><em>slice::get_unchecked()<\/em><\/a> or going down to raw pointers) and the possibility of using SIMD instructions (by using <a href=\"https:\/\/github.com\/AdamNiederer\/faster\" rel=\"noopener noreferrer\" target=\"_blank\"><em>faster<\/em><\/a> or <a href=\"https:\/\/crates.io\/crates\/stdsimd\" rel=\"noopener noreferrer\" target=\"_blank\"><em>stdsimd<\/em><\/a> directly). But in the end this should be a last resort for those little parts of your code where optimizations are needed and the compiler can&#8217;t be easily convinced to do it for you.<\/p>\n<h3 id=\"split-at\">Addendum: <em>slice::split_at<\/em><\/h3>\n<p>User <em>newpavlov<\/em> <a href=\"https:\/\/www.reddit.com\/r\/rust\/comments\/7rxrka\/speeding_up_rgb_to_grayscale_conversion_in_rust\/dt0ejao\/\" rel=\"noopener noreferrer\" target=\"_blank\">suggested on Reddit<\/a> to use repeated <a href=\"https:\/\/doc.rust-lang.org\/std\/primitive.slice.html#method.split_at\" rel=\"noopener noreferrer\" target=\"_blank\"><em>slice::split_at<\/em><\/a> in a while loop for similar performance.<\/p>\n<p>This would for example like<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"rust\" data-enlighter-title=\"\">pub fn bgrx_to_gray_split_at(\n    in_data: &amp;[u8],\n    out_data: &amp;mut [u8],\n    in_stride: usize,\n    out_stride: usize,\n    width: usize,\n) {\n    assert_eq!(in_data.len() % 4, 0);\n    assert_eq!(out_data.len() % 4, 0);\n    assert_eq!(out_data.len() \/ out_stride, in_data.len() \/ in_stride);\n\n    let in_line_bytes = width * 4;\n    let out_line_bytes = width * 4;\n\n    assert!(in_line_bytes &lt;= in_stride);\n    assert!(out_line_bytes &lt;= out_stride);\n\n    for (in_line, out_line) in in_data\n        .exact_chunks(in_stride)\n        .zip(out_data.exact_chunks_mut(out_stride))\n    {\n        let mut in_pp: &amp;[u8] = in_line[..in_line_bytes].as_ref();\n        let mut out_pp: &amp;mut [u8] = out_line[..out_line_bytes].as_mut();\n        assert!(in_pp.len() == out_pp.len());\n\n        while in_pp.len() &gt;= 4 {\n            let (in_p, in_tmp) = in_pp.split_at(4);\n            let (out_p, out_tmp) = { out_pp }.split_at_mut(4);\n            in_pp = in_tmp;\n            out_pp = out_tmp;\n\n            let b = u32::from(in_p[0]);\n            let g = u32::from(in_p[1]);\n            let r = u32::from(in_p[2]);\n            let x = u32::from(in_p[3]);\n\n            let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) \/ 65536;\n            let grey = grey as u8;\n            out_p[0] = grey;\n            out_p[1] = grey;\n            out_p[2] = grey;\n            out_p[3] = grey;\n        }\n    }\n}<\/pre>\n<p>Performance-wise this brings us very close to the <em>exact_chunks<\/em> version<\/p>\n<pre class=\"decode:1 \" >test tests::bench_exact_chunks_1920x1080      ... bench: 1,965,631 ns\/iter (+\/- 58,832)\ntest tests::bench_split_at_1920x1080          ... bench: 2,046,834 ns\/iter (+\/- 35,990)<\/pre>\n<p>and the assembly is also very similar<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"asm\" data-enlighter-title=\"\">.LBB0_10:\n  add rbx, -4\n  movzx r15d, byte ptr [rsi]\n  movzx r12d, byte ptr [rsi + 1]\n  movzx edx, byte ptr [rsi + 2]\n  imul r13d, edx, 19595\n  imul r12d, r12d, 38470\n  imul edx, r15d, 7471\n  add edx, r12d\n  add edx, r13d\n  shr edx, 16\n  movzx edx, dl\n  imul edx, edx, 16843009\n  mov dword ptr [rcx], edx\n  lea rcx, [rcx + 4]\n  add rsi, 4\n  cmp rbx, 3\n  ja .LBB0_10<\/pre>\n<p>Here the compiler even optimizes the storing of the value into a single write operation of 4 bytes, at the cost of an additional multiplication and zero-extend register move.<\/p>\n<p>Overall this code performs very well too, but in my opinion it looks rather ugly compared to the versions using the different chunks iterators. Also this is basically what the <em>exact_chunks<\/em> iterator does internally: repeatedly calling <em>slice::split_at<\/em>. In theory both versions could lead to the very same assembly, but the LLVM optimizer is currently handling both slightly different.<\/p>\n<h3 id=\"faster\">Addendum 2: SIMD with faster<\/h3>\n<p><a href=\"https:\/\/github.com\/AdamNiederer\" rel=\"noopener noreferrer\" target=\"_blank\">Adam Niederer<\/a>, author of <a href=\"https:\/\/github.com\/AdamNiederer\/faster\" rel=\"noopener noreferrer\" target=\"_blank\">faster<\/a>, <a href=\"https:\/\/github.com\/sdroege\/bgrx-to-grayscale-benchmark.rs\/pull\/1\" rel=\"noopener noreferrer\" target=\"_blank\">provided a PR<\/a> that implements the same algorithm with faster to make explicit use of SIMD instructions if available.<\/p>\n<p>Due to some codegen issues, this currently has to be compiled with the CPU being selected as <em>nehalem<\/em>, i.e. by running <em>RUSTFLAGS=&#8221;-C target-cpu=nehalem&#8221; cargo +nightly bench<\/em>, but it provides yet another speedup by a factor of up to 1.27x compared to the fastest previous solution and 2.7x compared to the initial solution:<\/p>\n<pre class=\"decode:1 \" >test tests::bench_chunks_1920x1080_asserts                     ... bench:   4,539,286 ns\/iter (+\/- 106,265)\ntest tests::bench_chunks_1920x1080_asserts_2                   ... bench:   3,550,683 ns\/iter (+\/- 96,917)\ntest tests::bench_chunks_1920x1080_iter_sum                    ... bench:   5,233,238 ns\/iter (+\/- 114,671)\ntest tests::bench_chunks_1920x1080_iter_sum_2                  ... bench:   3,532,059 ns\/iter (+\/- 94,964)\ntest tests::bench_chunks_1920x1080_no_asserts                  ... bench:   4,468,269 ns\/iter (+\/- 89,329)\ntest tests::bench_chunks_1920x1080_no_asserts_faster           ... bench:   2,476,077 ns\/iter (+\/- 54,877)\ntest tests::bench_chunks_1920x1080_no_asserts_faster_unstrided ... bench:   1,642,980 ns\/iter (+\/- 108,034)\ntest tests::bench_exact_chunks_1920x1080                       ... bench:   2,078,950 ns\/iter (+\/- 64,536)\ntest tests::bench_split_at_1920x1080                           ... bench:   2,096,603 ns\/iter (+\/- 107,420)<\/pre>\n<p>The code in question is very similar to what you would&#8217;ve written with <a href=\"https:\/\/cgit.freedesktop.org\/gstreamer\/orc\" rel=\"noopener noreferrer\" target=\"_blank\">ORC<\/a>, especially the unstrided version. You basically operate on multiple elements at once, doing the same operation on each, but both versions do this in a slightly different way.<\/p>\n<pre class=\"EnlighterJSRAW\" data-enlighter-language=\"rust\" data-enlighter-title=\"\">pub fn bgrx_to_gray_chunks_no_asserts_faster_unstrided(in_data: &amp;[u8], out_data: &amp;mut [u8]) {\n    \/\/ Relies on vector width which is a multiple of 4\n    assert!(u8s::WIDTH % 4 == 0 &amp;&amp; u32s::WIDTH % 4 == 0);\n\n    const RGB_Y: [u32; 16] = [19595, 38470, 7471, 0, 19595, 38470, 7471, 0, 19595, 38470, 7471, 0, 19595, 38470, 7471, 0];\n    let rgbvec = u32s::load(&amp;RGB_Y, 0);\n    in_data.simd_iter(u8s(0)).simd_map(|v| {\n        let (a, b) = v.upcast();\n        let (a32, b32) = a.upcast();\n        let (c32, d32) = b.upcast();\n\n        let grey32a = a32 * rgbvec \/ u32s(65536);\n        let grey32b = b32 * rgbvec \/ u32s(65536);\n        let grey32c = c32 * rgbvec \/ u32s(65536);\n        let grey32d = d32 * rgbvec \/ u32s(65536);\n\n        let grey16a = grey32a.saturating_downcast(grey32b);\n        let grey16b = grey32c.saturating_downcast(grey32d);\n\n        let grey = grey16a.saturating_downcast(grey16b);\n        grey\n    }).scalar_fill(out_data);\n}\n\npub fn bgrx_to_gray_chunks_no_asserts_faster(in_data: &amp;[u8], out_data: &amp;mut [u8]) {\n    \/\/ Sane, but slowed down by faster&#039;s current striding implementation.\n    in_data.stride_four(tuplify!(4, u8s(0))).zip().simd_map(|(r, g, b, _)| {\n        let (r16a, r16b) = r.upcast();\n        let (r32a, r32b) = r16a.upcast();\n        let (r32c, r32d) = r16b.upcast();\n\n        let (g16a, g16b) = g.upcast();\n        let (g32a, g32b) = g16a.upcast();\n        let (g32c, g32d) = g16b.upcast();\n\n        let (b16a, b16b) = b.upcast();\n        let (b32a, b32b) = b16a.upcast();\n        let (b32c, b32d) = b16b.upcast();\n\n        let grey32a = (r32a * u32s(19595) + g32a * u32s(38470) + b32a * u32s(7471)) \/ u32s(65536);\n        let grey32b = (r32b * u32s(19595) + g32b * u32s(38470) + b32b * u32s(7471)) \/ u32s(65536);\n        let grey32c = (r32c * u32s(19595) + g32c * u32s(38470) + b32c * u32s(7471)) \/ u32s(65536);\n        let grey32d = (r32d * u32s(19595) + g32d * u32s(38470) + b32d * u32s(7471)) \/ u32s(65536);\n\n        let grey16a = grey32a.saturating_downcast(grey32b);\n        let grey16b = grey32c.saturating_downcast(grey32d);\n\n        let grey = grey16a.saturating_downcast(grey16b);\n        grey\n    }).scalar_fill(out_data);\n}<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>In the previous blog post I wrote about how to write a RGB to grayscale conversion filter for GStreamer in Rust. In this blog post I&#8217;m going to write about how to optimize the processing loop of that filter, without resorting to unsafe code or SIMD instructions by staying with plain, safe Rust code. I &hellip; <a href=\"https:\/\/coaxion.net\/blog\/2018\/01\/speeding-up-rgb-to-grayscale-conversion-in-rust-by-a-factor-of-2-2-and-various-other-multimedia-related-processing-loops\/\" class=\"more-link\">Continue reading <span class=\"screen-reader-text\">Speeding up RGB to grayscale conversion in Rust by a factor of 2.2 \u2013 and various other multimedia related processing loops<\/span><\/a><\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"jetpack_post_was_ever_published":false,"_jetpack_newsletter_access":"","_jetpack_dont_email_post_to_subs":false,"_jetpack_newsletter_tier_id":0,"_jetpack_memberships_contains_paywalled_content":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[3,5,53],"tags":[],"class_list":["post-525","post","type-post","status-publish","format-standard","hentry","category-free-software","category-gstreamer","category-rust"],"jetpack_featured_media_url":"","jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/coaxion.net\/blog\/wp-json\/wp\/v2\/posts\/525","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/coaxion.net\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/coaxion.net\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/coaxion.net\/blog\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/coaxion.net\/blog\/wp-json\/wp\/v2\/comments?post=525"}],"version-history":[{"count":17,"href":"https:\/\/coaxion.net\/blog\/wp-json\/wp\/v2\/posts\/525\/revisions"}],"predecessor-version":[{"id":768,"href":"https:\/\/coaxion.net\/blog\/wp-json\/wp\/v2\/posts\/525\/revisions\/768"}],"wp:attachment":[{"href":"https:\/\/coaxion.net\/blog\/wp-json\/wp\/v2\/media?parent=525"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/coaxion.net\/blog\/wp-json\/wp\/v2\/categories?post=525"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/coaxion.net\/blog\/wp-json\/wp\/v2\/tags?post=525"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}