January 2018 – coaxion.net

Speeding up RGB to grayscale conversion in Rust by a factor of 2.2 – and various other multimedia related processing loops

In the previous blog post I wrote about how to write a RGB to grayscale conversion filter for GStreamer in Rust. In this blog post I’m going to write about how to optimize the processing loop of that filter, without resorting to unsafe code or SIMD instructions by staying with plain, safe Rust code.

I also tried to implement the processing loop with faster, a Rust crate for writing safe SIMD code. It looks very promising, but unless I missed something in the documentation it currently is missing some features to be able to express this specific algorithm in a meaningful way. Once it works on stable Rust (waiting for SIMD to be stabilized) and includes runtime CPU feature detection, this could very well be a good replacement for the ORC library used for the same purpose in GStreamer in various places. ORC works by JIT-compiling a minimal “array operation language” to SIMD assembly for your specific CPU (and has support for x86 MMX/SSE, PPC Altivec, ARM NEON, etc.).

If someone wants to prove me wrong and implement this with faster, feel free to do so and I’ll link to your solution and include it in the benchmark results below.

All code below can be found in this GIT repository.

Baseline Implementation
First Optimization – Assertions
First Optimization – Assertions Try 2
Second Optimization – Iterate a bit more
Third Optimization – Getting rid of the bounds check finally
Summary
Addendum: slice::split_at

Addendum 2: SIMD with faster

Baseline Implementation

This is how the baseline implementation looks like.

pub fn bgrx_to_gray_chunks_no_asserts(
    in_data: &[u8],
    out_data: &mut [u8],
    in_stride: usize,
    out_stride: usize,
    width: usize,
) {
    let in_line_bytes = width * 4;
    let out_line_bytes = width * 4;

    for (in_line, out_line) in in_data
        .chunks(in_stride)
        .zip(out_data.chunks_mut(out_stride))
    {
        for (in_p, out_p) in in_line[..in_line_bytes]
            .chunks(4)
            .zip(out_line[..out_line_bytes].chunks_mut(4))
        {
            let b = u32::from(in_p[0]);
            let g = u32::from(in_p[1]);
            let r = u32::from(in_p[2]);
            let x = u32::from(in_p[3]);

            let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) / 65536;
            let grey = grey as u8;
            out_p[0] = grey;
            out_p[1] = grey;
            out_p[2] = grey;
            out_p[3] = grey;
        }
    }
}

This basically iterates over each line of the input and output frame (outer loop), and then for each BGRx chunk of 4 bytes in each line it converts the values to u32, multiplies with a constant array, converts back to u8 and stores the same value in the whole output BGRx chunk.

Note: This is only doing the actual conversion from linear RGB to grayscale (and in BT.601 colorspace). To do this conversion correctly you need to know your colorspaces and use the correct coefficients for conversion, and also do gamma correction. See this about why it is important.

So what can be improved on this? For starters, let’s write a small benchmark for this so that we know whether any of our changes actually improve something. This is using the (unfortunately still) unstable benchmark feature of Cargo.

#![feature(test)]
#![feature(exact_chunks)]

extern crate test;

pub fn bgrx_to_gray_chunks_no_asserts(...)
    [...]
}

#[cfg(test)]
mod tests {
    use super::*;
    use test::Bencher;
    use std::iter;

    fn create_vec(w: usize, h: usize) -> Vec<u8> {
        iter::repeat(0).take(w * h * 4).collect::<_>()
    }

    #[bench]
    fn bench_chunks_1920x1080_no_asserts(b: &mut Bencher) {
        let i = test::black_box(create_vec(1920, 1080));
        let mut o = test::black_box(create_vec(1920, 1080));

        b.iter(|| bgrx_to_gray_chunks_no_asserts(&i, &mut o, 1920 * 4, 1920 * 4, 1920));
    }
}

This can be run with cargo bench and then prints the amount of nanoseconds each iterator of the closure was taking. To only really measure the processing itself, allocations and initializations of the input/output frame are happening outside of the closure. We’re not interested in times for that.

First Optimization – Assertions

To actually start optimizing this function, let’s take a look at the assembly that the compiler is outputting. The easiest way of doing that is via the Godbolt Compiler Explorer website. Select “rustc nightly” and use “-C opt-level=3” for the compiler flags, and then copy & paste your code in there. Once it compiles, to find the assembly that corresponds to a line, simply right-click on the line and “Scroll to assembly”.

Alternatively you can use cargo rustc –release — -C opt-level=3 –emit asm and check the assembly file that is output in the target/release/deps directory.

What we see then for our inner loop is something like the following

.LBB4_19:
  cmp r15, r11
  mov r13, r11
  cmova r13, r15
  mov rdx, r8
  sub rdx, r13
  je .LBB4_34
  cmp rdx, 3
  jb .LBB4_35
  inc r9
  movzx edx, byte ptr [rbx - 1]
  movzx ecx, byte ptr [rbx - 2]
  movzx esi, byte ptr [rbx]
  imul esi, esi, 19595
  imul edx, edx, 38470
  imul ecx, ecx, 7471
  add ecx, edx
  add ecx, esi
  shr ecx, 16
  mov byte ptr [r10 - 3], cl
  mov byte ptr [r10 - 2], cl
  mov byte ptr [r10 - 1], cl
  mov byte ptr [r10], cl
  add r10, 4
  add r8, -4
  add r15, -4
  add rbx, 4
  cmp r9, r14
  jb .LBB4_19

This is already quite optimized. For each loop iteration the first few instructions are doing some bounds checking and if they fail jump to the .LBB4_34 or .LBB4_35 labels. How to understand that this is bounds checking? Scroll down in the assembly to where these labels are defined and you’ll see something like the following

.LBB4_34:
  lea rdi, [rip + .Lpanic_bounds_check_loc.D]
  xor esi, esi
  xor edx, edx
  call core::panicking::panic_bounds_check@PLT
  ud2
.LBB4_35:
  cmp r15, r11
  cmova r11, r15
  sub r8, r11
  lea rdi, [rip + .Lpanic_bounds_check_loc.F]
  mov esi, 2
  mov rdx, r8
  call core::panicking::panic_bounds_check@PLT
  ud2

Also if you check (with the colors, or the “scroll to source” feature) which Rust code these correspond to, you’ll see that it’s the first and third access to the 4-byte slice that contains our BGRx values.

Afterwards in the assembly, the following steps are happening: 0) incrementing of the “loop counter” representing the number of iterations we’re going to do (r9), 1) actual reading of the B, G and R value and conversion to u32 (the 3 movzx, note that the reading of the x value is optimized away as the compiler sees that it is always multiplied by 0 later), 2) the multiplications with the array elements (the 3 imul), 3) combining of the results and division (i.e. shift) (the 2 add and the shr), 4) storing of the result in the output (the 4 mov). Afterwards the slice pointers are increased by 4 (rbx and r10) and the lengths (used for bounds checking) are decreased by 4 (r8 and r15). Finally there’s a check (cmp) to see if r9 (our loop) counter is at the end of the slice, and if not we jump back to the beginning and operate on the next BGRx chunk.

Generally what we want to do for optimizations is to get rid of unnecessary checks (bounds checking), memory accesses, conditions (cmp, cmov) and jumps (the instructions starting with j). These are all things that are slowing down our code.

So the first thing that seems useful to optimize here is the bounds checking at the beginning. It definitely seems not useful to do two checks instead of one for the two slices (the checks are for the both slices at once but Godbolt does not detect that and believes it’s only the input slice). And ideally we could teach the compiler that no bounds checking is needed at all.

As I wrote in the previous blog post, often this knowledge can be given to the compiler by inserting assertions.

To prevent two checks and just have a single check, you can insert a assert_eq!(in_p.len(), 4) at the beginning of the inner loop and the same for the output slice. Now we only have a single bounds check left per iteration.

As a next step we might want to try to move this knowledge outside the inner loop so that there is no bounds checking at all in there anymore. We might want to add assertions like the following outside the outer loop then to give all knowledge we have to the compiler

assert_eq!(in_data.len() % 4, 0);
assert_eq!(out_data.len() % 4, 0);
assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride);

assert!(in_line_bytes <= in_stride);
assert!(out_line_bytes <= out_stride);

Unfortunately adding those has no effect at all on the inner loop, but having them outside the outer loop for good measure is not the worst idea so let’s just keep them. At least it can be used as some kind of documentation of the invariants of this code for future readers.

So let’s benchmark these two implementations now. The results on my machine are the following

test tests::bench_chunks_1920x1080_no_asserts ... bench:   4,420,145 ns/iter (+/- 139,051)
test tests::bench_chunks_1920x1080_asserts    ... bench:   4,897,046 ns/iter (+/- 166,555)

This is surprising, our version without the assertions is actually faster by a factor of ~1.1 although it had fewer conditions. So let’s take a closer look at the assembly at the top of the loop again, where the bounds checking happens, in the version with assertions

.LBB4_19:
  cmp rbx, r11
  mov r9, r11
  cmova r9, rbx
  mov r14, r12
  sub r14, r9
  lea rax, [r14 - 1]
  mov qword ptr [rbp - 120], rax
  mov qword ptr [rbp - 128], r13
  mov qword ptr [rbp - 136], r10
  cmp r14, 5
  jne .LBB4_33
  inc rcx
  [...]

While this indeed has only one jump as expected for the bounds checking, the number of comparisons is the same and even worse: 3 memory writes to the stack are happening right before the jump. If we follow to the .LBB4_33 label we will see that the assert_eq! macro is going to do something with core::fmt::Debug. This is setting up the information needed for printing the assertion failure, the “expected X equals to Y” output. This is certainly not good and the reason why everything is slower now.

First Optimization – Assertions Try 2

All the additional instructions and memory writes were happening because the assert_eq! macro is outputting something user friendly that actually contains the values of both sides. Let’s try again with the assert! macro instead

test tests::bench_chunks_1920x1080_no_asserts ... bench:   4,420,145 ns/iter (+/- 139,051)
test tests::bench_chunks_1920x1080_asserts    ... bench:   4,897,046 ns/iter (+/- 166,555)
test tests::bench_chunks_1920x1080_asserts_2  ... bench:   3,968,976 ns/iter (+/- 97,084)

This already looks more promising. Compared to our baseline version this gives us a speedup of a factor of 1.12, and compared to the version with assert_eq! 1.23. If we look at the assembly for the bounds checks (everything else stays the same), it also looks more like what we would’ve expected

.LBB4_19:
  cmp rbx, r12
  mov r13, r12
  cmova r13, rbx
  add r13, r14
  jne .LBB4_33
  inc r9
  [...]

One cmp less, only one jump left. And no memory writes anymore!

So keep in mind that assert_eq! is more user-friendly but quite a bit more expensive even in the “good case” compared to assert!.

Second Optimization – Iterate a bit more

This is still not very satisfying though. No bounds checking should be needed at all as each chunk is going to be exactly 4 bytes. We’re just not able to convince the compiler that this is the case. While it may be possible (let me know if you find a way!), let’s try something different. The zip iterator is done when the shortest iterator of both is done, and there are optimizations specifically for zipped slice iterators implemented. Let’s try that and replace the grayscale value calculation with

let grey = in_p.iter()
    .zip(RGB_Y.iter())
    .map(|(i, c)| u32::from(*i) * c)
    .sum::<u32>() / 65536;

If we run that through our benchmark after removing the assert!(in_p.len() == 4) (and the same for the output slice), these are the results

test tests::bench_chunks_1920x1080_asserts_2  ... bench:   3,968,976 ns/iter (+/- 97,084)
test tests::bench_chunks_1920x1080_iter_sum   ... bench:  11,393,600 ns/iter (+/- 347,958)

We’re actually 2.9 times slower! Even when adding back the assert!(in_p.len() == 4) assertion (and the same for the output slice) we’re still slower

test tests::bench_chunks_1920x1080_asserts_2  ... bench:   3,968,976 ns/iter (+/- 97,084)
test tests::bench_chunks_1920x1080_iter_sum   ... bench:  11,393,600 ns/iter (+/- 347,958)
test tests::bench_chunks_1920x1080_iter_sum_2 ... bench:  10,420,442 ns/iter (+/- 242,379)

If we look at the assembly of the assertion-less variant, it’s a complete mess now

.LBB0_19:
  cmp rbx, r13
  mov rcx, r13
  cmova rcx, rbx
  mov rdx, r8
  sub rdx, rcx
  cmp rdx, 4
  mov r11d, 4
  cmovb r11, rdx
  test r11, r11
  je .LBB0_20
  movzx ecx, byte ptr [r15 - 2]
  imul ecx, ecx, 19595
  cmp r11, 1
  jbe .LBB0_22
  movzx esi, byte ptr [r15 - 1]
  imul esi, esi, 38470
  add esi, ecx
  movzx ecx, byte ptr [r15]
  imul ecx, ecx, 7471
  add ecx, esi
  test rdx, rdx
  jne .LBB0_23
  jmp .LBB0_35
.LBB0_20:
  xor ecx, ecx
.LBB0_22:
  test rdx, rdx
  je .LBB0_35
.LBB0_23:
  shr ecx, 16
  mov byte ptr [r10 - 3], cl
  mov byte ptr [r10 - 2], cl
  cmp rdx, 3
  jb .LBB0_36
  inc r9
  mov byte ptr [r10 - 1], cl
  mov byte ptr [r10], cl
  add r10, 4
  add r8, -4
  add rbx, -4
  add r15, 4
  cmp r9, r14
  jb .LBB0_19

In short, there are now various new conditions and jumps for short-circuiting the zip iterator in the various cases. And because of all the noise added, the compiler was not even able to optimize the bounds check for the output slice away anymore (.LBB0_35 cases). While it was able to unroll the iterator (note that the 3 imul multiplications are not interleaved with jumps and are actually 3 multiplications instead of yet another loop), which is quite impressive, it couldn’t do anything meaningful with that information it somehow got (it must’ve understood that each chunk has 4 bytes!). This looks like something going wrong somewhere in the optimizer to me.

If we take a look at the variant with the assertions, things look much better

.LBB3_19:
  cmp r11, r12
  mov r13, r12
  cmova r13, r11
  add r13, r14
  jne .LBB3_33
  inc r9
  movzx ecx, byte ptr [rdx - 2]
  imul r13d, ecx, 19595
  movzx ecx, byte ptr [rdx - 1]
  imul ecx, ecx, 38470
  add ecx, r13d
  movzx ebx, byte ptr [rdx]
  imul ebx, ebx, 7471
  add ebx, ecx
  shr ebx, 16
  mov byte ptr [r10 - 3], bl
  mov byte ptr [r10 - 2], bl
  mov byte ptr [r10 - 1], bl
  mov byte ptr [r10], bl
  add r10, 4
  add r11, -4
  add r14, 4
  add rdx, 4
  cmp r9, r15
  jb .LBB3_19

This is literally the same as the assertion version we had before, except that the reading of the input slice, the multiplications and the additions are happening in iterator order instead of being batched all together. It’s quite impressive that the compiler was able to completely optimize away the zip iterator here, but unfortunately it’s still many times slower than the original version. The reason must be the instruction-reordering. The previous version had all memory reads batched and then the operations batched, which is apparently much better for the internal pipelining of the CPU (it is going to perform the next instructions without dependencies on the previous ones already while waiting for the pending instructions to finish).

It’s also not clear to me why the LLVM optimizer is not able to schedule the instructions the same way here. It apparently has all information it needs for that if no iterator is involved, and both versions are leading to exactly the same assembly except for the order of instructions. This also seems like something fishy.

Nonetheless, we still have our manual bounds check (the assertion) left here and we should really try to get rid of that. No progress so far.

Third Optimization – Getting rid of the bounds check finally

Let’s tackle this from a different angle now. Our problem is apparently that the compiler is not able to understand that each chunk is exactly 4 bytes.

So why don’t we write a new chunks iterator that has always exactly the requested amount of items, instead of potentially less for the very last iteration. And instead of panicking if there are leftover elements, it seems useful to just ignore them. That way we have API that is functionally different from the existing chunks iterator and provides behaviour that is useful in various cases. It’s basically the slice equivalent of the exact_chunks iterator of the ndarray crate.

By having it functionally different from the existing one, and not just an optimization, I also submitted it for inclusion in Rust’s standard library and it’s nowadays available as an unstable feature in nightly. Like all newly added API. Nonetheless, the same can also be implemented inside your code with basically the same effect, there are no dependencies on standard library internals.

So, let’s use our new exact_chunks iterator that is guaranteed (by API) to always give us exactly 4 bytes. In our case this is exactly equivalent to the normal chunks as by construction our slices always have a length that is a multiple of 4, but the compiler can’t infer that information. The resulting code looks as follows

pub fn bgrx_to_gray_exact_chunks(
    in_data: &[u8],
    out_data: &mut [u8],
    in_stride: usize,
    out_stride: usize,
    width: usize,
) {
    assert_eq!(in_data.len() % 4, 0);
    assert_eq!(out_data.len() % 4, 0);
    assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride);

    let in_line_bytes = width * 4;
    let out_line_bytes = width * 4;

    assert!(in_line_bytes <= in_stride);
    assert!(out_line_bytes <= out_stride);

    for (in_line, out_line) in in_data
        .exact_chunks(in_stride)
        .zip(out_data.exact_chunks_mut(out_stride))
    {
        for (in_p, out_p) in in_line[..in_line_bytes]
            .exact_chunks(4)
            .zip(out_line[..out_line_bytes].exact_chunks_mut(4))
        {
            assert!(in_p.len() == 4);
            assert!(out_p.len() == 4);

            let b = u32::from(in_p[0]);
            let g = u32::from(in_p[1]);
            let r = u32::from(in_p[2]);
            let x = u32::from(in_p[3]);

            let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) / 65536;
            let grey = grey as u8;
            out_p[0] = grey;
            out_p[1] = grey;
            out_p[2] = grey;
            out_p[3] = grey;
        }
    }
}

It’s exactly the same as the previous version with assertions, except for using exact_chunks instead of chunks and the same for the mutable iterator. The resulting benchmark of all our variants now looks as follow

test tests::bench_chunks_1920x1080_no_asserts ... bench:   4,420,145 ns/iter (+/- 139,051)
test tests::bench_chunks_1920x1080_asserts    ... bench:   4,897,046 ns/iter (+/- 166,555)
test tests::bench_chunks_1920x1080_asserts_2  ... bench:   3,968,976 ns/iter (+/- 97,084)
test tests::bench_chunks_1920x1080_iter_sum   ... bench:  11,393,600 ns/iter (+/- 347,958)
test tests::bench_chunks_1920x1080_iter_sum_2 ... bench:  10,420,442 ns/iter (+/- 242,379)
test tests::bench_exact_chunks_1920x1080      ... bench:   2,007,459 ns/iter (+/- 112,287)

Compared to our initial version this is a speedup of a factor of 2.2, compared to our version with assertions a factor of 1.98. This seems like a worthwhile improvement, and if we look at the resulting assembly there are no bounds checks at all anymore

.LBB0_10:
  movzx edx, byte ptr [rsi - 2]
  movzx r15d, byte ptr [rsi - 1]
  movzx r12d, byte ptr [rsi]
  imul r13d, edx, 19595
  imul edx, r15d, 38470
  add edx, r13d
  imul ebx, r12d, 7471
  add ebx, edx
  shr ebx, 16
  mov byte ptr [rcx - 3], bl
  mov byte ptr [rcx - 2], bl
  mov byte ptr [rcx - 1], bl
  mov byte ptr [rcx], bl
  add rcx, 4
  add rsi, 4
  dec r10
  jne .LBB0_10

Also due to this the compiler is able to apply some more optimizations and we only have one loop counter for the number of iterations r10 and the two pointers rcx and rsi that are increased/decreased in each iteration. There is no tracking of the remaining slice lengths anymore, as in the assembly of the original version (and the versions with assertions).

Summary

So overall we got a speedup of a factor of 2.2 while still writing very high-level Rust code with iterators and not falling back to unsafe code or using SIMD. The optimizations the Rust compiler is applying are quite impressive and the Rust marketing line of zero-cost abstractions is really visible in reality here.

The same approach should also work for many similar algorithms, and thus many similar multimedia related algorithms where you iterate over slices and operate on fixed-size chunks.

Also the above shows that as a first step it’s better to write clean and understandable high-level Rust code without worrying too much about performance (assume the compiler can optimize well), and only afterwards take a look at the generated assembly and check which instructions should really go away (like bounds checking). In many cases this can be achieved by adding assertions in strategic places, or like in this case by switching to a slightly different abstraction that is closer to the actual requirements (however I believe the compiler should be able to produce the same code with the help of assertions with the normal chunks iterator, but making that possible requires improvements to the LLVM optimizer probably).

And if all does not help, there’s still the escape hatch of unsafe (for using functions like slice::get_unchecked() or going down to raw pointers) and the possibility of using SIMD instructions (by using faster or stdsimd directly). But in the end this should be a last resort for those little parts of your code where optimizations are needed and the compiler can’t be easily convinced to do it for you.

Addendum: slice::split_at

User newpavlov suggested on Reddit to use repeated slice::split_at in a while loop for similar performance.

This would for example like

pub fn bgrx_to_gray_split_at(
    in_data: &[u8],
    out_data: &mut [u8],
    in_stride: usize,
    out_stride: usize,
    width: usize,
) {
    assert_eq!(in_data.len() % 4, 0);
    assert_eq!(out_data.len() % 4, 0);
    assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride);

    let in_line_bytes = width * 4;
    let out_line_bytes = width * 4;

    assert!(in_line_bytes <= in_stride);
    assert!(out_line_bytes <= out_stride);

    for (in_line, out_line) in in_data
        .exact_chunks(in_stride)
        .zip(out_data.exact_chunks_mut(out_stride))
    {
        let mut in_pp: &[u8] = in_line[..in_line_bytes].as_ref();
        let mut out_pp: &mut [u8] = out_line[..out_line_bytes].as_mut();
        assert!(in_pp.len() == out_pp.len());

        while in_pp.len() >= 4 {
            let (in_p, in_tmp) = in_pp.split_at(4);
            let (out_p, out_tmp) = { out_pp }.split_at_mut(4);
            in_pp = in_tmp;
            out_pp = out_tmp;

            let b = u32::from(in_p[0]);
            let g = u32::from(in_p[1]);
            let r = u32::from(in_p[2]);
            let x = u32::from(in_p[3]);

            let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) / 65536;
            let grey = grey as u8;
            out_p[0] = grey;
            out_p[1] = grey;
            out_p[2] = grey;
            out_p[3] = grey;
        }
    }
}

Performance-wise this brings us very close to the exact_chunks version

test tests::bench_exact_chunks_1920x1080      ... bench: 1,965,631 ns/iter (+/- 58,832)
test tests::bench_split_at_1920x1080          ... bench: 2,046,834 ns/iter (+/- 35,990)

and the assembly is also very similar

.LBB0_10:
  add rbx, -4
  movzx r15d, byte ptr [rsi]
  movzx r12d, byte ptr [rsi + 1]
  movzx edx, byte ptr [rsi + 2]
  imul r13d, edx, 19595
  imul r12d, r12d, 38470
  imul edx, r15d, 7471
  add edx, r12d
  add edx, r13d
  shr edx, 16
  movzx edx, dl
  imul edx, edx, 16843009
  mov dword ptr [rcx], edx
  lea rcx, [rcx + 4]
  add rsi, 4
  cmp rbx, 3
  ja .LBB0_10

Here the compiler even optimizes the storing of the value into a single write operation of 4 bytes, at the cost of an additional multiplication and zero-extend register move.

Overall this code performs very well too, but in my opinion it looks rather ugly compared to the versions using the different chunks iterators. Also this is basically what the exact_chunks iterator does internally: repeatedly calling slice::split_at. In theory both versions could lead to the very same assembly, but the LLVM optimizer is currently handling both slightly different.

Addendum 2: SIMD with faster

Adam Niederer, author of faster, provided a PR that implements the same algorithm with faster to make explicit use of SIMD instructions if available.

Due to some codegen issues, this currently has to be compiled with the CPU being selected as nehalem, i.e. by running RUSTFLAGS=”-C target-cpu=nehalem” cargo +nightly bench, but it provides yet another speedup by a factor of up to 1.27x compared to the fastest previous solution and 2.7x compared to the initial solution:

test tests::bench_chunks_1920x1080_asserts                     ... bench:   4,539,286 ns/iter (+/- 106,265)
test tests::bench_chunks_1920x1080_asserts_2                   ... bench:   3,550,683 ns/iter (+/- 96,917)
test tests::bench_chunks_1920x1080_iter_sum                    ... bench:   5,233,238 ns/iter (+/- 114,671)
test tests::bench_chunks_1920x1080_iter_sum_2                  ... bench:   3,532,059 ns/iter (+/- 94,964)
test tests::bench_chunks_1920x1080_no_asserts                  ... bench:   4,468,269 ns/iter (+/- 89,329)
test tests::bench_chunks_1920x1080_no_asserts_faster           ... bench:   2,476,077 ns/iter (+/- 54,877)
test tests::bench_chunks_1920x1080_no_asserts_faster_unstrided ... bench:   1,642,980 ns/iter (+/- 108,034)
test tests::bench_exact_chunks_1920x1080                       ... bench:   2,078,950 ns/iter (+/- 64,536)
test tests::bench_split_at_1920x1080                           ... bench:   2,096,603 ns/iter (+/- 107,420)

The code in question is very similar to what you would’ve written with ORC, especially the unstrided version. You basically operate on multiple elements at once, doing the same operation on each, but both versions do this in a slightly different way.

pub fn bgrx_to_gray_chunks_no_asserts_faster_unstrided(in_data: &[u8], out_data: &mut [u8]) {
    // Relies on vector width which is a multiple of 4
    assert!(u8s::WIDTH % 4 == 0 && u32s::WIDTH % 4 == 0);

    const RGB_Y: [u32; 16] = [19595, 38470, 7471, 0, 19595, 38470, 7471, 0, 19595, 38470, 7471, 0, 19595, 38470, 7471, 0];
    let rgbvec = u32s::load(&RGB_Y, 0);
    in_data.simd_iter(u8s(0)).simd_map(|v| {
        let (a, b) = v.upcast();
        let (a32, b32) = a.upcast();
        let (c32, d32) = b.upcast();

        let grey32a = a32 * rgbvec / u32s(65536);
        let grey32b = b32 * rgbvec / u32s(65536);
        let grey32c = c32 * rgbvec / u32s(65536);
        let grey32d = d32 * rgbvec / u32s(65536);

        let grey16a = grey32a.saturating_downcast(grey32b);
        let grey16b = grey32c.saturating_downcast(grey32d);

        let grey = grey16a.saturating_downcast(grey16b);
        grey
    }).scalar_fill(out_data);
}

pub fn bgrx_to_gray_chunks_no_asserts_faster(in_data: &[u8], out_data: &mut [u8]) {
    // Sane, but slowed down by faster's current striding implementation.
    in_data.stride_four(tuplify!(4, u8s(0))).zip().simd_map(|(r, g, b, _)| {
        let (r16a, r16b) = r.upcast();
        let (r32a, r32b) = r16a.upcast();
        let (r32c, r32d) = r16b.upcast();

        let (g16a, g16b) = g.upcast();
        let (g32a, g32b) = g16a.upcast();
        let (g32c, g32d) = g16b.upcast();

        let (b16a, b16b) = b.upcast();
        let (b32a, b32b) = b16a.upcast();
        let (b32c, b32d) = b16b.upcast();

        let grey32a = (r32a * u32s(19595) + g32a * u32s(38470) + b32a * u32s(7471)) / u32s(65536);
        let grey32b = (r32b * u32s(19595) + g32b * u32s(38470) + b32b * u32s(7471)) / u32s(65536);
        let grey32c = (r32c * u32s(19595) + g32c * u32s(38470) + b32c * u32s(7471)) / u32s(65536);
        let grey32d = (r32d * u32s(19595) + g32d * u32s(38470) + b32d * u32s(7471)) / u32s(65536);

        let grey16a = grey32a.saturating_downcast(grey32b);
        let grey16b = grey32c.saturating_downcast(grey32d);

        let grey = grey16a.saturating_downcast(grey16b);
        grey
    }).scalar_fill(out_data);
}

How to write GStreamer Elements in Rust Part 1: A Video Filter for converting RGB to grayscale

This is part one of a series of blog posts that I’ll write in the next weeks, as previously announced in the GStreamer Rust bindings 0.10.0 release blog post. Since the last series of blog posts about writing GStreamer plugins in Rust ([1] [2] [3] [4]) a lot has changed, and the content of those blog posts has only historical value now, as the journey of experimentation to what exists now.

In this first part we’re going to write a plugin that contains a video filter element. The video filter can convert from RGB to grayscale, either output as 8-bit per pixel grayscale or 32-bit per pixel RGB. In addition there’s a property to invert all grayscale values, or to shift them by up to 255 values. In the end this will allow you to watch Big Bucky Bunny, or anything else really that can somehow go into a GStreamer pipeline, in grayscale. Or encode the output to a new video file, send it over the network via WebRTC or something else, or basically do anything you want with it.

This will show the basics of how to write a GStreamer plugin and element in Rust: the basic setup for registering a type and implementing it in Rust, and how to use the various GStreamer API and APIs from the Rust standard library to do the processing.

The final code for this plugin can be found here, and it is based on the 0.1 version of the gst-plugin crate and the 0.10 version of the gstreamer crate. At least Rust 1.20 is required for all this. I’m also assuming that you have GStreamer (at least version 1.8) installed for your platform, see e.g. the GStreamer bindings installation instructions.

Project Structure
Plugin Initialization
Type Registration
Type Class & Instance Initialization
Caps & Pad Templates
Caps Handling Part 1
Caps Handling Part 2
Conversion of BGRx Video Frames to Grayscale
Testing the new element
Properties
What next?

Project Structure

We’ll create a new cargo project with cargo init –lib –name gst-plugin-tutorial. This will create a basically empty Cargo.toml and a corresponding src/lib.rs. We will use this structure: lib.rs will contain all the plugin related code, separate modules will contain any GStreamer plugins that are added.

The empty Cargo.toml has to be updated to list all the dependencies that we need, and to define that the crate should result in a cdylib, i.e. a C library that does not contain any Rust-specific metadata. The final Cargo.toml looks as follows

[package]
name = "gst-plugin-tutorial"
version = "0.1.0"
authors = ["Sebastian Dröge <sebastian@centricular.com>"]
repository = "https://github.com/sdroege/gst-plugin-rs"
license = "MIT/Apache-2.0"

[dependencies]
glib = "0.4"
gstreamer = "0.10"
gstreamer-base = "0.10"
gstreamer-video = "0.10"
gst-plugin = "0.1"

[lib]
name = "gstrstutorial"
crate-type = ["cdylib"]
path = "src/lib.rs"

We’re depending on the gst-plugin crate, which provides all the basic infrastructure for implementing GStreamer plugins and elements. In addition we depend on the gstreamer, gstreamer-base and gstreamer-video crates for various GStreamer API that we’re going to use later, and the glib crate to be able to use some GLib API that we’ll need. GStreamer is building upon GLib, and this leaks through in various places.

With the basic project structure being set-up, we should be able to compile the project with cargo build now, which will download and build all dependencies and then creates a file called target/debug/libgstrstutorial.so (or .dll on Windows, .dylib on macOS). This is going to be our GStreamer plugin.

To allow GStreamer to find our new plugin and make it available in every GStreamer-based application, we could install it into the system- or user-wide GStreamer plugin path or simply point the GST_PLUGIN_PATH environment variable to the directory containing it:

export GST_PLUGIN_PATH=<code>{{EJS15}}</code>/target/debug

If you now run the gst-inspect-1.0 tool on the libgstrstutorial.so, it will not yet print all information it can extract from the plugin but for now just complains that this is not a valid GStreamer plugin. Which is true, we didn’t write any code for it yet.

Plugin Initialization

Let’s start editing src/lib.rs to make this an actual GStreamer plugin. First of all, we need to add various extern crate directives to be able to use our dependencies and also mark some of them #[macro_use] because we’re going to use macros defined in some of them. This looks like the following

extern crate glib;
#[macro_use]
extern crate gstreamer as gst;
extern crate gstreamer_base as gst_base;
extern crate gstreamer_video as gst_video;
#[macro_use]
extern crate gst_plugin;

Next we make use of the plugin_define! macro from the gst-plugin crate to set-up the static metadata of the plugin (and make the shared library recognizeable by GStreamer to be a valid plugin), and to define the name of our entry point function (plugin_init) where we will register all the elements that this plugin provides.

plugin_define!(
    b"rstutorial\0",
    b"Rust Tutorial Plugin\0",
    plugin_init,
    b"1.0\0",
    b"MIT/X11\0",
    b"rstutorial\0",
    b"rstutorial\0",
    b"https://github.com/sdroege/gst-plugin-rs\0",
    b"2017-12-30\0"
);

This is unfortunately not very beautiful yet due to a) GStreamer requiring this information to be statically available in the shared library, not returned by a function (starting with GStreamer 1.14 it can be a function), and b) Rust not allowing raw strings (b”blabla) to be concatenated with a macro like the std::concat macro (so that the b and \0 parts could be hidden away). Expect this to become better in the future.

The static plugin metadata that we provide here is

name of the plugin
short description for the plugin
name of the plugin entry point function
version number of the plugin
license of the plugin (only a fixed set of licenses is allowed here, see)
source package name
binary package name (only really makes sense for e.g. Linux distributions)
origin of the plugin
release date of this version

In addition we’re defining an empty plugin entry point function that just returns true

fn plugin_init(plugin: &gst::Plugin) -> bool {
    true
}

With all that given, gst-inspect-1.0 should print exactly this information when running on the libgstrstutorial.so file (or .dll on Windows, or .dylib on macOS)

gst-inspect-1.0 target/debug/libgstrstutorial.so

Type Registration

As a next step, we’re going to add another module rgb2gray to our project, and call a function called register from our plugin_init function.

mod rgb2gray;

fn plugin_init(plugin: &gst::Plugin) -> bool {
    rgb2gray::register(plugin);
    true
}

With that our src/lib.rs is complete, and all following code is only in src/rgb2gray.rs. At the top of the new file we first need to add various use-directives to import various types and functions we’re going to use into the current module’s scope

use glib;
use gst;
use gst::prelude::*;
use gst_video;

use gst_plugin::properties::*;
use gst_plugin::object::*;
use gst_plugin::element::*;
use gst_plugin::base_transform::*;

use std::i32;
use std::sync::Mutex;

GStreamer is based on the GLib object system (GObject). C (just like Rust) does not have built-in support for object orientated programming, inheritance, virtual methods and related concepts, and GObject makes these features available in C as a library. Without language support this is a quite verbose endeavour in C, and the gst-plugin crate tries to expose all this in a (as much as possible) Rust-style API while hiding all the details that do not really matter.

So, as a next step we need to register a new type for our RGB to Grayscale converter GStreamer element with the GObject type system, and then register that type with GStreamer to be able to create new instances of it. We do this with the following code

struct Rgb2GrayStatic;

impl ImplTypeStatic<BaseTransform> for Rgb2GrayStatic {
    fn get_name(&self) -> &str {
        "Rgb2Gray"
    }

    fn new(&self, element: &BaseTransform) -> Box<BaseTransformImpl<BaseTransform>> {
        Rgb2Gray::new(element)
    }

    fn class_init(&self, klass: &mut BaseTransformClass) {
        Rgb2Gray::class_init(klass);
    }
}

pub fn register(plugin: &gst::Plugin) {
    let type_ = register_type(Rgb2GrayStatic);
    gst::Element::register(plugin, "rsrgb2gray", 0, type_);
}

This defines a zero-sized struct Rgb2GrayStatic that is used to implement the ImplTypeStatic<BaseTransform> trait on it for providing static information about the type to the type system. In our case this is a zero-sized struct, but in other cases this struct might contain actual data (for example if the same element code is used for multiple elements, e.g. when wrapping a generic codec API that provides support for multiple decoders and then wanting to register one element per decoder). By implementing ImplTypeStatic<BaseTransform> we also declare that our element is going to be based on the GStreamer BaseTransform base class, which provides a relatively simple API for 1:1 transformation elements like ours is going to be.

ImplTypeStatic provides functions that return a name for the type, and functions for initializing/returning a new instance of our element (new) and for initializing the class metadata (class_init, more on that later). We simply let those functions proxy to associated functions on the Rgb2Gray struct that we’re going to define at a later time.

In addition, we also define a register function (the one that is already called from our plugin_init function) and in there first register the Rgb2GrayStatic type metadata with the GObject type system to retrieve a type ID, and then register this type ID to GStreamer to be able to create new instances of it with the name “rsrgb2gray” (e.g. when using gst::ElementFactory::make).

Type Class & Instance Initialization

As a next step we declare the Rgb2Gray struct and implement the new and class_init functions on it. In the first version, this struct is almost empty but we will later use it to store all state of our element.

struct Rgb2Gray {
    cat: gst::DebugCategory,
}

impl Rgb2Gray {
    fn new(_transform: &BaseTransform) -> Box<BaseTransformImpl<BaseTransform>> {
        Box::new(Self {
            cat: gst::DebugCategory::new(
                "rsrgb2gray",
                gst::DebugColorFlags::empty(),
                "Rust RGB-GRAY converter",
            ),
        })
    }

    fn class_init(klass: &mut BaseTransformClass) {
        klass.set_metadata(
            "RGB-GRAY Converter",
            "Filter/Effect/Converter/Video",
            "Converts RGB to GRAY or grayscale RGB",
            "Sebastian Dröge <sebastian@centricular.com>",
        );

        klass.configure(BaseTransformMode::NeverInPlace, false, false);
    }
}

In the new function we return a boxed (i.e. heap-allocated) version of our struct, containing a newly created GStreamer debug category of name “rsrgb2gray”. We’re going to use this debug category later for making use of GStreamer’s debug logging system for logging the state and changes of our element.

In the class_init function we, again, set up some metadata for our new element. In this case these are a description, a classification of our element, a longer description and the author. The metadata can later be retrieved and made use of via the Registry and PluginFeature/ElementFactory API. We also configure the BaseTransform class and define that we will never operate in-place (producing our output in the input buffer), and that we don’t want to work in passthrough mode if the input/output formats are the same.

Additionally we need to implement various traits on the Rgb2Gray struct, which will later be used to override virtual methods of the various parent classes of our element. For now we can keep the trait implementations empty. There is one trait implementation required per parent class.

impl ObjectImpl<BaseTransform> for Rgb2Gray {}
impl ElementImpl<BaseTransform> for Rgb2Gray {}
impl BaseTransformImpl<BaseTransform> for Rgb2Gray {}

With all this defined, gst-inspect-1.0 should be able to show some more information about our element already but will still complain that it’s not complete yet.

Caps & Pad Templates

Data flow of GStreamer elements is happening via pads, which are the input(s) and output(s) (or sinks and sources) of an element. Via the pads, buffers containing actual media data, events or queries are transferred. An element can have any number of sink and source pads, but our new element will only have one of each.

To be able to declare what kinds of pads an element can create (they are not necessarily all static but could be created at runtime by the element or the application), it is necessary to install so-called pad templates during the class initialization. These pad templates contain the name (or rather “name template”, it could be something like src_%u for e.g. pad templates that declare multiple possible pads), the direction of the pad (sink or source), the availability of the pad (is it always there, sometimes added/removed by the element or to be requested by the application) and all the possible media types (called caps) that the pad can consume (sink pads) or produce (src pads).

In our case we only have always pads, one sink pad called “sink”, on which we can only accept RGB (BGRx to be exact) data with any width/height/framerate and one source pad called “src”, on which we will produce either RGB (BGRx) data or GRAY8 (8-bit grayscale) data. We do this by adding the following code to the class_init function.

        let caps = gst::Caps::new_simple(
            "video/x-raw",
            &[
                (
                    "format",
                    &gst::List::new(&[
                        &gst_video::VideoFormat::Bgrx.to_string(),
                        &gst_video::VideoFormat::Gray8.to_string(),
                    ]),
                ),
                ("width", &gst::IntRange::<i32>::new(0, i32::MAX)),
                ("height", &gst::IntRange::<i32>::new(0, i32::MAX)),
                (
                    "framerate",
                    &gst::FractionRange::new(
                        gst::Fraction::new(0, 1),
                        gst::Fraction::new(i32::MAX, 1),
                    ),
                ),
            ],
        );

        let src_pad_template = gst::PadTemplate::new(
            "src",
            gst::PadDirection::Src,
            gst::PadPresence::Always,
            &caps,
        );
        klass.add_pad_template(src_pad_template);

        let caps = gst::Caps::new_simple(
            "video/x-raw",
            &[
                ("format", &gst_video::VideoFormat::Bgrx.to_string()),
                ("width", &gst::IntRange::<i32>::new(0, i32::MAX)),
                ("height", &gst::IntRange::<i32>::new(0, i32::MAX)),
                (
                    "framerate",
                    &gst::FractionRange::new(
                        gst::Fraction::new(0, 1),
                        gst::Fraction::new(i32::MAX, 1),
                    ),
                ),
            ],
        );

        let sink_pad_template = gst::PadTemplate::new(
            "sink",
            gst::PadDirection::Sink,
            gst::PadPresence::Always,
            &caps,
        );
        klass.add_pad_template(sink_pad_template);

The names “src” and “sink” are pre-defined by the BaseTransform class and this base-class will also create the actual pads with those names from the templates for us whenever a new element instance is created. Otherwise we would have to do that in our new function but here this is not needed.

If you now run gst-inspect-1.0 on the rsrgb2gray element, these pad templates with their caps should also show up.

Caps Handling Part 1

As a next step we will add caps handling to our new element. This involves overriding 4 virtual methods from the BaseTransformImpl trait, and actually storing the configured input and output caps inside our element struct. Let’s start with the latter

struct State {
    in_info: gst_video::VideoInfo,
    out_info: gst_video::VideoInfo,
}

struct Rgb2Gray {
    cat: gst::DebugCategory,
    state: Mutex<Option<State>>,
}

impl Rgb2Gray {
    fn new(_transform: &BaseTransform) -> Box<BaseTransformImpl<BaseTransform>> {
        Box::new(Self {
            cat: gst::DebugCategory::new(
                "rsrgb2gray",
                gst::DebugColorFlags::empty(),
                "Rust RGB-GRAY converter",
            ),
            state: Mutex::new(None),
        })
    }
}

We define a new struct State that contains the input and output caps, stored in a VideoInfo. VideoInfo is a struct that contains various fields like width/height, framerate and the video format and allows to conveniently with the properties of (raw) video formats. We have to store it inside a Mutex in our Rgb2Gray struct as this can (in theory) be accessed from multiple threads at the same time.

Whenever input/output caps are configured on our element, the set_caps virtual method of BaseTransform is called with both caps (i.e. in the very beginning before the data flow and whenever it changes), and all following video frames that pass through our element should be according to those caps. Once the element is shut down, the stop virtual method is called and it would make sense to release the State as it only contains stream-specific information. We’re doing this by adding the following to the BaseTransformImpl trait implementation

impl BaseTransformImpl<BaseTransform> for Rgb2Gray {
    fn set_caps(&self, element: &BaseTransform, incaps: &gst::Caps, outcaps: &gst::Caps) -> bool {
        let in_info = match gst_video::VideoInfo::from_caps(incaps) {
            None => return false,
            Some(info) => info,
        };
        let out_info = match gst_video::VideoInfo::from_caps(outcaps) {
            None => return false,
            Some(info) => info,
        };

        gst_debug!(
            self.cat,
            obj: element,
            "Configured for caps {} to {}",
            incaps,
            outcaps
        );

        *self.state.lock().unwrap() = Some(State {
            in_info: in_info,
            out_info: out_info,
        });

        true
    }

    fn stop(&self, element: &BaseTransform) -> bool {
        // Drop state
        let _ = self.state.lock().unwrap().take();

        gst_info!(self.cat, obj: element, "Stopped");

        true
    }
}

This code should be relatively self-explanatory. In set_caps we’re parsing the two caps into a VideoInfo and then store this in our State, in stop we drop the State and replace it with None. In addition we make use of our debug category here and use the gst_info! and gst_debug! macros to output the current caps configuration to the GStreamer debug logging system. This information can later be useful for debugging any problems once the element is running.

Next we have to provide information to the BaseTransform base class about the size in bytes of a video frame with specific caps. This is needed so that the base class can allocate an appropriately sized output buffer for us, that we can then fill later. This is done with the get_unit_size virtual method, which is required to return the size of one processing unit in specific caps. In our case, one processing unit is one video frame. In the case of raw audio it would be the size of one sample multiplied by the number of channels.

impl BaseTransformImpl<BaseTransform> for Rgb2Gray {
    fn get_unit_size(&self, _element: &BaseTransform, caps: &gst::Caps) -> Option<usize> {
        gst_video::VideoInfo::from_caps(caps).map(|info| info.size())
    }
}

We simply make use of the VideoInfo API here again, which conveniently gives us the size of one video frame already.

Instead of get_unit_size it would also be possible to implement the transform_size virtual method, which is getting passed one size and the corresponding caps, another caps and is supposed to return the size converted to the second caps. Depending on how your element works, one or the other can be easier to implement.

Caps Handling Part 2

We’re not done yet with caps handling though. As a very last step it is required that we implement a function that is converting caps into the corresponding caps in the other direction. For example, if we receive BGRx caps with some width/height on the sinkpad, we are supposed to convert this into new caps with the same width/height but BGRx or GRAY8. That is, we can convert BGRx to BGRx or GRAY8. Similarly, if the element downstream of ours can accept GRAY8 with a specific width/height from our source pad, we have to convert this to BGRx with that very same width/height.

This has to be implemented in the transform_caps virtual method, and looks as following

impl BaseTransformImpl<BaseTransform> for Rgb2Gray {
    fn transform_caps(
        &self,
        element: &BaseTransform,
        direction: gst::PadDirection,
        caps: gst::Caps,
        filter: Option<&gst::Caps>,
    ) -> gst::Caps {
        let other_caps = if direction == gst::PadDirection::Src {
            let mut caps = caps.clone();

            for s in caps.make_mut().iter_mut() {
                s.set("format", &gst_video::VideoFormat::Bgrx.to_string());
            }

            caps
        } else {
            let mut gray_caps = gst::Caps::new_empty();

            {
                let gray_caps = gray_caps.get_mut().unwrap();

                for s in caps.iter() {
                    let mut s_gray = s.to_owned();
                    s_gray.set("format", &gst_video::VideoFormat::Gray8.to_string());
                    gray_caps.append_structure(s_gray);
                }
                gray_caps.append(caps.clone());
            }

            gray_caps
        };

        gst_debug!(
            self.cat,
            obj: element,
            "Transformed caps from {} to {} in direction {:?}",
            caps,
            other_caps,
            direction
        );

        if let Some(filter) = filter {
            filter.intersect_with_mode(&other_caps, gst::CapsIntersectMode::First)
        } else {
            other_caps
        }
    }
}

This caps conversion happens in 3 steps. First we check if we got caps for the source pad. In that case, the caps on the other pad (the sink pad) are going to be exactly the same caps but no matter if the caps contained BGRx or GRAY8 they must become BGRx as that’s the only format that our sink pad can accept. We do this by creating a clone of the input caps, then making sure that those caps are actually writable (i.e. we’re having the only reference to them, or a copy is going to be created) and then iterate over all the structures inside the caps and then set the “format” field to BGRx. After this, all structures in the new caps will be with the format field set to BGRx.

Similarly, if we get caps for the sink pad and are supposed to convert it to caps for the source pad, we create new caps and in there append a copy of each structure of the input caps (which are BGRx) with the format field set to GRAY8. In the end we append the original caps, giving us first all caps as GRAY8 and then the same caps as BGRx. With this ordering we signal to GStreamer that we would prefer to output GRAY8 over BGRx.

In the end the caps we created for the other pad are filtered against optional filter caps to reduce the potential size of the caps. This is done by intersecting the caps with that filter, while keeping the order (and thus preferences) of the filter caps (gst::CapsIntersectMode::First).

Conversion of BGRx Video Frames to Grayscale

Now that all the caps handling is implemented, we can finally get to the implementation of the actual video frame conversion. For this we start with defining a helper function bgrx_to_gray that converts one BGRx pixel to a grayscale value. The BGRx pixel is passed as a &[u8] slice with 4 elements and the function returns another u8 for the grayscale value.

impl Rgb2Gray {
    #[inline]
    fn bgrx_to_gray(in_p: &[u8]) -> u8 {
        // See https://en.wikipedia.org/wiki/YUV#SDTV_with_BT.601
        const R_Y: u32 = 19595; // 0.299 * 65536
        const G_Y: u32 = 38470; // 0.587 * 65536
        const B_Y: u32 = 7471; // 0.114 * 65536

        assert_eq!(in_p.len(), 4);

        let b = u32::from(in_p[0]);
        let g = u32::from(in_p[1]);
        let r = u32::from(in_p[2]);

        let gray = ((r * R_Y) + (g * G_Y) + (b * B_Y)) / 65536;
        (gray as u8)
    }
}

This function works by extracting the blue, green and red components from each pixel (remember: we work on BGRx, so the first value will be blue, the second green, the third red and the fourth unused), extending it from 8 to 32 bits for a wider value-range and then converts it to the Y component of the YUV colorspace (basically what your grandparents’ black & white TV would’ve displayed). The coefficients come from the Wikipedia page about YUV and are normalized to unsigned 16 bit integers so we can keep some accuracy, don’t have to work with floating point arithmetic and stay inside the range of 32 bit integers for all our calculations. As you can see, the green component is weighted more than the others, which comes from our eyes being more sensitive to green than to other colors.

Afterwards we have to actually call this function on every pixel. For this the transform virtual method is implemented, which gets a input and output buffer passed and we’re supposed to read the input buffer and fill the output buffer. The implementation looks as follows, and is going to be our biggest function for this element

impl BaseTransformImpl<BaseTransform> for Rgb2Gray {
    fn transform(
        &self,
        element: &BaseTransform,
        inbuf: &gst::Buffer,
        outbuf: &mut gst::BufferRef,
    ) -> gst::FlowReturn {
        let mut state_guard = self.state.lock().unwrap();
        let state = match *state_guard {
            None => {
                gst_element_error!(element, gst::CoreError::Negotiation, ["Have no state yet"]);
                return gst::FlowReturn::NotNegotiated;
            }
            Some(ref mut state) => state,
        };

        let in_frame = match gst_video::VideoFrameRef::from_buffer_ref_readable(
            inbuf.as_ref(),
            &state.in_info,
        ) {
            None => {
                gst_element_error!(
                    element,
                    gst::CoreError::Failed,
                    ["Failed to map input buffer readable"]
                );
                return gst::FlowReturn::Error;
            }
            Some(in_frame) => in_frame,
        };

        let mut out_frame =
            match gst_video::VideoFrameRef::from_buffer_ref_writable(outbuf, &state.out_info) {
                None => {
                    gst_element_error!(
                        element,
                        gst::CoreError::Failed,
                        ["Failed to map output buffer writable"]
                    );
                    return gst::FlowReturn::Error;
                }
                Some(out_frame) => out_frame,
            };

        let width = in_frame.width() as usize;
        let in_stride = in_frame.plane_stride()[0] as usize;
        let in_data = in_frame.plane_data(0).unwrap();
        let out_stride = out_frame.plane_stride()[0] as usize;
        let out_format = out_frame.format();
        let out_data = out_frame.plane_data_mut(0).unwrap();

        if out_format == gst_video::VideoFormat::Bgrx {
            assert_eq!(in_data.len() % 4, 0);
            assert_eq!(out_data.len() % 4, 0);
            assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride);

            let in_line_bytes = width * 4;
            let out_line_bytes = width * 4;

            assert!(in_line_bytes <= in_stride);
            assert!(out_line_bytes <= out_stride);

            for (in_line, out_line) in in_data
                .chunks(in_stride)
                .zip(out_data.chunks_mut(out_stride))
            {
                for (in_p, out_p) in in_line[..in_line_bytes]
                    .chunks(4)
                    .zip(out_line[..out_line_bytes].chunks_mut(4))
                {
                    assert_eq!(out_p.len(), 4);

                    let gray = Rgb2Gray::bgrx_to_gray(in_p);
                    out_p[0] = gray;
                    out_p[1] = gray;
                    out_p[2] = gray;
                }
            }
        } else if out_format == gst_video::VideoFormat::Gray8 {
            assert_eq!(in_data.len() % 4, 0);
            assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride);

            let in_line_bytes = width * 4;
            let out_line_bytes = width;

            assert!(in_line_bytes <= in_stride);
            assert!(out_line_bytes <= out_stride);

            for (in_line, out_line) in in_data
                .chunks(in_stride)
                .zip(out_data.chunks_mut(out_stride))
            {
                for (in_p, out_p) in in_line[..in_line_bytes]
                    .chunks(4)
                    .zip(out_line[..out_line_bytes].iter_mut())
                {
                    let gray = Rgb2Gray::bgrx_to_gray(in_p);
                    *out_p = gray;
                }
            }
        } else {
            unimplemented!();
        }

        gst::FlowReturn::Ok
    }
}

What happens here is that we first of all lock our state (the input/output VideoInfo) and error out if we don’t have any yet (which can’t really happen unless other elements have a bug, but better safe than sorry). After that we map the input buffer readable and the output buffer writable with the VideoFrameRef API. By mapping the buffers we get access to the underlying bytes of them, and the mapping operation could for example make GPU memory available or just do nothing and give us access to a normally allocated memory area. We have access to the bytes of the buffer until the VideoFrameRef goes out of scope.

Instead of VideoFrameRef we could’ve also used the gst::Buffer::map_readable() and gst::Buffer::map_writable() API, but different to those the VideoFrameRef API also extracts various metadata from the raw video buffers and makes them available. For example we can directly access the different planes as slices without having to calculate the offsets ourselves, or we get directly access to the width and height of the video frame.

After mapping the buffers, we store various information we’re going to need later in local variables to save some typing later. This is the width (same for input and output as we never changed the width in transform_caps), the input and out (row-) stride (the number of bytes per row/line, which possibly includes some padding at the end of each line for alignment reasons), the output format (which can be BGRx or GRAY8 because of how we implemented transform_caps) and the pointers to the first plane of the input and output (which in this case also is the only plane, BGRx and GRAY8 both have only a single plane containing all the RGB/gray components).

Then based on whether the output is BGRx or GRAY8, we iterate over all pixels. The code is basically the same in both cases, so I’m only going to explain the case where BGRx is output.

We start by iterating over each line of the input and output, and do so by using the chunks iterator to give us chunks of as many bytes as the (row-) stride of the video frame is, do the same for the other frame and then zip both iterators together. This means that on each iteration we get exactly one line as a slice from each of the frames and can then start accessing the actual pixels in each line.

To access the individual pixels in each line, we again use the chunks iterator the same way, but this time to always give us chunks of 4 bytes from each line. As BGRx uses 4 bytes for each pixel, this gives us exactly one pixel. Instead of iterating over the whole line, we only take the actual sub-slice that contains the pixels, not the whole line with stride number of bytes containing potential padding at the end. Now for each of these pixels we call our previously defined bgrx_to_gray function and then fill the B, G and R components of the output buffer with that value to get grayscale output. And that’s all.

Using Rust high-level abstractions like the chunks iterators and bounds-checking slice accesses might seem like it’s going to cause quite some performance penalty, but if you look at the generated assembly most of the bounds checks are completely optimized away and the resulting assembly code is close to what one would’ve written manually (especially when using the newly-added exact_chunks iterators). Here you’re getting safe and high-level looking code with low-level performance!

You might’ve also noticed the various assertions in the processing function. These are there to give further hints to the compiler about properties of the code, and thus potentially being able to optimize the code better and moving e.g. bounds checks out of the inner loop and just having the assertion outside the loop check for the same. In Rust adding assertions can often improve performance by allowing further optimizations to be applied, but in the end always check the resulting assembly to see if what you did made any difference.

Testing the new element

Now we implemented almost all functionality of our new element and could run it on actual video data. This can be done now with the gst-launch-1.0 tool, or any application using GStreamer and allowing us to insert our new element somewhere in the video part of the pipeline. With gst-launch-1.0 you could run for example the following pipelines

# Run on a test pattern
gst-launch-1.0 videotestsrc ! rsrgb2gray ! videoconvert ! autovideosink

# Run on some video file, also playing the audio
gst-launch-1.0 playbin uri=file:///path/to/some/file video-filter=rsrgb2gray

Note that you will likely want to compile with cargo build –release and add the target/release directory to GST_PLUGIN_PATH instead. The debug build might be too slow, and generally the release builds are multiple orders of magnitude (!) faster.

Properties

The only feature missing now are the properties I mentioned in the opening paragraph: one boolean property to invert the grayscale value and one integer property to shift the value by up to 255. Implementing this on top of the previous code is not a lot of work. Let’s start with defining a struct for holding the property values and defining the property metadata.

const DEFAULT_INVERT: bool = false;
const DEFAULT_SHIFT: u32 = 0;

#[derive(Debug, Clone, Copy)]
struct Settings {
    invert: bool,
    shift: u32,
}

impl Default for Settings {
    fn default() -> Self {
        Settings {
            invert: DEFAULT_INVERT,
            shift: DEFAULT_SHIFT,
        }
    }
}

static PROPERTIES: [Property; 2] = [
    Property::Boolean(
        "invert",
        "Invert",
        "Invert grayscale output",
        DEFAULT_INVERT,
        PropertyMutability::ReadWrite,
    ),
    Property::UInt(
        "shift",
        "Shift",
        "Shift grayscale output (wrapping around)",
        (0, 255),
        DEFAULT_SHIFT,
        PropertyMutability::ReadWrite,
    ),
];

struct Rgb2Gray {
    cat: gst::DebugCategory,
    settings: Mutex<Settings>,
    state: Mutex<Option<State>>,
}

impl Rgb2Gray {
    fn new(_transform: &BaseTransform) -> Box<BaseTransformImpl<BaseTransform>> {
        Box::new(Self {
            cat: gst::DebugCategory::new(
                "rsrgb2gray",
                gst::DebugColorFlags::empty(),
                "Rust RGB-GRAY converter",
            ),
            settings: Mutex::new(Default::default()),
            state: Mutex::new(None),
        })
    }
}

This should all be rather straightforward: we define a Settings struct that stores the two values, implement the Default trait for it, then define a two-element array with property metadata (names, description, ranges, default value, writability), and then store the default value of our Settings struct inside another Mutex inside the element struct.

In the next step we have to make use of these: we need to tell the GObject type system about the properties, and we need to implement functions that are called whenever a property value is set or get.

impl Rgb2Gray {
    fn class_init(klass: &mut BaseTransformClass) {
        [...]
        klass.install_properties(&PROPERTIES);
        [...]
    }
}

impl ObjectImpl<BaseTransform> for Rgb2Gray {
    fn set_property(&self, obj: &glib::Object, id: u32, value: &glib::Value) {
        let prop = &PROPERTIES[id as usize];
        let element = obj.clone().downcast::<BaseTransform>().unwrap();

        match *prop {
            Property::Boolean("invert", ..) => {
                let mut settings = self.settings.lock().unwrap();
                let invert = value.get().unwrap();
                gst_info!(
                    self.cat,
                    obj: &element,
                    "Changing invert from {} to {}",
                    settings.invert,
                    invert
                );
                settings.invert = invert;
            }
            Property::UInt("shift", ..) => {
                let mut settings = self.settings.lock().unwrap();
                let shift = value.get().unwrap();
                gst_info!(
                    self.cat,
                    obj: &element,
                    "Changing shift from {} to {}",
                    settings.shift,
                    shift
                );
                settings.shift = shift;
            }
            _ => unimplemented!(),
        }
    }

    fn get_property(&self, _obj: &glib::Object, id: u32) -> Result<glib::Value, ()> {
        let prop = &PROPERTIES[id as usize];

        match *prop {
            Property::Boolean("invert", ..) => {
                let settings = self.settings.lock().unwrap();
                Ok(settings.invert.to_value())
            }
            Property::UInt("shift", ..) => {
                let settings = self.settings.lock().unwrap();
                Ok(settings.shift.to_value())
            }
            _ => unimplemented!(),
        }
    }
}

Property values can be changed from any thread at any time, that’s why the Mutex is needed here to protect our struct. And we’re using a new mutex to be able to have it locked only for the shorted possible amount of time: we don’t want to keep it locked for the whole time of the transform function, otherwise applications trying to set/get values would block for up to one frame.

In the property setter/getter functions we are working with a glib::Value. This is a dynamically typed value type that can contain values of any type, together with the type information of the contained value. Here we’re using it to handle an unsigned integer (u32) and a boolean for our two properties. To know which property is currently set/get, we get an identifier passed which is the index into our PROPERTIES array. We then simply match on the name of that to decide which property was meant

With this implemented, we can already compile everything, see the properties and their metadata in gst-inspect-1.0 and can also set them on gst-launch-1.0 like this

# Set invert to true and shift to 128
gst-launch-1.0 videotestsrc ! rsrgb2gray invert=true shift=128 ! videoconvert ! autovideosink

If we set GST_DEBUG=rsrgb2gray:6 in the environment before running that, we can also see the corresponding debug output when the values are changing. The only thing missing now is to actually make use of the property values for the processing. For this we add the following changes to bgrx_to_gray and the transform function

impl Rgb2Gray {
    #[inline]
    fn bgrx_to_gray(in_p: &[u8], shift: u8, invert: bool) -> u8 {
        [...]

        let gray = ((r * R_Y) + (g * G_Y) + (b * B_Y)) / 65536;
        let gray = (gray as u8).wrapping_add(shift);

        if invert {
            255 - gray
        } else {
            gray
        }
    }
}

impl BaseTransformImpl<BaseTransform> for Rgb2Gray {
    fn transform(
        &self,
        element: &BaseTransform,
        inbuf: &gst::Buffer,
        outbuf: &mut gst::BufferRef,
    ) -> gst::FlowReturn {
        let settings = *self.settings.lock().unwrap();
        [...]
                    let gray = Rgb2Gray::bgrx_to_gray(in_p, settings.shift as u8, settings.invert);
        [...]
    }
}

And that’s all. If you run the element in gst-launch-1.0 and change the values of the properties you should also see the corresponding changes in the video output.

Note that we always take a copy of the Settings struct at the beginning of the transform function. This ensures that we take the mutex only the shorted possible amount of time and then have a local snapshot of the settings for each frame.

Also keep in mind that the usage of the property values in the bgrx_to_gray function is far from optimal. It means the addition of another condition to the calculation of each pixel, thus potentially slowing it down a lot. Ideally this condition would be moved outside the inner loops and the bgrx_to_gray function would made generic over that. See for example this blog post about “branchless Rust” for ideas how to do that, the actual implementation is left as an exercise for the reader.

What next?

I hope the code walkthrough above was useful to understand how to implement GStreamer plugins and elements in Rust. If you have any questions, feel free to ask them here in the comments.

The same approach also works for audio filters or anything that can be handled in some way with the API of the BaseTransform base class. You can find another filter, an audio echo filter, using the same approach here.

In the next blog post in this series I’ll show how to use another base class to implement another kind of element, but for the time being you can also check the GIT repository for various other element implementations.

Month: January 2018