Speeding up RGB to grayscale conversion in Rust by a factor of 2.2 – and various other multimedia related processing loops

In the previous blog post I wrote about how to write a RGB to grayscale conversion filter for GStreamer in Rust. In this blog post I’m going to write about how to optimize the processing loop of that filter, without resorting to unsafe code or SIMD instructions by staying with plain, safe Rust code.

I also tried to implement the processing loop with faster, a Rust crate for writing safe SIMD code. It looks very promising, but unless I missed something in the documentation it currently is missing some features to be able to express this specific algorithm in a meaningful way. Once it works on stable Rust (waiting for SIMD to be stabilized) and includes runtime CPU feature detection, this could very well be a good replacement for the ORC library used for the same purpose in GStreamer in various places. ORC works by JIT-compiling a minimal “array operation language” to SIMD assembly for your specific CPU (and has support for x86 MMX/SSE, PPC Altivec, ARM NEON, etc.).

If someone wants to prove me wrong and implement this with faster, feel free to do so and I’ll link to your solution and include it in the benchmark results below.

All code below can be found in this GIT repository.

Table of Contents

  1. Baseline Implementation
  2. First Optimization – Assertions
  3. First Optimization – Assertions Try 2
  4. Second Optimization – Iterate a bit more
  5. Third Optimization – Getting rid of the bounds check finally
  6. Summary
  7. Addendum: slice::split_at

Baseline Implementation

This is how the baseline implementation looks like.

This basically iterates over each line of the input and output frame (outer loop), and then for each BGRx chunk of 4 bytes in each line it converts the values to u32, multiplies with a constant array, converts back to u8 and stores the same value in the whole output BGRx chunk.

So what can be improve on this? For starters, let’s write a small benchmark for this so that we know whether any of our changes actually improve something. This is using the (unfortunately still) unstable benchmark feature of Cargo.

This can be run with cargo bench and then prints the amount of nanoseconds each iterator of the closure was taking. To only really measure the processing itself, allocations and initializations of the input/output frame are happening outside of the closure. We’re not interested in times for that.

First Optimization – Assertions

To actually start optimizing this function, let’s take a look at the assembly that the compiler is outputting. The easiest way of doing that is via the Godbolt Compiler Explorer website. Select “rustc nightly” and use “-C opt-level=3” for the compiler flags, and then copy & paste your code in there. Once it compiles, to find the assembly that corresponds to a line, simply right-click on the line and “Scroll to assembly”.

Alternatively you can use cargo rustc –release — -C opt-level=3 –emit asm and check the assembly file that is output in the target/release/deps directory.

What we see then for our inner loop is something like the following

This is already quite optimized. For each loop iteration the first few instructions are doing some bounds checking and if they fail jump to the .LBB4_34 or .LBB4_35 labels. How to understand that this is bounds checking? Scroll down in the assembly to where these labels are defined and you’ll see something like the following

Also if you check (with the colors, or the “scroll to source” feature) which Rust code these correspond to, you’ll see that it’s the first and third access to the 4-byte slice that contains our BGRx values.

Afterwards in the assembly, the following steps are happening: 0) incrementing of the “loop counter” representing the number of iterations we’re going to do (r9), 1) actual reading of the B, G and R value and conversion to u32 (the 3 movzx, note that the reading of the x value is optimized away as the compiler sees that it is always multiplied by 0 later), 2) the multiplications with the array elements (the 3 imul), 3) combining of the results and division (i.e. shift) (the 2 add and the shr), 4) storing of the result in the output (the 4 mov). Afterwards the slice pointers are increased by 4 (rbx and r10) and the lengths (used for bounds checking) are decreased by 4 (r8 and r15). Finally there’s a check (cmp) to see if r9 (our loop) counter is at the end of the slice, and if not we jump back to the beginning and operate on the next BGRx chunk.

Generally what we want to do for optimizations is to get rid of unnecessary checks (bounds checking), memory accesses, conditions (cmp, cmov) and jumps (the instructions starting with j). These are all things that are slowing down our code.

So the first thing that seems useful to optimize here is the bounds checking at the beginning. It definitely seems not useful to do two checks instead of one for the two slices (the checks are for the both slices at once but Godbolt does not detect that and believes it’s only the input slice). And ideally we could teach the compiler that no bounds checking is needed at all.

As I wrote in the previous blog post, often this knowledge can be given to the compiler by inserting assertions.

To prevent two checks and just have a single check, you can insert a assert_eq!(in_p.len(), 4) at the beginning of the inner loop and the same for the output slice. Now we only have a single bounds check left per iteration.

As a next step we might want to try to move this knowledge outside the inner loop so that there is no bounds checking at all in there anymore. We might want to add assertions like the following outside the outer loop then to give all knowledge we have to the compiler

Unfortunately adding those has no effect at all on the inner loop, but having them outside the outer loop for good measure is not the worst idea so let’s just keep them. At least it can be used as some kind of documentation of the invariants of this code for future readers.

So let’s benchmark these two implementations now. The results on my machine are the following

This is surprising, our version without the assertions is actually faster by a factor of ~1.1 although it had fewer conditions. So let’s take a closer look at the assembly at the top of the loop again, where the bounds checking happens, in the version with assertions

While this indeed has only one jump as expected for the bounds checking, the number of comparisons is the same and even worse: 3 memory writes to the stack are happening right before the jump. If we follow to the .LBB4_33 label we will see that the assert_eq! macro is going to do something with core::fmt::Debug. This is setting up the information needed for printing the assertion failure, the “expected X equals to Y” output. This is certainly not good and the reason why everything is slower now.

First Optimization – Assertions Try 2

All the additional instructions and memory writes were happening because the assert_eq! macro is outputting something user friendly that actually contains the values of both sides. Let’s try again with the assert! macro instead

This already looks more promising. Compared to our baseline version this gives us a speedup of a factor of 1.12, and compared to the version with assert_eq! 1.23. If we look at the assembly for the bounds checks (everything else stays the same), it also looks more like what we would’ve expected

One cmp less, only one jump left. And no memory writes anymore!

So keep in mind that assert_eq! is more user-friendly but quite a bit more expensive even in the “good case” compared to assert!.

Second Optimization – Iterate a bit more

This is still not very satisfying though. No bounds checking should be needed at all as each chunk is going to be exactly 4 bytes. We’re just not able to convince the compiler that this is the case. While it may be possible (let me know if you find a way!), let’s try something different. The zip iterator is done when the shortest iterator of both is done, and there are optimizations specifically for zipped slice iterators implemented. Let’s try that and replace the grayscale value calculation with

If we run that through our benchmark after removing the assert!(in_p.len() == 4) (and the same for the output slice), these are the results

We’re actually 2.9 times slower! Even when adding back the assert!(in_p.len() == 4) assertion (and the same for the output slice) we’re still slower

If we look at the assembly of the assertion-less variant, it’s a complete mess now

In short, there are now various new conditions and jumps for short-circuiting the zip iterator in the various cases. And because of all the noise added, the compiler was not even able to optimize the bounds check for the output slice away anymore (.LBB0_35 cases). While it was able to unroll the iterator (note that the 3 imul multiplications are not interleaved with jumps and are actually 3 multiplications instead of yet another loop), which is quite impressive, it couldn’t do anything meaningful with that information it somehow got (it must’ve understood that each chunk has 4 bytes!). This looks like something going wrong somewhere in the optimizer to me.

If we take a look at the variant with the assertions, things look much better

This is literally the same as the assertion version we had before, except that the reading of the input slice, the multiplications and the additions are happening in iterator order instead of being batched all together. It’s quite impressive that the compiler was able to completely optimize away the zip iterator here, but unfortunately it’s still many times slower than the original version. The reason must be the instruction-reordering. The previous version had all memory reads batched and then the operations batched, which is apparently much better for the internal pipelining of the CPU (it is going to perform the next instructions without dependencies on the previous ones already while waiting for the pending instructions to finish).

It’s also not clear to me why the LLVM optimizer is not able to schedule the instructions the same way here. It apparently has all information it needs for that if no iterator is involved, and both versions are leading to exactly the same assembly except for the order of instructions. This also seems like something fishy.

Nonetheless, we still have our manual bounds check (the assertion) left here and we should really try to get rid of that. No progress so far.

Third Optimization – Getting rid of the bounds check finally

Let’s tackle this from a different angle now. Our problem is apparently that the compiler is not able to understand that each chunk is exactly 4 bytes.

So why don’t we write a new chunks iterator that has always exactly the requested amount of items, instead of potentially less for the very last iteration. And instead of panicking if there are leftover elements, it seems useful to just ignore them. That way we have API that is functionally different from the existing chunks iterator and provides behaviour that is useful in various cases. It’s basically the slice equivalent of the exact_chunks iterator of the ndarray crate.

By having it functionally different from the existing one, and not just an optimization, I also submitted it for inclusion in Rust’s standard library and it’s nowadays available as an unstable feature in nightly. Like all newly added API. Nonetheless, the same can also be implemented inside your code with basically the same effect, there are no dependencies on standard library internals.

So, let’s use our new exact_chunks iterator that is guaranteed (by API) to always give us exactly 4 bytes. In our case this is exactly equivalent to the normal chunks as by construction our slices always have a length that is a multiple of 4, but the compiler can’t infer that information. The resulting code looks as follows

It’s exactly the same as the previous version with assertions, except for using exact_chunks instead of chunks and the same for the mutable iterator. The resulting benchmark of all our variants now looks as follow

Compared to our initial version this is a speedup of a factor of 2.2, compared to our version with assertions a factor of 1.98. This seems like a worthwhile improvement, and if we look at the resulting assembly there are no bounds checks at all anymore

Also due to this the compiler is able to apply some more optimizations and we only have one loop counter for the number of iterations r10 and the two pointers rcx and rsi that are increased/decreased in each iteration. There is no tracking of the remaining slice lengths anymore, as in the assembly of the original version (and the versions with assertions).

Summary

So overall we got a speedup of a factor of 2.2 while still writing very high-level Rust code with iterators and not falling back to unsafe code or using SIMD. The optimizations the Rust compiler is applying are quite impressive and the Rust marketing line of zero-cost abstractions is really visible in reality here.

The same approach should also work for many similar algorithms, and thus many similar multimedia related algorithms where you iterate over slices and operate on fixed-size chunks.

Also the above shows that as a first step it’s better to write clean and understandable high-level Rust code without worrying too much about performance (assume the compiler can optimize well), and only afterwards take a look at the generated assembly and check which instructions should really go away (like bounds checking). In many cases this can be achieved by adding assertions in strategic places, or like in this case by switching to a slightly different abstraction that is closer to the actual requirements (however I believe the compiler should be able to produce the same code with the help of assertions with the normal chunks iterator, but making that possible requires improvements to the LLVM optimizer probably).

And if all does not help, there’s still the escape hatch of unsafe (for using functions like slice::get_unchecked() or going down to raw pointers) and the possibility of using SIMD instructions (by using faster or stdsimd directly). But in the end this should be a last resort for those little parts of your code where optimizations are needed and the compiler can’t be easily convinced to do it for you.

Addendum: slice::split_at

User newpavlov suggested on Reddit to use repeated slice::split_at in a while loop for similar performance.

This would for example like

Performance-wise this brings us very close to the exact_chunks version

and the assembly is also very similar

Here the compiler even optimizes the storing of the value into a single write operation of 4 bytes, at the cost of an additional multiplication and zero-extend register move.

Overall this code performs very well too, but in my opinion it looks rather ugly compared to the versions using the different chunks iterators. Also this is basically what the exact_chunks iterator does internally: repeatedly calling slice::split_at. In theory both versions could lead to the very same assembly, but the LLVM optimizer is currently handling both slightly different.

How to write GStreamer Elements in Rust Part 1: A Video Filter for converting RGB to grayscale

This is part one of a series of blog posts that I’ll write in the next weeks, as previously announced in the GStreamer Rust bindings 0.10.0 release blog post. Since the last series of blog posts about writing GStreamer plugins in Rust ([1] [2] [3] [4]) a lot has changed, and the content of those blog posts has only historical value now, as the journey of experimentation to what exists now.

In this first part we’re going to write a plugin that contains a video filter element. The video filter can convert from RGB to grayscale, either output as 8-bit per pixel grayscale or 32-bit per pixel RGB. In addition there’s a property to invert all grayscale values, or to shift them by up to 255 values. In the end this will allow you to watch Big Bucky Bunny, or anything else really that can somehow go into a GStreamer pipeline, in grayscale. Or encode the output to a new video file, send it over the network via WebRTC or something else, or basically do anything you want with it.

Big Bucky Bunny – Grayscale

This will show the basics of how to write a GStreamer plugin and element in Rust: the basic setup for registering a type and implementing it in Rust, and how to use the various GStreamer API and APIs from the Rust standard library to do the processing.

The final code for this plugin can be found here, and it is based on the 0.1 version of the gst-plugin crate and the 0.10 version of the gstreamer crate. At least Rust 1.20 is required for all this. I’m also assuming that you have GStreamer (at least version 1.8) installed for your platform, see e.g. the GStreamer bindings installation instructions.

Table of Contents

  1. Project Structure
  2. Plugin Initialization
  3. Type Registration
  4. Type Class & Instance Initialization
  5. Caps & Pad Templates
  6. Caps Handling Part 1
  7. Caps Handling Part 2
  8. Conversion of BGRx Video Frames to Grayscale
  9. Testing the new element
  10. Properties
  11. What next?

Project Structure

We’ll create a new cargo project with cargo init –lib –name gst-plugin-tutorial. This will create a basically empty Cargo.toml and a corresponding src/lib.rs. We will use this structure: lib.rs will contain all the plugin related code, separate modules will contain any GStreamer plugins that are added.

The empty Cargo.toml has to be updated to list all the dependencies that we need, and to define that the crate should result in a cdylib, i.e. a C library that does not contain any Rust-specific metadata. The final Cargo.toml looks as follows

We’re depending on the gst-plugin crate, which provides all the basic infrastructure for implementing GStreamer plugins and elements. In addition we depend on the gstreamer, gstreamer-base and gstreamer-video crates for various GStreamer API that we’re going to use later, and the glib crate to be able to use some GLib API that we’ll need. GStreamer is building upon GLib, and this leaks through in various places.

With the basic project structure being set-up, we should be able to compile the project with cargo build now, which will download and build all dependencies and then creates a file called target/debug/libgstrstutorial.so (or .dll on Windows, .dylib on macOS). This is going to be our GStreamer plugin.

To allow GStreamer to find our new plugin and make it available in every GStreamer-based application, we could install it into the system- or user-wide GStreamer plugin path or simply point the GST_PLUGIN_PATH environment variable to the directory containing it:

If you now run the gst-inspect-1.0 tool on the libgstrstutorial.so, it will not yet print all information it can extract from the plugin but for now just complains that this is not a valid GStreamer plugin. Which is true, we didn’t write any code for it yet.

Plugin Initialization

Let’s start editing src/lib.rs to make this an actual GStreamer plugin. First of all, we need to add various extern crate directives to be able to use our dependencies and also mark some of them #[macro_use] because we’re going to use macros defined in some of them. This looks like the following

Next we make use of the plugin_define! macro from the gst-plugin crate to set-up the static metadata of the plugin (and make the shared library recognizeable by GStreamer to be a valid plugin), and to define the name of our entry point function (plugin_init) where we will register all the elements that this plugin provides.

This is unfortunately not very beautiful yet due to a) GStreamer requiring this information to be statically available in the shared library, not returned by a function (starting with GStreamer 1.14 it can be a function), and b) Rust not allowing raw strings (b”blabla) to be concatenated with a macro like the std::concat macro (so that the b and \0 parts could be hidden away). Expect this to become better in the future.

The static plugin metadata that we provide here is

  1. name of the plugin
  2. short description for the plugin
  3. name of the plugin entry point function
  4. version number of the plugin
  5. license of the plugin (only a fixed set of licenses is allowed here, see)
  6. source package name
  7. binary package name (only really makes sense for e.g. Linux distributions)
  8. origin of the plugin
  9. release date of this version

In addition we’re defining an empty plugin entry point function that just returns true

With all that given, gst-inspect-1.0 should print exactly this information when running on the libgstrstutorial.so file (or .dll on Windows, or .dylib on macOS)

Type Registration

As a next step, we’re going to add another module rgb2gray to our project, and call a function called register from our plugin_init function.

With that our src/lib.rs is complete, and all following code is only in src/rgb2gray.rs. At the top of the new file we first need to add various use-directives to import various types and functions we’re going to use into the current module’s scope

GStreamer is based on the GLib object system (GObject). C (just like Rust) does not have built-in support for object orientated programming, inheritance, virtual methods and related concepts, and GObject makes these features available in C as a library. Without language support this is a quite verbose endeavour in C, and the gst-plugin crate tries to expose all this in a (as much as possible) Rust-style API while hiding all the details that do not really matter.

So, as a next step we need to register a new type for our RGB to Grayscale converter GStreamer element with the GObject type system, and then register that type with GStreamer to be able to create new instances of it. We do this with the following code

This defines a zero-sized struct Rgb2GrayStatic that is used to implement the ImplTypeStatic<BaseTransform> trait on it for providing static information about the type to the type system. In our case this is a zero-sized struct, but in other cases this struct might contain actual data (for example if the same element code is used for multiple elements, e.g. when wrapping a generic codec API that provides support for multiple decoders and then wanting to register one element per decoder). By implementing ImplTypeStatic<BaseTransform> we also declare that our element is going to be based on the GStreamer BaseTransform base class, which provides a relatively simple API for 1:1 transformation elements like ours is going to be.

ImplTypeStatic provides functions that return a name for the type, and functions for initializing/returning a new instance of our element (new) and for initializing the class metadata (class_init, more on that later). We simply let those functions proxy to associated functions on the Rgb2Gray struct that we’re going to define at a later time.

In addition, we also define a register function (the one that is already called from our plugin_init function) and in there first register the Rgb2GrayStatic type metadata with the GObject type system to retrieve a type ID, and then register this type ID to GStreamer to be able to create new instances of it with the name “rsrgb2gray” (e.g. when using gst::ElementFactory::make).

Type Class & Instance Initialization

As a next step we declare the Rgb2Gray struct and implement the new and class_init functions on it. In the first version, this struct is almost empty but we will later use it to store all state of our element.

In the new function we return a boxed (i.e. heap-allocated) version of our struct, containing a newly created GStreamer debug category of name “rsrgb2gray”. We’re going to use this debug category later for making use of GStreamer’s debug logging system for logging the state and changes of our element.

In the class_init function we, again, set up some metadata for our new element. In this case these are a description, a classification of our element, a longer description and the author. The metadata can later be retrieved and made us of via the Registry and PluginFeature/ElementFactory API. We also configure the BaseTransform class and define that we will never operate in-place (producing our output in the input buffer), and that we don’t want to work in passthrough mode if the input/output formats are the same.

Additionally we need to implement various traits on the Rgb2Gray struct, which will later be used to override virtual methods of the various parent classes of our element. For now we can keep the trait implementations empty. There is one trait implementation required per parent class.

With all this defined, gst-inspect-1.0 should be able to show some more information about our element already but will still complain that it’s not complete yet.

Caps & Pad Templates

Data flow of GStreamer elements is happening via pads, which are the input(s) and output(s) (or sinks and sources) of an element. Via the pads, buffers containing actual media data, events or queries are transferred. An element can have any number of sink and source pads, but our new element will only have one of each.

To be able to declare what kinds of pads an element can create (they are not necessarily all static but could be created at runtime by the element or the application), it is necessary to install so-called pad templates during the class initialization. These pad templates contain the name (or rather “name template”, it could be something like src_%u for e.g. pad templates that declare multiple possible pads), the direction of the pad (sink or source), the availability of the pad (is it always there, sometimes added/removed by the element or to be requested by the application) and all the possible media types (called caps) that the pad can consume (sink pads) or produce (src pads).

In our case we only have always pads, one sink pad called “sink”, on which we can only accept RGB (BGRx to be exact) data with any width/height/framerate and one source pad called “src”, on which we will produce either RGB (BGRx) data or GRAY8 (8-bit grayscale) data. We do this by adding the following code to the class_init function.

The names “src” and “sink” are pre-defined by the BaseTransform class and this base-class will also create the actual pads with those names from the templates for us whenever a new element instance is created. Otherwise we would have to do that in our new function but here this is not needed.

If you now run gst-inspect-1.0 on the rsrgb2gray element, these pad templates with their caps should also show up.

Caps Handling Part 1

As a next step we will add caps handling to our new element. This involves overriding 4 virtual methods from the BaseTransformImpl trait, and actually storing the configured input and output caps inside our element struct. Let’s start with the latter

We define a new struct State that contains the input and output caps, stored in a VideoInfo. VideoInfo is a struct that contains various fields like width/height, framerate and the video format and allows to conveniently with the properties of (raw) video formats. We have to store it inside a Mutex in our Rgb2Gray struct as this can (in theory) be accessed from multiple threads at the same time.

Whenever input/output caps are configured on our element, the set_caps virtual method of BaseTransform is called with both caps (i.e. in the very beginning before the data flow and whenever it changes), and all following video frames that pass through our element should be according to those caps. Once the element is shut down, the stop virtual method is called and it would make sense to release the State as it only contains stream-specific information. We’re doing this by adding the following to the BaseTransformImpl trait implementation

This code should be relatively self-explanatory. In set_caps we’re parsing the two caps into a VideoInfo and then store this in our State, in stop we drop the State and replace it with None. In addition we make use of our debug category here and use the gst_info! and gst_debug! macros to output the current caps configuration to the GStreamer debug logging system. This information can later be useful for debugging any problems once the element is running.

Next we have to provide information to the BaseTransform base class about the size in bytes of a video frame with specific caps. This is needed so that the base class can allocate an appropriately sized output buffer for us, that we can then fill later. This is done with the get_unit_size virtual method, which is required to return the size of one processing unit in specific caps. In our case, one processing unit is one video frame. In the case of raw audio it would be the size of one sample multiplied by the number of channels.

We simply make use of the VideoInfo API here again, which conveniently gives us the size of one video frame already.

Instead of get_unit_size it would also be possible to implement the transform_size virtual method, which is getting passed one size and the corresponding caps, another caps and is supposed to return the size converted to the second caps. Depending on how your element works, one or the other can be easier to implement.

Caps Handling Part 2

We’re not done yet with caps handling though. As a very last step it is required that we implement a function that is converting caps into the corresponding caps in the other direction. For example, if we receive BGRx caps with some width/height on the sinkpad, we are supposed to convert this into new caps with the same width/height but BGRx or GRAY8. That is, we can convert BGRx to BGRx or GRAY8. Similarly, if the element downstream of ours can accept GRAY8 with a specific width/height from our source pad, we have to convert this to BGRx with that very same width/height.

This has to be implemented in the transform_caps virtual method, and looks as following

This caps conversion happens in 3 steps. First we check if we got caps for the source pad. In that case, the caps on the other pad (the sink pad) are going to be exactly the same caps but no matter if the caps contained BGRx or GRAY8 they must become BGRx as that’s the only format that our sink pad can accept. We do this by creating a clone of the input caps, then making sure that those caps are actually writable (i.e. we’re having the only reference to them, or a copy is going to be created) and then iterate over all the structures inside the caps and then set the “format” field to BGRx. After this, all structures in the new caps will be with the format field set to BGRx.

Similarly, if we get caps for the sink pad and are supposed to convert it to caps for the source pad, we create new caps and in there append a copy of each structure of the input caps (which are BGRx) with the format field set to GRAY8. In the end we append the original caps, giving us first all caps as GRAY8 and then the same caps as BGRx. With this ordering we signal to GStreamer that we would prefer to output GRAY8 over BGRx.

In the end the caps we created for the other pad are filtered against optional filter caps to reduce the potential size of the caps. This is done by intersecting the caps with that filter, while keeping the order (and thus preferences) of the filter caps (gst::CapsIntersectMode::First).

Conversion of BGRx Video Frames to Grayscale

Now that all the caps handling is implemented, we can finally get to the implementation of the actual video frame conversion. For this we start with defining a helper function bgrx_to_gray that converts one BGRx pixel to a grayscale value. The BGRx pixel is passed as a &[u8] slice with 4 elements and the function returns another u8 for the grayscale value.

This function works by extracting the blue, green and red components from each pixel (remember: we work on BGRx, so the first value will be blue, the second green, the third red and the fourth unused), extending it from 8 to 32 bits for a wider value-range and then converts it to the Y component of the YUV colorspace (basically what your grandparents’ black & white TV would’ve displayed). The coefficients come from the Wikipedia page about YUV and are normalized to unsigned 16 bit integers so we can keep some accuracy, don’t have to work with floating point arithmetic and stay inside the range of 32 bit integers for all our calculations. As you can see, the green component is weighted more than the others, which comes from our eyes being more sensitive to green than to other colors.

Afterwards we have to actually call this function on every pixel. For this the transform virtual method is implemented, which gets a input and output buffer passed and we’re supposed to read the input buffer and fill the output buffer. The implementation looks as follows, and is going to be our biggest function for this element

What happens here is that we first of all lock our state (the input/output VideoInfo) and error out if we don’t have any yet (which can’t really happen unless other elements have a bug, but better safe than sorry). After that we map the input buffer readable and the output buffer writable with the VideoFrameRef API. By mapping the buffers we get access to the underlying bytes of them, and the mapping operation could for example make GPU memory available or just do nothing and give us access to a normally allocated memory area. We have access to the bytes of the buffer until the VideoFrameRef goes out of scope.

Instead of VideoFrameRef we could’ve also used the gst::Buffer::map_readable() and gst::Buffer::map_writable() API, but different to those the VideoFrameRef API also extracts various metadata from the raw video buffers and makes them available. For example we can directly access the different planes as slices without having to calculate the offsets ourselves, or we get directly access to the width and height of the video frame.

After mapping the buffers, we store various information we’re going to need later in local variables to save some typing later. This is the width (same for input and output as we never changed the width in transform_caps), the input and out (row-) stride (the number of bytes per row/line, which possibly includes some padding at the end of each line for alignment reasons), the output format (which can be BGRx or GRAY8 because of how we implemented transform_caps) and the pointers to the first plane of the input and output (which in this case also is the only plane, BGRx and GRAY8 both have only a single plane containing all the RGB/gray components).

Then based on whether the output is BGRx or GRAY8, we iterate over all pixels. The code is basically the same in both cases, so I’m only going to explain the case where BGRx is output.

We start by iterating over each line of the input and output, and do so by using the chunks iterator to give us chunks of as many bytes as the (row-) stride of the video frame is, do the same for the other frame and then zip both iterators together. This means that on each iteration we get exactly one line as a slice from each of the frames and can then start accessing the actual pixels in each line.

To access the individual pixels in each line, we again use the chunks iterator the same way, but this time to always give us chunks of 4 bytes from each line. As BGRx uses 4 bytes for each pixel, this gives us exactly one pixel. Instead of iterating over the whole line, we only take the actual sub-slice that contains the pixels, not the whole line with stride number of bytes containing potential padding at the end. Now for each of these pixels we call our previously defined bgrx_to_gray function and then fill the B, G and R components of the output buffer with that value to get grayscale output. And that’s all.

Using Rust high-level abstractions like the chunks iterators and bounds-checking slice accesses might seem like it’s going to cause quite some performance penalty, but if you look at the generated assembly most of the bounds checks are completely optimized away and the resulting assembly code is close to what one would’ve written manually (especially when using the newly-added exact_chunks iterators). Here you’re getting safe and high-level looking code with low-level performance!

You might’ve also noticed the various assertions in the processing function. These are there to give further hints to the compiler about properties of the code, and thus potentially being able to optimize the code better and moving e.g. bounds checks out of the inner loop and just having the assertion outside the loop check for the same. In Rust adding assertions can often improve performance by allowing further optimizations to be applied, but in the end always check the resulting assembly to see if what you did made any difference.

Testing the new element

Now we implemented almost all functionality of our new element and could run it on actual video data. This can be done now with the gst-launch-1.0 tool, or any application using GStreamer and allowing us to insert our new element somewhere in the video part of the pipeline. With gst-launch-1.0 you could run for example the following pipelines

Note that you will likely want to compile with cargo build –release and add the target/release directory to GST_PLUGIN_PATH instead. The debug build might be too slow, and generally the release builds are multiple orders of magnitude (!) faster.

Properties

The only feature missing now are the properties I mentioned in the opening paragraph: one boolean property to invert the grayscale value and one integer property to shift the value by up to 255. Implementing this on top of the previous code is not a lot of work. Let’s start with defining a struct for holding the property values and defining the property metadata.

This should all be rather straightforward: we define a Settings struct that stores the two values, implement the Default trait for it, then define a two-element array with property metadata (names, description, ranges, default value, writability), and then store the default value of our Settings struct inside another Mutex inside the element struct.

In the next step we have to make use of these: we need to tell the GObject type system about the properties, and we need to implement functions that are called whenever a property value is set or get.

Property values can be changed from any thread at any time, that’s why the Mutex is needed here to protect our struct. And we’re using a new mutex to be able to have it locked only for the shorted possible amount of time: we don’t want to keep it locked for the whole time of the transform function, otherwise applications trying to set/get values would block for up to one frame.

In the property setter/getter functions we are working with a glib::Value. This is a dynamically typed value type that can contain values of any type, together with the type information of the contained value. Here we’re using it to handle an unsigned integer (u32) and a boolean for our two properties. To know which property is currently set/get, we get an identifier passed which is the index into our PROPERTIES array. We then simply match on the name of that to decide which property was meant

With this implemented, we can already compile everything, see the properties and their metadata in gst-inspect-1.0 and can also set them on gst-launch-1.0 like this

If we set GST_DEBUG=rsrgb2gray:6 in the environment before running that, we can also see the corresponding debug output when the values are changing. The only thing missing now is to actually make use of the property values for the processing. For this we add the following changes to bgrx_to_gray and the transform function

And that’s all. If you run the element in gst-launch-1.0 and change the values of the properties you should also see the corresponding changes in the video output.

Note that we always take a copy of the Settings struct at the beginning of the transform function. This ensures that we take the mutex only the shorted possible amount of time and then have a local snapshot of the settings for each frame.

Also keep in mind that the usage of the property values in the bgrx_to_gray function is far from optimal. It means the addition of another condition to the calculation of each pixel, thus potentially slowing it down a lot. Ideally this condition would be moved outside the inner loops and the bgrx_to_gray function would made generic over that. See for example this blog post about “branchless Rust” for ideas how to do that, the actual implementation is left as an exercise for the reader.

What next?

I hope the code walkthrough above was useful to understand how to implement GStreamer plugins and elements in Rust. If you have any questions, feel free to ask them here in the comments.

The same approach also works for audio filters or anything that can be handled in some way with the API of the BaseTransform base class. You can find another filter, an audio echo filter, using the same approach here.

In the next blog post in this series I’ll show how to use another base class to implement another kind of element, but for the time being you can also check the GIT repository for various other element implementations.

GStreamer Rust bindings release 0.10.0 & gst-plugin release 0.1.0

Today I’ve released version 0.10.0 of the Rust GStreamer bindings, and after a journey of more than 1½ years the first release of the GStreamer plugin writing infrastructure crate “gst-plugin”.

Check the repositories¹² of both for more details, the code and various examples.

GStreamer Bindings

Some of the changes since the 0.9.0 release were already outlined in the previous blog post, and most of the other changes were also things I found while writing GStreamer plugins. For the full changelog, take a look at the CHANGELOG.md in the repository.

Other changes include

  • I went over the whole API in the last days, added any missing things I found, simplified API as it made sense, changed functions to take Option<_> if allowed, etc.
  • Bindings for using and writing typefinders. Typefinders are the part of GStreamer that try to guess what kind of media is to be handled based on looking at the bytes. Especially writing those in Rust seems worthwhile, considering that basically all of the GIT log of the existing typefinders consists of fixes for various kinds of memory-safety problems.
  • Bindings for the Registry and PluginFeature were added, as well as fixing the relevant API that works with paths/filenames to actually work on Paths
  • Bindings for the GStreamer Net library were added, allowing to build applications that synchronize their media of the network by using PTP, NTP or a custom GStreamer protocol (for which there also exists a server). This could be used for building video-walls, systems recording the same scene from multiple cameras, etc. and provides (depending on network conditions) up to < 1ms synchronization between devices.

Generally, this is something like a “1.0” release for me now (due to depending on too many pre-1.0 crates this is not going to be 1.0 anytime soon). The basic API is all there and nicely usable now and hopefully without any bugs, the known-missing APIs are not too important for now and can easily be added at a later time when needed. At this point I don’t expect many API changes anymore.

GStreamer Plugins

The other important part of this announcement is the first release of the “gst-plugin” crate. This provides the basic infrastructure for writing GStreamer plugins and elements in Rust, without having to write any unsafe code.

I started experimenting with using Rust for this more than 1½ years ago, and while a lot of things have changed in that time, this release is a nice milestone. In the beginning there were no GStreamer bindings and I was writing everything manually, and there were also still quite a few pieces of code written in C. Nowadays everything is in Rust and using the automatically generated GStreamer bindings.

Unfortunately there is no real documentation for any of this yet, there’s only the autogenerated rustdoc documentation available from here, and various example GStreamer plugins inside the GIT repository that can be used as a starting point. And various people already wrote their GStreamer plugins in Rust based on this.

The basic idea of the API is however that everything is as Rust-y as possible. Which might not be too much due to having to map subtyping, virtual methods and the like to something reasonable in Rust, but I believe it’s nice to use now. You basically only have to implement one or more traits on your structs, and that’s it. There’s still quite some boilerplate required, but it’s far less than what would be required in C. The best example at this point might be the audioecho element.

Over the next days (or weeks?) I’m not going to write any documentation yet, but instead will write a couple of very simple, minimal elements that do basically nothing and can be used as starting points to learn how all this works together. And will write another blog post or two about the different parts of writing a GStreamer plugin and element in Rust, so that all of you can get started with that.

Let’s hope that the number of new GStreamer plugins written in C is going to decrease in the future, and maybe even new people who would’ve never done that in C, with all the footguns everywhere, can get started with writing GStreamer plugins in Rust now.

A GStreamer Plugin like the Rec Button on your Tape Recorder – A Multi-Threaded Plugin written in Rust

As Rust is known for “Fearless Concurrency”, that is being able to write concurrent, multi-threaded code without fear, it seemed like a good fit for a GStreamer element that we had to write at Centricular.

Previous experience with Rust for writing (mostly) single-threaded GStreamer elements and applications (also multi-threaded) were all quite successful and promising already. And in the end, this new element was also a pleasure to write and probably faster than doing the equivalent in C. For the impatient, the code, tests and a GTK+ example application (written with the great Rust GTK bindings, but the GStreamer element is also usable from C or any other language) can be found here.

What does it do?

The main idea of the element is that it basically works like the rec button on your tape recorder. There is a single boolean property called “record”, and whenever it is set to true it will pass-through data and whenever it is set to false it will drop all data. But different to the existing valve element, it

  • Outputs a contiguous timeline without gaps, i.e. there are no gaps in the output when not recording. Similar to the recording you get on a tape recorder, you don’t have 10s of silence if you didn’t record for 10s.
  • Handles and synchronizes multiple streams at once. When recording e.g. a video stream and an audio stream, every recorded segment starts and stops with both streams at the same time
  • Is key-frame aware. If you record a compressed video stream, each recorded segment starts at a keyframe and ends right before the next keyframe to make it most likely that all frames can be successfully decoded

The multi-threading aspect here comes from the fact that in GStreamer each stream usually has its own thread, so in this case the video stream and the audio stream(s) would come from different threads but would have to be synchronized between each other.

The GTK+ example application for the plugin is playing a video with the current playback time and a beep every second, and allows to record this as an MP4 file in the current directory.

How did it go?

This new element was again based on the Rust GStreamer bindings and the infrastructure that I was writing over the last year or two for writing GStreamer plugins in Rust.

As written above, it generally went all fine and was quite a pleasure but there were a few things that seem noteworthy. But first of all, writing this in Rust was much more convenient and fun than writing it in C would’ve been, and I’ve written enough similar code in C before. It would’ve taken quite a bit longer, I would’ve had to debug more problems in the new code during development (there were actually surprisingly few things going wrong during development, I expected more!), and probably would’ve written less exhaustive tests because writing tests in C is just so inconvenient.

Rust does not prevent deadlocks

While this should be clear, and was also clear to myself before, this seems like it might need some reiteration. Safe Rust prevents data races, but not all possible bugs that multi-threaded programs can have. Rust is not magic, only a tool that helps you prevent some classes of potential bugs.

For example, you can’t just stop thinking about lock order if multiple mutexes are involved, or that you can carelessly use condition variables without making sure that your conditions actually make sense and accessed atomically. As a wise man once said, “the safest program is the one that does not run at all”, and a deadlocking program is very close to that.

The part about condition variables might be something that can be improved in Rust. Without this, you can easily end up in situations where you wait forever or your conditions are actually inconsistent. Currently Rust’s condition variables only require a mutex to be passed to the functions for waiting for the condition to be notified, but it would probably also make sense to require passing the same mutex to the constructor and notify functions to make it absolutely clear that you need to ensure that your conditions are always accessed/modified while this specific mutex is locked. Otherwise you might end up in debugging hell.

Fortunately during development of the plugin I only ran into a simple deadlock, caused by accidentally keeping a mutex locked for too long and then running into conflict with another one. Which is probably an easy trap if the most common way of unlocking a mutex is to let the mutex lock guard fall out of scope. This makes it impossible to forget to unlock the mutex, but also makes it less explicit when it is unlocked and sometimes explicit unlocking by manually dropping the mutex lock guard is still necessary.

So in summary, while a big group of potential problems with multi-threaded programs are prevented by Rust, you still have to be careful to not run into any of the many others. Especially if you use lower-level constructs like condition variables, and not just e.g. channels. Everything is however far more convenient than doing the same in C, and with more support by the compiler, so I definitely prefer writing such code in Rust over doing the same in C.

Missing API

As usual, for the first dozen projects using a new library or new bindings to an existing library, you’ll notice some missing bits and pieces. That I missed relatively core part of GStreamer, the GstRegistry API, was surprising nonetheless. True, you usually don’t use it directly and I only need to use it here for loading the new plugin from a non-standard location, but it was still surprising. Let’s hope this was the biggest oversight. If you look at the issues page on GitHub, you’ll find a few other things that are still missing though. But nobody needed them yet, so it’s probably fine for the time being.

Another part of missing APIs that I noticed during development was that many manual (i.e. not auto-generated) bindings didn’t have the Debug trait implemented, or not in a too useful way. This is solved now, as otherwise I wouldn’t have been able to properly log what is happening inside the element to allow easier debugging later if something goes wrong.

Apart from that there were also various other smaller things that were missing, or bugs (see below) that I found in the bindings while going through all these. But those seem not very noteworthy – check the commit logs if you’re interested.

Bugs, bugs, bgsu

I also found a couple of bugs in the bindings. They can be broadly categorized in two categories

  • Annotation bugs in GStreamer. The auto-generated parts of the bindings are generated from an XML description of the API, that is generated from the C headers and code and annotations in there. There were a couple of annotations that were wrong (or missing) in GStreamer, which then caused memory leaks in my case. Such mistakes could also easily cause memory-safety issues though. The annotations are fixed now, which will also benefit all the other language bindings for GStreamer (and I’m not sure why nobody noticed the memory leaks there before me).
  • Bugs in the manually written parts of the bindings. Similarly to the above, there was one memory leak and another case where a function could’ve returned NULL but did not have this case covered on the Rust-side by returning an Option<_>.

Generally I was quite happy with the lack of bugs though, the bindings are really ready for production at this point. And especially, all the bugs that I found are things that are unfortunately “normal” and common when writing code in C, while Rust is preventing exactly these classes of bugs. As such, they have to be solved only once at the bindings layer and then you’re free of them and you don’t have to spent any brain capacity on their existence anymore and can use your brain to solve the actual task at hand.

Inconvenient API

Similar to the missing API, whenever using some rather new API you will find things that are inconvenient and could ideally be done better. The biggest case here was the GstSegment API. A segment represents a (potentially open-ended) playback range and contains all the information to convert timestamps to the different time bases used in GStreamer. I’m not going to get into details here, best check the documentation for them.

A segment can be in different formats, e.g. in time or bytes. In the C API this is handled by storing the format inside the segment, and requiring you to pass the format together with the value to every function call, and internally there are some checks then that let the function fail if there is a format mismatch. In the previous version of the Rust segment API, this was done the same, and caused lots of unwrap() calls in this element.

But in Rust we can do better, and the new API for the segment now encodes the format in the type system (i.e. there is a Segment<Time>) and only values with the correct type (e.g. ClockTime) can be passed to the corresponding functions of the segment. In addition there is a type for a generic segment (which still has all the runtime checks) and functions to “cast” between the two.

Overall this gives more type-safety (the compiler already checks that you don’t mix calculations between seconds and bytes) and makes the API usage more convenient as various error conditions just can’t happen and thus don’t have to be handled. Or like in C, are simply ignored and not handled, potentially leaving a trap that can cause hard to debug bugs at a later time.

That Rust requires all errors to be handled makes it very obvious how many potential error cases the average C code out there is not handling at all, and also shows that a more expressive language than C can easily prevent many of these error cases at compile-time already.

GStreamer Rust bindings release 0.9

About 3 months, a GStreamer Conference and two bug-fix releases have passed now since the GStreamer Rust bindings release 0.8.0. Today version 0.9.0 (and 0.9.1 with a small bugfix to export some forgotten types) with a couple of API improvements and lots of additions and cleanups was released. This new version depends on the new set of releases of the gtk-rs crates (glib/etc).

The full changelog can be found here, but below is a short overview of the (in my opinion) most interesting changes.

Tutorials

The basic tutorials 1 to 8 were ported from C to Rust by various contributors. The C versions and the corresponding explanatory text can be found here, and it should be relatively easy to follow the text together with the Rust code.

This should make learning to use GStreamer from Rust much easier, in combination with the few example applications that exist in the repository.

Type-safety Improvements

Previously querying the current playback position from a pipeline (and various other things analogous) was giving you a plain 64-bit integer, just like in C. However in Rust we can easily do better.

The main problem with just getting an integer was that there are “special” values that have the meaning of “no value known”, specifically GST_CLOCK_TIME_NONE for values in time. In C this often causes bugs by code ignoring this special case and then doing calculations with such a value, resulting in completely wrong numbers. In the Rust bindings these are now expressed as an Option<_> so that the special case has to be handled separately, and in combination with that for timed values there is a new type called ClockTime that is implementing all the arithmetic traits and others so you can still do normal arithmetic operations on the values, while the implementation of those operations takes care of GST_CLOCK_TIME_NONE. Also it was previously easy to get a value in bytes and add it to a value in time. Whenever multiple formats are possible, a new type called FormatValue is now used that combines the value itself with its format to prevent such mistakes.

Error Handling

Various operations in GStreamer can fail with a custom enum type: link pads (PadLinkReturn), pushing a buffer (FlowReturn), changing an element’s state (StateChangeReturn). Previously handling this was not as convenient as the usual Result-based error handling in Rust. With this release, all these types provide a function into_result() that allows to convert into a Result that splits the enum into its good and bad cases, e.g. FlowSuccess and FlowError. Based on this, the usual Rust error handling is possible, including usage of the ?-operator. Once the Try trait is stable, it will also be possible to directly use the ?-operator on FlowReturn and the others before conversion into a Result.

All these enums are also marked as #[must_use] now, which causes a compiler warning if code is not specifically handling them (which could mean to explicitly ignore them), making it even harder to ignore errors caused by any failures of such operations.

In addition, all the examples and tutorials make use of the above now and many examples were ported to the failure crate and implement proper error handling in all situations now, for example the decodebin example.

Various New API

Apart from all of the above, a lot of new API was added. Both for writing GStreamer-based applications, and making that easier, as well as for writing GStreamer plugins in Rust. For the latter, the gst-plugin-rs repository with various crates (and plugins) was ported to the GStreamer bindings and completely rewritten, but more on that in another blog post in the next couple of days once the gst-plugin crate is released and published on crates.io.

Multi-threaded raw video conversion and scaling in GStreamer

Another new feature that landed in GStreamer already a while ago, and is included in the 1.12 release, is multi-threaded raw video conversion and scaling. The short story is that it lead to e.g. 3.2x speed-up converting 1080p video to 4k with 4 cores.

I had a few cases where a single core was not able to do rescaling in real-time anymore, even on a quite fast machine. One of the cases was 60fps 4k video in the v210 (10 bit YUV) color format, which is a lot of bytes per second in a not very processing-friendly format. GStreamer’s video converter and scaler is already quite optimized and using SIMD instructions like SSE or Neon, so there was not much potential for further optimizations in that direction.
However basically every machine nowadays has multiple CPU cores that could be used and raw video conversion/scaling is an almost perfectly parallelizable problem, and the way how the conversion code was already written it was relatively easy to add.

The way it works now is similar to the processing model of libraries like OpenMP or Rayon. The whole work is divided into smaller, equal sub-problems that are then handled in parallel, then it is waiting until all parts are done and the result is combined. In our specific case that means that each plane of the video frame is cut into 2, 4, or more slices of full rows, which are then converted separately. The “combining” step does not exist, all sub-conversions are directly written to the correct place in the output already.

As a small helper object for this kind of processing model, I wrote GstParallelizedTaskRunner which might also be useful for other pieces of code that want to do the same.

In the end it was not much work, but the results were satisfying. For example the conversion of 1080p to 4k video in the v210 color format with 4 threads gave a speedup of 3.2x. At that point it looks like the main bottleneck was memory bandwidth, but I didn’t look closer as this is already more than enough for the use cases I was interested in.

Rendering HTML5 video in Servo with GStreamer

At the Web Engines Hackfest in A Coruña at the beginning of October 2017, I was working on adding some proof-of-concept code to Servo to render HTML5 videos with GStreamer. For the impatient, the results can be seen in this video here

And the code can be found here and here.

Details

Servo is Mozilla‘s experimental browser engine written in Rust, optimized for high-performance, parallelized rendering. Some of the parts of Servo are being merged in Firefox as part of the Project Quantum, and already provide a lot of performance and stability improvements there.

During the hackfest I actually spent most of the time trying to wrap my head around the huge Servo codebase. It seems very well-structured and designed, exactly what you would expect from starting such a project from scratch by a company that has decades of experience writing browser engines already. After also having worked on WebKit in the past, I would say that you can see the difference of a legacy codebase from the end of the 90s and something written in a modern language with modern software engineering practices.

To the actual implementation of HTML5 video rendering via GStreamer, I actually started on top of the initial implementation that Philippe Normand started before already. That one was rendering the video in a separate window though, and did not work with the latest version of Servo anymore. I cleaned it up and made it work again (probably the best task you can do to learn a new codebase), and then added support for actually rendering the video inside the web view.

This required quite a few additions on the Servo side, some of which are probably more hacks than anything else, but from the GStreamer-side is was extremely simple. In Servo currently all the infrastructure for media rendering is still missing, while GStreamer has more than a decade of polishing for making integration into other software as easy as possible.

All the GStreamer code was written with the GStreamer Rust bindings, containing not a single line of unsafe code.

As you can see from the above video, the results work quite well already. Media controls or anything more fancy are not working though. Also rendering is currently done completely in software, and a RGBA frame is then uploaded via OpenGL to the GPU for rendering. However, hardware codecs can already be used just fine, and basically every media format out there is supported.

Future

While this all might sound great, unfortunately Mozilla’s plans for media support in Servo are different. They’re planning to use the C++ Firefox/Gecko media backend instead of GStreamer. Best to ask them for reasons, I would probably not repeat them correctly.

Nonetheless, I’ll try to keep the changes updated with latest Servo and once they add more things for media support themselves add the corresponding GStreamer implementations in my branch. It still provides value for both showing that GStreamer is very well capable of handling web use cases (which it already showed in WebKit), as well as being a possibly better choice for people trying to use Servo on embedded systems or with hardware codecs in general. But as I’ll have to work based on what they do, I’m not going to add anything fundamentally new myself at this point as I would have to rewrite it around whatever they decide for the implementation of it anyway.

Also once that part is there, having GStreamer directly render to an OpenGL texture would be added, which would allow direct rendering with hardware codecs to the screen without having the CPU worry about all the raw video data.

But for now, it’s waiting until they catch up with the Firefox/Gecko media backend.

DASH trick-mode playback in GStreamer: Fast-forward/rewind without saturating your network and CPU

GStreamer now has support for I-frame-only (aka keyframe) trick mode playback of DASH streams. It works only on DASH streams with ISOBMFF (aka MP4) fragments, and only if these contain all the required information. This is something I wanted to blog about since many months already, and it’s even included in the GStreamer 1.10 release already.

When trying to play back a DASH stream with rates that are much higher than real-time (say 32x), or playing the streams in reverse, you can easily run into various problems. This is something that was already supported by GStreamer in older versions, for both DASH streams as well as local files or HLS streams but it’s far from ideal. What would happen is that you usually run out of available network bandwidth (you need to be able to download the stream 32x faster than usual), or out of CPU/GPU resources (it needs to be decoded 32x faster than usual) and even if all that works, there’s no point in displaying 960 (30fps at 32x) frames per second.

To get around that, GStreamer 1.10 can now (if explicitly requested with GST_SEEK_FLAG_TRICKMODE_KEY_UNITS) only download and decode I-frames. Depending on the distance of I-frames in the stream and the selected playback speed, this looks more or less smooth. Also depending on that, this might still yield to many frames to be downloaded or decoded in real-time, so GStreamer also measures the distance between I-frames, how fast data can be downloaded and whether decoders and sinks can catch up to decide whether to skip over a couple of I-frames and maybe only download every third I-frame.

If you want to test this, grab the playback-test from GStreamer, select the trickmode key-units mode, and seek in a DASH stream while providing a higher positive or negative (reverse) playback rate.

Let us know if you run into any problems with any specific streams!

Short Implementation Overview

From an implementation point of view this works by having the DASH element in GStreamer (dashdemux) not only download the ISOBMFF fragments but also parses the headers of each to get the positions and distances of each I-frame in the fragment. Based on that it then decides which ones to download or whether to skip ahead one or more fragments. The ISOBMFF headers are then passed to the MP4 demuxer (qtdemux), followed by discontinuous buffers that only contain the actual I-frames and nothing else. While this sounds rather simple from an high-level point of view, getting this all right in the details was the result of a couple of months of work by Edward Hervey and myself.

Currently the heuristics for deciding which I-frames to download and how much to skip ahead are rather minimal, but it’s working fine in many situations already. A lot of tuning can still be done though, and some streams are working less well than others which can also be improved.

Exporting a GObject C API from Rust code and using it from C, Python, JavaScript and others

During the last days I was experimenting a bit with implementing a GObject C API in Rust. The results can be found in this repository, and this is something like an overview of the work, code walkthrough and status report. Note that this is quite long, a little bit further down you can find a table of contents and then jump to the area you’re interested in. Or read it chapter by chapter.

GObject is a C library that allows to write object-oriented, cross-platform APIs in C (which does not have support for that built-in), and provides a very expressive runtime type system with many features known from languages like Java, C# or C++. It is also used by various C library, most notably the cross-platform GTK UI toolkit and the GStreamer multimedia framework. GObject also comes with strong conventions about how an API is supposed to look and behave, which makes it relatively easy to learn new GObject based APIs as compared to generic C libraries that could do anything unexpected.

I’m not going to give a full overview about how GObject works internally and how it is used. If you’re not familiar with that it might be useful to first read the documentation and especially the tutorial. Also some C & Rust (especially unsafe Rust and FFI) knowledge would be good to have for what follows.

If you look at the code, you will notice that there is a lot of unsafe code, boilerplate and glue code. And especially code duplication. I don’t expect anyone to manually write all this code, and the final goal of all this is to have Rust macros to make the life easier. Simple Rust macros that make it as easy as in C are almost trivial to write, but what we really want here is to be able to write it all only in safe Rust in code that looks a bit like C# or Java. There is a prototype for that already written by Niko Matsakis, and a blog post with further details about it. The goal for this code is to work as a manual example that can be integrated one step at a time into the macro based solution. Code written with that macro should in the end look similar to the following

and be usable like

The code in my repository is already integrated well into GTK-rs, but the macro generated code should also be integrated well into GTK-rs and work the same as other GTK-rs code from Rust. In addition the generated code should of course make use of all the type FFI conversion infrastructure that already exists in there and was explained by Federico in his blog post (part 1, part 2).
In the end, I would like to see such a macro solution integrated directly into the GLib bindings.

Table of Contents

  1. Why?
  2. Simple (boxed) types
  3. Object types
    1. Inheritance
    2. Virtual Methods
    3. Properties
    4. Signals
  4. Interfaces
  5. Usage from C
  6. Usage from Rust
  7. Usage from Python, JavaScript and Others
  8. What next?

Why?

Now one might ask why? GObject is yet another C library and Rust can export plain C API without any other dependencies just fine. While that is true, C is not very expressive at all and there are no conventions about how C APIs should look like and behave, so everybody does their own stuff. With GObject you would get all kinds of object-oriented programming features and strong conventions about API design. And you actually get a couple of features (inheritance, properties/signals, full runtime type system) that Rust does not have. And as bonus points, you get bindings for various other languages (Python, JavaScript, C++, C#, …) for free. More on the last point later.

Another reason why you might want to do this, is to be able to interact with existing C libraries that use GObject. For example if you want to create a subclass of some GTK widget to give it your own custom behaviour or modify its appearance, or even writing a completely new GTK widget that should be placed together with other widgets in your UI, or for implementing a new GStreamer element that implements some fancy filter or codec or … that you want to use.

Simple (boxed) types

Let’s start with the simple and boring case, which already introduces various GObject concepts. Let’s assume you already have some simple Rust type that you want to expose a C API for, and it should be GObject-style to get all the above advantages. For that, GObject has the concept of boxed types. These have to come with a “copy” and “free” function, which can do an actual copy of the object or just implement reference counting, and GObject allows to register these together with a string name for the type and then gives back a type ID (GType) that allows referencing this type.

Boxed types can then be automatically used, together with any C API they provide, from C and any other languages for which GObject support exists (i.e. basically all). It allows to use instances of these boxed types to be used in signals and properties (see further below), allows them to be stored in GValue (a container type that allows to store an instance of any other type together with its type ID), etc.

So how does all this work? In my repository I’m implementing a boxed type around a Option, one time as a “copy” type RString, another time reference counted (SharedRString). Outside Rust, both are just passed as pointers and the implementation of them is private/opaque. As such, it is possible to use any kind of Rust struct or enum and e.g. marking them as #[repr(C)] is not needed. It is also possible to use #[repr(C)] structs though, in which case the memory layout could be public and any struct fields could be available from C and other languages.

RString

The actual implementation of the type is in the imp.rs file, i.e. in the imp module. I’ll cover the other files in there at a later time, but mod.rs is providing a public Rust API around all this that integrates with GTK-rs.

The following is the whole implementation, in safe Rust:

Type Registration

Once the macro based solution is complete, this would be more or less all that would be required to also make this available to C via GObject, and any other languages. But we’re not there yet, and the goal here is to do it all manually. So first of all, we need to register this type somehow to GObject, for which (by convention) a C function called ex_rstring_get_type() should be defined which registers the type on the first call to get the type ID, and on further calls just returns that type ID. If you’re wondering what ex is: this is the “namespace” (C has no built-in support for namespaces) of the whole library, short for “example”. The get_type() function looks like this:

This is all unsafe Rust and calling directly into the GObject C library. We use std::sync::Once for the one-time registration of the type, and store the result in a static mut called TYPE (super unsafe, but OK here as we only ever write to it once). For registration we call g_boxed_type_register_static() from GObject (provided to Rust via the gobject-sys crate) and provide the name (via std::ffi::CString for C interoperability) and the copy and free functions. Unfortunately we have to cast them to a generic pointer, and then transmute them to a different function pointer type as the arguments and return value pointers that GObject wants there are plain void * pointers but in our code we would at least like to use RString *. And that’s all that there is to the registration. We mark the whole function as extern “C” to use the C calling conventions, and use #[no_mangle] so that the function is exported with exactly that symbol name (otherwise Rust is doing symbol name mangling), and last we make sure that no panic unwinding happens from this Rust code back to the C code via the callback_guard!() macro from the glib crate.

Memory Managment Functions

Now let’s take a look at the actual copy and free functions, and the actual constructor function called ex_rstring_new():

These are also unsafe Rust functions that work with raw pointers and C types, but fortunately not too much is happening here.

In the constructor function we get a C string (char *) passed as argument, convert this to a Rust string (actually Option as this can be NULL) via from_glib_none() from the glib crate and then pass that to the Rust constructor of our type. from_glib_none() means that we don’t take ownership of the C string passed to us, the other variant would be from_glib_full() in which case we would take ownership. We then pack up the result in a Rust Box to place the new RString in heap allocated memory (otherwise it would be stack allocated), and use Box’s into_raw() function to get a raw pointer to the memory and not have its Drop implementation called anymore. This is then returned to the caller.

Similarly in the copy and free functions we just do some juggling with Boxes: copy take a raw pointer to our RString, calls the compiler generated clone() function to copy it all, and then packs it up in a new Box to return to the caller. The free function converts the raw pointer back to a Box, and then lets the Drop implementation of Box take care of freeing all memory related to it.

Actual Functionality

The two remaining functions are C wrappers for the get() and set() Rust functions:

These only call the corresponding Rust functions. The set() function again uses glib’s from_glib_none() to convert from a C string to a Rust string. The get() function uses ToGlibPtrFull::to_glib_full() from GLib to convert from a Rust string (Option to be accurate) to a C string, while passing ownership of the C string to the caller (which then also has to free it at a later time).

This was all quite verbose, which is why a macro based solution for all this would be very helpful.

Corresponding C Header

Now if this API would be used from C, the header file to do so would look something like this. Probably no surprises here.

Ideally this would also be autogenerated from the Rust code in one way or another, maybe via rusty-cheddar or rusty-binder.

SharedRString

The shared, reference counted, RString works basically the same. The only differences are in how the pointers between C and Rust are converted. For this, let’s take a look at the constructor, copy (aka ref) and free (aka unref) functions again:

The only difference here is that instead of using a Box, std::alloc::Arc is used, and some differences in the copy (aka ref) function. Previously with the Box, we were just creating a immutable reference from the raw pointer and cloned it, but with the Arc we want to clone the Arc itself (i.e. have the same underlying object but increase the reference count). For this we use Arc::from_raw() to get back an Arc, and then clone the Arc. If we wouldn’t do anything else, at the end of the function our original Arc would get its Drop implementation called and the reference count decreased, defeating the whole point of the function. To prevent that, we convert the original Arc to a raw pointer again and “leak” it. That is, we don’t destroy the reference owned by the caller, which would cause double free problems later.

Apart from this, everything is really the same. And also the C header looks basically the same.

Object types

Now let’s start with the more interesting part: actual subclasses of GObject with all the features you know from object-oriented languages. Everything up to here was only warm-up, even if useful by itself already to expose normal Rust types to C with a slightly more expressive API.

In GObject, subclasses of the GObject base class (think of Object in Java or C#, the most basic type from which everything inherits) all get the main following features from the base class: reference counting, inheritance, virtual methods, properties, signals. Similarly to boxed types, some functions and structs are registered at runtime with the GObject library to get back a type ID but it is slightly more involved. And our structs must be #[repr(C)] and be structured in a very specific way.

Struct Definitions

Every GObject subclass has two structs: 1) one instance struct that is used for the memory layout of every instance and could contain public fields, and 2) one class struct which is storing the class specific data and the instance struct contains a pointer to it. The class struct is more or less what in C++ the vtable would be, i.e. the place where virtual methods are stored, but in GObject it can also contain fields for example. We define a new type Foo that inherits from GObject.

The first element of the structs must be the corresponding struct of the class we inherit from. This later allows casting pointers of our subclass to pointers of the base class, and re-use all API implemented for the base class. In our example here we don’t define any public fields or virtual methods, in the repository the version has them but we get to that later.

Now we will actually need to be able to store some state with our objects, but we want to have that state private. For that we define another struct, a plain Rust struct this time

This uses RefCell for each field, as in GObject modifications of objects are all done conceptually via interior mutability. For a thread-safe object these would have to be Mutex instead.

Type Registration

In the end we glue all this together and register it to the GObject type system via a get_type() function, similar to the one for boxed types

The main difference here is that we call g_type_register_static(), which takes a struct as parameter that contains all the information about our new subclass. In that struct we provide sizes of the class and instance struct (GObject is allocating them for us), various uninteresting fields for now and two function pointers: 1) class_init for initializing the class struct as allocated by GObject (here we would also override virtual methods, define signals or properties for example) and 2) instance_init to do the same with the instance struct. Both structs are zero-initialized in the parts we defined, and the parent parts of both structs are initialized by the code for the parent class already.

Struct Initialization

These two functions look like the following for us (the versions in the repository already do more things)

During class initialization, we tell GObject about the size of our private struct but we actually wrap it into an Option. This allows us to later replace it simply with None to deallocate all memory related to it. During instance initialization this private struct is already allocated for us by GObject (and zero-initialized), so we simply get a raw pointer to it via g_type_instance_get_private() and write an initialized struct to that pointer. Raw pointers must be used here so that the Drop implementation of Option is not called for the old, zero-initialized memory when replacing the struct.

As you might’ve noticed, we currently never set the private struct to None to release the memory, effectively leaking memory, but we get to that later when talking about virtual methods.

Constructor

With what we have so far, it’s already possible to create new instances of our subclass, and for that we also define a constructor function now

There is probably not much that has to be explained here: we only tell GObject to allocate a new instance of our specific type (by providing the type ID), which then causes the memory to be allocated and our initialization functions to be called. For the very first time, class_init would be called, for all times instance_init is called.

Methods

All this would be rather boring at this point because there is no way to actually do something with our object, so various functions are defined to work with the private data. For example to get the value of the counter

This gets the private struct from GObject (get_priv() is a helper function that does the same as we did in instance_init), and then calls a safe Rust function implemented on our struct to actually get the value. Notable here is that we don’t pass &self to the function, but something called FooWrapper. This is a GTK-rs style wrapper type that directly allows to use any API implemented on parent classes and provides various other functionality. It is defined in mod.rs but we will talk about that later.

Inheritance

GObject allows single-inheritance from a base class, similar to Java and C#. All behaviour of the base class is inherited, and API of the base class can be used on the subclass.

I shortly hinted at how that works above already: 1) instance and class struct have the parent class’ structs as first field, so casting to pointers of the parent class work just fine, 2) GObject is told what the parent class is in the call to g_type_register_static(). We did that above already, as we inherited from GObject.

By inheriting from GObject, we e.g. can call g_object_ref() to do reference counting, or any of the other GObject API. Also it allows the Rust wrapper type defined in mod.rs to provide appropriate API for the base class to us without any casts, and to do memory management automatically. How that works is probably going to be explained in one of the following blog posts on Federico’s blog.

In the example repository, there is also another type defined which inherits from our type Foo, called Bar. It’s basically the same code again, except for the name and parent type.

Virtual Methods

Overriding Virtual Methods

Inheritance alone is already useful for reducing code duplication, but to make it really useful virtual methods are needed so that behaviour can be adjusted. In GObject this works similar to how it’s done in e.g. C++, just manually: you place function pointers to the virtual method implementations into the class struct and then call those. As every subclass has its own copy of the class struct (initialized with the values from the parent class), it can override these with whatever function it wants. And as it’s possible to get the actual class struct of the parent class, it is possible to chain up to the implementation of the virtual function of the parent class. Let’s look at the example of the GObject::finalize virtual method, which is called at the very end when the object is to be destroyed and which should free all memory. In there we will free our private data struct with the RefCells.

As a first step, we need to override the function pointer in the class struct in our class_init function and replace it with another function that implements the behaviour we want

This new function could call into a safe Rust implementation, like it’s done for other virtual methods (see a bit later) but for finalize we have to do manual memory management and that’s all unsafe Rust. The way how we free the memory here is by replacing, that is take()ing the Some value out of the Option that contains our private struct, and then let it be dropped. Afterwards we have to chain up to the parent class’ implementation of finalize, which is done by calling map() on the Option that contains the function pointer.

All the function pointers in glib-sys and related crates is stored in Options to be able to handle the case of a NULL function pointer and an actual function pointer to a function.

Now for chaining up to the parent class’ finalize implementation, there’s a static, global variable containing a pointer to the parent class’ class struct, called PRIV. This is also initialized in the class_init function

While this is a static mut global variable, this is fine as it’s only ever written to once from class_init, and can only ever be accessed after class_init is done.

Defining New Virtual Methods

For defining new virtual methods, we would add a corresponding function pointer to the class struct and optionally initialize it to a default implementation in the class_init function, or otherwise keep it at NULL/None.

The trampoline function provided here is responsible for converting from the C types to the Rust types, and then calling a safe Rust implementation of the virtual method.

To make it possible to call these virtual methods from the outside, a C function has to be defined again similar to the ones for non-virtual methods. Instead of calling the Rust implementation directly, this gets the class struct of the type that is passed in and then calls the function pointer for the virtual method implementation of that specific type.

Subclasses would override this default implementation (or provide an actual implementation) exactly the same way, and also chain up to the parent class’ implementation like we saw before for GObject::finalize.

Properties

Similar to Objective-C and C#, GObject has support for properties. These are registered per type, have some metadata attached to them (property type, name, description, writability, valid value range, etc) and subclasses are inheriting them and can override them. The main difference between properties and struct fields is that setting/getting the property values is executing some code instead of just pointing at a memory location, and you can connect a callback to the property to be notified whenever its value changes. And they can be queried at runtime from a specific type, and set/get via their string names instead of actual C API. Allowed types for properties are everything that has a GObject type ID assigned, including all GObject subclasses, many fundamental types (integers, strings, …) and boxed types like our RString and SharedRString above.

Defining Properties

To define a property, we have to register it in the class_init function and also implement the GObject::get_property() and GObject::set_property() virtual methods (or only one of them for read-only / write-only properties). Internally inside the implementation of our GObject, the properties are identified by an integer index for which we define a simple enum, and when registered we get back a GParamSpec pointer that we should also store (for notifying about property changes for example).

In class_init we then override the two virtual methods and register a new property, by providing the name, type, value of our enum corresponding to that property, default value and various other metadata. We then store the GParamSpec related to the property in a Vec, indexed by the enum value. In our example we add a string-typed “name” property that is readable and writable, but can only ever be written to during object construction.

Afterwards we define the trampoline implementations for the set_property and get_property virtual methods.

In there we decide based on the index which property is meant, and then convert from/to the GValue container provided by GObject, and then call into safe Rust getters/setters.

This property can now be used via the GObject API, e.g. its value can be retrieved via g_object_get(obj, “name”, &pointer_to_a_char_pointer) in C.

Construct Properties

The property we defined above had one special feature: it can only ever be set during object construction. Similarly, every property that is writable can also be set during object construction. This works by providing a value to g_object_new() in the constructor function, which then causes GObject to pass this to our set_property() implementation.

Signals

GObject also supports signals. These are similar to events in e.g. C#, Qt or the C++ Boost signals library, and not to be confused with UNIX signals. GObject signals allow you to connect a callback that is called every time a specific event happens.

Signal Registration

Similarly to properties, these are registered in class_init together with various metadata, can be queried at runtime and are usually used by string name. Notification about property changes is implemented with signals, the GObject::notify signal.

Also similarly to properties, internally in our implementation the signals are used by an integer index. We also store that globally, indexed by a simple enum.

In class_init we then register the signal for our type. For that we provide a name, the parameters of the signal (anything that can be stored in a GValue can be used for this again), the return value (we don’t have one here) and various other metadata. GObject then tells us the ID of the signal, which we store in our vector. In our case we define a signal named “incremented”, that is emitted every time the internal counter of the object is incremented and provides the current value of the counter and by how much it was incremented.

One special part here is the class_offset. GObject allows to (optionally) define a default class handler for the signal. This is always called when the signal is emitted, and is usually a virtual method that can be overridden by subclasses. During signal registration, the offset in bytes to the function pointer of that virtual method inside the class struct is provided.

This is all exactly the same as for virtual methods, just that it will be automatically called when the signal is emitted.

Signal Emission

For emitting the signal, we have to provide the instance and the arguments in an array as GValues, and then emit the signal by the ID we got back during signal registration.

While all parameters to the signal are provided as a GValue here, GObject calls our default class handler and other C callbacks connected to the signal with the corresponding C types directly. The conversion is done inside GObject and then the corresponding function is called via libffi. It is also possible to directly get the array of GValues instead though, by using the GClosure API, for which there are also Rust bindings.

Connecting to the signal can now be done via e.g. g_object_connect() from C.

C header

Similarly to the boxed types, we also have to define a C header for the exported GObject C API. This ideally would also be autogenerated from the macro based solution (e.g. with rusty-cheddar), but here we write it manually. This is mostly GObject boilerplate and conventions.