GLib/GIO async operations and Rust futures + async/await

Unfortunately I was not able to attend the Rust+GNOME hackfest in Madrid last week, but I could at least spend some of my work time at Centricular on implementing one of the things I wanted to work on during the hackfest. The other one, more closely related to the gnome-class work, will be the topic of a future blog post once I actually have something to show.

So back to the topic. With the latest GIT version of the Rust bindings for GLib, GTK, etc it is now possible to make use of the Rust futures infrastructure for GIO async operations and various other functions. This should make writing of GNOME, and in general GLib-using, applications in Rust quite a bit more convenient.

For the impatient, the summary is that you can use Rust futures with GLib and GIO now, that it works both on the stable and nightly version of the compiler, and with the nightly version of the compiler it is also possible to use async/await. An example with the latter can be found here, and an example just using futures without async/await here.

Table of Contents

  1. Futures
    1. Futures in Rust
    2. Async/Await
    3. Tokio
  2. Futures & GLib/GIO
    1. Callbacks
    2. GLib Futures
    3. GIO Asynchronous Operations
    4. Async/Await
  3. The Future

Futures

First of all, what are futures and how do they work in Rust. In a few words, a future (also called promise elsewhere) is a value that represents the result of an asynchronous operation, e.g. establishing a TCP connection. The operation itself (usually) runs in the background, and only once the operation is finished (or fails), the future resolves to the result of that operation. There are all kinds of ways to combine futures, e.g. to execute some other (potentially async) code with the result once the first operation has finished.

It’s a concept that is also widely used in various other programming languages (e.g. C#, JavaScript, Python, …) for asynchronous programming and can probably be considered a proven concept at this point.

Futures in Rust

In Rust, a future is basically an implementation of relatively simple trait called Future. The following is the definition as of now, but there are discussions to change/simplify/generalize it currently and to also move it to the Rust standard library:

Anything that implements this trait can be considered an asynchronous operation that resolves to either an Item or an Error. Consumers of the future would call the poll method to check if the future has resolved already (to a result or error), or if the future is not ready yet. In case of the latter, the future itself would at a later point, once it is ready to proceed, notify the consumer about that. It would get a way for notifications from the Context that is passed, and proceeding does not necessarily mean that the future will resolve after this but it could just advance its internal state closer to the final resolution.

Calling poll manually is kind of inconvenient, so generally this is handled by an Executor on which the futures are scheduled and which is running them until their resolution. Equally, it’s inconvenient to have to implement that trait directly so for most common operations there are combinators that can be used on futures to build new futures, usually via closures in one way or another. For example the following would run the passed closure with the successful result of the future, and then have it return another future (Ok(()) is converted via IntoFuture to the future that always resolves successfully with ()), and also maps any errors to ()

A future represents only a single value, but there is also a trait for something producing multiple values: a Stream. For more details, best to check the documentation.

Async/Await

The above way of combining futures via combinators and closures is still not too great, and is still close to callback hell. In other languages (e.g. C#, JavaScript, Python, …) this was solved by introducing new features to the language: async for declaring futures with normal code flow, and await for suspending execution transparently and resuming at that point in the code with the result of a future.

Of course this was also implemented in Rust. Currently based on procedural macros, but there are discussions to actually move this also directly into the language and standard library.

The above example would look something like the following with the current version of the macros

This looks almost like normal, synchronous code but is internally converted into a future and completely asynchronous.

Unfortunately this is currently only available on the nightly version of Rust until various bits and pieces get stabilized.

Tokio

Most of the time when people talk about futures in Rust, they implicitly also mean Tokio. Tokio is a pure Rust, cross-platform asynchronous IO library and based on the futures abstraction above. It provides a futures executor and various types for asynchronous IO, e.g. sockets and socket streams.

But while Tokio is a great library, we’re not going to use it here and instead implement a futures executor around GLib. And on top of that implement various futures, also around GLib’s sister library GIO, which is providing lots of API for synchronous and asynchronous IO.

Just like all IO operations in Tokio, all GLib/GIO asynchronous operations are dependent on running with their respective event loop (i.e. the futures executor) and while it’s possible to use both in the same process, each operation has to be scheduled on the correct one.

Futures & GLib/GIO

Asynchronous operations and generally everything event related (timeouts, …) are based on callbacks that you have to register, and are running via a GMainLoop that is executing events from a GMainContext. The latter is just something that stores everything that is scheduled and provides API for polling if something is ready to be executed now, while the former does exactly that: executing.

Callbacks

The callback based API is also available via the Rust bindings, and would for example look as follows

As can be seen here already, the callback-based approach leads to quite non-linear code and deep indentation due to all the closures. Also error handling becomes quite tricky due to somehow having handle them from a completely different call stack.

Compared to C this is still far more convenient due to actually having closures that can capture their environment, but we can definitely do better in Rust.

The above code also assumes that somewhere a main loop is running on the default main context, which could be achieved with the following e.g. inside main()

It is also possible to explicitly select for various operations on which main context they should run, but that’s just a minor detail.

GLib Futures

To make this situation a bit nicer, I’ve implemented support for futures in the Rust bindings. This means, that the GLib MainContext is now a futures executor (and arbitrary futures can be scheduled on it), all the GSource related operations in GLib (timeouts, UNIX signals, …) have futures- or stream-based variants and all the GIO asynchronous operations also come with futures variants now. The latter are autogenerated with the gir bindings code generator.

For enabling usage of this, the futures feature of the glib and gio crates have to be enabled, but that’s about it. It is currently still hidden behind a feature gate because the futures infrastructure is still going to go through some API incompatible changes in the near future.

So let’s take a look at how to use it. First of all, setting up the main context and executing a trivial future on it

Apart from spawn(), there is also a spawn_local(). The former can be called from any thread but requires the future to implement the Send trait (that is, it must be safe to send it to other threads) while the latter can only be called from the thread that owns the main context but it allows any kind of future to be spawned. In addition there is also a block_on() function on the main context, which allows to run non-static futures up to their completion and returns their result. The spawn functions only work with static futures (i.e. they have no references to any stack frame) and requires the futures to be infallible and resolve to ().

The above code already showed one of the advantages of using futures: it is possible to use all generic futures (that don’t require a specific executor), like futures::lazy or the mpsc/oneshot channels with GLib now. And any of the combinators that are available on futures

This example also shows the block_on functionality to return an actual value from the future (1 in this case).

GIO Asynchronous Operations

Similarly, all asynchronous GIO operations are now available as futures. For example to open a file asynchronously and getting a gio::InputStream to read from, the following could be done

A bigger example can be found in the gtk-rs examples repository here. This example is basically reading a file asynchronously in 64 byte chunks and printing it to stdout, then closing the file.

In the same way, network operations or any other asynchronous operation can be handled via futures now.

Async/Await

Compared to a callback-based approach, that bigger example is already a lot nicer but still quite heavy to read. With the async/await extension that I mentioned above already, the code looks much nicer in comparison and really almost like synchronous code. Except that it is not synchronous.

For compiling this code, the futures-nightly feature has to be enabled for the glib crate, and a nightly compiler must be used.

The bigger example from before with async/await can be found here.

With this we’re already very close in Rust to having the same convenience as in other languages with asynchronous programming. And also it is very similar to what is possible in Vala with GIO asynchronous operations.

The Future

For now this is all finished and available from GIT of the glib and gio crates. This will have to be updated in the future whenever the futures API is changing, but it is planned to stabilize all this in Rust until the end of this year.

In the future it might also make sense to add futures variants for all the GObject signal handlers, so that e.g. handling a click on a GTK+ button could be done similarly from a future (or rather from a Stream as a signal can be emitted multiple times). If this is in the end more convenient than the callback-based approach that is currently used, is to be seen. Some experimentation would be necessary here. Also how to handle return values of signal handlers would have to be figured out.

Improving GStreamer performance on a high number of network streams by sharing threads between elements with Rust’s tokio crate

For one of our customers at Centricular we were working on a quite interesting project. Their use-case was basically to receive an as-high-as-possible number of audio RTP streams over UDP, transcode them, and then send them out via UDP again. Due to how GStreamer usually works, they were running into some performance issues.

This blog post will describe the first set of improvements that were implemented for this use-case, together with a minimal benchmark and the results. My colleague Mathieu will follow up with one or two other blog posts with the other improvements and a more full-featured benchmark.

The short version is that CPU usage decreased by about 65-75%, i.e. allowing 3-4x more streams with the same CPU usage. Also parallelization works better and usage of different CPU cores is more controllable, allowing for better scalability. And a fixed, but configurable number of threads is used, which is independent of the number of streams.

The code for this blog post can be found here.

Table of Contents

  1. GStreamer & Threads
  2. Thread-Sharing GStreamer Elements
  3. Available Elements
  4. Little Benchmark
  5. Conclusion

GStreamer & Threads

In GStreamer, by default each source is running from its own OS thread. Additionally, for receiving/sending RTP, there will be another thread in the RTP jitterbuffer, yet another thread for receiving RTCP (another source) and a last thread for sending RTCP at the right times. And RTCP has to be received and sent for the receiver and sender side part of the pipeline, so the number of threads doubles. In the sum this gives at least 1 + 1 + (1 + 1) * 2 = 6 threads per RTP stream in this scenario. In a normal audio scenario, there will be one packet received/sent e.g. every 20ms on each stream, and every now and then an RTCP packet. So most of the time all these threads are only waiting.

Apart from the obvious waste of OS resources (1000 streams would be 6000 threads), this also brings down performance as all the time threads are being woken up. This means that context switches have to happen basically all the time.

To solve this we implemented a mechanism to share threads, and in the end as a result we have a fixed, but configurable number of threads that is independent from the number of streams. And can run e.g. 500 streams just fine on a single thread with a single core, which was completely impossible before. In addition we also did some work to reduce the number of allocations for each packet, so that after startup no additional allocations happen per packet anymore for buffers. See Mathieu’s upcoming blog post for details.

In this blog post, I’m going to write about a generic mechanism for sources, queues and similar elements to share their threads between each other. For the RTP related bits (RTP jitterbuffer and RTCP timer) this was not used due to reuse of existing C codebases.

Thread-Sharing GStreamer Elements

The code in question can be found here, a small benchmark is in the examples directory and it is going to be used for the results later. A full-featured benchmark will come in Mathieu’s blog post.

This is a new GStreamer plugin, written in Rust and around the Tokio crate for asynchronous IO and generally a “task scheduler”.

While this could certainly also have been written in C around something like libuv, doing this kind of work in Rust is simply more productive and fun due to its safety guarantees and the strong type system, which definitely reduced the amount of debugging a lot. And in addition “modern” language features like closures, which make working with futures much more ergonomic.

When using these elements it is important to have full control over the pipeline and its elements, and the dataflow inside the pipeline has to be carefully considered to properly configure how to share threads. For example the following two restrictions should be kept in mind all the time:

  1. Downstream of such an element, the streaming thread must never ever block for considerable amounts of time. Otherwise all other elements inside the same thread-group would be blocked too, even if they could do any work now
  2. This generally all works better in live pipelines, where media is produced in real-time and not as fast as possible

Available Elements

So this repository currently contains the generic infrastructure (see the src/iocontext.rs source file) and a couple of elements:

  • an UDP source: ts-udpsrc, a replacement for udpsrc
  • an app source: ts-appsrc, a replacement for appsrc to inject packets into the pipeline from the application
  • a queue: ts-queue, a replacement for queue that is useful for adding buffering to a pipeline part. The upstream side of the queue will block if not called from another thread-sharing element, but if called from another thread-sharing element it will pause the current task asynchronously. That is, stop the upstream task from producing more data.
  • a proxysink/src element: ts-proxysrc, ts-proxysink, replacements for proxysink/proxysrc for connecting two pipelines with each other. This basically works like the queue, but split into two elements.
  • a tone generator source around spandsp: ts-tonesrc, a replacement for tonegeneratesrc. This also contains some minimal FFI bindings for that part of the spandsp C library.

All these elements have more or less the same API as their non-thread-sharing counterparts.

API-wise, each of these elements has a set of properties for controlling how it is sharing threads with other elements, and with which elements:

  • context: A string that defines in which group this element is. All elements with the same context are running on the same thread or group of threads,
  • context-threads: Number of threads to use in this context. -1 means exactly one thread, 1 and above used N+1 threads (1 thread for polling fds, N worker threads) and 0 sets N to the number of available CPU cores. As long as no considerable work is done in these threads, -1 has shown to be the most efficient. See also this tokio GitHub issue
  • context-wait: Number of milliseconds that the threads will wait on each iteration. This allows to reduce CPU usage even further by handling all events/packets that arrived during that timespan to be handled all at once instead of waking up the thread every time a little event happens, thus reducing context switches again

The elements are all pushing data downstream from a tokio thread whenever data is available, assuming that downstream does not block. If downstream is another thread-sharing element and it would have to block (e.g. a full queue), it instead returns a new future to upstream so that upstream can asynchronously wait on that future before producing more output. By this, back-pressure is implemented between different GStreamer elements without ever blocking any of the tokio threads. All this is implemented around the normal GStreamer data-flow mechanisms, there is no “tokio fast-path” between elements.

Little Benchmark

As mentioned above, there’s a small benchmark application in the examples directory. This basically sets up a configurable number of streams and directly connects them to a fakesink, throwing away all packets. Additionally there is another thread that is sending all these packets. As such, this is really the most basic benchmark and not very realistic but nonetheless it shows the same performance improvement as the real application. Again, see Mathieu’s upcoming blog post for a more realistic and complete benchmark.

When running it, make sure that your user can create enough fds. The benchmark will just abort if not enough fds can be allocated. You can control this with ulimit -n SOME_NUMBER, and allowing a couple of thousands is generally a good idea. The benchmarks below were running with 10000.

After running cargo build –release to build the plugin itself, you can run the benchmark with:

and in another shell the UDP sender with

This runs 1000 streams, uses ts-udpsrc (alternative would be udpsrc), configures exactly one thread -1, 1 context, and a wait time of 20ms. See above for what these settings mean. You can check CPU usage with e.g. top. Testing was done on an Intel i7-4790K, with Rust 1.25 and GStreamer 1.14. One packet is sent every 20ms for each stream.

Source Streams Threads Contexts Wait CPU
udpsrc 1000 1000 x x 44%
ts-udpsrc 1000 -1 1 0 18%
ts-udpsrc 1000 -1 1 20 13%
ts-udpsrc 1000 -1 2 20 15%
ts-udpsrc 1000 2 1 20 16%
ts-udpsrc 1000 2 2 20 27%
Source Streams Threads Contexts Wait CPU
udpsrc 2000 2000 x x 95%
ts-udpsrc 2000 -1 1 20 29%
ts-udpsrc 2000 -1 2 20 31%
Source Streams Threads Contexts Wait CPU
ts-udpsrc 3000 -1 1 20 36%
ts-udpsrc 3000 -1 2 20 47%

Results for 3000 streams for the old udpsrc are not included as starting up that many threads needs too long.

The best configuration is apparently a single thread per context (see this tokio GitHub issue) and waiting 20ms for every iterations. Compared to the old udpsrc, CPU usage is about one third in that setting, and generally it seems to parallelize well. It’s not clear to me why the last test has 11% more CPU with two contexts, while in every other test the number of contexts does not really make a difference, and also not for that many streams in the real test-case.

The waiting does not reduce CPU usage a lot in this benchmark, but on the real test-case it does. The reason is most likely that this benchmark basically sends all packets at once, then waits for the remaining time, then sends the next packets.

Take these numbers with caution, the real test-case in Mathieu’s blog post will show the improvements in the bigger picture, where it was generally a quarter of CPU usage and almost perfect parallelization when increasing the number of contexts.

Conclusion

Generally this was a fun exercise and we’re quite happy with the results, especially the real results. It took me some time to understand how tokio works internally so that I can implement all kinds of customizations on top of it, but for normal usage of tokio that should not be required and the overall design makes a lot of sense to me, as well as the way how futures are implemented in Rust. It requires some learning and understanding how exactly the API can be used and behaves, but once that point is reached it seems like a very productive and performant solution for asynchronous IO. And modelling asynchronous IO problems based on the Rust-style futures seems a nice and intuitive fit.

The performance measurements also showed that GStreamer’s default usage of threads is not always optimal, and a model like in upipe or pipewire (or rather SPA) can provide better performance. But as this also shows, it is possible to implement something like this on top of GStreamer and for the common case, using threads like in GStreamer reduces the cognitive load on the developer a lot.

For a future version of GStreamer, I don’t think we should make the threading “manual” like in these two other projects, but instead provide some API additions that make it nicer to implement thread-sharing elements and to add ways in the GStreamer core to make streaming threads non-blocking. All this can be implemented already, but it could be nicer.

All this “only” improved the number of threads, and thus the threading and context switching overhead. Many other optimizations in other areas are still possible on top of this, for example optimizing receive performance and reducing the number of memory copies inside the pipeline even further. If that’s something you would be interested in, feel free to get in touch.

And with that: Read Mathieu’s upcoming blog posts about the other parts, RTP jitterbuffer / RTCP timer thread sharing, and no allocations, and the full benchmark.