Multi-threaded raw video conversion and scaling in GStreamer

Another new feature that landed in GStreamer already a while ago, and is included in the 1.12 release, is multi-threaded raw video conversion and scaling. The short story is that it lead to e.g. 3.2x speed-up converting 1080p video to 4k with 4 cores.

I had a few cases where a single core was not able to do rescaling in real-time anymore, even on a quite fast machine. One of the cases was 60fps 4k video in the v210 (10 bit YUV) color format, which is a lot of bytes per second in a not very processing-friendly format. GStreamer’s video converter and scaler is already quite optimized and using SIMD instructions like SSE or Neon, so there was not much potential for further optimizations in that direction.
However basically every machine nowadays has multiple CPU cores that could be used and raw video conversion/scaling is an almost perfectly parallelizable problem, and the way how the conversion code was already written it was relatively easy to add.

The way it works now is similar to the processing model of libraries like OpenMP or Rayon. The whole work is divided into smaller, equal sub-problems that are then handled in parallel, then it is waiting until all parts are done and the result is combined. In our specific case that means that each plane of the video frame is cut into 2, 4, or more slices of full rows, which are then converted separately. The “combining” step does not exist, all sub-conversions are directly written to the correct place in the output already.

As a small helper object for this kind of processing model, I wrote GstParallelizedTaskRunner which might also be useful for other pieces of code that want to do the same.

In the end it was not much work, but the results were satisfying. For example the conversion of 1080p to 4k video in the v210 color format with 4 threads gave a speedup of 3.2x. At that point it looks like the main bottleneck was memory bandwidth, but I didn’t look closer as this is already more than enough for the use cases I was interested in.

Rendering HTML5 video in Servo with GStreamer

At the Web Engines Hackfest in A Coruña at the beginning of October 2017, I was working on adding some proof-of-concept code to Servo to render HTML5 videos with GStreamer. For the impatient, the results can be seen in this video here

And the code can be found here and here.

Details

Servo is Mozilla‘s experimental browser engine written in Rust, optimized for high-performance, parallelized rendering. Some of the parts of Servo are being merged in Firefox as part of the Project Quantum, and already provide a lot of performance and stability improvements there.

During the hackfest I actually spent most of the time trying to wrap my head around the huge Servo codebase. It seems very well-structured and designed, exactly what you would expect from starting such a project from scratch by a company that has decades of experience writing browser engines already. After also having worked on WebKit in the past, I would say that you can see the difference of a legacy codebase from the end of the 90s and something written in a modern language with modern software engineering practices.

To the actual implementation of HTML5 video rendering via GStreamer, I actually started on top of the initial implementation that Philippe Normand started before already. That one was rendering the video in a separate window though, and did not work with the latest version of Servo anymore. I cleaned it up and made it work again (probably the best task you can do to learn a new codebase), and then added support for actually rendering the video inside the web view.

This required quite a few additions on the Servo side, some of which are probably more hacks than anything else, but from the GStreamer-side is was extremely simple. In Servo currently all the infrastructure for media rendering is still missing, while GStreamer has more than a decade of polishing for making integration into other software as easy as possible.

All the GStreamer code was written with the GStreamer Rust bindings, containing not a single line of unsafe code.

As you can see from the above video, the results work quite well already. Media controls or anything more fancy are not working though. Also rendering is currently done completely in software, and a RGBA frame is then uploaded via OpenGL to the GPU for rendering. However, hardware codecs can already be used just fine, and basically every media format out there is supported.

Future

While this all might sound great, unfortunately Mozilla’s plans for media support in Servo are different. They’re planning to use the C++ Firefox/Gecko media backend instead of GStreamer. Best to ask them for reasons, I would probably not repeat them correctly.

Nonetheless, I’ll try to keep the changes updated with latest Servo and once they add more things for media support themselves add the corresponding GStreamer implementations in my branch. It still provides value for both showing that GStreamer is very well capable of handling web use cases (which it already showed in WebKit), as well as being a possibly better choice for people trying to use Servo on embedded systems or with hardware codecs in general. But as I’ll have to work based on what they do, I’m not going to add anything fundamentally new myself at this point as I would have to rewrite it around whatever they decide for the implementation of it anyway.

Also once that part is there, having GStreamer directly render to an OpenGL texture would be added, which would allow direct rendering with hardware codecs to the screen without having the CPU worry about all the raw video data.

But for now, it’s waiting until they catch up with the Firefox/Gecko media backend.

DASH trick-mode playback in GStreamer: Fast-forward/rewind without saturating your network and CPU

GStreamer now has support for I-frame-only (aka keyframe) trick mode playback of DASH streams. It works only on DASH streams with ISOBMFF (aka MP4) fragments, and only if these contain all the required information. This is something I wanted to blog about since many months already, and it’s even included in the GStreamer 1.10 release already.

When trying to play back a DASH stream with rates that are much higher than real-time (say 32x), or playing the streams in reverse, you can easily run into various problems. This is something that was already supported by GStreamer in older versions, for both DASH streams as well as local files or HLS streams but it’s far from ideal. What would happen is that you usually run out of available network bandwidth (you need to be able to download the stream 32x faster than usual), or out of CPU/GPU resources (it needs to be decoded 32x faster than usual) and even if all that works, there’s no point in displaying 960 (30fps at 32x) frames per second.

To get around that, GStreamer 1.10 can now (if explicitly requested with GST_SEEK_FLAG_TRICKMODE_KEY_UNITS) only download and decode I-frames. Depending on the distance of I-frames in the stream and the selected playback speed, this looks more or less smooth. Also depending on that, this might still yield to many frames to be downloaded or decoded in real-time, so GStreamer also measures the distance between I-frames, how fast data can be downloaded and whether decoders and sinks can catch up to decide whether to skip over a couple of I-frames and maybe only download every third I-frame.

If you want to test this, grab the playback-test from GStreamer, select the trickmode key-units mode, and seek in a DASH stream while providing a higher positive or negative (reverse) playback rate.

Let us know if you run into any problems with any specific streams!

Short Implementation Overview

From an implementation point of view this works by having the DASH element in GStreamer (dashdemux) not only download the ISOBMFF fragments but also parses the headers of each to get the positions and distances of each I-frame in the fragment. Based on that it then decides which ones to download or whether to skip ahead one or more fragments. The ISOBMFF headers are then passed to the MP4 demuxer (qtdemux), followed by discontinuous buffers that only contain the actual I-frames and nothing else. While this sounds rather simple from an high-level point of view, getting this all right in the details was the result of a couple of months of work by Edward Hervey and myself.

Currently the heuristics for deciding which I-frames to download and how much to skip ahead are rather minimal, but it’s working fine in many situations already. A lot of tuning can still be done though, and some streams are working less well than others which can also be improved.