Rendering HTML5 video in Servo with GStreamer

At the Web Engines Hackfest in A Coruña at the beginning of October 2017, I was working on adding some proof-of-concept code to Servo to render HTML5 videos with GStreamer. For the impatient, the results can be seen in this video here

And the code can be found here and here.

Details

Servo is Mozilla‘s experimental browser engine written in Rust, optimized for high-performance, parallelized rendering. Some of the parts of Servo are being merged in Firefox as part of the Project Quantum, and already provide a lot of performance and stability improvements there.

During the hackfest I actually spent most of the time trying to wrap my head around the huge Servo codebase. It seems very well-structured and designed, exactly what you would expect from starting such a project from scratch by a company that has decades of experience writing browser engines already. After also having worked on WebKit in the past, I would say that you can see the difference of a legacy codebase from the end of the 90s and something written in a modern language with modern software engineering practices.

To the actual implementation of HTML5 video rendering via GStreamer, I actually started on top of the initial implementation that Philippe Normand started before already. That one was rendering the video in a separate window though, and did not work with the latest version of Servo anymore. I cleaned it up and made it work again (probably the best task you can do to learn a new codebase), and then added support for actually rendering the video inside the web view.

This required quite a few additions on the Servo side, some of which are probably more hacks than anything else, but from the GStreamer-side is was extremely simple. In Servo currently all the infrastructure for media rendering is still missing, while GStreamer has more than a decade of polishing for making integration into other software as easy as possible.

All the GStreamer code was written with the GStreamer Rust bindings, containing not a single line of unsafe code.

As you can see from the above video, the results work quite well already. Media controls or anything more fancy are not working though. Also rendering is currently done completely in software, and a RGBA frame is then uploaded via OpenGL to the GPU for rendering. However, hardware codecs can already be used just fine, and basically every media format out there is supported.

Future

While this all might sound great, unfortunately Mozilla’s plans for media support in Servo are different. They’re planning to use the C++ Firefox/Gecko media backend instead of GStreamer. Best to ask them for reasons, I would probably not repeat them correctly.

Nonetheless, I’ll try to keep the changes updated with latest Servo and once they add more things for media support themselves add the corresponding GStreamer implementations in my branch. It still provides value for both showing that GStreamer is very well capable of handling web use cases (which it already showed in WebKit), as well as being a possibly better choice for people trying to use Servo on embedded systems or with hardware codecs in general. But as I’ll have to work based on what they do, I’m not going to add anything fundamentally new myself at this point as I would have to rewrite it around whatever they decide for the implementation of it anyway.

Also once that part is there, having GStreamer directly render to an OpenGL texture would be added, which would allow direct rendering with hardware codecs to the screen without having the CPU worry about all the raw video data.

But for now, it’s waiting until they catch up with the Firefox/Gecko media backend.

DASH trick-mode playback in GStreamer: Fast-forward/rewind without saturating your network and CPU

GStreamer now has support for I-frame-only (aka keyframe) trick mode playback of DASH streams. It works only on DASH streams with ISOBMFF (aka MP4) fragments, and only if these contain all the required information. This is something I wanted to blog about since many months already, and it’s even included in the GStreamer 1.10 release already.

When trying to play back a DASH stream with rates that are much higher than real-time (say 32x), or playing the streams in reverse, you can easily run into various problems. This is something that was already supported by GStreamer in older versions, for both DASH streams as well as local files or HLS streams but it’s far from ideal. What would happen is that you usually run out of available network bandwidth (you need to be able to download the stream 32x faster than usual), or out of CPU/GPU resources (it needs to be decoded 32x faster than usual) and even if all that works, there’s no point in displaying 960 (30fps at 32x) frames per second.

To get around that, GStreamer 1.10 can now (if explicitly requested with GST_SEEK_FLAG_TRICKMODE_KEY_UNITS) only download and decode I-frames. Depending on the distance of I-frames in the stream and the selected playback speed, this looks more or less smooth. Also depending on that, this might still yield to many frames to be downloaded or decoded in real-time, so GStreamer also measures the distance between I-frames, how fast data can be downloaded and whether decoders and sinks can catch up to decide whether to skip over a couple of I-frames and maybe only download every third I-frame.

If you want to test this, grab the playback-test from GStreamer, select the trickmode key-units mode, and seek in a DASH stream while providing a higher positive or negative (reverse) playback rate.

Let us know if you run into any problems with any specific streams!

Short Implementation Overview

From an implementation point of view this works by having the DASH element in GStreamer (dashdemux) not only download the ISOBMFF fragments but also parses the headers of each to get the positions and distances of each I-frame in the fragment. Based on that it then decides which ones to download or whether to skip ahead one or more fragments. The ISOBMFF headers are then passed to the MP4 demuxer (qtdemux), followed by discontinuous buffers that only contain the actual I-frames and nothing else. While this sounds rather simple from an high-level point of view, getting this all right in the details was the result of a couple of months of work by Edward Hervey and myself.

Currently the heuristics for deciding which I-frames to download and how much to skip ahead are rather minimal, but it’s working fine in many situations already. A lot of tuning can still be done though, and some streams are working less well than others which can also be improved.

Getting RSS feeds for news websites that don’t provide them

This is about a fun little project I already wrote a few months ago, completely unrelated to other things I’m usually writing about, but I thought it might be useful for others too.

When I moved to Greece last year, I had the problem that there were not many news websites that provided local news in English and actually had an RSS feed. And having local news, next to global ones about what happens all over the world, seems like a good idea to know what happens close around you.
The only useful website I found was Ekathimerini. There were two others that seemed useful to have, The Press Project (a crowd-funded project) and To Vima, both of which don’t have an RSS feed (or not for their English version). Of course the real solution to this problem is to learn Greek, which I’m doing now but that’s going to take a while until I’m able to understand news articles without too much effort.

So what did I do? I wrote a small web service that parses the HTML of those websites and returns an RSS feed based on that, together with having it update regularly in the background and keeping some history of items. You can find it here: html-rss-proxy. The resulting RSS feeds seem to work very well in Liferea and Newsblur at least.

Since then it was also extended by another news website, and generally it’s rather simple to add new ones to the code. You just need to figure out how to extract the relevant information from the HTML of some website and then convert that into code like here and wrap it up in the general interface that the other parts of the code expect, like here.

If you add some website yourself, feel free to send me a pull request and I’ll merge it!

Technical bits

On the technical side, this seems to be one of the most stable pieces of software I ever wrote. It never crashed or otherwise failed since I started running it, and fortunately I also didn’t have to update the HTML parsing code yet because of website changes.

It’s written in Haskell, using the Scotty web framework, Cereal serialization library for storing the history of the past articles, http-conduit for fetching the websites, and html-conduit for parsing the HTML. Overall a very pleasant experience, thanks to the language being very convenient to write and preventing most silly mistakes at compile-time, and the high quality of the libraries.

The only part I’m not yet too happy about is the actual HTML parsing, it seems to verbose and redudant. I might port it to another library at a later time, maybe xml-html-conduit-lens.

Update

After saying that I don’t like the HTML parsing, I actually reimplemented it around xml-html-conduit-lens now. The result is much shorter code and it resembles the structure of the HTML, as you can see here for example.

Considering that people always say that lens is so complicated, and this is more than simple getters, I have to say it went rather painless. Only the compiler errors if the types don’t match are a bit tricky to understand at first.

Web Engines Hackfest 2014

During the last days I attended the Web Engines Hackfest 2014 in A Coruña, which was kindly hosted by Igalia in their office. This gave me some time to work again on WebKit, and especially improve and clean up its GStreamer media backend, and even more important of course an opportunity to meet again all the great people working on it.

Apart from various smaller and bigger cleanups, and getting 12 patches merged (thanks to Philippe Normand and Gustavo Noronha reviewing everything immediately), I was working on improving the WebAudio AudioDestination implementation, which allows websites to programmatically generate audio via Javascript and output it. This should be in a much better state now.

But the biggest chunk of work was my attempt for a cleaner and actually working reimplementation of the Media Source Extensions. The Media Source Extensions basically allow a website to provide a container or elementary stream for e.g. the video tag via Javascript, and can for example be used to implement custom streaming protocols. This is still far from finished, but at least it already works on YouTube with their Javascript DASH player and should give a good starting point for finishing the implementation. I hope to have some more time for continuing this work, but any help would be welcome.

Next to the Media Source Extensions, I also spent some time on reviewing a few GStreamer related patches, e.g. the WebAudio AudioSourceProvider implementation by Philippe, or his patch to use GStreamer’s OpenGL library directly in WebKit instead of reimplementing many pieces of it.

I also took a closer look at Servo, Mozilla’s research browser engine written in Rust. It looks like a very promising and well designed project (both, Servo and Rust actually!). I’m sure I’ll play around with Rust in the near future, and will also try to make some time available to work a bit on Servo. Thanks also to Martin Robinson for answering all my questions about Servo and Rust.

And then there were of course lots of discussions about everything!

Good things are going to happen in the future, and WebKit still seems to be a very active project with enthusiastic people 🙂
I hope I’ll be able to visit next year’s Web Engines Hackfest again, it was a lot of fun.

HTTP Adaptive Streaming with GStreamer

Let’s talk a bit about HTTP Adaptive streaming and GStreamer, what it is and how it works. Especially the implementation in GStreamer is not exactly trivial and can be a bit confusing at first sight.

If you’re just interested in knowing if GStreamer supports any HTTP adaptive streaming protocols and which you can stop after this paragraph: yes, and there are currently elements for handling HLS, MPEG DASH and Microsoft SmoothStreaming.

What is it?

So what exactly do I mean when talking about HTTP Adaptive Streaming. There are a few streaming protocols out there that basically work the following way:

  1. You download a Manifest file with metadata about the stream via HTTP. This Manifest contains the location of the actual media, possibly in multiple different bitrates and/or resolutions and/or languages and/or separate audio or subtitle streams or any other different type of variant. It might also contain the location of additional metadata or sub-Manifests that provide more information about a specific variant.
  2. The actual media are also downloaded via HTTP and split into fragments of a specific size, usually 2 seconds to 10 seconds. Depending on the actual protocol these separate fragments can be played standalone or need additional information from the Manifest. The actual media is usually a container format like MPEG TS or a variant of ISO MP4.

Examples of this are HLS, MPEG DASH and Microsoft SmoothStreaming.

This kind of protocols are used for both video-on-demand and “live” treaming, but obviously can’t provide low-latency live streams. When used for live streaming it is usually required to add more than one fragment of latency and then download one or more fragments, reload the playlist to get the location of the next fragments and then download those.

Why?

You might wonder why one would like to implement such a complicated protocol on top of HTTP that can’t even provide low-latency live streaming and why one would choose it over other streaming protocols like RTSP, or any RTP based protocol, or simply serving a stream over HTTP at a single location.

The main reason for this is that these other approaches don’t allow usage of the HTTP based CDNs that are deployed via thousands of servers all around the world, and that they don’t allow using their caching mechanisms. One would need to deploy a specialised CDN just for this other kind of streaming protocol.

A secondary reason is that especially UDP based protocols like RTP but also anything else not based on HTTP can be a bit complicated to deploy because of all the middle boxes (e.g. firewalls, NATs, proxies) that exist in almost any network out there. HTTP(S) generally works everywhere, in the end it’s the only protocol most people conciously use or know about.

And yet another reason is that this splitting of the streams in the fragments allows trivial to implement switching between bitrates or any other stream alternatives at fragment boundaries.

GStreamer client-side design

In GStreamer the above mentioned three protocols are implemented and after trying some different approaches for implementing this kind of protocol, all of them converged to a single design. This is what I’m going to describe now.

Source

The naive approach would consider implementing all this as a single source element. Because that’s how network protocols are usually implemented in GStreamer, right?

While this seems to make sense there’s one problem with this. There is no separate URI scheme defined for such HTTP adaptive streams. They’re all using normal HTTP URIs that point to the Manifest file. And GStreamer chooses the source element that should be used based on the URI scheme alone, and especially not from the data that would be received from that URI.

So what are we left with? HTTP URIs are handled by the standard GStreamer elements for HTTP, and using such a source element gives us a stream containing the Manifest. To make any sense of this collection of bytes and detect the type of it, we additionally have to implement a typefinder that provides the media type based on looking at the data and actually tells us that this is e.g. a MPEG DASH Manifest.

Demuxer

This Manifest stream is usually rather short and followed by the EOS event like every other stream. We now need another element that does something with this Manifest and implements the specific HTTP adaptive streaming protocol. In GStreamer terminology this would act like a demuxer, it has one stream of input and outputs one or more streams based on that. Strictly speaking it’s not really a demuxer though, it does not demultiplex the input stream into the separate streams but that’s just an internal detail in the end.

This demuxer now has to wait until it received the EOS event of the Manifest, then needs to parse it and do whatever the protocol defines. In specific it starts a second thread that handles the control flow of the protocol. This thread downloads any additional resources specified in the Manifest, decides which media fragment(s) are supposed to be downloaded next, starts downloading them and makes sure they leave the demuxer on the source pads in a meaningful way.

The demuxer also has to handle the SEEK event, and based on the time specified in there jump to a different media fragment.

Examples of such demuxers are hlsdemux, dashdemux and mssdemux.

Downloading of data

For downloading additional resources listed in the Manifest that are not actual media fragments (sub-Manifests, reloading the Manifest, headers, encryption keys, …) there is a helper object called GstURIDownloader, which basically provides a blocking (but cancellable) API like this: GstBuffer * buffer = fetch_uri(uri)
Internally it creates a GStreamer source element based on the URI, starts it with that URI, collects all buffers and then returns all of them as a single buffer.

Initially this was also used to download media fragments but it was noticed that this is not ideal. Mostly because the demuxer will have to download a complete fragment before it can be passed downstream, and very huge buffers are passed downstream that cause any buffering
elements to mostly jump between 0% and 100% all the time.

Instead, thanks to the recent work of Thiago Santos, the media fragments are now downloaded differently. The demuxer element is actually a GstBin now and it has a child element that is connected to the source pad: a source element for downloading the media fragments.

This allows to forward the data as it is downloaded immediately and allows downstream elements to handle the data one block after another instead of getting it in multi-second chunks. Especially it also means that playback can start much sooner as you don’t have to wait for a complete fragment but can fill up your buffers with a partial fragment already.

Internally this is a bit tricky as the demuxer will have to catch EOS events from the source (we don’t want to stop streaming just because a fragment is done), catch errors and other messages (maybe instead of forwarding the error we want to retry downloading this media fragment) and switch between different source elements (or just switch the URI of the existing one) during streaming once a media fragment is finished. I won’t describe this here, best to look at the code for that.

Switching between alternatives

Now what happens if the demuxer wants to switch between different alternative streams, e.g. because it has noticed that it can barely keep up with downloading streams of a high bitrate and wants to switch to a lower bitrate. Or event to an alternative that is audio-only. Here the demuxer has to select the next media fragment for the chosen alternative stream and forward that downstream.

But it’s of course not that simple because currently we don’t support renegotiation of decoder pipelines. It could easily happen that codecs change between different alternatives, or the topology changes (e.g. the video stream disappears). Note that not supporting automatic renegotiation for cases like this in decodebin and related elements is not a design deficit of GStreamer but just a limitation of the current implementation.

There already is a similar case that is handled in GStreamer already, which is handling of chained Ogg files (i.e. different Ogg files concatenated to each other). In that case it should also behave like a single stream, but codecs could change or the topology could change. Here the demuxer just has to add new pads for the following streams after having emitted the no-more-pads signal, and then remove all the old pads. decodebin and playbin then first drain the old streams and then handle the new ones, while making sure their times align perfectly with the old ones.

(uri)decodebin and playbin

If we now look at the complete (source) pipeline that is created by uridecodebin for this case we come up with the following (simplified):

HTTP Adaptive Streaming Pipeline

We have a source element for the Manifest, which is added by uridecodebin. uridecodebin also uses a typefinder to detect that this is actually an HTTP adapative streaming Manifest. This is then connected to a decodebin instance, which is configured to do buffering after the demuxers with the multiqueue element. In the normal HTTP stream buffering, the buffering is done between the source element and decodebin with the queue2 element.

decodebin then selects an HTTP adaptive streaming protocol demuxer for the current protocol, waits until it has decided on what to output and the connects it to a multiqueue element like every other demuxer. However it uses different buffering settings as this demuxer is going to behave a bit different.

Followed by that is the actual demuxer for the media fragments, which could for example be tsdemux if they’re MPEG TS media fragments. If an elementary stream is used for the media fragment, decodebin will insert a parser for that elementary stream which will chunk the stream into codec frames and also put timing information on them.

Afterwards there comes another multiqueue element, which does not happen after a parser if no HTTP adapative streaming demuxer is used. decodebin will configure this multiqueue element to handle buffering and send buffering messages to the application to allow it to pause playback until the buffer is full.

In playbin and playsink no special handling is necessary. Now let’s take a look at a complete pipeline graph for an HLS stream that contains one video and one audio stream inside MPEG TS containers. Note: this is huge, like all playbin pipelines. Everything on the right half is for converting raw audio/video and postprocessing it, and then outputting it on the display and speakers.

playbin: HLS with audio and video

Keep in mind that the audio and video streams, and also subtitle streams, could also be in separate media fragments. In that case the HTTP adaptive streaming demuxer would have multiple source pads, each of them followed by a demuxer or parser and multiqueues. And decodebin would aggregate the buffering messages of each of the multiqueues to give the application a consistent view of the buffering status.

Possible optimisations

Now all of this sounds rather complex and probably slow compared to the naive approach of just implementing all this in a single source element and not having so many different elements involved here. As time as shown it actually is not slow at all, and if you consider a design for
such a protocol in a single element you will notice that all the different components that are separate elements here will also show up in your design. But as GStreamer is like Lego and we like having lots of generic components that are put together to build a more complex whole, the current design seems to follow the idea of GStreamer in a more consistent way. And especially it is possible to reuse lots of existing elements and also allow to replace different elements with custom implements. Transparently due to GStreamer’s autoplugging mechanisms.

So let’s talk about a few possible optimisations here, some of which are implemented in the default GStreamer elements and should be kept in mind when replacing elements. Some of which could be implemented on top of the existing elements.

Keep-alive Connections and other HTTP features

All these HTTP adaptive streaming protocols require to create lots of HTTP requests, which traditionally required to create a new TCP connection for every single request. This involves quite some overhead and increases the latency because of TCP’s handshake protocol. Even worse if you use HTTPS and have to also handle the SSL/TLS handshake protocol on top of that. And we’re talking about multiple 100ms to seconds per connection setup here. HTTP 1.1 allows connections to be kept alive for some period of time and reuse them for multiple HTTP requests. Browsers are using this since a long time already to efficiently show you websites composed of many different files with low latency.

Also previously all GStreamer HTTP source elements closed their connection(s) when going back to the READY state, but it is required to set them to the READY state to switch URIs. Which basically means
that although HTTP 1.1 allows to reuse connections for multiple requests, we were not able to make use of this. Now the souphttpsrc HTTP source element keeps connections open until it goes back to the NULL state if the keep-alive property is set to TRUE, and other HTTP source elements could implement this too. The HTTP adaptive streaming demuxers are making use of this “implicit interface” to reuse connections for multiple requests as much as possible.

Compression

HTTP also defines a way for clients and servers to negotiate the encoding that both support. Especially this allows both to negotiate that the actual data (the response body) should be compressed with gzip (or another method) instead of transferring it as plaintext. For media fragments this is not very useful, for Manifests this can be very useful. Especially in the case of HLS, where the Manifest is a plaintext, ASCII file that can easily be a few 100 kb in size.

The HTTP adaptive streaming demuxers are using another “implicit interface” on the HTTP source element to enable compression (if supported by the server) for Manifest files. This is also currently only handled in the souphttpsrc element.

Other minor features

HTTP defines many other headers, and the HTTP adaptive streaming demuxers make use of two more if supported by the HTTP source element. The “implicit interface” for setting more headers in the HTTP request is the extra-headers property, which can be set to arbitrary headers.

The HTTP adaptive streaming demuxers are currently setting the Referer header to the URI of the Manifest file, which is not mandated by any standard to my knowledge so far but there are streams out there that actually forbid to download media fragments without that. And then the demuxers also set the Cache-Control header to 1) tell caches/proxies to update their internal copy of the Manifest file when redownloading it and b) to tell caches/proxies that some requests must not be cached (if indicated so in the Manifest). The latter can be ignored by the caches of course.

If you implement your own HTTP source element it is probably a good idea to copy the interface of the souphttpsrc element at least for these properties.

Caching HTTP source

Another area that could easily be optimised is implementing a cache for the downloaded media fragments and also Manifests. This is especially useful for video-on-demand streams, and even more when the user likes to seek in the stream. Without any cache it would be required to download all media fragments again after seeking, even if the seek position was already downloaded before.

A simple way for implementing this is a caching HTTP source element. This basically works like an HTTP cache/proxy like Squid, only one level higher. It behaves as if it is an HTTP source element, but actually it does magic inside.

From a conceptual point of view this caching HTTP source element would implement the GstURIHandler interface and handle the HTTP(S) protocols, and ideally also implement some of the properties of souphttpsrc as mentioned above. Internally it would actually be a GstBin that dynamically creates a pipeline for downloading an URI when transitioning from the READY to the PAUSED state. It could have the following internal configurations:

Caching HTTP Source

The only tricky bits here are proxying of the different properties to the relevant internal elements, and error handling. You probably don’t want to stop the complete pipeline if for some reason writing to your cache file fails, or reading from your cache file fails. For that you could catch the error messages and also intercept data flow (to ignore error flow returns from filesink), and then dynamically reconfigure the internal pipeline as if nothing has happened.

Based on the Cache-Control header you could implement different functionality, e.g. for refreshing data stored in your cache already. Based on the Referer header you could correlate URIs of media fragments to their corresponding Manifest URI. But how to actually implement the storage, cache invalidation and pruning is a standard problem of computer science and covered elsewhere already.

Creating Streams – The Server Side

And in the end some final words about the server side of things. How to create such HTTP adaptive streams with GStreamer and serve them to your users.

In general this is a relatively simple task with GStreamer and the standard tools and elements that are already available. We have muxers for all the relevant container formats (one for the MP4 variant used in MPEG DASH is available in Bugzilla), there is API to request keyframes at specific positions (so that you can actually start a new media fragment at that position) and the dynamic pipeline mechanisms allow for dynamically switching muxers to start a new media fragment.

On top of that it would only be required to create the Manifest files, and then serve all of this with some HTTP server. For which there are already too many implementations out there, and in the end you want to hide your server behind a CDN anyway.

Additionally there are the hlssink and dashsink elements, the latter one just being in Bugzilla right now. These implement the creation of the media fragments already with the above mentioned GStreamer tools and generate a Manifest file for you.

And then there is the GStreamer Streaming Server, which also has support for HTTP adaptive streaming protocols and can be used to easily serve multiple streams. But that one deserves its own article.

FOMS Workshop 2013

This week and next week I’m going to be in San Francisco and will attend the Foundations of Open Media Standards and Software (FOMS) workshop 2013. Thanks to Silvia Pfeiffer, the other organisers and the sponsors of the event (Google and brightcove) for making this possible by sponsoring me.

Topics that are going to be discussed are WebRTC, Web multimedia in general, open codecs and related topics and the list of attendees seems very diverse to me. I expect lots of interesting discussions and a few interesting and hopefully productive days 🙂

If anybody else is in that area currently and wants to meet for a coffee, beer or some food on the 2-3 days before and after FOMS, please write me a mail or use some other communication channel 🙂