HTTP Adaptive Streaming with GStreamer

Let’s talk a bit about HTTP Adaptive streaming and GStreamer, what it is and how it works. Especially the implementation in GStreamer is not exactly trivial and can be a bit confusing at first sight.

If you’re just interested in knowing if GStreamer supports any HTTP adaptive streaming protocols and which you can stop after this paragraph: yes, and there are currently elements for handling HLS, MPEG DASH and Microsoft SmoothStreaming.

What is it?

So what exactly do I mean when talking about HTTP Adaptive Streaming. There are a few streaming protocols out there that basically work the following way:

  1. You download a Manifest file with metadata about the stream via HTTP. This Manifest contains the location of the actual media, possibly in multiple different bitrates and/or resolutions and/or languages and/or separate audio or subtitle streams or any other different type of variant. It might also contain the location of additional metadata or sub-Manifests that provide more information about a specific variant.
  2. The actual media are also downloaded via HTTP and split into fragments of a specific size, usually 2 seconds to 10 seconds. Depending on the actual protocol these separate fragments can be played standalone or need additional information from the Manifest. The actual media is usually a container format like MPEG TS or a variant of ISO MP4.

Examples of this are HLS, MPEG DASH and Microsoft SmoothStreaming.

This kind of protocols are used for both video-on-demand and “live” treaming, but obviously can’t provide low-latency live streams. When used for live streaming it is usually required to add more than one fragment of latency and then download one or more fragments, reload the playlist to get the location of the next fragments and then download those.

Why?

You might wonder why one would like to implement such a complicated protocol on top of HTTP that can’t even provide low-latency live streaming and why one would choose it over other streaming protocols like RTSP, or any RTP based protocol, or simply serving a stream over HTTP at a single location.

The main reason for this is that these other approaches don’t allow usage of the HTTP based CDNs that are deployed via thousands of servers all around the world, and that they don’t allow using their caching mechanisms. One would need to deploy a specialised CDN just for this other kind of streaming protocol.

A secondary reason is that especially UDP based protocols like RTP but also anything else not based on HTTP can be a bit complicated to deploy because of all the middle boxes (e.g. firewalls, NATs, proxies) that exist in almost any network out there. HTTP(S) generally works everywhere, in the end it’s the only protocol most people conciously use or know about.

And yet another reason is that this splitting of the streams in the fragments allows trivial to implement switching between bitrates or any other stream alternatives at fragment boundaries.

GStreamer client-side design

In GStreamer the above mentioned three protocols are implemented and after trying some different approaches for implementing this kind of protocol, all of them converged to a single design. This is what I’m going to describe now.

Source

The naive approach would consider implementing all this as a single source element. Because that’s how network protocols are usually implemented in GStreamer, right?

While this seems to make sense there’s one problem with this. There is no separate URI scheme defined for such HTTP adaptive streams. They’re all using normal HTTP URIs that point to the Manifest file. And GStreamer chooses the source element that should be used based on the URI scheme alone, and especially not from the data that would be received from that URI.

So what are we left with? HTTP URIs are handled by the standard GStreamer elements for HTTP, and using such a source element gives us a stream containing the Manifest. To make any sense of this collection of bytes and detect the type of it, we additionally have to implement a typefinder that provides the media type based on looking at the data and actually tells us that this is e.g. a MPEG DASH Manifest.

Demuxer

This Manifest stream is usually rather short and followed by the EOS event like every other stream. We now need another element that does something with this Manifest and implements the specific HTTP adaptive streaming protocol. In GStreamer terminology this would act like a demuxer, it has one stream of input and outputs one or more streams based on that. Strictly speaking it’s not really a demuxer though, it does not demultiplex the input stream into the separate streams but that’s just an internal detail in the end.

This demuxer now has to wait until it received the EOS event of the Manifest, then needs to parse it and do whatever the protocol defines. In specific it starts a second thread that handles the control flow of the protocol. This thread downloads any additional resources specified in the Manifest, decides which media fragment(s) are supposed to be downloaded next, starts downloading them and makes sure they leave the demuxer on the source pads in a meaningful way.

The demuxer also has to handle the SEEK event, and based on the time specified in there jump to a different media fragment.

Examples of such demuxers are hlsdemux, dashdemux and mssdemux.

Downloading of data

For downloading additional resources listed in the Manifest that are not actual media fragments (sub-Manifests, reloading the Manifest, headers, encryption keys, …) there is a helper object called GstURIDownloader, which basically provides a blocking (but cancellable) API like this: GstBuffer * buffer = fetch_uri(uri)
Internally it creates a GStreamer source element based on the URI, starts it with that URI, collects all buffers and then returns all of them as a single buffer.

Initially this was also used to download media fragments but it was noticed that this is not ideal. Mostly because the demuxer will have to download a complete fragment before it can be passed downstream, and very huge buffers are passed downstream that cause any buffering
elements to mostly jump between 0% and 100% all the time.

Instead, thanks to the recent work of Thiago Santos, the media fragments are now downloaded differently. The demuxer element is actually a GstBin now and it has a child element that is connected to the source pad: a source element for downloading the media fragments.

This allows to forward the data as it is downloaded immediately and allows downstream elements to handle the data one block after another instead of getting it in multi-second chunks. Especially it also means that playback can start much sooner as you don’t have to wait for a complete fragment but can fill up your buffers with a partial fragment already.

Internally this is a bit tricky as the demuxer will have to catch EOS events from the source (we don’t want to stop streaming just because a fragment is done), catch errors and other messages (maybe instead of forwarding the error we want to retry downloading this media fragment) and switch between different source elements (or just switch the URI of the existing one) during streaming once a media fragment is finished. I won’t describe this here, best to look at the code for that.

Switching between alternatives

Now what happens if the demuxer wants to switch between different alternative streams, e.g. because it has noticed that it can barely keep up with downloading streams of a high bitrate and wants to switch to a lower bitrate. Or event to an alternative that is audio-only. Here the demuxer has to select the next media fragment for the chosen alternative stream and forward that downstream.

But it’s of course not that simple because currently we don’t support renegotiation of decoder pipelines. It could easily happen that codecs change between different alternatives, or the topology changes (e.g. the video stream disappears). Note that not supporting automatic renegotiation for cases like this in decodebin and related elements is not a design deficit of GStreamer but just a limitation of the current implementation.

There already is a similar case that is handled in GStreamer already, which is handling of chained Ogg files (i.e. different Ogg files concatenated to each other). In that case it should also behave like a single stream, but codecs could change or the topology could change. Here the demuxer just has to add new pads for the following streams after having emitted the no-more-pads signal, and then remove all the old pads. decodebin and playbin then first drain the old streams and then handle the new ones, while making sure their times align perfectly with the old ones.

(uri)decodebin and playbin

If we now look at the complete (source) pipeline that is created by uridecodebin for this case we come up with the following (simplified):

HTTP Adaptive Streaming Pipeline

We have a source element for the Manifest, which is added by uridecodebin. uridecodebin also uses a typefinder to detect that this is actually an HTTP adapative streaming Manifest. This is then connected to a decodebin instance, which is configured to do buffering after the demuxers with the multiqueue element. In the normal HTTP stream buffering, the buffering is done between the source element and decodebin with the queue2 element.

decodebin then selects an HTTP adaptive streaming protocol demuxer for the current protocol, waits until it has decided on what to output and the connects it to a multiqueue element like every other demuxer. However it uses different buffering settings as this demuxer is going to behave a bit different.

Followed by that is the actual demuxer for the media fragments, which could for example be tsdemux if they’re MPEG TS media fragments. If an elementary stream is used for the media fragment, decodebin will insert a parser for that elementary stream which will chunk the stream into codec frames and also put timing information on them.

Afterwards there comes another multiqueue element, which does not happen after a parser if no HTTP adapative streaming demuxer is used. decodebin will configure this multiqueue element to handle buffering and send buffering messages to the application to allow it to pause playback until the buffer is full.

In playbin and playsink no special handling is necessary. Now let’s take a look at a complete pipeline graph for an HLS stream that contains one video and one audio stream inside MPEG TS containers. Note: this is huge, like all playbin pipelines. Everything on the right half is for converting raw audio/video and postprocessing it, and then outputting it on the display and speakers.

playbin: HLS with audio and video

Keep in mind that the audio and video streams, and also subtitle streams, could also be in separate media fragments. In that case the HTTP adaptive streaming demuxer would have multiple source pads, each of them followed by a demuxer or parser and multiqueues. And decodebin would aggregate the buffering messages of each of the multiqueues to give the application a consistent view of the buffering status.

Possible optimisations

Now all of this sounds rather complex and probably slow compared to the naive approach of just implementing all this in a single source element and not having so many different elements involved here. As time as shown it actually is not slow at all, and if you consider a design for
such a protocol in a single element you will notice that all the different components that are separate elements here will also show up in your design. But as GStreamer is like Lego and we like having lots of generic components that are put together to build a more complex whole, the current design seems to follow the idea of GStreamer in a more consistent way. And especially it is possible to reuse lots of existing elements and also allow to replace different elements with custom implements. Transparently due to GStreamer’s autoplugging mechanisms.

So let’s talk about a few possible optimisations here, some of which are implemented in the default GStreamer elements and should be kept in mind when replacing elements. Some of which could be implemented on top of the existing elements.

Keep-alive Connections and other HTTP features

All these HTTP adaptive streaming protocols require to create lots of HTTP requests, which traditionally required to create a new TCP connection for every single request. This involves quite some overhead and increases the latency because of TCP’s handshake protocol. Even worse if you use HTTPS and have to also handle the SSL/TLS handshake protocol on top of that. And we’re talking about multiple 100ms to seconds per connection setup here. HTTP 1.1 allows connections to be kept alive for some period of time and reuse them for multiple HTTP requests. Browsers are using this since a long time already to efficiently show you websites composed of many different files with low latency.

Also previously all GStreamer HTTP source elements closed their connection(s) when going back to the READY state, but it is required to set them to the READY state to switch URIs. Which basically means
that although HTTP 1.1 allows to reuse connections for multiple requests, we were not able to make use of this. Now the souphttpsrc HTTP source element keeps connections open until it goes back to the NULL state if the keep-alive property is set to TRUE, and other HTTP source elements could implement this too. The HTTP adaptive streaming demuxers are making use of this “implicit interface” to reuse connections for multiple requests as much as possible.

Compression

HTTP also defines a way for clients and servers to negotiate the encoding that both support. Especially this allows both to negotiate that the actual data (the response body) should be compressed with gzip (or another method) instead of transferring it as plaintext. For media fragments this is not very useful, for Manifests this can be very useful. Especially in the case of HLS, where the Manifest is a plaintext, ASCII file that can easily be a few 100 kb in size.

The HTTP adaptive streaming demuxers are using another “implicit interface” on the HTTP source element to enable compression (if supported by the server) for Manifest files. This is also currently only handled in the souphttpsrc element.

Other minor features

HTTP defines many other headers, and the HTTP adaptive streaming demuxers make use of two more if supported by the HTTP source element. The “implicit interface” for setting more headers in the HTTP request is the extra-headers property, which can be set to arbitrary headers.

The HTTP adaptive streaming demuxers are currently setting the Referer header to the URI of the Manifest file, which is not mandated by any standard to my knowledge so far but there are streams out there that actually forbid to download media fragments without that. And then the demuxers also set the Cache-Control header to 1) tell caches/proxies to update their internal copy of the Manifest file when redownloading it and b) to tell caches/proxies that some requests must not be cached (if indicated so in the Manifest). The latter can be ignored by the caches of course.

If you implement your own HTTP source element it is probably a good idea to copy the interface of the souphttpsrc element at least for these properties.

Caching HTTP source

Another area that could easily be optimised is implementing a cache for the downloaded media fragments and also Manifests. This is especially useful for video-on-demand streams, and even more when the user likes to seek in the stream. Without any cache it would be required to download all media fragments again after seeking, even if the seek position was already downloaded before.

A simple way for implementing this is a caching HTTP source element. This basically works like an HTTP cache/proxy like Squid, only one level higher. It behaves as if it is an HTTP source element, but actually it does magic inside.

From a conceptual point of view this caching HTTP source element would implement the GstURIHandler interface and handle the HTTP(S) protocols, and ideally also implement some of the properties of souphttpsrc as mentioned above. Internally it would actually be a GstBin that dynamically creates a pipeline for downloading an URI when transitioning from the READY to the PAUSED state. It could have the following internal configurations:

Caching HTTP Source

The only tricky bits here are proxying of the different properties to the relevant internal elements, and error handling. You probably don’t want to stop the complete pipeline if for some reason writing to your cache file fails, or reading from your cache file fails. For that you could catch the error messages and also intercept data flow (to ignore error flow returns from filesink), and then dynamically reconfigure the internal pipeline as if nothing has happened.

Based on the Cache-Control header you could implement different functionality, e.g. for refreshing data stored in your cache already. Based on the Referer header you could correlate URIs of media fragments to their corresponding Manifest URI. But how to actually implement the storage, cache invalidation and pruning is a standard problem of computer science and covered elsewhere already.

Creating Streams – The Server Side

And in the end some final words about the server side of things. How to create such HTTP adaptive streams with GStreamer and serve them to your users.

In general this is a relatively simple task with GStreamer and the standard tools and elements that are already available. We have muxers for all the relevant container formats (one for the MP4 variant used in MPEG DASH is available in Bugzilla), there is API to request keyframes at specific positions (so that you can actually start a new media fragment at that position) and the dynamic pipeline mechanisms allow for dynamically switching muxers to start a new media fragment.

On top of that it would only be required to create the Manifest files, and then serve all of this with some HTTP server. For which there are already too many implementations out there, and in the end you want to hide your server behind a CDN anyway.

Additionally there are the hlssink and dashsink elements, the latter one just being in Bugzilla right now. These implement the creation of the media fragments already with the above mentioned GStreamer tools and generate a Manifest file for you.

And then there is the GStreamer Streaming Server, which also has support for HTTP adaptive streaming protocols and can be used to easily serve multiple streams. But that one deserves its own article.

OpenGL support in GStreamer

Over the last few months Matthew Waters, Julien Isorce and to some lesser degree myself worked on integrating proper OpenGL support into GStreamer.

Previously there were a few sinks based on OpenGL (osxvideosink for Mac OS X and eglglessink for Android and iOS), but they all only allowed rendering to a window. They did not allow rendering of a video into a custom texture that is then composited inside the application into an OpenGL scene. And then there was gst-plugins-gl, which allowed more flexible handling of OpenGL inside GStreamer pipelines, including uploading and downloading of video frames to the GPU, provided various filters and base classes to easily implement shader-based filters, provided infrastructure for sharing OpenGL contexts between different elements (even if they run in different threads) and also provided a video sink. The latter was now improved a lot, ported to all the new features for hardware integration and finally merged into gst-plugins-bad. Starting with GStreamer 1.4 in a few weeks, OpenGL will be a first-class citizen in GStreamer pipelines.

After yesterday’s addition of EAGL support for iOS (EAGL is Apple’s iOS API for handling GLES contexts), there is nothing missing to use this new set of library and plugins on all platforms supported by GStreamer. And finally we can get rid of eglglessink, which was only meant as an intermediate solution until we have all the infrastructure for real OpenGL support.

EFL and Enlightenment GStreamer 1.x support

Over the past few weeks I did some work on porting Emotion to GStreamer 1.x. Emotion is the media library used by Enlightenment and part of the Enlightenment Foundation Libraries (EFL). It provides a media playback library abstraction (there are also Xine and VLC backends).

Previously there was a GStreamer 0.10 backend (which was the default one for Emotion), but GStreamer 0.10 is no longer maintained and supported by the community. At Centricular we want to make sure that GStreamer and other Free Software shine, so I started porting the backend to GStreamer 1.0.

I started doing a straightforward port of the old GStreamer 0.10 backend. That was a few hours of work, but the old 0.10 backend was rather bitrotten and was lacking a lot of features compared to the other backends. So I spent some more time on cleaning it up, fixing a lot of bugs on the way and making it (almost) feature-complete. Some of the new features I added were selection and switching of audio/video/text streams, support for the navigation interface for DVDs, buffering for network streams, improved support for live streams and proper support for non-1:1 pixel-aspect-ratios. I’ll work on adding some further features and improvements (like zerocopy rendering) to it over the next weeks every now and then, but overall I would say that this is ready for general use now and definitely an improvement over the old code.

The code can be found here and will also be in the 1.9 release, which will be released soonish. It’s also the default backend for Emotion now, and should give proper out-of-the-box multimedia experience with Enlightenment. The GStreamer 0.10 backend is still available but has to be enabled explicitly, if anybody needs it for whatever reason.

GStreamer Dynamic Pipelines

Another recurring topic with GStreamer since a long time is how to build applications with dynamic pipelines. That is, pipelines in which elements are relinked while the pipeline is playing and without stopping the pipeline.

So, let’s write a bit about it and explain how it all works.

Note however that I’m not covering the most common and simple case here: a demuxer or decodebin adding pads when set to PLAYING, and then connecting to these pads. My example code does this however, but there’s enough documentation about this already.

Also these two examples unfortunately need GStreamer 1.2.3 or newer because of some bugfixes.

The Theory

What’s difficult about dynamic pipelines? Why can’t you just relink elements and their pads at any time like you do when the pipeline is not running? Let’s consider the example of the plumbing in your house. If you want to change something there in the pipeline, you better make sure nothing is flowing through the pipes at that time or otherwise there will be a big mess 🙂

Pad Probes

In GStreamer this is handled with the pad probe mechanism. Pad probes allow to register a callback that is called when ever a specific condition is met. These conditions are expressed with a flags type, and are e.g. GST_PAD_PROBE_TYPE_BUFFER for a buffer arriving at the pad or GST_PAD_PROBE_TYPE_QUERY_UPSTREAM for an upstream query. Additionally these flags specify the scheduling type (not so important), and can specify a blocking type: GST_PAD_PROBE_TYPE_IDLE and GST_PAD_PROBE_TYPE_BLOCK.

gst_pad_add_probe() adds a probe and returns an identifier, which can later be used to remove the probe again from the pad with gst_pad_remove_probe().

The Callback

The probe callback is called whenever the condition is met. In this callback we get an info structure passed, which contains the exact condition that caused the callback to be called and the data that is associated with this. This can be for example the current buffer, the current event or the current query.

From the callback this data can be inspected but it’s also possible to replace the data stored in the info structure.

Once everything we want to do is done inside the callback, the callback has to return a return value. This specifies if the data should be passed on (GST_PAD_PROBE_PASS), should be dropped (GST_PAD_PROBE_DROP), the probe should be removed and the data should be passed (GST_PAD_PROBE_REMOVE) or the default action for this probe type should happen (GST_PAD_PROBE_OK, more on that later).

Note that the callback can be called from an arbitrary thread, and especially is not guaranteed to be called from your main application thread. For all serialized events, buffers and queries it will be called from the corresponding streaming thread.

Also it is important to keep in mind that the callback can be called multiple times (also at once), and that it can also still be called when returning GST_PAD_PROBE_REMOVE from it (another thread might’ve just called into it). It is the job of the callback to protect against that.

Blocking Types

The blocking types of the conditions are of further interest here. Without a blocking type the probe callback can be used to get notified whenever the condition is met, or intercept data flow or even modify events or buffers. That can also be very useful but not for our topic.

Whenever one of the blocking types is specified in the condition, triggering the probe will cause the pad to be blocked. That means that the pad will not pass on any data related to the condition until the probe is removed (with gst_pad_remove_probe() or by returning GST_PAD_PROBE_REMOVE), unless GST_PAD_PROBE_PASS is returned from the callback. This guarantees that nothing else that matches the condition can pass and the callback can safely do it’s work. Especially if GST_PAD_PROBE_TYPE_DATA_BOTH is specified, no data flow can happen and downstream of the pad until the next queue can be safely relinked. To be able to relink parts after the next queues you additionally need to make sure that all data flow has finished until that point too, which can be done with further pad probes (see also the advanced variant of the first example).

Probes with the GST_PAD_PROBE_TYPE_IDLE blocking type will be called the next time the pad is idle, i.e. there is no data flow happening currently. This can also happen immediately if gst_pad_add_probe() is called, directly from the thread that calls gst_pad_add_probe(). Or after the next buffer, event or query is handled.

Probes with the GST_PAD_PROBE_TYPE_BLOCK blocking type will be called the next time the conditions match, and will block the pad before passing on the data. This allows to inspect the buffer, event or query that is currently pending for the pad while still blocking the pad from doing anything else.

The main advantage of GST_PAD_PROBE_TYPE_BLOCK probes is that they provide the data that is currently pending, while the main advantage of GST_PAD_PROBE_TYPE_IDLE is that it is guaranteed to be called as soon as possible (independent of any data coming or not, there might not be any further data at all). It comes with the disadvantage that it might be called directly from the thread that calls gst_pad_add_probe() though. Depending on the use case, one or both of them should be chosen.

Now to the examples.

Example 1: Inserting & removing a filter

In this example we will have a decodebin, connected to a video sink with the navseek element. This allows us to watch any supported video file and seek with the cursor keys. Every 5 seconds a video effect filter will be inserted in front of the sink, or removed if it was inserted last time. All this without ever stopping playback or breaking because of seeking. The code is available here.

Setting up everything

In main() we set up the pipeline and link all parts we can already link, connect to the GstElement::pad-added signal of decodebin and then start a mainloop.

From the pad-added callback we then connect the first video pad that is added on decodebin to the converter in front of the video sink. We also add our periodic 5 second timeout, which will insert/remove the filter here. After this point the pipeline will be PLAYING and the video will be shown.

The insertion/removal of the filter

The timeout callback is quite boring, nothing is happening here other than calling gst_pad_add_probe() to add an IDLE probe. And here we also initialize a variable that protects our probe callback from multiple concurrent calls. We use an IDLE probe here as we’re not interested in the data causing the callback call, and also just want to get the callback called as soon as possible, even from the current thread.

Now the actual insertion or removal of the filter happens in the probe callback. This is the actually interesting part. Here we first check if the callback was already called with an atomic operation, and afterwards either insert or remove the filter. In both cases we need to make sure that all elements are properly linked on their pads afterwards and have the appropriate states. We also have to insert a video convert in front of the filter to make sure that output of the decoder can be handled by our filter.

A slightly more advanced variant

And that’s already all to know about this case. A slightly more complex variant of this is also in gst-plugins-base. The main difference is that BLOCK probes are used here, and the filter is drained with an EOS event before it is replaced. This is done by first adding a BLOCK probe in front of the filter, then from the callback adding another one after the filter and then sending an EOS event to the filter. From the probe after the filter we pass through all data until the EOS event is received and only then remove the filter. This is done for the case that the filter has multiple buffers queued internally. BLOCK probes instead of IDLE probes are used here because we would otherwise potentially send the EOS event from the application’s main thread, which would then block until the EOS event arrived on the other side of the filter and the filter was removed.

Example 2: Adding & removing sinks

The second example also plays a video with decodebin, but randomly adds or removes another video sink every 3 seconds. This uses the tee element for duplicating the video stream. The code can be found here.

Setting up everything

In main() we set up the pipeline and link all parts we can already link, connect to the GstElement::pad-added signal of decodebin and then start a mainloop. Same as in the previous example. We don’t add a sink here yet.

From the pad-added callback we now link decodebin to the tee element, request a first srcpad from tee and link a first sink. This first sink is a fakesink (with sync=TRUE to play in realtime), and is always present. This makes sure that the video is always playing in realtime, even if we have no visible sinks currently. At the end of the callback we add our 3 seconds, periodic timer.

Addition of sinks

In the timeout callback we first get a random number to decide if we now add or remove a sink. If we add a new sink this is all done from the timeout callback (i.e. the application’s main thread) directly. We can do all this from the main thread and without pad probes because there’s no data flow to disrupt. The new tee srcpad is just created here and if tee pushes any buffer through it now it will just be dropped. For adding a sink we just request a new srcpad from the tee and link it to a queue, video converter and sink, sync all the states and remember that we added this sink. A queue is necessary after every tee srcpad because otherwise the tee will lock up (because all tee srcpads are served from a single thread).

Removal of sinks

Removal of sinks is a bit more complicated as now we have to block the relevant pad because there might be data flow happening just now. For this we add an IDLE probe and from the callback unlink and destroy the sink. Again we protect against multiple calls to the callback, and we pass our sink information structure to the callback to know which sink actually should be removed. Note here that we pass g_free() to gst_pad_add_probe() as destroy notify for the sink information structure and don’t free the memory from the callback. This is necessary because the callback can still be called after we released the sink, and we would access already freed memory then.

I hope this helps to understand how dynamic pipelines can be implemented with GStreamer. It should be easily possible to extend these example s to real, more complicated use cases. The concepts are the same in all cases.

New GStreamer plugins: OpenNI2 (3D sensors, e.g. Kinect), RTP over TCP and OpenEXR (HDR image format)

Over the last few days a few new GStreamer were added to latest GIT, and will be part of the 1.3/1.4 releases.

OpenNI2 – 3D sensors support

Miguel Casas-Sanchez worked on implementing a source element for the OpenNI2 API, which supports the Kinect camera of the XBox and some other cameras that provide depth information as another channel. This is not to be confused with stereoscopic videos as supported by some codecs and used in cinema, which uses two different frames from slightly different angles and creates a 3D image from that.

This source element handles dumps of the camera input to files, and also capturing from cameras directly. Currently the output of it is either RGB (without depth information), 16bit grayscale (only depth information) or RGBA (with depth information in the alpha channel). This can be configured with the “sourcetype” property right now. At a later time we should try to define a proper interface for handling depth information, especially something that does not feel completely contradictory with the stereoscopic video API. Maybe there could be just another plane for the depth information in a GstMeta.

The plugin is available in gst-plugins-bad GIT.

RTP over TCP

One question that was asked often in the past is how to stream RTP over a TCP connection. As RTP packets are expected to be sent over a datagram protocol, like UDP, and TCP provides a stream protocol, it is necessary to collect a complete packet at the receiver side and then pass this onwards. RTP has no size information in the packets, so an additional framing protocol is required on top of RTP. Fortunately there’s an RFC which defines a very simple one that is used in practice: RFC4571. Thanks to Olivier Crête for mentioning this RFC on the gstreamer-devel mailinglist. The framing protocol is very simple, it’s defined as just pre-pending a 16 bit unsigned, big-endian integer in front of each RTP or RTCP packet.

Yesterday I wrote two simple elements for this RFC implemented and integrated them into gst-plugins-good GIT.

They can for example be used like this:

gst-launch-1.0 audiotestsrc ! “audio/x-raw,rate=48000” ! vorbisenc ! rtpvorbispay config-interval=1 ! rtpstreampay ! tcpserversink port=5678

gst-launch-1.0 tcpclientsrc port=5678 host=127.0.0.1 do-timestamp=true ! “application/x-rtp-stream,media=audio,clock-rate=48000,encoding-name=VORBIS” ! rtpstreamdepay ! rtpvorbisdepay ! decodebin ! audioconvert ! audioresample ! autoaudiosink

A more elaborate solution could also use RTCP communication between the sender and receiver. RTCP can also be passed through the rtpstreampay and rtpstreamdepay elements the same way.

For anything more complicated you should consider looking into RTSP though as it is much more flexible, feature-rich and allows exchanging the stream configurations automatically, and it also allows streams to be delivered via TCP (or UDP unicast/multicast). GStreamer has a source plugin and a server library for receiving or serving RTSP streams.

OpenEXR – HDR image formats

Another thing that I worked on (and still do) is an OpenEXR decoder plugin. OpenEXR is a HDR image format, but unfortunately we don’t have support for any HDR-compatible raw video format in GStreamer… yet! The OpenEXR decoder element is inside gst-plugins-bad GIT now but internally converts the 16 bit floating point color components to 16 bit integers, choosing 1.0 as clipping point. Once there’s consensus about how to expose such raw video formats in GStreamer (see Bugzilla #719902), support for the 16 bit floating point RGBA format should be easy to add.

Synchronized audio mixing in GStreamer

Over the last few weeks I worked on a new GStreamer element: audiomixer. This new element is based on adder, i.e. it mixes multiple audio streams together and produces a single audio stream. It’s already merged into GIT master of the gst-plugins-bad module.

The main and important difference to adder is that it actually synchronizes the different audio streams against each other instead of just mixing samples together as they come while ignoring all timing information. This was a feature that was requested since a long time and in practice causes problems if streams are supposed to start at different times, have gaps or slightly different clock rates (and thus one is supposed to run a bit faster than the other). It’s also an important first step to properly support mixing of live streams. The video element for mixing, videomixer, properly implements synchronization since a few years now.

A very simple example to see the difference between both elements would be the following gst-launch command:

gst-launch-1.0 audiomixer name=mix
  mix. ! audioconvert ! audioresample ! autoaudiosink
  audiotestsrc num-buffers=400 volume=0.2 ! mix.
  audiotestsrc num-buffers=300 volume=0.2 freq=880 timestamp-offset=1000000000 ! mix.
  audiotestsrc num-buffers=100 volume=0.2 freq=660 timestamp-offset=2000000000 ! mix.

If you replace audiomixer by adder, you’ll hear all streams starting at the same time while with audiomixer they start with the correct offsets to each other.

So, what’s left to be done. Currently reverse playback/mixing is not support, that’s somewhere next on my todo list. Also the handling of flushing seeks and flushes in general on mixers (and muxers) is currently rather suboptimal, that’s something I’m working on next. As a side-effect this will also bring us one step nearer to proper mixing of live streams.

FOMS Workshop 2013

This week and next week I’m going to be in San Francisco and will attend the Foundations of Open Media Standards and Software (FOMS) workshop 2013. Thanks to Silvia Pfeiffer, the other organisers and the sponsors of the event (Google and brightcove) for making this possible by sponsoring me.

Topics that are going to be discussed are WebRTC, Web multimedia in general, open codecs and related topics and the list of attendees seems very diverse to me. I expect lots of interesting discussions and a few interesting and hopefully productive days 🙂

If anybody else is in that area currently and wants to meet for a coffee, beer or some food on the 2-3 days before and after FOMS, please write me a mail or use some other communication channel 🙂

GStreamer 1.0 examples for iOS, Android and in general

As the folks at gstreamer.com (not to be confused with the GStreamer project) are still at the old and unmaintained GStreamer 0.10 release series, I started to port all their tutorials and examples to 1.x. You can find the code here: http://cgit.freedesktop.org/~slomo/gst-sdk-tutorials/

This includes the generic tutorials and examples, and ones for iOS and Android. Over the past months many people wanted to try the 1.x binaries for iOS and Android and were asking for examples how to use them. Especially the fourth and fifth tutorials should help to get people started fast, you can find them here (Android) and here (iOS).

If there are any problems with these, please report them to myself or if you suspect any GStreamer bugs report them in Bugzilla. The XCode OS X project files and the Visual Studio project files are ported but I didn’t test them, please report if they work 🙂

The never-ending story: GStreamer and hardware integration

This is the companion article for my talk at the GStreamer Conference 2013: The never-ending story: GStreamer and hardware integration

The topic of hardware integration into GStreamer (think of GPUs, DSPs, hardware codecs, OpenMAX IL, OpenGL, VAAPI, video4linux, etc.) was always a tricky one. In the past, in the 0.10 release series, it often was impossible to make full and efficient use of special hardware features and many ugly hacks were required. In 1.x we reconsidered many of these problems and tried to adjust the existing APIs, or add new APIs to make it possible to solve these problems in a cleaner way. Although all these APIs are properly documented in detail, what’s missing is some kind of high-level/overview documentation to understand how it all fits together, which API to use when for what. That’s what I’m trying to solve here now.

What’s new in GStreamer 1.0 & 1.2

The relevant changes in GStreamer 1.x for hardware integration are the much better memory management and control with the GstMemory and GstAllocator abstractions and GstBufferPool, the support for arbitrary buffer metadata with GstMeta, sharing of device contexts between elements with GstContext and the reworked (re-)negotiation and (re-)configuration mechanisms, especially the new ALLOCATION query. In the following I will explain how to use them in the context of hardware integration and how they all work together and will assume that you already know and understand the basics of GStreamer.

Special memory

The most common problem when dealing with special hardware and the APIs to use it is, that it requires the usage of special memory. Maybe there are some constraints on the memory (e.g. regarding alignment or physical contiguous memory is required), or you need to use some special API to use the memory and it doesn’t behave like normal system memory. Examples for the second case would be VAAPI and OpenGL, where you have to use special APIs to do something with the memory, or the new DMA-BUF Linux kernel API, where you just pass around a file descriptor that might be mappable to the process’ virtual address space via mmap().

Custom memory types

In 0.10 special memory was usually handled by subclassing GstBuffer, and somehow ensuring that you could get a buffer of the correct type an it’s not copied to a normal GstBuffer in between. It was very unreliable and rather hackish usually.

In 1.0 it’s no longer possible to subclass GstBuffer, and GstBuffer is only a collection of memory abstraction objects and metadata. You would now implement a custom GstMemory and GstAllocator. GstMemory here is just representing an instance or handle to a specific memory but itself has no logic implemented. For the logic it would point to its corresponding GstAllocator. The reason for this split is mainly that a) there might be multiple allocators for a single memory type (think of the different ways of allocating an EGLImage), and b) that you want to negotiate the allocator and memory type between elements before allocating any of these memories.

This new abstraction of memory has explicit operations to request read, write or exclusive access to the memory and for usage like normal system memory requires an explicit call to gst_memory_map(). This mapping to system memory allows to use it from C like memory allocated via malloc(), but might not be possible for every type of memory. Subclasses of GstMemory could have to implement all these operations and map them to whatever this specific type of memory provides, and could provide additional API on top of that. The memory could also be marked as read-only, or could even be inaccessible for the process and just passed around as an opaque handle between elements.

Let me give some examples. For DMA-BUF you would for example implement the mapping to the virtual address space of the process via mmap() (which is done by GstDmabufMemory), for VAAPI you could implement additional API for the memory that could provide the memory as an GL texture id or convert it to a different color format (which is done by gst-vaapi), for memory that represents a GL texture id you could implement mapping to the process’ virtual address space by dispatching an operation to the GL thread and using glTexImage2D() or glReadPixels() and implement copying of memory in a similar way (which is done by gst-plugins-gl). You could also implement some kind of access control for the memory, which would allow easier implementation of DRM schemes (all engineers are probably dying a bit inside while reading this now…), or do crazy things like implementing some kind of file-backed memory.

With GstBuffer just being a collection of memories now, it provides convenience API to handle all the contained memories, e.g. to copy all of them, or map all of them to the process’ virtual address space.

Memory constraints

If all you need is just some memory that behaves like normal, malloc’d memory but fulfils some additional constraints like alignment or being physically contiguous you don’t need to implement your own GstMemory and GstAllocator subclass. The allocation function of GstAllocator, gst_allocator_alloc(), allows to provide parameters to specify such things and the default memory implementations handles this just fine.

For format specific constraints, like having a special rowstride of raw video frames, or having their planes in separate memory regions, something else can be done which I will mention later when talking about arbitrary buffer metadata.

So far I’m not aware of any special type of memory that can’t be handled with the new API, and I think it’s rather unlikely that there ever will be one. In the worst case it will just be a bit inconvenient to write the code.

Buffer pools

In many cases it will be useful to not just allocate a new block of memory whenever one is required, but instead keep track of memory in a pool and recycle memory that is currently unused. For example because allocations are just too expensive or because there is only a very limited number of memory blocks available anyway.

For this GstBufferPool is available. It provides a generic buffer pool class with standard functionality (setting of min/max number of buffers, pre-allocation of buffers, allocation parameters, allow/forbid dynamic allocation of buffers, selection of the GstAllocator, etc) and can be extended by subclasses with custom API and configuration.

As the most common case for buffer pools is for raw video, there’s also a GstVideoBufferPool subclass which already implements a lot of common functionality. For example it cares for allocating memory of the right size and caps handling, allows for configuration of custom rowstrides or using separate memory regions for the different planes and implements a custom configuration interface for specifying additional padding around each plane (GstVideoAlignment).

Arbitrary buffer metadata

In GStreamer 1.0 you can attach GstMeta to a buffer to represent arbitrary metadata, which can be used for all kinds of different things. GstMeta provides generic descriptions about the aspects of a buffer it applies to (tags) and functions to transform the meta (e.g. if a filter changes the aspect the meta applies too). The GstMeta supported by elements can also be negotiated between them to make sure all elements in a chain support a specific kind of metadata if it needs to care about it.

Examples for GstMeta usage would be the additional of simple metadata to buffers. Metadata like defining a rectangle with region of interests on a video frame (e.g. for face detection), or defining the gamma transfer function that applies to a video frame or defining custom rowstrides, or defining special downmix matrixes for converting 5.1 audio to stereo. While these are all examples that mostly apply to raw audio and video, it is also possible to implement metas that describe compressed formats, e.g. a meta that provides an MPEG sequence header as parsed C structures.

Apart from simple metadata it is also possible to use GstMeta for delayed processing, for example if all elements in a chain support a cropping metadata, it would be possible to not crop the video frame in the decoder but instead let the video sink handle the cropping based on a cropping rectangle put on the buffer as metadata. Or subtitles could be put on a buffer as metadata instead of overlaying them on top of the frame, and only in the sink after all other processing the subtitles would be overlaid on top of the video frame.

And then it’s also possible to use GstMeta to define “dynamic interfaces” on buffers. The metadata could for example contain a function pointer to a function that allows to upload the buffer to a GL texture, which could be simply implemented via glTexImage2() for normal memory or via special VAAPI for memory that is backed by VAAPI memory.

All the examples given in the previous paragraphs are already implemented by GStreamer and can just be re-used by other elements and we plan to add more generic GstMeta in the future. Look for GstVideoMeta, GstVideoCropMeta, GstVideoGLTextureUploadMeta, GstVideoOverlayMeta and GstAudioDownmixMeta for example.

Do not use GstMeta for memory specific information, e.g. as a lazy way to get around implementing a custom GstMemory and GstAllocator. It won’t work as optimal!

Negotiation

Compared to 0.10, negotiation between elements was extended by another step. There’s a) caps negotiation and b) allocation negotiation. Each of these happens before data-flow and after receiving a RECONFIGURE event.

caps negotiation

During caps negotiation the pads of the elements decide on one specific format that is supported by both linked pads and also make it possible for further downstream elements to decide on a specific format. Caps in GStreamer represent such a specific format and its properties, and is represented in a mime-type similar way with properties, e.g. “video/x-h264,width=1920,height=1080,profile=main”.

GstCaps can contain a list of such formats, represented as GstStructures, and each of these structures contain a name and a list of fields. The values of the fields can either be concrete, single values or describing multiple values (e.g. by using integer ranges). On top of caps many generic operations are defined to check compatibility of caps, to create intersections, to merge caps or to convert them into a single, fixed format (which is required during negotiation). If multiple structures are contained in a caps, they are ordered by preference.

All this is nothing new, but in GStreamer 1.2 caps were extended by GstCapsFeatures, which allows to add additional constraints to caps. For example a specific memory type can be requested or the existence of specific metadata. This is used during negotiation to make sure only compatible elements are linked (e.g. ones that can only work with EGLImage memory would not be seen as compatible with ones that only can handle VAAPI memory although everything else of the caps is the same), and is also used in autoplugging elements like playbin to select the most optimal decoder and sink combinations. An example of such caps with caps features would be “video/x-raw(memory:EGLImage),width=1920,height=1080” to specify that EGLImage memory is required.

So, the actual negotiation of the caps happens from upstream to downstream. The most downstream element uses a CAPS query, which replaces the getcaps function of pads from 0.10 and has an additional filter argument in 1.0. The filter argument is used to tell downstream which caps are actually supported by upstream and in which preference, so it don’t have to create unnecessary caps. This query is then propagated further downstream to the next elements, translated if necessary (e.g. from raw video caps to h264 caps by an encoder, while still proxying the values of the width/height fields), then answered by the most downstream element that doesn’t propagate the query further downstream and filled with the caps supported by this element’s pad. On the way back upstream these result caps are further narrowed down, translated between different formats and potentially reordered by preference. Once the results are received in the most upstream element, it tries to choose on specific format from those supported by downstream and itself, while keeping preferences of itself and downstream under consideration. Then as a last step it sends a CAPS event downstream to notify about the actually chosen caps. This CAPS event is then further propagated downstream by elements (if necessary translated) that don’t need further information to decide on their output format, or stored and later when all necessary information is available (e.g. after the header buffers are received and parsed) a new CAPS event is sent downstream with the new caps.

This is all very similar to how the same worked in 0.10.

allocation negotiation

The next step, the allocation negotiation, is something new in 1.0. The ALLOCATION query is used by every most upstream element that needs to allocate buffers (i.e. sources, converters, decoders, encoders, etc) to get information about the allocation possibilities from downstream.

The ALLOCATION query is created with the caps that the upstream element wants first answered by the most downstream element that by itself would allocate new buffers again for its own downstream (i.e. converters, decoders, encoders). It’s filled with any buffer pools that can be provided by downstream, allocators (and thus memory types), the minimum and maximum number of buffers, the allocation size, allocation parameters and the supported metas. On its way back upstream these results are filtered and narrowed down by the further upstream elements (e.g. to remove metas that are not supported by an element, or a memory type), and especially the minimum and maximum number of buffers is updated by each element. The querying element will then decide to use one of the buffer pools and/or allocators and which metas can be used and which not. It can also decide to not use any of the buffer pools or allocators and use its own that is compatible with the results of the query.

The usage of downstream buffer pools or allocators is the 1.0 replacement of gst_pad_alloc_buffer() in 0.10, and its much more efficient as the complete way downstream does not have to be walked for every buffer allocation but only once.

Note: the caps in the ALLOCATION query are not required to be the same as in the CAPS event. For example if the video crop metadata is used, the caps in the CAPS event will contain the cropped width and height while the caps in the ALLOCATION query will contain the uncropped (larger) width and height.

Context sharing

Another problem that often needs to be solved for hardware integration (but also other use cases), is the sharing of some kind of context between elements. Maybe even before these elements are linked together. For example you might need to share a VADisplay or EGLDisplay between your decoder and your sink, or an OpenGL context (plus thread dispatching functionality), or unrelated to hardware integration you might want to share an HTTP session (e.g. cookies) with multiple elements.

In GStreamer 1.2 support for doing this was added to GStreamer core via GstContext.

GstContext allows sharing such a context in a generic way via a name to identify the type of context and a GstStructure to store various information about it. It is distributed in the pipeline the following way: Whenever an element needs a specific context it first checks if one of the correct type was set on it before (for which the GstElement::set_context vfunc would’ve been called). Otherwise it will query downstream and then upstream with the CONTEXT query to retrieve any locally known contexts of the correct type (e.g. a decoder could get a context from a sink). If this doesn’t give any result either, the element is posting a NEED_CONTEXT message on the bus. This is then sent upwards in the bin hierarchy and if one of the containing bins knows a context of the correct type it will synchronously set it on the element. Otherwise the application will get the NEED_CONTEXT message and has the possibility to set a context from a sync bus handler on the element. If the element still has no context set, it can create one itself and advertise it in the bin hierarchy and to the application with the HAVE_CONTEXT message. When receiving this message, bins will cache the contexts and will use them the next time a NEED_CONTEXT message is seen. The same goes for contexts that were set on a bin, those will also be cached by the bin for future use.

All in all this does not require any interaction with the application, unless the application wants to configure which contexts are set on which elements.

To make the distributing of contexts more reliable, decodebin has a new “autoplug-query” signal, which can be used to forward CONTEXT (and ALLOCATION) queries of not-yet-exposed elements during autoplugging to downstream elements (e.g. sinks). This is also used by playbin to make sure elements can reach the sinks during autoplugging.

Open issues

Does this mean that GStreamer is complete now? No 🙂 There are still some open issues around this, but these should be fixable in 1.x without any problems. All the building blocks are there. Overall hardware integration is already working much much better than in 0.10.

Reconfiguration

One remaining issue is the handling of device reconfiguration. Often it is required to release all memory allocated from a device before being able to set a new configuration on the device and allocate new memory. The problem with this is that GStreamer elements currently have no control about downstream or upstream elements that use their memory, there is no API to request them to release memory. And even if there was it would be a bit tricky to integrate into 3rd party libraries like libav, as they keep references to memory internally for things like reference frames.

Related to this is that currently buffer pools only keep track of buffers, but there’s nothing that makes it easy to keep track of memories. Implementing this over and over again in GstAllocator subclasses is error-prone and inconvenient.

All discussions about this and some initial ideas are currently handling in Bugzilla #707534

Device probing

The other remaining issue is missing API for device probing, e.g. to get all available camera devices together with the GStreamer elements that can be used to access them. In 0.10 there was a very suboptimal solution for this, and it was removed for 1.0 without any replacement so far.

There is a patch with a new implementation of this in Bugzilla #678402. The new idea now is to implement a custom GstPluginFeature, that allows creating devices probing instances for elements from the registry (similar to GstElementFactory). This is planned to be integrated really soon now and should be in 1.4.

Summary

In practise this currently means that it’s possible to use gst-vaapi transparently in playbin, without the application having to know about anything VAAPI related. playbin will automatically select correct decoders and sinks and have them use in an optimal way together. This even works inside a WebKit HTML5 website together with fancy CSS3 effects.

gst-omx and the v4l2 plugin can now provide zerocopy operation, a slow ARM based device like the Raspberry Pi can decode HD video in realtime to EGLImages and display them, gst-plugins-gl has a solution to all OpenGL related threading problems, and many many more.

I think with what we have in GStreamer 1.2 we should be able to integrate all kinds of hardware in an optimal way and have it used transparently in pipelines. Some APIs might be missing, but there should be nothing that can’t be added rather easily now.

Streaming GStreamer pipelines via HTTP

In the past many people joined the GStreamer IRC channel on FreeNode and were asking how to stream a GStreamer pipeline to multiple clients via HTTP. Just explaining how to do it and that it’s actually quite easy might not be that convincing, so here’s a small tool that does exactly that. I called it http-launch and you can get it from GitHub here.

Given a GStreamer pipeline in GstParse syntax (same as e.g. gst-launch), it will start an HTTP server on port 8080, will start the pipeline once the first client connects and then serves from a single pipeline all following clients with the data that it produces.

For example you could call it like this to stream a WebM stream:

http-launch webmmux streamable=true name=stream   videotestsrc ! vp8enc ! stream.   audiotestsrc ! vorbisenc ! stream.

Note that this is just a simple example of what you can do with GStreamer and not meant for production use. Something like gst-streaming-server would be better suited for that, especially once it gets support for HLS/DASH or similar protocols.

Now let’s walk through the most important parts of the code.

The HTTP server

First some short words about the HTTP server part. Instead of just using libsoup, I implemented a trivial HTTP server with GIO. Probably not 100% standards compliant or bug-free, but good enough for demonstration purposes :). Also this should be a good example of how the different network classes of GIO go together.

The HTTP server is based on a GSocketService, which listens on a specific port for new connections via a GLib main context, and notifies via a signal whenever there is a new connection. These new connections are provided as a GSocketConnection. These are line 424 and following, and line 240 and following.

In lines 240 and following we start polling the GIOStream of the connection, to be notified whenever new data can be read from the stream. Based on this non-blocking reading from the connection is implemented in line 188 and following. Something like this pattern for non-blocking reading/writing to a socket is also implemented in GStreamer’s GstRTSPConnection.

Here we trivially read data until a complete HTTP message is received (i.e. “\r\n\r\n” is detected in what we read), which is then parsed with the GLib string functions. Only GET and HEAD requests are handled in very simple ways. The GET request will then lead us to the code that connects this HTTP server with GStreamer.

Really, consider using libsoup if you want to implement an HTTP server or client!

The GStreamer pipeline

Now to the GStreamer part of this small application. The actual pipeline is, as explained above, passed via the commandline. This is then parsed and properly set up in line 362 and following. For this GstParse is used, which parses a pipeline string into a real GstBin.

As the pipeline string passed to http-launch must not contain a sink element but end in an element with the name “stream”, we’ll have to get this element now and add our own sink to the bin. We do this by getting the “stream” element via gst_bin_get_by_name(), setting up a GstGhostPad that proxies the source pad of it as a source pad of the bin, and then putting the bin created from the pipeline string and a sink element into a GstPipeline, where both (the bin and the sink) are then connected.

The sink we are using here is multisocketsink, which sends all data received to a set of aplication-provided GSockets. In line 390 and following we set up some properties on the sink that makes sure that newly connected clients start from a keyframe and that the buffering for all clients inside multisocketsink is handled in a sensible way. Instead of letting new clients wait for the next keyframe we could also explicitly request the pipeline to generate a new keyframe each time a client connects.

Now the last part missing is that whenever we successfully received a GET request from a client, we will stop handling any reads/writes from the socket ourselves and pass it to multisocketsink. This is done in line 146. From this point onwards the socket for this client is only handled by multisocketsink. Additionally we start the pipeline here for the first client that has successfully connected.

I hope this showed a bit how one of the lesser known GStreamer elements can be used to stream media to multiple clients, and that GStreamer provides building blocks for almost everything already 😉