May 2022 – coaxion.net

Over the last few weeks, GStreamer’s RTP stack got a couple of new and quite useful features. As it is difficult to configure, mostly because there being so many different possible configurations, I decided to write about this a bit with some example code.

The features are RFC 6051-style rapid synchronization of RTP streams, which can be used for inter-stream (e.g. audio/video) synchronization as well as inter-device (i.e. network) synchronization, and the ability to easily retrieve absolute sender clock times per packet on the receiver side.

Note that each of this was already possible before with GStreamer via different mechanisms with different trade-offs. Obviously, not being able to have working audio/video synchronization would be simply not acceptable and I previously talked about how to do inter-device synchronization with GStreamer before, for example at the GStreamer Conference 2015 in Düsseldorf.

The example code below will make use of the GStreamer RTSP Server library but can be applied to any kind of RTP workflow, including WebRTC, and are written in Rust but the same can also be achieved in any other language. The full code can be found in this repository.

And for reference, the merge requests to enable all this are [1], [2] and [3]. You probably don’t want to backport those to an older version of GStreamer though as there are dependencies on various other changes elsewhere. All of the following needs at least GStreamer from the git main branch as of today, or the upcoming 1.22 release.

Baseline Sender / Receiver Code

The starting point of the example code can be found here in the baseline branch. All the important steps are commented so it should be relatively self-explanatory.

Sender

The sender is starting an RTSP server on the local machine on port 8554 and provides a media with H264 video and Opus audio on the mount point /test. It can be started with

$ cargo run -p rtp-rapid-sync-example-send

After starting the server it can be accessed via GStreamer with e.g. gst-play-1.0 rtsp://127.0.0.1:8554/test or similarly via VLC or any other software that supports RTSP.

This does not do anything special yet but lays the foundation for the following steps. It creates an RTSP server instance with a custom RTSP media factory, which in turn creates custom RTSP media instances. All this is not needed at this point yet but will allow for the necessary customization later.

One important aspect here is that the base time of the media’s pipeline is set to zero

pipeline.set_base_time(gst::ClockTime::ZERO);
pipeline.set_start_time(gst::ClockTime::NONE);

This allows the timeoverlay element that is placed in the video part of the pipeline to render the clock time over the video frames. We’re going to use this later to confirm on the receiver that the clock time on the sender and the one retrieved on the receiver are the same.

let video_overlay = gst::ElementFactory::make("timeoverlay", None)
    .context("Creating timeoverlay")?;
[...]
video_overlay.set_property_from_str("time-mode", "running-time");

It actually only supports rendering the running time of each buffer, but in a live pipeline with the base time set to zero the running time and pipeline clock time are the same. See the documentation for some more details about the time concepts in GStreamer.

Overall this creates the following RTSP stream producer bin, which will be used also in all the following steps:

Receiver

The receiver is a simple playbin pipeline that plays an RTSP URI given via command-line parameters and runs until the stream is finished or an error has happened.

It can be run with the following once the sender is started

$ cargo run -p rtp-rapid-sync-example-recv -- "rtsp://192.168.1.101:8554/test"

Please don’t forget to replace the IP with the IP of the machine that is actually running the server.

All the code should be familiar to anyone who ever wrote a GStreamer application in Rust, except for one part that might need a bit more explanation

pipeline.connect_closure(
    "source-setup",
    false,
    glib::closure!(|_playbin: &gst::Pipeline, source: &gst::Element| {
        source.set_property("latency", 40u32);
    }),
);

playbin is going to create an rtspsrc, and at that point it will emit the source-setup signal so that the application can do any additional configuration of the source element. Here we’re connecting a signal handler to that signal to do exactly that.

By default rtspsrc introduces a latency of 2 seconds of latency, which is a lot more than what is usually needed. For live, non-VOD RTSP streams this value should be around the network jitter and here we’re configuring that to 40 milliseconds.

Retrieval of absolute sender clock times

Now as the first step we’re going to retrieve the absolute sender clock times for each video frame on the receiver. They will be rendered by the receiver at the bottom of each video frame and will also be printed to stdout. The changes between the previous version of the code and this version can be seen here and the final code here in the sender-clock-time-retrieval branch.

When running the sender and receiver as before, the video from the receiver should look similar to the following

The upper time that is rendered on the video frames is rendered by the sender, the bottom time is rendered by the receiver and both should always be the same unless something is broken here. Both times are the pipeline clock time when the sender created/captured the video frame.

In this configuration the absolute clock times of the sender are provided to the receiver via the NTP / RTP timestamp mapping provided by the RTCP Sender Reports. That’s also the reason why it takes about 5s for the receiver to know the sender’s clock time as RTCP packets are not scheduled very often and only after about 5s by default. The RTCP interval can be configured on rtpbin together with many other things.

Sender

On the sender-side the configuration changes are rather small and not even absolutely necessary.

rtpbin.set_property_from_str("ntp-time-source", "clock-time");

By default the RTP NTP time used in the RTCP packets is based on the local machine’s walltime clock converted to the NTP epoch. While this works fine, this is not the clock that is used for synchronizing the media and as such there will be drift between the RTP timestamps of the media and the NTP time from the RTCP packets, which will be reset every time the receiver receives a new RTCP Sender Report from the sender.

Instead, we configure rtpbin here to use the pipeline clock as the source for the NTP timestamps used in the RTCP Sender Reports. This doesn’t give us (by default at least, see later) an actual NTP timestamp but it doesn’t have the drift problem mentioned before. Without further configuration, in this pipeline the used clock is the monotonic system clock.

rtpbin.set_property("rtcp-sync-send-time", false);

rtpbin normally uses the time when a packet is sent out for the NTP / RTP timestamp mapping in the RTCP Sender Reports. This is changed with this property to instead use the time when the video frame / audio sample was captured, i.e. it does not include all the latency introduced by encoding and other processing in the sender pipeline.

This doesn’t make any big difference in this scenario but usually one would be interested in the capture clock times and not the send clock times.

Receiver

On the receiver-side there are a few more changes. First of all we have to opt-in to rtpjitterbuffer putting a reference timestamp metadata on every received packet with the sender’s absolute clock time.

pipeline.connect_closure(
    "source-setup",
    false,
    glib::closure!(|_playbin: &gst::Pipeline, source: &gst::Element| {
        source.set_property("latency", 40u32);
        source.set_property("add-reference-timestamp-meta", true);
    }),
);

rtpjitterbuffer will start putting the metadata on packets once it knows the NTP / RTP timestamp mapping, i.e. after the first RTCP Sender Report is received in this case. Between the Sender Reports it is going to interpolate the clock times. The normal timestamps (PTS) on each packet are not affected by this and are still based on whatever clock is used locally by the receiver for synchronization.

To actually make use of the reference timestamp metadata we add a timeoverlay element as video-filter on the receiver:

let timeoverlay =
    gst::ElementFactory::make("timeoverlay", None).context("Creating timeoverlay")?;

timeoverlay.set_property_from_str("time-mode", "reference-timestamp");
timeoverlay.set_property_from_str("valignment", "bottom");

pipeline.set_property("video-filter", &timeoverlay);

This will then render the sender’s absolute clock times at the bottom of each video frame, as seen in the screenshot above.

And last we also add a pad probe on the sink pad of the timeoverlay element to retrieve the reference timestamp metadata of each video frame and then printing the sender’s clock time to stdout:

let sinkpad = timeoverlay
    .static_pad("video_sink")
    .expect("Failed to get timeoverlay sinkpad");
sinkpad
    .add_probe(gst::PadProbeType::BUFFER, |_pad, info| {
        if let Some(gst::PadProbeData::Buffer(ref buffer)) = info.data {
            if let Some(meta) = buffer.meta::<gst::ReferenceTimestampMeta>() {
                println!("Have sender clock time {}", meta.timestamp());
            } else {
                println!("Have no sender clock time");
            }
        }

        gst::PadProbeReturn::Ok
    })
    .expect("Failed to add pad probe");

Rapid synchronization via RTP header extensions

The main problem with the previous code is that the sender’s clock times are only known once the first RTCP Sender Report is received by the receiver. There are many ways to configure rtpbin to make this happen faster (e.g. by reducing the RTCP interval or by switching to the AVPF RTP profile) but in any case the information would be transmitted outside the actual media data flow and it can’t be guaranteed that it is actually known on the receiver from the very first received packet onwards. This is of course not a problem in every use-case, but for the cases where it is there is a solution for this problem.

RFC 6051 defines an RTP header extension that allows to transmit the NTP timestamp that corresponds an RTP packet directly together with this very packet. And that’s what the next changes to the code are making use of.

The changes between the previous version of the code and this version can be seen here and the final code here in the rapid-synchronization branch.

Sender

To add the header extension on the sender-side it is only necessary to add an instance of the corresponding header extension implementation to the payloaders.

let hdr_ext = gst_rtp::RTPHeaderExtension::create_from_uri(
    "urn:ietf:params:rtp-hdrext:ntp-64",
    )
    .context("Creating NTP 64-bit RTP header extension")?;
hdr_ext.set_id(1);
video_pay.emit_by_name::<()>("add-extension", &[&hdr_ext]);

This first instantiates the header extension based on the uniquely defined URI for it, then sets its ID to 1 (see RFC 5285) and then adds it to the video payloader. The same is then done for the audio payloader.

By default this will add the header extension to every RTP packet that has a different RTP timestamp than the previous one. In other words: on the first packet that corresponds to an audio or video frame. Via properties on the header extension this can be configured but generally the default should be sufficient.

Receiver

On the receiver-side no changes would actually be necessary. The use of the header extension is signaled via the SDP (see RFC 5285) and it will be automatically made use of inside rtpbin as another source of NTP / RTP timestamp mappings in addition to the RTCP Sender Reports.

However, we configure one additional property on rtpbin

source.connect_closure(
    "new-manager",
    false,
    glib::closure!(|_rtspsrc: &gst::Element, rtpbin: &gst::Element| {
        rtpbin.set_property("min-ts-offset", gst::ClockTime::from_mseconds(1));
    }),
);

Inter-stream audio/video synchronization

The reason for configuring the min-ts-offset property on the rtpbin is that the NTP / RTP timestamp mapping is not only used for providing the reference timestamp metadata but it is also used for inter-stream synchronization by default. That is, for getting correct audio / video synchronization.

With RTP alone there is no mechanism to synchronize multiple streams against each other as the packet’s RTP timestamps of different streams have no correlation to each other. This is not too much of a problem as usually the packets for audio and video are received approximately at the same time but there’s still some inaccuracy in there.

One approach to fix this is to use the NTP / RTP timestamp mapping for each stream, either from the RTCP Sender Reports or from the RTP header extension, and that’s what is made use of here. And because the mapping is provided very often via the RTP header extension but the RTP timestamps are only accurate up to clock rate (1/90000s for video and 1/48000s) for audio in this case, we configure a threshold of 1ms for adjusting the inter-stream synchronization. Without this it would be adjusted almost continuously by a very small amount back and forth.

Other approaches for inter-stream synchronization are provided by RTSP itself before streaming starts (via the RTP-Info header), but due to a bug this is currently not made use of by GStreamer.

Yet another approach would be via the clock information provided by RFC 7273, about which I already wrote previously and which is also supported by GStreamer. This also allows inter-device, network synchronization and used for that purpose as part of e.g. AES67, Ravenna, SMPTE 2022 / 2110 and many other protocols.

Inter-device network synchronization

Now for the last part, we’re going to add actual inter-device synchronization to this example. The changes between the previous version of the code and this version can be seen here and the final code here in the network-sync branch. This does not use the clock information provided via RFC 7273 (which would be another option) but uses the same NTP / RTP timestamp mapping that was discussed above.

When starting the receiver multiple times on different (or the same) machines, each of them should play back the media synchronized to each other and exactly 2 seconds after the corresponding audio / video frames are produced on the sender.

For this, both sender and all receivers are using an NTP clock (pool.ntp.org in this case) instead of the local monotonic system clock for media synchronization (i.e. as the pipeline clock). Instead of an NTP clock it would also be possible to any other mechanism for network clock synchronization, e.g. PTP or the GStreamer netclock.

println!("Syncing to NTP clock");
clock
    .wait_for_sync(gst::ClockTime::from_seconds(5))
    .context("Syncing NTP clock")?;
println!("Synced to NTP clock");

This code instantiates a GStreamer NTP clock and then synchronously waits up to 5 seconds for it to synchronize. If that fails then the application simply exits with an error.

Sender

On the sender side all that is needed is to configure the RTSP media factory, and as such the pipeline used inside it, to use the NTP clock

factory.set_clock(Some(&clock));

This causes all media inside the sender’s pipeline to be synchronized according to this NTP clock and to also use it for the NTP timestamps in the RTCP Sender Reports and the RTP header extension.

Receiver

On the receiver side the same has to happen

pipeline.use_clock(Some(&clock));

In addition a couple more settings have to be configured on the receiver though. First of all we configure a static latency of 2 seconds on the receiver’s pipeline.

pipeline.set_latency(gst::ClockTime::from_seconds(2));

This is necessary as GStreamer can’t know the latency of every receiver (e.g. different decoders might be used), and also because the sender latency can’t be automatically known. Each audio / video frame will be timestamped on the receiver with the NTP timestamp when it was captured / created, but since then all the latency of the sender, the network and the receiver pipeline has passed and for this some compensation must happen.

Which value to use here depends a lot on the overall setup, but 2 seconds is a (very) safe guess in this case. The value only has to be larger than the sum of sender, network and receiver latency and in the end has the effect that the receiver is showing the media exactly that much later than the sender has produced it.

And last we also have to tell rtpbin that

sender and receiver clock are synchronized to each other, i.e. in this case both are using exactly the same NTP clock, and that no translation to the pipeline’s clock is necessary, and
that the outgoing timestamps on the receiver should be exactly the sender timestamps and that this conversion should happen based on the NTP / RTP timestamp mapping

source.set_property_from_str("buffer-mode", "synced");
source.set_property("ntp-sync", true);

And that’s it.

A careful reader will also have noticed that all of the above would also work without the RTP header extension, but then the receivers would only be synchronized once the first RTCP Sender Report is received. That’s what the test-netclock.c / test-netclock-client.c example from the GStreamer RTSP server is doing.

As usual with RTP, the above is by far not the only way of doing this and GStreamer also supports various other synchronization mechanisms. Which one is the correct one for a specific use-case depends on a lot of factors.

Month: May 2022

Instantaneous RTP synchronization & retrieval of absolute sender clock times with GStreamer

Baseline Sender / Receiver Code

Sender

Receiver

Retrieval of absolute sender clock times

Sender

Receiver

Rapid synchronization via RTP header extensions

Sender

Receiver

Inter-stream audio/video synchronization

Inter-device network synchronization

Sender

Receiver