September 2020 – coaxion.net

Porting EBU R128 audio loudness analysis from C to Rust – Porting Details

This blog post is part two of a four part series

Overview, summary and motivation
Porting approach with various details, examples and problems I ran into along the way
Performance optimizations
Building Rust code into a C library as drop-in replacement

In this part I’ll go through the actual porting process of the libebur128 C code to Rust, the approach I’ve chosen with various examples and a few problems I was running into.

It will be rather technical. I won’t explain details about how the C code works but will only focus on the aspects that are relevant for porting to Rust, otherwise this blog post would become even longer than it already is.

Porting

With the warnings out of the way, let’s get started. As a reminder, the code can be found on GitHub and you can also follow along the actual chronological porting process by going through the git history there. It’s not very different to what will follow here but just in case you prefer looking at diffs instead.

Approach

The approach I’ve taken is basically the same that Federico took for librsvg or Joe Neeman’s took for nnnoiseless:

Start with the C code and safe Rust bindings around the C API
Look for a function or component very low in the call graph without dependencies on (too much) other C code
Rewrite that code in Rust and add an internal C API for it
Call the new internal C API for that new Rust code from the C code and get rid of the C implementation of the component
Make sure the tests are still passing
Go to 2. and repeat

Compared to what I did when porting the FFmpeg loudness normalization filter this has the advantage that at every step there is a working version of the code and you don’t only notice at the very end that somewhere along the way you made a mistake. At each step you can validate that what you did was correct and the amount of code to debug if something went wrong is limited.

Thanks to Rust having a good FFI story for interoperability with C in either direction, writing the parts of the code that are called from C or calling into C is not that much of a headache and not worse than actually writing C.

Rust Bindings around C Library

This step could’ve been skipped if all I cared about was having a C API for the ported code later, or if I wanted to work with the tests of the C library for validation and worry about calling it from Rust at a later point. In this case I had already done safe Rust bindings around the C library before, and having a Rust API made it much easier to write tests that could be used during the porting and that could be automatically run at each step.

`bindgen`

As a first step for creating the Rust bindings there needs to be a way to actually call into the C code. In C there are the header files with the type definitions and function declarations, but Rust can’t directly work from those. The solution to this was in this case bindgen, which basically converts the C header files into something that Rust can understand. The resulting API is completely unsafe still but can be used in a next step to write safe Rust bindings around it.

I would recommend using bindgen for any non-trivial C API for which there is no better translation tool available, or for which there is no machine-readable description of the API that could be used instead by another tool. Parsing C headers is no fun and there is very little information available in C for generating safe bindings. For example for GObject-based libraries, using gir would be a better idea as it works from a rich XML description of the API that contains information about e.g. ownership transfer and allows to autogenerate safe Rust bindings in many cases.

Also the dependency on clang makes it hard to run bindgen as part of every build, so instead I’ve made sure that the code generated by bindgen is platform independent and included it inside the repository. If you use bindgen, please try to do the same. Requiring clang for building your crate makes everything more complicated for your users, especially if they’re unfortunate enough to use Windows.

But back to the topic. What bindgen generates is basically a translation of the C header into Rust: type definitions and function declarations. This looks for example as follows

\#[repr(C)]
\#[derive(Debug, Copy, Clone)]
pub struct ebur128_state {
    pub mode: ::std::os::raw::c_int,
    pub channels: ::std::os::raw::c_uint,
    pub samplerate: ::std::os::raw::c_ulong,
    pub d: *mut ebur128_state_internal,
}

extern "C" {
    pub fn ebur128_init(
        channels: ::std::os::raw::c_uint,
        samplerate: ::std::os::raw::c_ulong,
        mode: ::std::os::raw::c_int,
    ) -> *mut ebur128_state;

    pub fn ebur128_destroy(st: *mut *mut ebur128_state);

    pub fn ebur128_add_frames_int(
        st: *mut ebur128_state,
        src: *const ::std::os::raw::c_int,
        frames: usize,
    ) -> ::std::os::raw::c_int;
}

Based on this it is possible to call the C functions directly from unsafe Rust code and access the members of all the structs. It requires working with raw pointers and ensuring that everything is done correctly at any point to not cause memory corruption or worse. It’s just like using the API from C with a slightly different syntax.

Build System

To be able to call into the C API its implementation somehow has to be linked into your crate. As the C code later also has to be modified to call into the already ported Rust functions instead of the original C code, it makes most sense to build it as part of the crate instead of linking to an external version of it.

This can be done with the cc crate. It is called into from cargo‘s build.rs for configuring it, for example for configuring which C files to compile and how. Once done it is possible to call any exported C function from the Rust code. The build.rs is not really complicated in this case

fn main() {
    cc::Build::new()
        .file("src/c/ebur128.c")
        .compile("ebur128");
}

Safe Rust API

With all that in place a safe Rust API around the unsafe C functions can be written now. How this looks in practice differs from API to API and might require some more thought in case of a more complex API to ensure everything is still safe and sound from a Rust point of view. In this case it was fortunately rather simple.

For example the struct definition, the constructor and the destructor (Drop impl) looks as follows based on what bindgen generated above

pub struct EbuR128(ptr::NonNull<ffi::ebur128_state>);

The struct is a simple wrapper around std::ptr::NonNull, which itself is a zero-cost wrapper around raw pointers that additionally ensures that the stored pointer is never NULL and allows additional optimizations to take place based on that.

In other words: the Rust struct is just a raw pointer but with additional safety guarantees.

impl EbuR128 {
    pub fn new(channels: u32, samplerate: u32, mode: Mode) -> Result<Self, Error> {
        static ONCE: std::sync::Once = std::sync::Once::new();

        ONCE.call_once(|| unsafe { ffi::ebur128_libinit() });

        unsafe {
            let ptr = ffi::ebur128_init(channels, samplerate as _, mode.bits() as i32);
            let ptr = ptr::NonNull::new(ptr).ok_or(Error::NoMem)?;
            Ok(EbuR128(ptr))
        }
    }
}

The constructor is slightly more complicated as it also has to ensure that the one-time initialization function is called, once. This requires using std::sync::Once as above.

After that it calls the C constructor with the given parameters. This can return NULL in various cases when not enough memory could be allocated as described in the documentation of the C library. This needs to be handled gracefully here and instead of panicking an error is returned to the caller. ptr::NonNull::new() is returning an Option and if NULL is passed it would return None. If this happens it is transformed into an error together with an early return via the ? operator.

In the end the pointer then only has to be wrapped in the struct and be returned.

impl Drop for EbuR128 {
    fn drop(&mut self) {
        unsafe {
            let mut state = self.0.as_ptr();
            ffi::ebur128_destroy(&mut state);
        }
    }
}

The Drop trait is used for defining what should happen if a value of the struct goes out of scope and what should be done to clean up after it. In this case this means calling the destroy function of the C library. It takes a pointer to a pointer to its state, which is then set to NULL. As such it is necessary to store the raw pointer in a local variable and pass a mutable reference to it. Otherwise the ptr::NonNull would end up with a NULL pointer inside it, which would result in undefined behaviour.

The last function that I want to mention here is the one that takes a slice of audio samples for processing

    pub fn add_frames_i32(&mut self, frames: &[i32]) -> Result<(), Error> {
        unsafe {
            if frames.len() % self.0.as_ref().channels != 0 {
                return Err(Error::NoMem);
            }

            let res = ffi::ebur128_add_frames_int(
                self.0.as_ptr(),
                frames.as_ptr(),
                frames.len() / self.0.as_ref().channels,
            );
            Error::from_ffi(res as ffi::error, || ())
        }
    }

Apart from calling the C function it is again necessary to check various pre-conditions before doing so. The C function will cause out of bounds reads if passed a slice that doesn’t contain a sample for each channel, so this must be checked beforehand or otherwise the caller (safe Rust code!) could cause out of bounds memory accesses.

In the end after calling the function its return value is converted into a Result, converting any errors into the crate’s own Error enum.

As can be seen here, writing safe Rust bindings around the C API requires reading of the documentation of the C code and keeping all the safety guarantees of Rust in mind to ensure that it is impossible to violate those safety guarantees, no matter what the caller passes into the functions.

Not having to read the documentation and still being guaranteed that the code can’t cause memory corruption is one of the big advantages of Rust. If there even is documentation or it mentions such details.

Replacing first function: Resampler

Once the safe Rust API is done, it is possible to write safe Rust code that makes use of the C code. And among other things that allows to write tests to ensure that the ported code still does the same as the previous C code. But this will be the topic of the next section. Writing tests is boring, porting code is more interesting and that’s what I will start with.

To find the first function to port, I first read the C code to get a general idea of how the different pieces fit together and what the overall structure of the code is. Based on this I selected the resampler that is used in the true peak measurement. It is one of the leaf functions of the call-graph, that is, it is not calling into any other C code and is relatively independent from everything else. Unlike many other parts of the code it is also already factored out into separate structs and functions.

In the C code this can be found in the interp_create(), interp_destroy() and interp_process() functions. The resulting Rust code can be found in the interp module, which provides basically the same API in Rust except for the destroy() function which Rust provides for free, and corresponding extern "C" functions that can be called from C.

The create() function is not very interesting from a porting point of view: it simply allocates some memory and initializes it. The C and Rust versions of it look basically the same. The Rust version is only missing the checks for allocation failures as those can’t currently be handled in Rust, and handling allocation failures like in the C code is almost useless with modern operating systems and overcommitting allocators.

Also the struct definition is not too interesting, it is approximately the same as the one in C just that pointers to arrays are replaced by Vecs and lengths of them are directly taken from the Vec instead of storing them separately. In a later version the Vec were replaced by boxed slices and SmallVec but that shouldn’t be a concern here for now.

The main interesting part here is the processing function and how to provide all the functions to the C code.

Processing Function: Using Iterators

The processing function, interp_process(), is basically 4 nested for loops over each frame in the input data, each channel in each frame, the interpolator factors and finally the filter coefficients.

unsigned int out_stride = interp->channels * interp->factor;

for (frame = 0; frame < frames; frame++) {
  for (chan = 0; chan < interp->channels; chan++) {
    interp->z[chan][interp->zi] = *in++;
    outp = out + chan;
    for (f = 0; f < interp->factor; f++) {
      acc = 0.0;
      for (t = 0; t < interp->filter[f].count; t++) {
        int i = (int) interp->zi - (int) interp->filter[f].index[t];
        if (i < 0) {
          i += (int) interp->delay;
        }
        c = interp->filter[f].coeff[t];
        acc += (double) interp->z[chan][i] * c;
      }
      *outp = (float) acc;
      outp += interp->channels;
    }
  }
  out += out_stride;
  interp->zi++;
  if (interp->zi == interp->delay) {
    interp->zi = 0;
  }
}

This could be written the same way in Rust but slice indexing is not very idiomatic in Rust and using Iterators is preferred as it leads to more declarative code. Also all the indexing would lead to suboptimal performance due to the required bounds checks. So the main task here is to translate the code to iterators as much as possible.

When looking at the C code, the outer loop iterates over chunks of channels samples from the input and chunks of channels * factor samples from the output. This is exactly what the chunks_exact iterator on slices does in Rust. Similarly the second outer loop iterates over all samples of the input chunks and the two-dimensional array z, which has delay items per channel. On the Rust side I represented z as a flat, one-dimensional array for simplicty, so instead of indexing the chunks_exact() iterator is used again for iterating over it.

This leads to the following for the two outer loops

for (src, dst) in src
    .chunks_exact(self.channels)
    .zip(dst.chunks_exact_mut(self.channels * self.factor)) {
    for (channel_idx, (src, z)) in src
        .iter()
        .zip(self.z.chunks_exact_mut(delay))
        .enumerate() {
        // insert more code here
    }
}

Apart from making it more clear what the data access patterns are, this is also less error prone and gives more information to the compiler for optimizations.

Inside this loop, a ringbuffer z / zi is used to store the incoming samples for each channel and keeping the last delay number of samples for further processing. We’ll keep this part as it is for now and use explicit indexing. While a VecDeque, or any similar data structure from other crates, could be used here, it would complicate the code more, cause more allocations. I’ll revisit this piece of code in part 3 of this blog post.

The first inner loop now iterates over all filters, of which there are factor many and chunks of size channels from the output, for which the same translation as before is used. The second inner loop iterates over all coefficients of the filter and the z ringbuffer, and sums up the product of both which is then stored in the output.

So overall the body of the second outer loop above with the two inner loops would look as follows

z[self.zi] = *src;

for (filter, dst) in self
    .filter
    .iter()
    .zip(dst.chunks_exact_mut(self.channels)) {
        let mut acc = 0.0;
        for (c, index) in &filter.coeff {
            let mut i = self.zi as i32 - *index as i32;
            if i < 0 {
                i += self.delay as i32;
            }
            acc += z[i] as f64 * c;
        }

        dst[channel_idx] = acc as f32;
    }
}

self.zi += 1;
if self.zi == self.delay {
    self.zi = 0;
}

The full code after porting can be found here.

My general approach for porting C code with loops is what I did above: first try to understand the data access patterns and then find ways to express these with Rust iterators, and only if there is no obvious way use explicit indexing. And if the explicit indexing turns out to be a performance problem due to bounds checks, first try to reorganize the data so that direct iteration is possible (which usually also improves performance due to cache locality) and otherwise use some well-targeted usages of unsafe code. But more about that in part 3 of this blog post in the context of optimizations.

Exporting C Functions

To be able to call the above code from C, a C-compatible function needs to be exported for it. This involves working with unsafe code and raw pointers again as that’s the only thing C understands. The unsafe Rust code needs to assert the implicit assumptions that can’t be expressed in C and that the calling C code has to follow.

For the interpolator, functions with the same API as the previous C code are exported to keep the amount of changes minimal: create(), destroy() and process() functions.

\#[no_mangle]
pub unsafe extern "C" fn interp_create(taps: u32, factor: u32, channels: u32) -> *mut Interp {
    Box::into_raw(Box::new(Interp::new(
        taps,
        factor,
        channels,
    )))
}

\#[no_mangle]
pub unsafe extern "C" fn interp_destroy(interp: *mut Interp) {
    drop(Box::from_raw(interp));
}

The #[no_mangle] attribute on functions makes sure that the symbols are exported as is instead of being mangled by the compiler. extern "C" makes sure that the function has the same calling convention as a corresponding C function.

For the create() function the interpolator struct is wrapped in a Box, which is the most basic mechanism to do heap allocations in Rust. This is needed because the C code shouldn’t have to know about the layout of the Interp struct and should just handle it as an opaque pointer. Box::into_raw() converts the Box into a raw pointer that can be returned directly to C. This also passes ownership of the memory to C.

The destroy() function is doing the inverse of the above and calls Box::from_raw() to get a Box back from the raw pointer. This requires that the raw pointer passed in was actually allocated via a Box and is of the correct type, which is something that can’t really be checked. The function has to trust the C code to do this correctly. The standard approach to memory safety in C: trust that everybody is writing correct code.

After getting back the Box, it only has to be dropped and Rust will take care of deallocating any memory as needed.

\#[no_mangle]
pub unsafe extern "C" fn interp_process(
    interp: *mut Interp,
    frames: usize,
    src: *const f32,
    dst: *mut f32,
) -> usize {
    use std::slice;

    let interp = &mut *interp;
    let src = slice::from_raw_parts(src, interp.channels * frames);
    let dst = slice::from_raw_parts_mut(dst, interp.channels * interp.factor * frames);

    interp.process(src, dst);

    interp.factor * frames
}

The main interesting part for the processing function is the usage of slice::from_raw_parts. From the C side, the function again has to trust that the pointer are correct and actually contains frames number of audio frames. In Rust a slice knows about its size so some conversion between the two is needed: a slice of the correct size has to be created around the pointer. This does not involve any copying of memory, it only stores the length information together with the raw pointer. This is also the reason why it’s not required anywhere to pass the length separately to the Rust version of the processing function.

With this the interpolator is fully ported and the C functions can be called directly from the C code. On the C side they are declared as follows and then called as before

typedef void * interpolator;

extern interpolator* interp_create(unsigned int taps, unsigned int factor, unsigned int channels);
extern void interp_destroy(interpolator* interp);
extern size_t interp_process(interpolator* interp, size_t frames, float* in, float* out);

The commit that did all this also adds some tests to ensure that everything still works correctly. It also contains some optimizations on top of the code above and is not 100% the same code.

Writing Tests

For testing that porting the code part by part to Rust doesn’t introduce any problems I went for the common two layer approach: 1. integration tests that test if the whole thing still works correctly, and 2. unit tests for the rewritten component alone.

The integration tests come in two variants inside the ebur128 module: one variant just testing via assertions that the results on a fixed input are the expected ones, and one variant comparing the C and Rust implementations. The unit tests only come in the second variant for now.

To test the C implementation in the integration tests an old version of the crate that had no ported code yet is pulled in. For comparing the C implementation of the individual components, I extracted the C code into a separate C file that exported the same API as the corresponding Rust code and called both from the tests.

Comparing Floating Point Numbers

The first variant is not very interesting apart from the complications involved when comparing floating point numbers. For this I used the float_eq crate, which provides different ways of comparing floating point numbers.

The first variant of tests use it by checking if the absolute difference between the expected and actual result is very small, which is less strict than the ULP check used for the second variant of tests. Unfortunately this was required because depending on the used CPU and toolchain the results differ from the expected static results (generated with one CPU and toolchain), while for the comparison between the C and Rust implementation on the same CPU the results are basically the same.

`quickcheck`

quickcheck is a crate that allows to write randomized property tests. This seemed like the perfect tool for writing tests that compare the two implementations: the property to check for is equality of results, and it should be true for any possible input.

Using quickcheck is simple in principle. You write a test function that takes the inputs as parameters, do the processing and then check that the property you want to test for holds by either using assertions or returning a bool or TestResult.

\#[quickcheck]
fn compare_c_impl_i16(signal: Signal<i16>) {
    let mut ebu = EbuR128::new(signal.channels, signal.rate, ebur128::Mode::all()).unwrap();
    ebu.add_frames_i16(&signal.data).unwrap();

    let mut ebu_c =
        ebur128_c::EbuR128::new(signal.channels, signal.rate, ebur128_c::Mode::all()).unwrap();
    ebu_c.add_frames_i16(&signal.data).unwrap();

    compare_results(&ebu, &ebu_c, signal.channels);
}

quickcheck will then generate random values for the function parameters via the Arbitrary impl of the given types and call it many times. If one run fails, it tries to find a minimal testcase that fails based on “shrinking” the initial failure case and then prints that failing, shrunk testcase.

And this is the part of using quickcheck that involves some more effort: writing a reasonable Arbitrary impl for the inputs that can also be shrunk in a useful way on failures.

For the tests here I came up with a Signal type. Its Arbitrary implementation creates an audio signal with 1-16 channels and multiple sine waves of different amplitudes and frequencies. Shrinking first tries to reproduce the problem with a single channel, and then halving the signal length.

This worked well in practice so far. It doesn’t cover all possible inputs but should cover anything that can fail, and the simple shrinking approach also helped to find smallish testcases if something failed. But of course it’s not a perfect solution, only a practical one.

Based on these sets of tests I could be reasonably certain that the C and Rust implementation provide exactly the same results, so I could start porting the next part of the code.

True Peak: Macros vs. Traits

C doesn’t really have any mechanisms for writing code that is generic over different types (other than void *), or any more advanced means of abstraction than functions and structs. For that reason the C code uses macros via the C preprocessor in various places to write code once for the different input types (i16, i32, f32 and f64 or in C terms short, int, float and double). The C preprocessor is just a fancy mechanism for string concatenation so this is rather unpleasant to write and read, results might not even be valid C code and the resulting compiler errors are often rather confusing.

In Rust macros could also be used for this. While more clean than C macros because of macro hygiene rules and working on a typed token tree instead of just strings, this still would end up with hard to write and read code with possibly confusing compiler errors. For abstracting over different types, Rust provides traits. These allow to write code generic over different types with a well-defined interface, and allow to do much more but I won’t cover more here.

One example of macro usage in the C code is the input processing, of which the true peak measurement is one part. In C this basically looks as follows, with some parts left out because they’re not relevant for the true peak measurement itself

static void ebur128_check_true_peak(ebur128_state* st, size_t frames) {
  size_t c, i, frames_out;

  frames_out =
      interp_process(st->d->interp, frames, st->d->resampler_buffer_input,
                     st->d->resampler_buffer_output);

  for (i = 0; i < frames_out; ++i) {
    for (c = 0; c < st->channels; ++c) {
      double val =
          (double) st->d->resampler_buffer_output[i * st->channels + c];

      if (EBUR128_MAX(val, -val) > st->d->prev_true_peak[c]) {
        st->d->prev_true_peak[c] = EBUR128_MAX(val, -val);
      }
    }
  }
}

\#define EBUR128_FILTER(type, min_scale, max_scale)                             \
  static void ebur128_filter_##type(ebur128_state* st, const type* src,        \
                                    size_t frames) {                           \
    static double scaling_factor =                                             \
        EBUR128_MAX(-((double) (min_scale)), (double) (max_scale));            \
    double* audio_data = st->d->audio_data + st->d->audio_data_index;          \
                                                                               \
    // some other code                                                         \
                                                                               \
    if ((st->mode & EBUR128_MODE_TRUE_PEAK) == EBUR128_MODE_TRUE_PEAK &&       \
        st->d->interp) {                                                       \
      for (i = 0; i < frames; ++i) {                                           \
        for (c = 0; c < st->channels; ++c) {                                   \
          st->d->resampler_buffer_input[i * st->channels + c] =                \
              (float) ((double) src[i * st->channels + c] / scaling_factor);   \
        }                                                                      \
      }                                                                        \
      ebur128_check_true_peak(st, frames);                                     \
    }                                                                          \
                                                                               \
    // some other code                                                         \
}

EBUR128_FILTER(short, SHRT_MIN, SHRT_MAX)
EBUR128_FILTER(int, INT_MIN, INT_MAX)
EBUR128_FILTER(float, -1.0f, 1.0f)
EBUR128_FILTER(double, -1.0, 1.0)

What the invocations of the macro at the bottom do is to take the whole macro body and replace the usages of type, min_scale and max_scale accordingly. That is, one ends up with a function ebur128_filter_short() that works on a const short *, uses 32768.0 as scaling_factor and the corresponding for the 3 other types.

To convert this to Rust, first a trait that provides all the required operations has to be defined and then implemented on the 4 numeric types that are supported as audio input. In this case, the only required operation is to convert the input values to an f32 between -1.0 and +1.0.

pub(crate) trait AsF32: Copy {
    fn as_f32(self) -> f32;
}

impl AsF32 for i16 {
    fn as_f32(self) -> f32 {
        self as f32 / -(i16::MIN as f32)
    }
}

// And the same for i32, f32 and f64

Once this trait is defined and implemented on the needed types the Rust function can be written generically over the trait

pub(crate) fn check_true_peak<T: AsF32>(&mut self, src: &[T], peaks: &mut [f64]) {
    assert!(src.len() <= self.buffer_input.len());
    assert!(peaks.len() == self.channels);

    for (o, i) in self.buffer_input.iter_mut().zip(src.iter()) {
        *o = i.as_f32();
    }

    self.interp.process(
        &self.buffer_input[..(src.len())],
        &mut self.buffer_output[..(src.len() * self.interp_factor)],
    );

    for (channel_idx, peak) in peaks.iter_mut().enumerate() {
        for o in self.buffer_output[..(src.len() * self.interp_factor)]
            .chunks_exact(self.channels) {
            let v = o[channel_idx].abs() as f64;
            if v > *peak {
              *peak = v;
            }
        }
    }
}

This is not a direct translation of the C code though. As part of rewriting the C code I also factored out the true peak detection from the filter function into its own function. It is called from the filter function shown in the C code a bit further above. This way it was easy to switch only this part from a C implementation to a Rust implementation while keeping the other parts of the filter, and to also test it separately from the whole filter function.

All this can be found in this commit together with tests and benchmarks. The overall code is a bit different than what is listed above, and also the latest version in the repository looks a bit different but more on that in part 3 of this blog post.

One last thing worth mentioning here is that the AsF32 trait is not public API of the crate and neither are the functions generic over the input type. Instead, the generic functions are only used internally and the public API only provides 4 functions that are specialized to the concrete input types. This keeps the API surface smaller and makes the API easier to understand for users.

Loudness History

The next component I ported to Rust was the loudness history data structure. This is used to keep a history of previous loudness measurements for giving longer-term results and it exists in two variants: a histogram-based one and a queue-based one with a maximum length.

As part of the history data structure there are also a couple of operations to calculate values from it, but I won’t go into details of them here. They were more or less direct translations of the C code.

In C this data structure and the operations on it were distributed all over the code and part of the main state struct, so the first step to port it over to Rust was to identify all those pieces and put them behind some kind of API.

This is also something that I noticed when porting the FFmpeg loudness normalization filter: Rust making it much less effort to define new structs, functions or even modules than C seems to often lead to more modular code with clear component boundaries instead of everything put together in the same place. Requirements from the borrow checker often also make it more natural to split components into separate structs and functions.

Check the commit for the full details but in the end I ended up with the following functions that are called from the C code

typedef void * history;

extern history* history_create(int use_histogram, size_t max);
extern void history_add(history* hist, double energy);
extern void history_set_max_size(history* hist, size_t max);
extern double history_gated_loudness(const history* hist);
extern double history_relative_threshold(const history* hist);
extern double history_loudness_range(const history* hist);
extern void history_destroy(history *hist);

Enums

In the C code the two variants of the history were implemented by having both of them always present in the state struct but only one of them initialized and then at every usage site having code like the following

if (st->d->use_histogram) {
    // histogram code follows here
} else {
    // queue code follows here
}

Doing it like this is error prone, easy to forget and having fields in the struct that are unused in certain configurations seems wasteful. In Rust this situation is naturally expressed with enums

enum History {
    Histogram(Histogram),
    Queue(Queue),
}

struct Histogram { ... }
struct Queue { ... }

This allows to use the same storage for both variants and at each usage site the compiler enforces that both variants are explicitly handled

match history {
    History::Histogram(ref hist) => {
        // histogram code follows here
    }
    History::Queue(ref queue) => {
        // queue code follows here
    }
}

Splitting it up like this with an enum also leads to implementing the operations of the two variants directly on their structs and only implementing the common code directly implemented on the history enum. This also improves readability because it’s immediately clear what a piece of code applies to.

Logarithms

As mentioned in the overview already, there are some portability concerns in the C code. One of them showed up when porting the history and comparing the results of the ported code with the C code. This resulted in the following rather ugly code

fn energy_to_loudness(energy: f64) -> f64 {
    #[cfg(feature = "internal-tests")]
    {
        10.0 * (f64::ln(energy) / f64::ln(10.0)) - 0.691
    }
    #[cfg(not(feature = "internal-tests"))]
    {
        10.0 * f64::log10(energy) - 0.691
    }
}

In the C code, ln(x) / ln(10) is used everywhere for calculating the base 10 logarithm. Mathematically that’s the same thing but in practice this is unfortunately not true and the explicit log10() function is both faster and more accurate. Unfortunately it’s not available everywhere in C (it’s available since C99) so was not used in the C code. In Rust it is always available so I first used it unconditionally.

When running the tests later, they failed because the results of C code where slightly different than the results of the Rust code. In the end I tracked it down to the usage of log10(), so for now when comparing the two implementations the slower and less accurate version is used.

Lazy Initialization

Another topic I already mentioned in the overview is one-time initialization. For using the histogram efficiently it is necessary to calculate the values at the edges of each histogram bin as well as the center values. These values are always the same so they could be calculated once up-front. The C code calculated them whenever a new instance was created.

In Rust, one could build something around std::sync::Once together with static mut variables for storing the data, but that would be not very convenient and also would require using some unsafe code. static mut variables are inherently unsafe. Instead this can be simplified with the lazy_static or once_cell crates, and the API of the latter is also available now as part of the Rust standard library in the nightly versions.

Here I used lazy_static, which leads to the following code

lazy_static::lazy_static! {
    static ref HISTOGRAM_ENERGIES: [f64; 1000] = {
        let mut energies = [0.0; 1000];

        for (i, o) in energies.iter_mut().enumerate() {
            *o = f64::powf(10.0, (i as f64 / 10.0 - 69.95 + 0.691) / 10.0);
        }

        energies
    };

    static ref HISTOGRAM_ENERGY_BOUNDARIES: [f64; 1001] = {
        let mut boundaries = [0.0; 1001];

        for (i, o) in boundaries.iter_mut().enumerate() {
            *o = f64::powf(10.0, (i as f64 / 10.0 - 70.0 + 0.691) / 10.0);
        }

        boundaries
    };
}

On first access to e.g. HISTOGRAM_ENERGIES the corresponding code would be executed and from that point onwards it would be available as a read-only array with the calculated values. In practice this later turned out to cause performance problems, but more on that in part 3 of this blog post.

Another approach for calculating these constant numbers would be to calculate them at compile-time via const functions. This is almost possible with Rust now, the only part missing is a const variant of f64::powf(). It is also not available as a const function in C++ so there is probably a deeper reason behind this. Otherwise the code would look exactly like the code above except that the variables would be plain static instead of static ref and all calculations would happen at compile-time.

In the latest version of the code, and until f64::powf() is available as a const function, I’ve decided to simply include a static array with the calculated values inside the code.

Data Structures

And the last topic for the history is an implementation detail of the queue-based implementation. As I also mentioned during the overview, the C code is using a linked-list-based queue, and this is exactly where it is used.

The queue is storing one f64 value per entry, which means that in the end there is one heap memory allocation of 12 or 16 bytes per entry, depending on pointer size. That’s a lot of very small allocations, each allocation is 50-100% bigger than the actual payload and that’s ignoring any overhead by the allocator itself. By this quite some memory is wasted, and by having each value in a different allocation and having to follow a pointer to the next one, any operations on all values is not going to be very cache efficient.

As C doesn’t have any data structures in its standard library and this linked-list-based queue is something that is readily available on the BSDs and Linux at least, it probably makes sense to use it here instead of implementing a more efficient data structure inside the code. But it still seems really suboptimal for this use-case.

In Rust the standard library provides an ringbuffer-based VecDeque, which offers exactly the API that is needed here, stores all values tightly packed and thus doesn’t have any memory wasted per value and at the same time provides better cache efficiency. And it is available everywhere where the Rust standard library is available, unlike the BSD queue used by the C implementation.

In practice, apart from the obvious savings in memory, this also caused the Rust implementation without any further optimizations to take only 50%-70% of the time that the C implementation took, depending on the operation.

Filter: Flushing Denormals to Zero

Overall porting the filter function from C was the same as everything mentioned before so I won’t go into details here. The whole commit porting it can be found here.

There is only one aspect I want to focus on here: if available on x86/x86-64 the MXCSR register temporarily gets the _MM_FLUSH_ZERO_ON bit set to flush denormal floating point number to zero. That is, denormals (i.e. very small numbers close to zero) as result of any floating point operation are considered to be zero. If hardware support is not available, at the end of each filter call the values kept for the next call are manually set to zero if they contain denormals.

This is done in the C code for performance reasons. Operations on denormals are generally much slower than on normalized floating point numbers and it has a measurable impact on the performance in this case.

In Rust this had to be replicated. Not only for the performance reasons but also because otherwise the results of both implementations would be slightly different and comparing them in the tests would be harder.

On the C side, this requires some build system integration and usage of the C preprocessor to decide whether the hardware support for this can be used or not, and then some conditional code that is used in the EBUR128_FILTER macro that was shown a few sections above already. Specifically this is the code

\#if defined(__SSE2_MATH__) || defined(_M_X64) || _M_IX86_FP >= 2
\#include <xmmintrin.h>
\#define TURN_ON_FTZ                                                            \
  unsigned int mxcsr = _mm_getcsr();                                           \
  _mm_setcsr(mxcsr | _MM_FLUSH_ZERO_ON);
\#define TURN_OFF_FTZ _mm_setcsr(mxcsr);
\#define FLUSH_MANUALLY
\#else
\#warning "manual FTZ is being used, please enable SSE2 (-msse2 -mfpmath=sse)"
\#define TURN_ON_FTZ
\#define TURN_OFF_FTZ
\#define FLUSH_MANUALLY                                                         \
  st->d->v[c][4] = fabs(st->d->v[c][4]) < DBL_MIN ? 0.0 : st->d->v[c][4];      \
  st->d->v[c][3] = fabs(st->d->v[c][3]) < DBL_MIN ? 0.0 : st->d->v[c][3];      \
  st->d->v[c][2] = fabs(st->d->v[c][2]) < DBL_MIN ? 0.0 : st->d->v[c][2];      \
  st->d->v[c][1] = fabs(st->d->v[c][1]) < DBL_MIN ? 0.0 : st->d->v[c][1];
\#endif

This is not really that bad and my only concern here would be that it’s relatively easy to forget calling TURN_OFF_FTZ once the filter is done. This would then affect all future floating point operations outside the filter and potentially cause a lot of problems. This blog post gives a nice example of an interesting bug caused by this and shows how hard it was to debug it.

When porting this to more idiomatic Rust, this problem does not exist anymore.

This is the Rust implementation I ended up with

\#[cfg(all(
    any(target_arch = "x86", target_arch = "x86_64"),
    target_feature = "sse2"
))]
mod ftz {
    #[cfg(target_arch = "x86")]
    use std::arch::x86::{_mm_getcsr, _mm_setcsr, _MM_FLUSH_ZERO_ON};
    #[cfg(target_arch = "x86_64")]
    use std::arch::x86_64::{_mm_getcsr, _mm_setcsr, _MM_FLUSH_ZERO_ON};

    pub struct Ftz(u32);

    impl Ftz {
        pub fn new() -> Option<Self> {
            unsafe {
                let csr = _mm_getcsr();
                _mm_setcsr(csr | _MM_FLUSH_ZERO_ON);
                Some(Ftz(csr))
            }
        }
    }

    impl Drop for Ftz {
        fn drop(&mut self) {
            unsafe {
                _mm_setcsr(self.0);
            }
        }
    }
}

\#[cfg(not(any(all(
    any(target_arch = "x86", target_arch = "x86_64"),
    target_feature = "sse2"
))))]
mod ftz {
    pub struct Ftz;

    impl Ftz {
        pub fn new() -> Option<Self> {
            None
        }
    }
}

While a bit longer, it is also mostly whitespace. The important part to notice here is that when using the hardware support, a struct with a Drop impl is returned and once this struct is leaving the scope it would reset the MXCSR register again to its previous value. This way it can’t be forgotten and would also be reset as part of stack unwinding in case of panics.

On the usage side this looks as follows

let ftz = ftz::Ftz::new();

// all the calculations

if ftz.is_none() {
    // manual flushing of denormals to zero
}

No macros are required in Rust for this and all the platform-specific code is nicely abstracted away in a separate module. In the future support for this on e.g. ARM could be added and it would require no changes anywhere else, just the addition of another implementation of the Ftz struct.

Making it bullet-proof

As Anthony Ramine quickly noticed and told me on Twitter, the above is not actually sufficient. For non-malicious code using the Ftz type everything is alright: on every return path, including panics, the register would be reset again.

However malicious (or simply very confused?) code could make use of e.g. mem::forget(), Box::leak() or some other function to “leak” the Ftz value and cause the Drop implementation to never actually run and reset the register’s value. It’s perfectly valid to leak memory in safe Rust, so it’s not a good idea to rely on Drop implementations too much.

The solution for this can be found in this commit but the basic idea is to never actually give out a value of the Ftz type but only pass an immutable reference to safe Rust code. This then basically looks as follows

mod ftz {
    pub fn with_ftz<F: FnOnce(Option<&Ftz>) -> T, T>(func: F) -> T {
        unsafe {
            let ftz = Ftz::new();
            func(Some(&ftz))
        }
    }
}

ftz::with_ftz(|ftz| {
    // do things or check if `ftz` is None
});

This way it is impossible for any code outside the ftz module to leak the value and prevent resetting of the register.

Input Processing: Order of Operations Matters

The other parts of the processing code were relatively straightforward to port and not really different from anything I already mentioned above. However as part of porting that code I ran into a problem that took quite a while to debug: Once ported, the results of the C and Rust implementation were slightly different again.

I went through the affected code in detail and didn’t notice anything obvious. Both the C code and the Rust code were doing the same, so why are the results different?

This is the relevant part of the C code

size_t i;
double channel_sum;

channel_sum = 0.0;
if (st->d->audio_data_index < frames_per_block * st->channels) {
  for (i = 0; i < st->d->audio_data_index / st->channels; ++i) {
    channel_sum += st->d->audio_data[i * st->channels + c] *
                   st->d->audio_data[i * st->channels + c];
  }
  for (i = st->d->audio_data_frames -
           (frames_per_block - st->d->audio_data_index / st->channels);
       i < st->d->audio_data_frames; ++i) {
    channel_sum += st->d->audio_data[i * st->channels + c] *
                   st->d->audio_data[i * st->channels + c];
  }
} else {
  for (i = st->d->audio_data_index / st->channels - frames_per_block;
       i < st->d->audio_data_index / st->channels; ++i) {
    channel_sum += st->d->audio_data[i * st->channels + c] *
                   st->d->audio_data[i * st->channels + c];
  }
}

and the first version of the Rust code

let mut channel_sum = 0.0;

if audio_data_index < frames_per_block * channels {
    channel_sum += audio_data[..audio_data_index]
        .chunks_exact(channels)
        .map(|f| f[c] * f[c])
        .sum();

    channel_sum += audio_data
        [(audio_data.len() - frames_per_block * channels + audio_data_index)..]
        .chunks_exact(channels)
        .map(|f| f[c] * f[c])
        .sum();
} else {
    channel_sum += audio_data
        [(audio_data_index - frames_per_block * channels)..audio_data_index]
        .chunks_exact(channels)
        .map(|f| f[c] * f[c])
        .sum();
}

The difference between the two variants is the order of floating point operations in the if branch. The C code sums up all values into the same accumulator, while the Rust code first sums both parts into a separate accumulator and then adds them together. I changed it to do exactly the same and that caused the tests to pass again.

The order in which floating point operations are done matters, unfortunately, and in the example above the difference was big enough to cause the tests to fail. And the above is a nice practical example that shows that addition on floating point numbers is actually not associative.

Last C Code: Replacing API Layer

The last step was the most satisfying one: getting rid of all the C code. This can be seen in this commit. Note that the performance numbers in the commit message are wrong. At that point both versions were much closer already performance-wise and the Rust implementation was even faster in some configurations.

Up to that point all the internal code was already ported to Rust and the only remaining C code was the public C API, which did some minor tasks and then called into the Rust code. Technically this was not very interesting so I won’t get into any details here. It doesn’t add any new insights and this blog post is already long enough!

If you check the git history, all commits that followed after this one were cleanups, addition of some new features, adding more tests, performance optimizations (see the next part of this blog post) and adding a C-compatible API around the Rust implementation (see the last part of this blog post). The main part of the work was done.

Difficulty and Code Size

With the initial porting out of the way, I can now also answer the first two questions I wanted to get answered as part of this project.

Porting the C code to Rust was not causing any particular difficulties. The main challenges were

Understanding what the C code actually does and mapping the data access patterns to Rust iterators for more idiomatic and faster code. This also has the advantage of making the code clearer than with the C-style for loops. Porting was generally a pleasant experience and the language did not get in the way when implementing such code.
Also, by being able to port it incrementally I could always do a little bit during breaks and didn’t have to invest longer, contiguous blocks of time into the porting.
Refactoring the C code to be able to replace parts of it step by step with Rust implementations. This was complicated a bit by the fact that the C code did not factor out the different logical components nicely but instead kept everything entangled.
From having worked a lot with C for the more than 15 years, I wouldn’t say that this is because the code is bad (it is not!) but simply because C encourages writing code like this. Defining new structs or new functions seems like effort, and even worse if you try to move code into separate files because then you also have to worry about a header file and keeping the code and the header file in sync. Rust simplifies this noticeably and the way how the language behaves encourages splitting the code more into separate components.

Now for the size of the code. This is a slightly more complicated question. Rust and the default code style of rustfmt cause code to be spread out over more lines and to have more whitespace than the structurally same code in C with the common code styles used in C. In my experience, visually Rust code looks much less dense than C code for this reason.

Intuitively I would say that I have written much less Rust code for the actual implementation than there was C code, but lacking any other metrics let’s take a look at the lines of code while ignoring tests and comments. I used tokei for this.

1211 lines for the Rust code
1741 lines for the C code. Of this, 494 lines are headers and 367 lines of the headers are the queue implementation. That is, there are 1247 lines of non-header C code.

This makes the Rust implementation only slightly smaller if we ignore the C headers. Rust allows to write more concise code so I would have expected the difference to be bigger. At least partially this can probably be attributed to the different code formatting that causes Rust code to be less visually dense and as a result have code spread out over more lines than otherwise.

In any case, overall I’m happy with the results so far.

I will look at another metric of code size in the last part of this blog post for some further comparison: the size of the compiled code.

Next Part

In the next part of this blog post I will describe the performance optimizations I did to make the Rust implementation at least as fast as the C one and the problems I ran into while doing so. The previous two parts of the blog post had nothing negative to say about Rust but this will change in the third part. The Rust implementation without any optimizations was already almost as fast as the C implementation thanks to how well idiomatic Rust code can be optimized by the compiler, but the last few percent were not as painless as one would hope. In practice the performance difference probably wouldn’t have mattered.

From looking at the git history and comparing the code, you will also notice that some of the performance optimizations already happened as part of the porting. The final code is not exactly what I presented above.

Porting EBU R128 audio loudness analysis from C to Rust

Over the last few weeks I ported the libebur128 C library to Rust, both with a proper Rust API as well as a 100% compatible C API.

This blog post will be split into 4 parts that will be published over the next weeks

Overview and motivation
Porting approach with various details, examples and problems I ran into along the way
Performance optimizations
Building Rust code into a C library as drop-in replacement

If you’re only interested in the code, that can be found on GitHub and in the ebur128 crate on crates.io.

The initial versions of the ebur128 crate was built around the libebur128 C library (and included its code for ease of building), version 0.1.2 and newer is the pure Rust implementation.

Contents

1 EBU R128
2 Why?
- 2.1 Curiosity
- 2.2 Portability
3 Overview
4 Next Part

EBU R128

libebur128 implements the EBU R128 loudness standard. The Wikipedia page gives a good summary of the standard, but in short it describes how to measure loudness of an audio signal and how to use this for loudness normalization.

While this intuitively doesn’t sound very complicated, there are lots of little details (like how human ears are actually working) that make this not as easy as one might expect. This results in there being many different ways for measuring loudness and is one of the reasons why this standard was introduced. Of course it is also not the only standard for this.

libebur128 is also the library that I used in the GStreamer loudness normalization plugin, about which I wrote a few weeks ago already. By porting the underlying loudness measurement code to Rust, the only remaining C dependency of that plugin is GStreamer itself.

Apart from that it is used by FFmpeg, but they include their own modified copy, as well as many other projects that need some kind of loudness measurement and don’t use ReplayGain, another older but widely used standard for the same problem.

Why?

Before going over the details of what I did, let me first explain why I did this work at all. libebur128 is a perfectly well working library, in wide use for a long time and probably rather bug-free at this point and it was already possible to use the C implementation from Rust just fine. That’s what the initial versions of the ebur128 crate were doing.

My main reason for doing this was simply because it seemed like a fun little project. It isn’t a lot of code that is changing often so once ported it should be more or less finished and it shouldn’t be much work to stay in sync with the C version. I started thinking about doing this already after the initial release of the C-based ebur128 release, but after reading Joe Neeman’s blog post about porting another C audio library (RNNoise) to Rust this gave me the final push to actually start with porting the code and to follow through until it’s done.

However, don’t go around and ask other people to rewrite their projects in Rust (don’t be rude) or think that your own rewrite is magically going to be much faster and less buggy than the existing implementation. While Rust saves you from a big class of possible bugs, it doesn’t save you from yourself and usually rewrites contain bugs that didn’t exist in the original implementation. Also getting good performance in Rust requires, like in every other language, some effort. Before rewriting any software, think about the goals of this rewrite realistically as well as the effort required to actually get it finished.

Apart from fun there were also a few technical and non-technical reasons for me to look into this. I’m going to just list two here (curiosity and portability). I will skip the usual Rust memory-safety argument as that seems less important with this code: the C code is widely used for a long time, not changing a lot and has easy to follow memory access patterns. While it definitely had a memory safety bug (see above), it was rather difficult to trigger and it was fixed in the meantime.

Curiosity

Personally and at my company Centricular we try to do any new projects where it makes sense in Rust. While this worked very well in the past and we got great results, there were some questions for future projects that I wanted to get some answers, hard data and personal experience for

How difficult is it to port a C codebase function by function to Rust while keeping everything working along the way?
How difficult is it to get the same or better performance with idiomatic Rust code for low-level media processing code?
How much bigger or smaller is the resulting code and do Rust’s higher-level concepts like iterators help to keep code concise?
How difficult is it to create a C-compatible library in Rust with the same API and ABI?

I have some answers to all these questions already but previous work on this was not well structured and the results were also not documented, which I’m trying to change here now. Both to have a reference for myself in the future as well as for convincing other people that Rust is a reasonable technology choice for such projects.

As you can see the general pattern of these questions are introducing Rust into an existing codebase, replacing existing components with Rust and writing new components in Rust, which is also relates to my work on the Rust GStreamer bindings.

Portability

C is a very old language and while there is a standard, each compiler has its own quirks and each platform different APIs on top of the bare minimum that the C standard defines. C itself is very portable, but it is not easy to write portable C code, especially when not using a library like GLib that hides these differences and provides basic data structures and algorithms.

This seems to be something that is often forgotten when the portability of C is given as an argument against Rust, and that’s the reason why I wanted to mention this here specifically. While you can get a C compiler basically everywhere, writing C code that also runs well everywhere is another story and C doesn’t make this easy by design. Rust on the other hand makes writing portable code quite easy in my experience.

In practice there were three specific issues I had for this codebase. Most of the advantages of Rust here are because it is a new language and doesn’t have to carry a lot of historical baggage.

Mathematical Constants and Functions

Mathematical constants are not actually part of any C standard. While most compilers just define M_PI (for π), M_E (for 𝖾) and others in math.h nonetheless as they’re defined by POSIX and UNIX98.

Microsoft’s MSVC doesn’t, but instead you have to #define _USE_MATH_DEFINES before including math.h.

While not a big problem per-se, it is annoying and indeed caused the initial version of the ebur128 Rust crate to not compile with MSVC because I forgot about it.

Similarly, which mathematical functions are available depends a lot on the target platform and which version of the C standard is supported. An example of this is the log10 function to calculate the base-10 logarithm. For portability reasons, libebur128 didn’t use it but instead calculated it via the natural logarithm (ln(x) / ln(10) = log10(x)) because it’s only available in POSIX and since C99. While C99 is from 1999, there are still many compilers out there that don’t fully support it, again most prominently MSVC until very recently.

Using log10 instead of going via the natural logarithm is faster and more precise due to floating point number reasons, which is why the Rust implementation uses it but in C it would be required to check at build-time if the function is available or not, which complicates the build process and can easily be forgotten. libebur128 decided to not bother with these complications and simply not use it. Because of that, some conditional code in the Rust implementation is necessary for ensuring that both implementations return the same results in the tests.

Data Structures

libebur128 uses a linked-list-based queue data structure. As the C standard library is very minimal, no collection data structures are included. However on the BSDs and also on Linux with the GNU C library there is one available in sys/queue.h.

Of course MSVC does not have this and other compilers/platforms probably won’t have it either, so libebur128 included a local copy of that queue implementation. Now when building, one has to decide whether there is a system implementation available or otherwise use the internal version. Or simply always use the internal version.

Copying implementations of basic data structures and algorithms into every single project is ugly and error-prone, so let’s maybe not do that. C not having a standardized mechanism for dependency handling doesn’t help with this, which is unfortunately why this is very common in C projects.

One-time Initialization

Thread-safe one-time initialization is another thing that is not defined by the C standard, and depending on your platform there are different APIs available for it or none at all. POSIX again defines one that is widely available, but you can’t really depend on it unconditionally.

This complicates the code and build procedure, so libebur128 simply did not do that and did its one-time initializations of some global arrays every time a new instance was created. Which is probably fine, but a bit wasteful and probably strictly-speaking according to the C standard not actually thread-safe.

The initial version of the ebur128 Rust crate side-stepped this problem by simply doing this initialization once with the API provided by the Rust standard library. See part 2 and part 3 of this blog post for some more details about this.

Easier to Compile and Integrate

A Rust port only requires a Rust compiler, a mixed C/Rust codebase requires at least a C compiler in addition and some kind of build system for the C code.

libebur128 uses CMake, which would be an additional dependency so in the initial version of the ebur128 crate I went via cargo‘s build.rs build scripts and the cc crate as building libebur128 is easy enough. This works but build scripts are problematic for integration of the Rust code into other build systems than cargo.

The Rust port also makes use of conditional compilation in various places. Unlike in C with the preprocessor, non-standardized and inconsistent platform #defines and it being necessary to integrate everything in a custom way into the build system, Rust has a principled and well-designed approach to this problem. This makes it easier to keep the code clean, easier to maintain and more portable.

In addition to build system related simplifications, by not having any C code it is also much easier to compile the code to other targets like WebAssembly, which is natively supported by Rust. It is also possible to compile C to WebAssembly but getting both toolchains to agree with each other and produce compatible code seems not very easy.

Overview

As mentioned above, the code can be found on GitHub and in the ebur128 crate on crates.io.

The current version of the code produces the exact same results as the C version. This is enforced by the quickcheck tests that are running randomized inputs through both versions and check that the results are the same. The code also succeeds all the tests in the EBU loudness test set, so should hopefully be standards compliant as long as the test implementation is not wrong.

Performance-wise the Rust implementation is at least as fast as the C implementation. In some configurations it’s a few percent faster but probably not enough that it actually matters in practice. There are various benchmarks for both versions in different configurations available. The benchmarks are based on the criterion crate, which uses statistical methods to give as accurate as possible results. criterion also generates nice results with graphs for making analysis of the results more pleasant. See part 3 of this blog post for more details.

Writing tests and benchmarks for Rust is so much easier and feels more natural then doing it in C, so the Rust implementation has quite good coverage of the different code paths now. Especially no struggling with build systems was necessary like it would have been in C thanks to cargo and Rust having built-in support. This alone seems to have the potential to cause Rust code having, on average, better quality than similar code written in C.

It is also possible to compile the Rust implementation into a C library with the great cargo-c tool. This easily builds the code as a static/dynamic C library and installs the library, a C header file and also a pkg-config file. With this the Rust implementation is a 100% drop-in replacement of the C libebur128. It is not even necessary to recompile existing code. See part 4 of this blog post for more details.

Dependencies

Apart from the Rust standard library the Rust implementation depends on two other, small and widely used crates. Unlike with C, depending on external dependencies is rather simple with Rust and cargo. The two crates in question are

smallvec for a dynamically sized vectors/arrays that can be stored on the stack up to a certain size and only then fall back to heap allocations. This allows to avoid a couple of heap allocations under normal usage.
bitflags, which provides a macro for implementing properly typed bitflags. This is used in the constructor of the main type for selecting the features and modes that should be enabled, which directly maps to how the C API works (just with less type-safety).

Unsafe Code

A common question when announcing a Rust port of some C library is how much unsafe code was necessary to reach the same performance as the C code. In this case there are two uses of unsafe code outside the FFI code to call the C implementation in the tests/benchmarks and the C API.

Resampler

The True Peak measurement is using a resampler to upsample the audio signal to a higher sample rate. As part of the most inner loop of the resampler a statically sized ringbuffer is used.

As part of that ringbuffer, explicit indexing of a slice is needed. While the indexes are already manually checked to wrap around when needed, the Rust compiler and LLVM can’t figure that out so additional bounds checks plus panic handling is present in the compiled code. Apart from slowing down the loop with the additional condition, the panic code also causes the whole loop to be optimized less well.

So to get around that, unsafe indexing into the slice is used for performance reasons. While it requires a human now to check the memory safety of the code instead of relying on the compiler, the code in question is simple and small enough that it shouldn’t be a problem in practice.

More on this in part 2 and part 3 of this blog post.

Flushing Denormals to Zero

The other use of unsafe code is in the filter that is applied to the incoming audio signal. On x86/x86-64 the MXCSR register temporarily gets the _MM_FLUSH_ZERO_ON bit set to flush denormal floating point number to zero. That is, denormals (i.e. very small numbers close to zero) as result of any floating point operation are considered as zero.

This happens both for performance reasons as well as correctness reasons. Operations on denormals are generally much slower than on normalized floating point numbers. This has a measurable impact on the performance in this case.

Also as the C library does the same and not flushing denormals to zero would lead to slightly different results. While this difference doesn’t matter in practice as it’s very very small, it would make it harder to compare the results of both implementations as they wouldn’t be as close to each other anymore.

Doing this affects every floating point operation that happens while that bit is set, but because these are only the floating point operations performed by this crate and it’s guaranteed that the bit is unset again (even in case of panics) before leaving the filter, this shouldn’t cause any problems for other code.

Additional Features

Once the C library was ported and performance was comparable to the C implementation, I shortly checked the issues reported on the C library to check if there’s any useful feature requests or bug reports that I could implement / fix in the Rust implementation. There were three, one of which I also wanted for a future project.

None of the new features are available via the C API at this point for compatibility reasons.

Resetting the State

For this one there was a PR already for the C library. Previously the only way to reset all measurements was to create a new instance, which involves new memory allocations, filter initialization, etc..

It’s easy enough to provide a reset method to do only the minimal work required to reset all measurements and restart with a fresh state so I’ve added that to the Rust implementation.

Fix `set_max_window()` to actually work

This was a bug introduced in the C implementation a while ago in an attempt to prevent integer overflows when calculating sizes of memory allocations, which then would cause memory safety bugs because less memory was allocated than expected. Accidentally this fix restricted the allowed values for the maximum window size too much. There is a PR for fixing this in the C implementation.

On the Rust side this bug also existed because I simply ported over the checks. If I hadn’t ported over the checks, or ported an earlier version without the checks, there fortunately wouldn’t have been any memory safety bug on the Rust side though but instead one of two situations would have happened instead

In debug builds integer overflows cause a panic, so instead of allocating less memory than expected during the setting of the parameters there would’ve been a panic immediately instead of invalid memory accesses later.
In release builds integer overflows simply wrap around for performance reasons. This would’ve caused less memory than expected to be allocated, but later when trying to access the memory there would’ve been a panic when trying to access memory outside the allocated area.

While a panic is also not nice, it at least leads to no undefined behaviour and prevents worse things from happening.

The proper fix in this case was to not restrict the maximum window size statically but to instead check for overflows during the calculations. This is the same the PR for the C implementation does, but on the Rust side this is much easier because of built-in operations like checked_mul for doing an overflow-checking multiplication. In C this requires some rather convoluted code (check the PR for details).

Support for Planar Audio Input

The last additional feature that I implemented was support for planar audio input, for which also a PR to the C implementation exists already.

Most of the time audio signals have the samples of each channel interleaved with each other, so for example for stereo you have an array of samples with the first sample for the left channel, the first sample for the right channel, the second sample for the left channel, etc.. While this representation has some advantages, in other situations it is easier or faster to work with planar audio: the samples of each channel are contiguous one after another, so you have e.g. first all the samples of the left channel one after another and only then all samples of the right channel.

The PR for the C implementation does this with some code duplication of existing macro code (which can be prevented by making the macros more complicated), on the Rust side I implemented this without any code duplication by adding an internal abstraction for interleaved/planar audio and iterating over the samples and then working with that in normal, generic Rust code. This required some minor refactoring and code reorganization but in the end was rather painless. Note that most of the change is addition of new tests and moving some code around.

When looking at the Samples trait, the main part of this refactoring, one might wonder why I used closures instead of Rust iterators for iterating over the samples and the reason is unfortunately performance. More on this in part 3 of this blog post.

Next Part

In the next part of this blog post I will describe the porting approach in detail and also give various examples for how to port C code to idiomatic Rust, and some examples of problems I was running into.

Porting

Approach

Rust Bindings around C Library

bindgen

Build System

Safe Rust API

Replacing first function: Resampler

Processing Function: Using Iterators

Exporting C Functions

Writing Tests

Comparing Floating Point Numbers

quickcheck

True Peak: Macros vs. Traits

Loudness History

Enums

Logarithms

Lazy Initialization

Data Structures

Filter: Flushing Denormals to Zero

Making it bullet-proof

Input Processing: Order of Operations Matters

Last C Code: Replacing API Layer

Difficulty and Code Size

Next Part

EBU R128

Why?

Curiosity

Portability

Mathematical Constants and Functions

Data Structures

One-time Initialization

Easier to Compile and Integrate

Overview

Dependencies

Unsafe Code

Resampler

Flushing Denormals to Zero

Additional Features

Resetting the State

Fix set_max_window() to actually work

Support for Planar Audio Input

Next Part

`bindgen`

`quickcheck`

Fix `set_max_window()` to actually work