This blog post is part two of a four part series
- Overview, summary and motivation
- Porting approach with various details, examples and problems I ran into along the way
- Performance optimizations
- Building Rust code into a C library as drop-in replacement
In this part I’ll go through the actual porting process of the libebur128 C code to Rust, the approach I’ve chosen with various examples and a few problems I was running into.
It will be rather technical. I won’t explain details about how the C code works but will only focus on the aspects that are relevant for porting to Rust, otherwise this blog post would become even longer than it already is.
Contents
Porting
With the warnings out of the way, let’s get started. As a reminder, the code can be found on GitHub and you can also follow along the actual chronological porting process by going through the git history there. It’s not very different to what will follow here but just in case you prefer looking at diffs instead.
Approach
The approach I’ve taken is basically the same that Federico took for librsvg or Joe Neeman’s took for nnnoiseless
:
- Start with the C code and safe Rust bindings around the C API
- Look for a function or component very low in the call graph without dependencies on (too much) other C code
- Rewrite that code in Rust and add an internal C API for it
- Call the new internal C API for that new Rust code from the C code and get rid of the C implementation of the component
- Make sure the tests are still passing
- Go to 2. and repeat
Compared to what I did when porting the FFmpeg loudness normalization filter this has the advantage that at every step there is a working version of the code and you don’t only notice at the very end that somewhere along the way you made a mistake. At each step you can validate that what you did was correct and the amount of code to debug if something went wrong is limited.
Thanks to Rust having a good FFI story for interoperability with C in either direction, writing the parts of the code that are called from C or calling into C is not that much of a headache and not worse than actually writing C.
Rust Bindings around C Library
This step could’ve been skipped if all I cared about was having a C API for the ported code later, or if I wanted to work with the tests of the C library for validation and worry about calling it from Rust at a later point. In this case I had already done safe Rust bindings around the C library before, and having a Rust API made it much easier to write tests that could be used during the porting and that could be automatically run at each step.
bindgen
As a first step for creating the Rust bindings there needs to be a way to actually call into the C code. In C there are the header files with the type definitions and function declarations, but Rust can’t directly work from those. The solution to this was in this case bindgen
, which basically converts the C header files into something that Rust can understand. The resulting API is completely unsafe still but can be used in a next step to write safe Rust bindings around it.
I would recommend using bindgen
for any non-trivial C API for which there is no better translation tool available, or for which there is no machine-readable description of the API that could be used instead by another tool. Parsing C headers is no fun and there is very little information available in C for generating safe bindings. For example for GObject
-based libraries, using gir
would be a better idea as it works from a rich XML description of the API that contains information about e.g. ownership transfer and allows to autogenerate safe Rust bindings in many cases.
Also the dependency on clang
makes it hard to run bindgen
as part of every build, so instead I’ve made sure that the code generated by bindgen
is platform independent and included it inside the repository. If you use bindgen
, please try to do the same. Requiring clang
for building your crate makes everything more complicated for your users, especially if they’re unfortunate enough to use Windows.
But back to the topic. What bindgen
generates is basically a translation of the C header into Rust: type definitions and function declarations. This looks for example as follows
\#[repr(C)] \#[derive(Debug, Copy, Clone)] pub struct ebur128_state { pub mode: ::std::os::raw::c_int, pub channels: ::std::os::raw::c_uint, pub samplerate: ::std::os::raw::c_ulong, pub d: *mut ebur128_state_internal, } extern "C" { pub fn ebur128_init( channels: ::std::os::raw::c_uint, samplerate: ::std::os::raw::c_ulong, mode: ::std::os::raw::c_int, ) -> *mut ebur128_state; pub fn ebur128_destroy(st: *mut *mut ebur128_state); pub fn ebur128_add_frames_int( st: *mut ebur128_state, src: *const ::std::os::raw::c_int, frames: usize, ) -> ::std::os::raw::c_int; }
Based on this it is possible to call the C functions directly from unsafe
Rust code and access the members of all the structs. It requires working with raw pointers and ensuring that everything is done correctly at any point to not cause memory corruption or worse. It’s just like using the API from C with a slightly different syntax.
Build System
To be able to call into the C API its implementation somehow has to be linked into your crate. As the C code later also has to be modified to call into the already ported Rust functions instead of the original C code, it makes most sense to build it as part of the crate instead of linking to an external version of it.
This can be done with the cc
crate. It is called into from cargo
‘s build.rs
for configuring it, for example for configuring which C files to compile and how. Once done it is possible to call any exported C function from the Rust code. The build.rs
is not really complicated in this case
fn main() { cc::Build::new() .file("src/c/ebur128.c") .compile("ebur128"); }
Safe Rust API
With all that in place a safe Rust API around the unsafe C functions can be written now. How this looks in practice differs from API to API and might require some more thought in case of a more complex API to ensure everything is still safe and sound from a Rust point of view. In this case it was fortunately rather simple.
For example the struct definition, the constructor and the destructor (Drop
impl) looks as follows based on what bindgen
generated above
pub struct EbuR128(ptr::NonNull<ffi::ebur128_state>);
The struct is a simple wrapper around std::ptr::NonNull
, which itself is a zero-cost wrapper around raw pointers that additionally ensures that the stored pointer is never NULL
and allows additional optimizations to take place based on that.
In other words: the Rust struct is just a raw pointer but with additional safety guarantees.
impl EbuR128 { pub fn new(channels: u32, samplerate: u32, mode: Mode) -> Result<Self, Error> { static ONCE: std::sync::Once = std::sync::Once::new(); ONCE.call_once(|| unsafe { ffi::ebur128_libinit() }); unsafe { let ptr = ffi::ebur128_init(channels, samplerate as _, mode.bits() as i32); let ptr = ptr::NonNull::new(ptr).ok_or(Error::NoMem)?; Ok(EbuR128(ptr)) } } }
The constructor is slightly more complicated as it also has to ensure that the one-time initialization function is called, once. This requires using std::sync::Once
as above.
After that it calls the C constructor with the given parameters. This can return NULL
in various cases when not enough memory could be allocated as described in the documentation of the C library. This needs to be handled gracefully here and instead of panicking an error is returned to the caller. ptr::NonNull::new()
is returning an Option
and if NULL
is passed it would return None
. If this happens it is transformed into an error together with an early return via the ?
operator.
In the end the pointer then only has to be wrapped in the struct and be returned.
impl Drop for EbuR128 { fn drop(&mut self) { unsafe { let mut state = self.0.as_ptr(); ffi::ebur128_destroy(&mut state); } } }
The Drop
trait is used for defining what should happen if a value of the struct goes out of scope and what should be done to clean up after it. In this case this means calling the destroy function of the C library. It takes a pointer to a pointer to its state, which is then set to NULL
. As such it is necessary to store the raw pointer in a local variable and pass a mutable reference to it. Otherwise the ptr::NonNull
would end up with a NULL
pointer inside it, which would result in undefined behaviour.
The last function that I want to mention here is the one that takes a slice of audio samples for processing
pub fn add_frames_i32(&mut self, frames: &[i32]) -> Result<(), Error> { unsafe { if frames.len() % self.0.as_ref().channels != 0 { return Err(Error::NoMem); } let res = ffi::ebur128_add_frames_int( self.0.as_ptr(), frames.as_ptr(), frames.len() / self.0.as_ref().channels, ); Error::from_ffi(res as ffi::error, || ()) } }
Apart from calling the C function it is again necessary to check various pre-conditions before doing so. The C function will cause out of bounds reads if passed a slice that doesn’t contain a sample for each channel, so this must be checked beforehand or otherwise the caller (safe Rust code!) could cause out of bounds memory accesses.
In the end after calling the function its return value is converted into a Result
, converting any errors into the crate’s own Error
enum.
As can be seen here, writing safe Rust bindings around the C API requires reading of the documentation of the C code and keeping all the safety guarantees of Rust in mind to ensure that it is impossible to violate those safety guarantees, no matter what the caller passes into the functions.
Not having to read the documentation and still being guaranteed that the code can’t cause memory corruption is one of the big advantages of Rust. If there even is documentation or it mentions such details.
Replacing first function: Resampler
Once the safe Rust API is done, it is possible to write safe Rust code that makes use of the C code. And among other things that allows to write tests to ensure that the ported code still does the same as the previous C code. But this will be the topic of the next section. Writing tests is boring, porting code is more interesting and that’s what I will start with.
To find the first function to port, I first read the C code to get a general idea of how the different pieces fit together and what the overall structure of the code is. Based on this I selected the resampler that is used in the true peak measurement. It is one of the leaf functions of the call-graph, that is, it is not calling into any other C code and is relatively independent from everything else. Unlike many other parts of the code it is also already factored out into separate structs and functions.
In the C code this can be found in the interp_create()
, interp_destroy()
and interp_process()
functions. The resulting Rust code can be found in the interp
module, which provides basically the same API in Rust except for the destroy()
function which Rust provides for free, and corresponding extern "C"
functions that can be called from C.
The create()
function is not very interesting from a porting point of view: it simply allocates some memory and initializes it. The C and Rust versions of it look basically the same. The Rust version is only missing the checks for allocation failures as those can’t currently be handled in Rust, and handling allocation failures like in the C code is almost useless with modern operating systems and overcommitting allocators.
Also the struct definition is not too interesting, it is approximately the same as the one in C just that pointers to arrays are replaced by Vec
s and lengths of them are directly taken from the Vec
instead of storing them separately. In a later version the Vec
were replaced by boxed slices and SmallVec
but that shouldn’t be a concern here for now.
The main interesting part here is the processing function and how to provide all the functions to the C code.
Processing Function: Using Iterators
The processing function, interp_process()
, is basically 4 nested for
loops over each frame in the input data, each channel in each frame, the interpolator factors and finally the filter coefficients.
unsigned int out_stride = interp->channels * interp->factor; for (frame = 0; frame < frames; frame++) { for (chan = 0; chan < interp->channels; chan++) { interp->z[chan][interp->zi] = *in++; outp = out + chan; for (f = 0; f < interp->factor; f++) { acc = 0.0; for (t = 0; t < interp->filter[f].count; t++) { int i = (int) interp->zi - (int) interp->filter[f].index[t]; if (i < 0) { i += (int) interp->delay; } c = interp->filter[f].coeff[t]; acc += (double) interp->z[chan][i] * c; } *outp = (float) acc; outp += interp->channels; } } out += out_stride; interp->zi++; if (interp->zi == interp->delay) { interp->zi = 0; } }
This could be written the same way in Rust but slice indexing is not very idiomatic in Rust and using Iterator
s is preferred as it leads to more declarative code. Also all the indexing would lead to suboptimal performance due to the required bounds checks. So the main task here is to translate the code to iterators as much as possible.
When looking at the C code, the outer loop iterates over chunks of channels
samples from the input and chunks of channels * factor
samples from the output. This is exactly what the chunks_exact
iterator on slices does in Rust. Similarly the second outer loop iterates over all samples of the input chunks and the two-dimensional array z
, which has delay
items per channel
. On the Rust side I represented z
as a flat, one-dimensional array for simplicty, so instead of indexing the chunks_exact()
iterator is used again for iterating over it.
This leads to the following for the two outer loops
for (src, dst) in src .chunks_exact(self.channels) .zip(dst.chunks_exact_mut(self.channels * self.factor)) { for (channel_idx, (src, z)) in src .iter() .zip(self.z.chunks_exact_mut(delay)) .enumerate() { // insert more code here } }
Apart from making it more clear what the data access patterns are, this is also less error prone and gives more information to the compiler for optimizations.
Inside this loop, a ringbuffer
z
/ zi
is used to store the incoming samples for each channel and keeping the last delay
number of samples for further processing. We’ll keep this part as it is for now and use explicit indexing. While a VecDeque
, or any similar data structure from other crates, could be used here, it would complicate the code more, cause more allocations. I’ll revisit this piece of code in part 3 of this blog post.
The first inner loop now iterates over all filters, of which there are factor
many and chunks of size channels
from the output, for which the same translation as before is used. The second inner loop iterates over all coefficients of the filter and the z
ringbuffer, and sums up the product of both which is then stored in the output.
So overall the body of the second outer loop above with the two inner loops would look as follows
z[self.zi] = *src; for (filter, dst) in self .filter .iter() .zip(dst.chunks_exact_mut(self.channels)) { let mut acc = 0.0; for (c, index) in &filter.coeff { let mut i = self.zi as i32 - *index as i32; if i < 0 { i += self.delay as i32; } acc += z[i] as f64 * c; } dst[channel_idx] = acc as f32; } } self.zi += 1; if self.zi == self.delay { self.zi = 0; }
The full code after porting can be found here.
My general approach for porting C code with loops is what I did above: first try to understand the data access patterns and then find ways to express these with Rust iterators, and only if there is no obvious way use explicit indexing. And if the explicit indexing turns out to be a performance problem due to bounds checks, first try to reorganize the data so that direct iteration is possible (which usually also improves performance due to cache locality) and otherwise use some well-targeted usages of unsafe
code. But more about that in part 3 of this blog post in the context of optimizations.
Exporting C Functions
To be able to call the above code from C, a C-compatible function needs to be exported for it. This involves working with unsafe
code and raw pointers again as that’s the only thing C understands. The unsafe Rust code needs to assert the implicit assumptions that can’t be expressed in C and that the calling C code has to follow.
For the interpolator, functions with the same API as the previous C code are exported to keep the amount of changes minimal: create()
, destroy()
and process()
functions.
\#[no_mangle] pub unsafe extern "C" fn interp_create(taps: u32, factor: u32, channels: u32) -> *mut Interp { Box::into_raw(Box::new(Interp::new( taps, factor, channels, ))) } \#[no_mangle] pub unsafe extern "C" fn interp_destroy(interp: *mut Interp) { drop(Box::from_raw(interp)); }
The #[no_mangle]
attribute on functions makes sure that the symbols are exported as is instead of being mangled by the compiler. extern "C"
makes sure that the function has the same calling convention as a corresponding C function.
For the create()
function the interpolator struct is wrapped in a Box
, which is the most basic mechanism to do heap allocations in Rust. This is needed because the C code shouldn’t have to know about the layout of the Interp
struct and should just handle it as an opaque pointer. Box::into_raw()
converts the Box
into a raw pointer that can be returned directly to C. This also passes ownership of the memory to C.
The destroy()
function is doing the inverse of the above and calls Box::from_raw()
to get a Box
back from the raw pointer. This requires that the raw pointer passed in was actually allocated via a Box
and is of the correct type, which is something that can’t really be checked. The function has to trust the C code to do this correctly. The standard approach to memory safety in C: trust that everybody is writing correct code.
After getting back the Box
, it only has to be dropped and Rust will take care of deallocating any memory as needed.
\#[no_mangle] pub unsafe extern "C" fn interp_process( interp: *mut Interp, frames: usize, src: *const f32, dst: *mut f32, ) -> usize { use std::slice; let interp = &mut *interp; let src = slice::from_raw_parts(src, interp.channels * frames); let dst = slice::from_raw_parts_mut(dst, interp.channels * interp.factor * frames); interp.process(src, dst); interp.factor * frames }
The main interesting part for the processing function is the usage of slice::from_raw_parts
. From the C side, the function again has to trust that the pointer are correct and actually contains frames
number of audio frames. In Rust a slice knows about its size so some conversion between the two is needed: a slice of the correct size has to be created around the pointer. This does not involve any copying of memory, it only stores the length information together with the raw pointer. This is also the reason why it’s not required anywhere to pass the length separately to the Rust version of the processing function.
With this the interpolator is fully ported and the C functions can be called directly from the C code. On the C side they are declared as follows and then called as before
typedef void * interpolator; extern interpolator* interp_create(unsigned int taps, unsigned int factor, unsigned int channels); extern void interp_destroy(interpolator* interp); extern size_t interp_process(interpolator* interp, size_t frames, float* in, float* out);
The commit that did all this also adds some tests to ensure that everything still works correctly. It also contains some optimizations on top of the code above and is not 100% the same code.
Writing Tests
For testing that porting the code part by part to Rust doesn’t introduce any problems I went for the common two layer approach: 1. integration tests that test if the whole thing still works correctly, and 2. unit tests for the rewritten component alone.
The integration tests come in two variants inside the ebur128
module: one variant just testing via assertions that the results on a fixed input are the expected ones, and one variant comparing the C and Rust implementations. The unit tests only come in the second variant for now.
To test the C implementation in the integration tests an old version of the crate that had no ported code yet is pulled in. For comparing the C implementation of the individual components, I extracted the C code into a separate C file that exported the same API as the corresponding Rust code and called both from the tests.
Comparing Floating Point Numbers
The first variant is not very interesting apart from the complications involved when comparing floating point numbers. For this I used the float_eq
crate, which provides different ways of comparing floating point numbers.
The first variant of tests use it by checking if the absolute difference between the expected and actual result is very small, which is less strict than the ULP check used for the second variant of tests. Unfortunately this was required because depending on the used CPU and toolchain the results differ from the expected static results (generated with one CPU and toolchain), while for the comparison between the C and Rust implementation on the same CPU the results are basically the same.
quickcheck
quickcheck
is a crate that allows to write randomized property tests. This seemed like the perfect tool for writing tests that compare the two implementations: the property to check for is equality of results, and it should be true for any possible input.
Using quickcheck
is simple in principle. You write a test function that takes the inputs as parameters, do the processing and then check that the property you want to test for holds by either using assertions or returning a bool
or TestResult
.
\#[quickcheck] fn compare_c_impl_i16(signal: Signal<i16>) { let mut ebu = EbuR128::new(signal.channels, signal.rate, ebur128::Mode::all()).unwrap(); ebu.add_frames_i16(&signal.data).unwrap(); let mut ebu_c = ebur128_c::EbuR128::new(signal.channels, signal.rate, ebur128_c::Mode::all()).unwrap(); ebu_c.add_frames_i16(&signal.data).unwrap(); compare_results(&ebu, &ebu_c, signal.channels); }
quickcheck
will then generate random values for the function parameters via the Arbitrary
impl of the given types and call it many times. If one run fails, it tries to find a minimal testcase that fails based on “shrinking” the initial failure case and then prints that failing, shrunk testcase.
And this is the part of using quickcheck
that involves some more effort: writing a reasonable Arbitrary
impl for the inputs that can also be shrunk in a useful way on failures.
For the tests here I came up with a Signal
type. Its Arbitrary
implementation creates an audio signal with 1-16 channels and multiple sine waves of different amplitudes and frequencies. Shrinking first tries to reproduce the problem with a single channel, and then halving the signal length.
This worked well in practice so far. It doesn’t cover all possible inputs but should cover anything that can fail, and the simple shrinking approach also helped to find smallish testcases if something failed. But of course it’s not a perfect solution, only a practical one.
Based on these sets of tests I could be reasonably certain that the C and Rust implementation provide exactly the same results, so I could start porting the next part of the code.
True Peak: Macros vs. Traits
C doesn’t really have any mechanisms for writing code that is generic over different types (other than void *
), or any more advanced means of abstraction than functions and structs. For that reason the C code uses macros via the C preprocessor in various places to write code once for the different input types (i16
, i32
, f32
and f64
or in C terms short
, int
, float
and double
). The C preprocessor is just a fancy mechanism for string concatenation so this is rather unpleasant to write and read, results might not even be valid C code and the resulting compiler errors are often rather confusing.
In Rust macros could also be used for this. While more clean than C macros because of macro hygiene rules and working on a typed token tree instead of just strings, this still would end up with hard to write and read code with possibly confusing compiler errors. For abstracting over different types, Rust provides traits. These allow to write code generic over different types with a well-defined interface, and allow to do much more but I won’t cover more here.
One example of macro usage in the C code is the input processing, of which the true peak measurement is one part. In C this basically looks as follows, with some parts left out because they’re not relevant for the true peak measurement itself
static void ebur128_check_true_peak(ebur128_state* st, size_t frames) { size_t c, i, frames_out; frames_out = interp_process(st->d->interp, frames, st->d->resampler_buffer_input, st->d->resampler_buffer_output); for (i = 0; i < frames_out; ++i) { for (c = 0; c < st->channels; ++c) { double val = (double) st->d->resampler_buffer_output[i * st->channels + c]; if (EBUR128_MAX(val, -val) > st->d->prev_true_peak[c]) { st->d->prev_true_peak[c] = EBUR128_MAX(val, -val); } } } } \#define EBUR128_FILTER(type, min_scale, max_scale) \ static void ebur128_filter_##type(ebur128_state* st, const type* src, \ size_t frames) { \ static double scaling_factor = \ EBUR128_MAX(-((double) (min_scale)), (double) (max_scale)); \ double* audio_data = st->d->audio_data + st->d->audio_data_index; \ \ // some other code \ \ if ((st->mode & EBUR128_MODE_TRUE_PEAK) == EBUR128_MODE_TRUE_PEAK && \ st->d->interp) { \ for (i = 0; i < frames; ++i) { \ for (c = 0; c < st->channels; ++c) { \ st->d->resampler_buffer_input[i * st->channels + c] = \ (float) ((double) src[i * st->channels + c] / scaling_factor); \ } \ } \ ebur128_check_true_peak(st, frames); \ } \ \ // some other code \ } EBUR128_FILTER(short, SHRT_MIN, SHRT_MAX) EBUR128_FILTER(int, INT_MIN, INT_MAX) EBUR128_FILTER(float, -1.0f, 1.0f) EBUR128_FILTER(double, -1.0, 1.0)
What the invocations of the macro at the bottom do is to take the whole macro body and replace the usages of type
, min_scale
and max_scale
accordingly. That is, one ends up with a function ebur128_filter_short()
that works on a const short *
, uses 32768.0
as scaling_factor
and the corresponding for the 3 other types.
To convert this to Rust, first a trait that provides all the required operations has to be defined and then implemented on the 4 numeric types that are supported as audio input. In this case, the only required operation is to convert the input values to an f32
between -1.0
and +1.0
.
pub(crate) trait AsF32: Copy { fn as_f32(self) -> f32; } impl AsF32 for i16 { fn as_f32(self) -> f32 { self as f32 / -(i16::MIN as f32) } } // And the same for i32, f32 and f64
Once this trait is defined and implemented on the needed types the Rust function can be written generically over the trait
pub(crate) fn check_true_peak<T: AsF32>(&mut self, src: &[T], peaks: &mut [f64]) { assert!(src.len() <= self.buffer_input.len()); assert!(peaks.len() == self.channels); for (o, i) in self.buffer_input.iter_mut().zip(src.iter()) { *o = i.as_f32(); } self.interp.process( &self.buffer_input[..(src.len())], &mut self.buffer_output[..(src.len() * self.interp_factor)], ); for (channel_idx, peak) in peaks.iter_mut().enumerate() { for o in self.buffer_output[..(src.len() * self.interp_factor)] .chunks_exact(self.channels) { let v = o[channel_idx].abs() as f64; if v > *peak { *peak = v; } } } }
This is not a direct translation of the C code though. As part of rewriting the C code I also factored out the true peak detection from the filter function into its own function. It is called from the filter function shown in the C code a bit further above. This way it was easy to switch only this part from a C implementation to a Rust implementation while keeping the other parts of the filter, and to also test it separately from the whole filter function.
All this can be found in this commit together with tests and benchmarks. The overall code is a bit different than what is listed above, and also the latest version in the repository looks a bit different but more on that in part 3 of this blog post.
One last thing worth mentioning here is that the AsF32
trait is not public API of the crate and neither are the functions generic over the input type. Instead, the generic functions are only used internally and the public API only provides 4 functions that are specialized to the concrete input types. This keeps the API surface smaller and makes the API easier to understand for users.
Loudness History
The next component I ported to Rust was the loudness history data structure. This is used to keep a history of previous loudness measurements for giving longer-term results and it exists in two variants: a histogram-based one and a queue-based one with a maximum length.
As part of the history data structure there are also a couple of operations to calculate values from it, but I won’t go into details of them here. They were more or less direct translations of the C code.
In C this data structure and the operations on it were distributed all over the code and part of the main state struct, so the first step to port it over to Rust was to identify all those pieces and put them behind some kind of API.
This is also something that I noticed when porting the FFmpeg loudness normalization filter: Rust making it much less effort to define new structs, functions or even modules than C seems to often lead to more modular code with clear component boundaries instead of everything put together in the same place. Requirements from the borrow checker often also make it more natural to split components into separate structs and functions.
Check the commit for the full details but in the end I ended up with the following functions that are called from the C code
typedef void * history; extern history* history_create(int use_histogram, size_t max); extern void history_add(history* hist, double energy); extern void history_set_max_size(history* hist, size_t max); extern double history_gated_loudness(const history* hist); extern double history_relative_threshold(const history* hist); extern double history_loudness_range(const history* hist); extern void history_destroy(history *hist);
Enums
In the C code the two variants of the history were implemented by having both of them always present in the state struct but only one of them initialized and then at every usage site having code like the following
if (st->d->use_histogram) { // histogram code follows here } else { // queue code follows here }
Doing it like this is error prone, easy to forget and having fields in the struct that are unused in certain configurations seems wasteful. In Rust this situation is naturally expressed with enums
enum History { Histogram(Histogram), Queue(Queue), } struct Histogram { ... } struct Queue { ... }
This allows to use the same storage for both variants and at each usage site the compiler enforces that both variants are explicitly handled
match history { History::Histogram(ref hist) => { // histogram code follows here } History::Queue(ref queue) => { // queue code follows here } }
Splitting it up like this with an enum also leads to implementing the operations of the two variants directly on their structs and only implementing the common code directly implemented on the history enum. This also improves readability because it’s immediately clear what a piece of code applies to.
Logarithms
As mentioned in the overview already, there are some portability concerns in the C code. One of them showed up when porting the history and comparing the results of the ported code with the C code. This resulted in the following rather ugly code
fn energy_to_loudness(energy: f64) -> f64 { #[cfg(feature = "internal-tests")] { 10.0 * (f64::ln(energy) / f64::ln(10.0)) - 0.691 } #[cfg(not(feature = "internal-tests"))] { 10.0 * f64::log10(energy) - 0.691 } }
In the C code, ln(x) / ln(10)
is used everywhere for calculating the base 10 logarithm. Mathematically that’s the same thing but in practice this is unfortunately not true and the explicit log10()
function is both faster and more accurate. Unfortunately it’s not available everywhere in C (it’s available since C99) so was not used in the C code. In Rust it is always available so I first used it unconditionally.
When running the tests later, they failed because the results of C code where slightly different than the results of the Rust code. In the end I tracked it down to the usage of log10()
, so for now when comparing the two implementations the slower and less accurate version is used.
Lazy Initialization
Another topic I already mentioned in the overview is one-time initialization. For using the histogram efficiently it is necessary to calculate the values at the edges of each histogram bin as well as the center values. These values are always the same so they could be calculated once up-front. The C code calculated them whenever a new instance was created.
In Rust, one could build something around std::sync::Once
together with static mut
variables for storing the data, but that would be not very convenient and also would require using some unsafe
code. static mut
variables are inherently unsafe. Instead this can be simplified with the lazy_static
or once_cell
crates, and the API of the latter is also available now as part of the Rust standard library in the nightly versions.
Here I used lazy_static
, which leads to the following code
lazy_static::lazy_static! { static ref HISTOGRAM_ENERGIES: [f64; 1000] = { let mut energies = [0.0; 1000]; for (i, o) in energies.iter_mut().enumerate() { *o = f64::powf(10.0, (i as f64 / 10.0 - 69.95 + 0.691) / 10.0); } energies }; static ref HISTOGRAM_ENERGY_BOUNDARIES: [f64; 1001] = { let mut boundaries = [0.0; 1001]; for (i, o) in boundaries.iter_mut().enumerate() { *o = f64::powf(10.0, (i as f64 / 10.0 - 70.0 + 0.691) / 10.0); } boundaries }; }
On first access to e.g. HISTOGRAM_ENERGIES
the corresponding code would be executed and from that point onwards it would be available as a read-only array with the calculated values. In practice this later turned out to cause performance problems, but more on that in part 3 of this blog post.
Another approach for calculating these constant numbers would be to calculate them at compile-time via const
functions. This is almost possible with Rust now, the only part missing is a const
variant of f64::powf()
. It is also not available as a const
function in C++ so there is probably a deeper reason behind this. Otherwise the code would look exactly like the code above except that the variables would be plain static
instead of static ref
and all calculations would happen at compile-time.
In the latest version of the code, and until f64::powf()
is available as a const
function, I’ve decided to simply include a static array with the calculated values inside the code.
Data Structures
And the last topic for the history is an implementation detail of the queue-based implementation. As I also mentioned during the overview, the C code is using a linked-list-based queue, and this is exactly where it is used.
The queue is storing one f64
value per entry, which means that in the end there is one heap memory allocation of 12 or 16 bytes per entry, depending on pointer size. That’s a lot of very small allocations, each allocation is 50-100% bigger than the actual payload and that’s ignoring any overhead by the allocator itself. By this quite some memory is wasted, and by having each value in a different allocation and having to follow a pointer to the next one, any operations on all values is not going to be very cache efficient.
As C doesn’t have any data structures in its standard library and this linked-list-based queue is something that is readily available on the BSDs and Linux at least, it probably makes sense to use it here instead of implementing a more efficient data structure inside the code. But it still seems really suboptimal for this use-case.
In Rust the standard library provides an ringbuffer-based VecDeque
, which offers exactly the API that is needed here, stores all values tightly packed and thus doesn’t have any memory wasted per value and at the same time provides better cache efficiency. And it is available everywhere where the Rust standard library is available, unlike the BSD queue used by the C implementation.
In practice, apart from the obvious savings in memory, this also caused the Rust implementation without any further optimizations to take only 50%-70% of the time that the C implementation took, depending on the operation.
Filter: Flushing Denormals to Zero
Overall porting the filter function from C was the same as everything mentioned before so I won’t go into details here. The whole commit porting it can be found here.
There is only one aspect I want to focus on here: if available on x86/x86-64 the MXCSR
register temporarily gets the _MM_FLUSH_ZERO_ON
bit set to flush denormal floating point number to zero. That is, denormals (i.e. very small numbers close to zero) as result of any floating point operation are considered to be zero. If hardware support is not available, at the end of each filter call the values kept for the next call are manually set to zero if they contain denormals.
This is done in the C code for performance reasons. Operations on denormals are generally much slower than on normalized floating point numbers and it has a measurable impact on the performance in this case.
In Rust this had to be replicated. Not only for the performance reasons but also because otherwise the results of both implementations would be slightly different and comparing them in the tests would be harder.
On the C side, this requires some build system integration and usage of the C preprocessor to decide whether the hardware support for this can be used or not, and then some conditional code that is used in the EBUR128_FILTER
macro that was shown a few sections above already. Specifically this is the code
\#if defined(__SSE2_MATH__) || defined(_M_X64) || _M_IX86_FP >= 2 \#include <xmmintrin.h> \#define TURN_ON_FTZ \ unsigned int mxcsr = _mm_getcsr(); \ _mm_setcsr(mxcsr | _MM_FLUSH_ZERO_ON); \#define TURN_OFF_FTZ _mm_setcsr(mxcsr); \#define FLUSH_MANUALLY \#else \#warning "manual FTZ is being used, please enable SSE2 (-msse2 -mfpmath=sse)" \#define TURN_ON_FTZ \#define TURN_OFF_FTZ \#define FLUSH_MANUALLY \ st->d->v[c][4] = fabs(st->d->v[c][4]) < DBL_MIN ? 0.0 : st->d->v[c][4]; \ st->d->v[c][3] = fabs(st->d->v[c][3]) < DBL_MIN ? 0.0 : st->d->v[c][3]; \ st->d->v[c][2] = fabs(st->d->v[c][2]) < DBL_MIN ? 0.0 : st->d->v[c][2]; \ st->d->v[c][1] = fabs(st->d->v[c][1]) < DBL_MIN ? 0.0 : st->d->v[c][1]; \#endif
This is not really that bad and my only concern here would be that it’s relatively easy to forget calling TURN_OFF_FTZ
once the filter is done. This would then affect all future floating point operations outside the filter and potentially cause a lot of problems. This blog post gives a nice example of an interesting bug caused by this and shows how hard it was to debug it.
When porting this to more idiomatic Rust, this problem does not exist anymore.
This is the Rust implementation I ended up with
\#[cfg(all( any(target_arch = "x86", target_arch = "x86_64"), target_feature = "sse2" ))] mod ftz { #[cfg(target_arch = "x86")] use std::arch::x86::{_mm_getcsr, _mm_setcsr, _MM_FLUSH_ZERO_ON}; #[cfg(target_arch = "x86_64")] use std::arch::x86_64::{_mm_getcsr, _mm_setcsr, _MM_FLUSH_ZERO_ON}; pub struct Ftz(u32); impl Ftz { pub fn new() -> Option<Self> { unsafe { let csr = _mm_getcsr(); _mm_setcsr(csr | _MM_FLUSH_ZERO_ON); Some(Ftz(csr)) } } } impl Drop for Ftz { fn drop(&mut self) { unsafe { _mm_setcsr(self.0); } } } } \#[cfg(not(any(all( any(target_arch = "x86", target_arch = "x86_64"), target_feature = "sse2" ))))] mod ftz { pub struct Ftz; impl Ftz { pub fn new() -> Option<Self> { None } } }
While a bit longer, it is also mostly whitespace. The important part to notice here is that when using the hardware support, a struct with a Drop
impl is returned and once this struct is leaving the scope it would reset the MXCSR
register again to its previous value. This way it can’t be forgotten and would also be reset as part of stack unwinding in case of panics.
On the usage side this looks as follows
let ftz = ftz::Ftz::new(); // all the calculations if ftz.is_none() { // manual flushing of denormals to zero }
No macros are required in Rust for this and all the platform-specific code is nicely abstracted away in a separate module. In the future support for this on e.g. ARM could be added and it would require no changes anywhere else, just the addition of another implementation of the Ftz
struct.
Making it bullet-proof
As Anthony Ramine quickly noticed and told me on Twitter, the above is not actually sufficient. For non-malicious code using the Ftz
type everything is alright: on every return path, including panics, the register would be reset again.
However malicious (or simply very confused?) code could make use of e.g. mem::forget()
, Box::leak()
or some other function to “leak” the Ftz
value and cause the Drop
implementation to never actually run and reset the register’s value. It’s perfectly valid to leak memory in safe Rust, so it’s not a good idea to rely on Drop
implementations too much.
The solution for this can be found in this commit but the basic idea is to never actually give out a value of the Ftz
type but only pass an immutable reference to safe Rust code. This then basically looks as follows
mod ftz { pub fn with_ftz<F: FnOnce(Option<&Ftz>) -> T, T>(func: F) -> T { unsafe { let ftz = Ftz::new(); func(Some(&ftz)) } } } ftz::with_ftz(|ftz| { // do things or check if `ftz` is None });
This way it is impossible for any code outside the ftz
module to leak the value and prevent resetting of the register.
Input Processing: Order of Operations Matters
The other parts of the processing code were relatively straightforward to port and not really different from anything I already mentioned above. However as part of porting that code I ran into a problem that took quite a while to debug: Once ported, the results of the C and Rust implementation were slightly different again.
I went through the affected code in detail and didn’t notice anything obvious. Both the C code and the Rust code were doing the same, so why are the results different?
This is the relevant part of the C code
size_t i; double channel_sum; channel_sum = 0.0; if (st->d->audio_data_index < frames_per_block * st->channels) { for (i = 0; i < st->d->audio_data_index / st->channels; ++i) { channel_sum += st->d->audio_data[i * st->channels + c] * st->d->audio_data[i * st->channels + c]; } for (i = st->d->audio_data_frames - (frames_per_block - st->d->audio_data_index / st->channels); i < st->d->audio_data_frames; ++i) { channel_sum += st->d->audio_data[i * st->channels + c] * st->d->audio_data[i * st->channels + c]; } } else { for (i = st->d->audio_data_index / st->channels - frames_per_block; i < st->d->audio_data_index / st->channels; ++i) { channel_sum += st->d->audio_data[i * st->channels + c] * st->d->audio_data[i * st->channels + c]; } }
and the first version of the Rust code
let mut channel_sum = 0.0; if audio_data_index < frames_per_block * channels { channel_sum += audio_data[..audio_data_index] .chunks_exact(channels) .map(|f| f[c] * f[c]) .sum(); channel_sum += audio_data [(audio_data.len() - frames_per_block * channels + audio_data_index)..] .chunks_exact(channels) .map(|f| f[c] * f[c]) .sum(); } else { channel_sum += audio_data [(audio_data_index - frames_per_block * channels)..audio_data_index] .chunks_exact(channels) .map(|f| f[c] * f[c]) .sum(); }
The difference between the two variants is the order of floating point operations in the if
branch. The C code sums up all values into the same accumulator, while the Rust code first sums both parts into a separate accumulator and then adds them together. I changed it to do exactly the same and that caused the tests to pass again.
The order in which floating point operations are done matters, unfortunately, and in the example above the difference was big enough to cause the tests to fail. And the above is a nice practical example that shows that addition on floating point numbers is actually not associative.
Last C Code: Replacing API Layer
The last step was the most satisfying one: getting rid of all the C code. This can be seen in this commit. Note that the performance numbers in the commit message are wrong. At that point both versions were much closer already performance-wise and the Rust implementation was even faster in some configurations.
Up to that point all the internal code was already ported to Rust and the only remaining C code was the public C API, which did some minor tasks and then called into the Rust code. Technically this was not very interesting so I won’t get into any details here. It doesn’t add any new insights and this blog post is already long enough!
If you check the git history, all commits that followed after this one were cleanups, addition of some new features, adding more tests, performance optimizations (see the next part of this blog post) and adding a C-compatible API around the Rust implementation (see the last part of this blog post). The main part of the work was done.
Difficulty and Code Size
With the initial porting out of the way, I can now also answer the first two questions I wanted to get answered as part of this project.
Porting the C code to Rust was not causing any particular difficulties. The main challenges were
- Understanding what the C code actually does and mapping the data access patterns to Rust iterators for more idiomatic and faster code. This also has the advantage of making the code clearer than with the C-style
for
loops. Porting was generally a pleasant experience and the language did not get in the way when implementing such code.
Also, by being able to port it incrementally I could always do a little bit during breaks and didn’t have to invest longer, contiguous blocks of time into the porting. - Refactoring the C code to be able to replace parts of it step by step with Rust implementations. This was complicated a bit by the fact that the C code did not factor out the different logical components nicely but instead kept everything entangled.
From having worked a lot with C for the more than 15 years, I wouldn’t say that this is because the code is bad (it is not!) but simply because C encourages writing code like this. Defining new structs or new functions seems like effort, and even worse if you try to move code into separate files because then you also have to worry about a header file and keeping the code and the header file in sync. Rust simplifies this noticeably and the way how the language behaves encourages splitting the code more into separate components.
Now for the size of the code. This is a slightly more complicated question. Rust and the default code style of rustfmt
cause code to be spread out over more lines and to have more whitespace than the structurally same code in C with the common code styles used in C. In my experience, visually Rust code looks much less dense than C code for this reason.
Intuitively I would say that I have written much less Rust code for the actual implementation than there was C code, but lacking any other metrics let’s take a look at the lines of code while ignoring tests and comments. I used tokei for this.
- 1211 lines for the Rust code
- 1741 lines for the C code. Of this, 494 lines are headers and 367 lines of the headers are the queue implementation. That is, there are 1247 lines of non-header C code.
This makes the Rust implementation only slightly smaller if we ignore the C headers. Rust allows to write more concise code so I would have expected the difference to be bigger. At least partially this can probably be attributed to the different code formatting that causes Rust code to be less visually dense and as a result have code spread out over more lines than otherwise.
In any case, overall I’m happy with the results so far.
I will look at another metric of code size in the last part of this blog post for some further comparison: the size of the compiled code.
Next Part
In the next part of this blog post I will describe the performance optimizations I did to make the Rust implementation at least as fast as the C one and the problems I ran into while doing so. The previous two parts of the blog post had nothing negative to say about Rust but this will change in the third part. The Rust implementation without any optimizations was already almost as fast as the C implementation thanks to how well idiomatic Rust code can be optimized by the compiler, but the last few percent were not as painless as one would hope. In practice the performance difference probably wouldn’t have mattered.
From looking at the git history and comparing the code, you will also notice that some of the performance optimizations already happened as part of the porting. The final code is not exactly what I presented above.
Thanks for sharing !