Menu Content/Inhalt
Home Developer Blog
  • English
  • German
  • Japanese

News Feed

feed-image RSS 2.0
fre:ac Developer Blog

Welcome to the fre:ac developer blog. I will post status updates and other information about fre:ac development here.



fre:ac development status update 04/2018 Print
Written by Robert   
Tuesday, 01 May 2018 12:28

It's time for a new development status update after an interesting month.

Optimized CRC routines for audio codecs

In case you missed it, here is my article on speeding up LAME, FLAC, Ogg and Monkey's Audio with faster CRC checks. The proposed CRC algorithm is roughly 5 times faster than the one previously used and results in a speedup of about 5% for FLAC encoding and decoding. Patches have been submitted to the upstream projects and I hope for integration in official releases of these codecs.

Fixed crashes with local CDDB queries

A user reported occasional crashes when querying a local CDDB database on Linux. This turned out to be a thread-safety issue that manifested itself only when the CDDB query dialog was displayed and then immediately closed before the main thread finished processing the window mapping event.

The issue affects all systems using the X11 window system, so it can happen on Linux, FreeBSD and other Unix-like systems.

This and another issue that I found while investigating it will be fixed in the next alpha release.

SuperFast LAME nearing completion

A whole bunch of changes have been incorporated into the SuperFast version of the LAME MP3 encoder component. It's almost complete now and an official preview release is getting closer.

This month's changes include:

  • Support for CBR mode and VBR rate limiting
  • Support for MP3s with frame CRCs
  • Writing Xing header table of contents
  • Writing Xing header CRCs

There is just one item left on my list now which is related to handling the bit reservoir in high complexity situations (especially with MPEG 2 streams at 22.05 or 24 kHz). In that case it can happen that an encoder thread tries to use more reservoir than actually is available. Special handling has to be implemented to resolve such situations. I hope to be able to finish this in May.

While waiting for SuperFast LAME, make sure to check out the 2nd SuperFast preview release with added support for FDK-AAC and Speex and tuning for Opus and Core Audio AAC.

 
Faster CRC checks to speed up codecs Print
Written by Robert   
Sunday, 29 April 2018 22:33

So, I kind of stumbled into this, but always looking for possible optimizations, I simply had to explore it...

tl;dr: I accelerated checksum calculations and thus encoding times of LAME, FLAC, Ogg and Monkey's Audio using an optimized CRC algorithm. Find patches at the end of this post. These will be part of the next fre:ac 1.1 alpha release.

Calculating Xing/LAME header CRCs

Working on the LAME MP3 implementation of my SuperFast technology, I came across the necessity to do CRC checksum calculations. Every MP3 created by LAME has a Xing or LAME VBR header at the beginning. It contains index points to the MP3 as well as information about duration and gapless playback. At the end of this header, there are two CRC checksums, one for the MP3 bitstream and one for the header itself.

As the bitstream repacker used in SuperFast LAME changes the MP3's internal structure, an update of the Xing/LAME header's CRC values is necessary afterwards. I started with a simple implementation of the CRC16 algorithm that I wrote for the smooth Class Library. This created a small delay at the end of each conversion when the CRC for the MP3 file is updated. Not a big deal for the usually small MP3s weighting in at 3-4 MB. With larger files, however, like when converting a whole album to a single output file, it became painful. The CRC calculation added a delay of half a second for a 60 MB file on my i7 6900K system. On slower systems it would be much more.

Steps to optimize the calculation

The first thing I tried was using compiler optimizations for the CRC routines (GCC's -O3 instead of -Os). This brought the delay down to about a quarter second. Still too much for my taste, though.

I then started looking for optimized CRC algorithms and found Matt Stancliff's crcspeed repository. It is based on an algorithm developed by Intel that uses additional lookup tables to enable processing of multiple input bytes in a single step. There are different variants of this algorithm circling around, processing different numbers of bytes in each step, but it's generally called slicing-by-X (where X is usually 2, 4, 8 or 16).

I updated my CRC implementation to use the slicing algorithm and did some measurements. The slicing-by-8 variant turned out to be roughly 10 times faster than my original version and 5 times faster than the GCC -O3 compiled one. There was very little additional speedup when using slicing-by-12 (which I found to be the fastest) or slicing-by-16, so I decided to stick with slicing-by-8 as a good compromise between speed and memory requirements. Using the slicing-by-8 algorithm reduced the delay at the end of the 60 MB MP3 conversion to just a few 10s of milliseconds.

But I did not stop there...

Looking further

So, if I have to calculate CRC checksums for the Xing/LAME header, LAME itself will have to do the same. You just don't notice a delay, because the calculation is not done at once at the end, but spread over the whole encoding process. But does LAME use an optimized CRC implementation? As it turned out, no.

I updated the LAME CRC routines with the slicing-by-8 algorithm and got a speed-up of only 0.5%. Not much, but I wondered if other codecs (especially lossless ones that generate more data) might benefit more.

I looked further and found non-optimal CRC implementations in FLAC, Ogg (used for Opus, Vorbis and other codecs) and Monkey's Audio. Replacing them with the optimized algorithm yielded similar results to LAME for the lossy formats. The lossless formats, however, benefit more from the optimization and are sped up by about 5% due to more data being generated. When using Ogg FLAC, the speed-up is roughly 10% due to CRC's being calculated for both, the FLAC audio frames and the Ogg container pages.

So we get up to 5% speed-up in the usual case and around 10% improvement for the Ogg FLAC format. All by simply replacing the CRC algorithm with an optimized version.

Technical considerations

The original Intel algorithm and Matt Stancliff's version require separate implementations for big-endian and little-endian CPUs. I converted the algorithm to an endian-independent form, i.e. only one variant for all processors. I did not measure any significant speed difference after making the code endian-independent when compiling with optimizations turned on.

It's possible to speed up the CRC calculations even more using other methods such as using the PCLMULQDQ instruction on modern x86 CPUs. However, that would make the code depend on that platform and probably provide only marginal additional speed gains.

My implementation uses static lookup tables for LAME, FLAC and Ogg. This blows up code size a bit and I would have preferred calculating the tables on the fly on first use. That's difficult to get right in a portable, thread safe way in plain C though, so it is used only for Monkey's Audio which is written in C++ (allowing dynamic initialization of static data).

Speed gains

Here are some numbers showing relative speed gain when encoding and decoding with different codecs (all used with default settings):

Codec Encode Decode
LAME 0.5% -
Opus* 0.5% 1%
Vorbis* 0.5% 2%
Monkey's Audio 4% -
FLAC 5% 5%
Ogg FLAC 10% 15%

* Opus and Vorbis themselves are not optimized, but use the optimized Ogg container library.

The patches

Here are my patches to update the mentioned codecs' CRC calculations to the optimized slicing algorithm:

Update: The Monkey's Audio patch has been integrated in the official Monkey's Audio 4.34 release.

Here is a proof-of-concept FLAC build for Win64 for everyone to try out: flac-1.3.2-fastcrc-win64.zip

The patched codecs will be used in the next fre:ac 1.1 alpha release and I will contact the maintainers of these projects to request integration of the patches in official releases.

 
fre:ac development status update 03/2018 Print
Written by Robert   
Saturday, 31 March 2018 17:45

Hi and welcome to the March 2018 status update on fre:ac development.

The past month finally saw the 1.0.32 release and a new fre:ac 1.1 alpha providing lots of fixes and new features. Apart from that, I worked mostly on two things: Making the config dialogs for external codecs more beauti- and useful and implementing the repacker part of the upcoming SuperFast LAME encoder.

Improved config dialogs for external codecs

The configuration dialogs for external codecs are generated on the fly from a description provided by the external codec's XML script. Until now, that dialog was very small and displayed only one option at a time:

The old config dialog for external codecs

The next alpha will feature an improved configuration dialog generator. It will create dialogs that show all the options at the same time and provide more space for them:

The new config dialog for external codecs

Progress on SuperFast codecs

I spent a lot of time on writing the repacker part for the SuperFast LAME encoder in the past few weeks. A proof-of-concept was done quickly, but implementing all the edge cases turned out to be more difficult than I initially thought. Particularly, correct handling of the bit-reservoir was a lot of work. Nevertheless, it's working great now and I'm now in the testing stage.

The code for the SuperFast LAME component is now available in the SuperFast GitHub repository.

A few things are still missing, like support for CBR mode, CRC checksums and generating a valid Xing header, but I am confident that I can implement these things in the next few weeks.

I'm currently preparing a new SuperFast preview release based on fre:ac 1.1 alpha 20180306 which I plan to release in the next few days. As mentioned earlier, this will not include the SuperFast LAME encoder yet, but introduce SuperFast FDK-AAC and Speex along with some tweaks for the other codecs.

That's it for the moment. Make sure to come back in a month for another update.

 
fre:ac development status update 02/2018 Print
Written by Robert   
Wednesday, 21 February 2018 19:51

Hi all, it's time for an update on fre:ac development again.

The past month has been quite productive. However, due to some planned features turning out to be more difficult to implement than expected and new bugs surfacing, there is no new alpha release yet.

But let's have a look at the good things - the changes implemented in the past month.

Format selection for single file conversions

Selection of desired output sample format

Until now, when converting multiple tracks to a single output file, fre:ac could only do that when all input tracks had the same sample format. When mixing different sample rates or mono and stereo tracks, an error message appears to inform you about the issue.

That will officially be over with the next alpha release. From then, when encountering different sample formats, fre:ac will bring up a dialog to let you choose the desired output format.

Improved Delete after encoding option

The option to delete input files after a successful conversion has to be enabled for each new conversion task in current versions of fre:ac. This is to prevent you from accidentally deleting your music library when forgetting about that option being enabled.

Some users, however, use that option regularly and it's tedious for them to re-enable it all the time. Therefore, the next alpha will introduce some changes to it.

There will be an indicator on the Start conversion toolbar button and the corresponding menu entry to signal that the option is active:

New indicator for delete after encoding option.

Also, it will not disable itself after every conversion, but stay active until fre:ac is restarted. In addition, a checkbox in the confirmation dialog will allow you to keep the option enabled even over restarts, until you explicitly disable it.

Minor improvements

Some minor improvements and fixes have been implemented in the previous month and will be included in the next alpha:

  • Reading CD-Text will be supported on Linux and FreeBSD (not on macOS yet, though)
  • All CDDB dialogs will be resizable and remember the previous size
  • Fixed the quality setting for the FAAC encoder, which was not working correctly
  • Fixed a typing issue on macOS where edit fields would stop accepting input
  • Improved build system to auto-detect CDK path on Windows (no more editing Makefiles)

Progress on SuperFast LAME

I have made some progress on implementing a SuperFast version of the LAME encoder component. This requires some additional work due to how MP3 framing works. Frames are not independent entities in the MP3 bitstream, but can be intermixed with previous frames, i.e. the actual frame data can start before the frame header in a sparse area of a previous frame in order to make optimal use of the available bytes in frames.

This bitstream has to be unpacked into distinct frames in order for the SuperFast technology to work. The frames are then put in the correct order and are repacked into a new MP3 bitstream. I have a working proof-of-concept implementation of the unpacker already and will go for the repacker next. Once that is done, I will be able to assess performance and integrity of the implementation. What will be left then is to implement a Xing/LAME info header generator and make the proof-of-concept ready for a preview release.

SuperFast LAME will not be ready for the next update, but progress is being made and there should be a first preview available in a few months.

Next alpha and stable release

The new alpha originally planned for January should be available within the next two weeks followed by the release of fre:ac 1.0.32 a few days later.

 
fre:ac development status update 01/2018 Print
Written by Robert   
Tuesday, 09 January 2018 21:57

Hi all, after many months, here is a new update on fre:ac development. As it has been a long time since the last update, many things have happened and I will concentrate on the most important.

Repositories on GitHub

The code repositories for fre:ac, BoCA and the smooth Class Library have been moved to GitHub. The repositories still used the ancient CVS version control system before, so the migration is also a modernization of the project infrastructure. It will make working with feature branches and collaboration with other developers much easier in the future.

Please show your support by starring the repositories on GitHub!

Digital Signal Processing Engine

Since the 20171119 release, fre:ac finally has a working DSP engine that will be further improved in the next alpha.

While the current release enables you to use the resampler component to control the sample rate, the next version will add more format converters and some effect DSPs:

  • Sample Format Converter
    This component will convert between different sample resolutions (i.e. 8, 16, 24 or 32 bit) and between integer and floating point samples.
  • Channel Converter
    The channel converter will provide channel downmixing capabilities, i.e. converting from 5.1 to Stereo and similar transformations.
  • RNNoise
    This is a noise reduction component based on a neural network and designed for speech. You should try it when converting speech recordings the next time - the results are impressive. If you are doing speech recordings regularly, make sure to have a look at the website for more information.
  • Rubber Band
    This component can control the speed and pitch of recordings independently. It's great if you are into speed-listening to audio books, but also fun to play around changing the speed of music tracks.

In addition, fre:ac will be able to do automatic conversions when a specific sample format is required by an encoder. For example, the AAC format only supports certain sample rates and earlier versions of fre:ac showed an error message if the input did not match it. Future fre:ac releases will automatically upsample to the next supported rate instead.

SuperFast Codecs

In September, I published my work on speeding up encoding using multiple parallel encoder instances. Check out the corresponding article for a fre:ac preview release with this technology - dubbed SuperFast Conversion - enabled. It will soon make it's appearance in an official fre:ac release as an experimental option and I hope to be able to enable it by default in a later release.

The preview release mentioned above includes SuperFast versions of the Opus, FAAC and Core Audio encoders. Since that release, the technology has been implemented for the FDK-AAC and Speex encoders as well and there will soon be an update. I also plan on implementing this for the LAME MP3 encoder, but that will take some time due to the peculiarities of the MP3 format and related difficulties in applying this technology.

Core Audio on x64 Linux

Prior to the 20171119 release, to get the Core Audio encoder working on 64 bit Linux, you needed a 32 bit Wine installation. While this combination is quite common, there are some who use 64 bit Wine instead. The 20171119 release now supports both variants making it much easier to use the Apple encoder on 64 bit Linux.

Codec Updates

There have been several important codec updates during the last few months. LAME 3.100 has been released more than five years after 3.99.5 and work on the FAAC encoder has resumed after more than 9 years of almost complete inactivity. The new versions are included in fre:ac 1.0.31a and the 20171119 alpha release.

Also, the Monkey's Audio (APE) encoder is now available on non-Windows systems thanks to the help of its developer. It was great to work with him to make Monkey's Audio more portable.

And more...

There are many more smaller improvements and fixes that went into the latest releases or will go into the upcoming ones, but after such a long time it's really too much to mention everything here in detail. Here are just some keywords for the more notable ones:

  • Fixes for running on macOS 10.13
  • Allow CDDB queries when no CD drive is present
  • Fixes for USB CD-ROM drive detection
  • Made AAC decoders work independent of file extensions

Upcoming Releases

Work on the next fre:ac 1.1 alpha release is progressing really well and I expect it to be out in January. It will be almost feature complete in terms of my plans for 1.1, so it should be the last alpha release before fre:ac 1.1 beta 1. Hooray!!! :)

There will also be an update to the 1.0.x release series with some minor fixes probably in February.

 
Multi-threaded Opus and AAC encoding Print
Written by Robert   
Sunday, 03 September 2017 15:23

I always liked doing performance optimizations – such projects are often fun to work at and it's very rewarding when an idea actually works out in practice.

tl;dr: I implemented multi-threaded drivers for Opus Audio as well as the FAAC and Apple Core Audio AAC codecs – more to come. Scroll down or click here to download experimental fre:ac builds with this technology.

Introducing SuperFast codec drivers for fre:ac

The biggest step for the fre:ac project regarding performance was the introduction of a multi-threaded conversion engine in 2014. Running multiple conversions at the same time allowed fre:ac to scale almost linearly with the number of CPU cores, compared to converting only one file at a time.

Serial vs. parallel conversion in fre:ac

Parallel conversions, however, only help when there actually are multiple items to convert in the first place. When converting less files than there are CPU cores available, it's still inefficient. And when combining multiple files into a single output file, there is still only one CPU core used.

Converting less files than there are CPU cores and converting to a single output file

So, while a big step forward, the situation was still a little unsatisfying in certain cases which made me explore ways to further optimize the conversion engine.

Choosing the best method

In general, there are three different ways to distribute the work done by an audio codec to multiple CPU cores:

  1. Low level data parallelism – parallelizing loops
    To make use of low level data parallelism, you look for loops that do the same operation on many data values. A basic example would be multiplying a large array of values with a constant expression. You can spread the work over multiple threads, each doing the operation on a part of the values. For such optimization, you would usually use something like OpenMP which makes loop parallelization pretty easy. When it comes to complex algorithms, though, not all loops will be parallelizable, limiting the maximum speed-up you can gain with this method. An example of this approach are the Lancer optimizations for the Vorbis audio codec.

  2. Task parallelism – pipelining subtasks
    A second method is pipelining. It can be used if the work to be done consists of multiple stages. For an audio codec, these stages might be MDCT processing (converting time-domain samples to frequency values), psychoacoustic analysis and actual audio encoding. While the later stages depend on the results of the earlier ones, audio codecs usually work on blocks of samples, so a later stage can start its work as soon as the preceding stage finished a block. However, this approach only scales well as long as you have enough compute-intensive stages to fully utilize the available CPU cores. It can be combined with loop parallelization, though, to achieve further speedup. LAME MT is an example of a project using this method (without additional loop parallelization).

  3. High level data parallelism – splitting work into fragments
    The third method splits the work to be done into multiple parts each processed by a separate thread. For example, the first and second half of an audio file can be encoded separately and the results later be joined back together. Seeming very easy at first glance, this method has some issues due to frames in an audio file not being completely independent. The LAME MT project tried this method at first, but later found that the project's constraints of producing a result bit-identical to the non-parallel version could not be met with this approach and resorted to pipelining.

Looking at and thinking about the three methods, I quickly came to the conclusion that the first two were not feasible for a project like fre:ac. They would require modifications deep in the respective codecs and each new codec release would require reviewing the changes. Such modifications are better suited for being done in dedicated projects like Lancer or LAME MT.

The third method, however, could theoretically be implemented on a higher level – just run several instances of a codec in parallel and concatenate their output back together in the correct order.

This seemed reasonably easy and had the advantage that the same method might be usable for different codecs. I decided to try implementing this on top of the Opus codec first, because it has a constant block size and outputs raw packets that can be intercepted and reordered before being written to the output file.

How it works

For the first proof-of-concept implementation, I made the codec driver split the input data into blocks of several frames of audio data and distribute those blocks to worker threads in a round-robin manner. Except for very short files, there will usually be many more fragments than there are threads to be used, which allows for quicker distribution of work to the threads.

The worker threads then do their job in the background and collect packets of encoded audio data obtained from their respective codec instances.

When it's a threads turn for getting new data, the codec driver asks it for the collected audio data packets, writes those to the output stream and finally assigns a new block to the thread.

Basically, the inside of the main conversion loop looks like this:

Worker *worker = workers.GetNth(nextWorker++ % workers.Length());

/* See if the worker has some packets for us.
*/

if (worker->GetPackets().Length() != 0) ProcessPackets(worker->GetPackets(), ...);

/* Pass new frames to worker.
*/

worker->Encode(samplesBuffer, ...);

Details

The naive implementation of splitting the audio data into several parts encoded by separate codec instances and concatenating the output back together, proved to be problematic, though.

As mentioned above, frames are not completely independent of each other. Transformations and filters applied during the encoding process need multiple frames worth of audio data to adapt, so concatenating packets created with different initial filter states can lead to heavy distortions.

Difference signal of single vs. multi-threaded conversion

To avoid this effect, the audio fragments to encode have to overlap by a few frames to give the filters enough time to adapt. After trying out different values, I settled with an overlap of 8 frames for an Opus frame size of 10ms or more (a larger overlap is necessary at shorter frame sizes).

Difference signals with 1 and 8 frames overlap

With an overlap of 4 or more frames there are no more distortions. The remaining differences are on the same level as when encoding the same file starting at different offsets and should be unnoticeable.

Resulting implementation

The current implementation of this method for Opus and AAC uses blocks of 128 frames with an overlap of 8 frames. That's a 6.25% or 1/16th overlap limiting the maximum possible speedup to 16x. In practice, though, even if you have a 16 core CPU, the actual speedup will be lower because of non-parallelized decoders and additional work needed for thread management.

The multi-threaded codec driver solves the aforementioned problems of parallelizing conversions to a single output file and improving efficiency when converting less files than there are CPU cores.

Serial versus multi-threaded processing when converting to a single output file

Multi-threaded processing when converting less files than there are CPU cores

Performance numbers

The actually achieved speedups are quite significant. So good actually, that even on a 6 core CPU the conversion speed is often limited by decoding speed which makes me think about future possibilities of parallelizing the decoder side as well.

Multi file mode vs. single file mode

Speed comparison of regular vs. SuperFast codecs in multi and single file mode

In multi file mode, the SuperFast codec drivers are only slightly faster than the regular ones. Only one codec instance per track was used in this test and the performance advantage solely stems from decoding and encoding being done in separate threads.

In single file mode, though, the SuperFast codecs show their full potential. Using four threads, the conversion is already three times faster than with the regular drivers, which use only a single thread in this mode.

Note that the multi-threaded Opus encoder will exhaust its full potential only with 48 kHz audio data as that is the only sample rate natively used by Opus. The resampler used for feeding other sample rates to Opus still runs in non-parallel mode. I'm looking into ways of parallelizing it too.

What's next?

I chose Opus and AAC for the initial implementation of this idea, because they use constant frame sizes and straight forward frame packing. Thus, it was relatively easy to implement this idea on top of them without having to worry about too many side effects.

Adapting this method to other AAC codecs like FDK-AAC or other codecs with constant frame sizes like Speex should be relatively easy and will be done soon.

After that, the next step will be implementing this technology on top of the LAME MP3 encoder, which will be a bit more challenging. In MP3 files, frames can actually start before their frame header to make use of unused bits from previous frames. Therefore, you cannot simply concatenate the output of the encoder instances, but need to unpack the frames first and re-pack them when writing the output stream.

Some other codecs don't lend themselves very well to this kind of optimization. These include Ogg Vorbis (because of how block size switching changes the resulting frame lengths) as well as FLAC and WMA (because their APIs do not allow intercepting packets before writing the output stream). It currently does not seem feasible to implement this technology on top of those codecs.

Downloads

[Update: Multi-threaded FDK-AAC and Speex components have been added in the second preview release. The links below have been updated.]

Download an experimental fre:ac build with multi-threaded Opus, FAAC*, FDK-AAC, Core Audio and Speex converters:

* The FAAC codec is provided as a fallback when neither FDK-AAC nor the Core Audio encoder are available.

Source code

The source code for the multi-threaded codec drivers can be obtained at:

https://github.com/enzo1982/superfast

 
<< Start < Prev 1 2 3 4 5 6 7 8 9 Next > End >>

Page 1 of 9