Complete list of C/C++ FFT libraries

DSP, Plugin and Host development discussion.
Post Reply New Topic
RELATED
PRODUCTS

Post

I got carried away and made this list.
https://community.vcvrack.com/t/complet ... aries/9153
Hopefully you might find this useful for finding an FFT library you didn't know about.
Reply or email contact@vcvrack.com for corrections or to recommend a missing library.
VCV Rack, the Eurorack simulator

Post

Interesting overview, thanks for that. I'm using PFFFT and feel kind of reconfirmed that this was a good choice. The code is a bit ugly, but it seems to be the best option if you need a reasonably fast implementation with a liberal license.

Post

karrikuh wrote: Thu Apr 09, 2020 5:50 am Interesting overview, thanks for that. I'm using PFFFT and feel kind of reconfirmed that this was a good choice. The code is a bit ugly, but it seems to be the best option if you need a reasonably fast implementation with a liberal license.
Looking quickly at PFFFT code, it doesn't really seem particularly ugly. The basic problem with realistic (ie. faster than text-book) FFT routines is that you typically need quite a bit of duplication with small variations, because you tend to need different strategies for different transform sizes. In theory, you could use C++ template expansion for some of this, but back when I was working on these I found that you can get significantly faster code in practice by explicitly separating the different routines manually (or in my case, by using an external code-generator).

For example, with very small sizes the overhead of rearranging the data into a SIMD friendly format can be significant, to the point that a seemingly less efficient approach ends up faster and special case "peep-hole" optimisations are attractive. On the other hand, if the small transform is a sub-transform with the data already in the ideal order, then a different set of special-case optimisations might apply.

With slightly larger transforms, one can generally assume that the data fits caches and hence striding multiple sub-transforms together can be faster (including the very small special-cased sub-transforms). On the other hand, once the transform grows large enough that cache size becomes and issue, it can be faster to just process one sub-transform at a time. The actual break-points also vary by CPU, which is why some libraries (eg. FFTW) prefer to profile the actual target machine in order to find the fastest combination.

This also leads to a situation where given a moderate to large transform, you basically have to use different strategies for different sizes of sub-transforms. You might first divide into fully separate sub-transforms, then divide those further into strided transforms and finally process the smallest blocks with special case code. Especially when using split-radix, a single large transform might also eventually decompose into small transforms of various different sizes, which might benefit from different special case code.

There's a lot more to optimising an FFT, but this should give you an idea why most practical libraries have a lot of seemingly repetitive code, even though the text-book algorithm is very short.

ps. The algorithm is also very sensitive to all kinds of overheads. There are some algebraic optimisations that can be done, but in practice the biggest challenge in optimising a transform is to figure out how to actually keep the FPUs busy.

Post Reply

Return to “DSP and Plugin Development”