New real FFT routines

DSP, Plug-in and Host development discussion.
KVRist
50 posts since 7 Jun, 2018

Post Sat Jan 23, 2021 12:40 pm

For all those who may be interested I have just written a new fft/ifft C function pair working directly on real sequences and on half-complex spectra. I've been searching for such a thing since a long time without luck, so I decided to set at work and to create my implementation. Compared to the "standard" way of feeding the real and imaginary parts of a complex fft with a scrambled sequence and rearranging the result into a half complex spectrum, these routines are about 30% faster other than more compact (no macros or sub function calls, everything is contained in a single source file).
Instructions and information included in the source.
The source is freely distributable and usable and can be adapted/modified if needed at the only condition that the information text at beginning is left in place.

http://www.elenadomain.it/listing/SE/ElenaRFFT.c

KVRAF
6192 posts since 12 Feb, 2006 from Helsinki, Finland

Post Sat Jan 23, 2021 7:17 pm

I've written a somewhat realistic (single-file) FFT. Performance-wise it generally beats Ooura (which is also single-file) in most cases, although not necessarily in a fair comparison (mine uses SSE, Ooura doesn't) and as far as I know mine was pretty competitive among the free ones for double precision at the time I wrote it, although these days I believe there are faster choices available (and I think some licenses have been made more liberal too). The process of writing one was very good learning exercise and I would recommend it for everyone, but it also means that I know a bit or two about making FFTs actually competitive.

Now, when you make claims about the performance of an algorithm such as FFT that a lot of people have spent a LOT of time optimizing, you better benchmark your code against exisitng libraries with different transform sizes. Everyone and their pet chicken claims that their FFT is faster than some naive text-book algorithm, but that doesn't mean anything.

There is a reason most realistic FFTs are fairly complicated and it has to do with the fact that the strategies that are fastest for medium sized blocks end up being slow for larger sizes (because cache is everything once you reach a certain transform size) while very small blocks benefit from special cases (because a large fraction of the twiddles are easier than the general case).

For fairly small blocks, even trivial things like shuffling the bit-reversed order can be slow enough that it's highly profitable to compute the required swaps in advance and just unroll it all into the code for every transform size directly. When I wrote mine, I did this for blocks up to 256 elements, which is around the breakpoint where the overhead of computing the swaps on the fly starts to become insignificant compared to the actual cost of computing the transform and blowing more code-cache on it becomes less attractive.

Ignoring the shuffling, the general idea with the main transform is to stride multiple sub-transforms of smaller blocks, but only in chunks large enough that your entire workset fits in L1. Once you're bigger than L1 (something that you don't ever experience with real transform of max-size 16384 in single-precision, because that's just about 128kB), computing one sub-transform at a time quickly becomes much faster (until the sub-transforms are smaller than L1 again, at which point you should stride again). So you definitely need a minimum of two different strategies, otherwise you'll be limited to relative small blocks only. On the other hand, the very small transforms can take advantage of special rotations for pi/2, pi/4 and possibly pi/8 and should be special cased; split-radix is also very profitable here, even though radix-4 might be easier at larger sizes (for regularity; trying to use split-radix all the way becomes quite complicated).

From looking at your code I can immediately tell that you've obviously taken the first steps to making something realistic, but at the same time we're still looking at something that would be classified as de-pessimized text-book algorithm, where some of the obvious problems are fixed, but very little real-world performance tuning has been done. I seriously suggest benchmarking against some production library to get an idea where you actually stand in terms of performance. I suspect it will be slow for small (eg. <256) transforms, it'll probably be sort of reasonable for medium size transforms and it'll be very slow once you get past the L1 barrier.

It's all cool to try to write a "fast" simple FFT, but let's be real (pun intended): simple FFTs are never truly competitive with more complex FFTs that separately optimize transforms of different sizes.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

KVRAF
6192 posts since 12 Feb, 2006 from Helsinki, Finland

Post Sat Jan 23, 2021 7:30 pm

Also please don't get me wrong, I'm sharing these thoughts purely because I find the problem of optimizing FFTs is quite interesting, but the complexity of the problem is often severely under-appreciated. It's one of those problems where the seemingly simple algorithm necessarily becomes more and more complex as you push further down the performance slope.

Seriously though, benchmark it with existing (production-quality!) libraries. If you're 50% slower than FFTW then that's fine, but if you're 50 times slower then it hardly matters how much faster you are compared to some naive text-book algorithm.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

KVRist
247 posts since 8 May, 2007

Post Sun Jan 24, 2021 10:32 am

Hi all,

I just wanted to say that, based on writing my own FFT library in 2006, I completely agree with everything that mystran wrote in the previous posts. The subject is really fascinating and is probably largely misunderstood by those who use FFT libraries or have never written anything beyond power-of-two routines. The advice about mulitple strategies is very important if you don't restrict yourself to small primes, etc. As an aside, some library routines from several years ago of calculations of Bessel functions could have benefited from two strategies, being that they were way off for either high or low indices and were apparently not tested very well at all.

Regards,
Dave Clark

KVRAF
6192 posts since 12 Feb, 2006 from Helsinki, Finland

Post Sun Jan 24, 2021 12:58 pm

DaveClark wrote:
Sun Jan 24, 2021 10:32 am
The subject is really fascinating and is probably largely misunderstood by those who use FFT libraries or have never written anything beyond power-of-two routines.
I actually have not looked into anything other than power-of-two, but even power-of-two libraries benefit from quite a diverse set of techniques depending on the size of the transform. :)
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

KVRist

Topic Starter

50 posts since 7 Jun, 2018

Post Sun Jan 24, 2021 4:25 pm

In no way I have claimed that my fft is "faster". I just specified that _with respect to the common approach of splitting a real sequence + feeding a more or less standard complex fft with both real and imag parts plus post processing the result_, my routines are 30% faster because have no redundant passages and operations. Also I have never seen previously any natively real fft implementation and correct me if I am wrong. All those I have seen adopt directly or indirectly the approach described above. Unfortunately I have no time to carry benchmarks but everybody who is curious can do that. Mine is not a commercial product but an open source code that I wrote for my purposes and decided to release as a bonus. Who doesn't need it is not forced to use it and I am not selling anything. But please consider what follows. Other than performing the least number of operations needed or almost so, my code uses the following optimizations: 1. In the first passage where data are scrambled in bit reverse order, two-points partial spectra are already generated thus avoiding one pass with consequent memory access 2. Access to every data cell is guaranteed to happen only once per read and once per write at every pass 3. The roots of unity (twiddle factors) are precomputed only upto pi/4, all other symmetric cases are obtained by trivial changes of sign thus avoiding accesses to the table and saving many complex multiplications. It won't surely compete with a similar thing written in assembler but I don't even think I achieved a so bad result...

User avatar
KVRist
75 posts since 3 Jan, 2021

Post Sun Jan 24, 2021 4:48 pm

// This source code can be freely used, distributed, adapted or even modified until the present
// accompanying header is not removed
I think you mean "as long", not "until".

KVRist
247 posts since 8 May, 2007

Post Mon Jan 25, 2021 3:15 pm

elena2 wrote:
Sun Jan 24, 2021 4:25 pm
Also I have never seen previously any natively real fft implementation and correct me if I am wrong. All those I have seen adopt directly or indirectly the approach described above.
Why don't you count Hartley? Just curious why you didn't mention it ...

Regards,
Dave Clark

KVRAF
6192 posts since 12 Feb, 2006 from Helsinki, Finland

Post Mon Jan 25, 2021 6:11 pm

elena2 wrote:
Sun Jan 24, 2021 4:25 pm
1. In the first passage where data are scrambled in bit reverse order, two-points partial spectra are already generated thus avoiding one pass with consequent memory access
I'm curious, have you looked into split radixes and taking the simple twiddles for 4-point and 8-point FFTs into account?

For 4-point the twiddle is just i (ie. complies multiplication is component shuffle; this is also why radix-4 is so great in the general case) and for 8-points it's +/-sqrt(.5)+/-i*sqrt(.5) (ie. complex multiplication is just sums/differences with an overall scale factor). So if you have a small transform like say 32-points, you can use split radix to get two 8-point transforms (=fast special case) and one 16-point transforms, where the latter can then be split into 4+4+8 which are all fast special cases.

This does blow up the code complexity a bit though with all the special cases, but if you do something like this then it's not clear if special-casing 2-point transforms is profitable, because you probably should not be computing any 2-point transforms at all, which might explain why it's not too common in more complicated libraries. If you don't do this optimization (to keep code size more compact), then sure, it sounds like a great idea.
3.The roots of unity (twiddle factors) are precomputed only upto pi/4, all other symmetric cases are obtained by trivial changes of sign thus avoiding accesses to the table and saving many complex multiplications.
I think this is actually one of the more common optimizations when you go beyond text-book, because it's very easy to implement and pretty much always profitable (eg. I don't think it should cause any problems even with very high transforms, as long as your L1 is at least 4-way associative and basically all of them are), whether or not you use tables (eg. it saves some computations with recurrent twiddles too).
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

KVRian
636 posts since 21 Feb, 2006 from FI

Post Wed Feb 03, 2021 3:48 am

Anyone tried this implementation (source code link on page #2)?

KVRAF
6192 posts since 12 Feb, 2006 from Helsinki, Finland

Post Wed Feb 03, 2021 4:46 am

juha_p wrote:
Wed Feb 03, 2021 3:48 am
Anyone tried this implementation (source code link on page #2)?
I tried that back in 2014(?) when working on DustFFT, but never managed to get really good performance out of that kind of template setup. In some sense, this is actually very similar to what DustFFT actually does, except rather than using C++ templates I used a code-generator to produce C code for all the variants (and the code-generator is really just a sort-of ad-hoc template expander). The template approach is still better than a text-book for sure, but at least back then generated C code was a better choice.

The funny thing is that with some compilers I observed that DustFFT would run faster if forced to compile as C rather than C++ and I'm pretty sure some versions had #error to sanity check that __cplusplus wasn't defined; we are talking about literally the same code with the same compiler and it was happening even with -fno-exceptions (or the MSVC variant, can't remember what compilers it was). I don't know what the situation with current compilers would be, but the current version doesn't seem to have the #error anymore.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

User avatar
KVRist
75 posts since 3 Jan, 2021

Post Wed Feb 03, 2021 4:58 am

juha_p wrote:
Wed Feb 03, 2021 3:48 am
Anyone tried this implementation (source code link on page #2)?
Why pick that over fftw3?

KVRAF
6192 posts since 12 Feb, 2006 from Helsinki, Finland

Post Wed Feb 03, 2021 6:46 am

uOpt wrote:
Wed Feb 03, 2021 4:58 am
juha_p wrote:
Wed Feb 03, 2021 3:48 am
Anyone tried this implementation (source code link on page #2)?
Why pick that over fftw3?
GPL.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

User avatar
KVRist
75 posts since 3 Jan, 2021

Post Wed Feb 03, 2021 7:30 am

mystran wrote:
Wed Feb 03, 2021 6:46 am
uOpt wrote:
Wed Feb 03, 2021 4:58 am
juha_p wrote:
Wed Feb 03, 2021 3:48 am
Anyone tried this implementation (source code link on page #2)?
Why pick that over fftw3?
GPL.
Oh. I didn't realize. Thanks.

KVRist
247 posts since 8 May, 2007

Post Wed Feb 03, 2021 9:33 am

uOpt wrote:
Wed Feb 03, 2021 7:30 am
mystran wrote:
Wed Feb 03, 2021 6:46 am
uOpt wrote:
Wed Feb 03, 2021 4:58 am
juha_p wrote:
Wed Feb 03, 2021 3:48 am
Anyone tried this implementation (source code link on page #2)?
Why pick that over fftw3?
GPL.
Oh. I didn't realize. Thanks.
Another reason for choosing something else or writing one's own is that FFTW3 is almost impossible to debug, or at least I found that to be the case. Although the authors boldly claimed it was written in the "lingua franca" of the scientific community (C in case anyone like me thinks it was FORTRAN transitioning to C++ at that time, with C a distinct minority in science) the C code for the routines is machine-generated by Caml Light scripts and is effectively read-only. It might as well have been purposefully obfuscated The authors themselves have admitted that there are cases in which they don't know what the code is doing.

Now it's been many years since I considered all of this, so perhaps better information is out there and perhaps the most recent versions are eminently readable.

Return to “DSP and Plug-in Development”