New real FFT routines

 KVRist
 50 posts since 7 Jun, 2018
For all those who may be interested I have just written a new fft/ifft C function pair working directly on real sequences and on halfcomplex spectra. I've been searching for such a thing since a long time without luck, so I decided to set at work and to create my implementation. Compared to the "standard" way of feeding the real and imaginary parts of a complex fft with a scrambled sequence and rearranging the result into a half complex spectrum, these routines are about 30% faster other than more compact (no macros or sub function calls, everything is contained in a single source file).
Instructions and information included in the source.
The source is freely distributable and usable and can be adapted/modified if needed at the only condition that the information text at beginning is left in place.
http://www.elenadomain.it/listing/SE/ElenaRFFT.c
Instructions and information included in the source.
The source is freely distributable and usable and can be adapted/modified if needed at the only condition that the information text at beginning is left in place.
http://www.elenadomain.it/listing/SE/ElenaRFFT.c

 KVRAF
 6192 posts since 12 Feb, 2006 from Helsinki, Finland
I've written a somewhat realistic (singlefile) FFT. Performancewise it generally beats Ooura (which is also singlefile) in most cases, although not necessarily in a fair comparison (mine uses SSE, Ooura doesn't) and as far as I know mine was pretty competitive among the free ones for double precision at the time I wrote it, although these days I believe there are faster choices available (and I think some licenses have been made more liberal too). The process of writing one was very good learning exercise and I would recommend it for everyone, but it also means that I know a bit or two about making FFTs actually competitive.
Now, when you make claims about the performance of an algorithm such as FFT that a lot of people have spent a LOT of time optimizing, you better benchmark your code against exisitng libraries with different transform sizes. Everyone and their pet chicken claims that their FFT is faster than some naive textbook algorithm, but that doesn't mean anything.
There is a reason most realistic FFTs are fairly complicated and it has to do with the fact that the strategies that are fastest for medium sized blocks end up being slow for larger sizes (because cache is everything once you reach a certain transform size) while very small blocks benefit from special cases (because a large fraction of the twiddles are easier than the general case).
For fairly small blocks, even trivial things like shuffling the bitreversed order can be slow enough that it's highly profitable to compute the required swaps in advance and just unroll it all into the code for every transform size directly. When I wrote mine, I did this for blocks up to 256 elements, which is around the breakpoint where the overhead of computing the swaps on the fly starts to become insignificant compared to the actual cost of computing the transform and blowing more codecache on it becomes less attractive.
Ignoring the shuffling, the general idea with the main transform is to stride multiple subtransforms of smaller blocks, but only in chunks large enough that your entire workset fits in L1. Once you're bigger than L1 (something that you don't ever experience with real transform of maxsize 16384 in singleprecision, because that's just about 128kB), computing one subtransform at a time quickly becomes much faster (until the subtransforms are smaller than L1 again, at which point you should stride again). So you definitely need a minimum of two different strategies, otherwise you'll be limited to relative small blocks only. On the other hand, the very small transforms can take advantage of special rotations for pi/2, pi/4 and possibly pi/8 and should be special cased; splitradix is also very profitable here, even though radix4 might be easier at larger sizes (for regularity; trying to use splitradix all the way becomes quite complicated).
From looking at your code I can immediately tell that you've obviously taken the first steps to making something realistic, but at the same time we're still looking at something that would be classified as depessimized textbook algorithm, where some of the obvious problems are fixed, but very little realworld performance tuning has been done. I seriously suggest benchmarking against some production library to get an idea where you actually stand in terms of performance. I suspect it will be slow for small (eg. <256) transforms, it'll probably be sort of reasonable for medium size transforms and it'll be very slow once you get past the L1 barrier.
It's all cool to try to write a "fast" simple FFT, but let's be real (pun intended): simple FFTs are never truly competitive with more complex FFTs that separately optimize transforms of different sizes.
Now, when you make claims about the performance of an algorithm such as FFT that a lot of people have spent a LOT of time optimizing, you better benchmark your code against exisitng libraries with different transform sizes. Everyone and their pet chicken claims that their FFT is faster than some naive textbook algorithm, but that doesn't mean anything.
There is a reason most realistic FFTs are fairly complicated and it has to do with the fact that the strategies that are fastest for medium sized blocks end up being slow for larger sizes (because cache is everything once you reach a certain transform size) while very small blocks benefit from special cases (because a large fraction of the twiddles are easier than the general case).
For fairly small blocks, even trivial things like shuffling the bitreversed order can be slow enough that it's highly profitable to compute the required swaps in advance and just unroll it all into the code for every transform size directly. When I wrote mine, I did this for blocks up to 256 elements, which is around the breakpoint where the overhead of computing the swaps on the fly starts to become insignificant compared to the actual cost of computing the transform and blowing more codecache on it becomes less attractive.
Ignoring the shuffling, the general idea with the main transform is to stride multiple subtransforms of smaller blocks, but only in chunks large enough that your entire workset fits in L1. Once you're bigger than L1 (something that you don't ever experience with real transform of maxsize 16384 in singleprecision, because that's just about 128kB), computing one subtransform at a time quickly becomes much faster (until the subtransforms are smaller than L1 again, at which point you should stride again). So you definitely need a minimum of two different strategies, otherwise you'll be limited to relative small blocks only. On the other hand, the very small transforms can take advantage of special rotations for pi/2, pi/4 and possibly pi/8 and should be special cased; splitradix is also very profitable here, even though radix4 might be easier at larger sizes (for regularity; trying to use splitradix all the way becomes quite complicated).
From looking at your code I can immediately tell that you've obviously taken the first steps to making something realistic, but at the same time we're still looking at something that would be classified as depessimized textbook algorithm, where some of the obvious problems are fixed, but very little realworld performance tuning has been done. I seriously suggest benchmarking against some production library to get an idea where you actually stand in terms of performance. I suspect it will be slow for small (eg. <256) transforms, it'll probably be sort of reasonable for medium size transforms and it'll be very slow once you get past the L1 barrier.
It's all cool to try to write a "fast" simple FFT, but let's be real (pun intended): simple FFTs are never truly competitive with more complex FFTs that separately optimize transforms of different sizes.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

 KVRAF
 6192 posts since 12 Feb, 2006 from Helsinki, Finland
Also please don't get me wrong, I'm sharing these thoughts purely because I find the problem of optimizing FFTs is quite interesting, but the complexity of the problem is often severely underappreciated. It's one of those problems where the seemingly simple algorithm necessarily becomes more and more complex as you push further down the performance slope.
Seriously though, benchmark it with existing (productionquality!) libraries. If you're 50% slower than FFTW then that's fine, but if you're 50 times slower then it hardly matters how much faster you are compared to some naive textbook algorithm.
Seriously though, benchmark it with existing (productionquality!) libraries. If you're 50% slower than FFTW then that's fine, but if you're 50 times slower then it hardly matters how much faster you are compared to some naive textbook algorithm.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

 KVRist
 247 posts since 8 May, 2007
Hi all,
I just wanted to say that, based on writing my own FFT library in 2006, I completely agree with everything that mystran wrote in the previous posts. The subject is really fascinating and is probably largely misunderstood by those who use FFT libraries or have never written anything beyond poweroftwo routines. The advice about mulitple strategies is very important if you don't restrict yourself to small primes, etc. As an aside, some library routines from several years ago of calculations of Bessel functions could have benefited from two strategies, being that they were way off for either high or low indices and were apparently not tested very well at all.
Regards,
Dave Clark
I just wanted to say that, based on writing my own FFT library in 2006, I completely agree with everything that mystran wrote in the previous posts. The subject is really fascinating and is probably largely misunderstood by those who use FFT libraries or have never written anything beyond poweroftwo routines. The advice about mulitple strategies is very important if you don't restrict yourself to small primes, etc. As an aside, some library routines from several years ago of calculations of Bessel functions could have benefited from two strategies, being that they were way off for either high or low indices and were apparently not tested very well at all.
Regards,
Dave Clark

 KVRAF
 6192 posts since 12 Feb, 2006 from Helsinki, Finland
I actually have not looked into anything other than poweroftwo, but even poweroftwo libraries benefit from quite a diverse set of techniques depending on the size of the transform.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

 KVRist
 50 posts since 7 Jun, 2018
Topic Starter
In no way I have claimed that my fft is "faster". I just specified that _with respect to the common approach of splitting a real sequence + feeding a more or less standard complex fft with both real and imag parts plus post processing the result_, my routines are 30% faster because have no redundant passages and operations. Also I have never seen previously any natively real fft implementation and correct me if I am wrong. All those I have seen adopt directly or indirectly the approach described above. Unfortunately I have no time to carry benchmarks but everybody who is curious can do that. Mine is not a commercial product but an open source code that I wrote for my purposes and decided to release as a bonus. Who doesn't need it is not forced to use it and I am not selling anything. But please consider what follows. Other than performing the least number of operations needed or almost so, my code uses the following optimizations: 1. In the first passage where data are scrambled in bit reverse order, twopoints partial spectra are already generated thus avoiding one pass with consequent memory access 2. Access to every data cell is guaranteed to happen only once per read and once per write at every pass 3. The roots of unity (twiddle factors) are precomputed only upto pi/4, all other symmetric cases are obtained by trivial changes of sign thus avoiding accesses to the table and saving many complex multiplications. It won't surely compete with a similar thing written in assembler but I don't even think I achieved a so bad result...

 KVRist
 247 posts since 8 May, 2007
Why don't you count Hartley? Just curious why you didn't mention it ...
Regards,
Dave Clark

 KVRAF
 6192 posts since 12 Feb, 2006 from Helsinki, Finland
I'm curious, have you looked into split radixes and taking the simple twiddles for 4point and 8point FFTs into account?
For 4point the twiddle is just i (ie. complies multiplication is component shuffle; this is also why radix4 is so great in the general case) and for 8points it's +/sqrt(.5)+/i*sqrt(.5) (ie. complex multiplication is just sums/differences with an overall scale factor). So if you have a small transform like say 32points, you can use split radix to get two 8point transforms (=fast special case) and one 16point transforms, where the latter can then be split into 4+4+8 which are all fast special cases.
This does blow up the code complexity a bit though with all the special cases, but if you do something like this then it's not clear if specialcasing 2point transforms is profitable, because you probably should not be computing any 2point transforms at all, which might explain why it's not too common in more complicated libraries. If you don't do this optimization (to keep code size more compact), then sure, it sounds like a great idea.
I think this is actually one of the more common optimizations when you go beyond textbook, because it's very easy to implement and pretty much always profitable (eg. I don't think it should cause any problems even with very high transforms, as long as your L1 is at least 4way associative and basically all of them are), whether or not you use tables (eg. it saves some computations with recurrent twiddles too).3.The roots of unity (twiddle factors) are precomputed only upto pi/4, all other symmetric cases are obtained by trivial changes of sign thus avoiding accesses to the table and saving many complex multiplications.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

 KVRian
 636 posts since 21 Feb, 2006 from FI
Anyone tried this implementation (source code link on page #2)?

 KVRAF
 6192 posts since 12 Feb, 2006 from Helsinki, Finland
I tried that back in 2014(?) when working on DustFFT, but never managed to get really good performance out of that kind of template setup. In some sense, this is actually very similar to what DustFFT actually does, except rather than using C++ templates I used a codegenerator to produce C code for all the variants (and the codegenerator is really just a sortof adhoc template expander). The template approach is still better than a textbook for sure, but at least back then generated C code was a better choice.juha_p wrote: ↑Wed Feb 03, 2021 3:48 amAnyone tried this implementation (source code link on page #2)?
The funny thing is that with some compilers I observed that DustFFT would run faster if forced to compile as C rather than C++ and I'm pretty sure some versions had #error to sanity check that __cplusplus wasn't defined; we are talking about literally the same code with the same compiler and it was happening even with fnoexceptions (or the MSVC variant, can't remember what compilers it was). I don't know what the situation with current compilers would be, but the current version doesn't seem to have the #error anymore.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.
 KVRist
 75 posts since 3 Jan, 2021
Why pick that over fftw3?juha_p wrote: ↑Wed Feb 03, 2021 3:48 amAnyone tried this implementation (source code link on page #2)?

 KVRAF
 6192 posts since 12 Feb, 2006 from Helsinki, Finland
GPL.uOpt wrote: ↑Wed Feb 03, 2021 4:58 amWhy pick that over fftw3?juha_p wrote: ↑Wed Feb 03, 2021 3:48 amAnyone tried this implementation (source code link on page #2)?
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

 KVRist
 247 posts since 8 May, 2007
Another reason for choosing something else or writing one's own is that FFTW3 is almost impossible to debug, or at least I found that to be the case. Although the authors boldly claimed it was written in the "lingua franca" of the scientific community (C in case anyone like me thinks it was FORTRAN transitioning to C++ at that time, with C a distinct minority in science) the C code for the routines is machinegenerated by Caml Light scripts and is effectively readonly. It might as well have been purposefully obfuscated The authors themselves have admitted that there are cases in which they don't know what the code is doing.
Now it's been many years since I considered all of this, so perhaps better information is out there and perhaps the most recent versions are eminently readable.