
https://github.com/signaldust/DustFFT
[edit 2012: updated link, also comes with wrappers for real-to-half-complex and back]
This is a "single C file" library, though there is a header embedded into the file that you can copy-past into a separate .h file if you feel like it. Currently does complex double precision (and power of 2) transforms only. It will currently only compile on x86 (or x64) because for whatever reason I didn't #ifdef the SSE2 intrinsics (yet).
See the source, but basically there's simple _fwdD / _revD functions (the 'D' is for doubles, in case I decide to add single precision) to do forward/reverse transforms, which take a buffer (double*) and the size of the transform and the buffer should contain twice that many numbers, with real/imag inputs interleaved. The following should be safe with any half-decent std::complex implementation:
Code: Select all
std::complex<double> buf[1024]; /* fill this with whatever */
DustFFT_fwdD(reinterpret_cast<double*>(buf), 1024);
Neither forward or reverse transforms do any scaling, so as usual, you probably want to divide everything by N at some point (typically after reverse .. but it varies, hence not done automatically).
Some day I might add other stuff like single precision, AVX or maybe real-only transforms (let me know if you feel like sponsoring something; personally I just need complex doubles most of the time), or .. I don't know.. but for now it does complex double precision. The source code looks like horrid garbage, but that's mostly because (1) getting decent performance out of an FFT means duplicating the same code a dozen times with just some slight variations, (2) I got tired of trying to convince C++ compilers to do the right thing with templates and (3) I didn't feel like wasting time writing a pretty-printer into the generation script (which is written in a variant of Lisp and not provided). So yeah, it's not the actual source .. just something you can compile.

Anyway, it appears to work. It also appears to work reasonably fast. So how fast is it? Well, it probably won't compete with FFTW or IMKL, but the scalar code-path is typically slightly slower than Ooura split-radix (both compiled on MSVC08/fp:fast + SSE2 instructions; for x87 it'll probably be much slower), while the SSE2 code-path should be faster (except for some tiny transforms.. but if your bottleneck is 8 point FFTs then .. well, I could probably special case that). How much faster? Well, that depends on the system; on my sandy-i7 desktop, for most larger transforms it takes about 0.75 times what Ooura takes (as long as we fit into L3; at 2^10 points that's no longer the case, and I should probably add another strategy there) and on my netbook where the difference is smaller, I actually beat both with a C++ template mess that I wrote last week, so there's a pretty good chance I could make it even faster.. but even there it's a bit faster.
Obviously speed is not the only reason I wrote this. I also needed the ability to easily add extra features, but I'm providing the "bare-bones" version for the general public in case you might find it useful.

In terms of accuracy it won't break any records, but it's good enough that for audio purposes you really shouldn't have to worry about that. If you observe the scalar version losing precision (when compiled with -ffast-mat or something), please let me know and I'll see if I can try to convince your compiler to get it right.