DustFFT - in case you need an FFT :P
-
- KVRian
- 627 posts since 30 Aug, 2012
mystran,
Thank you so much for sharing this code.
Now, I noticed right away that you have each iteration written out rather than running through loops. Is that significantly faster for the CPU? It is kind of risky at the higher number of N where a single, hard to spot, typo could mess up the process. But if it's a lot faster...
Thank you so much for sharing this code.
Now, I noticed right away that you have each iteration written out rather than running through loops. Is that significantly faster for the CPU? It is kind of risky at the higher number of N where a single, hard to spot, typo could mess up the process. But if it's a lot faster...
- KVRAF
- Topic Starter
- 7897 posts since 12 Feb, 2006 from Helsinki, Finland
There are three general strategies used (and anything said about "transforms" applies to sub-transforms as well). Very small transforms use special-cased code, different depending on the transform size, because there are significant savings with simplified twiddle-multiplies here. It is currently done up to 16 points, but I probably should special-case 32 and 64 as well. The shuffle-code special-cases up to 256 since that seemed like the point where the overhead of calculating things on the fly starts becoming negligible, but really that's all generated by the same algorithm that is used for the general case.Fender19 wrote: Now, I noticed right away that you have each iteration written out rather than running through loops. Is that significantly faster for the CPU? It is kind of risky at the higher number of N where a single, hard to spot, typo could mess up the process. But if it's a lot faster...
Transforms larger than the special-cases use basic radix-4 striding iteration, similar to what you find in most text-books (at least assuming radix-4 is used). Finally large transforms work the same, except instead of striding, the sub-transforms are done recursively (although the actual work is still done by the same function), because this gives much better cache behavior (and hence performance).
But like I said above, the code is generated by a script. While the basic operations and special case twiddles are necessarily hard-coded, the actual functions and such are just fairly high level code, specialized on types somewhat similar to C++ templates, just allowing for a bit more flexibility. That's why it looks ugly too, didn't feel like writing a pretty printer. Whenever you see a radix-4 butterfly in the code, it is actually generated by the same script-function, just specialized for whatever twiddles and types we might be using in that particular spot.
On top of that, I have a test-routine that will check every possible transform for correctness and accuracy (the former is done by comparing to a known good implementation, the latter is done by performing a couple of rounds of transforms back and forth and checking against the original).
The whole FFT process in general is also pretty sensitive, so if you have a mistake somewhere, it'll actually give you incorrect results pretty much 100% of the time, unless the signal to be transformed is some regular special case (I use random values for testing to avoid this).
-
- KVRian
- 573 posts since 1 Jan, 2013 from Denmark
I've grown quite fond of this. Rudimentary testing shows that it is even faster than kiss's real-optimized float fft. I can't help but wonder how fast this would be with using real inputs, and 4-float (even 8-float using avx) vectors. The numbers say 4x increase, although it probably wouldn't be in that range, I'm really curious as to how difficult/hard it would be to change it?
A matter of changing instructions/types, or fundamental changes in the algorithm? I briefly looked over the source, but as you said yourself, it is not intended for human eyes...
A matter of changing instructions/types, or fundamental changes in the algorithm? I briefly looked over the source, but as you said yourself, it is not intended for human eyes...
- KVRAF
- Topic Starter
- 7897 posts since 12 Feb, 2006 from Helsinki, Finland
Adding some AVX support should theoretically be quite straight-forward, it mostly just requires another set of shuffling strategies to rearrange the data layouts.
If you're doing real-valued transforms for multiple channels, you can pack two of them into a single complex transform (which is usually the fastest thing to do), as the real and imaginary components. This requires an extra shuffling pass to separate the two transforms in spectral domain and I've been meaning to add a convenience wrapper to do it ... but it's not all that complicated, check two for the price of one from this page if you want to do it yourself.
If you want to understand the source, look at the dispatch logic at the bottom and then fft_r4_sd (radix-4, single-doubles general case, used recursively) or fft_r4s_sd (same thing, strided variant for sub-transforms expected to fit the L1 cache). Ignoring the bit-reversal, these do the main work for the scalar code-path, rest is just special cases. The _pd2 variants work with data layout rearranged such that there are two real components and then two imaginary components (loaded into another registers) and work exactly the same as the _sd versions, just two components at a time. The small-transform special cases are a bit messier since the 8/16 point transforms use split-radix and special case twiddles and then the _pd2 variants of those also need to undo the data reordering .. but mostly everything other than _r4_ and _r4s_ is there simply to reduce the logic-overhead.
If you're doing real-valued transforms for multiple channels, you can pack two of them into a single complex transform (which is usually the fastest thing to do), as the real and imaginary components. This requires an extra shuffling pass to separate the two transforms in spectral domain and I've been meaning to add a convenience wrapper to do it ... but it's not all that complicated, check two for the price of one from this page if you want to do it yourself.
If you want to understand the source, look at the dispatch logic at the bottom and then fft_r4_sd (radix-4, single-doubles general case, used recursively) or fft_r4s_sd (same thing, strided variant for sub-transforms expected to fit the L1 cache). Ignoring the bit-reversal, these do the main work for the scalar code-path, rest is just special cases. The _pd2 variants work with data layout rearranged such that there are two real components and then two imaginary components (loaded into another registers) and work exactly the same as the _sd versions, just two components at a time. The small-transform special cases are a bit messier since the 8/16 point transforms use split-radix and special case twiddles and then the _pd2 variants of those also need to undo the data reordering .. but mostly everything other than _r4_ and _r4s_ is there simply to reduce the logic-overhead.
-
- KVRian
- 573 posts since 1 Jan, 2013 from Denmark
Thanks, that's a very neat trick. I wasn't aware you were able to do that. Since i didn't find any examples on doing this anywhere else, here's my solution:
or maybe a bit more readable like this:
Note that you potentially can cut the loop in half, since the second half of the fft of real signals by definition is equal to the first, just negative.
The transformed result has an error of about -350dB on arbitrary inputs, compared to fft'ing the signals seperately, which is not that bad i guess - definitely useful!
Code: Select all
void seperateTransforms(const double * tsf, double * real, double * imag, std::size_t N)
{
N <<= 1;
double x1, x2, y1, y2;
for (auto k = 2u; k < N; k += 2)
{
x1 = tsf[k];
x2 = tsf[N - k];
y1 = tsf[k + 1];
y2 = tsf[N - k + 1];
real[k] = (x1 + x2) * 0.5;
real[k + 1] = (y1 - y2) * 0.5;
imag[k] = (y1 + y2) * 0.5;
imag[k + 1] = -(x1 - x2) * 0.5;
}
real[0] = tsf[0];
imag[0] = tsf[1];
}
Code: Select all
typedef std::complex<double> cmplx;
typedef std::vector<cmplx> vec;
void seperateTransforms(const vec & transform, vec & real, vec & imag)
{
auto N = transform.size();
for (auto k = 1u; k < N; ++k)
{
auto z1 = transform[k];
auto z2 = std::conj(transform[N - k]);
real[k] = (z1 + z2) * 0.5;
// i * conj(z), swap real and imag + doing conj
imag[k] = cmplx((z1 - z2).imag(), -(z1 - z2).real()) * 0.5;
}
// set DC offset
real[0] = transform[0].real();
imag[0] = transform[0].imag();
}
The transformed result has an error of about -350dB on arbitrary inputs, compared to fft'ing the signals seperately, which is not that bad i guess - definitely useful!
- KVRist
- 168 posts since 19 Apr, 2014 from London
Haha, funny, Mayae that's just how my method works, which I shared just a couple days ago.
- KVRAF
- Topic Starter
- 7897 posts since 12 Feb, 2006 from Helsinki, Finland
Yeah, the cross-bleed is not a huge problem with double precision transforms as long as the signals are roughly in the same range (which fortunately is the case with audio usually).Mayae wrote:The transformed result has an error of about -350dB on arbitrary inputs, compared to fft'ing the signals seperately, which is not that bad i guess - definitely useful!
Oh and .. you can similarly combine two "half-spectra" into a single spectrum that gives the two inverse transforms as real and imaginary components:
pack as complex -> FFT -> separate -> process independently -> recombine -> IFFT -> unpack from complex
-
- KVRian
- 573 posts since 1 Jan, 2013 from Denmark
Huh i didn't see that. If that's the case, i wasted a fair bit of time Can you link it?avasopht wrote:Haha, funny, Mayae that's just how my method works, which I shared just a couple days ago.
Yeah i guess the process would be the exact reverse of the 'decomposition' method i posted. Cool stuffmystran wrote:Yeah, the cross-bleed is not a huge problem with double precision transforms as long as the signals are roughly in the same range (which fortunately is the case with audio usually).Mayae wrote:The transformed result has an error of about -350dB on arbitrary inputs, compared to fft'ing the signals seperately, which is not that bad i guess - definitely useful!
Oh and .. you can similarly combine two "half-spectra" into a single spectrum that gives the two inverse transforms as real and imaginary components:
pack as complex -> FFT -> separate -> process independently -> recombine -> IFFT -> unpack from complex
- KVRist
- 168 posts since 19 Apr, 2014 from London
http://www.kvraudio.com/forum/viewtopic ... 1#p5877291Mayae wrote:Huh i didn't see that. If that's the case, i wasted a fair bit of time Can you link it?avasopht wrote:Haha, funny, Mayae that's just how my method works, which I shared just a couple days ago.
I wouldn't call it a waste of time at all as a lesson learnt is a lesson gained
The code shown has N log N real multiplies, but there's a property that allows you to halve that by placing the second half of the already folded signal into the imaginary plane and multiplying that with your complex sinusoid for only half of the already folded signal.
I have to apologise in advance for my poor use of terminology as my DSP knowledge is very much self discovered so I will be refining my language and jargon over the week.
Last edited by avasopht on Thu Sep 18, 2014 12:09 pm, edited 1 time in total.
-
- KVRian
- 573 posts since 1 Jan, 2013 from Denmark
Heh, yeah.avasopht wrote:http://www.kvraudio.com/forum/viewtopic ... 1#p5877291Mayae wrote:Huh i didn't see that. If that's the case, i wasted a fair bit of time Can you link it?avasopht wrote:Haha, funny, Mayae that's just how my method works, which I shared just a couple days ago.
I wouldn't call it a waste of time at all as a lesson learnt is a lesson gained
The code shown has N log N real multiplies, but there's a property that allows you to halve that by placing the second half of the already folded signal into the imaginary plane and multiplying that with your complex sinusoid for only half of the already folded signal.
I can sort of see that it would work - but wouldn't this be a property of all complex transforms? I can't intuitively seem to realize why it would work, though, will have to look at it once i get home..
-
- KVRist
- 50 posts since 21 Apr, 2008 from Germany
Hi,
first of all thanks for the code. I am using it for a spectral display on a parametric EQ (similar to Ableton's EQ) and it works fine for this purpose.
However I needed to change
to
to be able to compile it on XCode / OSX. If I do this, it works, however I didn't got too deep into your implementation details (the script generated stuff is really scary ) to check whether the static_cast may have some unwanted side-effects.
first of all thanks for the code. I am using it for a spectral display on a parametric EQ (similar to Ableton's EQ) and it works fine for this purpose.
However I needed to change
Code: Select all
static const unsigned ALIGN16 _mask_pd_sign_lo[4] = { 0, 1 << 31, 0, 0 };
Code: Select all
static const unsigned ALIGN16 _mask_pd_sign_lo[4] = { 0, static_cast<unsigned int>(1 << 31), 0, 0 };
- KVRAF
- Topic Starter
- 7897 posts since 12 Feb, 2006 from Helsinki, Finland
Yeah, this was discussed on page 1 and happens when compiling as C++ (which I didn't test in advance, because .. well it's C code). You can use (1U<<31) instead of the cast though.Chrisboy2000 wrote:to be able to compile it on XCode / OSX. If I do this, it works, however I didn't got too deep into your implementation details (the script generated stuff is really scary ) to check whether the static_cast may have some unwanted side-effects.Code: Select all
static const unsigned ALIGN16 _mask_pd_sign_lo[4] = { 0, static_cast<unsigned int>(1 << 31), 0, 0 };
-
- KVRist
- 50 posts since 21 Apr, 2008 from Germany
Stupid me, sorry for bothering
-
- KVRian
- 1379 posts since 26 Apr, 2004 from UK
Chris> Actually your version is less correct than mystran fix, as you are casting after it was shifted.