DustFFT  in case you need an FFT :P

 KVRian
 551 posts since 30 Aug, 2012
mystran,
Thank you so much for sharing this code.
Now, I noticed right away that you have each iteration written out rather than running through loops. Is that significantly faster for the CPU? It is kind of risky at the higher number of N where a single, hard to spot, typo could mess up the process. But if it's a lot faster...
Thank you so much for sharing this code.
Now, I noticed right away that you have each iteration written out rather than running through loops. Is that significantly faster for the CPU? It is kind of risky at the higher number of N where a single, hard to spot, typo could mess up the process. But if it's a lot faster...

 KVRAF
 6321 posts since 12 Feb, 2006 from Helsinki, Finland
Topic Starter
There are three general strategies used (and anything said about "transforms" applies to subtransforms as well). Very small transforms use specialcased code, different depending on the transform size, because there are significant savings with simplified twiddlemultiplies here. It is currently done up to 16 points, but I probably should specialcase 32 and 64 as well. The shufflecode specialcases up to 256 since that seemed like the point where the overhead of calculating things on the fly starts becoming negligible, but really that's all generated by the same algorithm that is used for the general case.Fender19 wrote: Now, I noticed right away that you have each iteration written out rather than running through loops. Is that significantly faster for the CPU? It is kind of risky at the higher number of N where a single, hard to spot, typo could mess up the process. But if it's a lot faster...
Transforms larger than the specialcases use basic radix4 striding iteration, similar to what you find in most textbooks (at least assuming radix4 is used). Finally large transforms work the same, except instead of striding, the subtransforms are done recursively (although the actual work is still done by the same function), because this gives much better cache behavior (and hence performance).
But like I said above, the code is generated by a script. While the basic operations and special case twiddles are necessarily hardcoded, the actual functions and such are just fairly high level code, specialized on types somewhat similar to C++ templates, just allowing for a bit more flexibility. That's why it looks ugly too, didn't feel like writing a pretty printer. Whenever you see a radix4 butterfly in the code, it is actually generated by the same scriptfunction, just specialized for whatever twiddles and types we might be using in that particular spot.
On top of that, I have a testroutine that will check every possible transform for correctness and accuracy (the former is done by comparing to a known good implementation, the latter is done by performing a couple of rounds of transforms back and forth and checking against the original).
The whole FFT process in general is also pretty sensitive, so if you have a mistake somewhere, it'll actually give you incorrect results pretty much 100% of the time, unless the signal to be transformed is some regular special case (I use random values for testing to avoid this).
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

 KVRian
 553 posts since 1 Jan, 2013 from Denmark
I've grown quite fond of this. Rudimentary testing shows that it is even faster than kiss's realoptimized float fft. I can't help but wonder how fast this would be with using real inputs, and 4float (even 8float using avx) vectors. The numbers say 4x increase, although it probably wouldn't be in that range, I'm really curious as to how difficult/hard it would be to change it?
A matter of changing instructions/types, or fundamental changes in the algorithm? I briefly looked over the source, but as you said yourself, it is not intended for human eyes...
A matter of changing instructions/types, or fundamental changes in the algorithm? I briefly looked over the source, but as you said yourself, it is not intended for human eyes...

 KVRAF
 6321 posts since 12 Feb, 2006 from Helsinki, Finland
Topic Starter
Adding some AVX support should theoretically be quite straightforward, it mostly just requires another set of shuffling strategies to rearrange the data layouts.
If you're doing realvalued transforms for multiple channels, you can pack two of them into a single complex transform (which is usually the fastest thing to do), as the real and imaginary components. This requires an extra shuffling pass to separate the two transforms in spectral domain and I've been meaning to add a convenience wrapper to do it ... but it's not all that complicated, check two for the price of one from this page if you want to do it yourself.
If you want to understand the source, look at the dispatch logic at the bottom and then fft_r4_sd (radix4, singledoubles general case, used recursively) or fft_r4s_sd (same thing, strided variant for subtransforms expected to fit the L1 cache). Ignoring the bitreversal, these do the main work for the scalar codepath, rest is just special cases. The _pd2 variants work with data layout rearranged such that there are two real components and then two imaginary components (loaded into another registers) and work exactly the same as the _sd versions, just two components at a time. The smalltransform special cases are a bit messier since the 8/16 point transforms use splitradix and special case twiddles and then the _pd2 variants of those also need to undo the data reordering .. but mostly everything other than _r4_ and _r4s_ is there simply to reduce the logicoverhead.
If you're doing realvalued transforms for multiple channels, you can pack two of them into a single complex transform (which is usually the fastest thing to do), as the real and imaginary components. This requires an extra shuffling pass to separate the two transforms in spectral domain and I've been meaning to add a convenience wrapper to do it ... but it's not all that complicated, check two for the price of one from this page if you want to do it yourself.
If you want to understand the source, look at the dispatch logic at the bottom and then fft_r4_sd (radix4, singledoubles general case, used recursively) or fft_r4s_sd (same thing, strided variant for subtransforms expected to fit the L1 cache). Ignoring the bitreversal, these do the main work for the scalar codepath, rest is just special cases. The _pd2 variants work with data layout rearranged such that there are two real components and then two imaginary components (loaded into another registers) and work exactly the same as the _sd versions, just two components at a time. The smalltransform special cases are a bit messier since the 8/16 point transforms use splitradix and special case twiddles and then the _pd2 variants of those also need to undo the data reordering .. but mostly everything other than _r4_ and _r4s_ is there simply to reduce the logicoverhead.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

 KVRian
 553 posts since 1 Jan, 2013 from Denmark
Thanks, that's a very neat trick. I wasn't aware you were able to do that. Since i didn't find any examples on doing this anywhere else, here's my solution:
or maybe a bit more readable like this:
Note that you potentially can cut the loop in half, since the second half of the fft of real signals by definition is equal to the first, just negative.
The transformed result has an error of about 350dB on arbitrary inputs, compared to fft'ing the signals seperately, which is not that bad i guess  definitely useful!
Code: Select all
void seperateTransforms(const double * tsf, double * real, double * imag, std::size_t N)
{
N <<= 1;
double x1, x2, y1, y2;
for (auto k = 2u; k < N; k += 2)
{
x1 = tsf[k];
x2 = tsf[N  k];
y1 = tsf[k + 1];
y2 = tsf[N  k + 1];
real[k] = (x1 + x2) * 0.5;
real[k + 1] = (y1  y2) * 0.5;
imag[k] = (y1 + y2) * 0.5;
imag[k + 1] = (x1  x2) * 0.5;
}
real[0] = tsf[0];
imag[0] = tsf[1];
}
Code: Select all
typedef std::complex<double> cmplx;
typedef std::vector<cmplx> vec;
void seperateTransforms(const vec & transform, vec & real, vec & imag)
{
auto N = transform.size();
for (auto k = 1u; k < N; ++k)
{
auto z1 = transform[k];
auto z2 = std::conj(transform[N  k]);
real[k] = (z1 + z2) * 0.5;
// i * conj(z), swap real and imag + doing conj
imag[k] = cmplx((z1  z2).imag(), (z1  z2).real()) * 0.5;
}
// set DC offset
real[0] = transform[0].real();
imag[0] = transform[0].imag();
}
The transformed result has an error of about 350dB on arbitrary inputs, compared to fft'ing the signals seperately, which is not that bad i guess  definitely useful!

 KVRAF
 6321 posts since 12 Feb, 2006 from Helsinki, Finland
Topic Starter
Yeah, the crossbleed is not a huge problem with double precision transforms as long as the signals are roughly in the same range (which fortunately is the case with audio usually).Mayae wrote:The transformed result has an error of about 350dB on arbitrary inputs, compared to fft'ing the signals seperately, which is not that bad i guess  definitely useful!
Oh and .. you can similarly combine two "halfspectra" into a single spectrum that gives the two inverse transforms as real and imaginary components:
pack as complex > FFT > separate > process independently > recombine > IFFT > unpack from complex
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

 KVRian
 553 posts since 1 Jan, 2013 from Denmark
Huh i didn't see that. If that's the case, i wasted a fair bit of time Can you link it?avasopht wrote:Haha, funny, Mayae that's just how my method works, which I shared just a couple days ago.
Yeah i guess the process would be the exact reverse of the 'decomposition' method i posted. Cool stuffmystran wrote:Yeah, the crossbleed is not a huge problem with double precision transforms as long as the signals are roughly in the same range (which fortunately is the case with audio usually).Mayae wrote:The transformed result has an error of about 350dB on arbitrary inputs, compared to fft'ing the signals seperately, which is not that bad i guess  definitely useful!
Oh and .. you can similarly combine two "halfspectra" into a single spectrum that gives the two inverse transforms as real and imaginary components:
pack as complex > FFT > separate > process independently > recombine > IFFT > unpack from complex
 KVRist
 162 posts since 19 Apr, 2014 from London
http://www.kvraudio.com/forum/viewtopic ... 1#p5877291Mayae wrote:Huh i didn't see that. If that's the case, i wasted a fair bit of time Can you link it?avasopht wrote:Haha, funny, Mayae that's just how my method works, which I shared just a couple days ago.
I wouldn't call it a waste of time at all as a lesson learnt is a lesson gained
The code shown has N log N real multiplies, but there's a property that allows you to halve that by placing the second half of the already folded signal into the imaginary plane and multiplying that with your complex sinusoid for only half of the already folded signal.
I have to apologise in advance for my poor use of terminology as my DSP knowledge is very much self discovered so I will be refining my language and jargon over the week.
Last edited by avasopht on Thu Sep 18, 2014 4:09 am, edited 1 time in total.

 KVRian
 553 posts since 1 Jan, 2013 from Denmark
Heh, yeah.avasopht wrote:http://www.kvraudio.com/forum/viewtopic ... 1#p5877291Mayae wrote:Huh i didn't see that. If that's the case, i wasted a fair bit of time Can you link it?avasopht wrote:Haha, funny, Mayae that's just how my method works, which I shared just a couple days ago.
I wouldn't call it a waste of time at all as a lesson learnt is a lesson gained
The code shown has N log N real multiplies, but there's a property that allows you to halve that by placing the second half of the already folded signal into the imaginary plane and multiplying that with your complex sinusoid for only half of the already folded signal.
I can sort of see that it would work  but wouldn't this be a property of all complex transforms? I can't intuitively seem to realize why it would work, though, will have to look at it once i get home..

 KVRist
 48 posts since 21 Apr, 2008 from Germany
Hi,
first of all thanks for the code. I am using it for a spectral display on a parametric EQ (similar to Ableton's EQ) and it works fine for this purpose.
However I needed to change
to
to be able to compile it on XCode / OSX. If I do this, it works, however I didn't got too deep into your implementation details (the script generated stuff is really scary ) to check whether the static_cast may have some unwanted sideeffects.
first of all thanks for the code. I am using it for a spectral display on a parametric EQ (similar to Ableton's EQ) and it works fine for this purpose.
However I needed to change
Code: Select all
static const unsigned ALIGN16 _mask_pd_sign_lo[4] = { 0, 1 << 31, 0, 0 };
Code: Select all
static const unsigned ALIGN16 _mask_pd_sign_lo[4] = { 0, static_cast<unsigned int>(1 << 31), 0, 0 };

 KVRAF
 6321 posts since 12 Feb, 2006 from Helsinki, Finland
Topic Starter
Yeah, this was discussed on page 1 and happens when compiling as C++ (which I didn't test in advance, because .. well it's C code). You can use (1U<<31) instead of the cast though.Chrisboy2000 wrote:to be able to compile it on XCode / OSX. If I do this, it works, however I didn't got too deep into your implementation details (the script generated stuff is really scary ) to check whether the static_cast may have some unwanted sideeffects.Code: Select all
static const unsigned ALIGN16 _mask_pd_sign_lo[4] = { 0, static_cast<unsigned int>(1 << 31), 0, 0 };
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

 KVRist
 48 posts since 21 Apr, 2008 from Germany
Stupid me, sorry for bothering

 KVRian
 1379 posts since 26 Apr, 2004 from UK
Chris> Actually your version is less correct than mystran fix, as you are casting after it was shifted.