KVR Audio

Fender19 · Post by **Fender19** » Wed Jul 16, 2014 2:53 pm

mystran,

Thank you so much for sharing this code.

Now, I noticed right away that you have each iteration written out rather than running through loops. Is that significantly faster for the CPU? It is kind of risky at the higher number of N where a single, hard to spot, typo could mess up the process. But if it's a lot faster...

mystran · Post by **mystran** » Wed Jul 16, 2014 3:32 pm

Fender19 wrote: Now, I noticed right away that you have each iteration written out rather than running through loops. Is that significantly faster for the CPU? It is kind of risky at the higher number of N where a single, hard to spot, typo could mess up the process. But if it's a lot faster...

There are three general strategies used (and anything said about "transforms" applies to sub-transforms as well). Very small transforms use special-cased code, different depending on the transform size, because there are significant savings with simplified twiddle-multiplies here. It is currently done up to 16 points, but I probably should special-case 32 and 64 as well. The shuffle-code special-cases up to 256 since that seemed like the point where the overhead of calculating things on the fly starts becoming negligible, but really that's all generated by the same algorithm that is used for the general case.

Transforms larger than the special-cases use basic radix-4 striding iteration, similar to what you find in most text-books (at least assuming radix-4 is used). Finally large transforms work the same, except instead of striding, the sub-transforms are done recursively (although the actual work is still done by the same function), because this gives much better cache behavior (and hence performance).

But like I said above, the code is generated by a script. While the basic operations and special case twiddles are necessarily hard-coded, the actual functions and such are just fairly high level code, specialized on types somewhat similar to C++ templates, just allowing for a bit more flexibility. That's why it looks ugly too, didn't feel like writing a pretty printer. Whenever you see a radix-4 butterfly in the code, it is actually generated by the same script-function, just specialized for whatever twiddles and types we might be using in that particular spot.

On top of that, I have a test-routine that will check every possible transform for correctness and accuracy (the former is done by comparing to a known good implementation, the latter is done by performing a couple of rounds of transforms back and forth and checking against the original).

The whole FFT process in general is also pretty sensitive, so if you have a mistake somewhere, it'll actually give you incorrect results pretty much 100% of the time, unless the signal to be transformed is some regular special case (I use random values for testing to avoid this).

matt42 · Post by **matt42** » Thu Jul 17, 2014 3:12 am

Thanks for sharing this Mystran,

I'm not actually doing anything with FFTs at the moment, but when the time comes I'll check this out for sure.

Cheers,

Matt

Mayae · Post by **Mayae** » Wed Sep 10, 2014 7:07 pm

I've grown quite fond of this. Rudimentary testing shows that it is even faster than kiss's real-optimized float fft. I can't help but wonder how fast this would be with using real inputs, and 4-float (even 8-float using avx) vectors. The numbers say 4x increase, although it probably wouldn't be in that range, I'm really curious as to how difficult/hard it would be to change it?

A matter of changing instructions/types, or fundamental changes in the algorithm? I briefly looked over the source, but as you said yourself, it is not intended for human eyes...

mystran · Post by **mystran** » Thu Sep 11, 2014 2:45 am

Adding some AVX support should theoretically be quite straight-forward, it mostly just requires another set of shuffling strategies to rearrange the data layouts.

If you're doing real-valued transforms for multiple channels, you can pack two of them into a single complex transform (which is usually the fastest thing to do), as the real and imaginary components. This requires an extra shuffling pass to separate the two transforms in spectral domain and I've been meaning to add a convenience wrapper to do it ... but it's not all that complicated, check two for the price of one from this page if you want to do it yourself.

If you want to understand the source, look at the dispatch logic at the bottom and then fft_r4_sd (radix-4, single-doubles general case, used recursively) or fft_r4s_sd (same thing, strided variant for sub-transforms expected to fit the L1 cache). Ignoring the bit-reversal, these do the main work for the scalar code-path, rest is just special cases. The _pd2 variants work with data layout rearranged such that there are two real components and then two imaginary components (loaded into another registers) and work exactly the same as the _sd versions, just two components at a time. The small-transform special cases are a bit messier since the 8/16 point transforms use split-radix and special case twiddles and then the _pd2 variants of those also need to undo the data reordering .. but mostly everything other than _r4_ and _r4s_ is there simply to reduce the logic-overhead.

Mayae · Post by **Mayae** » Thu Sep 18, 2014 12:55 am

Thanks, that's a very neat trick. I wasn't aware you were able to do that. Since i didn't find any examples on doing this anywhere else, here's my solution:

Code: Select all

void seperateTransforms(const double * tsf, double * real, double * imag, std::size_t N)
{
	N <<= 1;
	double x1, x2, y1, y2;
	for (auto k = 2u; k < N; k += 2)
	{
		x1 = tsf[k];
		x2 = tsf[N - k];
		y1 = tsf[k + 1];
		y2 = tsf[N - k + 1];

		real[k] = (x1 + x2) * 0.5;
		real[k + 1] = (y1 - y2) * 0.5;
		imag[k] = (y1 + y2) * 0.5;
		imag[k + 1] = -(x1 - x2) * 0.5;
	}
	real[0] = tsf[0];
	imag[0] = tsf[1];
}

or maybe a bit more readable like this:

Code: Select all

typedef std::complex<double> cmplx;
typedef std::vector<cmplx> vec;

void seperateTransforms(const vec & transform, vec & real, vec & imag)
{
	auto N = transform.size();
	for (auto k = 1u; k < N; ++k)
	{
		auto z1 = transform[k];
		auto z2 = std::conj(transform[N - k]);
		real[k] = (z1 + z2) * 0.5;
		// i * conj(z), swap real and imag + doing conj
		imag[k] = cmplx((z1 - z2).imag(), -(z1 - z2).real()) * 0.5;
	}
	// set DC offset
	real[0] = transform[0].real();
	imag[0] = transform[0].imag();
}

Note that you potentially can cut the loop in half, since the second half of the fft of real signals by definition is equal to the first, just negative.
The transformed result has an error of about -350dB on arbitrary inputs, compared to fft'ing the signals seperately, which is not that bad i guess

- definitely useful!

avasopht · Post by **avasopht** » Thu Sep 18, 2014 3:17 am

Haha, funny, Mayae that's just how my method works, which I shared just a couple days ago.

mystran · Post by **mystran** » Thu Sep 18, 2014 9:33 am

Mayae wrote:The transformed result has an error of about -350dB on arbitrary inputs, compared to fft'ing the signals seperately, which is not that bad i guess - definitely useful!

Yeah, the cross-bleed is not a huge problem with double precision transforms as long as the signals are roughly in the same range (which fortunately is the case with audio usually).

Oh and .. you can similarly combine two "half-spectra" into a single spectrum that gives the two inverse transforms as real and imaginary components:

pack as complex -> FFT -> separate -> process independently -> recombine -> IFFT -> unpack from complex

Mayae · Post by **Mayae** » Thu Sep 18, 2014 11:29 am

avasopht wrote:Haha, funny, Mayae that's just how my method works, which I shared just a couple days ago.

Huh i didn't see that. If that's the case, i wasted a fair bit of time

Can you link it?

mystran wrote:
Mayae wrote:The transformed result has an error of about -350dB on arbitrary inputs, compared to fft'ing the signals seperately, which is not that bad i guess - definitely useful!
Yeah, the cross-bleed is not a huge problem with double precision transforms as long as the signals are roughly in the same range (which fortunately is the case with audio usually).

Oh and .. you can similarly combine two "half-spectra" into a single spectrum that gives the two inverse transforms as real and imaginary components:

pack as complex -> FFT -> separate -> process independently -> recombine -> IFFT -> unpack from complex

Yeah i guess the process would be the exact reverse of the 'decomposition' method i posted. Cool stuff

avasopht · Post by **avasopht** » Thu Sep 18, 2014 11:50 am

Mayae wrote:
avasopht wrote:Haha, funny, Mayae that's just how my method works, which I shared just a couple days ago.
Huh i didn't see that. If that's the case, i wasted a fair bit of time Can you link it?

http://www.kvraudio.com/forum/viewtopic ... 1#p5877291

I wouldn't call it a waste of time at all as a lesson learnt is a lesson gained

The code shown has N log N real multiplies, but there's a property that allows you to halve that by placing the second half of the already folded signal into the imaginary plane and multiplying that with your complex sinusoid for only half of the already folded signal.

I have to apologise in advance for my poor use of terminology as my DSP knowledge is very much self discovered so I will be refining my language and jargon over the week.

Mayae · Post by **Mayae** » Thu Sep 18, 2014 12:08 pm

avasopht wrote:
Mayae wrote:
avasopht wrote:Haha, funny, Mayae that's just how my method works, which I shared just a couple days ago.
Huh i didn't see that. If that's the case, i wasted a fair bit of time Can you link it?
http://www.kvraudio.com/forum/viewtopic ... 1#p5877291

I wouldn't call it a waste of time at all as a lesson learnt is a lesson gained

The code shown has N log N real multiplies, but there's a property that allows you to halve that by placing the second half of the already folded signal into the imaginary plane and multiplying that with your complex sinusoid for only half of the already folded signal.

Heh, yeah.
I can sort of see that it would work - but wouldn't this be a property of all complex transforms? I can't intuitively seem to realize why it would work, though, will have to look at it once i get home..

Chrisboy2000 · Post by **Chrisboy2000** » Fri Oct 03, 2014 10:12 am

Hi,

first of all thanks for the code. I am using it for a spectral display on a parametric EQ (similar to Ableton's EQ) and it works fine for this purpose.

However I needed to change

Code: Select all

static const unsigned ALIGN16 _mask_pd_sign_lo[4] = { 0, 1 << 31, 0, 0 };

to

Code: Select all

static const unsigned ALIGN16 _mask_pd_sign_lo[4] = { 0, static_cast<unsigned int>(1 << 31), 0, 0 };

to be able to compile it on XCode / OSX. If I do this, it works, however I didn't got too deep into your implementation details (the script generated stuff is really scary

) to check whether the static_cast may have some unwanted side-effects.

mystran · Post by **mystran** » Fri Oct 03, 2014 11:29 am

Chrisboy2000 wrote:
Code: Select all
static const unsigned ALIGN16 _mask_pd_sign_lo[4] = { 0, static_cast<unsigned int>(1 << 31), 0, 0 };
to be able to compile it on XCode / OSX. If I do this, it works, however I didn't got too deep into your implementation details (the script generated stuff is really scary ) to check whether the static_cast may have some unwanted side-effects.

Yeah, this was discussed on page 1 and happens when compiling as C++ (which I didn't test in advance, because .. well it's C code). You can use (1U<<31) instead of the cast though.

Chrisboy2000 · Post by **Chrisboy2000** » Fri Oct 03, 2014 2:15 pm

Stupid me, sorry for bothering

Miles1981 · Post by **Miles1981** » Sun Oct 05, 2014 10:36 pm

Chris> Actually your version is less correct than mystran fix, as you are casting after it was shifted.

DustFFT - in case you need an FFT :P