KVR Audio

mystran · Post by **mystran** » Thu Dec 03, 2020 6:05 pm

Just in case the whole "polyphase" part is confusing someone, let me explain, because this is so simple it's not even funny:

For downsampling you simply convolve the input as usual, but rather than incrementing the input offset by 1 after every sample, you increment the input offset by the oversampling ratio N, but still increment the output offset by only 1. That's it. This way we compute only the samples we are going to keep, only need 1/Nth of the work and there is no discarding of samples.

For upsampling things are only slightly more complicated. What you want to do here is reorder the kernel into N separate shorter kernel "branches" in a tap-by-tap round-robin fashion. Then you convolve the input (without zero-stuffing) with each of these separate shorter kernels and interleave the results. Again, this takes 1/Nth of the work compared to zero-stuffing and simple convolution.

Some descriptions of polyphase resampling complicated the whole thing by also trying to explain the N:M ratio case, where you're supposed to combine both of these optimisations. This is somewhat more complicated (well, not really, but it's harder to wrap your head around). Ignore that stuff, you only need the simple 1:N and N:1 cases for oversampling.

Z1202 · Post by **Z1202** » Thu Dec 03, 2020 6:53 pm

As I'm pretty sure I already mentioned elsewhere, I find the whole idea of polyphase filtering in resampling unnecessarily complicated. It's much simpler to think in terms of convolving discrete-time signal with continuous-time kernels, where you re-sample only the points which you need. This gives you exactly the same math, but the reasoning is straightforward and simple. YMMV.

S0lo · Post by **S0lo** » Thu Dec 03, 2020 7:07 pm

mystran wrote: ↑Thu Dec 03, 2020 6:05 pmFor downsampling you simply convolve the input as usual, but rather than incrementing the input offset by 1 after every sample, you increment the input offset by the oversampling ratio N, but still increment the output offset by only 1. That's it. This way we compute only the samples we are going to keep, only need 1/Nth of the work and there is no discarding of samples.

Thats exactly what I'm doing without even knowing about polyphase in the first place. BUT the problem is the number of taps would still have to be multiplied by the oversampling factor, which grows exponentially.

What I can't get my head around is that all references out there say that you can actually use taps*factor/m where m is the number branches (filters) which basically means just taps (when we set m=factor). BUT, since there are m branches that need computation then it's the same thing, where is the saving? The only way I can understand this as a saving is to exploit hardware parallelism with it doing multiple branches concurrently.

Edit: they didn't literally say taps*factor/m. It's more like branch_taps = FIR_taps / m. Which is the same thing.

syntonica · Post by **syntonica** » Thu Dec 03, 2020 8:13 pm

I've just been looking into/thinking about this. The only way the polyphase looks good is if there are at least 100 taps, which comes out to more than 3x the number of multiplies than a 9th order Chebyshev II/Elliptical LP filter.

Does the FIR perform better just because they are literally multiply/adds that vectorize more easily? Unfortunately, biquads are currently my worst performing filters, although I've done nothing towards optimization at this point, other than letting the compiler handle it.

Also, with the FIR, it looks like there is going to be at least a 50 sample latency I have to carry. Ugh... Unless I am missing something here.

Finally, are the FIR filter coefficients necessarily mirrored from end to end? I haven't looked at the actual math yet to see how they are generated.

matt42 · Post by **matt42** » Thu Dec 03, 2020 8:41 pm

Downsampling, you essentially only need to process the retained samples, so at x32 you only do 1/32 of the processing with a FIR.

Latency is an issue, but it is possible to transform the kernel to minimum phase

mystran · Post by **mystran** » Thu Dec 03, 2020 8:48 pm

S0lo wrote: ↑Thu Dec 03, 2020 7:07 pm
mystran wrote: ↑Thu Dec 03, 2020 6:05 pmFor downsampling you simply convolve the input as usual, but rather than incrementing the input offset by 1 after every sample, you increment the input offset by the oversampling ratio N, but still increment the output offset by only 1. That's it. This way we compute only the samples we are going to keep, only need 1/Nth of the work and there is no discarding of samples.
Thats exactly what I'm doing without even knowing about polyphase in the first place. BUT the problem is the number of taps would still have to be multiplied by the oversampling factor, which grows exponentially.

The work-per-sample stays the same when measured at the oversampled rate. IIR filtering also has to run at the oversampled rate, the number of samples it has to process grows by a factor of the oversampling rate. The point I'm trying to make is that there is no additional penalty for FIR beyond what you will have to accept with literally every type of filter.

mystran · Post by **mystran** » Thu Dec 03, 2020 9:03 pm

syntonica wrote: ↑Thu Dec 03, 2020 8:13 pm I've just been looking into/thinking about this. The only way the polyphase looks good is if there are at least 100 taps, which comes out to more than 3x the number of multiplies than a 9th order Chebyshev II/Elliptical LP filter.

Remember that your IIR has to run at the oversampled rate, so you should compare the per-sample IIR cost against per-sample FIR cost which is the total kernel size divided by the oversampling factor.

Does the FIR perform better just because they are literally multiply/adds that vectorize more easily?

You should usually expect them to pipeline better as well.

Also, with the FIR, it looks like there is going to be at least a 50 sample latency I have to carry. Ugh... Unless I am missing something here.

Latency is a matter of phase response of the IR: you can convert linear-phase FIR into minimum-phase FIR if you want latency similar to an IIR filter.

Finally, are the FIR filter coefficients necessarily mirrored from end to end? I haven't looked at the actual math yet to see how they are generated.

Linear-phase filters are symmetric. Trying to take advantage of the symmetry is likely to be a pessimisation, because it interferes with both vectorization and polyphase decomposition.

ps. Also there is no good reason to design an oversampling FIR where the "taps per branch" isn't a multiple of your SIMD width. Something like 100 taps just doesn't make any sense. For efficient implementation you're going to have to pad it anyway, so might just as well design a slightly longer kernel instead.

syntonica · Post by **syntonica** » Thu Dec 03, 2020 9:11 pm

I'm looking at 2x oversampling with a 90db cut for both filters. The elliptical has minimal passband ripple, the FIR has an acceptable amount of roll off/fold back. So, they are both performing about the same. More or less.

My thumbnail estimate on the number of multiplies was taking into account that I only calculate the needed samples with the FIR as opposed to every sample with the IIR.

On a side note, all of this information is finally being grokked by my puny human brain and being retained. So hopefully, I can finally put this topic to bed and not come back to it again so soon.

syntonica · Post by **syntonica** » Thu Dec 03, 2020 9:14 pm

mystran wrote: ↑Thu Dec 03, 2020 9:03 pm Latency is a matter of phase response of the IR: you can convert linear-phase FIR into minimum-phase FIR if you want latency similar to an IIR filter.

Whoa! Slow down! I can only go down single rabbit holes, not entire rabbit warrens!

This'll be next on my to-do list.

2DaT · Post by **2DaT** » Fri Dec 11, 2020 10:42 am

mystran wrote: ↑Thu Dec 03, 2020 9:43 am
2DaT wrote: ↑Thu Dec 03, 2020 3:47 am It's possible to design FMA-efficient FIR by processing multiple values at a time. For maximum efficiency use a typical latency/throughput of 8, 1 load per FMA, 1 coefficient load for 8 values.
It is not entirely clear what sort of tiling you are suggesting here.

Code: Select all


//Computes horizontal sum of input vectors i
//Result: =[hsum(a0),hsum(a1),hsum(a2),hsum(a3)]
//Based on single reduction
//https://stackoverflow.com/questions/13219146/how-to-sum-m256-horizontally
__m128 vector_hsum(__m256 a0, __m256 a1, __m256 a2, __m256 a3)
{
	//interleave a0 and a1 
	//lo -  [a0_0, a1_0, a0_1, a1_1]
	//hi -  [a0_2, a1_2, a0_3, a1_3]
	__m256 a01lo = _mm256_unpacklo_ps(a0, a1);
	__m256 a01hi = _mm256_unpackhi_ps(a0, a1);
	__m256 a23lo = _mm256_unpacklo_ps(a2, a3);
	__m256 a23hi = _mm256_unpackhi_ps(a2, a3);
	// [a0_02,a1_02,a0_13,a1_13]
	__m256 a01psum = _mm256_add_ps(a01lo, a01hi);
	__m256 a23psum = _mm256_add_ps(a23lo, a23hi);
	// [a0_02,a1_02,a2_02,a3_02]
	__m256 a0123_02 = _mm256_shuffle_ps(a01psum, a23psum, _MM_SHUFFLE(1, 0, 1, 0));
	// [a0_13,a1_13,a2_13,a3_13]
	__m256 a0123_13 = _mm256_shuffle_ps(a01psum, a23psum, _MM_SHUFFLE(3, 2, 3, 2));
	__m256 sum = _mm256_add_ps(a0123_02, a0123_13);
	//Recombining 4 vector from 8
	__m128 vlow = _mm256_castps256_ps128(sum);
	__m128 vhigh = _mm256_extractf128_ps(sum, 1); // high 128
	__m128 res = _mm_add_ps(vlow, vhigh);
	return res;
}

void decimate_fma(const float* __restrict coeffs, const float* __restrict input, float*  __restrict output, size_t size, size_t taps, size_t factor)
{
	size_t g = 0;
	for (size_t i = 0; i < size - taps; i += factor*8)
	{
		__m256 sum0, sum1, sum2, sum3, sum4, sum5, sum6, sum7;
		sum0 = sum1 = sum2 = sum3 = sum4 = sum5 = sum6 = sum7 = _mm256_setzero_ps();

		const float* input_i = input + i;
		for (size_t j = 0; j < taps; j += 8)
		{
			const float* input_i_j = input_i + j;
			__m256 coefficient = _mm256_loadu_ps(coeffs + j);
			sum0 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 0), coefficient, sum0);
			sum1 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 1), coefficient, sum1);
			sum2 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 2), coefficient, sum2);
			sum3 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 3), coefficient, sum3);
			sum4 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 4), coefficient, sum4);
			sum5 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 5), coefficient, sum5);
			sum6 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 6), coefficient, sum6);
			sum7 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 7), coefficient, sum7);
		}
		__m128 v0 = vector_hsum(sum0, sum1, sum2, sum3);
		__m128 v1 = vector_hsum(sum4, sum5, sum6, sum7);
		_mm_storeu_ps(output + g, v0);
		_mm_storeu_ps(output + g + 4, v1);
		g += 8;
	}
}

A "polyphase" decimator that takes advantage of FMA instructions.
Not tested, but the strategy should be obvious.

This should be close to optimal for high rates (8x,16x,24x,32x) when everything is aligned. For lower rates different strategy should be applied to avoid unaligned reads (multiple FIR kernels with offsets).

iuyj · Post by **iuyj** » Fri Dec 25, 2020 10:36 pm

https://github.com/avaneev/r8brain-free-src

rafa1981 · Post by **rafa1981** » Wed Jan 26, 2022 10:02 am

Sorry for the bump.

At some point I might want to add a lower-than-FIR-latency oversampling filter.

Has someone tried the parallel (partial fraction expansion) version of the eliptics/type2 chebishevs?

From my limited research and knowledge, it seems that partial fraction expansion of higher order filters is tricky to do on the fly, but as these are fixed designs I can see myself offloading the first steps to math packages, e.g.: "scipy.signal.iirdesign", then "scipy.signal.residuez".

From there I don't know enough to see how:

-To go from poles zeros to second order sections.
-How to do polyphase on a parallel IIR.
-Handling close poles (might happen on high order eliptics I guess).

I can't be the first one thinking about this, there must be some hurdle or not be worth CPU-wise versus the cascaded implementation.

mystran · Post by **mystran** » Wed Jan 26, 2022 11:46 am

Chebychev type-2 at least is very simple to design by simply placing poles and zeroes invidually (eg. just take the formulas out of Wikipedia) so you'll get the denominators for PFE directly and a bit of linear algebra should solve the zeroes for you.

... but seriously as I've pointed out before "lower than FIR latency" is sort of non-sensical statement.

rafa1981 · Post by **rafa1981** » Wed Jan 26, 2022 4:10 pm

Let me correct, a minphase version of a similar frequency response than that of a linear-phase FIR. Alternatively a filter optimized for latency with OK quality.

mystran · Post by **mystran** » Wed Jan 26, 2022 5:59 pm

rafa1981 wrote: ↑Wed Jan 26, 2022 4:10 pm Let me correct, a minphase version of a similar frequency response than that of a linear-phase FIR.

It is relatively straight-forward to convert pretty much any linear-phase FIR into a minimum-phase FIR. The "log-cepstrum" method for doing this is essentially just a few rounds of FFT:

1. FFT the original FIR (using an FFT size that's at least a couple of times longer than the FIR is good idea to combat "time aliasing")

2. for each bin replace the real component with the logarithm of the magnitude (eg. with some -200dB "small value" as a lower threshold to avoid -inf from the logarithm if the magnitude is zero), set the imaginary component to zero

3. take IFFT of the "log spectrum" to get "log cepstrum"

4. if the FFT size is N, then multiply elements in range [1, N/2-1] by 2 and zero the elements in range [N/2+1, N-1], bins 0 and N/2 are left "as-is" - this step is where the magic happens

5. take FFT of the result, take complex(!) exponent of each bin, IFFT the result and (typically) truncate to original length.

This method is not exact (there's numerical error sources from potential time-aliasing, from the "small value" and from the final truncation) so might not work great if the FIR is very short, but it's pretty robust (ie. the numerical errors typically cause slight errors rather than something catastrophic) and for a typical oversampling kernel it usually works great.

Anti-Aliasing Filters