Anti-Aliasing Filters
- KVRAF
- 7891 posts since 12 Feb, 2006 from Helsinki, Finland
Just in case the whole "polyphase" part is confusing someone, let me explain, because this is so simple it's not even funny:
For downsampling you simply convolve the input as usual, but rather than incrementing the input offset by 1 after every sample, you increment the input offset by the oversampling ratio N, but still increment the output offset by only 1. That's it. This way we compute only the samples we are going to keep, only need 1/Nth of the work and there is no discarding of samples.
For upsampling things are only slightly more complicated. What you want to do here is reorder the kernel into N separate shorter kernel "branches" in a tap-by-tap round-robin fashion. Then you convolve the input (without zero-stuffing) with each of these separate shorter kernels and interleave the results. Again, this takes 1/Nth of the work compared to zero-stuffing and simple convolution.
Some descriptions of polyphase resampling complicated the whole thing by also trying to explain the N:M ratio case, where you're supposed to combine both of these optimisations. This is somewhat more complicated (well, not really, but it's harder to wrap your head around). Ignore that stuff, you only need the simple 1:N and N:1 cases for oversampling.
For downsampling you simply convolve the input as usual, but rather than incrementing the input offset by 1 after every sample, you increment the input offset by the oversampling ratio N, but still increment the output offset by only 1. That's it. This way we compute only the samples we are going to keep, only need 1/Nth of the work and there is no discarding of samples.
For upsampling things are only slightly more complicated. What you want to do here is reorder the kernel into N separate shorter kernel "branches" in a tap-by-tap round-robin fashion. Then you convolve the input (without zero-stuffing) with each of these separate shorter kernels and interleave the results. Again, this takes 1/Nth of the work compared to zero-stuffing and simple convolution.
Some descriptions of polyphase resampling complicated the whole thing by also trying to explain the N:M ratio case, where you're supposed to combine both of these optimisations. This is somewhat more complicated (well, not really, but it's harder to wrap your head around). Ignore that stuff, you only need the simple 1:N and N:1 cases for oversampling.
-
- KVRAF
- 1607 posts since 12 Apr, 2002
As I'm pretty sure I already mentioned elsewhere, I find the whole idea of polyphase filtering in resampling unnecessarily complicated. It's much simpler to think in terms of convolving discrete-time signal with continuous-time kernels, where you re-sample only the points which you need. This gives you exactly the same math, but the reasoning is straightforward and simple. YMMV.
- KVRian
- 1253 posts since 31 Dec, 2008
Thats exactly what I'm doing without even knowing about polyphase in the first place. BUT the problem is the number of taps would still have to be multiplied by the oversampling factor, which grows exponentially.mystran wrote: ↑Thu Dec 03, 2020 6:05 pmFor downsampling you simply convolve the input as usual, but rather than incrementing the input offset by 1 after every sample, you increment the input offset by the oversampling ratio N, but still increment the output offset by only 1. That's it. This way we compute only the samples we are going to keep, only need 1/Nth of the work and there is no discarding of samples.
What I can't get my head around is that all references out there say that you can actually use taps*factor/m where m is the number branches (filters) which basically means just taps (when we set m=factor). BUT, since there are m branches that need computation then it's the same thing, where is the saving? The only way I can understand this as a saving is to exploit hardware parallelism with it doing multiple branches concurrently.
Edit: they didn't literally say taps*factor/m. It's more like branch_taps = FIR_taps / m. Which is the same thing.
www.solostuff.net
Advice is heavy. So don’t send it like a mountain.
Advice is heavy. So don’t send it like a mountain.
- KVRAF
- Topic Starter
- 2239 posts since 25 Sep, 2014 from Specific Northwest
I've just been looking into/thinking about this. The only way the polyphase looks good is if there are at least 100 taps, which comes out to more than 3x the number of multiplies than a 9th order Chebyshev II/Elliptical LP filter.
Does the FIR perform better just because they are literally multiply/adds that vectorize more easily? Unfortunately, biquads are currently my worst performing filters, although I've done nothing towards optimization at this point, other than letting the compiler handle it.
Also, with the FIR, it looks like there is going to be at least a 50 sample latency I have to carry. Ugh... Unless I am missing something here.
Finally, are the FIR filter coefficients necessarily mirrored from end to end? I haven't looked at the actual math yet to see how they are generated.
Does the FIR perform better just because they are literally multiply/adds that vectorize more easily? Unfortunately, biquads are currently my worst performing filters, although I've done nothing towards optimization at this point, other than letting the compiler handle it.
Also, with the FIR, it looks like there is going to be at least a 50 sample latency I have to carry. Ugh... Unless I am missing something here.
Finally, are the FIR filter coefficients necessarily mirrored from end to end? I haven't looked at the actual math yet to see how they are generated.
I started on Logic 5 with a PowerBook G4 550Mhz. I now have a MacBook Air M1 and it's ~165x faster! So, why is my music not proportionally better?
-
- KVRian
- 1273 posts since 9 Jan, 2006
Downsampling, you essentially only need to process the retained samples, so at x32 you only do 1/32 of the processing with a FIR.
Latency is an issue, but it is possible to transform the kernel to minimum phase
Latency is an issue, but it is possible to transform the kernel to minimum phase
Last edited by matt42 on Thu Dec 03, 2020 9:03 pm, edited 2 times in total.
- KVRAF
- 7891 posts since 12 Feb, 2006 from Helsinki, Finland
The work-per-sample stays the same when measured at the oversampled rate. IIR filtering also has to run at the oversampled rate, the number of samples it has to process grows by a factor of the oversampling rate. The point I'm trying to make is that there is no additional penalty for FIR beyond what you will have to accept with literally every type of filter.S0lo wrote: ↑Thu Dec 03, 2020 7:07 pmThats exactly what I'm doing without even knowing about polyphase in the first place. BUT the problem is the number of taps would still have to be multiplied by the oversampling factor, which grows exponentially.mystran wrote: ↑Thu Dec 03, 2020 6:05 pmFor downsampling you simply convolve the input as usual, but rather than incrementing the input offset by 1 after every sample, you increment the input offset by the oversampling ratio N, but still increment the output offset by only 1. That's it. This way we compute only the samples we are going to keep, only need 1/Nth of the work and there is no discarding of samples.
- KVRAF
- 7891 posts since 12 Feb, 2006 from Helsinki, Finland
Remember that your IIR has to run at the oversampled rate, so you should compare the per-sample IIR cost against per-sample FIR cost which is the total kernel size divided by the oversampling factor.
You should usually expect them to pipeline better as well.Does the FIR perform better just because they are literally multiply/adds that vectorize more easily?
Latency is a matter of phase response of the IR: you can convert linear-phase FIR into minimum-phase FIR if you want latency similar to an IIR filter.Also, with the FIR, it looks like there is going to be at least a 50 sample latency I have to carry. Ugh... Unless I am missing something here.
Linear-phase filters are symmetric. Trying to take advantage of the symmetry is likely to be a pessimisation, because it interferes with both vectorization and polyphase decomposition.Finally, are the FIR filter coefficients necessarily mirrored from end to end? I haven't looked at the actual math yet to see how they are generated.
ps. Also there is no good reason to design an oversampling FIR where the "taps per branch" isn't a multiple of your SIMD width. Something like 100 taps just doesn't make any sense. For efficient implementation you're going to have to pad it anyway, so might just as well design a slightly longer kernel instead.
- KVRAF
- Topic Starter
- 2239 posts since 25 Sep, 2014 from Specific Northwest
I'm looking at 2x oversampling with a 90db cut for both filters. The elliptical has minimal passband ripple, the FIR has an acceptable amount of roll off/fold back. So, they are both performing about the same. More or less.
My thumbnail estimate on the number of multiplies was taking into account that I only calculate the needed samples with the FIR as opposed to every sample with the IIR.
On a side note, all of this information is finally being grokked by my puny human brain and being retained. So hopefully, I can finally put this topic to bed and not come back to it again so soon.
My thumbnail estimate on the number of multiplies was taking into account that I only calculate the needed samples with the FIR as opposed to every sample with the IIR.
On a side note, all of this information is finally being grokked by my puny human brain and being retained. So hopefully, I can finally put this topic to bed and not come back to it again so soon.
I started on Logic 5 with a PowerBook G4 550Mhz. I now have a MacBook Air M1 and it's ~165x faster! So, why is my music not proportionally better?
- KVRAF
- Topic Starter
- 2239 posts since 25 Sep, 2014 from Specific Northwest
Whoa! Slow down! I can only go down single rabbit holes, not entire rabbit warrens!
This'll be next on my to-do list.
I started on Logic 5 with a PowerBook G4 550Mhz. I now have a MacBook Air M1 and it's ~165x faster! So, why is my music not proportionally better?
-
- KVRian
- 631 posts since 21 Jun, 2013
Code: Select all
//Computes horizontal sum of input vectors i
//Result: =[hsum(a0),hsum(a1),hsum(a2),hsum(a3)]
//Based on single reduction
//https://stackoverflow.com/questions/13219146/how-to-sum-m256-horizontally
__m128 vector_hsum(__m256 a0, __m256 a1, __m256 a2, __m256 a3)
{
//interleave a0 and a1
//lo - [a0_0, a1_0, a0_1, a1_1]
//hi - [a0_2, a1_2, a0_3, a1_3]
__m256 a01lo = _mm256_unpacklo_ps(a0, a1);
__m256 a01hi = _mm256_unpackhi_ps(a0, a1);
__m256 a23lo = _mm256_unpacklo_ps(a2, a3);
__m256 a23hi = _mm256_unpackhi_ps(a2, a3);
// [a0_02,a1_02,a0_13,a1_13]
__m256 a01psum = _mm256_add_ps(a01lo, a01hi);
__m256 a23psum = _mm256_add_ps(a23lo, a23hi);
// [a0_02,a1_02,a2_02,a3_02]
__m256 a0123_02 = _mm256_shuffle_ps(a01psum, a23psum, _MM_SHUFFLE(1, 0, 1, 0));
// [a0_13,a1_13,a2_13,a3_13]
__m256 a0123_13 = _mm256_shuffle_ps(a01psum, a23psum, _MM_SHUFFLE(3, 2, 3, 2));
__m256 sum = _mm256_add_ps(a0123_02, a0123_13);
//Recombining 4 vector from 8
__m128 vlow = _mm256_castps256_ps128(sum);
__m128 vhigh = _mm256_extractf128_ps(sum, 1); // high 128
__m128 res = _mm_add_ps(vlow, vhigh);
return res;
}
void decimate_fma(const float* __restrict coeffs, const float* __restrict input, float* __restrict output, size_t size, size_t taps, size_t factor)
{
size_t g = 0;
for (size_t i = 0; i < size - taps; i += factor*8)
{
__m256 sum0, sum1, sum2, sum3, sum4, sum5, sum6, sum7;
sum0 = sum1 = sum2 = sum3 = sum4 = sum5 = sum6 = sum7 = _mm256_setzero_ps();
const float* input_i = input + i;
for (size_t j = 0; j < taps; j += 8)
{
const float* input_i_j = input_i + j;
__m256 coefficient = _mm256_loadu_ps(coeffs + j);
sum0 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 0), coefficient, sum0);
sum1 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 1), coefficient, sum1);
sum2 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 2), coefficient, sum2);
sum3 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 3), coefficient, sum3);
sum4 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 4), coefficient, sum4);
sum5 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 5), coefficient, sum5);
sum6 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 6), coefficient, sum6);
sum7 = _mm256_fmadd_ps(_mm256_loadu_ps(input_i_j + factor * 7), coefficient, sum7);
}
__m128 v0 = vector_hsum(sum0, sum1, sum2, sum3);
__m128 v1 = vector_hsum(sum4, sum5, sum6, sum7);
_mm_storeu_ps(output + g, v0);
_mm_storeu_ps(output + g + 4, v1);
g += 8;
}
}
Not tested, but the strategy should be obvious.
This should be close to optimal for high rates (8x,16x,24x,32x) when everything is aligned. For lower rates different strategy should be applied to avoid unaligned reads (multiple FIR kernels with offsets).
-
- KVRer
- 16 posts since 15 Dec, 2008
-
- KVRian
- 919 posts since 4 Jan, 2007
Sorry for the bump.
At some point I might want to add a lower-than-FIR-latency oversampling filter.
Has someone tried the parallel (partial fraction expansion) version of the eliptics/type2 chebishevs?
From my limited research and knowledge, it seems that partial fraction expansion of higher order filters is tricky to do on the fly, but as these are fixed designs I can see myself offloading the first steps to math packages, e.g.: "scipy.signal.iirdesign", then "scipy.signal.residuez".
From there I don't know enough to see how:
-To go from poles zeros to second order sections.
-How to do polyphase on a parallel IIR.
-Handling close poles (might happen on high order eliptics I guess).
I can't be the first one thinking about this, there must be some hurdle or not be worth CPU-wise versus the cascaded implementation.
At some point I might want to add a lower-than-FIR-latency oversampling filter.
Has someone tried the parallel (partial fraction expansion) version of the eliptics/type2 chebishevs?
From my limited research and knowledge, it seems that partial fraction expansion of higher order filters is tricky to do on the fly, but as these are fixed designs I can see myself offloading the first steps to math packages, e.g.: "scipy.signal.iirdesign", then "scipy.signal.residuez".
From there I don't know enough to see how:
-To go from poles zeros to second order sections.
-How to do polyphase on a parallel IIR.
-Handling close poles (might happen on high order eliptics I guess).
I can't be the first one thinking about this, there must be some hurdle or not be worth CPU-wise versus the cascaded implementation.
- KVRAF
- 7891 posts since 12 Feb, 2006 from Helsinki, Finland
Chebychev type-2 at least is very simple to design by simply placing poles and zeroes invidually (eg. just take the formulas out of Wikipedia) so you'll get the denominators for PFE directly and a bit of linear algebra should solve the zeroes for you.
... but seriously as I've pointed out before "lower than FIR latency" is sort of non-sensical statement.
... but seriously as I've pointed out before "lower than FIR latency" is sort of non-sensical statement.
- KVRAF
- 7891 posts since 12 Feb, 2006 from Helsinki, Finland
It is relatively straight-forward to convert pretty much any linear-phase FIR into a minimum-phase FIR. The "log-cepstrum" method for doing this is essentially just a few rounds of FFT:
1. FFT the original FIR (using an FFT size that's at least a couple of times longer than the FIR is good idea to combat "time aliasing")
2. for each bin replace the real component with the logarithm of the magnitude (eg. with some -200dB "small value" as a lower threshold to avoid -inf from the logarithm if the magnitude is zero), set the imaginary component to zero
3. take IFFT of the "log spectrum" to get "log cepstrum"
4. if the FFT size is N, then multiply elements in range [1, N/2-1] by 2 and zero the elements in range [N/2+1, N-1], bins 0 and N/2 are left "as-is" - this step is where the magic happens
5. take FFT of the result, take complex(!) exponent of each bin, IFFT the result and (typically) truncate to original length.
This method is not exact (there's numerical error sources from potential time-aliasing, from the "small value" and from the final truncation) so might not work great if the FIR is very short, but it's pretty robust (ie. the numerical errors typically cause slight errors rather than something catastrophic) and for a typical oversampling kernel it usually works great.