Accurate SSE2/FMA Tan for BLT prewrap.

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

Since tan is frequently used for filters, and we don't need complicated range reduction, it seems like a good candidate to get additional perf.

Edit: even better version, based on rational minmax approximation.

Code: Select all

// Approximates tan(x*pi), x on (1e-35, 0.4999999). Useful for BLT prewrap.
// SSE2 instructions, 4-way parallel
inline __m128 tanpi_blt(__m128 x) {
  // R(x^2)*x is relative error minmax rational fit to tan(x*pi):
  // R(x) =
  // (-3.13832354545593x+3.31913113594055)/(x^2+-4.47475242614746x+1.05651223659515)
  // ~4 ULP error
  const __m128 num0 = _mm_set1_ps(3.319131136e+00);
  const __m128 num1 = _mm_set1_ps(-3.138323545e+00);
  const __m128 den0 = _mm_set1_ps(1.056512237e+00);
  const __m128 den1 = _mm_set1_ps(-4.474752426e+00);

  const __m128 half = _mm_set1_ps(0.5f);
  const __m128 quarter = _mm_set1_ps(0.25f);
  const __m128 signmask = _mm_set1_ps(-0.0f);

  // We use this useful identity to compute on finite range of values:
  // tan(x - pi/2)  == 1 / -tan(x)
  // Since we compute tan(x*pi), identity is tanpi(x-0.5) == 1 / -tanpi(x)

  // Force the value into the finite range (-0.25,0.25)

  // Compute the mask, whether we compute using the identity (x>0.25) or
  // directly (x<0.25)
  __m128 inverse_mask = _mm_cmpgt_ps(x, quarter);
  x = _mm_sub_ps(x, _mm_and_ps(inverse_mask, half));

  __m128 f2 = _mm_mul_ps(x, x);
  // Numerator, Horner's scheme
  __m128 num = _mm_add_ps(_mm_mul_ps(num1, f2), num0);
  // Andnot with sign mask to produce only positive results
  num = _mm_mul_ps(_mm_andnot_ps(signmask, x), num);

  // Denominator, Horner's scheme
  __m128 denom = _mm_add_ps(f2, den1);
  denom = _mm_add_ps(den0, _mm_mul_ps(denom, f2));

  // Since we already compute a rational function, we just need to swap
  // numerator and denominator to produce an inverse

  // If (mask) then swap(a,b)
  // Conditional swap with one simple trick...
  __m128 swap_xor = _mm_and_ps(inverse_mask, _mm_xor_ps(num, denom));
  num = _mm_xor_ps(num, swap_xor);
  denom = _mm_xor_ps(denom, swap_xor);

  return _mm_div_ps(num, denom);
}

// Approximates tan(x*pi), x on (1e-35, 0.4999999). Useful for BLT prewrap.
// AVX2+FMA instructions, 8-way parallel
inline __m256 tanpi_blt_fma(__m256 x) {
  const __m256 num0 = _mm256_set1_ps(3.319131136e+00);
  const __m256 num1 = _mm256_set1_ps(-3.138323545e+00);
  const __m256 den0 = _mm256_set1_ps(1.056512237e+00);
  const __m256 den1 = _mm256_set1_ps(-4.474752426e+00);
  const __m256 half = _mm256_set1_ps(0.5f);
  const __m256 quarter = _mm256_set1_ps(0.25f);
  const __m256 signmask = _mm256_set1_ps(-0.0f);

  __m256 inverse_mask = _mm256_cmp_ps(x, quarter, _CMP_GT_OS);
  x = _mm256_sub_ps(x, _mm256_and_ps(inverse_mask, half));

  __m256 f2 = _mm256_mul_ps(x, x);

  __m256 num = _mm256_fmadd_ps(f2, num1, num0);
  num = _mm256_mul_ps(_mm256_andnot_ps(signmask, x), num);

  __m256 denom = _mm256_add_ps(f2, den1);
  denom = _mm256_fmadd_ps(denom, f2, den0);

  __m256 denom_select = _mm256_blendv_ps(denom, num, inverse_mask);
  __m256 num_select = _mm256_blendv_ps(num, denom, inverse_mask);

  return _mm256_div_ps(num_select, denom_select);
}
Perf, cycles/value.

Code: Select all

MSVC vector tan: 5.71924
MSVC scalar tan: 29.8364
Old tanpi4_blt : 3.44653
Old tanpi4_blt_fma: 1.06836
New tanpi_blt : 2.48047
New tanpi_blt_fma : 0.902832
Error is around 3.6 ULPs, 2.3e-07 relative error.
Last edited by 2DaT on Tue Aug 20, 2019 6:21 pm, edited 3 times in total.

Post

For filter tuning, I actually like using Taylor expansion of sine (with 4 terms for 7th order below) and then computing sin/cos using the identity 1/cos = 1/sqrt(1-sin^2) which is made fast by using rsqrt with one Newton round:

Code: Select all

    static inline __m128 f4_rsqrt(const __m128 & x)
    {
        const __m128 cn05 = _mm_set1_ps(0.5f);
        const __m128 cn15 = _mm_set1_ps(1.5f);
        
        __m128 y = _mm_rsqrt_ps(x);
        return _mm_mul_ps(y, _mm_sub_ps(cn15,
            _mm_mul_ps(_mm_mul_ps(cn05, x), _mm_mul_ps(y,y))));
    }

    static inline __m128 f4_fastTan(const __m128 & x)
    {
        const __m128 cn7 = _mm_set1_ps(1/5040.f);
        const __m128 cn5 = _mm_set1_ps(1/120.f);
        const __m128 cn3 = _mm_set1_ps(1/6.f);
        const __m128 cn1 = _mm_set1_ps(1.f);
        
        __m128 x2 = _mm_mul_ps(x,x);
        __m128 y = _mm_sub_ps(cn5, _mm_mul_ps(cn7, x2));
        y = _mm_add_ps(cn3, _mm_mul_ps(x2, y));
        y = _mm_mul_ps(x, _mm_sub_ps(cn1, _mm_mul_ps(x2, y)));

        __m128 r = _mm_sub_ps(cn1, _mm_mul_ps(y,y));
        return _mm_mul_ps(y, f4_rsqrt(r));
    }
While this gets a little less accurate at high frequencies, it's still pretty decent in practice.

ps. I think I already shared this idea in the thread where I suggested similar scheme for tanh() by approximating sinh() and the computing 1/cosh() using rsqrt(1+sinh^2).

Post

mystran wrote: Wed Aug 14, 2019 9:07 pm For filter tuning, I actually like using Taylor expansion of sine (with 4 terms for 7th order below) and then computing sin/cos using the identity 1/cos = 1/sqrt(1-sin^2) which is made fast by using rsqrt with one Newton round:
I don't think the whole rsqrtps/rcpps + NR thing is worth it on modern processors. In fact, LLVM tries to be smart and replaces my div intrinsic with rcp+NR and makes my code slower.
LLVM with strict math:

Code: Select all

vcmpltps    ymm10,ymm0,ymm9  
 vandps      ymm11,ymm10,ymm1  
 vsubps      ymm9,ymm9,ymm11  
 vmulps      ymm11,ymm9,ymm9  
 vmovaps     ymm12,ymm3  
 vfmadd213ps ymm12,ymm11,ymm2  
 vfmadd213ps ymm12,ymm11,ymm4  
 vfmadd213ps ymm12,ymm11,ymm5  
 vfmadd213ps ymm12,ymm11,ymm6  
 vfmadd213ps ymm12,ymm11,ymm7  
 vandps      ymm9,ymm9,ymm8  
 vmulps      ymm9,ymm12,ymm9  
 vdivps      ymm11,ymm0,ymm9  
 vblendvps   ymm9,ymm9,ymm11,ymm10  
LLVM with fast math:

Code: Select all

vcmpltps    ymm10,ymm0,ymm9  
 vandps      ymm11,ymm10,ymm1  
 vsubps      ymm9,ymm9,ymm11  
 vmulps      ymm11,ymm9,ymm9  
 vmovaps     ymm12,ymm3  
 vfmadd213ps ymm12,ymm11,ymm2  
 vfmadd213ps ymm12,ymm11,ymm4  
 vfmadd213ps ymm12,ymm11,ymm5  
 vfmadd213ps ymm12,ymm11,ymm6  
 vfmadd213ps ymm12,ymm11,ymm7  
 vandps      ymm9,ymm9,ymm8  
 vmulps      ymm9,ymm12,ymm9  
 vrcpps      ymm11,ymm9  
 vmovaps     ymm12,ymm11  
 vfnmadd213ps ymm12,ymm9,ymm0  
 vfmadd132ps ymm12,ymm11,ymm11  
 vblendvps   ymm9,ymm9,ymm12,ymm10 

Code: Select all

LLVM with fast math: 1.22485
LLVM with strict math: 1.05029
Modern CPUs allow divisions to be as fast as multiplies, as long as you don't divide too often. I guess optimal number would be like 5-10 arithmetic operations per divide (provided enough ILP of course).

Post

2DaT wrote: Thu Aug 15, 2019 12:04 am
mystran wrote: Wed Aug 14, 2019 9:07 pm For filter tuning, I actually like using Taylor expansion of sine (with 4 terms for 7th order below) and then computing sin/cos using the identity 1/cos = 1/sqrt(1-sin^2) which is made fast by using rsqrt with one Newton round:
I don't think the whole rsqrtps/rcpps + NR thing is worth it on modern processors.
In my approximation above though, we specifically want reciprocal square root, so it's either rsqrtps+NR or sqrtps+divps. I'm too broken to benchmark it against yours (other than to note that mine is much less accurate at high frequencies), but figured I could still share.

Post

Another thing with filter tuning is that typically what I usually need is something like:

tan(pi*min(0.49,baseTune*exp2(cv)))

The reason I mention this, because now you have the additional problem where putting it all into one loop results in high register with all the constants, while splitting it into two loops means bouncing through memory. Would be nice to have a good approximation for the whole thing directly.

ps. the 0.49 being an arbitrary "slightly less than Nyquist" value that hardly matters

Post

mystran wrote: Thu Aug 15, 2019 8:31 am In my approximation above though, we specifically want reciprocal square root, so it's either rsqrtps+NR or sqrtps+divps. I'm too broken to benchmark it against yours (other than to note that mine is much less accurate at high frequencies), but figured I could still share.
It's not particularly efficient, because it uses taylor poly, but the idea is nice. :tu: It's a shame that CPUs don't have full precision rsqrtps, perhaps existing div/sqrt hardware could be used for it.
mystran wrote: Thu Aug 15, 2019 8:55 am The reason I mention this, because now you have the additional problem where putting it all into one loop results in high register with all the constants, while splitting it into two loops means bouncing through memory.
I don't think that reading constants from memory is bad, since it will be cached and memory loads integrate with arithmetic.
Tried this augmented version, and it does OK. There are loads, but there is no register spilling which is good.

Code: Select all

  ASM for:
  
  for (size_t i = 0; i < size; i += 4) {
    _mm_store_ps(out + i, tanpi4_blt(exp_mp(_mm_load_ps(in + i))));
  }
  
 movups      xmm3,xmmword ptr [rcx+rdx]  
 movaps      xmm4,xmm6  
 mulps       xmm3,xmm13  
 cvtps2dq    xmm5,xmm3  
 cvtdq2ps    xmm0,xmm5  
 pslld       xmm5,17h  
 subps       xmm3,xmm0  
 movaps      xmm2,xmm3  
 movaps      xmm1,xmm3  
 mulps       xmm1,xmm3  
 movaps      xmm0,xmm3  
 mulps       xmm2,xmm12  
 mulps       xmm3,xmm8  
 mulps       xmm0,xmm10  
 addps       xmm2,xmm11  
 addps       xmm3,xmm7  
 addps       xmm0,xmm9  
 mulps       xmm2,xmm1  
 addps       xmm2,xmm0  
 mulps       xmm2,xmm1  
 addps       xmm2,xmm3  
 paddd       xmm5,xmm2  
 cmpltps     xmm4,xmm5  
 movaps      xmm0,xmm4  
 andps       xmm0,xmmword ptr [__xmm@40000000400000004000000040000000 (07FF64F333620h)]  
 subps       xmm5,xmm0  
 movaps      xmm2,xmm5  
 mulps       xmm2,xmm5  
 movaps      xmm3,xmm2  
 movaps      xmm1,xmm2  
 mulps       xmm3,xmmword ptr [__xmm@3abb92563abb92563abb92563abb9256 (07FF64F333560h)]  
 movaps      xmm0,xmm2  
 mulps       xmm0,xmmword ptr [__xmm@3c2cbce53c2cbce53c2cbce53c2cbce5 (07FF64F333580h)]  
 mulps       xmm1,xmm2  
 addps       xmm3,xmmword ptr [__xmm@3aba7be43aba7be43aba7be43aba7be4 (07FF64F333550h)]  
 addps       xmm0,xmmword ptr [__xmm@3d229da63d229da63d229da63d229da6 (07FF64F333590h)]  
 mulps       xmm2,xmm15  
 mulps       xmm3,xmm1  
 addps       xmm2,xmm14  
 addps       xmm3,xmm0  
 movaps      xmm0,xmmword ptr [__xmm@80000000800000008000000080000000 (07FF64F333630h)]  
 andnps      xmm0,xmm4  
 mulps       xmm3,xmm1  
 movaps      xmm1,xmm6  
 addps       xmm3,xmm2  
 mulps       xmm3,xmm5  
 divps       xmm1,xmm3  
 andnps      xmm4,xmm3  
 andps       xmm1,xmm0  
 orps        xmm1,xmm4  

mystran wrote: Thu Aug 15, 2019 8:55 am Another thing with filter tuning is that typically what I usually need is something like:
tan(pi*min(0.49,baseTune*exp2(cv)))
Would be nice to have a good approximation for the whole thing directly.
I don't know of any formula that will allow good approximation over big range of values.

Post

2DaT wrote: Thu Aug 15, 2019 12:50 pm
mystran wrote: Thu Aug 15, 2019 8:31 am In my approximation above though, we specifically want reciprocal square root, so it's either rsqrtps+NR or sqrtps+divps. I'm too broken to benchmark it against yours (other than to note that mine is much less accurate at high frequencies), but figured I could still share.
It's not particularly efficient, because it uses taylor poly, but the idea is nice. :tu:
Actually in this case I specifically opted for Taylor over minmax specifically because I was only really interested in good accuracy at low frequencies (and really you could probably drop a term and it'd still be good enough for oversampled filters)... but ... yeah.
2DaT wrote: Thu Aug 15, 2019 12:50 pm
mystran wrote: Thu Aug 15, 2019 8:55 am Another thing with filter tuning is that typically what I usually need is something like:
tan(pi*min(0.49,baseTune*exp2(cv)))
Would be nice to have a good approximation for the whole thing directly.
I don't know of any formula that will allow good approximation over big range of values.
Yup. One can rewrite to tan(pi*exp2(min(log2(0.49),cv+log2(baseTune)))) or something similar in order to move the range clipping out of the approximation, but that doesn't really help if exp2() relies on integer arithmetics in the exponent.

Post

2DaT wrote: Wed Aug 14, 2019 5:55 pm Since tan is frequently used for filters, and we don't need complicated range reduction, it seems like a good candidate to get additional perf.

Code: Select all

//Approximates tan(x*pi*0.25), x on (1e-34, 2-1e-7). Useful for BLT prewrap. 4-way parallel.
...
Some commenting in code would be nice.

BTW, theoretical question, in case of BLT prewarping, how accurate tan() approximation needs to be (compared to std::tan) ... any testing done regarding this?

Edit:
Here're few Pade' approximations I did for tan(x*pi/4) which of
5,5 version returns 2.85328203149421 and 6,6 version 2.85329005907606 for N frequency compared to std:tan's 2.85329014753834 ... ).
Last edited by juha_p on Sun Aug 18, 2019 7:08 am, edited 5 times in total.

Post

juha_p wrote: Sat Aug 17, 2019 11:53 am BTW, theoretical question, in case of BLT prewarping, how accurate tan() approximation needs to be (compared to std::tan) ... any testing done regarding this?
This is actually more of a practical rather than theoretical question. Most of the time it's sufficient for the tuning to be good enough to fool the human ear and additionally for non-linear filters the actual tuning tends to be signal dependent anyway. From pure stability point of view, as long as your tuning coefficient doesn't hit infinity before the tuning frequency reaches Nyquist (or go negative before you hit DC), you're good to go.

In practice, relative error in the ballpark of 1e-4 (ie. about 13 or 14 mantissa bits) is probably "good enough" for pretty much everything, although once you have something that can do this efficiently, it's usually straight-forward to add some extra terms to your approximation if you want to push the error further down. With significantly larger error you might start to notice some tuning inconsistencies when playing melodies with a self-oscillating filter, or when trying to track specific harmonics... but you probably won't find any analog filters that track anywhere near this well.

Either way, note that you really want to look at relative error, because we're really interested in the log-domain (ie. pitch) error most of the time and min-maxing the frequency directly without weighting could result in rather poor tuning at low frequencies.

Post

juha_p wrote: Sat Aug 17, 2019 11:53 am Some commenting in code would be nice.
Intrinsics are kinda scary to read, but are simple operations themselves. Do you need comments on the algorithm?

Post

2DaT wrote: Sat Aug 17, 2019 9:48 pm
juha_p wrote: Sat Aug 17, 2019 11:53 am Some commenting in code would be nice.
Intrinsics are kinda scary to read, but are simple operations themselves. Do you need comments on the algorithm?
That would be nice since I don't fully get why you do what you do there in code (is it just generic tan(x*pi/4) approx or designed for BLT prewarping)?

Post

juha_p wrote: Sat Aug 17, 2019 9:52 pm That would be nice since I don't fully get why you do what you do there in code (is it just generic tan(x*pi/4) approx or designed for BLT prewarping)?
Added some algorithm comments. It can be adapted to full range, but it will be a bit less efficient. I never needed a full range tan for anything in DSP anyway.

Post

Thanks, that clarifies a lot.

Post

2DaT wrote: Thu Aug 15, 2019 12:50 pm I don't think that reading constants from memory is bad, since it will be cached and memory loads integrate with arithmetic.
Tried this augmented version, and it does OK. There are loads, but there is no register spilling which is good.
Actually given how few memory references the disassembly actually has, it's probably just a pure latency penalty here.

Post

mystran wrote: Sun Aug 18, 2019 8:29 pm
2DaT wrote: Thu Aug 15, 2019 12:50 pm I don't think that reading constants from memory is bad, since it will be cached and memory loads integrate with arithmetic.
Tried this augmented version, and it does OK. There are loads, but there is no register spilling which is good.
Actually given how few memory references the disassembly actually has, it's probably just a pure latency penalty here.
And it's "good" kind of latency, because it doesn't add to dependency chain. Any decent out-of-order cpu will issue those loads ahead of time, and by decent I mean anything that's better than Pentium 4.

Post Reply

Return to “DSP and Plugin Development”