## Accurate SSE2/FMA Tan for BLT prewrap.

DSP, Plug-in and Host development discussion.
2DaT
KVRist
377 posts since 21 Jun, 2013
Since tan is frequently used for filters, and we don't need complicated range reduction, it seems like a good candidate to get additional perf.

Code: Select all

``````//Approximates tan(x*pi*0.25), x on (1e-34, 2-1e-7). Useful for BLT prewrap. 4-way parallel.
__m128 tanpi4_blt(__m128 x) {
//Polynomial coeffs for tan(x*pi/4) for (-1,1) in even terms, so it goes like c0*x + c1*x^3 + ...
//Remez relative error fit ~6 ULPs
const __m128 c0 = _mm_set1_ps(7.853980064e-01);
const __m128 c1 = _mm_set1_ps(1.615037620e-01);
const __m128 c2 = _mm_set1_ps(3.970112652e-02);
const __m128 c3 = _mm_set1_ps(1.054308284e-02);
const __m128 c4 = _mm_set1_ps(1.422759611e-03);
const __m128 c5 = _mm_set1_ps(1.431057928e-03);

const __m128 one = _mm_set1_ps(1.0f);
const __m128 two = _mm_set1_ps(2.0f);

//We use this useful identity to compute on finite range of values:
// tan(x - pi/2)  == 1 / -tan(x)
//Since we compute tan(x*pi/4), identity is tanpi4(x-2) == 1 / -tanpi4(x)

//Force the value into the finite range (-1,1)

//Compute the mask, whether we compute using the identity (x>1) or directly (x<1)
//Subtract 2 for values greater than 1

//Compute the polynomial for tan(x*pi/4)
//Hybrid scheme for polynomial evaluation, which is slightly faster than Horner's
__m128 f2 = _mm_mul_ps(x, x); //x^2
__m128 f4 = _mm_mul_ps(f2, f2); //x^4
__m128 p = _mm_add_ps(_mm_mul_ps(c5, f2), c4);
p = _mm_mul_ps(p, f4);// (c5*x^2 +c4)*x^4
p = _mm_mul_ps(p, f4);// ((c5*x^2 +c4)*x^4 + (c3*x^2 + c2))*x^4
p = _mm_mul_ps(p, x);// (((c5*x^2 +c4)*x^4 + (c3*x^2 + c2))*x^4)+(c1*x^2+c0*x))*x

//Compute the inverse for all values even if we don't need it, because it's a vector function
__m128 inverse = _mm_div_ps(one, p);

//Taking the abs of the inverted value with and(!signmask, inverse), because the result is positive
//Rearranging to reduce instruction dependence:
//Select the right value based on the mask
return p;
}
//Approximates tan(x*pi*0.25), x on (1e-34, 2-1e-7). Useful for BLT prewrap. FMA instructions. 8-way parallel.
__m256 tanpi4_blt_fma(__m256 x) {
const __m256 c0 = _mm256_set1_ps(7.853980064e-01);
const __m256 c1 = _mm256_set1_ps(1.615037620e-01);
const __m256 c2 = _mm256_set1_ps(3.970112652e-02);
const __m256 c3 = _mm256_set1_ps(1.054308284e-02);
const __m256 c4 = _mm256_set1_ps(1.422759611e-03);
const __m256 c5 = _mm256_set1_ps(1.431057928e-03);
const __m256 one = _mm256_set1_ps(1.0f);
const __m256 two = _mm256_set1_ps(2.0f);

__m256 inverse_mask = _mm256_cmp_ps(x, one, _CMP_GT_OS);

__m256 f2 = _mm256_mul_ps(x, x);
__m256 p = c5;

__m256 inverse = _mm256_div_ps(one, p);

return p;
}``````
Accuracy:

Code: Select all

``````[1e-34, 2-1e-7]
tanpi4_blt, Max error, ULPs : 6.00925
tanpi4_blt_fma, Max error, ULPs : 5.487955``````
Performance and comparison to msvc std::tan. CPU Cycles/value.

Code: Select all

``````tanpi4_blt: 3.56934
tanpi4_blt_fma: 1.09766
msvc tan vec: 5.72974
msvc tan scalar: 29.979``````
Maybe it's possible to get even better results using rational minmax, because we need a divide anyway. I might try this if there is any interest.
Last edited by 2DaT on Sat Aug 17, 2019 3:26 pm, edited 1 time in total.

mystran
KVRAF
5325 posts since 12 Feb, 2006 from Helsinki, Finland
For filter tuning, I actually like using Taylor expansion of sine (with 4 terms for 7th order below) and then computing sin/cos using the identity 1/cos = 1/sqrt(1-sin^2) which is made fast by using rsqrt with one Newton round:

Code: Select all

``````    static inline __m128 f4_rsqrt(const __m128 & x)
{
const __m128 cn05 = _mm_set1_ps(0.5f);
const __m128 cn15 = _mm_set1_ps(1.5f);

__m128 y = _mm_rsqrt_ps(x);
return _mm_mul_ps(y, _mm_sub_ps(cn15,
_mm_mul_ps(_mm_mul_ps(cn05, x), _mm_mul_ps(y,y))));
}

static inline __m128 f4_fastTan(const __m128 & x)
{
const __m128 cn7 = _mm_set1_ps(1/5040.f);
const __m128 cn5 = _mm_set1_ps(1/120.f);
const __m128 cn3 = _mm_set1_ps(1/6.f);
const __m128 cn1 = _mm_set1_ps(1.f);

__m128 x2 = _mm_mul_ps(x,x);
__m128 y = _mm_sub_ps(cn5, _mm_mul_ps(cn7, x2));
y = _mm_mul_ps(x, _mm_sub_ps(cn1, _mm_mul_ps(x2, y)));

__m128 r = _mm_sub_ps(cn1, _mm_mul_ps(y,y));
return _mm_mul_ps(y, f4_rsqrt(r));
}
``````
While this gets a little less accurate at high frequencies, it's still pretty decent in practice.

ps. I think I already shared this idea in the thread where I suggested similar scheme for tanh() by approximating sinh() and the computing 1/cosh() using rsqrt(1+sinh^2).
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

2DaT
KVRist
377 posts since 21 Jun, 2013
mystran wrote:
Wed Aug 14, 2019 1:07 pm
For filter tuning, I actually like using Taylor expansion of sine (with 4 terms for 7th order below) and then computing sin/cos using the identity 1/cos = 1/sqrt(1-sin^2) which is made fast by using rsqrt with one Newton round:
I don't think the whole rsqrtps/rcpps + NR thing is worth it on modern processors. In fact, LLVM tries to be smart and replaces my div intrinsic with rcp+NR and makes my code slower.
LLVM with strict math:

Code: Select all

``````vcmpltps    ymm10,ymm0,ymm9
vandps      ymm11,ymm10,ymm1
vsubps      ymm9,ymm9,ymm11
vmulps      ymm11,ymm9,ymm9
vmovaps     ymm12,ymm3
vandps      ymm9,ymm9,ymm8
vmulps      ymm9,ymm12,ymm9
vdivps      ymm11,ymm0,ymm9
vblendvps   ymm9,ymm9,ymm11,ymm10  ``````
LLVM with fast math:

Code: Select all

``````vcmpltps    ymm10,ymm0,ymm9
vandps      ymm11,ymm10,ymm1
vsubps      ymm9,ymm9,ymm11
vmulps      ymm11,ymm9,ymm9
vmovaps     ymm12,ymm3
vandps      ymm9,ymm9,ymm8
vmulps      ymm9,ymm12,ymm9
vrcpps      ymm11,ymm9
vmovaps     ymm12,ymm11
vblendvps   ymm9,ymm9,ymm12,ymm10 ``````

Code: Select all

``````LLVM with fast math: 1.22485
LLVM with strict math: 1.05029``````
Modern CPUs allow divisions to be as fast as multiplies, as long as you don't divide too often. I guess optimal number would be like 5-10 arithmetic operations per divide (provided enough ILP of course).

mystran
KVRAF
5325 posts since 12 Feb, 2006 from Helsinki, Finland
2DaT wrote:
Wed Aug 14, 2019 4:04 pm
mystran wrote:
Wed Aug 14, 2019 1:07 pm
For filter tuning, I actually like using Taylor expansion of sine (with 4 terms for 7th order below) and then computing sin/cos using the identity 1/cos = 1/sqrt(1-sin^2) which is made fast by using rsqrt with one Newton round:
I don't think the whole rsqrtps/rcpps + NR thing is worth it on modern processors.
In my approximation above though, we specifically want reciprocal square root, so it's either rsqrtps+NR or sqrtps+divps. I'm too broken to benchmark it against yours (other than to note that mine is much less accurate at high frequencies), but figured I could still share.
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

mystran
KVRAF
5325 posts since 12 Feb, 2006 from Helsinki, Finland
Another thing with filter tuning is that typically what I usually need is something like:

tan(pi*min(0.49,baseTune*exp2(cv)))

The reason I mention this, because now you have the additional problem where putting it all into one loop results in high register with all the constants, while splitting it into two loops means bouncing through memory. Would be nice to have a good approximation for the whole thing directly.

ps. the 0.49 being an arbitrary "slightly less than Nyquist" value that hardly matters
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

2DaT
KVRist
377 posts since 21 Jun, 2013
mystran wrote:
Thu Aug 15, 2019 12:31 am
In my approximation above though, we specifically want reciprocal square root, so it's either rsqrtps+NR or sqrtps+divps. I'm too broken to benchmark it against yours (other than to note that mine is much less accurate at high frequencies), but figured I could still share.
It's not particularly efficient, because it uses taylor poly, but the idea is nice. It's a shame that CPUs don't have full precision rsqrtps, perhaps existing div/sqrt hardware could be used for it.
mystran wrote:
Thu Aug 15, 2019 12:55 am
The reason I mention this, because now you have the additional problem where putting it all into one loop results in high register with all the constants, while splitting it into two loops means bouncing through memory.
I don't think that reading constants from memory is bad, since it will be cached and memory loads integrate with arithmetic.
Tried this augmented version, and it does OK. There are loads, but there is no register spilling which is good.

Code: Select all

``````  ASM for:

for (size_t i = 0; i < size; i += 4) {
_mm_store_ps(out + i, tanpi4_blt(exp_mp(_mm_load_ps(in + i))));
}

movups      xmm3,xmmword ptr [rcx+rdx]
movaps      xmm4,xmm6
mulps       xmm3,xmm13
cvtps2dq    xmm5,xmm3
cvtdq2ps    xmm0,xmm5
pslld       xmm5,17h
subps       xmm3,xmm0
movaps      xmm2,xmm3
movaps      xmm1,xmm3
mulps       xmm1,xmm3
movaps      xmm0,xmm3
mulps       xmm2,xmm12
mulps       xmm3,xmm8
mulps       xmm0,xmm10
mulps       xmm2,xmm1
mulps       xmm2,xmm1
cmpltps     xmm4,xmm5
movaps      xmm0,xmm4
andps       xmm0,xmmword ptr [__xmm@40000000400000004000000040000000 (07FF64F333620h)]
subps       xmm5,xmm0
movaps      xmm2,xmm5
mulps       xmm2,xmm5
movaps      xmm3,xmm2
movaps      xmm1,xmm2
mulps       xmm3,xmmword ptr [__xmm@3abb92563abb92563abb92563abb9256 (07FF64F333560h)]
movaps      xmm0,xmm2
mulps       xmm0,xmmword ptr [__xmm@3c2cbce53c2cbce53c2cbce53c2cbce5 (07FF64F333580h)]
mulps       xmm1,xmm2
mulps       xmm2,xmm15
mulps       xmm3,xmm1
movaps      xmm0,xmmword ptr [__xmm@80000000800000008000000080000000 (07FF64F333630h)]
andnps      xmm0,xmm4
mulps       xmm3,xmm1
movaps      xmm1,xmm6
mulps       xmm3,xmm5
divps       xmm1,xmm3
andnps      xmm4,xmm3
andps       xmm1,xmm0
orps        xmm1,xmm4  ``````

mystran wrote:
Thu Aug 15, 2019 12:55 am
Another thing with filter tuning is that typically what I usually need is something like:
tan(pi*min(0.49,baseTune*exp2(cv)))
Would be nice to have a good approximation for the whole thing directly.
I don't know of any formula that will allow good approximation over big range of values.

mystran
KVRAF
5325 posts since 12 Feb, 2006 from Helsinki, Finland
2DaT wrote:
Thu Aug 15, 2019 4:50 am
mystran wrote:
Thu Aug 15, 2019 12:31 am
In my approximation above though, we specifically want reciprocal square root, so it's either rsqrtps+NR or sqrtps+divps. I'm too broken to benchmark it against yours (other than to note that mine is much less accurate at high frequencies), but figured I could still share.
It's not particularly efficient, because it uses taylor poly, but the idea is nice.
Actually in this case I specifically opted for Taylor over minmax specifically because I was only really interested in good accuracy at low frequencies (and really you could probably drop a term and it'd still be good enough for oversampled filters)... but ... yeah.
2DaT wrote:
Thu Aug 15, 2019 4:50 am
mystran wrote:
Thu Aug 15, 2019 12:55 am
Another thing with filter tuning is that typically what I usually need is something like:
tan(pi*min(0.49,baseTune*exp2(cv)))
Would be nice to have a good approximation for the whole thing directly.
I don't know of any formula that will allow good approximation over big range of values.
Yup. One can rewrite to tan(pi*exp2(min(log2(0.49),cv+log2(baseTune)))) or something similar in order to move the range clipping out of the approximation, but that doesn't really help if exp2() relies on integer arithmetics in the exponent.
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

juha_p
KVRian
548 posts since 21 Feb, 2006 from FI
2DaT wrote:
Wed Aug 14, 2019 9:55 am
Since tan is frequently used for filters, and we don't need complicated range reduction, it seems like a good candidate to get additional perf.

Code: Select all

``````//Approximates tan(x*pi*0.25), x on (1e-34, 2-1e-7). Useful for BLT prewrap. 4-way parallel.
...
``````
Some commenting in code would be nice.

BTW, theoretical question, in case of BLT prewarping, how accurate tan() approximation needs to be (compared to std::tan) ... any testing done regarding this?

Edit:
Here're few Pade' approximations I did for tan(x*pi/4) which of
5,5 version returns 2.85328203149421 and 6,6 version 2.85329005907606 for N frequency compared to std:tan's 2.85329014753834 ... ).
Last edited by juha_p on Sat Aug 17, 2019 11:08 pm, edited 5 times in total.

mystran
KVRAF
5325 posts since 12 Feb, 2006 from Helsinki, Finland
juha_p wrote:
Sat Aug 17, 2019 3:53 am
BTW, theoretical question, in case of BLT prewarping, how accurate tan() approximation needs to be (compared to std::tan) ... any testing done regarding this?
This is actually more of a practical rather than theoretical question. Most of the time it's sufficient for the tuning to be good enough to fool the human ear and additionally for non-linear filters the actual tuning tends to be signal dependent anyway. From pure stability point of view, as long as your tuning coefficient doesn't hit infinity before the tuning frequency reaches Nyquist (or go negative before you hit DC), you're good to go.

In practice, relative error in the ballpark of 1e-4 (ie. about 13 or 14 mantissa bits) is probably "good enough" for pretty much everything, although once you have something that can do this efficiently, it's usually straight-forward to add some extra terms to your approximation if you want to push the error further down. With significantly larger error you might start to notice some tuning inconsistencies when playing melodies with a self-oscillating filter, or when trying to track specific harmonics... but you probably won't find any analog filters that track anywhere near this well.

Either way, note that you really want to look at relative error, because we're really interested in the log-domain (ie. pitch) error most of the time and min-maxing the frequency directly without weighting could result in rather poor tuning at low frequencies.
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

2DaT
KVRist
377 posts since 21 Jun, 2013
juha_p wrote:
Sat Aug 17, 2019 3:53 am
Some commenting in code would be nice.
Intrinsics are kinda scary to read, but are simple operations themselves. Do you need comments on the algorithm?

juha_p
KVRian
548 posts since 21 Feb, 2006 from FI
2DaT wrote:
Sat Aug 17, 2019 1:48 pm
juha_p wrote:
Sat Aug 17, 2019 3:53 am
Some commenting in code would be nice.
Intrinsics are kinda scary to read, but are simple operations themselves. Do you need comments on the algorithm?
That would be nice since I don't fully get why you do what you do there in code (is it just generic tan(x*pi/4) approx or designed for BLT prewarping)?

2DaT
KVRist
377 posts since 21 Jun, 2013
juha_p wrote:
Sat Aug 17, 2019 1:52 pm
That would be nice since I don't fully get why you do what you do there in code (is it just generic tan(x*pi/4) approx or designed for BLT prewarping)?
Added some algorithm comments. It can be adapted to full range, but it will be a bit less efficient. I never needed a full range tan for anything in DSP anyway.

juha_p
KVRian
548 posts since 21 Feb, 2006 from FI
Thanks, that clarifies a lot.

mystran
KVRAF
5325 posts since 12 Feb, 2006 from Helsinki, Finland
2DaT wrote:
Thu Aug 15, 2019 4:50 am
I don't think that reading constants from memory is bad, since it will be cached and memory loads integrate with arithmetic.
Tried this augmented version, and it does OK. There are loads, but there is no register spilling which is good.
Actually given how few memory references the disassembly actually has, it's probably just a pure latency penalty here.
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

2DaT
KVRist
377 posts since 21 Jun, 2013
mystran wrote:
Sun Aug 18, 2019 12:29 pm
2DaT wrote:
Thu Aug 15, 2019 4:50 am
I don't think that reading constants from memory is bad, since it will be cached and memory loads integrate with arithmetic.
Tried this augmented version, and it does OK. There are loads, but there is no register spilling which is good.
Actually given how few memory references the disassembly actually has, it's probably just a pure latency penalty here.
And it's "good" kind of latency, because it doesn't add to dependency chain. Any decent out-of-order cpu will issue those loads ahead of time, and by decent I mean anything that's better than Pentium 4.