First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?

mystran
 KVRAF
 5166 posts since 12 Feb, 2006 from Helsinki, Finland
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
For performance sensitive applications, you might want to implement some of the "expensive" math functions manually anyway, since this allows you to fine tune the tradeoff between accuracy and performance.
In DSP applications, you might even want up to three different approximations for exp(): one that is lowaccuracy for general purpose modulation mapping (eg. for exponential gain modulation and such it's probably enough to be accurate to within 0.1dB or so), one that is more accurate to within about 1 cent for tuning purposes and then maybe one more that's accurate to within a few ulp for things like Newton where trying to save on cycles can hurt convergence (which quickly ends up costing more than just spending the extra cycles if this saves you from having to do extra iterations).
In DSP applications, you might even want up to three different approximations for exp(): one that is lowaccuracy for general purpose modulation mapping (eg. for exponential gain modulation and such it's probably enough to be accurate to within 0.1dB or so), one that is more accurate to within about 1 cent for tuning purposes and then maybe one more that's accurate to within a few ulp for things like Newton where trying to save on cycles can hurt convergence (which quickly ends up costing more than just spending the extra cycles if this saves you from having to do extra iterations).
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

Nowhk
 KVRian
 864 posts since 2 Oct, 2013
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
Right now I need it to retrieve pitch multiplier due to range being modulated (at audio rate) between 48.0 to 48.0: exp(value * ln2per12). What do you suggest?mystran wrote: ↑Fri Dec 21, 2018 3:07 amFor performance sensitive applications, you might want to implement some of the "expensive" math functions manually anyway, since this allows you to fine tune the tradeoff between accuracy and performance.
In DSP applications, you might even want up to three different approximations for exp(): one that is lowaccuracy for general purpose modulation mapping (eg. for exponential gain modulation and such it's probably enough to be accurate to within 0.1dB or so), one that is more accurate to within about 1 cent for tuning purposes and then maybe one more that's accurate to within a few ulp for things like Newton where trying to save on cycles can hurt convergence (which quickly ends up costing more than just spending the extra cycles if this saves you from having to do extra iterations).

mystran
 KVRAF
 5166 posts since 12 Feb, 2006 from Helsinki, Finland
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
There was just very recently a thread about exp() approximations, maybe see if there is something useful there.
Anyway, the basic idea is to convert to base2 (or just approximate exp2() directly), then split argument into fractional and integer part. Then you approximate the fractional part using a minmax polynomial or something (check that thread for ideas). For the integer part you reinterpret the resulting (from the singleoctave approximation) floating point bitpattern as an integer and use integer add to adjust the exponent bits directly (which is why you always want to use base2 internally).
Anyway, the basic idea is to convert to base2 (or just approximate exp2() directly), then split argument into fractional and integer part. Then you approximate the fractional part using a minmax polynomial or something (check that thread for ideas). For the integer part you reinterpret the resulting (from the singleoctave approximation) floating point bitpattern as an integer and use integer add to adjust the exponent bits directly (which is why you always want to use base2 internally).
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

Nowhk
 KVRian
 864 posts since 2 Oct, 2013
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
Didn't find the recent thread about exp().mystran wrote: ↑Sat Dec 22, 2018 11:58 amThere was just very recently a thread about exp() approximations, maybe see if there is something useful there.
Anyway, the basic idea is to convert to base2 (or just approximate exp2() directly), then split argument into fractional and integer part. Then you approximate the fractional part using a minmax polynomial or something (check that thread for ideas). For the integer part you reinterpret the resulting (from the singleoctave approximation) floating point bitpattern as an integer and use integer add to adjust the exponent bits directly (which is why you always want to use base2 internally).
However, someone tell me some of these fancy approximations (which, I hope, could also interess you all), using only SSE2 (and FMA) intrinsics:
Code: Select all
__m128d fastExp1(__m128d x) {
__m128d ret = _mm_mul_pd(_mm_set1_pd(1.0 / 256), x);
ret = _mm_add_pd(_mm_set1_pd(1.0), ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
return ret;
}
__m128d fastExp2(__m128d x) {
const __m128d a0 = _mm_set1_pd(1.0);
const __m128d a1 = _mm_set1_pd(1.0);
const __m128d a2 = _mm_set1_pd(1.0 / 2);
const __m128d a3 = _mm_set1_pd(1.0 / 2 / 3);
const __m128d a4 = _mm_set1_pd(1.0 / 2 / 3 / 4);
const __m128d a5 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5);
const __m128d a6 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5 / 6);
const __m128d a7 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5 / 6 / 7);
__m128d ret = _mm_fmadd_pd(a7, x, a6);
ret = _mm_fmadd_pd(ret, x, a5);
// If fma extention is not present use
// ret = _mm_add_pd(_mm_mul_pd(ret, x), a5);
ret = _mm_fmadd_pd(ret, x, a4);
ret = _mm_fmadd_pd(ret, x, a3);
ret = _mm_fmadd_pd(ret, x, a2);
ret = _mm_fmadd_pd(ret, x, a1);
ret = _mm_fmadd_pd(ret, x, a0);
return ret;
}
__m128d fastExp3(__m128d x) {
const __m128d a = _mm_set1_pd(1.0 / M_LN2);
const __m128d b = _mm_set1_pd(3 * 1024.0  1.05);
__m128d t = _mm_fmadd_pd(x, a, b);
return _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(t), 11));
}
__m128d fastExp5(__m128d x) {
const __m128d ap = _mm_set1_pd(0.5 / M_LN2);
const __m128d an = _mm_set1_pd(0.5 / M_LN2);
const __m128d b = _mm_set1_pd(3 * 1024.0  1.05);
__m128d tp = _mm_fmadd_pd(x, ap, b);
__m128d tn = _mm_fmadd_pd(x, an, b);
tp = _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(tp), 11));
tn = _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(tn), 11));
return _mm_div_pd(tp, tn);
}
Which is, as written above: range value from 48.0 to 48.0, thus exp(value * ln2per12).

mystran
 KVRAF
 5166 posts since 12 Feb, 2006 from Helsinki, Finland
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
viewtopic.php?f=33&t=510794Nowhk wrote: ↑Tue Jan 15, 2019 2:28 amDidn't find the recent thread about exp().mystran wrote: ↑Sat Dec 22, 2018 11:58 amThere was just very recently a thread about exp() approximations, maybe see if there is something useful there.
Anyway, the basic idea is to convert to base2 (or just approximate exp2() directly), then split argument into fractional and integer part. Then you approximate the fractional part using a minmax polynomial or something (check that thread for ideas). For the integer part you reinterpret the resulting (from the singleoctave approximation) floating point bitpattern as an integer and use integer add to adjust the exponent bits directly (which is why you always want to use base2 internally).
My opinion on this is that you should probably measure them and form your own opinion.Which one, in your opinion, is the best deal (accuracy, performance and stability) for my target?
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

Nowhk
 KVRian
 864 posts since 2 Oct, 2013
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
Thanks
Tried! It seems that nobody of them could be good enough: https://rextester.com/UDT46215 (despite the performance; not considerated yet).
Here's biggest errors I can have on some values, considering a range of 48/48 and a step of 0.1, compared to the "scalar" exp() function:
Code: Select all
fastExp1 error: 0.2367371124568560247780624195002019405364990234375
fastExp2 error: 0.1230974105602360424427388352341949939727783203125
fastExp3 error: 0.4000000000014534151659972849301993846893310546875
fastExp5 error: 0.2714721354081657267443006276153028011322021484375
I need another approx I believe.
If I use ippsExp_64f_I, forcing IPP to only use SSE2, I reach a max error of 3.552713678800500929355621337890625e15 compared to the scalar exp() version
Quite impressive. The downside is that I can only compute "array" with IPP, not doublepacked m128d values (which is what I need right now).
I'll have a look on the KVR link above and return

mystran
 KVRAF
 5166 posts since 12 Feb, 2006 from Helsinki, Finland
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
Normally after converting to base2, you want to split the argument into integer and fractional parts, only approximate the fractional part and then add the integer part into the floating point exponent of the result directly. If you look at the actual errors of the approximations, you'll notice that they are much more accurate over a limited range, typically either [0.5,0.5] or [0,1] and you only want to use this range of the actual approximation.
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

2DaT
 KVRist
 344 posts since 21 Jun, 2013
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
Remember: FMA is AVX2 and is available only on a modern processor.Nowhk wrote: ↑Tue Jan 15, 2019 7:25 amTried! It seems that nobody of them could be good enough: https://rextester.com/UDT46215 (despite the performance; not considerated yet).
Here's biggest errors I can have on some values, considering a range of 48/48 and a step of 0.1, compared to the "scalar" exp() function:
fastExp2 seems the best, even though a drift of ~0.12 for pitch its TOO much.Code: Select all
fastExp1 error: 0.2367371124568560247780624195002019405364990234375 fastExp2 error: 0.1230974105602360424427388352341949939727783203125 fastExp3 error: 0.4000000000014534151659972849301993846893310546875 fastExp5 error: 0.2714721354081657267443006276153028011322021484375
I need another approx I believe.
Your listed methods are trash. Use this (for FMA):
Works OK for range [60,60]. No outof range/denormal/NaN checking so use at your risk.
Code: Select all
__m128 exp_mp2(__m128 x)
{
//[0.5,0.5] 2^x approx polynomial ~ 2.4 ulp
const __m128 c0 = _mm_set1_ps(1.000000119e+00f);
const __m128 c1 = _mm_set1_ps(6.931469440e01f);
const __m128 c2 = _mm_set1_ps(2.402212024e01f);
const __m128 c3 = _mm_set1_ps(5.550713092e02f);
const __m128 c4 = _mm_set1_ps(9.675540961e03f);
const __m128 c5 = _mm_set1_ps(1.327647245e03f);
const __m128 l2e = _mm_set1_ps(1.442695041f);
__m128 f = _mm_mul_ps(x, l2e);
__m128i i = _mm_cvtps_epi32(f);
f = _mm_sub_ps(f, _mm_cvtepi32_ps(i));
__m128 p = c5;
p = _mm_fmadd_ps(p, f, c4);
p = _mm_fmadd_ps(p, f, c3);
p = _mm_fmadd_ps(p, f, c2);
p = _mm_fmadd_ps(p, f, c1);
p = _mm_fmadd_ps(p, f, c0);
i = _mm_slli_epi32(i, 23);
return _mm_castsi128_ps(_mm_add_epi32(i, _mm_castps_si128(p)));
}

mystran
 KVRAF
 5166 posts since 12 Feb, 2006 from Helsinki, Finland
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
That range sounds a little conservative, given that singleprecision has 8 bits exponent, so about twice (well, I guess 1.4 times, given that it's basee) that range should be perfectly fine, or am I missing some subtle detail here? I mean as long as the result is a "normal" number, it should work, right?
Also, if the integer part is negative, then don't you end up with the signbit set unless you mask it out explicitly, or is there some cancellation here that I just can't see?
edit: just clarify, I don't necessarily doubt it's correctness, I'm just having trouble figuring out how it's supposed to work, so please enlighten me
edit2: oh... I think I get it now.. the exponent bias means we have bit 7 set, so as long as the negative value results at carry from bit 7, it'll overflow into the signbit and cancel out the sign of the integer part, saving the bitmask at the cost of losing just a tiny bit of range (or does it even lose any?)... yeah, that seems pretty clever, somehow never thought about that
Last edited by mystran on Tue Jan 15, 2019 6:07 pm, edited 1 time in total.
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

2DaT
 KVRist
 344 posts since 21 Jun, 2013
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
I checked it and actually the range of values which does not produce garbage is [87, 88]mystran wrote: ↑Tue Jan 15, 2019 4:33 pmThat range sounds a little conservative, given that singleprecision has 8 bits exponent, so about twice (well, I guess 1.4 times, given that it's basee) that range should be perfectly fine, or am I missing some subtle detail here? I mean as long as the result is a "normal" number, it should work, right?
There is a left logical shift (shifting in zeroes), it boils down to addition of two 9 bit twoscomplement numbers.mystran wrote: ↑Tue Jan 15, 2019 4:33 pmAlso, if the integer part is negative, then don't you end up with the signbit set unless you mask it out explicitly, or is there some cancellation here that I just can't see?
edit: just clarify, I don't necessarily doubt it's correctness, I'm just having trouble figuring out how it's supposed to work, so please enlighten me
2^x takes values ~[0.7,1.4]. Floating point exponents 126127. Unless you add a big negative exponent, it won't go negative.

mystran
 KVRAF
 5166 posts since 12 Feb, 2006 from Helsinki, Finland
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
Yeah, I pretty much figured it out right before you managed to answer that.2DaT wrote: ↑Tue Jan 15, 2019 6:06 pmThere is a left logical shift (shifting in zeroes), it boils down to addition of two 9 bit twoscomplement numbers.mystran wrote: ↑Tue Jan 15, 2019 4:33 pmAlso, if the integer part is negative, then don't you end up with the signbit set unless you mask it out explicitly, or is there some cancellation here that I just can't see?
edit: just clarify, I don't necessarily doubt it's correctness, I'm just having trouble figuring out how it's supposed to work, so please enlighten me
2^x takes values ~[0.7,1.4]. Floating point exponents 126127. Unless you add a big negative exponent, it won't go negative.
edit: oh and left shifts are kinda always logical, unless you plan on triggering overflow traps (which x86 doesn't support anyway, at least as far as logical and arithmetic left shift are literally the same binary encoding)
edit2: I don't think you actually even lose any range; you might get negative (or otherwise wrong) values instead of denormals or infinity, but any normal results should still be correct, I think
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

juha_p
 KVRian
 521 posts since 21 Feb, 2006 from FI
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
Wasn't your range need just between ±48.0 * (ln(2)/12)) ... (±~2.78)?Nowhk wrote: ↑Tue Jan 15, 2019 7:25 am...
Here's biggest errors I can have on some values, considering a range of 48/48 and a step of 0.1, compared to the "scalar" exp() function:
fastExp2 seems the best, even though a drift of ~0.12 for pitch its TOO much.Code: Select all
fastExp1 error: 0.2367371124568560247780624195002019405364990234375 fastExp2 error: 0.1230974105602360424427388352341949939727783203125 fastExp3 error: 0.4000000000014534151659972849301993846893310546875 fastExp5 error: 0.2714721354081657267443006276153028011322021484375
I need another approx I believe.
...
These implementations (as SSE) versions are usually faster than std::exp() and can be about as accurate aswell:
(Pade)  https://root.cern.ch/doc/v606/exp_8h_source.html
(Remez, Taylor)  http://www.chokkan.org/software/dist/fastexp.c.html
There's also this nice Pade related function ...  https://math.stackexchange.com/a/2196435
Last edited by juha_p on Wed Jan 16, 2019 1:38 am, edited 1 time in total.

Kraku
 KVRian
 1404 posts since 13 Oct, 2003 from Prague, Czech Republic
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
You can try and evaluate the accuracy of these 2^X and e^X functions and see if some of them have high enough accuracy for you. I found them from my archives. It should be possible to convert them into SIMD/intrinsics versions.
// 2^X
inline float Fast_Pow2X(const float x)
{
// [0.5, 0.5] 2^x approx polynomial ~ 2.4 ulp
constexpr float c0 = 1.000000119e+00f;
constexpr float c1 = 6.931469440e01f;
constexpr float c2 = 2.402212024e01f;
constexpr float c3 = 5.550713092e02f;
constexpr float c4 = 9.675540961e03f;
constexpr float c5 = 1.327647245e03f;
const int32_t i = int32_t(x); // i = round(x)
const float f = x  float(i); // f = x  i
// EstrinHorner evaluation scheme
const float f2 = f*f;
float p = c5*f + c4;
p *= f2;
p += c3*f + c2;
p *= f2;
p += c1*f + c0;
// (2^i)*(2^f)
// Directly in floating point exponent
const uint32_t temp2 = (i << 23) + *((uint32_t*) &p);
return *((float*) &temp2);
}
// e^X
inline float Fast_Exp(const float x)
{
constexpr float scale = 1.442695041f;
return Fast_Pow2X(scale * x);
}
// 2^X
static inline float Fast_Pow2(float p)
{
float offset = (p < 0) ? 1.0f : 0.0f;
float clip_p = (p < 126) ? 126.0f : p;
int32_t w = clip_p;
float z = clip_p  w + offset;
union uint_float
{
uint32_t i;
float f;
};
const uint_float v = { (uint32_t) ((1 << 23) * (clip_p + 121.2740575f + 27.7280233f / (4.84252568f  z)  1.49012907f * z)) };
return v.f;
}
// 2^X  1
static inline double Fast_Pow2_Minus_1(double x)
{
return 0.15639e16+(0.69314718055995154416+(0.24022650695869054994+
(0.55504108675285271942e1+(0.96181289721472527028e2
+(0.13333568212100120656e2+(0.15403075872034576440e3
+(0.15265399313676342402e4+(0.13003468358428470167e5
+0.12113766044841794408e6*x)*x)*x)*x)*x)*x)*x)*x)*x;
}

Nowhk
 KVRian
 864 posts since 2 Oct, 2013
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
Yes of course, but it can converted easily on SSE2; just looks at the comment inside the code:
Code: Select all
ret = _mm_fmadd_pd(ret, x, a5);
// If fma extention is not present use
// ret = _mm_add_pd(_mm_mul_pd(ret, x), a5);
Oh thanks! Do you have the same code "on the fly" for __m128d (i.e. double)? Else I need some sort of manual refactoring from __m128 to __m128d...
Yes, correct! But if you see in the code, its already multiply * ln2per12, so the errors are actually for that range.
I'll check it out now, thanks!juha_p wrote: ↑Wed Jan 16, 2019 12:09 amThese implementations (as SSE) versions are usually faster than std::exp() and can be about as accurate aswell:
(Pade)  https://root.cern.ch/doc/v606/exp_8h_source.html
(Remez, Taylor)  http://www.chokkan.org/software/dist/fastexp.c.html
There's also this nice Pade related function ...  https://math.stackexchange.com/a/2196435

Nowhk
 KVRian
 864 posts since 2 Oct, 2013
Re: First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
Awesome...
I've tried this, and in that range the max error is 1.7763568394002504646778106689453125e15
It seems perfect to calculate a pitch multiplier!!! Thanks!!!
Do you already have the SSE2 version?
Note: how much a sin(freq * pitchMultiplier) will (in this worst case, since is the max error) drift?