First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?

DSP, Plugin and Host development discussion.
Post Reply New Topic
RELATED
PRODUCTS

Post

For performance sensitive applications, you might want to implement some of the "expensive" math functions manually anyway, since this allows you to fine tune the trade-off between accuracy and performance.

In DSP applications, you might even want up to three different approximations for exp(): one that is low-accuracy for general purpose modulation mapping (eg. for exponential gain modulation and such it's probably enough to be accurate to within 0.1dB or so), one that is more accurate to within about 1 cent for tuning purposes and then maybe one more that's accurate to within a few ulp for things like Newton where trying to save on cycles can hurt convergence (which quickly ends up costing more than just spending the extra cycles if this saves you from having to do extra iterations).

Post

mystran wrote: Fri Dec 21, 2018 11:07 am For performance sensitive applications, you might want to implement some of the "expensive" math functions manually anyway, since this allows you to fine tune the trade-off between accuracy and performance.

In DSP applications, you might even want up to three different approximations for exp(): one that is low-accuracy for general purpose modulation mapping (eg. for exponential gain modulation and such it's probably enough to be accurate to within 0.1dB or so), one that is more accurate to within about 1 cent for tuning purposes and then maybe one more that's accurate to within a few ulp for things like Newton where trying to save on cycles can hurt convergence (which quickly ends up costing more than just spending the extra cycles if this saves you from having to do extra iterations).
Right now I need it to retrieve pitch multiplier due to range being modulated (at audio rate) between -48.0 to 48.0: exp(value * ln2per12). What do you suggest?

Post

There was just very recently a thread about exp() approximations, maybe see if there is something useful there.

Anyway, the basic idea is to convert to base-2 (or just approximate exp2() directly), then split argument into fractional and integer part. Then you approximate the fractional part using a min-max polynomial or something (check that thread for ideas). For the integer part you reinterpret the resulting (from the single-octave approximation) floating point bit-pattern as an integer and use integer add to adjust the exponent bits directly (which is why you always want to use base-2 internally).

Post

mystran wrote: Sat Dec 22, 2018 7:58 pm There was just very recently a thread about exp() approximations, maybe see if there is something useful there.

Anyway, the basic idea is to convert to base-2 (or just approximate exp2() directly), then split argument into fractional and integer part. Then you approximate the fractional part using a min-max polynomial or something (check that thread for ideas). For the integer part you reinterpret the resulting (from the single-octave approximation) floating point bit-pattern as an integer and use integer add to adjust the exponent bits directly (which is why you always want to use base-2 internally).
Didn't find the recent thread about exp().

However, someone tell me some of these fancy approximations (which, I hope, could also interess you all), using only SSE2 (and FMA) intrinsics:

Code: Select all

__m128d fastExp1(__m128d x) {
   __m128d ret = _mm_mul_pd(_mm_set1_pd(1.0 / 256), x);
   ret = _mm_add_pd(_mm_set1_pd(1.0), ret);
   ret = _mm_mul_pd(ret, ret);
   ret = _mm_mul_pd(ret, ret);
   ret = _mm_mul_pd(ret, ret);
   ret = _mm_mul_pd(ret, ret);
   ret = _mm_mul_pd(ret, ret);
   ret = _mm_mul_pd(ret, ret);
   ret = _mm_mul_pd(ret, ret);
   ret = _mm_mul_pd(ret, ret);
   return ret;
}

__m128d fastExp2(__m128d x) {
   const __m128d a0 = _mm_set1_pd(1.0);
   const __m128d a1 = _mm_set1_pd(1.0);
   const __m128d a2 = _mm_set1_pd(1.0 / 2);
   const __m128d a3 = _mm_set1_pd(1.0 / 2 / 3);
   const __m128d a4 = _mm_set1_pd(1.0 / 2 / 3 / 4);
   const __m128d a5 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5);
   const __m128d a6 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5 / 6);
   const __m128d a7 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5 / 6 / 7);

   __m128d ret = _mm_fmadd_pd(a7, x, a6);
   ret = _mm_fmadd_pd(ret, x, a5); 
   // If fma extention is not present use
   // ret = _mm_add_pd(_mm_mul_pd(ret, x), a5);
   ret = _mm_fmadd_pd(ret, x, a4);
   ret = _mm_fmadd_pd(ret, x, a3);
   ret = _mm_fmadd_pd(ret, x, a2);
   ret = _mm_fmadd_pd(ret, x, a1);
   ret = _mm_fmadd_pd(ret, x, a0);
   return ret;
}

__m128d fastExp3(__m128d x) {
   const __m128d a = _mm_set1_pd(1.0 / M_LN2);
   const __m128d b = _mm_set1_pd(3 * 1024.0 - 1.05);
   __m128d t = _mm_fmadd_pd(x, a, b);
   return _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(t), 11));
}

__m128d fastExp5(__m128d x) {
   const __m128d ap = _mm_set1_pd(0.5 / M_LN2);
   const __m128d an = _mm_set1_pd(-0.5 / M_LN2);
   const __m128d b = _mm_set1_pd(3 * 1024.0 - 1.05);
   __m128d tp = _mm_fmadd_pd(x, ap, b);
   __m128d tn = _mm_fmadd_pd(x, an, b);
   tp = _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(tp), 11));
   tn = _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(tn), 11));
   return _mm_div_pd(tp, tn);
}
Which one, in your opinion, is the best deal (accuracy, performance and stability) for my target?
Which is, as written above: range value from -48.0 to 48.0, thus exp(value * ln2per12).

Post

Nowhk wrote: Tue Jan 15, 2019 10:28 am
mystran wrote: Sat Dec 22, 2018 7:58 pm There was just very recently a thread about exp() approximations, maybe see if there is something useful there.

Anyway, the basic idea is to convert to base-2 (or just approximate exp2() directly), then split argument into fractional and integer part. Then you approximate the fractional part using a min-max polynomial or something (check that thread for ideas). For the integer part you reinterpret the resulting (from the single-octave approximation) floating point bit-pattern as an integer and use integer add to adjust the exponent bits directly (which is why you always want to use base-2 internally).
Didn't find the recent thread about exp().
viewtopic.php?f=33&t=510794
Which one, in your opinion, is the best deal (accuracy, performance and stability) for my target?
My opinion on this is that you should probably measure them and form your own opinion.

Post

mystran wrote: Tue Jan 15, 2019 11:33 am viewtopic.php?f=33&t=510794
Thanks :tu:
mystran wrote: Tue Jan 15, 2019 11:33 am My opinion on this is that you should probably measure them and form your own opinion.
Tried! It seems that nobody of them could be good enough: https://rextester.com/UDT46215 (despite the performance; not considerated yet).

Here's biggest errors I can have on some values, considering a range of -48/48 and a step of 0.1, compared to the "scalar" exp() function:

Code: Select all

fastExp1 error: 0.2367371124568560247780624195002019405364990234375
fastExp2 error: 0.1230974105602360424427388352341949939727783203125
fastExp3 error: 0.4000000000014534151659972849301993846893310546875
fastExp5 error: 0.2714721354081657267443006276153028011322021484375
fastExp2 seems the best, even though a drift of ~0.12 for pitch its TOO much.
I need another approx I believe.

If I use ippsExp_64f_I, forcing IPP to only use SSE2, I reach a max error of 3.552713678800500929355621337890625e-15 compared to the scalar exp() version :idea:
Quite impressive. The downside is that I can only compute "array" with IPP, not double-packed m128d values (which is what I need right now).

I'll have a look on the KVR link above and return 8)

Post

Normally after converting to base-2, you want to split the argument into integer and fractional parts, only approximate the fractional part and then add the integer part into the floating point exponent of the result directly. If you look at the actual errors of the approximations, you'll notice that they are much more accurate over a limited range, typically either [-0.5,0.5] or [0,1] and you only want to use this range of the actual approximation.

Post

Nowhk wrote: Tue Jan 15, 2019 3:25 pm Tried! It seems that nobody of them could be good enough: https://rextester.com/UDT46215 (despite the performance; not considerated yet).

Here's biggest errors I can have on some values, considering a range of -48/48 and a step of 0.1, compared to the "scalar" exp() function:

Code: Select all

fastExp1 error: 0.2367371124568560247780624195002019405364990234375
fastExp2 error: 0.1230974105602360424427388352341949939727783203125
fastExp3 error: 0.4000000000014534151659972849301993846893310546875
fastExp5 error: 0.2714721354081657267443006276153028011322021484375
fastExp2 seems the best, even though a drift of ~0.12 for pitch its TOO much.
I need another approx I believe.
Remember: FMA is AVX2 and is available only on a modern processor.
Your listed methods are trash. Use this (for FMA):
Works OK for range [-60,60]. No out-of range/denormal/NaN checking so use at your risk.

Code: Select all

	__m128 exp_mp2(__m128 x)
	{
		//[-0.5,0.5] 2^x approx polynomial ~ 2.4 ulp
		const __m128 c0 = _mm_set1_ps(1.000000119e+00f);
		const __m128 c1 = _mm_set1_ps(6.931469440e-01f);
		const __m128 c2 = _mm_set1_ps(2.402212024e-01f);
		const __m128 c3 = _mm_set1_ps(5.550713092e-02f);
		const __m128 c4 = _mm_set1_ps(9.675540961e-03f);
		const __m128 c5 = _mm_set1_ps(1.327647245e-03f);
		const __m128 l2e = _mm_set1_ps(1.442695041f);
		
		__m128 f = _mm_mul_ps(x, l2e);
		__m128i i = _mm_cvtps_epi32(f);
		f = _mm_sub_ps(f, _mm_cvtepi32_ps(i));
		__m128 p = c5;
		p = _mm_fmadd_ps(p, f, c4);
		p = _mm_fmadd_ps(p, f, c3);
		p = _mm_fmadd_ps(p, f, c2);
		p = _mm_fmadd_ps(p, f, c1);
		p = _mm_fmadd_ps(p, f, c0);
		i = _mm_slli_epi32(i, 23);
		return _mm_castsi128_ps(_mm_add_epi32(i, _mm_castps_si128(p)));
	}

Post

2DaT wrote: Tue Jan 15, 2019 4:59 pm Works OK for range [-60,60]. No out-of range/denormal/NaN checking so use at your risk.
That range sounds a little conservative, given that single-precision has 8 bits exponent, so about twice (well, I guess 1.4 times, given that it's base-e) that range should be perfectly fine, or am I missing some subtle detail here? I mean as long as the result is a "normal" number, it should work, right?

Also, if the integer part is negative, then don't you end up with the sign-bit set unless you mask it out explicitly, or is there some cancellation here that I just can't see?

edit: just clarify, I don't necessarily doubt it's correctness, I'm just having trouble figuring out how it's supposed to work, so please enlighten me :D

edit2: oh... I think I get it now.. the exponent bias means we have bit 7 set, so as long as the negative value results at carry from bit 7, it'll overflow into the sign-bit and cancel out the sign of the integer part, saving the bit-mask at the cost of losing just a tiny bit of range (or does it even lose any?)... yeah, that seems pretty clever, somehow never thought about that
Last edited by mystran on Wed Jan 16, 2019 2:07 am, edited 1 time in total.

Post

mystran wrote: Wed Jan 16, 2019 12:33 am
2DaT wrote: Tue Jan 15, 2019 4:59 pm Works OK for range [-60,60]. No out-of range/denormal/NaN checking so use at your risk.
That range sounds a little conservative, given that single-precision has 8 bits exponent, so about twice (well, I guess 1.4 times, given that it's base-e) that range should be perfectly fine, or am I missing some subtle detail here? I mean as long as the result is a "normal" number, it should work, right?
I checked it and actually the range of values which does not produce garbage is [-87, 88]
mystran wrote: Wed Jan 16, 2019 12:33 am Also, if the integer part is negative, then don't you end up with the sign-bit set unless you mask it out explicitly, or is there some cancellation here that I just can't see?

edit: just clarify, I don't necessarily doubt it's correctness, I'm just having trouble figuring out how it's supposed to work, so please enlighten me :D
There is a left logical shift (shifting in zeroes), it boils down to addition of two 9 bit twos-complement numbers.
2^x takes values ~[0.7,1.4]. Floating point exponents 126-127. Unless you add a big negative exponent, it won't go negative.

Post

2DaT wrote: Wed Jan 16, 2019 2:06 am
mystran wrote: Wed Jan 16, 2019 12:33 am Also, if the integer part is negative, then don't you end up with the sign-bit set unless you mask it out explicitly, or is there some cancellation here that I just can't see?

edit: just clarify, I don't necessarily doubt it's correctness, I'm just having trouble figuring out how it's supposed to work, so please enlighten me :D
There is a left logical shift (shifting in zeroes), it boils down to addition of two 9 bit twos-complement numbers.
2^x takes values ~[0.7,1.4]. Floating point exponents 126-127. Unless you add a big negative exponent, it won't go negative.
Yeah, I pretty much figured it out right before you managed to answer that. :)

edit: oh and left shifts are kinda always logical, unless you plan on triggering overflow traps (which x86 doesn't support anyway, at least as far as logical and arithmetic left shift are literally the same binary encoding) :)

edit2: I don't think you actually even lose any range; you might get negative (or otherwise wrong) values instead of denormals or infinity, but any normal results should still be correct, I think

Post

Nowhk wrote: Tue Jan 15, 2019 3:25 pm ...
Here's biggest errors I can have on some values, considering a range of -48/48 and a step of 0.1, compared to the "scalar" exp() function:

Code: Select all

fastExp1 error: 0.2367371124568560247780624195002019405364990234375
fastExp2 error: 0.1230974105602360424427388352341949939727783203125
fastExp3 error: 0.4000000000014534151659972849301993846893310546875
fastExp5 error: 0.2714721354081657267443006276153028011322021484375
fastExp2 seems the best, even though a drift of ~0.12 for pitch its TOO much.
I need another approx I believe.

...
Wasn't your range need just between ±48.0 * (ln(2)/12)) ... (±~2.78)?

These implementations (as SSE) versions are usually faster than std::exp() and can be about as accurate aswell:
(Pade) - https://root.cern.ch/doc/v606/exp_8h_source.html
(Remez, Taylor) - http://www.chokkan.org/software/dist/fastexp.c.html

There's also this nice Pade related function ... - https://math.stackexchange.com/a/2196435
Last edited by juha_p on Wed Jan 16, 2019 9:38 am, edited 1 time in total.

Post

You can try and evaluate the accuracy of these 2^X and e^X functions and see if some of them have high enough accuracy for you. I found them from my archives. It should be possible to convert them into SIMD/intrinsics versions.
// 2^X
inline float Fast_Pow2X(const float x)
{
// [-0.5, 0.5] 2^x approx polynomial ~ 2.4 ulp
constexpr float c0 = 1.000000119e+00f;
constexpr float c1 = 6.931469440e-01f;
constexpr float c2 = 2.402212024e-01f;
constexpr float c3 = 5.550713092e-02f;
constexpr float c4 = 9.675540961e-03f;
constexpr float c5 = 1.327647245e-03f;

const int32_t i = int32_t(x); // i = round(x)
const float f = x - float(i); // f = x - i

// Estrin-Horner evaluation scheme
const float f2 = f*f;
float p = c5*f + c4;
p *= f2;
p += c3*f + c2;
p *= f2;
p += c1*f + c0;

// (2^i)*(2^f)
// Directly in floating point exponent
const uint32_t temp2 = (i << 23) + *((uint32_t*) &p);
return *((float*) &temp2);
}


// e^X
inline float Fast_Exp(const float x)
{
constexpr float scale = 1.442695041f;
return Fast_Pow2X(scale * x);
}


// 2^X
static inline float Fast_Pow2(float p)
{
float offset = (p < 0) ? 1.0f : 0.0f;
float clip_p = (p < -126) ? -126.0f : p;
int32_t w = clip_p;
float z = clip_p - w + offset;

union uint_float
{
uint32_t i;
float f;
};

const uint_float v = { (uint32_t) ((1 << 23) * (clip_p + 121.2740575f + 27.7280233f / (4.84252568f - z) - 1.49012907f * z)) };

return v.f;
}

// 2^X - 1
static inline double Fast_Pow2_Minus_1(double x)
{
return -0.15639e-16+(0.69314718055995154416+(0.24022650695869054994+
(0.55504108675285271942e-1+(0.96181289721472527028e-2
+(0.13333568212100120656e-2+(0.15403075872034576440e-3
+(0.15265399313676342402e-4+(0.13003468358428470167e-5
+0.12113766044841794408e-6*x)*x)*x)*x)*x)*x)*x)*x)*x;
}

Post

2DaT wrote: Tue Jan 15, 2019 4:59 pm Remember: FMA is AVX2 and is available only on a modern processor.
Yes of course, but it can converted easily on SSE2; just looks at the comment inside the code:

Code: Select all

ret = _mm_fmadd_pd(ret, x, a5);
// If fma extention is not present use
// ret = _mm_add_pd(_mm_mul_pd(ret, x), a5);
2DaT wrote: Tue Jan 15, 2019 4:59 pmUse this (for FMA):
Works OK for range [-60,60]. No out-of range/denormal/NaN checking so use at your risk.
Oh thanks! Do you have the same code "on the fly" for __m128d (i.e. double)? Else I need some sort of manual refactoring from __m128 to __m128d...

juha_p wrote: Wed Jan 16, 2019 8:09 am Wasn't your range need just between ±48.0 * (ln(2)/12)) ... (±~2.78)?
Yes, correct! But if you see in the code, its already multiply * ln2per12, so the errors are actually for that range.

juha_p wrote: Wed Jan 16, 2019 8:09 am These implementations (as SSE) versions are usually faster than std::exp() and can be about as accurate aswell:
(Pade) - https://root.cern.ch/doc/v606/exp_8h_source.html
(Remez, Taylor) - http://www.chokkan.org/software/dist/fastexp.c.html

There's also this nice Pade related function ... - https://math.stackexchange.com/a/2196435
I'll check it out now, thanks!

Post

juha_p wrote: Wed Jan 16, 2019 8:09 am (Pade) - https://root.cern.ch/doc/v606/exp_8h_source.html
Awesome...
I've tried this, and in that range the max error is 1.7763568394002504646778106689453125e-15 :love:

It seems perfect to calculate a pitch multiplier!!! Thanks!!!
Do you already have the SSE2 version?

Note: how much a sin(freq * pitchMultiplier) will (in this worst case, since is the max error) drift?

Post Reply

Return to “DSP and Plugin Development”