First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
- KVRAF
- 7888 posts since 12 Feb, 2006 from Helsinki, Finland
For performance sensitive applications, you might want to implement some of the "expensive" math functions manually anyway, since this allows you to fine tune the trade-off between accuracy and performance.
In DSP applications, you might even want up to three different approximations for exp(): one that is low-accuracy for general purpose modulation mapping (eg. for exponential gain modulation and such it's probably enough to be accurate to within 0.1dB or so), one that is more accurate to within about 1 cent for tuning purposes and then maybe one more that's accurate to within a few ulp for things like Newton where trying to save on cycles can hurt convergence (which quickly ends up costing more than just spending the extra cycles if this saves you from having to do extra iterations).
In DSP applications, you might even want up to three different approximations for exp(): one that is low-accuracy for general purpose modulation mapping (eg. for exponential gain modulation and such it's probably enough to be accurate to within 0.1dB or so), one that is more accurate to within about 1 cent for tuning purposes and then maybe one more that's accurate to within a few ulp for things like Newton where trying to save on cycles can hurt convergence (which quickly ends up costing more than just spending the extra cycles if this saves you from having to do extra iterations).
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Right now I need it to retrieve pitch multiplier due to range being modulated (at audio rate) between -48.0 to 48.0: exp(value * ln2per12). What do you suggest?mystran wrote: ↑Fri Dec 21, 2018 11:07 am For performance sensitive applications, you might want to implement some of the "expensive" math functions manually anyway, since this allows you to fine tune the trade-off between accuracy and performance.
In DSP applications, you might even want up to three different approximations for exp(): one that is low-accuracy for general purpose modulation mapping (eg. for exponential gain modulation and such it's probably enough to be accurate to within 0.1dB or so), one that is more accurate to within about 1 cent for tuning purposes and then maybe one more that's accurate to within a few ulp for things like Newton where trying to save on cycles can hurt convergence (which quickly ends up costing more than just spending the extra cycles if this saves you from having to do extra iterations).
- KVRAF
- 7888 posts since 12 Feb, 2006 from Helsinki, Finland
There was just very recently a thread about exp() approximations, maybe see if there is something useful there.
Anyway, the basic idea is to convert to base-2 (or just approximate exp2() directly), then split argument into fractional and integer part. Then you approximate the fractional part using a min-max polynomial or something (check that thread for ideas). For the integer part you reinterpret the resulting (from the single-octave approximation) floating point bit-pattern as an integer and use integer add to adjust the exponent bits directly (which is why you always want to use base-2 internally).
Anyway, the basic idea is to convert to base-2 (or just approximate exp2() directly), then split argument into fractional and integer part. Then you approximate the fractional part using a min-max polynomial or something (check that thread for ideas). For the integer part you reinterpret the resulting (from the single-octave approximation) floating point bit-pattern as an integer and use integer add to adjust the exponent bits directly (which is why you always want to use base-2 internally).
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Didn't find the recent thread about exp().mystran wrote: ↑Sat Dec 22, 2018 7:58 pm There was just very recently a thread about exp() approximations, maybe see if there is something useful there.
Anyway, the basic idea is to convert to base-2 (or just approximate exp2() directly), then split argument into fractional and integer part. Then you approximate the fractional part using a min-max polynomial or something (check that thread for ideas). For the integer part you reinterpret the resulting (from the single-octave approximation) floating point bit-pattern as an integer and use integer add to adjust the exponent bits directly (which is why you always want to use base-2 internally).
However, someone tell me some of these fancy approximations (which, I hope, could also interess you all), using only SSE2 (and FMA) intrinsics:
Code: Select all
__m128d fastExp1(__m128d x) {
__m128d ret = _mm_mul_pd(_mm_set1_pd(1.0 / 256), x);
ret = _mm_add_pd(_mm_set1_pd(1.0), ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
ret = _mm_mul_pd(ret, ret);
return ret;
}
__m128d fastExp2(__m128d x) {
const __m128d a0 = _mm_set1_pd(1.0);
const __m128d a1 = _mm_set1_pd(1.0);
const __m128d a2 = _mm_set1_pd(1.0 / 2);
const __m128d a3 = _mm_set1_pd(1.0 / 2 / 3);
const __m128d a4 = _mm_set1_pd(1.0 / 2 / 3 / 4);
const __m128d a5 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5);
const __m128d a6 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5 / 6);
const __m128d a7 = _mm_set1_pd(1.0 / 2 / 3 / 4 / 5 / 6 / 7);
__m128d ret = _mm_fmadd_pd(a7, x, a6);
ret = _mm_fmadd_pd(ret, x, a5);
// If fma extention is not present use
// ret = _mm_add_pd(_mm_mul_pd(ret, x), a5);
ret = _mm_fmadd_pd(ret, x, a4);
ret = _mm_fmadd_pd(ret, x, a3);
ret = _mm_fmadd_pd(ret, x, a2);
ret = _mm_fmadd_pd(ret, x, a1);
ret = _mm_fmadd_pd(ret, x, a0);
return ret;
}
__m128d fastExp3(__m128d x) {
const __m128d a = _mm_set1_pd(1.0 / M_LN2);
const __m128d b = _mm_set1_pd(3 * 1024.0 - 1.05);
__m128d t = _mm_fmadd_pd(x, a, b);
return _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(t), 11));
}
__m128d fastExp5(__m128d x) {
const __m128d ap = _mm_set1_pd(0.5 / M_LN2);
const __m128d an = _mm_set1_pd(-0.5 / M_LN2);
const __m128d b = _mm_set1_pd(3 * 1024.0 - 1.05);
__m128d tp = _mm_fmadd_pd(x, ap, b);
__m128d tn = _mm_fmadd_pd(x, an, b);
tp = _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(tp), 11));
tn = _mm_castsi128_pd(_mm_slli_epi64(_mm_castpd_si128(tn), 11));
return _mm_div_pd(tp, tn);
}
Which is, as written above: range value from -48.0 to 48.0, thus exp(value * ln2per12).
- KVRAF
- 7888 posts since 12 Feb, 2006 from Helsinki, Finland
viewtopic.php?f=33&t=510794Nowhk wrote: ↑Tue Jan 15, 2019 10:28 amDidn't find the recent thread about exp().mystran wrote: ↑Sat Dec 22, 2018 7:58 pm There was just very recently a thread about exp() approximations, maybe see if there is something useful there.
Anyway, the basic idea is to convert to base-2 (or just approximate exp2() directly), then split argument into fractional and integer part. Then you approximate the fractional part using a min-max polynomial or something (check that thread for ideas). For the integer part you reinterpret the resulting (from the single-octave approximation) floating point bit-pattern as an integer and use integer add to adjust the exponent bits directly (which is why you always want to use base-2 internally).
My opinion on this is that you should probably measure them and form your own opinion.Which one, in your opinion, is the best deal (accuracy, performance and stability) for my target?
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Thanks
Tried! It seems that nobody of them could be good enough: https://rextester.com/UDT46215 (despite the performance; not considerated yet).
Here's biggest errors I can have on some values, considering a range of -48/48 and a step of 0.1, compared to the "scalar" exp() function:
Code: Select all
fastExp1 error: 0.2367371124568560247780624195002019405364990234375
fastExp2 error: 0.1230974105602360424427388352341949939727783203125
fastExp3 error: 0.4000000000014534151659972849301993846893310546875
fastExp5 error: 0.2714721354081657267443006276153028011322021484375
I need another approx I believe.
If I use ippsExp_64f_I, forcing IPP to only use SSE2, I reach a max error of 3.552713678800500929355621337890625e-15 compared to the scalar exp() version
Quite impressive. The downside is that I can only compute "array" with IPP, not double-packed m128d values (which is what I need right now).
I'll have a look on the KVR link above and return
- KVRAF
- 7888 posts since 12 Feb, 2006 from Helsinki, Finland
Normally after converting to base-2, you want to split the argument into integer and fractional parts, only approximate the fractional part and then add the integer part into the floating point exponent of the result directly. If you look at the actual errors of the approximations, you'll notice that they are much more accurate over a limited range, typically either [-0.5,0.5] or [0,1] and you only want to use this range of the actual approximation.
-
- KVRian
- 631 posts since 21 Jun, 2013
Remember: FMA is AVX2 and is available only on a modern processor.Nowhk wrote: ↑Tue Jan 15, 2019 3:25 pm Tried! It seems that nobody of them could be good enough: https://rextester.com/UDT46215 (despite the performance; not considerated yet).
Here's biggest errors I can have on some values, considering a range of -48/48 and a step of 0.1, compared to the "scalar" exp() function:
fastExp2 seems the best, even though a drift of ~0.12 for pitch its TOO much.Code: Select all
fastExp1 error: 0.2367371124568560247780624195002019405364990234375 fastExp2 error: 0.1230974105602360424427388352341949939727783203125 fastExp3 error: 0.4000000000014534151659972849301993846893310546875 fastExp5 error: 0.2714721354081657267443006276153028011322021484375
I need another approx I believe.
Your listed methods are trash. Use this (for FMA):
Works OK for range [-60,60]. No out-of range/denormal/NaN checking so use at your risk.
Code: Select all
__m128 exp_mp2(__m128 x)
{
//[-0.5,0.5] 2^x approx polynomial ~ 2.4 ulp
const __m128 c0 = _mm_set1_ps(1.000000119e+00f);
const __m128 c1 = _mm_set1_ps(6.931469440e-01f);
const __m128 c2 = _mm_set1_ps(2.402212024e-01f);
const __m128 c3 = _mm_set1_ps(5.550713092e-02f);
const __m128 c4 = _mm_set1_ps(9.675540961e-03f);
const __m128 c5 = _mm_set1_ps(1.327647245e-03f);
const __m128 l2e = _mm_set1_ps(1.442695041f);
__m128 f = _mm_mul_ps(x, l2e);
__m128i i = _mm_cvtps_epi32(f);
f = _mm_sub_ps(f, _mm_cvtepi32_ps(i));
__m128 p = c5;
p = _mm_fmadd_ps(p, f, c4);
p = _mm_fmadd_ps(p, f, c3);
p = _mm_fmadd_ps(p, f, c2);
p = _mm_fmadd_ps(p, f, c1);
p = _mm_fmadd_ps(p, f, c0);
i = _mm_slli_epi32(i, 23);
return _mm_castsi128_ps(_mm_add_epi32(i, _mm_castps_si128(p)));
}
- KVRAF
- 7888 posts since 12 Feb, 2006 from Helsinki, Finland
That range sounds a little conservative, given that single-precision has 8 bits exponent, so about twice (well, I guess 1.4 times, given that it's base-e) that range should be perfectly fine, or am I missing some subtle detail here? I mean as long as the result is a "normal" number, it should work, right?
Also, if the integer part is negative, then don't you end up with the sign-bit set unless you mask it out explicitly, or is there some cancellation here that I just can't see?
edit: just clarify, I don't necessarily doubt it's correctness, I'm just having trouble figuring out how it's supposed to work, so please enlighten me
edit2: oh... I think I get it now.. the exponent bias means we have bit 7 set, so as long as the negative value results at carry from bit 7, it'll overflow into the sign-bit and cancel out the sign of the integer part, saving the bit-mask at the cost of losing just a tiny bit of range (or does it even lose any?)... yeah, that seems pretty clever, somehow never thought about that
Last edited by mystran on Wed Jan 16, 2019 2:07 am, edited 1 time in total.
-
- KVRian
- 631 posts since 21 Jun, 2013
I checked it and actually the range of values which does not produce garbage is [-87, 88]mystran wrote: ↑Wed Jan 16, 2019 12:33 amThat range sounds a little conservative, given that single-precision has 8 bits exponent, so about twice (well, I guess 1.4 times, given that it's base-e) that range should be perfectly fine, or am I missing some subtle detail here? I mean as long as the result is a "normal" number, it should work, right?
There is a left logical shift (shifting in zeroes), it boils down to addition of two 9 bit twos-complement numbers.mystran wrote: ↑Wed Jan 16, 2019 12:33 am Also, if the integer part is negative, then don't you end up with the sign-bit set unless you mask it out explicitly, or is there some cancellation here that I just can't see?
edit: just clarify, I don't necessarily doubt it's correctness, I'm just having trouble figuring out how it's supposed to work, so please enlighten me
2^x takes values ~[0.7,1.4]. Floating point exponents 126-127. Unless you add a big negative exponent, it won't go negative.
- KVRAF
- 7888 posts since 12 Feb, 2006 from Helsinki, Finland
Yeah, I pretty much figured it out right before you managed to answer that.2DaT wrote: ↑Wed Jan 16, 2019 2:06 amThere is a left logical shift (shifting in zeroes), it boils down to addition of two 9 bit twos-complement numbers.mystran wrote: ↑Wed Jan 16, 2019 12:33 am Also, if the integer part is negative, then don't you end up with the sign-bit set unless you mask it out explicitly, or is there some cancellation here that I just can't see?
edit: just clarify, I don't necessarily doubt it's correctness, I'm just having trouble figuring out how it's supposed to work, so please enlighten me
2^x takes values ~[0.7,1.4]. Floating point exponents 126-127. Unless you add a big negative exponent, it won't go negative.
edit: oh and left shifts are kinda always logical, unless you plan on triggering overflow traps (which x86 doesn't support anyway, at least as far as logical and arithmetic left shift are literally the same binary encoding)
edit2: I don't think you actually even lose any range; you might get negative (or otherwise wrong) values instead of denormals or infinity, but any normal results should still be correct, I think
-
- KVRian
- 833 posts since 21 Feb, 2006 from FI
Wasn't your range need just between ±48.0 * (ln(2)/12)) ... (±~2.78)?Nowhk wrote: ↑Tue Jan 15, 2019 3:25 pm ...
Here's biggest errors I can have on some values, considering a range of -48/48 and a step of 0.1, compared to the "scalar" exp() function:
fastExp2 seems the best, even though a drift of ~0.12 for pitch its TOO much.Code: Select all
fastExp1 error: 0.2367371124568560247780624195002019405364990234375 fastExp2 error: 0.1230974105602360424427388352341949939727783203125 fastExp3 error: 0.4000000000014534151659972849301993846893310546875 fastExp5 error: 0.2714721354081657267443006276153028011322021484375
I need another approx I believe.
...
These implementations (as SSE) versions are usually faster than std::exp() and can be about as accurate aswell:
(Pade) - https://root.cern.ch/doc/v606/exp_8h_source.html
(Remez, Taylor) - http://www.chokkan.org/software/dist/fastexp.c.html
There's also this nice Pade related function ... - https://math.stackexchange.com/a/2196435
Last edited by juha_p on Wed Jan 16, 2019 9:38 am, edited 1 time in total.
-
- KVRAF
- 1609 posts since 13 Oct, 2003 from Oulu, Finland
You can try and evaluate the accuracy of these 2^X and e^X functions and see if some of them have high enough accuracy for you. I found them from my archives. It should be possible to convert them into SIMD/intrinsics versions.
// 2^X
inline float Fast_Pow2X(const float x)
{
// [-0.5, 0.5] 2^x approx polynomial ~ 2.4 ulp
constexpr float c0 = 1.000000119e+00f;
constexpr float c1 = 6.931469440e-01f;
constexpr float c2 = 2.402212024e-01f;
constexpr float c3 = 5.550713092e-02f;
constexpr float c4 = 9.675540961e-03f;
constexpr float c5 = 1.327647245e-03f;
const int32_t i = int32_t(x); // i = round(x)
const float f = x - float(i); // f = x - i
// Estrin-Horner evaluation scheme
const float f2 = f*f;
float p = c5*f + c4;
p *= f2;
p += c3*f + c2;
p *= f2;
p += c1*f + c0;
// (2^i)*(2^f)
// Directly in floating point exponent
const uint32_t temp2 = (i << 23) + *((uint32_t*) &p);
return *((float*) &temp2);
}
// e^X
inline float Fast_Exp(const float x)
{
constexpr float scale = 1.442695041f;
return Fast_Pow2X(scale * x);
}
// 2^X
static inline float Fast_Pow2(float p)
{
float offset = (p < 0) ? 1.0f : 0.0f;
float clip_p = (p < -126) ? -126.0f : p;
int32_t w = clip_p;
float z = clip_p - w + offset;
union uint_float
{
uint32_t i;
float f;
};
const uint_float v = { (uint32_t) ((1 << 23) * (clip_p + 121.2740575f + 27.7280233f / (4.84252568f - z) - 1.49012907f * z)) };
return v.f;
}
// 2^X - 1
static inline double Fast_Pow2_Minus_1(double x)
{
return -0.15639e-16+(0.69314718055995154416+(0.24022650695869054994+
(0.55504108675285271942e-1+(0.96181289721472527028e-2
+(0.13333568212100120656e-2+(0.15403075872034576440e-3
+(0.15265399313676342402e-4+(0.13003468358428470167e-5
+0.12113766044841794408e-6*x)*x)*x)*x)*x)*x)*x)*x)*x;
}
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Yes of course, but it can converted easily on SSE2; just looks at the comment inside the code:
Code: Select all
ret = _mm_fmadd_pd(ret, x, a5);
// If fma extention is not present use
// ret = _mm_add_pd(_mm_mul_pd(ret, x), a5);
Oh thanks! Do you have the same code "on the fly" for __m128d (i.e. double)? Else I need some sort of manual refactoring from __m128 to __m128d...
Yes, correct! But if you see in the code, its already multiply * ln2per12, so the errors are actually for that range.
I'll check it out now, thanks!juha_p wrote: ↑Wed Jan 16, 2019 8:09 am These implementations (as SSE) versions are usually faster than std::exp() and can be about as accurate aswell:
(Pade) - https://root.cern.ch/doc/v606/exp_8h_source.html
(Remez, Taylor) - http://www.chokkan.org/software/dist/fastexp.c.html
There's also this nice Pade related function ... - https://math.stackexchange.com/a/2196435
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Awesome...
I've tried this, and in that range the max error is 1.7763568394002504646778106689453125e-15
It seems perfect to calculate a pitch multiplier!!! Thanks!!!
Do you already have the SSE2 version?
Note: how much a sin(freq * pitchMultiplier) will (in this worst case, since is the max error) drift?