KVR Audio

2DaT · Post by **2DaT** » Sat Sep 08, 2018 12:02 pm

mystran wrote: The error doesn't look like proper minmax. I would guess this is a Chebychev approximation?

Yes, it drifted a bit after I quantized the coefficients.

Division in general is slower than sqrt() actually.

I have checked the instruction timings on https://www.agner.org/optimize/instruction_tables.pdf and I can't verify this claim. Division and sqrt have the same performance across multiple microarchitectures.

camsr · Post by **camsr** » Sat Sep 08, 2018 10:00 pm

Here is the EpowX approximation. To be able to use the bitshift requires modulo of the input. And that's when I learned that a rounding is a simple case of modulo

Code: Select all

double exec_approx_EpowX(float in)
{
    // this still doesn't work with negative inputs

    double j = .346573590279973;
    double l = 1.0 / (2.0 * j);
    double h = 0.99999022527305;

    uint64_t shift = (uint64_t)( (l * (double)in) + 0.5)  ;
    double x = (double)in;
    double x_mod = (((double)shift) * -2.0 * j) + x;
    x_mod *= h;
    double result_scaler = (double)( 1ull << shift );

    double dividend = x_mod+64.0;
    double divisor = x_mod-64.0;
    double result = (dividend / divisor);
    result *= result;
    result *= result;
    result *= result;
    result *= result;
    result *= result;
    result *= result_scaler;

    return result;
}

juha_p · Post by **juha_p** » Sun Sep 09, 2018 6:43 am

For me std:exp has worked well enough for 2^x as well :

Code: Select all

exp(x * M_LN2); // 2^x

accurate: https://www.desmos.com/calculator/lczbo65wpy

Here's one simple approximation I found.

Code: Select all

double exp_alt(double x) {
  x = 1.0 + x / 1048576; // ... | 16384 | 65536 | 1048576 | ...
 
  x *= x; x *= x; x *= x; x *= x; // 4
  x *= x; x *= x; x *= x; x *= x; // 8
  x *= x; x *= x; x *= x; x *= x; // 12
  x *= x; x *= x; // 14 
  x *= x; x *= x; // 16
  x *= x; x *= x; x *= x; x *= x; // 20
// ...
  return x;
}

EDIT: Made this as SSE version with accuracy option (1st SSE code ever for me

):

Code: Select all

__m128d exp_alt_sse(__m128d x_in, int level){
// error level 2^16  [-0.5, 0.5] ~3.14579329e-03
float l [32] = {2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 
		8192, 16384, 32768, 65536, 131072, 262144, 524288, 1048576, 
		2097152, 4194304, 8388608, 16777216, 33554432, 67108864, 134217728, 
		268435456, 536870912, 1073741824, 2147483648, 4294967296 };
    __m128d x, r;
    __m128d c0  = _mm_set1_pd (1.0f);
    __m128d c1  = _mm_set1_pd (l[level-1]);
     
    x = _mm_div_pd (x_in, c1);	  	// x/2^level
    x = _mm_add_pd (x, c0);  		//  x+1
    
    for(int i = 0; i < level; i++){
        x = _mm_mul_pd (x, x); // x*=x
    }
    	
    return x;
}

camsr · Post by **camsr** » Sun Sep 09, 2018 5:55 pm

juha-P thanks for refreshing my memory about how to change the base of an exponential. It's been a long time since I've needed to do that!

I'm trying to figure out a way to write this approx function where it's able to take some arguments and "optimize" itself, without needing to rewrite any source or add complementary functions.
Basically it would look like this:
approx( result_power , base , domain , x_input )

But it actuality, when it would be called in use, it would be approx( in ), the arguments of the previous function are just to define the function before compilation.

Not sure if that's possible in a clean way with C.

camsr · Post by **camsr** » Mon Sep 10, 2018 6:59 am

The test program is running but I have to admit that it's not as informative as a graphic plot of the functions' errata.
Maybe someone could explain how to calculate ULP?
Soon it will include execution speed profiling as well.
I got caught up in making an easy to modify source and I could have done this a lot sooner but the effort was worth it. I can simply drop in a function, build it's tiny wrapper to print it's identifier, and a function address in one array, and it's off to the races.

Maximum Difference Absolute is calculated as:
max( |basis - approx| )
Maximum Difference Relative:
max( |(basis - approx)/basis| )
Average Relative Value:
accumulate( |(basis - approx)/basis| ) / units_in_domain (units being every bit in the domain, not sure yet if this is accurate)

Code: Select all

Basis function stndrd_EpowX compared to approx_EpowX
Denormals are allowed

Domain:
low       = 1.000000
high      = 32.000000

Maximum Difference Absolute = 79105605.328125000000000000
Maximum Difference Relative = 0.000001303941352186
Average Relative Value      = 0.000000846578606836

juha_p · Post by **juha_p** » Mon Sep 10, 2018 7:32 pm

camsr wrote: ...
Maybe someone could explain how to calculate ULP?
...

Maybe you'll find this helpful to start with - https://en.wikipedia.org/wiki/Unit_in_the_last_place

camsr · Post by **camsr** » Tue Sep 11, 2018 8:13 am

camsr wrote: I'm trying to figure out a way to write this approx function where it's able to take some arguments and "optimize" itself, without needing to rewrite any source or add complementary functions.
Basically it would look like this:
approx( result_power , base , domain , x_input )

But it actuality, when it would be called in use, it would be approx( in ), the arguments of the previous function are just to define the function before compilation.

Not sure if that's possible in a clean way with C.

Making some progress on this idea. Not sure what to say about GCC but it's not easy to coax a compilation that works.

Using a macro definately works:

Code: Select all

#define U 1.f
AT(noinline) float abc_pur(  const float const a, float b, float c)
{
    float q = 0.f;
    if (U > 0.f) { q = b-c; }
    else { q = b/c; }
    return q;
}

The emitted function will have removed the division and the branch.

But using a very very constant only seems to work if the function is inlined:

Code: Select all

static const float const A[] = {1.f};
AT(inline) float abc_inl(  const float const a, float b, float c)
{
    float q = 0.f;
    if (a > 0.f) { q = b-c; }
    else { q = b/c; }
    return q;
}

And if not inlined, the branch will remain. Not sure yet why. It might have something to do with 'a' being a function argument.

juha_p · Post by **juha_p** » Tue Sep 11, 2018 10:50 am

Just FYI, accuracy of this simple approximation (I found mentioned on some thread at math.exchance forums) is told to be around 1.3x10^-10 when -½<x<½ and actually, it seem to be little faster than std::exp() :

https://www.desmos.com/calculator/u7n7iui9ip

(Added in there a plot for Maclaurin expansion for e^x (level 9) ... slow compared to std::exp() but accuracy can be adjusted by just changing the 'level' in range 1...15)

juha_p · Post by **juha_p** » Wed Sep 12, 2018 9:52 pm

Made a 'little' faster SSE version of the simple function mentioned in my earlier post (2nd SSE thingy ever

(1st can be found few post earlier )) :

Code: Select all

__m128d easyEXP_sse_tweaked (__m128d x){
// plotted @ https://www.desmos.com/calculator/tj0dhylwyd
// max error [-0.5, 0.5] ~1.61612233e-15
	__m128d t1, t2, t3, r, a, b;
  
    __m128d c0  = _mm_set1_pd (1680.0);
    __m128d c1  = _mm_set1_pd (840.0);
    __m128d c2  = _mm_set1_pd (180.0);
    __m128d c3  = _mm_set1_pd (20.0);
    __m128d c4  = _mm_set1_pd (0.25);

	x = _mm_mul_pd (x, c4); 	// tweak  x --> 0.25x
	__m128d x025 = _mm_mul_pd (x, c4);
    __m128d x2 = _mm_mul_pd (x, x);
	__m128d x3 = _mm_mul_pd (x2, x);
	__m128d x4 = _mm_mul_pd (x2, x2);
     
    t1 = _mm_mul_pd (c1, x);	// t1 = 840x
    t2 = _mm_mul_pd (c2, x2);	// f2 = 180x^2
    t3 = _mm_mul_pd (c3, x3);   	// t3 = 20x^3
	
    a = _mm_add_pd (c0, t1);	// 1680+840x             
    a = _mm_add_pd (a, t2);     	// + 180*x^2          
    a = _mm_add_pd (a, t3);     	// + 20x^3         
    a = _mm_add_pd (a, x4);     	// + x^4        
    
    b = _mm_sub_pd (c0, t1);  	// 1680-840x
    b = _mm_add_pd (b, t2);     	// + 180x^2
    b = _mm_sub_pd (b, t3);     	// - 20x^3
    b = _mm_add_pd (b, x4);     	// + x^4

    r = _mm_div_pd (a, b);      	// a/b
    r = _mm_mul_pd (r, r);    	// ^2
    r = _mm_mul_pd (r, r);  	        // ^4
			    
    return r;
}

NOTE: Updated the code with tweak camsar pointed out in post below!

Also, made a SSE version of this MacLaurin based thingy (3rd SSE routine ever

):

Code: Select all

__m128d MacLaurin_EXP_SSE(__m128d x, int level){
// function plotted @ https://www.desmos.com/calculator/u7n7iui9ip

	if(level < 1 || level > 15){ printf("MacLaurinEXP_SSE: check level %i ", level); exit; }

	__m128d ret, tmp;
	
	__m128d x2 = _mm_mul_pd (x, x);		// x^2
	__m128d x3 = _mm_mul_pd (x2, x);	// x^3
	__m128d x4 = _mm_mul_pd (x2, x2);	// x^4
	__m128d x5 = _mm_mul_pd (x4, x);	// x^5
	__m128d x8 = _mm_mul_pd (x4, x4);	// x^8
	__m128d x10 = _mm_mul_pd (x5, x5);	// x^10

	// (dividers) n! 
	__m128d c15 = _mm_set1_pd (7.6471637318198164759011319857880704441551002397563244e-13);
	__m128d c14 = _mm_set1_pd (1.1470745597729724713851697978682105666232650359634486e-11);
	__m128d c13 = _mm_set1_pd (1.6059043836821614599392377170154947932725710503488281e-10);
	__m128d c12 = _mm_set1_pd (2.0876756987868098979210090321201432312543423654534765e-9);
	__m128d c11 = _mm_set1_pd (2.5052108385441718775052108385441718775052108385441718e-8);
	__m128d c10 = _mm_set1_pd (2.7557319223985890652557319223985890652557319223985890e-7);
	__m128d c9  = _mm_set1_pd (2.7557319223985890652557319223985890652557319223985890e-6);
	__m128d c8  = _mm_set1_pd (0.000024801587301587301587301587301587301587301587301587301);
	__m128d c7  = _mm_set1_pd (0.000198412698412698412698412698412698412698412698412698412);
	__m128d c6  = _mm_set1_pd (0.001388888888888888888888888888888888888888888888888888888);
	__m128d c5  = _mm_set1_pd (0.008333333333333333333333333333333333333333333333333333333);
	__m128d c4  = _mm_set1_pd (0.041666666666666666666666666666666666666666666666666666666);
	__m128d c3  = _mm_set1_pd (0.166666666666666666666666666666666666666666666666666666666);
	__m128d c2  = _mm_set1_pd (0.5);
	__m128d c1  = _mm_set1_pd (1.0);    
	
	switch (level){
	case 15: // error [-0.5, 0.5] b^x ~2.00914267e-16  2^x ~1.57009246e-16
		ret = _mm_mul_pd (x10, x5); // x^15
	    ret = _mm_mul_pd (ret, c15); // x^15*1/15!
	case 14: // error [-0.5, 0.5] b^x ~2.00914267e-16  2^x ~1.57009246e-16
		tmp = _mm_mul_pd (x10, x4); // x^14
	    tmp = _mm_mul_pd (tmp, c14);
	    ret = _mm_add_pd (ret, tmp);
	case 13: // b^x error [-0.5, 0.5] b^x ~1.09826899e-15  2^x ~1.57009246e-16
		tmp = _mm_mul_pd (x8, x5);
	    tmp = _mm_mul_pd (tmp, c13);
	    ret = _mm_add_pd (ret, tmp);
	case 12: // b^x error [-0.5, 0.5] b^x ~3.11176214e-14  2^x ~1.57009246e-16
	   	tmp = _mm_mul_pd (x8, x4);
	   	tmp = _mm_mul_pd (tmp, c12);
	   	ret = _mm_add_pd (ret, tmp);
	case 11: // b^x error [-0.5, 0.5] b^x ~8.09058156e-13  2^x ~8.79251777e-15
	    tmp = _mm_mul_pd (x8, x3);
	   	tmp = _mm_mul_pd (tmp, c11);
	   	ret = _mm_add_pd (ret, tmp);
	case 10: // b^x error [-0.5, 0.5] b^x ~1.93588214e-11  2^x ~2.98160558e-13
	    tmp = _mm_mul_pd (x5, x5);
	   	tmp = _mm_mul_pd (tmp, c10);
	   	ret = _mm_add_pd (ret, tmp);
	case 9: // b^x error [-0.5, 0.5] b^x ~4.24335993e-10  2^x ~9.44504819e-12
	    tmp = _mm_mul_pd (x8, x);
	   	tmp = _mm_mul_pd (tmp, c9);
	   	ret = _mm_add_pd (ret, tmp);
	case 8: // b^x error [-0.5, 0.5] ~8.44955828e-09  2^x ~2.71687386e-10
	    tmp = _mm_mul_pd (x8, c1);
	   	tmp = _mm_mul_pd (tmp, c8);
	   	ret = _mm_add_pd (ret, tmp);
	case 7: // b^x error [-0.5, 0.5] b^x ~1.51280538e-07  2^x ~7.02890655e-09
	    tmp = _mm_mul_pd (x4, x3);
	   	tmp = _mm_mul_pd (tmp, c7);
	   	ret = _mm_add_pd (ret, tmp);
	case 6: // b^x error [-0.5, 0.5] b^x ~2.40440100e-06  2^x ~1.61491586e-07
	    tmp = _mm_mul_pd (x3, x3);
	   	tmp = _mm_mul_pd (tmp, c6);
	   	ret = _mm_add_pd (ret, tmp);
	case 5: // b^x error [-0.5, 0.5] ~3.33751405e-05  2^x ~3.24223991e-06
	    tmp = _mm_mul_pd (x4, x);
	   	tmp = _mm_mul_pd (tmp, c5);
	   	ret = _mm_add_pd (ret, tmp);
	case 4: // b^x error [-0.5, 0.5] b^x ~3.95979357e-04  2^x ~5.56843187e-05
	    tmp = _mm_mul_pd (x4, c1);
	   	tmp = _mm_mul_pd (tmp, c4);
	   	ret = _mm_add_pd (ret, tmp);
	case 3: // b^x error [-0.5, 0.5] b^x ~3.89756562e-03  2^x ~7.94446221e-04
	    tmp = _mm_mul_pd (x3, c1);
	   	tmp = _mm_mul_pd (tmp, c3);
	   	ret = _mm_add_pd (ret, tmp);
	case 2: // b^x error [-0.5, 0.5] b^x ~3.04507942e-02  2^x ~9.01738668e-03
	    tmp = _mm_mul_pd (x, x);
	   	tmp = _mm_mul_pd (tmp, c2);
	   	ret = _mm_add_pd (ret, tmp);
	default: //b^x error [-0.5, 0.5] b^x ~1.75639365e-01  2^x ~7.59155094e-02
		tmp = _mm_add_pd (c1, x);
		ret = _mm_add_pd (ret, tmp);
	}

	return ret;
}

camsr · Post by **camsr** » Thu Sep 13, 2018 12:02 am

a(x)=(1680+840*x+180*(x^2)+20*(x^3)+(x^4))
b(x)=(1680-840*x+180*(x^2)-20*(x^3)+(x^4))

( a(.25x)/b(.25x) )^4

This is quite good for accuracy!

juha_p · Post by **juha_p** » Thu Sep 13, 2018 3:02 pm

... yes!.

Tried it this way :

Code: Select all

#define myexp_a(x) (1680 + 840 * x + 180 * (x * x) + 20 * (x * x * x) + (x * x * x * x))
#define myexp_b(x) (1680 - 840 * x + 180 * (x * x) - 20 * (x * x * x) + (x * x * x * x))
#define myexp(x) pow((myexp_a(0.25*x)/myexp_b(0.25*x)), 4)

BTW, if one wants to try do above functionality as inline __m128d type functions, is there a change to get it working with 'full speed of SSE' ?

EDIT: Got the 'original' formula simplied a bit (maybe speed improve as well) :

Code: Select all

#define myexp(x) (40 * x * (x * x + 42))/(x * (x * ((x - 20) * x + 180) - 840) + 1680) + 1

and with the .25x tweak:

Code: Select all

x *= 0.25; 
return (7965941760000 + x * (15931883520000 + x * (15362887680000 + x * (9483264000000 +
x * (4194989568000 + x * (1410296832000 + x * (372798720000 + x * (79059456000 +
x * (13598150400 + x * (1904102400 + x * (216470400 + x * (19785600 + x * (1426720 + 
x * (78560 + x * (3120 + x * (80 + x))))))))))))))))/(7965941760000 + x * (-15931883520000 + 
x * (15362887680000 + x * (-9483264000000 + x * (4194989568000 + x * (-1410296832000 + 
x * (372798720000 + x * (-79059456000 + x * (13598150400 + x * (-1904102400 + 
x * (216470400 + x * (-19785600 + x * (1426720 + x * (-78560 + x * (3120 + 
(-80 + x) * x)))))))))))))));

, but this is maybe slower than std::exp.

camsr · Post by **camsr** » Fri Sep 14, 2018 3:52 pm

camsr wrote:
Making some progress on this idea. Not sure what to say about GCC but it's not easy to coax a compilation that works.

Using a macro definately works:
Code: Select all
#define U 1.f
AT(noinline) float abc_pur(  const float const a, float b, float c)
{
    float q = 0.f;
    if (U > 0.f) { q = b-c; }
    else { q = b/c; }
    return q;
}
The emitted function will have removed the division and the branch.

But using a very very constant only seems to work if the function is inlined:
Code: Select all
static const float const A[] = {1.f};
AT(inline) float abc_inl(  const float const a, float b, float c)
{
    float q = 0.f;
    if (a > 0.f) { q = b-c; }
    else { q = b/c; }
    return q;
}
And if not inlined, the branch will remain. Not sure yet why. It might have something to do with 'a' being a function argument.

And.. it happens that if the function is declared static, it will eliminate the branch with the compile-time constant as an argument. Problem solved.

juha_p · Post by **juha_p** » Sat Sep 15, 2018 6:23 am

camsr wrote:...

( a(.25x)/b(.25x) )^4

This is quite good for accuracy!

Updated the SSE version with this tweak ... slowed the original function ~40% but, still it's ~8 times faster than std::exp() and ~32 times faster than std::pow() ([-0.5, 0.5] tested)
(by test routine I linked earlier).

2DaT · Post by **2DaT** » Sat Sep 15, 2018 7:58 am

juha_p wrote: Updated the SSE version with this tweak ... slowed the original function ~40% but, still it's ~8 times faster than std::exp() and ~32 times faster than std::pow() ([-0.5, 0.5] tested)
(by test routine I linked earlier).

That's a really slow exp implementation. What compiler is that?

juha_p · Post by **juha_p** » Sat Sep 15, 2018 9:44 am

2DaT wrote:That's a really slow exp implementation. What compiler is that?

MonoDevelop in Ubuntu g++ as compiler w/o optimizations ... also, remember that I don't know much of implementing SSE code so many things can be wrong there and the test process is what it is (just look the code linked in one of my posts earlier in this thread).

BTW, the original function I did without camsr's ()^4 tweak is (by the test I used) actually equal by speed with your exp_mp() implementation on page 1 of this thread. Dunno how much it would effect in speed if data type of your function is __m128d ?

Test:

Code: Select all

starttime=times(&timestats);
for (unsigned long i=0;i<1000000;i++)
     for (unsigned long j=0;j<100;j++)
     arg.f[0] = farray[i];
	memcpy (&x, &arg, sizeof(x));
	y = exp_mp (x);
	memcpy (&res, &y, sizeof(y));
cout << "\nexp_mp() time = " << times(&timestats) - starttime << endl;

Results:
std::exp() time = 197ms
std::pow() time = 779ms
exp_mp() time = 17ms
easyEXP_sse() time = 17ms
easyEXP_sse_tweaked() time = 23ms

My b^x approximation