KVR Audio

S0lo · Post by **S0lo** » Sat Jul 10, 2021 9:12 pm

mystran wrote: ↑Sat Jul 10, 2021 5:20 pm The problem with anything having to do with cache is that there really is no nutshells and there is no simple answers. Sequential access as such usually mostly doesn't even matter (well, again "it depends"), but using complete cache lines is much faster than fetching a few bytes here and another bytes for somewhere else.

I probably won't use a LUT in this particular EG case. But still, the simple guide line still holds most of the time. i.e. I've never seen a randomly accessed LUT outperform serial/ordered/localized access. At worst, they'd perform the same when many many other things are happening. So in terms of good coding design, one should always try to localize access to memory during a certain time window, the longer that window is, the better. Or in otherwords, traversal order matters. Saying it differently, data access patterns matter. All these are nutshells

As for the order of code execution and it's cache interference with data access patterns. Modern CPUs have separate L1 data and instruction caches. This should help to a certain extent.

If you don't like nutshells, Here is a good one hour talk about it: https://www.youtube.com/watch?v=WDIkqP4JbkE

vasyan · Post by **vasyan** » Mon Jul 12, 2021 3:31 am

S0lo wrote: ↑Fri Jul 09, 2021 9:51 pm I usually try not to use exp/log/pow/tan etc until I have to. But if I do, yes I see what you mean. But then isn't that an exp/log optimization problem? not strictly a muti-segment EG problem.

For envelopes, exponential approximation is fine. For example this implementation.

Code: Select all

#include <stdint.h>

float fastExp3(register float x)  // cubic spline approximation
{
    union { float f; int32_t i; } reinterpreter;

    reinterpreter.i = (int32_t)(12102203.0f*x) + 127*(1 << 23);
    int32_t m = (reinterpreter.i >> 7) & 0xFFFF;  // copy mantissa
    // empirical values for small maximum relative error (8.34e-5):
    reinterpreter.i +=
         ((((((((1277*m) >> 14) + 14825)*m) >> 14) - 79749)*m) >> 11) - 626;
    return reinterpreter.f;
}

float fastExp4(register float x)  // quartic spline approximation
{
    union { float f; int32_t i; } reinterpreter;

    reinterpreter.i = (int32_t)(12102203.0f*x) + 127*(1 << 23);
    int32_t m = (reinterpreter.i >> 7) & 0xFFFF;  // copy mantissa
    // empirical values for small maximum relative error (1.21e-5):
    reinterpreter.i += (((((((((((3537*m) >> 16)
        + 13668)*m) >> 18) + 15817)*m) >> 14) - 80470)*m) >> 11);
    return reinterpreter.f;
}

Z1202 · Post by **Z1202** » Mon Jul 12, 2021 7:05 am

vasyan wrote: ↑Mon Jul 12, 2021 3:31 am float fastExp3(register float x) // cubic spline approximation
{
union { float f; int32_t i; } reinterpreter;
[/code]

It seems that the modern way of type punning is either use memcpy (most compilers should optimize it out for int/float conversion), or c++20 bit_cast. The unions might work in some or even most compilers, but I'd be cautious.

Z1202 · Post by **Z1202** » Mon Jul 12, 2021 7:35 am

You might also want to consider this thread viewtopic.php?f=33&t=519852
Maybe also this one viewtopic.php?f=33&t=510794

vasyan · Post by **vasyan** » Mon Jul 12, 2021 8:27 am

Z1202 wrote: ↑Mon Jul 12, 2021 7:35 am You might also want to consider this thread viewtopic.php?f=33&t=519852
Maybe also this one viewtopic.php?f=33&t=510794

As variant - use this: https://github.com/herumi/fmath

mystran · Post by **mystran** » Mon Jul 12, 2021 10:58 am

Z1202 wrote: ↑Mon Jul 12, 2021 7:05 am It seems that the modern way of type punning is either use memcpy (most compilers should optimize it out for int/float conversion), or c++20 bit_cast. The unions might work in some or even most compilers, but I'd be cautious.

The ugly situation with unions is that punning is legal in C but not in C++ ... yet because C code is regularly compiled as C++ this is probably one of those things that is technically illegal, but practically speaking works fairly reliably anyway. At least I've never seen it actually cause any issues.

Another reliable (in practice) method for C++20 is using reinterpret_cast with pointers (or references), which is also one of the few ways you can turn an address into a function pointer (which strictly speaking I guess is completely illegal, but necessary for every JIT).

Z1202 · Post by **Z1202** » Mon Jul 12, 2021 12:54 pm

mystran wrote: ↑Mon Jul 12, 2021 10:58 am Another reliable (in practice) method for C++20 is using reinterpret_cast with pointers (or references), which is also one of the few ways you can turn an address into a function pointer (which strictly speaking I guess is completely illegal, but necessary for every JIT).

Not in my book. I have been regularly running against UB with reinterpret_cast-based type punning in gcc. You need to use the "may_alias" attribute to make it work.

mystran · Post by **mystran** » Mon Jul 12, 2021 1:11 pm

Z1202 wrote: ↑Mon Jul 12, 2021 12:54 pm
mystran wrote: ↑Mon Jul 12, 2021 10:58 am Another reliable (in practice) method for C++20 is using reinterpret_cast with pointers (or references), which is also one of the few ways you can turn an address into a function pointer (which strictly speaking I guess is completely illegal, but necessary for every JIT).
Not in my book. I have been regularly running against UB with reinterpret_cast-based type punning in gcc. You need to use the "may_alias" attribute to make it work.

Good to know. I don't use GCC because I disapprove of the GNU project.

S0lo · Post by **S0lo** » Tue Jul 13, 2021 11:45 am

Z1202 wrote: ↑Mon Jul 12, 2021 7:35 am You might also want to consider this thread viewtopic.php?f=33&t=519852
Maybe also this one viewtopic.php?f=33&t=510794

I'm trying your 2^x approx. With the N=5 it's giving me at least 8%+ CPU improvement over MSVC pow(). I'm using it for pitch (1volt/oct to Hz conversion).

Is there a particular reason you went for i being the nearest integer to x, instead of it being just the floor of x ? Rounding just requires an extra conditional step. I've tried both with and without the rounding and they sound well precise for my ears.

Here is the code I'm using

Code: Select all

float Fast2PowX5(float x)
{	float r,t;
	int i;
	union 
	{	int as_int;
		float as_float;
	} v;
	
	// Integer part
	i = (int)x; t = x-i; 

	// Make i nearest integer to x
	if (t>0.5) { i++; t--; }
	
	v.as_int = (i+127)<<23;

	// Fraction part
	x = t;
	r = 1.0f + 0.69313401135f*x; x*=t;
	r += 0.24016664624f*x; x*=t;
	r += 0.05554874811f*x; x*=t;
	r += 0.00980272025f*x; x*=t;
	r += 0.00130702937f*x;
	r *= v.as_float;
	return r;
}

mystran · Post by **mystran** » Tue Jul 13, 2021 1:25 pm

For straight exp() the one I'd typically use is exp_mp from this post (with a few trivial changes like doing base-2 directly, since that's usually more common in audio): viewtopic.php?p=7161124#p7161124

Z1202 · Post by **Z1202** » Tue Jul 13, 2021 1:35 pm

S0lo wrote: ↑Tue Jul 13, 2021 11:45 am Is there a particular reason you went for i being the nearest integer to x, instead of it being just the floor of x ? Rounding just requires an extra conditional step. I've tried both with and without the rounding and they sound well precise for my ears.

Actually it's the other way around, floor often requires an extra conditional step, rounding (assuming your FPU is in rounding to nearest mode, normally it's the default) doesn't. Depending on the STL version and the compiler it might be (I think) std::lrint() or you often can use a dedicated processor intrinsic for that if lrint is not efficiently implemented.
(int)x is not floor, it's truncation.

mystran · Post by **mystran** » Tue Jul 13, 2021 1:43 pm

Branching on the fractional bits is a bad idea. It's going to predict poorly.

If you want to use an integer cast (ie. not worry about the current rounding mode), then add +0.5 to positive numbers and -0.5 to negative numbers (before the cast). You don't even need a branch for this, you can simply copy the signbit of the argument into the offset, but even a simple branch on the signbit is going to predict much better for almost all realworld situations.

S0lo · Post by **S0lo** » Tue Jul 13, 2021 3:50 pm

@mystran, thanks. Still trying to spin my head around what you said. But the point was actually to remove that conditional without needing any std or lib calls.

Z1202 wrote: ↑Tue Jul 13, 2021 1:35 pm Actually it's the other way around, floor often requires an extra conditional step, rounding (assuming your FPU is in rounding to nearest mode, normally it's the default) doesn't. Depending on the STL version and the compiler it might be (I think) std::lrint() or you often can use a dedicated processor intrinsic for that if lrint is not efficiently implemented.
(int)x is not floor, it's truncation.

My bad for saying its floor

. Truncation will give ceiling when x is negative. But even then, it will work, or wont it? I mean without that conditional line and without any rounding lib calls to neither std::lrint() nor floor() nor ceil(). Admittedly, I think (int) uses a CPU intrinsic though.

Well, practically it is working (same code above, without that if line). It seams to be tracking 9 octaves well, where x is negative bellow middle c, and positive above it.

mystran · Post by **mystran** » Tue Jul 13, 2021 4:14 pm

S0lo wrote: ↑Tue Jul 13, 2021 3:50 pmAdmittedly, I think (int) uses a CPU intrinsic though.

On modern x86 compilers casting to int will produce a CVTTSS2SI (or CVTTSD2SI for doubles) instruction, which converts a scalar floating point from SSE register into an integer register with truncation towards zero. I'd imagine this instruction exists for the purpose of casting floats to ints.

If you use lrintf/lrint (or the C++11 std::lrint overloads) you should get a single CVTSS2SI/CVTSD2SI instead which applies the current rounding mode (edit: but otherwise it's identical and has exactly the same cost), which normally defaults to nearest. If you want rounding to nearest and you already have the correct rounding mode (=default) then this is faster than anything you can build around an integer cast.

If you use floor/round/ceil then you'll usually get an actual function call, so that these can set (and then restore) the rounding mode as required.

Music Engineer · Post by **Music Engineer** » Wed Jul 14, 2021 12:12 pm

camsr wrote: ↑Sat Jul 10, 2021 8:30 am Using some polynomials, I was trying to approximate something like a second order exponential decay. I found an ugly formula to do it, however maybe there is a way to optimize it using the function derivative. Why not take a look?
https://www.desmos.com/calculator/iceqafamt8

seems like the approximation gets better for narrower gaussians, so it would seem to be advisable, to use a large s and pre-scale the input accordingly. but i think, the range of shapes produced by the formula is quite interesting in it's own right.

...however, i just came up with a recursion to produce the gaussian shape. here's the code for a little numerical experiment i just did - what's going on is explained in the comments:

Code: Select all

void gaussianIterator()
{
  // We iteratively produce a Gaussian bell function y(x) = exp(a*x^2) where a < 0 and compare it
  // to a directly computed function. To derive the iterative/recursive rule, we consider:
  //   y(x+h) = exp(a*(x+h)^2) = exp(a*(x^2+2*h*x+h^2)) = exp(a*x^2)*exp(2*a*h*x)*exp(a*h^2)
  //          = y(x) * exp(2*a*h*x) * exp(a*h^2)
  // where the 2nd factor is just a regular decaying exponential, which we shall define as z(x). 
  // We already routinely compute such things recursively and in this case, it can be done by 
  // considering:
  //   z(x)   = exp(2*a*h*x)
  //   z(x+h) = exp(2*a*h*(x+h)) = exp(2*a*h*x) * exp(2*a*h*h) = z(x) * exp(2*a*h*h) = z(x) * d
  // where we have defined d as the decay factor. The 3rd factor is just a constant scale factor 
  // given by:
  //   s = exp(a*h^2)
  // from which we also see that d = s^2, which is also quite convenient.
  // Alternatively, we could have taken the derivative y'(x) = 2*a*x*y and solve that approximately
  // via an initial value solver. But since this will only give an approximate solution, the 
  // technique above is better. However, other functions may not admit such a computationally 
  // efficient exact recursion. In these cases, the approach via an ODE may be be useful.


  int    N  = 1000;    // number of data points to generate
  double x0 = -5.0;    // start value for the x-values
  double h  =  0.01;   // step size
  double a  = -0.2;    // factor in front of the x in the exponent

  // Compute abscissa values and true values for y(x):
  using Vec = std::vector<double>;
  Vec x(N), yt(N), zt(N);
  for(int n = 0; n < N; n++)
  {
    x[n]  = x0 + n*h;
    yt[n] = exp(a*x[n]*x[n]);
    zt[n] = exp(2*a*h*x[n]);
  }

  // Compute z(x) and y(x) by exact recursion:
  Vec yr(N), zr(N);
  yr[0] = yt[0]; zr[0] = zt[0];  // initial values copied from directly computed values
  double s = exp(a*h*h);         // scaler
  double d = s*s;                // == exp(2*a*h*h), decay for exponential z(x)
  Vec errAbs(N), errRel(N);      // absolute and relative error
  for(int n = 1; n < N; n++)
  {
    //yr[n] = yr[n-1] * zt[n-1] * s;  // using the exact z(x)
    yr[n] = yr[n-1] * zr[n-1] * s;    // using the recursive z(x)
    zr[n] = zr[n-1] * d;
    errAbs[n] = yr[n] - yt[n];
    errRel[n] = errAbs[n] / yt[n];
  }

  GNUPlotter plt;
  plt.addDataArrays(N, &x[0], &yt[0], &yr[0], &errRel[0]);
  plt.plot();

  // Observations:
  // -When using the exact z(x) together with the recursion for y(x), the (accumulated) relative 
  //  error in the last datapoint is around 2.7e-15, when we use the recursively computed z(x), it's
  //  around 1.7e-11
}

here is what i get. black is the ground truth, blue is the recursively computed gaussian (there's no visible difference, so you only see blue) and green is the error.

Gauss.png

at the final datapoint, the accumulated roundoff error is around 1.7e-11. since the recursion is exact, roundoff error is the only type of error that occurs here - there's no approximation error. btw. the stuff i mention there about using an ODE solver - i think, this is precisely what Big Tick has done for the x^n and 1/x functions, using forward Euler...if i'm not mistaken

if you want to run the code yourself, you will either need my GNUPlotter:
https://github.com/RobinSchmidt/GNUPlotCPP
or you just comment out the plotting stuff at the bottom and just inspect the resulting arrays in the debugger or use some other means for creating the plot

i think, it may be worthwhile to explore this stuff further. maybe we can create sinusoids with gaussian envelopes aka "gaborets" by using a complex version of the recursion:
https://books.google.de/books?id=h90HIV ... et&f=false
and/or compute things like y(x) = exp(i*p(x)) where p(x) is a polynomial and i the imaginary unit...i guess that may be useful to recursively create sinusoids with a polynomial frequency envelope ...as are needed in additive synthesis and sinusoidal modeling (typically cubic) ....more stuff to explore....

The best method of create custom envelopes