KVR Audio

mystran · Post by **mystran** » Thu Nov 16, 2023 5:50 pm

juha_p wrote: Thu Nov 16, 2023 5:03 pm I could not test this njuffa's implementation right now (I've done it couple years ago and remember it being slow on emulated fma) but, he says it's 1.5 ULP in range [0:PI].

You can probably get away with error quite a bit bigger than that (at least one order of magnitude, but perhaps more, especially if you force C0 or C1 continuity, 'cos then it's mostly the 3rd harmonic which gets enough psychoacoustic masking that it's mostly a matter of whether you need it clean on an FFT). As far as performance, Estrin's scheme might help if the stuff doesn't otherwise pipeline well enough (though it might pipeline multiple evaluations just fine, in which case it's just a question of latency, idk).

rigatoni_modular · Post by **rigatoni_modular** » Thu Nov 16, 2023 6:03 pm

I'm not too picky about accuracy as long as I don't have a bunch of spurious harmonics. Right now I'm using this archived article's method but reduced to 7th order which still seems to give decent results. I process my partials in groups of 4 using exclusively SIMD instructions which helps a bit to keep things fast.

Also, to reduce the amount of sin calculations I'm doing I've been experimenting with a `sinToCos` implementation using the fact that sqrt(1-sin^2) gives a rectified cosine. I'm only using this to get the derivative values for BLEP.

mystran · Post by **mystran** » Thu Nov 16, 2023 7:04 pm

rigatoni_modular wrote: Thu Nov 16, 2023 6:03 pm Also, to reduce the amount of sin calculations I'm doing I've been experimenting with a `sinToCos` implementation using the fact that sqrt(1-sin^2) gives a rectified cosine. I'm only using this to get the derivative values for BLEP.

It this actually faster than just computing another poly or perhaps derivative of your main poly if you don't wanna range-reduce separately (and want to reuse the x*x terms, etc)? On x86 sqrtps is generally ~7 cycles throughput (11 latency) and I think the poly might go faster than that (most CPUs do 2 multiplies per cycle, some do 2 adds as well.. even if you don't want to use FMA which should usually be 2 per cycle).

rigatoni_modular · Post by **rigatoni_modular** » Thu Nov 16, 2023 8:55 pm

mystran wrote: Thu Nov 16, 2023 7:04 pm
rigatoni_modular wrote: Thu Nov 16, 2023 6:03 pm Also, to reduce the amount of sin calculations I'm doing I've been experimenting with a `sinToCos` implementation using the fact that sqrt(1-sin^2) gives a rectified cosine. I'm only using this to get the derivative values for BLEP.
It this actually faster than just computing another poly or perhaps derivative of your main poly if you don't wanna range-reduce separately (and want to reuse the x*x terms, etc)? On x86 sqrtps is generally ~7 cycles throughput (11 latency) and I think the poly might go faster than that (most CPUs do 2 multiplies per cycle, some do 2 adds as well.. even if you don't want to use FMA which should usually be 2 per cycle).

Hmm, making a combination Chebyshev sin/cos reusing some of the range reduction and sin terms (comes in handy because of the product rule) worked well - thanks!

rigatoni_modular · Post by **rigatoni_modular** » Thu Nov 16, 2023 9:49 pm

Despite my order 0 blep working great now, the 1st-3rd orders are a bit iffy and also introduce weird artifacts, most notably some high frequency noise bumps a little over 1/4 and 1/2 of the nyquist frequency. I'm tempted to chalk this up to the inaccuracies in my residuals, which I mentioned in a previous reply to this topic but still haven't been able to address.

Screenshot 2023-11-16 at 1.47.30 PM.png

mystran · Post by **mystran** » Thu Nov 16, 2023 9:59 pm

rigatoni_modular wrote: Thu Nov 16, 2023 9:49 pm Despite my order 0 blep working great now, the 1st-3rd orders are a bit iffy and also introduce weird artifacts, most notably some high frequency noise bumps a little over 1/4 and 1/2 of the nyquist frequency. I'm tempted to chalk this up to the inaccuracies in my residuals, which I mentioned in a previous reply to this topic but still haven't been able to address.

How long is your BLEP and what window are you using? Are you computing the higher order BLEPs by numerical integration or using the analytic approach?

rigatoni_modular · Post by **rigatoni_modular** » Thu Nov 16, 2023 10:13 pm

mystran wrote: Thu Nov 16, 2023 9:59 pm How long is your BLEP and what window are you using? Are you computing the higher order BLEPs by numerical integration or using the analytic approach?

I'm using 8 zero crossings with 16x oversampling, Blackman window. I'm still trying to figure out why my higher order BLEPs are all requiring me to window them at each step to get a decent result. My 0th residual is scaled to have a zero-crossing per unit of input (x is scaled by pi) and normalized to have the numerically integrated BLEP go from -0.5 to 0.5. I'm computing the rest analytically using:
- ∆h_1 = pi*x*∆h_0 + cos(pi*x)/pi
- ∆h_2 = 1/2 * (pi*x*∆h_0 + sin(pi*x)/pi)
- ∆h_3 = 1/3 * (pi*x*∆h_0 - cos(pi*x)/pi)

as per

2DaT wrote: Tue Nov 07, 2023 6:48 pm Here is corrected version that computes analytic blep residuals.

https://www.desmos.com/calculator/j3jzbjc10m

Bleps 1, 2 and 3 look about right, but I am not 100% sure about oscillating nature of higher bleps.

which is based on the Native Instruments paper (though the sin/cos terms and the lack of division by n in that desmos link confused me)

Either way, I'm not getting even the results from that Desmos link perhaps because of small errors in my numerically integrated 0th order BLEP.

2DaT · Post by **2DaT** » Thu Nov 16, 2023 10:56 pm

rigatoni_modular wrote: Thu Nov 16, 2023 10:13 pm I'm using 8 zero crossings with 16x oversampling, Blackman window. I'm still trying to figure out why my higher order BLEPs are all requiring me to window them at each step to get a decent result. My 0th residual is scaled to have a zero-crossing per unit of input (x is scaled by pi) and normalized to have the numerically integrated BLEP go from -0.5 to 0.5. I'm computing the rest analytically using:
- ∆h_1 = pi*x*∆h_0 + cos(pi*x)/pi
- ∆h_2 = 1/2 * (pi*x*∆h_0 + sin(pi*x)/pi)
- ∆h_3 = 1/3 * (pi*x*∆h_0 - cos(pi*x)/pi)

Where does that 1/n scaling comes from?

Also take extra care about how high order bleps are scaled; 1st order blep is always scaled by delta, 2nd order by delta^2 etc. That extra scaling comes from chain rule of derivatives, f(g(x))' = f'(g(x))*g'(x) and g(x) in that particular case is in fact delta for simple phasor.

That also ensures the lower frequency we have, the scaling for high order bleps gets asymptotically lower. So going below let's say 2nd order blep even for sine may not be very productive, since for realistic frequencies (let's say below 5kHz) scaling for high order bleps gets very low.

Oversampling also changes the things somewhat, because it scales deltas by 0.5 and that means that the impact from high order bleps gets reduced by 2^n, making high order bleps even less necessary.

rigatoni_modular · Post by **rigatoni_modular** » Thu Nov 16, 2023 11:10 pm

2DaT wrote: Thu Nov 16, 2023 10:56 pm
rigatoni_modular wrote: Thu Nov 16, 2023 10:13 pm I'm using 8 zero crossings with 16x oversampling, Blackman window. I'm still trying to figure out why my higher order BLEPs are all requiring me to window them at each step to get a decent result. My 0th residual is scaled to have a zero-crossing per unit of input (x is scaled by pi) and normalized to have the numerically integrated BLEP go from -0.5 to 0.5. I'm computing the rest analytically using:
- ∆h_1 = pi*x*∆h_0 + cos(pi*x)/pi
- ∆h_2 = 1/2 * (pi*x*∆h_0 + sin(pi*x)/pi)
- ∆h_3 = 1/3 * (pi*x*∆h_0 - cos(pi*x)/pi)
Where does that 1/n scaling comes from?

Also take extra care about how high order bleps are scaled; 1st order blep is always scaled by delta, 2nd order by delta^2 etc. That extra scaling comes from chain rule of derivatives, f(g(x))' = f'(g(x))*g'(x) and g(x) in that particular case is in fact delta for simple phasor.

That also ensures the lower frequency we have, the scaling for high order bleps gets asymptotically lower. So going below let's say 2nd order blep even for sine may not be very productive, since for realistic frequencies (let's say below 5kHz) scaling for high order bleps gets very low.

Oversampling also changes the things somewhat, because it scales deltas by 0.5 and that means that the impact from high order bleps gets reduced by 2^n, making high order bleps even less necessary.

On page 4 of the Native Instruments paper they derive that `n * ∆h_n(t)` = the expression on the RHS, so I'd expect to have to divide the RHS by n to get just the residual.

Also, what delta are you referring to in this instance? The discontinuity? The phase delta at the discontinuity? The phase delta 0-1 for a given phasor across the whole sample?

mystran · Post by **mystran** » Thu Nov 16, 2023 11:17 pm

rigatoni_modular wrote: Thu Nov 16, 2023 10:13 pm I'm using 8 zero crossings with 16x oversampling, Blackman window. I'm still trying to figure out why my higher order BLEPs are all requiring me to window them at each step to get a decent result. My 0th residual is scaled to have a zero-crossing per unit of input (x is scaled by pi) and normalized to have the numerically integrated BLEP go from -0.5 to 0.5. I'm computing the rest analytically using:
- ∆h_1 = pi*x*∆h_0 + cos(pi*x)/pi
- ∆h_2 = 1/2 * (pi*x*∆h_0 + sin(pi*x)/pi)
- ∆h_3 = 1/3 * (pi*x*∆h_0 - cos(pi*x)/pi)

I don't think this is going to work. The analytical integrals assume the base case ∆h_0 is unwindowed integrated sinc, the sine integral Si(x) (minus the trivial step? too lazy to check the paper right now). If you are numerically integrating the base case, then you should probably just numerically integrate the rest too, though your kernel is probably too short (and perhaps the window too poor as well) to get three good derivatives.

I'd also use more branches, like maybe 256 or something, 'cos it both brings down linear interpolation error and improves the accuracy of numerical integration... and with numerical integration make sure to use trapezoidal (or better, but trapezoidal has been enough for me) if you are going to try integrating multiple times, because Euler is really no good for that.

2DaT · Post by **2DaT** » Thu Nov 16, 2023 11:34 pm

rigatoni_modular wrote: Thu Nov 16, 2023 11:10 pm On page 4 of the Native Instruments paper they derive that `n * ∆h_n(t)` = the expression on the RHS, so I'd expect to have to divide the RHS by n to get just the residual.

Maybe that's right, I didn't do these myself, numerical approach work just fine.

rigatoni_modular wrote: Thu Nov 16, 2023 11:10 pm Also, what delta are you referring to in this instance? The discontinuity? The phase delta at the discontinuity? The phase delta 0-1 for a given phasor across the whole sample?

The delta from this code.

Code: Select all

phase += delta;
if ( phase >= 1.0f )
{
	phase -= 1.0f;
	fraction = phase / delta;
}

rigatoni_modular · Post by **rigatoni_modular** » Fri Nov 17, 2023 12:44 am

mystran wrote: Thu Nov 16, 2023 11:17 pm I'd also use more branches, like maybe 256 or something, 'cos it both brings down linear interpolation error and improves the accuracy of numerical integration... and with numerical integration make sure to use trapezoidal (or better, but trapezoidal has been enough for me) if you are going to try integrating multiple times, because Euler is really no good for that.

Switched to trapezoidal integration. It's working well for the 0th order sinc -> step. Getting slightly more influence from negative areas than positive areas in the integration so debugging that.

Come to think of it, my 0th order looks a little bit off too, the oscillations aren't centered around 0.

Screenshot 2023-11-16 at 5.08.35 PM.png

2DaT · Post by **2DaT** » Fri Nov 17, 2023 5:40 am

juha_p wrote: Thu Nov 16, 2023 5:03 pm I could not test this njuffa's implementation right now (I've done it couple years ago and remember it being slow on emulated fma) but, he says it's 1.5 ULP in range [0:PI].

Could be a lot better than this.

For oscillator purposes we don't care about relative accuracy whatsoever, so ULP calculations don't matter as much. And we don't need accurate range reductions, because the range is [0,1] and is known ahead of time. Probably can get with much more error than for typical library implementation.

I've tried two approaches: range reducing to [-0.25,0.25] and do extra tricks to correct the sign and range reducing to [-0.5,0.5] and compute a polynomial without anything extra.

The thing with [-0.25, 0.25] is that we can use a low order polynomial for both sin/cos and reuse some part of computation to compute both sin/cos at the same time.

In practice however, approximating half-cycle instead of quarter cycle doesn't grow the polynomial order by a lot, and processors and so good at evaluating FMAs that "trickery" is not profitable.

Here are both approaches in SSE2 on [0,1]

Code: Select all

inline void sincos2pi_sse2(__m128 x, __m128& sin_out, __m128& cos_out)
	{
		const __m128 half = _mm_set1_ps(0.5f);
		const __m128 quarter = _mm_set1_ps(0.25f);
		const __m128 signmask = _mm_set1_ps(-0.0f);

		__m128 mask = _mm_cmpgt_ps(x,half);

		x = _mm_sub_ps(x, quarter);
		x = _mm_sub_ps(x, _mm_and_ps(mask, half));

		__m128 sign_flip = _mm_and_ps(signmask, mask);
		__m128 f2 = _mm_mul_ps(x, x);

#if 1
		const __m128 c0 = _mm_set1_ps(-6.283164024e+00);
		const __m128 c1 = _mm_set1_ps(4.133714294e+01);
		const __m128 c2 = _mm_set1_ps(-8.134076691e+01);
		const __m128 c3 = _mm_set1_ps(7.099343109e+01);
		__m128 c = c3;
		c = _mm_add_ps(_mm_mul_ps(c, f2), c2);
		c = _mm_add_ps(_mm_mul_ps(c, f2), c1);
		c = _mm_add_ps(_mm_mul_ps(c, f2), c0);
		c = _mm_mul_ps(c, _mm_xor_ps(x,sign_flip));
		cos_out = c;
#endif
#if 1

		const __m128 s0 = _mm_set1_ps(9.999933243e-01);
		const __m128 s1 = _mm_set1_ps(-1.973575211e+01);
		const __m128 s2 = _mm_set1_ps(6.466053772e+01);
		const __m128 s3 = _mm_set1_ps(-7.821613312e+01);
		__m128 s = s3;
		s = _mm_add_ps(_mm_mul_ps(s, f2), s2);
		s = _mm_add_ps(_mm_mul_ps(s, f2), s1);
		s = _mm_add_ps(_mm_mul_ps(s, f2), s0);
		s = _mm_xor_ps(s, sign_flip);
		sin_out = s;
#endif

	}

	inline __m128 sin2pi_sse2(__m128 x)
	{
		const __m128 half = _mm_set1_ps(0.5f);

		x = _mm_sub_ps(x, half);

		__m128 f2 = _mm_mul_ps(x, x);

		const __m128 s0 = _mm_set1_ps(-6.283161640e+00);
		const __m128 s1 = _mm_set1_ps(4.133687973e+01);
		const __m128 s2 = _mm_set1_ps(-8.144869232e+01);
		const __m128 s3 = _mm_set1_ps(7.491574097e+01);
		const __m128 s4 = _mm_set1_ps(-3.356100464e+01);
		__m128 s = s4;
		s = _mm_add_ps(_mm_mul_ps(s, f2), s3);
		s = _mm_add_ps(_mm_mul_ps(s, f2), s2);
		s = _mm_add_ps(_mm_mul_ps(s, f2), s1);
		s = _mm_add_ps(_mm_mul_ps(s, f2), s0);
		s = _mm_mul_ps(s, x);
		return s;

	}

So in the end of the day, don't bother with anything too complicated for sine/cosine, just subtract 0.5, and do a polynomial of you liking. It has to be a P(x^2) and P(x^2)*x for cos and sine respectively.

mystran · Post by **mystran** » Fri Nov 17, 2023 8:36 am

rigatoni_modular wrote: Fri Nov 17, 2023 12:44 am Come to think of it, my 0th order looks a little bit off too, the oscillations aren't centered around 0.

What happens if you use a longer kernel?

rigatoni_modular · Post by **rigatoni_modular** » Fri Nov 17, 2023 8:50 am

Finally my residuals look good! Thanks for all the help

I have no idea what fixed my issue tbh. But they sound good and look good now.

Sine hard sync