You can probably get away with error quite a bit bigger than that (at least one order of magnitude, but perhaps more, especially if you force C0 or C1 continuity, 'cos then it's mostly the 3rd harmonic which gets enough psychoacoustic masking that it's mostly a matter of whether you need it clean on an FFT). As far as performance, Estrin's scheme might help if the stuff doesn't otherwise pipeline well enough (though it might pipeline multiple evaluations just fine, in which case it's just a question of latency, idk).juha_p wrote: Thu Nov 16, 2023 5:03 pm I could not test this njuffa's implementation right now (I've done it couple years ago and remember it being slow on emulated fma) but, he says it's 1.5 ULP in range [0:PI].
Sine hard sync
- KVRAF
- 8487 posts since 12 Feb, 2006 from Helsinki, Finland
-
rigatoni_modular rigatoni_modular https://www.kvraudio.com/forum/memberlist.php?mode=viewprofile&u=682452
- KVRer
- 23 posts since 7 Nov, 2023
I'm not too picky about accuracy as long as I don't have a bunch of spurious harmonics. Right now I'm using this archived article's method but reduced to 7th order which still seems to give decent results. I process my partials in groups of 4 using exclusively SIMD instructions which helps a bit to keep things fast.
Also, to reduce the amount of sin calculations I'm doing I've been experimenting with a `sinToCos` implementation using the fact that sqrt(1-sin^2) gives a rectified cosine. I'm only using this to get the derivative values for BLEP.
Also, to reduce the amount of sin calculations I'm doing I've been experimenting with a `sinToCos` implementation using the fact that sqrt(1-sin^2) gives a rectified cosine. I'm only using this to get the derivative values for BLEP.
- KVRAF
- 8487 posts since 12 Feb, 2006 from Helsinki, Finland
It this actually faster than just computing another poly or perhaps derivative of your main poly if you don't wanna range-reduce separately (and want to reuse the x*x terms, etc)? On x86 sqrtps is generally ~7 cycles throughput (11 latency) and I think the poly might go faster than that (most CPUs do 2 multiplies per cycle, some do 2 adds as well.. even if you don't want to use FMA which should usually be 2 per cycle).rigatoni_modular wrote: Thu Nov 16, 2023 6:03 pm Also, to reduce the amount of sin calculations I'm doing I've been experimenting with a `sinToCos` implementation using the fact that sqrt(1-sin^2) gives a rectified cosine. I'm only using this to get the derivative values for BLEP.
-
rigatoni_modular rigatoni_modular https://www.kvraudio.com/forum/memberlist.php?mode=viewprofile&u=682452
- KVRer
- 23 posts since 7 Nov, 2023
Hmm, making a combination Chebyshev sin/cos reusing some of the range reduction and sin terms (comes in handy because of the product rule) worked well - thanks!mystran wrote: Thu Nov 16, 2023 7:04 pmIt this actually faster than just computing another poly or perhaps derivative of your main poly if you don't wanna range-reduce separately (and want to reuse the x*x terms, etc)? On x86 sqrtps is generally ~7 cycles throughput (11 latency) and I think the poly might go faster than that (most CPUs do 2 multiplies per cycle, some do 2 adds as well.. even if you don't want to use FMA which should usually be 2 per cycle).rigatoni_modular wrote: Thu Nov 16, 2023 6:03 pm Also, to reduce the amount of sin calculations I'm doing I've been experimenting with a `sinToCos` implementation using the fact that sqrt(1-sin^2) gives a rectified cosine. I'm only using this to get the derivative values for BLEP.
-
rigatoni_modular rigatoni_modular https://www.kvraudio.com/forum/memberlist.php?mode=viewprofile&u=682452
- KVRer
- 23 posts since 7 Nov, 2023
Despite my order 0 blep working great now, the 1st-3rd orders are a bit iffy and also introduce weird artifacts, most notably some high frequency noise bumps a little over 1/4 and 1/2 of the nyquist frequency. I'm tempted to chalk this up to the inaccuracies in my residuals, which I mentioned in a previous reply to this topic but still haven't been able to address.
You do not have the required permissions to view the files attached to this post.
- KVRAF
- 8487 posts since 12 Feb, 2006 from Helsinki, Finland
How long is your BLEP and what window are you using? Are you computing the higher order BLEPs by numerical integration or using the analytic approach?rigatoni_modular wrote: Thu Nov 16, 2023 9:49 pm Despite my order 0 blep working great now, the 1st-3rd orders are a bit iffy and also introduce weird artifacts, most notably some high frequency noise bumps a little over 1/4 and 1/2 of the nyquist frequency. I'm tempted to chalk this up to the inaccuracies in my residuals, which I mentioned in a previous reply to this topic but still haven't been able to address.
-
rigatoni_modular rigatoni_modular https://www.kvraudio.com/forum/memberlist.php?mode=viewprofile&u=682452
- KVRer
- 23 posts since 7 Nov, 2023
I'm using 8 zero crossings with 16x oversampling, Blackman window. I'm still trying to figure out why my higher order BLEPs are all requiring me to window them at each step to get a decent result. My 0th residual is scaled to have a zero-crossing per unit of input (x is scaled by pi) and normalized to have the numerically integrated BLEP go from -0.5 to 0.5. I'm computing the rest analytically using:mystran wrote: Thu Nov 16, 2023 9:59 pm How long is your BLEP and what window are you using? Are you computing the higher order BLEPs by numerical integration or using the analytic approach?
- ∆h_1 = pi*x*∆h_0 + cos(pi*x)/pi
- ∆h_2 = 1/2 * (pi*x*∆h_0 + sin(pi*x)/pi)
- ∆h_3 = 1/3 * (pi*x*∆h_0 - cos(pi*x)/pi)
as per
which is based on the Native Instruments paper (though the sin/cos terms and the lack of division by n in that desmos link confused me)2DaT wrote: Tue Nov 07, 2023 6:48 pm Here is corrected version that computes analytic blep residuals.
https://www.desmos.com/calculator/j3jzbjc10m
Bleps 1, 2 and 3 look about right, but I am not 100% sure about oscillating nature of higher bleps.
Either way, I'm not getting even the results from that Desmos link perhaps because of small errors in my numerically integrated 0th order BLEP.
-
- KVRian
- 636 posts since 21 Jun, 2013
Where does that 1/n scaling comes from?rigatoni_modular wrote: Thu Nov 16, 2023 10:13 pm I'm using 8 zero crossings with 16x oversampling, Blackman window. I'm still trying to figure out why my higher order BLEPs are all requiring me to window them at each step to get a decent result. My 0th residual is scaled to have a zero-crossing per unit of input (x is scaled by pi) and normalized to have the numerically integrated BLEP go from -0.5 to 0.5. I'm computing the rest analytically using:
- ∆h_1 = pi*x*∆h_0 + cos(pi*x)/pi
- ∆h_2 = 1/2 * (pi*x*∆h_0 + sin(pi*x)/pi)
- ∆h_3 = 1/3 * (pi*x*∆h_0 - cos(pi*x)/pi)
Also take extra care about how high order bleps are scaled; 1st order blep is always scaled by delta, 2nd order by delta^2 etc. That extra scaling comes from chain rule of derivatives, f(g(x))' = f'(g(x))*g'(x) and g(x) in that particular case is in fact delta for simple phasor.
That also ensures the lower frequency we have, the scaling for high order bleps gets asymptotically lower. So going below let's say 2nd order blep even for sine may not be very productive, since for realistic frequencies (let's say below 5kHz) scaling for high order bleps gets very low.
Oversampling also changes the things somewhat, because it scales deltas by 0.5 and that means that the impact from high order bleps gets reduced by 2^n, making high order bleps even less necessary.
-
rigatoni_modular rigatoni_modular https://www.kvraudio.com/forum/memberlist.php?mode=viewprofile&u=682452
- KVRer
- 23 posts since 7 Nov, 2023
On page 4 of the Native Instruments paper they derive that `n * ∆h_n(t)` = the expression on the RHS, so I'd expect to have to divide the RHS by n to get just the residual.2DaT wrote: Thu Nov 16, 2023 10:56 pmWhere does that 1/n scaling comes from?rigatoni_modular wrote: Thu Nov 16, 2023 10:13 pm I'm using 8 zero crossings with 16x oversampling, Blackman window. I'm still trying to figure out why my higher order BLEPs are all requiring me to window them at each step to get a decent result. My 0th residual is scaled to have a zero-crossing per unit of input (x is scaled by pi) and normalized to have the numerically integrated BLEP go from -0.5 to 0.5. I'm computing the rest analytically using:
- ∆h_1 = pi*x*∆h_0 + cos(pi*x)/pi
- ∆h_2 = 1/2 * (pi*x*∆h_0 + sin(pi*x)/pi)
- ∆h_3 = 1/3 * (pi*x*∆h_0 - cos(pi*x)/pi)
Also take extra care about how high order bleps are scaled; 1st order blep is always scaled by delta, 2nd order by delta^2 etc. That extra scaling comes from chain rule of derivatives, f(g(x))' = f'(g(x))*g'(x) and g(x) in that particular case is in fact delta for simple phasor.
That also ensures the lower frequency we have, the scaling for high order bleps gets asymptotically lower. So going below let's say 2nd order blep even for sine may not be very productive, since for realistic frequencies (let's say below 5kHz) scaling for high order bleps gets very low.
Oversampling also changes the things somewhat, because it scales deltas by 0.5 and that means that the impact from high order bleps gets reduced by 2^n, making high order bleps even less necessary.
Also, what delta are you referring to in this instance? The discontinuity? The phase delta at the discontinuity? The phase delta 0-1 for a given phasor across the whole sample?
- KVRAF
- 8487 posts since 12 Feb, 2006 from Helsinki, Finland
I don't think this is going to work. The analytical integrals assume the base case ∆h_0 is unwindowed integrated sinc, the sine integral Si(x) (minus the trivial step? too lazy to check the paper right now). If you are numerically integrating the base case, then you should probably just numerically integrate the rest too, though your kernel is probably too short (and perhaps the window too poor as well) to get three good derivatives.rigatoni_modular wrote: Thu Nov 16, 2023 10:13 pm I'm using 8 zero crossings with 16x oversampling, Blackman window. I'm still trying to figure out why my higher order BLEPs are all requiring me to window them at each step to get a decent result. My 0th residual is scaled to have a zero-crossing per unit of input (x is scaled by pi) and normalized to have the numerically integrated BLEP go from -0.5 to 0.5. I'm computing the rest analytically using:
- ∆h_1 = pi*x*∆h_0 + cos(pi*x)/pi
- ∆h_2 = 1/2 * (pi*x*∆h_0 + sin(pi*x)/pi)
- ∆h_3 = 1/3 * (pi*x*∆h_0 - cos(pi*x)/pi)
I'd also use more branches, like maybe 256 or something, 'cos it both brings down linear interpolation error and improves the accuracy of numerical integration... and with numerical integration make sure to use trapezoidal (or better, but trapezoidal has been enough for me) if you are going to try integrating multiple times, because Euler is really no good for that.
-
- KVRian
- 636 posts since 21 Jun, 2013
Maybe that's right, I didn't do these myself, numerical approach work just fine.rigatoni_modular wrote: Thu Nov 16, 2023 11:10 pm On page 4 of the Native Instruments paper they derive that `n * ∆h_n(t)` = the expression on the RHS, so I'd expect to have to divide the RHS by n to get just the residual.
The delta from this code.rigatoni_modular wrote: Thu Nov 16, 2023 11:10 pm Also, what delta are you referring to in this instance? The discontinuity? The phase delta at the discontinuity? The phase delta 0-1 for a given phasor across the whole sample?
Code: Select all
phase += delta;
if ( phase >= 1.0f )
{
phase -= 1.0f;
fraction = phase / delta;
}-
rigatoni_modular rigatoni_modular https://www.kvraudio.com/forum/memberlist.php?mode=viewprofile&u=682452
- KVRer
- 23 posts since 7 Nov, 2023
Switched to trapezoidal integration. It's working well for the 0th order sinc -> step. Getting slightly more influence from negative areas than positive areas in the integration so debugging that.mystran wrote: Thu Nov 16, 2023 11:17 pm I'd also use more branches, like maybe 256 or something, 'cos it both brings down linear interpolation error and improves the accuracy of numerical integration... and with numerical integration make sure to use trapezoidal (or better, but trapezoidal has been enough for me) if you are going to try integrating multiple times, because Euler is really no good for that.
Come to think of it, my 0th order looks a little bit off too, the oscillations aren't centered around 0.
You do not have the required permissions to view the files attached to this post.
-
- KVRian
- 636 posts since 21 Jun, 2013
Could be a lot better than this.juha_p wrote: Thu Nov 16, 2023 5:03 pm I could not test this njuffa's implementation right now (I've done it couple years ago and remember it being slow on emulated fma) but, he says it's 1.5 ULP in range [0:PI].
For oscillator purposes we don't care about relative accuracy whatsoever, so ULP calculations don't matter as much. And we don't need accurate range reductions, because the range is [0,1] and is known ahead of time. Probably can get with much more error than for typical library implementation.
I've tried two approaches: range reducing to [-0.25,0.25] and do extra tricks to correct the sign and range reducing to [-0.5,0.5] and compute a polynomial without anything extra.
The thing with [-0.25, 0.25] is that we can use a low order polynomial for both sin/cos and reuse some part of computation to compute both sin/cos at the same time.
In practice however, approximating half-cycle instead of quarter cycle doesn't grow the polynomial order by a lot, and processors and so good at evaluating FMAs that "trickery" is not profitable.
Here are both approaches in SSE2 on [0,1]
Code: Select all
inline void sincos2pi_sse2(__m128 x, __m128& sin_out, __m128& cos_out)
{
const __m128 half = _mm_set1_ps(0.5f);
const __m128 quarter = _mm_set1_ps(0.25f);
const __m128 signmask = _mm_set1_ps(-0.0f);
__m128 mask = _mm_cmpgt_ps(x,half);
x = _mm_sub_ps(x, quarter);
x = _mm_sub_ps(x, _mm_and_ps(mask, half));
__m128 sign_flip = _mm_and_ps(signmask, mask);
__m128 f2 = _mm_mul_ps(x, x);
#if 1
const __m128 c0 = _mm_set1_ps(-6.283164024e+00);
const __m128 c1 = _mm_set1_ps(4.133714294e+01);
const __m128 c2 = _mm_set1_ps(-8.134076691e+01);
const __m128 c3 = _mm_set1_ps(7.099343109e+01);
__m128 c = c3;
c = _mm_add_ps(_mm_mul_ps(c, f2), c2);
c = _mm_add_ps(_mm_mul_ps(c, f2), c1);
c = _mm_add_ps(_mm_mul_ps(c, f2), c0);
c = _mm_mul_ps(c, _mm_xor_ps(x,sign_flip));
cos_out = c;
#endif
#if 1
const __m128 s0 = _mm_set1_ps(9.999933243e-01);
const __m128 s1 = _mm_set1_ps(-1.973575211e+01);
const __m128 s2 = _mm_set1_ps(6.466053772e+01);
const __m128 s3 = _mm_set1_ps(-7.821613312e+01);
__m128 s = s3;
s = _mm_add_ps(_mm_mul_ps(s, f2), s2);
s = _mm_add_ps(_mm_mul_ps(s, f2), s1);
s = _mm_add_ps(_mm_mul_ps(s, f2), s0);
s = _mm_xor_ps(s, sign_flip);
sin_out = s;
#endif
}
inline __m128 sin2pi_sse2(__m128 x)
{
const __m128 half = _mm_set1_ps(0.5f);
x = _mm_sub_ps(x, half);
__m128 f2 = _mm_mul_ps(x, x);
const __m128 s0 = _mm_set1_ps(-6.283161640e+00);
const __m128 s1 = _mm_set1_ps(4.133687973e+01);
const __m128 s2 = _mm_set1_ps(-8.144869232e+01);
const __m128 s3 = _mm_set1_ps(7.491574097e+01);
const __m128 s4 = _mm_set1_ps(-3.356100464e+01);
__m128 s = s4;
s = _mm_add_ps(_mm_mul_ps(s, f2), s3);
s = _mm_add_ps(_mm_mul_ps(s, f2), s2);
s = _mm_add_ps(_mm_mul_ps(s, f2), s1);
s = _mm_add_ps(_mm_mul_ps(s, f2), s0);
s = _mm_mul_ps(s, x);
return s;
}- KVRAF
- 8487 posts since 12 Feb, 2006 from Helsinki, Finland
What happens if you use a longer kernel?rigatoni_modular wrote: Fri Nov 17, 2023 12:44 am Come to think of it, my 0th order looks a little bit off too, the oscillations aren't centered around 0.
-
rigatoni_modular rigatoni_modular https://www.kvraudio.com/forum/memberlist.php?mode=viewprofile&u=682452
- KVRer
- 23 posts since 7 Nov, 2023
Finally my residuals look good! Thanks for all the help
I have no idea what fixed my issue tbh. But they sound good and look good now.
You do not have the required permissions to view the files attached to this post.
