KVR Audio

potato6 · Post by **potato6** » Thu Oct 09, 2014 2:00 am

I'm experimenting with several Oscillators and LFOs classes, in particular I found this wavetable class quite good: http://www.earlevel.com/main/2012/05/03 ... roduction/
And I'm using a generic LFO using tables.

Everything works but the CPU is way too much compared to Reaktor with almost same settings.
What's the best approach to get decent oscillators with not too much CPU load?

itoa · Post by **itoa** » Thu Oct 09, 2014 6:00 am

Use SIMD (e.g. SSE or accelerate framework on OsX) , limit memory access.

Mayae · Post by **Mayae** » Thu Oct 09, 2014 4:46 pm

Using iir oscillators you can generate nearly perfect sine+cosines extremely quickly. With careful optimization, you can get a sine+cosine pair every clock cycle, with no memory access - results vary with available instruction sets, of course. The fastest algorithm i found takes 6-7 arithmetic operations per rotation, which - if pipeline properly on my cpu, all take a single clock cycle. Using AVX it generates 8 pairs in parallel per rotation

I don't see anything beating that.

This approach is not the best for modulation, though. Read this topic: http://www.kvraudio.com/forum/viewtopic ... 3&t=412674

potato6 · Post by **potato6** » Thu Oct 09, 2014 11:07 pm

Thanks for the replies.

I don't know anything about SIMD/SSE, I'll try to check that.

For the IIR oscillators, looks interesting but why they are not the best for modulation? I'll study that thread you linked. Thanks!

Mayae · Post by **Mayae** » Fri Oct 10, 2014 3:21 pm

potato6 wrote:Thanks for the replies.

I don't know anything about SIMD/SSE, I'll try to check that.

For the IIR oscillators, looks interesting but why they are not the best for modulation? I'll study that thread you linked. Thanks!

The basic principle is, that you calculate the coefficients for them (the more precise, the better) and let them rotate infinitely. This is quite expensive, but you only have to do it once. If you want to alter the frequency though, you will have to recalculate the coefficients - ergo, the more you modulate (or, how often) the less suited they are, computationally.

Smashed Transistors · Post by **Smashed Transistors** » Sat Oct 11, 2014 9:48 am

With a SSE you will do 4 oscillators for the cost of one.

If you need sawtooth oscillators, DPW oscillators can easily benefit of SSE.
For sine/cosine IIR are not good for modulation. But you can use a complex resonator instead. These one can also benefit from SSE.

mystran · Post by **mystran** » Sun Oct 12, 2014 5:28 am

I just have to say this: wavetable lookup oscillators (using linear interpolation), when properly optimized, are really one of the more efficient algorithms. In fact, I can't imagine anything else that could come close to the quality (which you can make arbitrarily good by using more memory) and speed at the same time (especially if you use fixed-point math).

Now there might be various reasons why you might want to use something else (want to reduce memory foot-print, or want more flexibility in terms of morphing stuff analytically, etc), but performance shouldn't be a one of them. In other words: this was the algorithm of choice back when we didn't have CPU for anything fancier.

Another thing is whether a particular implementation is efficient. Example code is often written to be easy to understand rather than as fast as possible.

hibrasil · Post by **hibrasil** » Sun Oct 12, 2014 10:00 am

Does anyone have any examples of SSE oscillators?

Mayae · Post by **Mayae** » Sun Oct 12, 2014 10:45 am

hibrasil wrote:Does anyone have any examples of SSE oscillators?

This will run 4 oscillators in parallel (untested):

Code: Select all

void simdosc(float * freqs, float * phases, float sampleRate)
{

	__m128 oc1, oc2, osin, ocos, t0, t1, t2;

	{
		float alignas(16) buf[4][4];

		for (int i = 0; i < 4 /* 8 for avx*/; ++i)
		{
			auto const omega = tan(M_PI * freqs[i] / samplerate);
			auto const z = 2 / (1 + omega * omega);
			buf[0][i] = z - 1;
			buf[1][i] = omega * z;
			buf[2][i] = sin(phases[i]);
			buf[3][i] = cos(phases[i]);
		}

		oc1 = _mm_load_ps(buf[0]);
		oc2 = _mm_load_ps(buf[1]);
		osin = _mm_load_ps(buf[2]);
		ocos = _mm_load_ps(buf[3]);
	}
	while (true)
	{

		// use oscillators or store here

		/*
			generate the next sines and cosines:
			auto const t0 = oc1 * ocos - oc2 * osin;
			auto const t1 = oc2 * ocos + oc1 * osin;
			ocos = t0;
			osin = t1;

			if overloaded operators exist, use prev code 
		*/

		t0 = _mm_mul_ps(ocos, oc1); // c1 * cos
		t1 = _mm_mul_ps(osin, oc2); // c2 * sin
		t2 = _mm_sub_ps(t0, t1); // final cosine, t0

		ocos = _mm_mul_ps(ocos, oc2); // c2 * cos
		osin = _mm_mul_ps(osin, oc1); // c1 * sin

		osin = _mm_add_ps(ocos, osin); // final sine, t1
		ocos = t2;

	}

}

As for running them serially (ie., getting the next 4 pairs of sines / cosines of a single oscillator) i haven't got that to work yet, see this thread.

earlevel · Post by **earlevel** » Mon Oct 13, 2014 5:13 pm

potato6 wrote:I'm experimenting with several Oscillators and LFOs classes, in particular I found this wavetable class quite good: http://www.earlevel.com/main/2012/05/03 ... roduction/
And I'm using a generic LFO using tables.

Everything works but the CPU is way too much compared to Reaktor with almost same settings.
What's the best approach to get decent oscillators with not too much CPU load?

Glad you like the wavetable class.

Some ideas: The oscillator is pretty light-weight, but it's also a tutorial, so I didn't want to make it too confusing or unreadable—there are some optimization possible even before SIMD.

As you can see, looking at the getOutput function, the cpu load is in two pieces: the while loop that figures out which table to grab (takes longer for higher octave), and the linear interpolation. For the while loop, if you limit the step between table to a fixed size (octave, which is what will be used in almost all cases anyway), you could calculate which table to use in a single step, without the loop. It's a little difficult to understand, reading the code, and I didn't see it giving a consistent or significant improvement, so I left the cruder while loop.

For the linear interpolation, one extremely easy improvement is to make your wavetables one sample longer (2049 samples instead of 2048, etc.), then you can dispense with the test and branch for the linear interpolation.

Also, making the wavetables a bit larger, you could go with zero-order interpolation (which is supported in the oscillator via a #define). That may seem horrible, but consider that with constant wavetable size of each octave, the higher octave are progressively more oversampled. (The oscillator supports variable table size per octave, but that can only save you less than 50% in memory anyway.) That is, 2048 table size (1x oversampling the bottom octave), the upper octaves are already oversample way more than necessary for linear interpolation. Upping the table size more, they are good enough for truncating. (And the lower octaves don't need as much oversampling in general—if the upper octaves sound OK, the lower will be fine.)

At this point, the function call overhead is not so insignificant, so you could try making it inline (compilers can be finicky about this). There's also template meta programming...And there are other ways to reduce the per-function-cal overhead—if you can generate samples by the buffer, unroll loops, etc. you can keep execution in cache much easier if you do things by the buffer, at the expensive of flexibility (feedback within the synth algorithm, etc.).

Using the oscillator as an LFO-only, you'd probably just use one big table, getting rid of the while loop and using no interpolation, making it inline. Of course, a far bigger improvement is to not update the LFO at every sample tick—run your control rate at a different rate than the audio rate (you mentioned Reaktor, and of course this is what Reaktor does, and it lets you set the control update rate).

potato6 · Post by **potato6** » Sun Oct 19, 2014 8:19 pm

Thanks a lot for the reply. I really like your blog and I think you should write more stuff

I did a few tweaks as you suggested and the CPU usage improved a bit. I'm not really sure how to avoid the while loop in getOutput, could you explain this a bit more?
Thanks.

earlevel · Post by **earlevel** » Sun Oct 19, 2014 11:05 pm

potato6 wrote:I did a few tweaks as you suggested and the CPU usage improved a bit. I'm not really sure how to avoid the while loop in getOutput, could you explain this a bit more?

Well, since you ask, I did try this once:

Code: Select all

    double f = this->phaseInc * (sampleRate / baseFrequency * 2);
    int32_t c = (*(((int32_t *)(&f))+1) >> 20) - 1023 + 1;  // grab exponent, round it up
    if (c < 0)
        c = 0;
    else if (c >= this->numWavetables)
        c = this->numWavetables - 1;
    
    wavetable *wavetable = &this->wavetables[c];

It was a little faster, a lot uglier, requires access to sampleRate and baseFrequency (you could save the result of sampleRate / baseFrequency * 2 as a static variable, or something). It's just getting the relative frequency relative to the lowest wavetable octave as a float, and using its exponent to tell you which octave the frequency is in. Maybe there's a more efficient implementation, but I was just exploring the general idea of using the exponent.

Still, I didn't find the while loop to be much of a hit, and I don't think I got much improvement going this way, but wanted to show you a loop-free method. I'm sure there are ways to pre-calculate some things and get the table number more directly from the phase increment, but that makes the code harder to understand, for a pretty small improvement, and after all it's a tutorial, and pretty efficient overall considering the simplicity.

earlevel · Post by **earlevel** » Sun Oct 19, 2014 11:57 pm

BTW, be sure to compile with optimizations enabled. I let the compiler do trivial optimizations, and keep the code clean. The reason I didn't pursue the above optimization (essentially calculating log frequency by grabbing the floating point exponent) is that although it gave me a decent improvement in testing, when optimizations were enabled, the gain went away, and I actually lost a trivial amount compared with the while loop for a full frequency sweep. (It would probably win by a tiny bit if you were doing mostly top octave stuff, since the while loop takes long for higher octaves, while the log calculation is constant at any frequency.) In a simple test of generating a frequency sweep of a sawtooth oscillator, it's about three times faster with "release" compiler optimizations enabled.

earlevel · Post by **earlevel** » Mon Oct 20, 2014 7:06 am

I did a quick test: 10 ms to generate 20 seconds of samples at 44.1 kHz for the best case (low frequencies—that's 2000x real time), 18 ms for the highest octave (14 ms in the middle—for typical uses, the base frequency of an oscillator is rarely in the top octave). Then, I made getOutput "inline", which cut an additional couple of milliseconds off the time.

All of this is without changing the frequency—just calls to update the phase and get the output. Calling setFrequency every time only adds around a half-millisecond or less. The oscillator doesn't seem doesn't seem very cpu intensive to me, but I don't have anything handy to compare it with. Still, the best case (low octave) is at around 8.6 ms here (with the inlined getOutput, and calling setFrequency), and adding a single multiply to update the frequency variable on each call for the exponential sweep adds 4 ms (while sweeping the oscillator throughout the audio range)—that tells me that the wavetable oscillator is extremely efficient (12.7 ms for the 20 sec sweep, 20Hz-20kHz). So, I'm wondering if it's some of the other things you're doing to calculate frequency? Are you using something like the exp() function?

If that was confusing, here's my test, repeated 88,200 times; I did unroll the "for" loop by a factor of 16 to minimize its effect and get more of just the oscillator calls. It took about 5 ms off the total time:

Code: Select all

        osc->setFrequency(freqVal);
        soundBuf[idx] = osc->getOutput() * gainMult;
        osc->updatePhase();
        freqVal *= freqMult;    // exponential frequency sweep

Ichad.c · Post by **Ichad.c** » Mon Oct 20, 2014 7:37 am

Agner Fog gives a descent description/example of wavetables in his vectorclass manual,
http://www.agner.org/optimize/vectorclass.pdf - you don't need to use his class. He also gives some general descriptions on vectoring strategies, which is very useful.

Oscillators and LFOs not too heavy on CPU