KVR Audio

DukeRoodee · Post by **DukeRoodee** » Wed Dec 16, 2009 3:55 pm

Hi all,

i need some advice concerning code optimisation, hopefully somebody here has good C/C++ knowledge and can help me ?

I am developing a VSTi which (basically) works with 1..100(and something) oscillators (sine wave generators which render instruments partials).

I have this code in my sources

for( ii=0; ii<nrOfOscillators; ii++ ) {

//
// render partials signal (not shown here)
// signal = (......)

spsignal += signal;

nDegrees[ii] += nDegreesPerFrame[ii];
if( nDegrees[ii] > 360000 )
nDegrees[ii] -= 360000;
}

where nDegrees[ii] finally gets the current degrees of each single oscillator, so i can render the signal of each partial with something like "sin( nDegrees[ii] / 1000.0)"
I dont count from 0..360 because actually I access a precalculated sine table with 360000 values (i tried also a few other aproaches of calculating/aproximating sine values, but no way... precalculated table is the fastest by now).

So far ... this is not the prob, this works fine.

The problem is that these 3 lines

nDegrees[ii] += nDegreesPerFrame[ii];
if( nDegrees[ii] > 360000 )
nDegrees[ii] -= 360000;

cost me around 8-10% of CPU (on a 2.2Ghz core2duo) when running around 100 oscillators (means, the 3 lines run around 100x for each sampleFrame).

The calculation of the entire signal (with modulators, envelopes, etc., really manymany multiplications) takes LESS CPU than these addition, compare and subtraction.
nDegrees[] and nDegreesPerFrame[] are arrays of double. I tried also arrays of int, but no visible improvement.

How could i optimise this loop ? I am not really an expert in writing performant code, so.... somebody an idea ?

Thanks in advance

regards
Rudi

antto · Post by **antto** » Wed Dec 16, 2009 4:07 pm

EDIT: nDegrees is your phase, nDegreesPerFrame is the phase-increment-coefficient, which i also call "omega"
anyway, i see you "wrap" the phase at 36000.0 uh, why is that?

if your phase was 0.0 to 1.0 then here is a possible solution:
nDegrees[ii] += nDegreesPerFrame[ii];
nDegrees[ii] -= floor(nDegrees[ii]);

tho, the floor() call is probably gonna be bad too (not sure) but i'm sure there are approximitations of it (erm, faster floor() tricks)

otristan · Post by **otristan** » Wed Dec 16, 2009 4:10 pm

1) Try to modify your code so the wrapping is done automatically
using the natural integer type overflow.

2) try using local copy in the loop and only save values at the end

johnrule · Post by **johnrule** » Wed Dec 16, 2009 4:18 pm

nDegrees[ii] += nDegreesPerFrame[ii];
if( nDegrees[ii] > 360000 )
nDegrees[ii] -= 360000;

Try putting the value into a local var rather than searching the array everytime...

Code: Select all

int nD = nDegrees[ii];
int nDF = nDegreesPerFrame[ii];

nD += nDF;

if( nD > 360000 )
nD -= 360000;

JR

nollock · Post by **nollock** » Wed Dec 16, 2009 4:50 pm

DukeRoodee wrote: I tried also arrays of int, but no visible improvement.

int omega = int((freq / samplerate) * 4294967296.0);

That gets your phase step in integer format. You have to be sure freq is >= samplerate/2, or else it'll overflow and you'll get something weird. Maybe even a floating point exception, cant remember what atm.

uint phase += omega;

That will do your phase increment, and it will naturally wrap around.

Then have your lookup table a power of 2, and shift the phase to the right to create the index. EG..

single sinlut[32768];

output = sinlut[phase >> 17];

antto · Post by **antto** » Wed Dec 16, 2009 5:25 pm

nollock: hm, interesting..
how does it sound when you "sweep" or "fine-tune" the pitch of an oscillator that uses this "quantized" coefficient?

DukeRoodee · Post by **DukeRoodee** » Wed Dec 16, 2009 5:48 pm

Hi everybody

thanks for your advices.
I think the trick with the "natural integer type overflow" is the most promising and i already implemented that. Seems (not sure, because i didnt measure this yet) that it gives me back a few %.

I'll also try "lookup table a power of 2" later. Lets see...

I know that the table with 360000 may sound a bit weird but as the instrument is still in experimental stage, i didn't give to much importance to such things. I just wanted to make things work in short time, to get a prototype working. Now, as it works and seems to make sense, i'm trying to decrease the cpu load as much as possible (currenty @ 30% with all modulation sources , playing 1 voice... that's pretty much :-/

>>Try putting the value into a local var rather than searching the array everytime... <<
i have to write the new value back into the array... don't i lose the previously gained performance at this point ?

Rudi

nollock · Post by **nollock** » Wed Dec 16, 2009 6:59 pm

antto wrote:nollock: hm, interesting..
how does it sound when you "sweep" or "fine-tune" the pitch of an oscillator that uses this "quantized" coefficient?

Ok, the smaller omega gets, the more acuracy it looses, so this is an example of a worst case scenario, to reflect that...

samplerate = 192000;
oscfreq = 10hz;

int omega = int((freq / samplerate) * 4294967296.0);

gives an omega of 223696.

So between 2 adjacent frequencys there is a ratio of

223696/223695 = 1.00000447037

Convert that back into octaves

pitch_step = Log2(1.00000447037) = 6.449369973e-6

Multiply that by 1200 and its = 0.0077392;

Which means its accurate to around 129th of a cent, even in such extreme cases.

MadBrain · Post by **MadBrain** » Wed Dec 16, 2009 7:56 pm

Yeah, I use the same technique as Nollock described:

Code: Select all

inline float channel::generate15()
{
        int t;

	if(!--clock)
	{
		clock=16;
		osc[3].env.work();
		osc[2].env.work();
		osc[1].env.work();
		osc[0].env.work();
		lpf.env.work();
		lpf.minitick();
	}

	osc[3].env.ramp_out += osc[3].env.ramp;
	osc[2].env.ramp_out += osc[2].env.ramp;
	osc[1].env.ramp_out += osc[1].env.ramp;
	osc[0].env.ramp_out += osc[0].env.ramp;

	osc[3].pos+=osc[3].freq;
	osc[2].pos+=osc[2].freq;
	osc[1].pos+=osc[1].freq;
	osc[0].pos+=osc[0].freq;

        t =(osc[3].wave[osc[3].pos >>20] * osc[3].env.ramp_out);
        t+=(osc[2].wave[osc[2].pos >>20] * osc[2].env.ramp_out);
        t+=(osc[1].wave[osc[1].pos >>20] * osc[1].env.ramp_out);
        t+=(osc[0].wave[osc[0].pos >>20] * osc[0].env.ramp_out);

	return lpf.generate(t/65536.0f/65536.0f);
}

This is from an FM synth so it generates 4 oscillators at the same time. Note that it doesn't run the envelopes (and thus the potentially more costly calculations) every cycle - it does it only every 16 cycles, and then simply linearly interpolates between each calculation (using ramp and ramp_out). Other stuff is also calculated only every 16 cycles or less, such as filter cutoff, freq... I used waveforms that are 4096 samples long and stored as 16 bit, as a compromise between cache usage and better sound (256 sample waveforms were very grainy), although by that point it's probably reading mostly from L2 cache anyways - but in your case, you might see a performance gain if you can get it to read most data from L1 cache (with some linear interpolation you can make the tables much smaller, but then linear interpolation has a CPU cost so I dunno if it's a good trade-off).

My code uses about 1.6% CPU per channel on a P3 600mhz (1.3% without the filter, 0.7% without the envelope ramping and with only grainy 256 sample waveforms). Also note that it relies on the compiler's optimizations to remove potentially costly operations (floating point division and function calls for instance). The clock code probably generates a branch miss every 16 samples (branch misses can cost something like 20 cycles on a P4 for instance) but that's an acceptable cost imho. Also note that float to integer conversion can be slow depending on the compiler and cpu (due to some x86 quirk), which is why array indexes are best left as integers.

In your case, you should also see if instead of processing all the oscillators in an iterative loop, it might help to process a few oscillators (say maybe 8 ) at the same time in each loop, and then potentially add to some buffer - this is equivalent to some loop unrolling and with some luck the compiler will put some values in registers. (or it might not make any difference)

If this fails to give you enough improvement on CPU usage to get you as many oscs as you want, you might want to look at mathematical techniques. AFAIK there's a technique for synthesizing lots of sine waves at various frequencies at the same time, and that is used in some additive synth VSTs (the ones by Camel audio I think). If your frequencies are harmonic or nearly harmonic, you can have a huge speed gain by using some form FFT of course.

DukeRoodee · Post by **DukeRoodee** » Thu Dec 17, 2009 9:56 am

OK, i followed some of the advices, namingly
-"value into a local var rather than searching the array everytime... "
-"natural integer type overflow"
-"lookup table a power of 2"

All in all it gives me back some % of CPU. Running 96 osc. (without any modulation), this is the lowest note, it seems to be an improvement of something like 8%. Well, i am not talking about CPU load itself, but "VST" load.

What do you guys think... running a 96 osc synth (which is basically comparable to a synth playing 96 sine voices at a time, i would say) at around 20% VST load on a 2.2 GHz core2Duo... is this something you would call an acceptable value (for the specific technique) ?

@Madbrain:
>>AFAIK there's a technique for synthesizing lots of sine waves at various frequencies at the same time, and that is used in some additive synth VSTs (the ones by Camel audio I think). If your frequencies are harmonic or nearly harmonic, you can have a huge speed gain by using some form FFT of course.<<

yes, this would be a possibility. I am "aware" of this FFT technique, while i still do not mathematically understand it very well. Implementation would not be the main problem as there are really many implementation examples available. What I see is that FFT creates blocks of n samples, that way i lose a great part of the ability of controlling each harmonic in realtime. Anyway, i'll have a deeper look into this.

Thanks to all !
Rudi

dadaumpa · Post by **dadaumpa** » Thu Dec 17, 2009 10:43 am

nollock wrote:int omega = int((freq / samplerate) * 4294967296.0);
uint phase += omega;
single sinlut[32768];
output = sinlut[phase >> 17];

nollock++, I knew the thecnique, but never found a so simple, beautiful, elegant explanation, with sample code. amazing

I have a question, though. you have no linear interpolation between lut samples, meaning same value for very (very, very) small omegas.

I understand that the difference could be negligible, but wouldn't it affect the sound making it slightly "poorer" than a float-based lut with interpolation?

or do you have a smart trick to take into consideration that other 15 bits too?

cheers,
Aldo

nollock · Post by **nollock** » Thu Dec 17, 2009 6:53 pm

dadaumpa wrote: I understand that the difference could be negligible, but wouldn't it affect the sound making it slightly "poorer" than a float-based lut with interpolation?

I personally wouldn't do it without interpolation cause you need massive lookup tables to get reasonable SnR. And with huge lookup tables you're going to have lots of cache misses, which can cost 100s of cpu cycles.

So to add linear interpolation i'd do somthing like this.

Code: Select all


const int WAVEBITS = 10;
const int WAVESIZE = 1 << WAVEBITS;
const int FRACBITS = 32-WAVEBITS; 
const int FRACMASK = (1 << FRACBITS)-1;
const double FRACSCALE = 1.0 / (1 << FRACBITS);

// For the wave lookup repeat very first entry at the end,
// so we dont need to mask the index, thats why the array
// size is one larger.

single wavetable[WAVESIZE+1];

// Omega calc

int omega = int((freq / samplerate) * 4294967296.0);

// Osillator code

uint phase += omega; 

int idx = phase >> FRACBITS;

single tmp = wavetable[idx];
output = tmp + (phase & FRACMASK)*FRACSCALE*(wavetable[idx+1]-tmp);

nollock · Post by **nollock** » Thu Dec 17, 2009 7:17 pm

DukeRoodee wrote:
What do you guys think... running a 96 osc synth (which is basically comparable to a synth playing 96 sine voices at a time, i would say) at around 20% VST load on a 2.2 GHz core2Duo... is this something you would call an acceptable value (for the specific technique) ?

I'd say its very high for the actual instruction count. Whats likely happening is that you're thrashing the cache with your huge lookup table. An easy way to test would be to reduce the lookup table size to something like 256 samples. (Yeah it'll sound awful) But it'll give you an idea how much time is spent waiting for the cpu to fetch data from memory in comparison to how much spent actualy crunching numbers.

DukeRoodee · Post by **DukeRoodee** » Thu Dec 17, 2009 7:48 pm

nollock wrote: I'd say its very high for the actual instruction count. Whats likely happening is that you're thrashing the cache with your huge lookup table. An easy way to test would be to reduce the lookup table size to something like 256 samples. (Yeah it'll sound awful) But it'll give you an idea how much time is spent waiting for the cpu to fetch data from memory in comparison to how much spent actualy crunching numbers.

hmmm, yes this is probalby true, i am going to test this a little later.
hmmm 2: what could i do about it ? using a smaller lookup table makes it sound worse. Using sin() or approximations which i found here in the forum makes it even slower (i tested this already a time ago).
Your "linear interpolation" aproach in your previous post seems to be a good compromise between lookup table size and accuracy, right ? What would you say is a "good", performant lookup table size ? 2^15 ?

sorry for these maybe dumb questions, but i am relativley new to such performance topics...

DukeRoodee · Post by **DukeRoodee** » Thu Dec 17, 2009 8:53 pm

OK, the approach with a 2^8 lookup table (with linear interpolation) gives me back around 7%.
On my computer @ home ( only 2ghz core2duo) i am now at 17% VST load (instead of 24% with a 2^19 lookup table, no interpolation) for 96 osc.
18.5% with a 2^15 lookup table with interpolation.
Don't know about the resulting sound because i am listening through notebook speakers.

This is not bad i must say...

VSTi/C++ code optimisation ...need some advice.