VSTi/C++ code optimisation ...need some advice.
-
- KVRAF
- 2393 posts since 28 Mar, 2005
don't use those personally but worse taking a look imho
http://www.b-nm.at/sine-generation-tutorial/
http://www.b-nm.at/sine-generation-tutorial/
-
- KVRian
- 1153 posts since 10 Dec, 2003
For a single sine lookup a 1024 point table with linear interpolation will give 120dbs SnR.DukeRoodee wrote: What would you say is a "good", performant lookup table size ? 2^15 ?
So there's little point going higher than that.
-
- KVRist
- Topic Starter
- 106 posts since 12 May, 2006
otristan wrote:don't use those personally but worse taking a look imho
http://www.b-nm.at/sine-generation-tutorial/
hi
gave it a try.
The "fast" algorithm works like a charm for lower frequencies. At 7kHz it already generates everything but a sine wave and somewhere > 12khz it gets completely out of control. So i'd have to oversample it by a factor at least 2.0 (sounds awful and gives no performance advantage... well actually i dont know if "oversampling" is the right wird for this, i'll just call it so).
Got "good" sound with a oversampling factor of 10, but then by far slower than the sine table approach.
Anyway, thanks, it was interesting and worth a try
-
- KVRist
- Topic Starter
- 106 posts since 12 May, 2006
nollock wrote:For a single sine lookup a 1024 point table with linear interpolation will give 120dbs SnR.DukeRoodee wrote: What would you say is a "good", performant lookup table size ? 2^15 ?
So there's little point going higher than that.
OK, i'll keep that in mind, thanks
-
- KVRAF
- 2393 posts since 28 Mar, 2005
What about recomputing the seed from time to time depending on the frequency ?DukeRoodee wrote:otristan wrote:don't use those personally but worse taking a look imho
http://www.b-nm.at/sine-generation-tutorial/
hi
gave it a try.
The "fast" algorithm works like a charm for lower frequencies. At 7kHz it already generates everything but a sine wave and somewhere > 12khz it gets completely out of control. So i'd have to oversample it by a factor at least 2.0 (sounds awful and gives no performance advantage... well actually i dont know if "oversampling" is the right wird for this, i'll just call it so).
Got "good" sound with a oversampling factor of 10, but then by far slower than the sine table approach.
Anyway, thanks, it was interesting and worth a try
Like every 1024 sample or based on the acceptable drift.(audible test)
- KVRAF
- 7893 posts since 12 Feb, 2006 from Helsinki, Finland
You can get "perfect" sine with vector rotations all the way up to Nyquist (and beyond, though it aliases then) with something very similar to:
In other words, basic Euler rotation (with simultaneous update of both variables; the renormalization just scales the length of the rotating vector back to unity). The commonly proposed versions are basically optimized versions of the above that need less registers (by avoiding the temporaries). If you're not register starved, the above might even be faster in practice (edit: though obviously changing frequency is slow with the cos() and sin() required; inner loop should be easy to SSE optimize even for single-sine-at-a-time case because of the simultaneous update).
Code: Select all
// ySin, yCos are the state variables that need to be preserved
cosW = cos(2*PI*freq / samplerate);
sinW = sin(2*PI*freq / samplerate);
for each sample:
// multiply vector with a rotation matrix
tmpCos = cosW*yCos - sinW*ySin;
tmpSin = sinW*yCos + cosW*ySin;
// update the state vector with the new vector
yCos = tmpCos; ySin = tmpSin;
every once in a while:
// renormalize to cancel rounding issues
tmpLen = sqrt(yCos*yCos + ySin*ySin);
yCos /= tmpLen; ySin /= tmpLen;
-
- KVRist
- Topic Starter
- 106 posts since 12 May, 2006
Hi, its me again
so far i was successful with the sin wav rendering. i am using now a n=1024 table with linear interpolation. Works like a charm...
....if there wasn't another little problem
i have this code
unsigned int degrees;
for( ii=0; ii<oscis; ii++ ) {
degrees = nPhase[ii];
omega = degreesPerFramePerHzL * frq*(ii+1);
/* Here processing */
degrees += (unsigned int) omega;
nPhase[ii] = degrees;
}
omega is double,
nPhase[] is an array of unsigned int.
My problem seems to be the very last line (nPhase[ii] = degrees)
This line eats more CPU than the entire processing (which is deleted here because actually i deleted it for this measurment experiment).
if i leave the (nPhase[ii]=degrees) line out, i get a VST load of around 2.3%, where something like 1.2% is load caused by prior components. So the loop (measured with oscis=79) takes around 1.1-1.3 % Fine.
If i take the (nPhase[ii]=degrees) line into the code, VST load rises to 6.5%.
If i would instead use this:
for( ii=0; ii<oscis; ii++ ) {
degrees = nPhase[ii];
omega = degreesPerFramePerHzL * frq*(ii+1);
/* Here processing */
degrees += (unsigned int) omega;
nPhase[ii] = nPhase[ii]*2; <=== only change !
}
(the code itself does not make sense, i know, its just for experiment)
then the VST load is at 2.5%. So multiplication + assignment into the array costs like 0.2%
So, 2 array accesses (including 1 value assignment) plus a multiplication is faster that just 1 value assignment into the same array ?
I cannot recreate the situations, but the really weird thing is that i sometimes get superb performance with the "right" code (yesterday i was at around 5% with the entire signal processing), then i make some modifications which dont touch the signal rendering path at all, and suddenly i am back to 10% load (with the same loop, same compiler settings etc... 10% of which the assignment into the array alone takes around >4%).
I can't understand this.
I already tried to access the array in different ways, e.g. with a pointer to
a unsigned int, which i simply ++ at the end of the loop, to get the pointer to the next entry... makes no difference.
It seems that not the access of the array causes the problem, but the assignment of the unsigned int degrees.
Somebody an idea ? ( i don't <LOL> )
thanks again,
Rudi
so far i was successful with the sin wav rendering. i am using now a n=1024 table with linear interpolation. Works like a charm...
....if there wasn't another little problem
i have this code
unsigned int degrees;
for( ii=0; ii<oscis; ii++ ) {
degrees = nPhase[ii];
omega = degreesPerFramePerHzL * frq*(ii+1);
/* Here processing */
degrees += (unsigned int) omega;
nPhase[ii] = degrees;
}
omega is double,
nPhase[] is an array of unsigned int.
My problem seems to be the very last line (nPhase[ii] = degrees)
This line eats more CPU than the entire processing (which is deleted here because actually i deleted it for this measurment experiment).
if i leave the (nPhase[ii]=degrees) line out, i get a VST load of around 2.3%, where something like 1.2% is load caused by prior components. So the loop (measured with oscis=79) takes around 1.1-1.3 % Fine.
If i take the (nPhase[ii]=degrees) line into the code, VST load rises to 6.5%.
If i would instead use this:
for( ii=0; ii<oscis; ii++ ) {
degrees = nPhase[ii];
omega = degreesPerFramePerHzL * frq*(ii+1);
/* Here processing */
degrees += (unsigned int) omega;
nPhase[ii] = nPhase[ii]*2; <=== only change !
}
(the code itself does not make sense, i know, its just for experiment)
then the VST load is at 2.5%. So multiplication + assignment into the array costs like 0.2%
So, 2 array accesses (including 1 value assignment) plus a multiplication is faster that just 1 value assignment into the same array ?
I cannot recreate the situations, but the really weird thing is that i sometimes get superb performance with the "right" code (yesterday i was at around 5% with the entire signal processing), then i make some modifications which dont touch the signal rendering path at all, and suddenly i am back to 10% load (with the same loop, same compiler settings etc... 10% of which the assignment into the array alone takes around >4%).
I can't understand this.
I already tried to access the array in different ways, e.g. with a pointer to
a unsigned int, which i simply ++ at the end of the loop, to get the pointer to the next entry... makes no difference.
It seems that not the access of the array causes the problem, but the assignment of the unsigned int degrees.
Somebody an idea ? ( i don't <LOL> )
thanks again,
Rudi
- KVRist
- 411 posts since 25 Apr, 2007 from Northern CA
This line bothers me for some reason:DukeRoodee wrote:Code: Select all
unsigned int degrees; for( ii=0; ii<oscis; ii++ ) { degrees = nPhase[ii]; omega = degreesPerFramePerHzL * frq*(ii+1); /* Here processing */ degrees += (unsigned int) omega; nPhase[ii] = degrees; }
Code: Select all
degrees += (unsigned int) omega;
Code: Select all
uint num = (unsigned int) omega;
degrees += num;
Code: Select all
if( degrees <> result ) {
nPhase[ii] = degrees;
}
/* Where 'result' is a var containing the calculation result */
You might also try a more discrete loop (longhand instead of shorthand) to find your points of contention. For example, instead of this line:
Code: Select all
degrees += (unsigned int) omega;
Code: Select all
uint num = (unsigned int) omega;
if( num > 0 ) {
degrees = degrees + num;
}
Obviously I am just throwing out suggestions based on this snippet of code. Maybe something will at least trigger an idea...
Good luck,
JR
- KVRAF
- 2554 posts since 4 Sep, 2006 from 127.0.0.1
interesting, i was digging into the dust for this thread yesterday ;]
well, yes, storing the phase into the array might eat CPU
you can also try to use a pointer, but i'm not sure if it'll have any improvement..
well, yes, storing the phase into the array might eat CPU
you can also try to use a pointer, but i'm not sure if it'll have any improvement..
It doesn't matter how it sounds..
..as long as it has BASS and it's LOUD!
irc.libera.chat >>> #kvr
..as long as it has BASS and it's LOUD!
irc.libera.chat >>> #kvr
-
- KVRian
- 1153 posts since 10 Dec, 2003
My hunch is that it's the conversion from double to int in the previous line that is causing your slowdown. They can be expensive.DukeRoodee wrote: My problem seems to be the very last line (nPhase[ii] = degrees)
This line eats more CPU than the entire processing (which is deleted here because actually i deleted it for this measurment experiment).
The reason I think this is that the cpu load drops when you alter the last line so that it does not use the result of the conversion. So it's likely that the compiler has not even compiled the previous line as it's determined that it does nothing. (Assuming you're testing with optimizations enabled).
You can check by putting a breakpoint on each of the last two lines. Then run the program. In Visual C++ the red breakpoint icon will change to a hollow circle with a small exclamation mark signifying that the breakpoint could not be set. Otherwise you can drop into dissasembly mode to see if the line was compiled or not.
And/or you could try this..
Code: Select all
inline int DoubleToInt(double f)
{
__asm
{
PUSH EAX
FLD [f]
FISTP DWORD PTR [ESP]
POP EAX
}
}
Code: Select all
degrees += DoubleToInt(omega);
-
- KVRist
- Topic Starter
- 106 posts since 12 May, 2006
nollock wrote:
My hunch is that it's the conversion from double to int in the previous line that is causing your slowdown. They can be expensive.
....
And/or you could try this..
And this..Code: Select all
inline int DoubleToInt(double f) { __asm { PUSH EAX FLD [f] FISTP DWORD PTR [ESP] POP EAX } }
Code: Select all
degrees += DoubleToInt(omega);
Yes, i think you are completely right. I had compiler optimizations set to maximum, and already yesterday night i found some traces that it might be the conversion.
Finally, after trying some of the fast conversion routines (including yours) I'll go with the SSE2 _mm_cvttsd_si32 function. This gave me even better performance than DoubleToInt(..) and i need every bit because the algorithm is really "expensive".
Thanks a lot !
-
- KVRist
- Topic Starter
- 106 posts since 12 May, 2006
unfortunately not, i tried this alreadyantto wrote:interesting, i was digging into the dust for this thread yesterday ;]
well, yes, storing the phase into the array might eat CPU
you can also try to use a pointer, but i'm not sure if it'll have any improvement..
well, it MIGHT bring some improvement, but none that i could "see". I refer to the vst load of vsthost, which displays in percent and is not so stable as to judge if a different implementation gives me back some tenths of a percent. I can see improvement in the "bigger" range, maybe at least 1%
-
- KVRist
- 239 posts since 22 Jan, 2007 from Germany
If you are only using even harmonics, you might want to have a look at the code I posted about additive sine oscillators, here.
... when time becomes a loop ...
---
Intel i7 3770k @3.5GHz, 16GB RAM, Windows 7 / Ubuntu 16.04, Cubase Artist, Reaktor 6, Superior Drummer 3, M-Audio Audiophile 2496, Akai MPK-249, Roland TD-11KV+
---
Intel i7 3770k @3.5GHz, 16GB RAM, Windows 7 / Ubuntu 16.04, Cubase Artist, Reaktor 6, Superior Drummer 3, M-Audio Audiophile 2496, Akai MPK-249, Roland TD-11KV+
-
- KVRist
- Topic Starter
- 106 posts since 12 May, 2006
@johnrule:
>>Does it have to be converted?<<
unfortunately, yes. I tried to use unsigned int instead but the impact to other parts is too big, unfortunately.
>>I also wondered if you could compare the value of the array with the calculated result and only store it if it was different: <<
mmmh...the phase array is there just to increase the values in each step, so i am afraid it must be this way. Also checking if the increase is <>0 makes not much sense because the increase IS <> 0 in each step.
Finally, the problem was the double to int conversion. I am using now the conversion routine of SSE2 with a great gain in performance.
Anyway Thanks !
>>Does it have to be converted?<<
unfortunately, yes. I tried to use unsigned int instead but the impact to other parts is too big, unfortunately.
>>I also wondered if you could compare the value of the array with the calculated result and only store it if it was different: <<
mmmh...the phase array is there just to increase the values in each step, so i am afraid it must be this way. Also checking if the increase is <>0 makes not much sense because the increase IS <> 0 in each step.
Finally, the problem was the double to int conversion. I am using now the conversion routine of SSE2 with a great gain in performance.
Anyway Thanks !
-
- KVRAF
- 2875 posts since 28 Jan, 2004 from Da Nang, Vietnam
Just came across this. Slick!nollock wrote: const int WAVEBITS = 10;
const int WAVESIZE = 1 << WAVEBITS;
const int FRACBITS = 32-WAVEBITS;
const int FRACMASK = (1 << FRACBITS)-1;
const double FRACSCALE = 1.0 / (1 << FRACBITS);
// For the wave lookup repeat very first entry at the end,
// so we dont need to mask the index, thats why the array
// size is one larger.
single wavetable[WAVESIZE+1];
// Omega calc
int omega = int((freq / samplerate) * 4294967296.0);
// Osillator code
uint phase += omega;
int idx = phase >> FRACBITS;
single tmp = wavetable[idx];
output = tmp + (phase & FRACMASK)*FRACSCALE*(wavetable[idx+1]-tmp);