KVR Audio

otristan · Post by **otristan** » Thu Dec 17, 2009 9:44 pm

don't use those personally but worse taking a look imho

http://www.b-nm.at/sine-generation-tutorial/

nollock · Post by **nollock** » Thu Dec 17, 2009 11:02 pm

DukeRoodee wrote: What would you say is a "good", performant lookup table size ? 2^15 ?

For a single sine lookup a 1024 point table with linear interpolation will give 120dbs SnR.

So there's little point going higher than that.

DukeRoodee · Post by **DukeRoodee** » Fri Dec 18, 2009 11:40 am

otristan wrote:don't use those personally but worse taking a look imho

http://www.b-nm.at/sine-generation-tutorial/

hi

gave it a try.
The "fast" algorithm works like a charm for lower frequencies. At 7kHz it already generates everything but a sine wave and somewhere > 12khz it gets completely out of control. So i'd have to oversample it by a factor at least 2.0 (sounds awful and gives no performance advantage... well actually i dont know if "oversampling" is the right wird for this, i'll just call it so).
Got "good" sound with a oversampling factor of 10, but then by far slower than the sine table approach.

Anyway, thanks, it was interesting and worth a try

DukeRoodee · Post by **DukeRoodee** » Fri Dec 18, 2009 11:40 am

nollock wrote:
DukeRoodee wrote: What would you say is a "good", performant lookup table size ? 2^15 ?
For a single sine lookup a 1024 point table with linear interpolation will give 120dbs SnR.

So there's little point going higher than that.

OK, i'll keep that in mind, thanks

otristan · Post by **otristan** » Fri Dec 18, 2009 11:51 am

DukeRoodee wrote:
otristan wrote:don't use those personally but worse taking a look imho

http://www.b-nm.at/sine-generation-tutorial/

hi

gave it a try.
The "fast" algorithm works like a charm for lower frequencies. At 7kHz it already generates everything but a sine wave and somewhere > 12khz it gets completely out of control. So i'd have to oversample it by a factor at least 2.0 (sounds awful and gives no performance advantage... well actually i dont know if "oversampling" is the right wird for this, i'll just call it so).
Got "good" sound with a oversampling factor of 10, but then by far slower than the sine table approach.

Anyway, thanks, it was interesting and worth a try

What about recomputing the seed from time to time depending on the frequency ?
Like every 1024 sample or based on the acceptable drift.(audible test)

mystran · Post by **mystran** » Fri Dec 18, 2009 12:06 pm

You can get "perfect" sine with vector rotations all the way up to Nyquist (and beyond, though it aliases then) with something very similar to:

Code: Select all


   // ySin, yCos are the state variables that need to be preserved

   cosW = cos(2*PI*freq / samplerate);
   sinW = sin(2*PI*freq / samplerate);

   for each sample:
      // multiply vector with a rotation matrix
      tmpCos = cosW*yCos - sinW*ySin;
      tmpSin = sinW*yCos + cosW*ySin;

      // update the state vector with the new vector
      yCos = tmpCos; ySin = tmpSin;

      every once in a while:
         // renormalize to cancel rounding issues
         tmpLen = sqrt(yCos*yCos + ySin*ySin);
         yCos /= tmpLen; ySin /= tmpLen;

In other words, basic Euler rotation (with simultaneous update of both variables; the renormalization just scales the length of the rotating vector back to unity). The commonly proposed versions are basically optimized versions of the above that need less registers (by avoiding the temporaries). If you're not register starved, the above might even be faster in practice (edit: though obviously changing frequency is slow with the cos() and sin() required; inner loop should be easy to SSE optimize even for single-sine-at-a-time case because of the simultaneous update).

DukeRoodee · Post by **DukeRoodee** » Mon Jan 04, 2010 11:57 pm

Hi, its me again

so far i was successful with the sin wav rendering. i am using now a n=1024 table with linear interpolation. Works like a charm...

....if there wasn't another little problem

i have this code

unsigned int degrees;
for( ii=0; ii<oscis; ii++ ) {
degrees = nPhase[ii];
omega = degreesPerFramePerHzL * frq*(ii+1);

/* Here processing */

degrees += (unsigned int) omega;
nPhase[ii] = degrees;
}

omega is double,
nPhase[] is an array of unsigned int.

My problem seems to be the very last line (nPhase[ii] = degrees)
This line eats more CPU than the entire processing (which is deleted here because actually i deleted it for this measurment experiment).

if i leave the (nPhase[ii]=degrees) line out, i get a VST load of around 2.3%, where something like 1.2% is load caused by prior components. So the loop (measured with oscis=79) takes around 1.1-1.3 % Fine.

If i take the (nPhase[ii]=degrees) line into the code, VST load rises to 6.5%.

If i would instead use this:

for( ii=0; ii<oscis; ii++ ) {
degrees = nPhase[ii];
omega = degreesPerFramePerHzL * frq*(ii+1);

/* Here processing */

degrees += (unsigned int) omega;
nPhase[ii] = nPhase[ii]*2; <=== only change !
}

(the code itself does not make sense, i know, its just for experiment)

then the VST load is at 2.5%. So multiplication + assignment into the array costs like 0.2%

So, 2 array accesses (including 1 value assignment) plus a multiplication is faster that just 1 value assignment into the same array ?

I cannot recreate the situations, but the really weird thing is that i sometimes get superb performance with the "right" code (yesterday i was at around 5% with the entire signal processing), then i make some modifications which dont touch the signal rendering path at all, and suddenly i am back to 10% load (with the same loop, same compiler settings etc... 10% of which the assignment into the array alone takes around >4%).
I can't understand this.
I already tried to access the array in different ways, e.g. with a pointer to
a unsigned int, which i simply ++ at the end of the loop, to get the pointer to the next entry... makes no difference.
It seems that not the access of the array causes the problem, but the assignment of the unsigned int degrees.

Somebody an idea ? ( i don't <LOL> )

thanks again,

Rudi

johnrule · Post by **johnrule** » Tue Jan 05, 2010 12:57 am

DukeRoodee wrote:

Code: Select all

unsigned int degrees;
		for( ii=0; ii<oscis; ii++ ) {
			degrees = nPhase[ii];
			omega = degreesPerFramePerHzL * frq*(ii+1);

                        /* Here processing */
                        
			degrees += (unsigned int) omega;
			nPhase[ii] = degrees;
		}

This line bothers me for some reason:

Code: Select all

degrees += (unsigned int) omega;

Does it have to be converted? Try putting the conversion on a separate line instead of part of the "+=" operator line:

Code: Select all

uint num = (unsigned int) omega;
degrees += num;

I also wondered if you could compare the value of the array with the calculated result and only store it if it was different:

Code: Select all

if( degrees <> result ) {
 nPhase[ii] = degrees;
}

/* Where 'result' is a var containing the calculation result */

So if the value is unchanged, it will skip touching the array. I always try to avoid any conversion within a loop, as well as triggering something that may cause behind-the-scenes processing that takes longer than I expected.

You might also try a more discrete loop (longhand instead of shorthand) to find your points of contention. For example, instead of this line:

Code: Select all

degrees += (unsigned int) omega;

You could spell-it-out:

Code: Select all

uint num = (unsigned int) omega;
if( num > 0 ) {
  degrees = degrees + num;
}

You would define the "uint" var outside of the loop of course...the idea is to give you more code to find your point of contention.

Obviously I am just throwing out suggestions based on this snippet of code. Maybe something will at least trigger an idea...

Good luck,
JR

antto · Post by **antto** » Tue Jan 05, 2010 5:17 am

interesting, i was digging into the dust for this thread yesterday ;]

well, yes, storing the phase into the array might eat CPU
you can also try to use a pointer, but i'm not sure if it'll have any improvement..

nollock · Post by **nollock** » Tue Jan 05, 2010 7:54 am

DukeRoodee wrote: My problem seems to be the very last line (nPhase[ii] = degrees)
This line eats more CPU than the entire processing (which is deleted here because actually i deleted it for this measurment experiment).

My hunch is that it's the conversion from double to int in the previous line that is causing your slowdown. They can be expensive.

The reason I think this is that the cpu load drops when you alter the last line so that it does not use the result of the conversion. So it's likely that the compiler has not even compiled the previous line as it's determined that it does nothing. (Assuming you're testing with optimizations enabled).

You can check by putting a breakpoint on each of the last two lines. Then run the program. In Visual C++ the red breakpoint icon will change to a hollow circle with a small exclamation mark signifying that the breakpoint could not be set. Otherwise you can drop into dissasembly mode to see if the line was compiled or not.

And/or you could try this..

Code: Select all

inline int DoubleToInt(double f)
{
    __asm
    {
        PUSH    EAX
        FLD     [f]
        FISTP   DWORD PTR [ESP]
        POP     EAX
    }
}

And this..

Code: Select all

   degrees += DoubleToInt(omega);

DukeRoodee · Post by **DukeRoodee** » Tue Jan 05, 2010 11:44 am

nollock wrote:
My hunch is that it's the conversion from double to int in the previous line that is causing your slowdown. They can be expensive.
....
And/or you could try this..
Code: Select all
inline int DoubleToInt(double f)
{
    __asm
    {
        PUSH    EAX
        FLD     [f]
        FISTP   DWORD PTR [ESP]
        POP     EAX
    }
}
And this..
Code: Select all
   degrees += DoubleToInt(omega);

Yes, i think you are completely right. I had compiler optimizations set to maximum, and already yesterday night i found some traces that it might be the conversion.

Finally, after trying some of the fast conversion routines (including yours) I'll go with the SSE2 _mm_cvttsd_si32 function. This gave me even better performance than DoubleToInt(..) and i need every bit because the algorithm is really "expensive".

Thanks a lot !

DukeRoodee · Post by **DukeRoodee** » Tue Jan 05, 2010 11:48 am

antto wrote:interesting, i was digging into the dust for this thread yesterday ;]

well, yes, storing the phase into the array might eat CPU
you can also try to use a pointer, but i'm not sure if it'll have any improvement..

unfortunately not, i tried this already

well, it MIGHT bring some improvement, but none that i could "see". I refer to the vst load of vsthost, which displays in percent and is not so stable as to judge if a different implementation gives me back some tenths of a percent. I can see improvement in the "bigger" range, maybe at least 1%

neotec · Post by **neotec** » Tue Jan 05, 2010 11:54 am

If you are only using even harmonics, you might want to have a look at the code I posted about additive sine oscillators, here.

DukeRoodee · Post by **DukeRoodee** » Tue Jan 05, 2010 11:56 am

@johnrule:

>>Does it have to be converted?<<
unfortunately, yes. I tried to use unsigned int instead but the impact to other parts is too big, unfortunately.

>>I also wondered if you could compare the value of the array with the calculated result and only store it if it was different: <<
mmmh...the phase array is there just to increase the values in each step, so i am afraid it must be this way. Also checking if the increase is <>0 makes not much sense because the increase IS <> 0 in each step.

Finally, the problem was the double to int conversion. I am using now the conversion routine of SSE2 with a great gain in performance.

Anyway Thanks !

kuniklo · Post by **kuniklo** » Thu Jan 06, 2011 3:20 am

nollock wrote: const int WAVEBITS = 10;
const int WAVESIZE = 1 << WAVEBITS;
const int FRACBITS = 32-WAVEBITS;
const int FRACMASK = (1 << FRACBITS)-1;
const double FRACSCALE = 1.0 / (1 << FRACBITS);

// For the wave lookup repeat very first entry at the end,
// so we dont need to mask the index, thats why the array
// size is one larger.

single wavetable[WAVESIZE+1];

// Omega calc

int omega = int((freq / samplerate) * 4294967296.0);

// Osillator code

uint phase += omega;

int idx = phase >> FRACBITS;

single tmp = wavetable[idx];
output = tmp + (phase & FRACMASK)*FRACSCALE*(wavetable[idx+1]-tmp);

Just came across this. Slick!

VSTi/C++ code optimisation ...need some advice.