Login / Register  0 items | $0.00 NewWhat is KVR? Submit News Advertise
Nowhk
KVRian
 
549 posts since 2 Oct, 2013

Postby Nowhk; Wed May 10, 2017 2:08 am How much CPU would take 10 Envelopes per voice? It seems a lot...

This is a very simplified version of the code I've within my plugin audio, which (for each voice) process 10 different envelopes:

Code: Select all
   // buffer
   for (int i = 0; i < nFrames; i++) {
      // init audio buffer
      outputLeft[i] = 0.0;
      outputRight[i] = 0.0;

      // handle midi
      pMidiHandler->Process();

      // manage voices
      pVoiceManager->Process();

      // process voices
      int processedPlayingVoices = 1;
      for (int voiceIndex = 0; voiceIndex < PLUG_VOICES_BUFFER_SIZE; voiceIndex++) {
         if (processedPlayingVoices > pVoiceManager->mNumPlayingVoices) {
            break;
         }

         Voice &voice = pVoiceManager->mVoices[voiceIndex];
         if (voice.mIsPlaying) {
            // envelopes
            for (int envelopeIndex = 0; envelopeIndex < ENVELOPES_CONTAINER_NUM_ENVELOPE_MANAGER; envelopeIndex++) {
               Envelope &envelope = pEnvelopesContainer->pEnvelopeManager[envelopeIndex]->mEnvelope;
               VoiceParameters &voiceParameters = envelope.mVoiceParameters[voice.mIndex];

               // next phase
               voiceParameters.mBlockStep += envelope.mRate;
               voiceParameters.mStep += envelope.mRate;
            }
            
            processedPlayingVoices++;
         }
      }
   }

Basically, these "envelopes" do nothing right now. But even so, they already takes 4% of CPU playing on 16 voices simultaneously within the DAW.

What the heck is wrong with my code? Is there a way to semplify it? I don't think your plugin processing envelopes per voices (which does actually nothing) are so huge...

Are 3 nested "for" (samples x 16 x 10) a big deal working with audio?
I could place a Control Rate per voice, but those iteration/increments are needed anyway (i.e. mBlockStep or mStep), which still take 4% of CPU...

Or maybe 4% for 160 envelopes running all together (even if they do nothing) is somethings that must be accepted? If so, I'm afraid when I'll try to implement heavy math :hyper:

Thanks to everybody that will help me :P
Last edited by Nowhk on Wed May 10, 2017 7:25 am, edited 6 times in total.
clau_ste
KVRer
 
25 posts since 20 Jan, 2017

Postby clau_ste; Wed May 10, 2017 6:29 am Re: How much CPU would take 10 Envelopes per voice? It seems a lot...

I'm not a big expert, but probably start using vectorization (see this link https://github.com/NumScale/boost.simd) is a cool thing to start from.

Pratically vectorization let's you do every calculation at the same time
http://meseec.ce.rit.edu/756-projects/s ... 13/2-2.pdf
PurpleSunray
KVRian
 
533 posts since 13 Mar, 2012

Postby PurpleSunray; Wed May 10, 2017 8:48 am Re: How much CPU would take 10 Envelopes per voice? It seems a lot...

First - there is a reason why audio and control signals are handled differently on most DAWs - it is performance. ;)

Think about the following:
Your audio signal should be able to capture frequencies up to 24kHz, so you need a sample rate of 48kHz.
Can your control signal also contain such high frequencies? The answer is no in most of the cases - if it would, it would be an audio signal. A control signal, such as an envelope or an LFO is meant to control processing of your audio signal, but it doesn't contain "music content" on it own.
So this means that control signals don't need to be at audio sample rate necessarly, but it can be lower.
Like if your control rate is 1/8 of the audio rate, you have just optimized your control-signal code by 800% (since you only calcuate control-signal ever 8th audio sample, instead of every audio sample).

Having said that.. there a couple of pitfalls with this.
One is the "zipper noise". If you apply a gain modulation it should be sample-accurate. If you apply the modulation value on blocks of 8 audio-samples each, you create noise (often called "zipper noise"). So up-sample (=interpolate) to audio rate before you apply a gain modulation. It's not that critical for i.e. filter cutoff modulation, there 1/8 works fine.

Are 3 nested "for" (samples x 16 x 10) a big deal working with audio?

Nested for's are always something to look at when you have performance issues.
It's not a problem per se, but it depends on how you access data in it (what I cannot see on your code snipped ;) )
It is a problem if causes a lot cache faults.
Example:
You have a block of audio samples, that repsents a voice.
You want to apply 10 envelopes on voice.

Solution 1:
Get sample 0 from voice
Get value 0 from envelope 0 -> apply modulation
Get value 0 from envelope 1 -> apply modulation
..
Get sample 1 from voice
..
Get sample 2 from voice
..

This means:
cache fault - load the page that has data of voice into cpu cache, load value 0 to register
cache fault - load the page that has envelope 0, load value 0 to register
cache fault - load the page that has envelope 1, load value 0 to register
...........
cache fault - load the page that has data of voice into cpu cache, load value 1 to register
cache fault .....


Solution 2:
Appl all of the envelope 0 to the voice, then envelope 1, then envelope 2.

This means:
cache fault - load the page that has data of voice into cpu cache, load value 0 to register
cache fault - load the page that has envelope 0, load value 0 to register, calculate
load value 1 to register, calculate
load value 2 to register, calculate
..
cache fault - load the page that has envelope 1, load value 0 to register
load value 1 to register, calculate
..
A lots less traffic on caches / RAM.

And as a last side-note.. it's not that bad in reality. :D
CPUs have big caches and are pretty smart on prefetch nowadays
Miles1981
KVRian
 
1175 posts since 26 Apr, 2004, from UK

Postby Miles1981; Wed May 10, 2017 9:07 am Re: How much CPU would take 10 Envelopes per voice? It seems a lot...

For 10 envelopes, you may be able to fit that inside your L1 cache, so not that many cache faults.
The issue is also pipeline flushes. When you apply the modulation, you have a test, depending on your current power value, to apply the attack ratio or the release ratio. This test can reset your computations and can slow down by a huge factor. And as the power is different for each value, you are stuck...
The solution is to process several envelopes at the same time and use the mask capability of AVX512 (I think it's AVX512, there are some AVX2 capable processors that don't support masking). But of course, you are limited to this kind of cores.
So int he general case, you are stuck with something really bad, but there is nothing you can really do about it, except try to minimize cache misses (but your data is already in your L1 anyway).
Nowhk
KVRian
 
549 posts since 2 Oct, 2013

Postby Nowhk; Mon May 15, 2017 11:14 am Re: How much CPU would take 10 Envelopes per voice? It seems a lot...

Thanks both for the replies...
Well...

PurpleSunray wrote:Solution 2:
Appl all of the envelope 0 to the voice, then envelope 1, then envelope 2.

This means:
cache fault - load the page that has data of voice into cpu cache, load value 0 to register
cache fault - load the page that has envelope 0, load value 0 to register, calculate
load value 1 to register, calculate
load value 2 to register, calculate

Not sure, why here there isn't cache fault? When I load value 1 (or 2; of envelope 0, right?) I dont have that cache. Envelope values are not stored into an array. I calculate them on the fly for each sample (also because they can be automated). So I still need to read theme everytime. Or I don't understood what you are suggesting?

Also: when I read value 0 or envelope 0 and I try to "calculate", the value 0 of voice has been repleaced by the value or envelope, so data is lost again and I should get cache fault again :O

Miles1981 wrote:For 10 envelopes, you may be able to fit that inside your L1 cache, so not that many cache faults.

How can I do it? Is it somethings I can force? Or it depends by code-design?
User avatar
DJ Warmonger
KVRAF
 
2197 posts since 7 Jun, 2012, from Warsaw

Postby DJ Warmonger; Mon May 15, 2017 11:32 am Re: How much CPU would take 10 Envelopes per voice? It seems a lot...

When you apply the modulation, you have a test, depending on your current power value, to apply the attack ratio or the release ratio. This test can reset your computations and can slow down by a huge factor.

That's why you should never use branch instructions (if) in DSP. In this time just write factor once when envelope changes, then multiply by one and the same variable every time.
http://djwarmonger.wordpress.com/
Tricky-Loops wrote: (...)someone like Armin van Buuren who claims to make a track in half an hour and all his songs sound somewhat boring(...)
camsr
KVRAF
 
6542 posts since 16 Feb, 2005

Postby camsr; Mon May 15, 2017 12:50 pm Re: How much CPU would take 10 Envelopes per voice? It seems a lot...

samples * 16 * 10 is like 160x the overhead versus just the sample... this will undoubtedly cause memory slowdowns even on today's fast and large cache cpus. An x86 cacheline is 64 bytes, this can fit 64/4 floats (16 floats) and 64/8 doubles. If you attempt to parallelize the problem with SSE/AVX or whatever, you are trying to pull data from many different places, and this is what causes the faults.

The itanium processors (which are being phased out now) have 128 byte cachelines, I think these would have been better for dsp but what programmer would optimize a VST for itanium? :neutral:
Image
Nowhk
KVRian
 
549 posts since 2 Oct, 2013

Postby Nowhk; Mon May 15, 2017 1:27 pm Re: How much CPU would take 10 Envelopes per voice? It seems a lot...

:o
So let me understand: can I do somethings to improve it? Or I must just accept these limits and go ahead?

For what I see, removing branches (if I'm lucky) can improve the whole of 1% :D

I'm not sure about "cache-friendly", but honestely on the code above I do "nothing" and it takes lots of CPU... Not sure if it can be optimized (by me, or compiler).
MadBrain
KVRian
 
931 posts since 1 Dec, 2004

Postby MadBrain; Mon May 15, 2017 2:24 pm Re: How much CPU would take 10 Envelopes per voice? It seems a lot...

Suggestions:
- Process voice by voice instead of sample by sample! CPUs like this (even though you have to do an extra memset at the start).
- Process some of your voice control at a slower rate, such as sampleRate / 16. This is especially beneficial for everything that has slow operations like sqrt(), pow(), log(), sin() etc. Use a simple ramping algo for stuff like volume.
- Turn on your compiler's optimization for release builds.

Code: Select all
memset(outputLeft, 0, nFrames * sizeof(float));
memset(outputRight, 0, nFrames * sizeof(float));

int samplesLeft = nFrames;
while(samplesLeft > 0) {
  pMidiHandler->Process();
  pVoiceManager->Process();

  int blockSize = samplesLeft;
  int offsetNextEvent = pMidiHandler->GetSamplesTillNextEvent();
  if(offsetNextEvent < blockSize)
    blockSize = offsetNextEvent;
  offsetNextEvent = pVoiceManager->GetSamplesTillNextEvent();
  if(offsetNextEvent < blockSize)
    blockSize = offsetNextEvent;

  for(int voice = 0; voice < PLUG_VOICES_BUFFER_SIZE; voice++) {
    pVoiceManager->mVoices[voice].mixToBlock(outputLeft, outputRight, blockSize);
  }
  mFancyReverbEffectOrWhatever.processBlock(outputLeft, outputRight, blockSize);

  pMidiHandler->IncrementTime(blockSize);
  pVoiceManager->IncrementTime(blockSize);
  samplesLeft -= blockSize;
  outputLeft += blockSize;
  outputRight += blockSize;
}

Code: Select all
VoiceClass::mixToBlock(float *left, float *right, int nbSamples) {
  if(!mIsPlaying)
    return;

  while(nbSamples > 0) {
    mModulationUpdate -= 1;
    if(mModulationUpdate <= 0) {
      mModulationUpdate += mModulationUpdatePeriod;
      PROCESS_ALL_SLOW_MODULATIONS_AND_EXPONENTIALS_ETC_HERE();
    }

    float output = INSERT_YOUR_ALGO_HERE();
    *left++ += output;
    *right++ += output;

    for(int env=0; env<10; env++)
      mEnvelope[env].FAST_INLINED_RAMPING_UPDATE_THAT_ONLY_USES_ADDITION_OR_MULTIPLY();

    nbSamples--;
  }
}
camsr
KVRAF
 
6542 posts since 16 Feb, 2005

Postby camsr; Mon May 15, 2017 4:56 pm Re: How much CPU would take 10 Envelopes per voice? It seems a lot...

Nowhk wrote:must I just accept these limits and go ahead?


Yes. There's no faster memory than the cacheline, so your efforts should revolve around using it as much as possible. The simplest example of this is, you have an array of 16 floats (this array must be aligned on a 64byte boundary), do as much as possible with it before you have to access (read or write) any other memory.
https://en.wikipedia.org/wiki/Locality_of_reference
The second paragraph here basically tells you that when there is a fault, all those fancy processor features don't work anymore :)
Image
Nowhk
KVRian
 
549 posts since 2 Oct, 2013

Postby Nowhk; Mon May 15, 2017 11:18 pm Re: How much CPU would take 10 Envelopes per voice? It seems a lot...

MadBrain wrote:- Process voice by voice instead of sample by sample! CPUs like this (even though you have to do an extra memset at the start).

camsr wrote:Yes. There's no faster memory than the cacheline, so your efforts should revolve around using it as much as possible.

I guess the nice MadBrain's suggestion above is to "improve" that cacheline. But why it should? I don't see why CPU is "faster" on processing sample-block within a voice (and mix it, finally) instead of processing voices within a sample :o It has to do the same amount of iteration...

Also: why do you use memset? What's the benefit here? From ProcessDoubleReplacing, outputLeft and outputRight (i.e. double **inputs, double **outputs) are already allocated (I guess).

Many thanks guys, I'm learning aalllottttttt!!!!!
User avatar
S0lo
KVRist
 
401 posts since 31 Dec, 2008

Postby S0lo; Tue May 16, 2017 12:41 am Re: How much CPU would take 10 Envelopes per voice? It seems a lot...

DJ Warmonger wrote:
When you apply the modulation, you have a test, depending on your current power value, to apply the attack ratio or the release ratio. This test can reset your computations and can slow down by a huge factor.

That's why you should never use branch instructions (if) in DSP. In this time just write factor once when envelope changes, then multiply by one and the same variable every time.


Actually if statements can save you CPU if you use them in the right place, like for example avoiding to enter a CPU intensive block of code if this block is not likely to change its output so often.
User avatar
DJ Warmonger
KVRAF
 
2197 posts since 7 Jun, 2012, from Warsaw

Postby DJ Warmonger; Tue May 16, 2017 1:18 am Re: How much CPU would take 10 Envelopes per voice? It seems a lot...

S0lo wrote:
DJ Warmonger wrote:
When you apply the modulation, you have a test, depending on your current power value, to apply the attack ratio or the release ratio. This test can reset your computations and can slow down by a huge factor.

That's why you should never use branch instructions (if) in DSP. In this time just write factor once when envelope changes, then multiply by one and the same variable every time.


Actually if statements can save you CPU if you use them in the right place, like for example avoiding to enter a CPU intensive block of code if this block is not likely to change its output so often.

No one denies the obvious. But no branching in innermost loop, it eats all the performance.
http://djwarmonger.wordpress.com/
Tricky-Loops wrote: (...)someone like Armin van Buuren who claims to make a track in half an hour and all his songs sound somewhat boring(...)
User avatar
S0lo
KVRist
 
401 posts since 31 Dec, 2008

Postby S0lo; Tue May 16, 2017 1:51 am Re: How much CPU would take 10 Envelopes per voice? It seems a lot...

DJ Warmonger wrote:
S0lo wrote:
DJ Warmonger wrote:
When you apply the modulation, you have a test, depending on your current power value, to apply the attack ratio or the release ratio. This test can reset your computations and can slow down by a huge factor.

That's why you should never use branch instructions (if) in DSP. In this time just write factor once when envelope changes, then multiply by one and the same variable every time.


Actually if statements can save you CPU if you use them in the right place, like for example avoiding to enter a CPU intensive block of code if this block is not likely to change its output so often.

No one denies the obvious. But no branching in innermost loop, it eats all the performance.


In that case, I agree. Yet the the obvious can some times be masked by a habitual rule of thumb.
Miles1981
KVRian
 
1175 posts since 26 Apr, 2004, from UK

Postby Miles1981; Tue May 16, 2017 2:22 am Re: How much CPU would take 10 Envelopes per voice? It seems a lot...

Nowhk wrote:
Miles1981 wrote:For 10 envelopes, you may be able to fit that inside your L1 cache, so not that many cache faults.

How can I do it? Is it somethings I can force? Or it depends by code-design?

You always get a chunk of data, never one sample after another. This would cost too much from the hardware/OS point of view. All low level API give you that access (VST, AU, ASIO...). If you don't have it, then your API is wrong (like the one from Pirkle, dead wrong in terms of performance).
Next

Moderator: Moderators (Main)

Return to DSP and Plug-in Development