## How much CPU would take 10 Envelopes per voice? It seems a lot...

matt42
KVRian

996 posts since 9 Jan, 2006
S0lo wrote:problem is max and min implementation it self uses conditionals.
oh well, I just need non-conditional min/max functions.

S0lo wrote:Any way, you can do this:

x = min(x, 1);
Indeed, that was daft of me
S0lo
KVRian

551 posts since 31 Dec, 2008
Miles1981 wrote:
S0lo wrote:Both codes do the same thing. "sorting". But Code1 (insertion sort) uses two conditionals (an if and a while). While code2 (quicksort) has four conditionals (2 ifs and 2 whiles). Yet code2 is well known to perform much faster (nlog(n)) than code1 (quadratic) on the average.

check here: https://en.wikipedia.org/wiki/Quicksort
and here: https://en.wikipedia.org/wiki/Insertion_sort

You are comparing apples and oranges.

apples can be compared to oranges if there is a comparative goal, like "Health".

Miles1981 wrote:Of course if the complexity is different, this changes everything! We are talking about the same complexity in number of instructions.

And I'm actually not talking about that.
matt42
KVRian

996 posts since 9 Jan, 2006
Miles1981 wrote:
S0lo wrote:Both codes do the same thing. "sorting". But Code1 (insertion sort) uses two conditionals (an if and a while). While code2 (quicksort) has four conditionals (2 ifs and 2 whiles). Yet code2 is well known to perform much faster (nlog(n)) than code1 (quadratic) on the average.

check here: https://en.wikipedia.org/wiki/Quicksort
and here: https://en.wikipedia.org/wiki/Insertion_sort

You are comparing apples and oranges. Of course if the complexity is different, this changes everything! We are talking about the same complexity in number of instructions.
I think the point here is that the goal is to find the most efficient algorithm and by extension reduce the complexity. Once we find that then we needn't concern ourselves with details such as conditionals == bad, or whatever.

In the general case, I agree, conditionals will tend to be expensive and most likely should be avoided.
camsr
KVRAF

6805 posts since 16 Feb, 2005
Also important to note that, comparisons are cheap, branches are expensive.. if you can do some logic around comparison results without branching it could be the best option.
S0lo
KVRian

551 posts since 31 Dec, 2008
camsr wrote:Also important to note that, comparisons are cheap, branches are expensive.. if you can do some logic around comparison results without branching it could be the best option.

Yup, Thats why the most probable case should be put in the TRUE part. Because the FALSE is what causes the branch.
KVRian

943 posts since 1 Dec, 2004
Nowhk wrote:
MadBrain wrote:- Process voice by voice instead of sample by sample! CPUs like this (even though you have to do an extra memset at the start).

camsr wrote:Yes. There's no faster memory than the cacheline, so your efforts should revolve around using it as much as possible.

I guess the nice MadBrain's suggestion above is to "improve" that cacheline. But why it should? I don't see why CPU is "faster" on processing sample-block within a voice (and mix it, finally) instead of processing voices within a sample It has to do the same amount of iteration...

- It takes pMidiHandler->Process(); and pVoiceManager->Process(); out of the per-sample loop, which reduces the amount of code the instruction cache "sees" in the sample-per-sample section if those functions are large, making it more likely that it fits in L1 instruction cache or micro-op cache. This also applies if you reduce how often your modulations are calculated.

- It makes it more likely that your if()s will keep always going the same way in a row (or go in a short pattern that the branch predictor can follow), since you're processing the same voice over and over. For instance for your 160 envelopes, if you mix up the processing for all voices, it's likely that your different voices will be in different states, so the branch predictor will go all over the place. If you process per voice then the branch predictor will only see the 10 same envelopes over and over so the chances that it will predict accurately are higher.

- In the process of generating a sample for a voice, the CPU has to load and store a bunch of state variables from your voice object. If you process per voice, all those loads and stores are going to fall on the same addresses over and over, which makes it quite more likely that it's all in L1 cache, and it's also a better pattern for the prefetcher. If you're reading data from some wavetable, then all your loads are going to cluster together and advance more or less linearly, instead of falling all over the place depending on oscillator phases.

- You'll also probably have more luck getting the compiler to keep some of your values in registers instead of loading/storing them over and over, since you can read them into function local values (though on x86 cutting down on loads/stores is hard!).

- The CPU will not have to calculate the address of each of your voice objects every sample. This is only an addition plus an optimized multiplication per size of your voice object size, but all the loads/stores to your voice object variables depend on it... I don't know how much effect this has on bleeding edge CPUs but I could definitely see the effect on a pre-Ryzen multicore AMD.

Nowhk wrote:Also: why do you use memset? What's the benefit here? From ProcessDoubleReplacing, outputLeft and outputRight (i.e. double **inputs, double **outputs) are already allocated (I guess).

Many thanks guys, I'm learning aalllottttttt!!!!!

Memset just fills the buffer with 0.0. It's quivalent to doing this:
Code: Select all
`for(int i=0; i<nbSamples; i++)  outputLeft[i] = 0.0;for(int i=0; i<nbSamples; i++)  outputRight[i] = 0.0;`

The advantage is that the standard library memset() generally has a fast assembly-optimized version.
camsr
KVRAF

6805 posts since 16 Feb, 2005
What madbrain is saying is, when you process voice-by-voice the CPU has an easier time caching because it uses the same data sequentially. This improves throughput for the actual instructions since there is less waiting on data that may not be cached (in L1).
Nowhk
KVRian

749 posts since 2 Oct, 2013
Thanks for the replies dudes
Well, in order... (sorry if I reply with questions, but its the only way I know here for understand if I've understood )

MadBrain wrote:Memset just fills the buffer with 0.0. It's quivalent to doing this:
Code: Select all
`for(int i=0; i<nbSamples; i++)  outputLeft[i] = 0.0;for(int i=0; i<nbSamples; i++)  outputRight[i] = 0.0;`

The advantage is that the standard library memset() generally has a fast assembly-optimized version.

I see. Nice. So instead of doing it for each sample, you "pre-set" them with 0.0. Got it.

MadBrain wrote:- It takes pMidiHandler->Process(); and pVoiceManager->Process(); out of the per-sample loop, which reduces the amount of code the instruction cache "sees" in the sample-per-sample section if those functions are large, making it more likely that it fits in L1 instruction cache or micro-op cache. This also applies if you reduce how often your modulations are calculated.

So you mean somethings like "the less variables/functions not needed for every sample in the critical section (per-sample loop), the more space you have to store "main/audio" data within cache and register (reducing cache miss/fault)", right?

MadBrain wrote:- It makes it more likely that your if()s will keep always going the same way in a row (or go in a short pattern that the branch predictor can follow), since you're processing the same voice over and over. For instance for your 160 envelopes, if you mix up the processing for all voices, it's likely that your different voices will be in different states, so the branch predictor will go all over the place. If you process per voice then the branch predictor will only see the 10 same envelopes over and over so the chances that it will predict accurately are higher.

Basically, you are isolating the code without branch/if statement (which, for reasons that I don't get yet, will slow the code; but I think its a sort of holy grail, watching S0lo and PurpleSunray fight ) within the per-sample loop (i.e. critical section). Have I got your words?

MadBrain wrote:- In the process of generating a sample for a voice, the CPU has to load and store a bunch of state variables from your voice object. If you process per voice, all those loads and stores are going to fall on the same addresses over and over, which makes it quite more likely that it's all in L1 cache, and it's also a better pattern for the prefetcher. If you're reading data from some wavetable, then all your loads are going to cluster together and advance more or less linearly, instead of falling all over the place depending on oscillator phases.

What does it means "all those loads and stores are going to fall on the same addresses over and over"? When I switch to another voice, I don't care if the new Voice object load/store its data to the previous voice's addresses. Why it should care?

MadBrain wrote:- You'll also probably have more luck getting the compiler to keep some of your values in registers instead of loading/storing them over and over, since you can read them into function local values (though on x86 cutting down on loads/stores is hard!).

Are you saying that is more probable that CPU place a local var directly to a register if the loop-iteration is 100 samples long (i.e. a block size) instead of 16 (the voice iteration)? If so, that's really makes sense...

MadBrain wrote:- The CPU will not have to calculate the address of each of your voice objects every sample. This is only an addition plus an optimized multiplication per size of your voice object size, but all the loads/stores to your voice object variables depend on it... I don't know how much effect this has on bleeding edge CPUs but I could definitely see the effect on a pre-Ryzen multicore AMD.

Another great reason

Tonight I'll try to "re-design" my whole PDR code with these (very kind) suggestions
PurpleSunray
KVRian

780 posts since 13 Mar, 2012
Nowhk wrote:Are you saying that is more probable that CPU place a local var directly to a register if the loop-iteration is 100 samples long (i.e. a block size) instead of 16 (the voice iteration)? If so, that's really makes sense...

Not really (your biggest issue will be that compiler manages the registers, but the loop-count is not necessarily know already at compile time).

And it also based on the idea that the CPU will execute commands sequentially, otherwise you can't share a register state accross two instructions. But that is not necessarily the case on modern compilers & CPUs.

https://en.wikipedia.org/wiki/Instruction_pipelining

If you understand how pipelining works, it suddenly gets very clear why stuff like branching matters (an i7 has 14 pipeline stages, means if it is fully loaded and it requires a complete flush (aka you suckered the branch prediction), you must wait on / or give up 14 instructions).
Nowhk
KVRian

749 posts since 2 Oct, 2013
PurpleSunray wrote:Not really (your biggest issue will be that compiler manages the registers, but the loop-count is not necessarily know already at compile time).

Sorry typo (I meant compiler not CPU).

But what do you mean with "is not know at compile time"? It "knows" my code

Can you give to me a basic example to see why a v1 piece of code compilation won't use registers for vars (because "can't know") and a v2 (which does the same) that will use them? (or cache).

PurpleSunray
KVRian

780 posts since 13 Mar, 2012

Loop-count is buffer size of audio driver, for example.

But as said, keeping values on registers is a little optimiztation compared to coding a loop that perfectly fits into an SIMD pipeline of an i7 and/or Zen (might even be counterproductive if that "registger sync" causes pipeline stalls)
KVRian

943 posts since 1 Dec, 2004
Nowhk wrote:So you mean somethings like "the less variables/functions not needed for every sample in the critical section (per-sample loop), the more space you have to store "main/audio" data within cache and register (reducing cache miss/fault)", right?

Modern CPUs have something like 32k of L1 data cache, made out of cache lines of 64 bytes. This means that if you load some variable, the whole adjacent 64 bytes will be loaded into L1 cache as well, and if your loop deals with more than 32k of data it will be overwriting parts of L1 so those parts will be loaded from L2 every iteration, and data from L2 takes 10+ cycles to load in instead of 3, and you can't do 2 loads per cycle on L2.

Nowhk wrote:Basically, you are isolating the code without branch/if statement (which, for reasons that I don't get yet, will slow the code; but I think its a sort of holy grail, watching S0lo and PurpleSunray fight ) within the per-sample loop (i.e. critical section). Have I got your words?

Yes, the problem is that it takes many cycles for the CPU to make an instruction ready to execute because it has to (1) read the instruction from instruction cache and check that the address from the virtual page mapping matches (2) align the 3~5 instructions from the 16 byte instruction stream block that it has read into the decoders (3) undo the instruction encoding and split the instructions into micro-ops (for instance math+memory instructions are split into one micro-op for the load and one for the math part) (4) rename the registers used in the instructions (5) queue the instructions to execution units.

For the instruction right after a conditional jump, the CPU could start this whole process right after doing it for the conditional jump if it KNEW that the next instruction is going to run. But that's going to take dozens of cycles, since you'd need to wait the conditional jump to actually get loaded-decoded-renamed-queued and often even more cycles for it to actually run as it has to wait for instructions that produce the value it uses to run.

So CPUs do the next best thing: they GUESS if the jump is going to be taken or not and start loading the instructions from the guess right away. If the guess is right, then the jump is almost free. If the guess is wrong, then the CPU has to undo all the partially executed instructions it started doing after the conditional jump, which can easily be dozens of instructions!

This is why if() is very cheap if the CPU can do an accurate or mostly accurate guess, and quite expensive if it guesses wrong often! Generally the predictors deal well with if()'s that almost always go the same way as the previous time, or have short repeated patterns. This also applies to switch(), for(), calling function pointers and virtual functions, etc...

Nowhk wrote:What does it means "all those loads and stores are going to fall on the same addresses over and over"? When I switch to another voice, I don't care if the new Voice object load/store its data to the previous voice's addresses. Why it should care?

It cares that the data it loads/stores is in L1 cache (fastest but only 32k) or L2 cache (slower but larger) etc instead of memory. If the second time your loop runs, it's reloading the same memory addresses (because it's the same voice), then your chances that it's in the fastest L1 cache are like 99%... If you're iterating over a whole bunch of voices it might still all fit in L1 cache, but your chances go down...

Nowhk wrote:Are you saying that is more probable that CPU place a local var directly to a register if the loop-iteration is 100 samples long (i.e. a block size) instead of 16 (the voice iteration)? If so, that's really makes sense...

Form what I've seen, 99% of the time, the compiler will only registerify variables that are function-local, and not variables that are inside an object. This means that in some cases you can do the following trick:

Code: Select all
`uint32_t phase = voice.m_phase;uint32_t freq = voice.m_freq;int16_t *wave = voice.m_wavePtr;while(samplesLeft > 0) {  *output++ += wave[phase >> 20];  phase += freq;  samplesLeft--;}voice.m_phase = phase;`

The compiler will likely keep phase, freq, wave, i, output and samplesLeft in registers, whereas if you ran sample-per-sample instead of voice-per-voice you'd likely have to reload phase, freq, wave, and store phase on every iteration, instead of just once for the whole block.
PurpleSunray
KVRian

780 posts since 13 Mar, 2012
MadBrain wrote:The compiler will likely keep phase, freq, wave, i, output and samplesLeft in registers,

Not if it compiles for an X86 architecture. samplesLeft might stay on ECX. Then you have EAX and EDX left, and you need EAX as an accumulator already. So you pretty much have 1 register you could use as a cache, but you want to cache 6 values.
Nowhk
KVRian

749 posts since 2 Oct, 2013
Hi all,

yesterday I've tried with code all of your suggestions (before to totally understand the "theory", which I will return later). That's the final result (mainly following the MadBrain suggestions, since it place a smart code), only processing the envelopes (for 16 voices running simultaneously):

Code: Select all
`void MainIPlug::ProcessDoubleReplacing(double **inputs, double **outputs, int nFrames) {   double *outputLeft = outputs[0];   double *outputRight = outputs[1];   memset(outputLeft, 0, nFrames * sizeof(double));   memset(outputRight, 0, nFrames * sizeof(double));   // buffer   int samplesLeft = nFrames;   while (samplesLeft > 0) {      // events      pMidiHandler->Process();      pVoiceManager->Process();      int blockSize = samplesLeft;      blockSize = pMidiHandler->GetSamplesTillNextEvent(blockSize);      // voices      for (int voiceIndex = 0; voiceIndex < PLUG_VOICES_BUFFER_SIZE; voiceIndex++) {         Voice &voice = pVoiceManager->mVoices[voiceIndex];         if (!voice.mIsPlaying) {            continue;         }         int nbSamples = blockSize;         while (nbSamples > 0) {            for (int envelopeIndex = 0; envelopeIndex < ENVELOPES_CONTAINER_NUM_ENVELOPE_MANAGER; envelopeIndex++) {               Envelope &envelope = pEnvelopesContainer->pEnvelopeManager[envelopeIndex]->mEnvelope;               // new voice restart envelope               if (voice.mSample == 0.0) {                  envelope.Reset(voice);               }               envelope.Process(voice);            }            nbSamples--;         }      }      pMidiHandler->Flush(blockSize);      pVoiceManager->Flush(blockSize);      samplesLeft -= blockSize;      outputLeft += blockSize;      outputRight += blockSize;   }}`

As you can see, an adaptation of the MadBrain code within my plug.
Below, the Envelope's functions:

Code: Select all
`void Envelope::Reset(Voice &voice) {   mVoiceParameters[voice.mIndex].mControlRateIndex = 0;   mVoiceParameters[voice.mIndex].mBlockStep = gBlockSize;   mVoiceParameters[voice.mIndex].mStep = 0.0;   mVoiceParameters[voice.mIndex].mValue = mAmps[0];}void Envelope::Process(Voice &voice) {   VoiceParameters &voiceParameters = mVoiceParameters[voice.mIndex];   // control rate   if (voiceParameters.mControlRateIndex-- == 0) {      voiceParameters.mControlRateIndex = PLUG_CONTROL_RATE - 1;      if (mIsEnabled) {         // refresh at block size         if (voiceParameters.mBlockStep >= gBlockSize) {            // loop            unsigned int sectionIndex = RefreshSectionIndex(voiceParameters.mStep);            double sectionStep = RefreshSectionStep(sectionIndex, voiceParameters.mStep);            unsigned int blockIndex = RefreshBlockIndex(sectionStep);            double sectionLength = RefreshSectionLength(sectionIndex);            int numBlocks = (int)(sectionLength / gBlockSize);            double numBlocksFraction = 1.0 / numBlocks;            double pos0 = blockIndex * numBlocksFraction;            double pos1 = (blockIndex + 1) * numBlocksFraction;            double a = 1.0 - (1.0 / mTensions[sectionIndex]);            double p0 = pos0 / (pos0 + a * (pos0 - 1.0));            double p1 = pos1 / (pos1 + a * (pos1 - 1.0));            double sectionStartAmp = mAmps[sectionIndex];            double sectionEndAmp = mAmps[sectionIndex + 1];            double sectionDeltaAmp = sectionEndAmp - sectionStartAmp;            voiceParameters.mBlockStartAmp = sectionStartAmp + p0 * sectionDeltaAmp;            double blockEndAmp = sectionStartAmp + p1 * sectionDeltaAmp;                        voiceParameters.mBlockFraction = (blockEndAmp - voiceParameters.mBlockStartAmp) * (1.0 / gBlockSize);            voiceParameters.mBlockStep = fmod(voiceParameters.mBlockStep, gBlockSize);         }         // update value         voiceParameters.mValue = (voiceParameters.mBlockStartAmp + (voiceParameters.mBlockStep * voiceParameters.mBlockFraction));         ScaleValue(voiceParameters.mValue);      }      else {         // update value         voiceParameters.mValue = 0.0;      }   }   // next phase   voiceParameters.mBlockStep += mRate;   voiceParameters.mStep += mRate;}`

Unfortunately, it changes nothing The CPU used is exactly the same, also using the control rate. Keeping envelope points automatable, I'm about 10% of CPU (even if the first code I have written in tis topic has been reduced from 4% to 2%, nice one ).

At this point, are really the "if"s that cause this amount of performance? Ok, they are 16 voices and 10 envelopes, but I still feel that 10% is tooo much
camsr
KVRAF

6805 posts since 16 Feb, 2005
Can someone explain what an if statement branching to a continue directive is doing?
PreviousNext

Moderator: Moderators (Main)