Any tips for optimize this code?

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

@Nowhk
The the local copy of your envelope data again is on RAM (on stack). So the memcpy copies from RAM to RAM - pretty useless.
You gain performance from a "local copy" if that value fits into a CPU register. Than the compiler might decide to load it from RAM once and keep on the register during loop runs.

Post

PurpleSunray wrote: Wed Oct 17, 2018 8:33 pm @Nowhk
The the local copy of your envelope data again is on RAM (on stack). So the memcpy copies from RAM to RAM - pretty useless.
You gain performance from a "local copy" if that value fits into a CPU register. Than the compiler might decide to load it from RAM once and keep on the register during loop runs.
How can I ensure data I need (for that loop) goes to CPU register during loop runs so?

Post

PurpleSunray wrote: Wed Oct 17, 2018 8:20 pm
2DaT wrote: Wed Oct 17, 2018 4:15 pm
PurpleSunray wrote: Wed Oct 17, 2018 1:50 pm Why would you do all of this integer stuff? Your can increment in a SIMD friendly way.

Code: Select all

__m128 i1 = _mm_set_pd(1.0,0.0);
__m128 i2 = _mm_set_pd(3.0,2.0);
__m128 increment = _mm_set1_pd(4.0);
i1 = _mm_add_pd(i1,increment);
i2 = _mm_add_pd(i2,increment);
It has a drawback of a loop carried dependency chain, but it's not a big deal for this loop.
Because of 1 add_epi32 vs 2 add_pd. There is no point calculating doubles if you only need int, is there?
Your code also contains these instructions:
_mm_set1_epi32 which is a cross-domain mov and shuffle.
_mm_cvtepi32_pd * 2 which is roughly equivalent of add but sometimes worse.
_mm_srli_si128 - a shift or shuffle.

Post

@2Dat OK
@Nowhk you can't.
The compiler will decide. All you can do is making it easy for it. You can do so by knowing your CPU (your CPU will have 16 XMM registers to play with, most likley).
You did so already, by storing some values on a local double instead of reading it from the struct. So the compiler can build code to read it once from RAM and keep it on a XMM register. When you read from struct directly, compiler can't keep it on register, because you want to read it from sturct ;)

Post

PurpleSunray wrote: Wed Oct 17, 2018 9:11 pm You did so already, by storing some values on a local double instead of reading it from the struct. So the compiler can build code to read it once from RAM and keep it on a XMM register. When you read from struct directly, compiler can't keep it on register, because you want to read it from sturct ;)
Exactly :) I place most data on local register, so it will use them instead of accessing the RAM.
That's why "later" I memcopy: I'm filling a local array, avoid cpu to access RAM. But once I've filled it (at the end of the for) I need to copy back to the array in ram, else I'll lost the calculations :)

Why you said the local copy of your envelope data again is on RAM (on stack)?
I miss this point, since (as you already said) I've "storing some values on a local double instead of reading it from the struct" :D

Post

We are the KVR collective. Resistance is futile. You will be assimilated. Image
My MusicCalc is served over https!!

Post

While I can't comment on the local/SIMD situation (and it seems like you guys have got that covered), I did some code review of my own on the original and found some redundancies. Also I don't recommend splitting stuff out into a function unless two or more functions are using that code, performance-wise.

Code: Select all

  // buffer
  for (int remainingSamples = nFrames; remainingSamples > 0; remainingSamples -= PLUG_MAX_PROCESS_BLOCK)
  {
    // voices (buffer is 32; 16 simultaneous + 16 free slot for cutoof previous one)
    for (int voiceIndex = 0; voiceIndex != PLUG_VOICES_BUFFER_SIZE; ++voiceIndex)
    {
      if (pVoiceManager->mVoices[voiceIndex].mIsPlaying)
      {
        for (int remainingVoiceSamples = (remainingSamples > PLUG_MAX_PROCESS_BLOCK) ? remainingSamples : PLUG_MAX_PROCESS_BLOCK; remainingVoiceSamples != 0; --remainingVoiceSamples)
        {
          for (int envelopeIndex = 0; envelopeIndex != mNumEnvelopes; ++envelopeIndex)
          {
            Envelope &envelope = *pEnvelope[envelopeIndex];
            EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

            if (envelope.mIsEnabled)
            {
              // process block
              if (envelopeVoiceData.mBlockStep >= gBlockSize)
              {
                // calculate new envelope values for this block. its processed every 100 samples, not so heavy as operation, so it seems I can ignore the core of my code here
              }

              // update output value
              const double value = envelopeVoiceData.mBlockStartAmp + (envelopeVoiceData.mBlockStep * envelopeVoiceData.mBlockDeltaAmp);
              envelope.mValue[voiceIndex] = ((1 + envelope.mIsBipolar) / 2.0 * value + (1 - envelope.mIsBipolar) / 2.0) * envelope.mAmount;

              // next phase
              envelopeVoiceData.mBlockStep += envelope.mRate;
              envelopeVoiceData.mStep += envelope.mRate;
            }
          }

          ++voice.mSampleIndex;
        }
      }
    }
  }

blocksize isn't needed, remainingsamples isn't unsigned, you can safely negate the macro from it and it'll terminate the loop. For loops often compile to better code than while loops in my experience, depending on the compiler. Also, just basic stuff - it may help the compiler to optimize faster when the code intent is clearer. Or not. Compilers are super-crazy nowadays. Speaking of which, I recommend compiling in nuwen mingw if you can, not VCC. Codelite is an excellent IDE which picks up mingw easily. Much faster code output. Often up to 20% in my benchmarks (with -march=native). Won't auto-SIMD until -O3, but sometimes that's not faster.
Hope it helps.

Post

Nowhk wrote: Thu Oct 18, 2018 7:36 am
PurpleSunray wrote: Wed Oct 17, 2018 9:11 pm You did so already, by storing some values on a local double instead of reading it from the struct. So the compiler can build code to read it once from RAM and keep it on a XMM register. When you read from struct directly, compiler can't keep it on register, because you want to read it from sturct ;)
Exactly :) I place most data on local register, so it will use them instead of accessing the RAM.
That's why "later" I memcopy: I'm filling a local array, avoid cpu to access RAM. But once I've filled it (at the end of the for) I need to copy back to the array in ram, else I'll lost the calculations :)

Why you said the local copy of your envelope data again is on RAM (on stack)?
I miss this point, since (as you already said) I've "storing some values on a local double instead of reading it from the struct" :D
Your envelope data is not stored a CPU register only.
An SSE2 CPU has 16 registers, but your envelope data is an array of 64 doubles.
That doesn't fit into and so it is allocated on stack.

An example with some dummy code to show the difference.

Code: Select all

for (int i = 0; i < n; i++) {
    envelope.mValue[i] = envelope.data0 + envelope.data1;
}
this might become something like that:

Code: Select all

for (int i = 0; i < n; i++) {
    mov XMM0, envelope.data0;     // Load envelope.data0 from RAM to register
    mov XMM1, envelope.data1;     // Load envelope.data1 from RAM to register
    add XMM0, XMM1, XMM2;         // Add XMM0 to XMM1 and store on XMM2
    mov envelope.mValue[i], XMM2; // Store the result to RAM
}
If you change it to:

Code: Select all

double d0 = envelope.data0;
double d1 = envelope.data1;
for (int i = 0; i < n; i++) {
    envelope.mValue[i] = d0 + d1;
}
this might become something like that:

Code: Select all

mov XMM0, envelope.data0;     // Load envelope.data0 from RAM to register
mov XMM1, envelope.data1;     // Load envelope.data1 from RAM to register
for (int i = 0; i < n; i++) {
    add XMM0, XMM1, XMM2;         // Add XMM0 to XMM1 and store on XMM2
    mov envelope.mValue[i], XMM2; // Store the result to RAM
}
Two load operations have just been removed from the loop because the data0 and data1 values can be keept on a register.
Now you try to do that:

Code: Select all

double d0 = envelope.data0;
double d1 = envelope.data1;
double values[64];
for (int i = 0; i < n; i++) {
    values[i] = d0 + d1;
}
double values[64] will not go into a register, because it doesn't fit. Instead it allocates on stack:

Code: Select all

mov XMM0, envelope.data0;     // Load envelope.data0 from RAM to register
mov XMM1, envelope.data1;     // Load envelope.data1 from RAM to register
add esp, 512;                 // grow stack by size of dobule[64] and remeber the address (that is values[64])
for (int i = 0; i < n; i++) {
    add XMM0, XMM1, XMM2;    // Add XMM0 to XMM1 and store on XMM2
    mov values[i], XMM2;     // Store the result to RAM (on stack, as allocated above).
}
If you compare the two last exampled, you will see that the loop is same. Doesn't matter if values[64] is "local".

In general - as soon as you need a pointer on a value it cannot reside on a register only.
CPU registers do not have memory addresses. So the fact that you are passing a double* into the memcpy already tells you that this cannot reside on a CPU register only, but it needs a representation on RAM since you want to know the address of it.

Post

And I still don't get this:

Code: Select all

for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
	// update output value
	values[sampleIndex] = value;
	value += deltaValue;

	// next phase
	blockStep += rate;
}
Why all this increments?
Why not

Code: Select all

for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
	values[sampleIndex] = value + deltaValue * sampleIndex;
}
blockStep += rate * blockSize;
?

This "value += deltaValue;" makes it hard to vectorize. A dumb compiler won't do, while the second verson is easy to vectorize, even for a dumb compiler.

Long story:
The register on FPU can be packet with multiple values. You can do an operation on 2 doubles (SSE) or 4 doubles (AVX) in parallel. But.. "value += deltaValue" means, "calucate one after the other".. you need the result of the previous operation to do next. Hard to do calcuate two values in parallel if one depends on the previous.
On the other hand, calucalting "= value + deltaValue * sampleIndex;" in parallel is incredible easy. Just load sampleIndex register with [index][index+1] and calcuate two values for the cost of one (more or less.. :P )

Try changing it - MSVC seems to run a dumb compiler. Got a boost of about 50% by replacing += with *

Post

Extraneous variables are not a concern when using SSA optimization (in GCC this is enabled to varying degrees at all -O levels).
However, the typing of the variables might matter. For example, perhaps if two different conditional comparisons use an operand that is based on a single value, BUT that value is then typecasted to a second value (which effectively changes the underlying "machine mode") for one of the two comparisons, the compiler might generate a second value when only one is necessary..
I say might because I don't know.

Code: Select all

for (int i = 0; ; i++)  // 'i' should be assigned a runtime input, NOT a compile-time constant.
{ 
if ( i == 10 ) { printf("10\n"); }
if ( (char)i == 'C' ) { printf("C\n"); }
if ( (long long int)i == 255ll ) { printf("break"); break; }
}
So this code here has 3 valid comparisons of literal constants to an incrementer that is typecasted in 2 of 3 uses. Notice that one cast is to a lower bitrange and the other to a higher bitrange, but the static limit of the loop also fits in the range of the smallest type used with the incrementer.

Anyway, back on topic, I have observed that loop incrementers, for the use of pointer dereferencing, work best with a type that is the same size as the platform pointer size. Use the 'size_t' type to index your arrays and pointers, however be aware that too many of these cost more in memory space.

Post

camsr wrote: Thu Oct 18, 2018 1:55 pm Extraneous variables are not a concern when using SSA optimization (in GCC this is enabled to varying degrees at all -O levels).
It depends how they're used. In this case it did matter, so I restructured the code.

Post

PurpleSunray wrote: Thu Oct 18, 2018 9:57 am double values[64] will not go into a register, because it doesn't fit. Instead it allocates on stack
Got it. I've always thought that it will be "inserted" in chunks, so if it doesn't fit all, it will place "part of it, elaborate, than swap with another part". It seems I was wrong.

Said this, is better in your opinions access to array on stack "directly" or with pointer?
I mean, this way:

Code: Select all

double mValue[PLUG_VOICES_BUFFER_SIZE][PLUG_MAX_PROCESS_BLOCK];
...
double *pValue = envelope.mValue[voiceIndex];
*pValue++ = value
or this:

Code: Select all

double mValue[PLUG_VOICES_BUFFER_SIZE][PLUG_MAX_PROCESS_BLOCK];
...
envelope.mValue[voiceIndex][sampleIndex] = value;
Which would be faster in your opinions?
PurpleSunray wrote: Thu Oct 18, 2018 11:14 am Try changing it - MSVC seems to run a dumb compiler. Got a boost of about 50% by replacing += with *
You have right again: this really speed up the process! Thanks!

Post

It shouldn't make much of a difference, but if you are down on counting CPU cycles:

Code: Select all

for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
	values[voiceIndex][sampleIndex] = value;
}
For writing to values, the compiler needs to load values address to a register, than add voiceIndex*sizeof(double), than add sampleIndex sizeof(double) and than it has the right address to write to.
The compiler might optimize the first add since it is same on every loop run, so you get

Code: Select all

double* values = envelope.mValue[voiceIndex];
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
	values[sampleIndex] = value;
}
Now it is down to load an address and add sampleIndex.
Still it is not optimal, because there still is sampleIndex.
What we actually do is increase sampleIndex, than add sampleIndex to values address.
So this are again 2 adds. You can reduce it to 1 add in total by removing sampleIndex and increaseing the pointer directly (this is not always possible thou - if you need sampleIndex inside the loop you cannot remove it).

Code: Select all

double* start = envelope.mValue[voiceIndex];
double* end = start + blockSize;
for (double* ptr = start ; ptr < end; ptr++) {
	*ptr = value;
}

Post

And to even further confuse you
Nowhk wrote: Fri Oct 19, 2018 7:10 am
PurpleSunray wrote: Thu Oct 18, 2018 9:57 am double values[64] will not go into a register, because it doesn't fit. Instead it allocates on stack
Got it. I've always thought that it will be "inserted" in chunks, so if it doesn't fit all, it will place "part of it, elaborate, than swap with another part". It seems I was wrong.
Nope, you were right.
It actually works in chunks now. They are just way smaller as you were thinking: 2 double if you turned on SSE or 4 if you turned on AVX :lol: :P
A single instruction can load or store (or add or mul) 2 doubles now. That chunk-processing is the speed gain you see after you removed the += (vectorized code vs one-by-one code).
Last edited by PurpleSunray on Fri Oct 19, 2018 8:15 am, edited 1 time in total.

Post

PurpleSunray wrote: Fri Oct 19, 2018 7:32 am So this are again 2 adds. You can reduce it to 1 add in total by removing sampleIndex and increaseing the pointer directly (this is not always possible thou - if you need sampleIndex inside the loop you cannot remove it).
Which I can't, since I'll use sampleIndex internally :) What also make differences I've noticed is to change this:

Code: Select all

double mValue[PLUG_VOICES_BUFFER_SIZE][PLUG_MAX_PROCESS_BLOCK];
to this:

Code: Select all

double mValue[PLUG_VOICES_BUFFER_SIZE * PLUG_MAX_PROCESS_BLOCK];

Post Reply

Return to “DSP and Plugin Development”