Any tips for optimize this code?

DSP, Plug-in and Host development discussion.
PurpleSunray
KVRian
801 posts since 13 Mar, 2012

Post Fri Oct 12, 2018 4:53 am

Quickly wrote down some SSE code for that and did a runtime measure. turns out that is about twice as fast.. but not sure if the code is correct and if I meassured correctly, was just a very quick test :D

Code: Select all


#include <emmintrin.h>

void ProcessBlock_new(int voiceIndex, int remainingVoiceSamples) 
{
	for (int envelopeIndex = 0; envelopeIndex < 10; envelopeIndex++)
	{
		Envelope &envelope = *pEnvelope[envelopeIndex];
		EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

		// load values to MMX reigsters

		const double h = 0.5;
		__m128d half = _mm_load1_pd(&h);

		__m128d rate = _mm_load1_pd(&envelope.mRate);

		__m128d blockstart = _mm_load1_pd(&envelopeVoiceData.mBlockStartAmp);
		__m128d blockdelta = _mm_load1_pd(&envelopeVoiceData.mBlockDeltaAmp);
		__m128d blockstep = _mm_load1_pd(&envelopeVoiceData.mBlockStep);

		// run the loop

		if (envelope.mIsBipolar)
		{
			for (int sample = 0; sample < remainingVoiceSamples; sample++)
			{
				//  value = envelopeVoiceData.mBlockStartAmp + (envelopeVoiceData.mBlockStep * envelopeVoiceData.mBlockDeltaAmp);
				__m128d value = _mm_add_pd(blockstart, _mm_mul_pd(blockstep, blockdelta));

				// value = (0.5 * value + 0.5);
				value = _mm_add_pd(_mm_mul_pd(half, value), half);

				// envelope.mValue[voiceIndex] = value;
				_mm_storel_pd(&envelope.mValue[voiceIndex], value);

				// blockstep += rate;
				blockstep = _mm_add_pd(blockstep, rate);
			}

			// envelopeVoiceData.mBlockStep = blockstep;
			_mm_storel_pd(&envelopeVoiceData.mBlockStep, blockstep);
		}
		else
		{
			for (int sample = 0; sample < remainingVoiceSamples; sample++)
			{
				//  value = envelopeVoiceData.mBlockStartAmp + (envelopeVoiceData.mBlockStep * envelopeVoiceData.mBlockDeltaAmp);
				_mm_storel_pd(&envelope.mValue[voiceIndex], _mm_add_pd(blockstart, _mm_mul_pd(blockstep, blockdelta)));

				// blockstep += rate;
				blockstep = _mm_add_pd(blockstep, rate);
			}

			// envelopeVoiceData.mBlockStep = blockstep;
			_mm_storel_pd(&envelopeVoiceData.mBlockStep, blockstep);
		}
	}
}
vs

Code: Select all


void ProcessBlock_old(int voiceIndex, int remainingVoiceSamples) {
	for (int envelopeIndex = 0; envelopeIndex < 10; envelopeIndex++) {
		Envelope &envelope = *pEnvelope[envelopeIndex];
		EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

		double bp0 = (1 + envelope.mIsBipolar) * 0.5;
		double bp1 = (1 - envelope.mIsBipolar) * 0.5;

		// process block
		for (int sample = 0; sample < remainingVoiceSamples; sample++) {
			// update output value
			double value = envelopeVoiceData.mBlockStartAmp + (envelopeVoiceData.mBlockStep * envelopeVoiceData.mBlockDeltaAmp);
			envelope.mValue[voiceIndex] = (bp0 * value + bp1);

			// next phase
			envelopeVoiceData.mBlockStep += envelope.mRate;
		}
	}
}
result:

Code: Select all

Start...
NEW: 100000 runs in 1031ms
OLD: 100000 runs in 2188ms
might be worth a try coding it in assembly (or intrinsics ). I'm not using any packing on the code above, so I'm pretty sure it could way faster if you spend some more time on it to actually use SIMD
Last edited by PurpleSunray on Fri Oct 12, 2018 5:33 am, edited 1 time in total.

PurpleSunray
KVRian
801 posts since 13 Mar, 2012

Re: Any tips for optimize this code?

Post Fri Oct 12, 2018 5:11 am

Interesting! Placing local copy (and re-store them outside the loop) switch from 5% to 3%.
Oh.. actually this also might be the reason why my code is faster :lol:
_mm_storel_pd has a cost.. so no good idea to write back mBlockStep in each loop run.

Code: Select all

envelopeVoiceData.mBlockStep += envelope.mRate;
mBlockStep resides on RAM (on the envelopeVoiceData struct).
So every time to you read or write it, it causes RAM access (the _mm_storel_pd )
If you make mBlockStep a local, there is no RAM traffic - instead mBlockStep will be stored on a CPU register ... most likley (the compiler will decide, unless you code it on assmebly and manage registers on your own while the loop is running ;) )

User avatar
Nowhk
KVRian
769 posts since 2 Oct, 2013

Re: Any tips for optimize this code?

Post Fri Oct 12, 2018 6:10 am

I see, clear: thanks! Now, a further question.

What if instead of having a 1-dimension mValue array:

Code: Select all

  double mValue[PLUG_VOICES_BUFFER_SIZE];
I have a 2d array? So I can fill on sample block iteration, using it later for audio processing:

Code: Select all

  double mValue[PLUG_VOICES_BUFFER_SIZE][PLUG_MAX_PROCESS_BLOCK];
How would you access faster to it within the loop, optimizing the code?
Tried to make a "local" array and than copy in the end with:

Code: Select all

  values[PLUG_MAX_PROCESS_BLOCK]
  
  for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
    values[sampleIndex] = (bp0 * value + bp1)
  }
 
  std::memcpy(envelope.mValue[voiceIndex], values, PLUG_MAX_PROCESS_BLOCK)
But I don't really like it. It doesn't seem so performant. Any way to "fill" faster the local array values[PLUG_MAX_PROCESS_BLOCK] and in the end re-assign it to mValue[voiceIndex] position?

If I use pointer to mValue[PLUG_MAX_PROCESS_BLOCK], I'll write go/back to RAM, I guess...

IN SHORT: since I'm iterating all samples (block), doing this:

Code: Select all

  value[voiceIndex] = (bp0 * value + bp1);
do nothing hehe :) I need to "store" the calculated value for the following process. Best optimization (local) way?

PurpleSunray
KVRian
801 posts since 13 Mar, 2012

Re: Any tips for optimize this code?

Post Fri Oct 12, 2018 7:06 am

You need to arrange the data so that you can easily process a bunch of it at once (if you'r after SIMD code).

Example:
Inside the loop there is a multiply+add.
The SSE instructions for mul would be https://software.intel.com/sites/landin ... d=115,3886
If you read creafully, you notice that it does a mulitplication on a 128bit register, packed with 2 64bit doubles.
If you use the float operations, you can pack 4 floats instead of 2 doubles.
Now your data should be arranged in a way so that it can easily be processed with this operations. You will most likely end up with some kind of interleaved layout, such as:
value[SampleIndex0Voice0], value[SampleIndex1Voice0], value[SampleIndex0Voice1], value[SampleIndex1Voice0]
will allow you load first two samples of voice 1 (_mm_load_pd - pack to 128bit register), than calucate it via SIMD than store it back - then next voice, or next 2 samples, or ... go figure out on your own :D :P

PurpleSunray
KVRian
801 posts since 13 Mar, 2012

Re: Any tips for optimize this code?

Post Fri Oct 12, 2018 7:14 am

If I use pointer to mValue[PLUG_MAX_PROCESS_BLOCK], I'll write go/back to RAM, I guess...
Yes, I have't understood this line from beginning :D
You are wirting on same value over and over agian. If there is no operator overloading other magix involved this is completely usless. The sample-by-sample increment loop becomes obselete then too, because: value = envelope.mRate * sampleIndex * blockDelta + blockStart; .. or something like that :hihi:
Just calucate when you do

Code: Select all

if (envelopeVoiceData.mBlockStep >= gBlockSize) {
						// calculate new envelope values for this block. its processed every 100 samples, not so heavy as operation, so it seems I can ignore the core of my code here
					}
instead of running that increment-one-by-one loop.

Return to “DSP and Plug-in Development”