Any tips for optimize this code?

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

Hmm. That shoudn't make a difference. Size of the memory is computed by the compiler at compiler-time. And the memory region allocated looks same for both (a piece of continuous memory). Have you also change the way on how you access it?

Post

PurpleSunray wrote: Fri Oct 19, 2018 8:13 am Nope, you were right.
It actually works in chunks now. They are just way smaller as you were thinking: 2 double if you turned on SSE or 4 if you turned on AVX :lol: :P
A single instruction can load or store (or add or mul) 2 doubles now. That chunk-processing is the speed gain you see after you removed the += (vectorized code vs one-by-one code).
I meant: I was thinking that it could load (for example) all the array in memory; thats why I revert it to ram later, thiking it will go directly on registry (on the fly).

Instead, do you confirm its somethings like this? So...
Let say I've 16 XMM, 128bit each (16byte), so 2 double each registry. I can "local copy" 32 doubles.
Let say I'm already using 3 double for that process (one for value, one for deltaValue, one for blockStep). It remains 29 double I can use in registry. It will fill these 29 from values[64] (locate in ram), process the remain during time, and so on:

value[29]<>ram
value[29]<>ram
value[6]<>ram

Only 3 switches between ram, right?
PurpleSunray wrote: Fri Oct 19, 2018 8:47 am Hmm. That shoudn't make a difference. Size of the memory is computed by the compiler at compiler-time. And the memory region allocated looks same for both (a piece of continuous memory). Have you also change the way on how you access it?
It does :?
2% using this:

Code: Select all

double mValue[PLUG_VOICES_BUFFER_SIZE * PLUG_MAX_PROCESS_BLOCK];
...
double *pValue = &envelope.mValue[voiceIndex];
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
	pValue[sampleIndex] = value + deltaValue * sampleIndex;
}
3% with this:

Code: Select all

double mValue[PLUG_VOICES_BUFFER_SIZE][PLUG_MAX_PROCESS_BLOCK];
...
double *pValue = envelope.mValue[voiceIndex];
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
	pValue[sampleIndex] = value + deltaValue * sampleIndex;
}

Post

Nowhk wrote:it could load all the array in memory; thats why I revert it to ram later, thiking it will go directly on registry
[pedantic]
Your terminology needs some straightening up: [/pedantic]
We are the KVR collective. Resistance is futile. You will be assimilated. Image
My MusicCalc is served over https!!

Post

Instead, do you confirm its somethings like this? So...
The compiler will decide about this, unless you write your own assembly code.
Most compilers will actually do something like that already. They do loop-unrolling to minimize loop overhead and/or optimize memory/cache access. After the compiler has generated code, the loop could i.e. do: load 4 doubles into 2 registers. Do calculation, 2 times on 2 doubles, store 4 doubles from the 2 registers and than increase sampleIndex by 4. So now the loop proccesses 4 doubles per run - loop overhead decreased to half and 4*8=32byte chunks instead 2*8=16byte chunks are processes on each run (caching might benefit from that).
But again, unless you write assembly code, the compiler will take care about this.

About the code:
the first one writes on same values over and over again.
Cache will be verry happy about that, but user probably won't.

Code: Select all

double *pValue = &envelope.mValue[voiceIndex]; 
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
	pValue[sampleIndex] = value + deltaValue * sampleIndex;
}
envelope.mValue is a 1D array now, no more 2D.
envelope.mValue[1] is not the second array but the second sample.
==> voiceIndex++ is same as sampleIndex++.
When you move to next voice, you just add an offset of 1 to the sample index, not += MAX_BLOCK_SIZE

Post

PurpleSunray wrote: Fri Oct 19, 2018 10:32 am When you move to next voice, you just add an offset of 1 to the sample index, not += MAX_BLOCK_SIZE
Yep, wrong. Now if I correct:

Code: Select all

double mValue[PLUG_VOICES_BUFFER_SIZE * PLUG_MAX_PROCESS_BLOCK];
...
double *pValue = &envelope.mValue[voiceIndex * PLUG_MAX_PROCESS_BLOCK];
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
	pValue[sampleIndex] = value + deltaValue * sampleIndex;
}
Once corrected, its returned back to 3% :D

Code: Select all

for (int envelopeIndex = 0; envelopeIndex < pEnvelopesManager->mNumEnvelopes; envelopeIndex++) {
	Envelope &envelope = *pEnvelopesManager->pEnvelope[envelopeIndex];
	EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

	// skip disabled envelopes
	if (!envelope.mIsEnabled) { continue; }

	// local copy
	double rate = envelope.mRate;
	double blockStep = envelopeVoiceData.mBlockStep;

	double value;
	double deltaValue;
	double *pValue = envelope.mValue[voiceIndex];

	// precomputed data
	double bp0 = ((1 + envelope.mIsBipolar) * 0.5) * envelope.mAmount;
	double bp1 = ((1 - envelope.mIsBipolar) * 0.5) * envelope.mAmount;
	value = (bp0 * envelopeVoiceData.mBlockStartAmp) + (bp0 * blockStep * envelopeVoiceData.mBlockDeltaAmp) + bp1;
	deltaValue = bp0 * rate * envelopeVoiceData.mBlockDeltaAmp;

	// block
	for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
		// update output value
		pValue[sampleIndex] = value + deltaValue * sampleIndex;
	}

	blockStep += blockSize * rate;

	// revert local copy
	envelopeVoiceData.mBlockStep = blockStep;
}
BUT.... if instead of use "envelope.mValue" I use a "local array" placed in the top (as suggested by 2DaT):

Code: Select all

double envelopeValues[PLUG_VOICES_BUFFER_SIZE][PLUG_MAX_PROCESS_BLOCK];
for (int envelopeIndex = 0; envelopeIndex < pEnvelopesManager->mNumEnvelopes; envelopeIndex++) {
	Envelope &envelope = *pEnvelopesManager->pEnvelope[envelopeIndex];
	EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];

	// skip disabled envelopes
	if (!envelope.mIsEnabled) { continue; }

	// local copy
	double rate = envelope.mRate;
	double blockStep = envelopeVoiceData.mBlockStep;

	double value;
	double deltaValue;
	double *pValue = envelopeValues[voiceIndex];

	// precomputed data
	double bp0 = ((1 + envelope.mIsBipolar) * 0.5) * envelope.mAmount;
	double bp1 = ((1 - envelope.mIsBipolar) * 0.5) * envelope.mAmount;
	value = (bp0 * envelopeVoiceData.mBlockStartAmp) + (bp0 * blockStep * envelopeVoiceData.mBlockDeltaAmp) + bp1;
	deltaValue = bp0 * rate * envelopeVoiceData.mBlockDeltaAmp;

	// block
	for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
		// update output value
		pValue[sampleIndex] = value + deltaValue * sampleIndex;
	}

	blockStep += blockSize * rate;

	// revert local copy
	envelopeVoiceData.mBlockStep = blockStep;
}
It become 2% again :D

Isn't the same?

Post

Have you copied it back to somewhere? I don't see a memcpy anymore.
Who know what's gonna happen to envelopeValues if the code ends here. Some compiler might be smart enough to remove it because you never use it again after filling it once.

Post

PurpleSunray wrote: Fri Oct 19, 2018 1:46 pm Have you copied it back to somewhere? I don't see a memcpy anymore.
Who know what's gonna happen to envelopeValues if the code ends here. Some compiler might be smart enough to remove it because you never use it again after filling it once.
Uhm... I've added this after the loop to "emulate" real data usage, maybe is it not enough?:

Code: Select all

Envelope &envelope = *pEnvelopesManager->pEnvelope[2];
std::memcpy(envelope.mValue[3], envelopeValues[6], sizeof envelope.mValue[3]);
Still 2%...

(note that the second version, the 3% one, is called warped into a "function" such as "void EnvelopesManager::ProcessBlock(int voiceIndex, int blockSize)"... but I don't think it would make differences, its called only for each voice)

Post

2%, 3%... that's marginal. You need a more precise benchmark. Some weeks ago you had 50 digits precision, then why today just one?
We are the KVR collective. Resistance is futile. You will be assimilated. Image
My MusicCalc is served over https!!

Post

I've switched from 32-bit build (on my 64-bit machine, running the 32-bit version of the DAW) to a 64-bit build (on the same 64-bit machine, running the 64-bit version of the DAW): gain another unit :D

Probably because the 32-bit architecture use only 8 instead of 16 XMM registers? Not sure...

Always used the 32-bit version of my DAW, for retro compatibility of plugins which don't have a 64-bit version (which run slower, since they are bridged).

But this will be the reason to switch (and compile) 64-bit forever... :hug:

Post

Ok, since its related to optimization, now its time to figure out how to proper process Oscs part :)
Here's the actual piece of code I have (basic osc synthesis, nothing special; still learning):

Code: Select all

mRadiansPerSample = TWOPI / mSampleRate;
...
void Oscillator::ProcessVoiceBlock(int voiceIndex, double voiceFrequency, int blockSize, double *left, double *right) {
	// local copy
	double phase = mPhase[voiceIndex];

	// precomputed data
	double bp0 = voiceFrequency * mHostPitch;
	
	// wave
	switch (mWave) {
	case OSCILLATOR_WAVE_SINE:
	{
		// process block
		for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
			double output = (sin(phase)) * pParamGain->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex);

			*left++ += output;
			*right++ += output;
			
			// next phase
			phase += BOUNDED(mRadiansPerSample * (bp0 * WarpPitch(pParamPitch->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)) + WarpOffset(pParamOffset->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)), 0, PI));
			while (phase >= TWOPI) { phase -= TWOPI; }
		}
		
		break;
	}
	case OSCILLATOR_WAVE_SQUARE:
	{
		// process block
		for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
			double output = (phase <= PI ? 1.0 : -1.0) * pParamGain->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex);

			*left++ += output;
			*right++ += output;

			// next phase
			phase += BOUNDED(mRadiansPerSample * (bp0 * WarpPitch(pParamPitch->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)) + WarpOffset(pParamOffset->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)), 0, PI));
			while (phase >= TWOPI) { phase -= TWOPI; }
		}

		break;
	}
	case OSCILLATOR_WAVE_SAW:
	{
		// process block
		for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
			double output = ((2.0 * phase / TWOPI) - 1.0) * pParamGain->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex);

			*left++ += output;
			*right++ += output;

			// next phase
			phase += BOUNDED(mRadiansPerSample * (bp0 * WarpPitch(pParamPitch->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)) + WarpOffset(pParamOffset->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)), 0, PI));
			while (phase >= TWOPI) { phase -= TWOPI; }
		}

		break;
	}
	case OSCILLATOR_WAVE_TRIANGLE:
	{
		// process block
		for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
			double output = (2.0 * (0.5 - fabs((2.0 * phase / TWOPI) - 1.0))) * pParamGain->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex);

			*left++ += output;
			*right++ += output;

			// next phase
			phase += BOUNDED(mRadiansPerSample * (bp0 * WarpPitch(pParamPitch->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)) + WarpOffset(pParamOffset->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)), 0, PI));
			while (phase >= TWOPI) { phase -= TWOPI; }
		}

		break;
	}
	}

	// revert local copy
	mPhase[voiceIndex] = phase;
}

inline double WarpOffset(double normalizedValue) {
        // min = -900; max = 900; range = 1800;
	return min + normalizedValue * range;
}

inline double WarpPitch(double normalizedValue) {
        // bp0 = log(2.0) / 12.0; min = -48; max = 48; range = 96;
	return exp((min + normalizedValue * range) * bp0);
}
Two main problems here:

1 - for each sample it swap from the domain of normalized value (param value are normalized) to the domain of "warped" value, so I need to apply a transfer function on each sample (WarpPitch and WarpOffset); maybe I can remain into the [0-1] domain also for pitch and frequency offset (as for the gain)? Would be nice and save to me lot of (useless) conversion;
2 - the phase calc is redundant on every switch case, would be nice to have a single one for different osc type;

Dreaming your brilliant suggestions, as usual! Thanks dudes!

Post

The basic pattern to avoid source code duplication of the common parts while specializing some inner parts without taking a runtime hit in the loop is to write the common parts into a function (or if you want to structure the common parts as multiple functions or something more complicated, you can also wrap the whole thing into a templated class) with a type template. The template parameter is a type that implements the specialized parts in (non-virtual) methods and we expect the compiler to inline these (which is an entirely reasonable expectation as long as you don't hide the implementation from the compiler). Then the main interface function becomes a switch that checks the run-time enumeration and dispatches to the correctly templated function.

This way you get duplicate code in the binary output (so each version can be optimized separately), but get to keep it only once in the source code. The amount of boiler-plate this whole thing takes is actually surprisingly small, so it can pay of even for fairly simple stuff.

Post

mystran wrote: Tue Nov 13, 2018 3:56 pm The basic pattern to avoid source code duplication of the common parts while specializing some inner parts without taking a runtime hit in the loop is to write the common parts into a function (or if you want to structure the common parts as multiple functions or something more complicated, you can also wrap the whole thing into a templated class) with a type template. The template parameter is a type that implements the specialized parts in (non-virtual) methods and we expect the compiler to inline these (which is an entirely reasonable expectation as long as you don't hide the implementation from the compiler). Then the main interface function becomes a switch that checks the run-time enumeration and dispatches to the correctly templated function.

This way you get duplicate code in the binary output (so each version can be optimized separately), but get to keep it only once in the source code. The amount of boiler-plate this whole thing takes is actually surprisingly small, so it can pay of even for fairly simple stuff.
Great! Yeah, didn't think about templates, probably they will fit perfect here! Thanks

What about keeping the phase into the normalized domain? Is it possible in your opinion? Or some kind of conversion/warp is always needed?

Post

Nowhk wrote: Wed Nov 14, 2018 1:29 pm What about keeping the phase into the normalized domain? Is it possible in your opinion? Or some kind of conversion/warp is always needed?
I always just run the phase from 0 to 1, then multiply by a suitable (constant) scaling factor for lookups and functions as required. The actual range you pick is largely irrelevant, but [0,1] is simple to work with.

Anyway, you don't really want to do stuff like this:

Code: Select all

((2.0 * phase / TWOPI) - 1.0)
That can cost you a slow division. You can rewrite it as:

Code: Select all

((2.0 / TWOPI) * phase - 1.0)
This way the (2.0/TWOPI) can be evaluated at compile-time (since it's now all constants) and you only get a single multiply at run-time, saving you the division. In theory you might (or might not, depending on the specifics of a specific compiler) get that optimisation even with the first one when using "fast math" but it's still better to write the second, because that'll work across all (reasonable, optimising) compilers.

Post

mystran wrote: Wed Nov 14, 2018 11:18 pmI always just run the phase from 0 to 1, then multiply by a suitable (constant) scaling factor for lookups and functions as required. The actual range you pick is largely irrelevant, but [0,1] is simple to work with.
Uhm, how?
If I need to pitch the signal up of 48 semitones, the multiplier is * 16 of the note frequency.
Keep it simply, considering playing a note at 400hz, Pitch + 48 Semitones, its somethings like this:

Code: Select all

double phase = 0.0;
loop
  double output = sin(phase);
  phase += 16 * 440.0 * radiansPerSample;
  while (phase >= TWOPI) { phase -= TWOPI; }
I can keep 440 and precalculate radiansPerSample (which is TWOPI / sampleRate), but that "16" pitch value?
Can you have a suitable (constant) scaling factor for every normalized (0.0 to 1.0) pitch values without exp and log?

i.e. can you make:

Code: Select all

double pitchPd = log(2.0) / 12.0;
exp((minPitch + pitchNormalizedValue * rangePitch) * pitchPd);
as constant? Would be awesome :love:

Post

Nowhk wrote: Thu Nov 15, 2018 3:55 pm
mystran wrote: Wed Nov 14, 2018 11:18 pmI always just run the phase from 0 to 1, then multiply by a suitable (constant) scaling factor for lookups and functions as required. The actual range you pick is largely irrelevant, but [0,1] is simple to work with.
Uhm, how?
First of all, you need a tuning frequency, which is recomputed whenever sampling rate changes (and can be shared across the whole plugin):

Code: Select all

double baseFreq = 440. / samplingRate;
Then you can use this to convert a pitch (in semitones, octaves, MIDI notes, etc), into normalised frequency (ie. "cycles per sample"), separately for each voice obviously:

Code: Select all

double freq = baseFreq * pow(2., (midiNote - 69.) / 12.);

// the above can also be written in terms of exp()
// combining the division with the logarithm constant
const double ln2per12 = log(2.) / 12.;
double freq = baseFreq * exp((midiNote - 69.) * ln2per12);
There's no way to really avoid the exp() other than by approximating it or using a tuning table of some sort.

Now the main phase increment loop would look something like this (assuming we're not updating "freq" on per-sample basis):

Code: Select all

for(samples)
{
   // increment phase by frequency
   phase += freq;
   
   // wrap around back to [0,1[
   // note: decrement while loop is probably faster
   // but this is O(1) worst-case bound ;)
   if(phase >= 1.) phase -= (int) phase;
   
   // get output
   double output = Wave::getValue(phase);
}   
Here "Wave" would be the template-class (as per my earlier suggestion), let's say WaveSin:

Code: Select all

struct WaveSin
{
   static double getValue(double phase) { return sin(TWOPI * phase); }
};
That should hopefully be all the building blocks you need. :)

Post Reply

Return to “DSP and Plugin Development”