Any tips for optimize this code?
-
- KVRian
- 853 posts since 13 Mar, 2012
Hmm. That shoudn't make a difference. Size of the memory is computed by the compiler at compiler-time. And the memory region allocated looks same for both (a piece of continuous memory). Have you also change the way on how you access it?
~~ ॐ http://soundcloud.com/mfr ॐ ~~
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
I meant: I was thinking that it could load (for example) all the array in memory; thats why I revert it to ram later, thiking it will go directly on registry (on the fly).PurpleSunray wrote: ↑Fri Oct 19, 2018 8:13 am Nope, you were right.
It actually works in chunks now. They are just way smaller as you were thinking: 2 double if you turned on SSE or 4 if you turned on AVX
A single instruction can load or store (or add or mul) 2 doubles now. That chunk-processing is the speed gain you see after you removed the += (vectorized code vs one-by-one code).
Instead, do you confirm its somethings like this? So...
Let say I've 16 XMM, 128bit each (16byte), so 2 double each registry. I can "local copy" 32 doubles.
Let say I'm already using 3 double for that process (one for value, one for deltaValue, one for blockStep). It remains 29 double I can use in registry. It will fill these 29 from values[64] (locate in ram), process the remain during time, and so on:
value[29]<>ram
value[29]<>ram
value[6]<>ram
Only 3 switches between ram, right?
It doesPurpleSunray wrote: ↑Fri Oct 19, 2018 8:47 am Hmm. That shoudn't make a difference. Size of the memory is computed by the compiler at compiler-time. And the memory region allocated looks same for both (a piece of continuous memory). Have you also change the way on how you access it?
2% using this:
Code: Select all
double mValue[PLUG_VOICES_BUFFER_SIZE * PLUG_MAX_PROCESS_BLOCK];
...
double *pValue = &envelope.mValue[voiceIndex];
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
pValue[sampleIndex] = value + deltaValue * sampleIndex;
}
Code: Select all
double mValue[PLUG_VOICES_BUFFER_SIZE][PLUG_MAX_PROCESS_BLOCK];
...
double *pValue = envelope.mValue[voiceIndex];
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
pValue[sampleIndex] = value + deltaValue * sampleIndex;
}
- KVRAF
- 15273 posts since 8 Mar, 2005 from Utrecht, Holland
[pedantic]Nowhk wrote:it could load all the array in memory; thats why I revert it to ram later, thiking it will go directly on registry
Your terminology needs some straightening up: [/pedantic]
We are the KVR collective. Resistance is futile. You will be assimilated.
My MusicCalc is served over https!!
My MusicCalc is served over https!!
-
- KVRian
- 853 posts since 13 Mar, 2012
The compiler will decide about this, unless you write your own assembly code.Instead, do you confirm its somethings like this? So...
Most compilers will actually do something like that already. They do loop-unrolling to minimize loop overhead and/or optimize memory/cache access. After the compiler has generated code, the loop could i.e. do: load 4 doubles into 2 registers. Do calculation, 2 times on 2 doubles, store 4 doubles from the 2 registers and than increase sampleIndex by 4. So now the loop proccesses 4 doubles per run - loop overhead decreased to half and 4*8=32byte chunks instead 2*8=16byte chunks are processes on each run (caching might benefit from that).
But again, unless you write assembly code, the compiler will take care about this.
About the code:
the first one writes on same values over and over again.
Cache will be verry happy about that, but user probably won't.
Code: Select all
double *pValue = &envelope.mValue[voiceIndex];
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
pValue[sampleIndex] = value + deltaValue * sampleIndex;
}
envelope.mValue[1] is not the second array but the second sample.
==> voiceIndex++ is same as sampleIndex++.
When you move to next voice, you just add an offset of 1 to the sample index, not += MAX_BLOCK_SIZE
~~ ॐ http://soundcloud.com/mfr ॐ ~~
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Yep, wrong. Now if I correct:PurpleSunray wrote: ↑Fri Oct 19, 2018 10:32 am When you move to next voice, you just add an offset of 1 to the sample index, not += MAX_BLOCK_SIZE
Code: Select all
double mValue[PLUG_VOICES_BUFFER_SIZE * PLUG_MAX_PROCESS_BLOCK];
...
double *pValue = &envelope.mValue[voiceIndex * PLUG_MAX_PROCESS_BLOCK];
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
pValue[sampleIndex] = value + deltaValue * sampleIndex;
}
Code: Select all
for (int envelopeIndex = 0; envelopeIndex < pEnvelopesManager->mNumEnvelopes; envelopeIndex++) {
Envelope &envelope = *pEnvelopesManager->pEnvelope[envelopeIndex];
EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];
// skip disabled envelopes
if (!envelope.mIsEnabled) { continue; }
// local copy
double rate = envelope.mRate;
double blockStep = envelopeVoiceData.mBlockStep;
double value;
double deltaValue;
double *pValue = envelope.mValue[voiceIndex];
// precomputed data
double bp0 = ((1 + envelope.mIsBipolar) * 0.5) * envelope.mAmount;
double bp1 = ((1 - envelope.mIsBipolar) * 0.5) * envelope.mAmount;
value = (bp0 * envelopeVoiceData.mBlockStartAmp) + (bp0 * blockStep * envelopeVoiceData.mBlockDeltaAmp) + bp1;
deltaValue = bp0 * rate * envelopeVoiceData.mBlockDeltaAmp;
// block
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
// update output value
pValue[sampleIndex] = value + deltaValue * sampleIndex;
}
blockStep += blockSize * rate;
// revert local copy
envelopeVoiceData.mBlockStep = blockStep;
}
Code: Select all
double envelopeValues[PLUG_VOICES_BUFFER_SIZE][PLUG_MAX_PROCESS_BLOCK];
for (int envelopeIndex = 0; envelopeIndex < pEnvelopesManager->mNumEnvelopes; envelopeIndex++) {
Envelope &envelope = *pEnvelopesManager->pEnvelope[envelopeIndex];
EnvelopeVoiceData &envelopeVoiceData = envelope.mEnvelopeVoicesData[voiceIndex];
// skip disabled envelopes
if (!envelope.mIsEnabled) { continue; }
// local copy
double rate = envelope.mRate;
double blockStep = envelopeVoiceData.mBlockStep;
double value;
double deltaValue;
double *pValue = envelopeValues[voiceIndex];
// precomputed data
double bp0 = ((1 + envelope.mIsBipolar) * 0.5) * envelope.mAmount;
double bp1 = ((1 - envelope.mIsBipolar) * 0.5) * envelope.mAmount;
value = (bp0 * envelopeVoiceData.mBlockStartAmp) + (bp0 * blockStep * envelopeVoiceData.mBlockDeltaAmp) + bp1;
deltaValue = bp0 * rate * envelopeVoiceData.mBlockDeltaAmp;
// block
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
// update output value
pValue[sampleIndex] = value + deltaValue * sampleIndex;
}
blockStep += blockSize * rate;
// revert local copy
envelopeVoiceData.mBlockStep = blockStep;
}
Isn't the same?
-
- KVRian
- 853 posts since 13 Mar, 2012
Have you copied it back to somewhere? I don't see a memcpy anymore.
Who know what's gonna happen to envelopeValues if the code ends here. Some compiler might be smart enough to remove it because you never use it again after filling it once.
Who know what's gonna happen to envelopeValues if the code ends here. Some compiler might be smart enough to remove it because you never use it again after filling it once.
~~ ॐ http://soundcloud.com/mfr ॐ ~~
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Uhm... I've added this after the loop to "emulate" real data usage, maybe is it not enough?:PurpleSunray wrote: ↑Fri Oct 19, 2018 1:46 pm Have you copied it back to somewhere? I don't see a memcpy anymore.
Who know what's gonna happen to envelopeValues if the code ends here. Some compiler might be smart enough to remove it because you never use it again after filling it once.
Code: Select all
Envelope &envelope = *pEnvelopesManager->pEnvelope[2];
std::memcpy(envelope.mValue[3], envelopeValues[6], sizeof envelope.mValue[3]);
(note that the second version, the 3% one, is called warped into a "function" such as "void EnvelopesManager::ProcessBlock(int voiceIndex, int blockSize)"... but I don't think it would make differences, its called only for each voice)
- KVRAF
- 15273 posts since 8 Mar, 2005 from Utrecht, Holland
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
I've switched from 32-bit build (on my 64-bit machine, running the 32-bit version of the DAW) to a 64-bit build (on the same 64-bit machine, running the 64-bit version of the DAW): gain another unit
Probably because the 32-bit architecture use only 8 instead of 16 XMM registers? Not sure...
Always used the 32-bit version of my DAW, for retro compatibility of plugins which don't have a 64-bit version (which run slower, since they are bridged).
But this will be the reason to switch (and compile) 64-bit forever...
Probably because the 32-bit architecture use only 8 instead of 16 XMM registers? Not sure...
Always used the 32-bit version of my DAW, for retro compatibility of plugins which don't have a 64-bit version (which run slower, since they are bridged).
But this will be the reason to switch (and compile) 64-bit forever...
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Ok, since its related to optimization, now its time to figure out how to proper process Oscs part
Here's the actual piece of code I have (basic osc synthesis, nothing special; still learning):
Two main problems here:
1 - for each sample it swap from the domain of normalized value (param value are normalized) to the domain of "warped" value, so I need to apply a transfer function on each sample (WarpPitch and WarpOffset); maybe I can remain into the [0-1] domain also for pitch and frequency offset (as for the gain)? Would be nice and save to me lot of (useless) conversion;
2 - the phase calc is redundant on every switch case, would be nice to have a single one for different osc type;
Dreaming your brilliant suggestions, as usual! Thanks dudes!
Here's the actual piece of code I have (basic osc synthesis, nothing special; still learning):
Code: Select all
mRadiansPerSample = TWOPI / mSampleRate;
...
void Oscillator::ProcessVoiceBlock(int voiceIndex, double voiceFrequency, int blockSize, double *left, double *right) {
// local copy
double phase = mPhase[voiceIndex];
// precomputed data
double bp0 = voiceFrequency * mHostPitch;
// wave
switch (mWave) {
case OSCILLATOR_WAVE_SINE:
{
// process block
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
double output = (sin(phase)) * pParamGain->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex);
*left++ += output;
*right++ += output;
// next phase
phase += BOUNDED(mRadiansPerSample * (bp0 * WarpPitch(pParamPitch->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)) + WarpOffset(pParamOffset->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)), 0, PI));
while (phase >= TWOPI) { phase -= TWOPI; }
}
break;
}
case OSCILLATOR_WAVE_SQUARE:
{
// process block
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
double output = (phase <= PI ? 1.0 : -1.0) * pParamGain->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex);
*left++ += output;
*right++ += output;
// next phase
phase += BOUNDED(mRadiansPerSample * (bp0 * WarpPitch(pParamPitch->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)) + WarpOffset(pParamOffset->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)), 0, PI));
while (phase >= TWOPI) { phase -= TWOPI; }
}
break;
}
case OSCILLATOR_WAVE_SAW:
{
// process block
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
double output = ((2.0 * phase / TWOPI) - 1.0) * pParamGain->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex);
*left++ += output;
*right++ += output;
// next phase
phase += BOUNDED(mRadiansPerSample * (bp0 * WarpPitch(pParamPitch->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)) + WarpOffset(pParamOffset->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)), 0, PI));
while (phase >= TWOPI) { phase -= TWOPI; }
}
break;
}
case OSCILLATOR_WAVE_TRIANGLE:
{
// process block
for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
double output = (2.0 * (0.5 - fabs((2.0 * phase / TWOPI) - 1.0))) * pParamGain->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex);
*left++ += output;
*right++ += output;
// next phase
phase += BOUNDED(mRadiansPerSample * (bp0 * WarpPitch(pParamPitch->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)) + WarpOffset(pParamOffset->GetProcessedVoiceNormalizedValue(voiceIndex, sampleIndex)), 0, PI));
while (phase >= TWOPI) { phase -= TWOPI; }
}
break;
}
}
// revert local copy
mPhase[voiceIndex] = phase;
}
inline double WarpOffset(double normalizedValue) {
// min = -900; max = 900; range = 1800;
return min + normalizedValue * range;
}
inline double WarpPitch(double normalizedValue) {
// bp0 = log(2.0) / 12.0; min = -48; max = 48; range = 96;
return exp((min + normalizedValue * range) * bp0);
}
1 - for each sample it swap from the domain of normalized value (param value are normalized) to the domain of "warped" value, so I need to apply a transfer function on each sample (WarpPitch and WarpOffset); maybe I can remain into the [0-1] domain also for pitch and frequency offset (as for the gain)? Would be nice and save to me lot of (useless) conversion;
2 - the phase calc is redundant on every switch case, would be nice to have a single one for different osc type;
Dreaming your brilliant suggestions, as usual! Thanks dudes!
- KVRAF
- 7890 posts since 12 Feb, 2006 from Helsinki, Finland
The basic pattern to avoid source code duplication of the common parts while specializing some inner parts without taking a runtime hit in the loop is to write the common parts into a function (or if you want to structure the common parts as multiple functions or something more complicated, you can also wrap the whole thing into a templated class) with a type template. The template parameter is a type that implements the specialized parts in (non-virtual) methods and we expect the compiler to inline these (which is an entirely reasonable expectation as long as you don't hide the implementation from the compiler). Then the main interface function becomes a switch that checks the run-time enumeration and dispatches to the correctly templated function.
This way you get duplicate code in the binary output (so each version can be optimized separately), but get to keep it only once in the source code. The amount of boiler-plate this whole thing takes is actually surprisingly small, so it can pay of even for fairly simple stuff.
This way you get duplicate code in the binary output (so each version can be optimized separately), but get to keep it only once in the source code. The amount of boiler-plate this whole thing takes is actually surprisingly small, so it can pay of even for fairly simple stuff.
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Great! Yeah, didn't think about templates, probably they will fit perfect here! Thanksmystran wrote: ↑Tue Nov 13, 2018 3:56 pm The basic pattern to avoid source code duplication of the common parts while specializing some inner parts without taking a runtime hit in the loop is to write the common parts into a function (or if you want to structure the common parts as multiple functions or something more complicated, you can also wrap the whole thing into a templated class) with a type template. The template parameter is a type that implements the specialized parts in (non-virtual) methods and we expect the compiler to inline these (which is an entirely reasonable expectation as long as you don't hide the implementation from the compiler). Then the main interface function becomes a switch that checks the run-time enumeration and dispatches to the correctly templated function.
This way you get duplicate code in the binary output (so each version can be optimized separately), but get to keep it only once in the source code. The amount of boiler-plate this whole thing takes is actually surprisingly small, so it can pay of even for fairly simple stuff.
What about keeping the phase into the normalized domain? Is it possible in your opinion? Or some kind of conversion/warp is always needed?
- KVRAF
- 7890 posts since 12 Feb, 2006 from Helsinki, Finland
I always just run the phase from 0 to 1, then multiply by a suitable (constant) scaling factor for lookups and functions as required. The actual range you pick is largely irrelevant, but [0,1] is simple to work with.
Anyway, you don't really want to do stuff like this:
Code: Select all
((2.0 * phase / TWOPI) - 1.0)
Code: Select all
((2.0 / TWOPI) * phase - 1.0)
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Uhm, how?
If I need to pitch the signal up of 48 semitones, the multiplier is * 16 of the note frequency.
Keep it simply, considering playing a note at 400hz, Pitch + 48 Semitones, its somethings like this:
Code: Select all
double phase = 0.0;
loop
double output = sin(phase);
phase += 16 * 440.0 * radiansPerSample;
while (phase >= TWOPI) { phase -= TWOPI; }
Can you have a suitable (constant) scaling factor for every normalized (0.0 to 1.0) pitch values without exp and log?
i.e. can you make:
Code: Select all
double pitchPd = log(2.0) / 12.0;
exp((minPitch + pitchNormalizedValue * rangePitch) * pitchPd);
- KVRAF
- 7890 posts since 12 Feb, 2006 from Helsinki, Finland
First of all, you need a tuning frequency, which is recomputed whenever sampling rate changes (and can be shared across the whole plugin):
Code: Select all
double baseFreq = 440. / samplingRate;
Code: Select all
double freq = baseFreq * pow(2., (midiNote - 69.) / 12.);
// the above can also be written in terms of exp()
// combining the division with the logarithm constant
const double ln2per12 = log(2.) / 12.;
double freq = baseFreq * exp((midiNote - 69.) * ln2per12);
Now the main phase increment loop would look something like this (assuming we're not updating "freq" on per-sample basis):
Code: Select all
for(samples)
{
// increment phase by frequency
phase += freq;
// wrap around back to [0,1[
// note: decrement while loop is probably faster
// but this is O(1) worst-case bound ;)
if(phase >= 1.) phase -= (int) phase;
// get output
double output = Wave::getValue(phase);
}
Code: Select all
struct WaveSin
{
static double getValue(double phase) { return sin(TWOPI * phase); }
};