Optimize plugin code for balanced load or least load?

DSP, Plug-in and Host development discussion.
User avatar
Vokbuz
KVRist
117 posts since 24 Aug, 2014 from Moscow

Post Tue Nov 19, 2019 1:15 pm

Fender19 wrote:
Tue Nov 19, 2019 12:10 pm
I access the elements of that array using:

Code: Select all

	  
	  leftdelayed = DelayArray[0][delayPtr];	    //rotating buffer - retrieve saved value
	  rightdelayed = DelayArray[1][delayPtr];
	  DelayArray[0][delayPtr] = leftin;		   //rotating buffer - store new value
	  DelayArray[1][delayPtr] = rightin;
	  
	  delayPtr += 1;
	  delayPtr %= DelayLength;
Is this a good way to do this - or is it SLOW?
I would swap channel and sample indices: DelayArray[delayPtr][0/1]. With it you'll get a cache friendly layout.

quikquak
KVRian
602 posts since 6 Aug, 2005 from England

Re: Optimize plugin code for balanced load or least load?

Post Tue Nov 19, 2019 1:29 pm

Or use <double> SIMDs and do the stereo in one go. :)

User avatar
Vokbuz
KVRist
117 posts since 24 Aug, 2014 from Moscow

Re: Optimize plugin code for balanced load or least load?

Post Tue Nov 19, 2019 2:29 pm

quikquak wrote:
Tue Nov 19, 2019 1:29 pm
Or use <double> SIMDs and do the stereo in one go. :)
It depends on how leftin and rightin computed. If it can be vectorized, I'm pretty sure compiler can do it itself. And yes, I didn't think about it, the layout become not only cache friendly but also vector friendly.

Fender19
KVRist
350 posts since 30 Aug, 2012

Re: Optimize plugin code for balanced load or least load?

Post Tue Nov 19, 2019 4:05 pm

Thanks everyone for all your input. Much appreciated and hopefully helpful to others here as well.

I just discovered a significant oversight in my code. I will confess my mistake here just as a reminder for everyone else. I had some code that prepares data for the GUI display running inside the for(i<nFrames) loop, i.e., computing on every sample. It wasn't needed there because I only update the display once per ProcessReplacing block. I moved that code outside the "for" loop where it now runs once every 512 samples or so and only when needed to update the display. Significant improvement as you might expect!

Sometimes a fresh set of eyes and rearranging things can make giant leaps. :)

mystran
KVRAF
5491 posts since 12 Feb, 2006 from Helsinki, Finland

Re: Optimize plugin code for balanced load or least load?

Post Tue Nov 19, 2019 4:17 pm

quikquak wrote:
Tue Nov 19, 2019 12:44 pm
Just my 2ps worth... I've always avoided % as it internally uses a divide (yeah, it may not be relevant these days). I use something simple like this which also catches additions of larger than 1:-

Code: Select all

delayPtr += 1;
if (delayPtr >= DelayLength)
     delayPtr -= DelayLength;
General integer division is really slow and it's actually slow on such fundamental level that it's quite impractical to even build silicon that could perform it fast. For compile-time constant divisor, modern compilers can typically use modular inverse to convert it into a multiplication plus a couple of shifts (depending on the constant). Even in this case though, you're adding latency to the critical path.

In comparison, the compare+branch is effectively NOP as long as it's predicted correctly. While you typically take a branch misprediction each time the index wraps, it's still typically the fastest approach.

There is also another approach that is potentially even faster in some cases (but not always, because there is some overhead): don't do the wrap around check in the loop at all. Instead you can use a separate outer loop to figure out how many samples your inner loop should process before you need to wrap around (or reach the end of block).
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

JCJR
KVRAF
2895 posts since 17 Apr, 2005 from S.E. TN

Re: Optimize plugin code for balanced load or least load?

Post Tue Nov 19, 2019 6:52 pm

quikquak wrote:
Tue Nov 19, 2019 12:44 pm
Just my 2ps worth... I've always avoided % as it internally uses a divide (yeah, it may not be relevant these days). I use something simple like this which also catches additions of larger than 1:-

Code: Select all

delayPtr += 1;
if (delayPtr >= DelayLength)
     delayPtr -= DelayLength;
I used that kind of pattern lots over the years. The following is probably only useful as comic relief. Maybe a good belly laugh. :)

Haven't used C++ or ObjPascal for a few years since retirement. Maybe real languages would never compile this kludge:

In JIT compiled Reaper jsfx, was recently playing with algorithmic reverby stuff that had so many allpass head and tail pointers that it appeared that all the range checking conditionals had become a significant overhead. Especially since the default processing is always one sampleframe at a time, no option for asm tight loops, etc.

So trying to think up a non-branching, non-mod pointer wrapping method which might run a little faster in jsfx, came up with monstrosities like this for "monotonically increasing by integer" pointer/index wrapping:

delayIdx *= ((delayIdx += 1) < DelayLength);

Assuming jsfx compiles a branchless comparison, TRUE comparisons return 1 and FALSE comparisons return 0. Also jsfx seems to run fastest the fewer times you reference vars.

So the above line first incs delayIdx, then if the new [delayIdx < DelayLength] it multiplies the new delayIdx by 1 (no change), otherwise it multiplies the new delayIdx by 0, resetting the pointer to the buffer bottom. PERHAPS it does this without any low-level branching. At least I think I recall disassembling (and writing asm) compares which didn't need a branch in order to do the actual compare.

Regardless whether this abomination actually avoids a low-level branch in jsfx, it does run faster when range checking a big array of head and tail pointers, compared to an if--then-- branch equivalent. Maybe no self-respecting C++ compiler would compile such a thing? Or maybe this kind of abomination is so old and well-known that this is only the zillionth time this wheel was reinvented? Dunno.

Here is an even uglier version for modulated delay lines with non-integer pointer increments--

delayIdx -= ((delayIdx += delayInc) >= DelayLength) * DelayLength;

If the new delayIdx < DelayLength it subtracts zero from delayIdx, otherwise it subtracts DelayLength from delayIdx.

To repeat, the only reason the above isn't completely silly, is at least for my situation in jsfx, it actually seems to run faster than if--then--

2DaT
KVRist
399 posts since 21 Jun, 2013

Re: Optimize plugin code for balanced load or least load?

Post Tue Nov 19, 2019 10:01 pm

karrikuh wrote:
Tue Nov 19, 2019 11:28 am
It's possible I misinterpreted 2Dats comment, I thought he was recommending to generally prefer plain dynamic arrays (using new [] / delete []) over std::vector to avoid any overhead he assumed that std::vector would add. The term "array" is obviously a bit fuzzy...
It can have some overhead compared to array if it's std::vector<uint8_t> or on MSVC without strict aliasing. Though std::vector<float> is not that bad, I don't find it very useful either. Simple RAII wrapper around new/delete can be useful if you need alignment control.
Fender19 wrote:
Tue Nov 19, 2019 4:05 pm
I just discovered a significant oversight in my code. I will confess my mistake here just as a reminder for everyone else. I had some code that prepares data for the GUI display running inside the for(i<nFrames) loop, i.e., computing on every sample. It wasn't needed there because I only update the display once per ProcessReplacing block. I moved that code outside the "for" loop where it now runs once every 512 samples or so and only when needed to update the display. Significant improvement as you might expect!

Sometimes a fresh set of eyes and rearranging things can make giant leaps. :)
Profiling can be helpful to determine performance bottlenecks. It's not a silver bullet, sometimes there are no obvious bottlenecks, but performance still can be improved a lot. But when there are bottlenecks, it's probably worth working on them first.

mystran
KVRAF
5491 posts since 12 Feb, 2006 from Helsinki, Finland

Re: Optimize plugin code for balanced load or least load?

Post Tue Nov 19, 2019 11:03 pm

JCJR wrote:
Tue Nov 19, 2019 6:52 pm
Regardless whether this abomination actually avoids a low-level branch in jsfx, it does run faster when range checking a big array of head and tail pointers, compared to an if--then-- branch equivalent. Maybe no self-respecting C++ compiler would compile such a thing? Or maybe this kind of abomination is so old and well-known that this is only the zillionth time this wheel was reinvented? Dunno.
In terms of doing this in native code, the basic idea with the branching version is that it's zero-cost when predicted and it's expected to always predict correctly, except when you need a wrap around. So doing something other than compare + branch must be cheaper than the average cost of the expected mispredictions once per buffer wraparound...

...but then again you don't necessarily need a branch, because the whole conditional assignment can usually be compiled into a CMOVcc which is 1 cycle (where as SETcc+IMUL would be 4 cycles). Whether this is faster essentially depends on whether or not the delay is long enough that the occasional branch cost averages less than a cycle per iteration (assuming this is what is limiting the pipeline in the first place).

There's a further "gotcha" though and that's with loop induction variables. The classical induction variable analysis looks for a single linear operation of the form "x*a+b" where a and b are loop invariant (so we replace x with x1 and initialise x1=b, then increment x1+=a per iteration) and every type of wrap-around handling will cause the classic analysis to fail. No idea if modern compilers use more flexible variants, but at least when this is done manually, you can combine it with CMOVcc (or assignment in branch; doesn't matter), since we can also rewrite the conditional assignment "x=0" into "x1=b" allowing us to just increment a pointer, rather than using ptr+index*sizeof(element) for addressing.
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

User avatar
DJ Warmonger
KVRAF
3444 posts since 7 Jun, 2012 from Warsaw

Re: Optimize plugin code for balanced load or least load?

Post Tue Nov 19, 2019 11:56 pm

delayIdx *= ((delayIdx += 1) < DelayLength);

Assuming jsfx compiles a branchless comparison, TRUE comparisons return 1 and FALSE comparisons return 0. Also jsfx seems to run fastest the fewer times you reference vars.
This is called "multiplication by predicate" and is common optimization technique for all conditional instructions. I already saw it recommended in CUDA code as well as Synthmaker code module.
Mathematical and logical operations don't cause branching (or jumps). Ideally, you'd not want to use any branches in DSP loop. This can not only replace arithmetical comparisons, but boolean condition and FSM states.
http://djwarmonger.wordpress.com/
Tricky-Loops wrote: (...)someone like Armin van Buuren who claims to make a track in half an hour and all his songs sound somewhat boring(...)

2DaT
KVRist
399 posts since 21 Jun, 2013

Re: Optimize plugin code for balanced load or least load?

Post Wed Nov 20, 2019 12:44 am

DJ Warmonger wrote:
Tue Nov 19, 2019 11:56 pm
Ideally, you'd not want to use any branches in DSP loop. This can not only replace arithmetical comparisons, but boolean condition and FSM states.
ONLY unpredictable ones.

mystran
KVRAF
5491 posts since 12 Feb, 2006 from Helsinki, Finland

Re: Optimize plugin code for balanced load or least load?

Post Wed Nov 20, 2019 1:24 am

DJ Warmonger wrote:
Tue Nov 19, 2019 11:56 pm
This is called "multiplication by predicate" and is common optimization technique for all conditional instructions. I already saw it recommended in CUDA code as well as Synthmaker code module.
Mathematical and logical operations don't cause branching (or jumps). Ideally, you'd not want to use any branches in DSP loop. This can not only replace arithmetical comparisons, but boolean condition and FSM states.
Either directly predicated instructions (eg. CMOVcc on x86) or bitmasking are typically slightly faster than multiplication (even SETcc+DEC+AND beats SETcc+IMUL by one cycle on your average x86).

That said, the "prefer predicates to branches" might be somewhat dated advice when it comes to GPUs. In general, for coherent branches (ie. the whole warp takes the same path) you save having to compute the other path at all and for incoherent branches you'll basically get the predicated masking automatically. If you're working on some SM3 hardware, then you should certainly avoid branches, but these days they are not really a problem anymore. For short branches where predication is preferable, the compiler can take care of it anyway.

If I'm not mistaken, ISPC actually does a similar thing with SIMD where it generates code that only evaluates those branches that required for at least one element. This way different elements can take different branches or different number of loop iterations and the compiler will automatically predicate for you.
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

Z1202
KVRian
1092 posts since 12 Apr, 2002

Re: Optimize plugin code for balanced load or least load?

Post Wed Nov 20, 2019 1:28 am

syntonica wrote:
Tue Nov 19, 2019 10:49 am
Now that I look into it, -Ofast on clang adds -fno-signed-zeros -freciprocal-math -ffp-contract=fast -menable-unsafe-fp-math -menable-no-nans -menable-no-infs, so that may be why the extra flag in Xcode never does anything. :lol:
Interesting. I'd consider being careful with no-signed-zeros though, if SIMD is involved.

Z1202
KVRian
1092 posts since 12 Apr, 2002

Re: Optimize plugin code for balanced load or least load?

Post Wed Nov 20, 2019 1:31 am

syntonica wrote:
Tue Nov 19, 2019 10:49 am
However, I'm sure I've tried it just by itself using -O0 and not seen any significant gains. My tests use 5-10 instances of my plugin using different styles of patches so I get a good, overall average of CPU use.
Hmmm, that's weird. For one I remember that e.g. NaN handling can make comparisons much slower, since they need (at least in certain cases) to be compiled into a sequence of several commands if NaNs are to be properly handled.

mystran
KVRAF
5491 posts since 12 Feb, 2006 from Helsinki, Finland

Re: Optimize plugin code for balanced load or least load?

Post Wed Nov 20, 2019 2:34 am

Z1202 wrote:
Wed Nov 20, 2019 1:31 am
Hmmm, that's weird. For one I remember that e.g. NaN handling can make comparisons much slower, since they need (at least in certain cases) to be compiled into a sequence of several commands if NaNs are to be properly handled.
It's actually not a huge penalty on x86. With SSE at least, the actual comparisons are generic (similar to integer CMP) and just sets flags (ie. ZF/CF/PF). If either operand is a NaN, then PF is set. While you can check any combination of ZF/CF together, you can only check PF on it's own, so if you care about NaNs you need to check the flags twice. For regular branches, this just means you emit two conditional jumps (first check parity, then whatever comparison you had) rather than one.
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

User avatar
syntonica
KVRian
545 posts since 25 Sep, 2014 from Specific Northwest

Re: Optimize plugin code for balanced load or least load?

Post Wed Nov 20, 2019 2:41 am

Z1202 wrote:
Wed Nov 20, 2019 1:31 am
syntonica wrote:
Tue Nov 19, 2019 10:49 am
However, I'm sure I've tried it just by itself using -O0 and not seen any significant gains. My tests use 5-10 instances of my plugin using different styles of patches so I get a good, overall average of CPU use.
Hmmm, that's weird. For one I remember that e.g. NaN handling can make comparisons much slower, since they need (at least in certain cases) to be compiled into a sequence of several commands if NaNs are to be properly handled.
I've weeded out most all possible sources of NaNs. The one place I need to check isn't horribly critical, in the wave display in the GUI. I just make sure the sample equals itself since NaNs don't. It's cheaper to do that check rather than to block the thread until the wave calculation is complete.

I've listened very carefully to -Ofast and -O0 and there is no audible difference, nor have I seen any blow-ups. Any of those are strictly due to me and array bounds that I didn't lock down. Threading. :roll:

Return to “DSP and Plug-in Development”