Optimize plugin code for balanced load or least load?

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

Fender19 wrote: Tue Nov 19, 2019 8:10 pm I access the elements of that array using:

Code: Select all

	  
	  leftdelayed = DelayArray[0][delayPtr];	    //rotating buffer - retrieve saved value
	  rightdelayed = DelayArray[1][delayPtr];
	  DelayArray[0][delayPtr] = leftin;		   //rotating buffer - store new value
	  DelayArray[1][delayPtr] = rightin;
	  
	  delayPtr += 1;
	  delayPtr %= DelayLength;
Is this a good way to do this - or is it SLOW?
I would swap channel and sample indices: DelayArray[delayPtr][0/1]. With it you'll get a cache friendly layout.

Post

Or use <double> SIMDs and do the stereo in one go. :)

Post

quikquak wrote: Tue Nov 19, 2019 9:29 pm Or use <double> SIMDs and do the stereo in one go. :)
It depends on how leftin and rightin computed. If it can be vectorized, I'm pretty sure compiler can do it itself. And yes, I didn't think about it, the layout become not only cache friendly but also vector friendly.

Post

Thanks everyone for all your input. Much appreciated and hopefully helpful to others here as well.

I just discovered a significant oversight in my code. I will confess my mistake here just as a reminder for everyone else. I had some code that prepares data for the GUI display running inside the for(i<nFrames) loop, i.e., computing on every sample. It wasn't needed there because I only update the display once per ProcessReplacing block. I moved that code outside the "for" loop where it now runs once every 512 samples or so and only when needed to update the display. Significant improvement as you might expect!

Sometimes a fresh set of eyes and rearranging things can make giant leaps. :)

Post

quikquak wrote: Tue Nov 19, 2019 8:44 pm Just my 2ps worth... I've always avoided % as it internally uses a divide (yeah, it may not be relevant these days). I use something simple like this which also catches additions of larger than 1:-

Code: Select all

delayPtr += 1;
if (delayPtr >= DelayLength)
     delayPtr -= DelayLength;
General integer division is really slow and it's actually slow on such fundamental level that it's quite impractical to even build silicon that could perform it fast. For compile-time constant divisor, modern compilers can typically use modular inverse to convert it into a multiplication plus a couple of shifts (depending on the constant). Even in this case though, you're adding latency to the critical path.

In comparison, the compare+branch is effectively NOP as long as it's predicted correctly. While you typically take a branch misprediction each time the index wraps, it's still typically the fastest approach.

There is also another approach that is potentially even faster in some cases (but not always, because there is some overhead): don't do the wrap around check in the loop at all. Instead you can use a separate outer loop to figure out how many samples your inner loop should process before you need to wrap around (or reach the end of block).

Post

quikquak wrote: Tue Nov 19, 2019 8:44 pm Just my 2ps worth... I've always avoided % as it internally uses a divide (yeah, it may not be relevant these days). I use something simple like this which also catches additions of larger than 1:-

Code: Select all

delayPtr += 1;
if (delayPtr >= DelayLength)
     delayPtr -= DelayLength;
I used that kind of pattern lots over the years. The following is probably only useful as comic relief. Maybe a good belly laugh. :)

Haven't used C++ or ObjPascal for a few years since retirement. Maybe real languages would never compile this kludge:

In JIT compiled Reaper jsfx, was recently playing with algorithmic reverby stuff that had so many allpass head and tail pointers that it appeared that all the range checking conditionals had become a significant overhead. Especially since the default processing is always one sampleframe at a time, no option for asm tight loops, etc.

So trying to think up a non-branching, non-mod pointer wrapping method which might run a little faster in jsfx, came up with monstrosities like this for "monotonically increasing by integer" pointer/index wrapping:

delayIdx *= ((delayIdx += 1) < DelayLength);

Assuming jsfx compiles a branchless comparison, TRUE comparisons return 1 and FALSE comparisons return 0. Also jsfx seems to run fastest the fewer times you reference vars.

So the above line first incs delayIdx, then if the new [delayIdx < DelayLength] it multiplies the new delayIdx by 1 (no change), otherwise it multiplies the new delayIdx by 0, resetting the pointer to the buffer bottom. PERHAPS it does this without any low-level branching. At least I think I recall disassembling (and writing asm) compares which didn't need a branch in order to do the actual compare.

Regardless whether this abomination actually avoids a low-level branch in jsfx, it does run faster when range checking a big array of head and tail pointers, compared to an if--then-- branch equivalent. Maybe no self-respecting C++ compiler would compile such a thing? Or maybe this kind of abomination is so old and well-known that this is only the zillionth time this wheel was reinvented? Dunno.

Here is an even uglier version for modulated delay lines with non-integer pointer increments--

delayIdx -= ((delayIdx += delayInc) >= DelayLength) * DelayLength;

If the new delayIdx < DelayLength it subtracts zero from delayIdx, otherwise it subtracts DelayLength from delayIdx.

To repeat, the only reason the above isn't completely silly, is at least for my situation in jsfx, it actually seems to run faster than if--then--

Post

karrikuh wrote: Tue Nov 19, 2019 7:28 pm It's possible I misinterpreted 2Dats comment, I thought he was recommending to generally prefer plain dynamic arrays (using new [] / delete []) over std::vector to avoid any overhead he assumed that std::vector would add. The term "array" is obviously a bit fuzzy...
It can have some overhead compared to array if it's std::vector<uint8_t> or on MSVC without strict aliasing. Though std::vector<float> is not that bad, I don't find it very useful either. Simple RAII wrapper around new/delete can be useful if you need alignment control.
Fender19 wrote: Wed Nov 20, 2019 12:05 am I just discovered a significant oversight in my code. I will confess my mistake here just as a reminder for everyone else. I had some code that prepares data for the GUI display running inside the for(i<nFrames) loop, i.e., computing on every sample. It wasn't needed there because I only update the display once per ProcessReplacing block. I moved that code outside the "for" loop where it now runs once every 512 samples or so and only when needed to update the display. Significant improvement as you might expect!

Sometimes a fresh set of eyes and rearranging things can make giant leaps. :)
Profiling can be helpful to determine performance bottlenecks. It's not a silver bullet, sometimes there are no obvious bottlenecks, but performance still can be improved a lot. But when there are bottlenecks, it's probably worth working on them first.

Post

JCJR wrote: Wed Nov 20, 2019 2:52 am Regardless whether this abomination actually avoids a low-level branch in jsfx, it does run faster when range checking a big array of head and tail pointers, compared to an if--then-- branch equivalent. Maybe no self-respecting C++ compiler would compile such a thing? Or maybe this kind of abomination is so old and well-known that this is only the zillionth time this wheel was reinvented? Dunno.
In terms of doing this in native code, the basic idea with the branching version is that it's zero-cost when predicted and it's expected to always predict correctly, except when you need a wrap around. So doing something other than compare + branch must be cheaper than the average cost of the expected mispredictions once per buffer wraparound...

...but then again you don't necessarily need a branch, because the whole conditional assignment can usually be compiled into a CMOVcc which is 1 cycle (where as SETcc+IMUL would be 4 cycles). Whether this is faster essentially depends on whether or not the delay is long enough that the occasional branch cost averages less than a cycle per iteration (assuming this is what is limiting the pipeline in the first place).

There's a further "gotcha" though and that's with loop induction variables. The classical induction variable analysis looks for a single linear operation of the form "x*a+b" where a and b are loop invariant (so we replace x with x1 and initialise x1=b, then increment x1+=a per iteration) and every type of wrap-around handling will cause the classic analysis to fail. No idea if modern compilers use more flexible variants, but at least when this is done manually, you can combine it with CMOVcc (or assignment in branch; doesn't matter), since we can also rewrite the conditional assignment "x=0" into "x1=b" allowing us to just increment a pointer, rather than using ptr+index*sizeof(element) for addressing.

Post

delayIdx *= ((delayIdx += 1) < DelayLength);

Assuming jsfx compiles a branchless comparison, TRUE comparisons return 1 and FALSE comparisons return 0. Also jsfx seems to run fastest the fewer times you reference vars.
This is called "multiplication by predicate" and is common optimization technique for all conditional instructions. I already saw it recommended in CUDA code as well as Synthmaker code module.
Mathematical and logical operations don't cause branching (or jumps). Ideally, you'd not want to use any branches in DSP loop. This can not only replace arithmetical comparisons, but boolean condition and FSM states.
Blog ------------- YouTube channel
Tricky-Loops wrote: (...)someone like Armin van Buuren who claims to make a track in half an hour and all his songs sound somewhat boring(...)

Post

DJ Warmonger wrote: Wed Nov 20, 2019 7:56 amIdeally, you'd not want to use any branches in DSP loop. This can not only replace arithmetical comparisons, but boolean condition and FSM states.
ONLY unpredictable ones.

Post

DJ Warmonger wrote: Wed Nov 20, 2019 7:56 am This is called "multiplication by predicate" and is common optimization technique for all conditional instructions. I already saw it recommended in CUDA code as well as Synthmaker code module.
Mathematical and logical operations don't cause branching (or jumps). Ideally, you'd not want to use any branches in DSP loop. This can not only replace arithmetical comparisons, but boolean condition and FSM states.
Either directly predicated instructions (eg. CMOVcc on x86) or bitmasking are typically slightly faster than multiplication (even SETcc+DEC+AND beats SETcc+IMUL by one cycle on your average x86).

That said, the "prefer predicates to branches" might be somewhat dated advice when it comes to GPUs. In general, for coherent branches (ie. the whole warp takes the same path) you save having to compute the other path at all and for incoherent branches you'll basically get the predicated masking automatically. If you're working on some SM3 hardware, then you should certainly avoid branches, but these days they are not really a problem anymore. For short branches where predication is preferable, the compiler can take care of it anyway.

If I'm not mistaken, ISPC actually does a similar thing with SIMD where it generates code that only evaluates those branches that required for at least one element. This way different elements can take different branches or different number of loop iterations and the compiler will automatically predicate for you.

Post

syntonica wrote: Tue Nov 19, 2019 6:49 pmNow that I look into it, -Ofast on clang adds -fno-signed-zeros -freciprocal-math -ffp-contract=fast -menable-unsafe-fp-math -menable-no-nans -menable-no-infs, so that may be why the extra flag in Xcode never does anything. :lol:
Interesting. I'd consider being careful with no-signed-zeros though, if SIMD is involved.

Post

syntonica wrote: Tue Nov 19, 2019 6:49 pmHowever, I'm sure I've tried it just by itself using -O0 and not seen any significant gains. My tests use 5-10 instances of my plugin using different styles of patches so I get a good, overall average of CPU use.
Hmmm, that's weird. For one I remember that e.g. NaN handling can make comparisons much slower, since they need (at least in certain cases) to be compiled into a sequence of several commands if NaNs are to be properly handled.

Post

Z1202 wrote: Wed Nov 20, 2019 9:31 am Hmmm, that's weird. For one I remember that e.g. NaN handling can make comparisons much slower, since they need (at least in certain cases) to be compiled into a sequence of several commands if NaNs are to be properly handled.
It's actually not a huge penalty on x86. With SSE at least, the actual comparisons are generic (similar to integer CMP) and just sets flags (ie. ZF/CF/PF). If either operand is a NaN, then PF is set. While you can check any combination of ZF/CF together, you can only check PF on it's own, so if you care about NaNs you need to check the flags twice. For regular branches, this just means you emit two conditional jumps (first check parity, then whatever comparison you had) rather than one.

Post

Z1202 wrote: Wed Nov 20, 2019 9:31 am
syntonica wrote: Tue Nov 19, 2019 6:49 pmHowever, I'm sure I've tried it just by itself using -O0 and not seen any significant gains. My tests use 5-10 instances of my plugin using different styles of patches so I get a good, overall average of CPU use.
Hmmm, that's weird. For one I remember that e.g. NaN handling can make comparisons much slower, since they need (at least in certain cases) to be compiled into a sequence of several commands if NaNs are to be properly handled.
I've weeded out most all possible sources of NaNs. The one place I need to check isn't horribly critical, in the wave display in the GUI. I just make sure the sample equals itself since NaNs don't. It's cheaper to do that check rather than to block the thread until the wave calculation is complete.

I've listened very carefully to -Ofast and -O0 and there is no audible difference, nor have I seen any blow-ups. Any of those are strictly due to me and array bounds that I didn't lock down. Threading. :roll:
I started on Logic 5 with a PowerBook G4 550Mhz. I now have a MacBook Air M1 and it's ~165x faster! So, why is my music not proportionally better? :(

Post Reply

Return to “DSP and Plugin Development”