KVR Audio

Vokbuz · Post by **Vokbuz** » Tue Nov 19, 2019 9:15 pm

Fender19 wrote: ↑Tue Nov 19, 2019 8:10 pm I access the elements of that array using:
Code: Select all
	  
	  leftdelayed = DelayArray[0][delayPtr];	    //rotating buffer - retrieve saved value
	  rightdelayed = DelayArray[1][delayPtr];
	  DelayArray[0][delayPtr] = leftin;		   //rotating buffer - store new value
	  DelayArray[1][delayPtr] = rightin;
	  
	  delayPtr += 1;
	  delayPtr %= DelayLength;
Is this a good way to do this - or is it SLOW?

I would swap channel and sample indices: DelayArray[delayPtr][0/1]. With it you'll get a cache friendly layout.

quikquak · Post by **quikquak** » Tue Nov 19, 2019 9:29 pm

Or use <double> SIMDs and do the stereo in one go.

Vokbuz · Post by **Vokbuz** » Tue Nov 19, 2019 10:29 pm

quikquak wrote: ↑Tue Nov 19, 2019 9:29 pm Or use <double> SIMDs and do the stereo in one go.

It depends on how leftin and rightin computed. If it can be vectorized, I'm pretty sure compiler can do it itself. And yes, I didn't think about it, the layout become not only cache friendly but also vector friendly.

Fender19 · Post by **Fender19** » Wed Nov 20, 2019 12:05 am

Thanks everyone for all your input. Much appreciated and hopefully helpful to others here as well.

I just discovered a significant oversight in my code. I will confess my mistake here just as a reminder for everyone else. I had some code that prepares data for the GUI display running inside the for(i<nFrames) loop, i.e., computing on every sample. It wasn't needed there because I only update the display once per ProcessReplacing block. I moved that code outside the "for" loop where it now runs once every 512 samples or so and only when needed to update the display. Significant improvement as you might expect!

Sometimes a fresh set of eyes and rearranging things can make giant leaps.

mystran · Post by **mystran** » Wed Nov 20, 2019 12:17 am

quikquak wrote: ↑Tue Nov 19, 2019 8:44 pm Just my 2ps worth... I've always avoided % as it internally uses a divide (yeah, it may not be relevant these days). I use something simple like this which also catches additions of larger than 1:-
Code: Select all
delayPtr += 1;
if (delayPtr >= DelayLength)
     delayPtr -= DelayLength;

General integer division is really slow and it's actually slow on such fundamental level that it's quite impractical to even build silicon that could perform it fast. For compile-time constant divisor, modern compilers can typically use modular inverse to convert it into a multiplication plus a couple of shifts (depending on the constant). Even in this case though, you're adding latency to the critical path.

In comparison, the compare+branch is effectively NOP as long as it's predicted correctly. While you typically take a branch misprediction each time the index wraps, it's still typically the fastest approach.

There is also another approach that is potentially even faster in some cases (but not always, because there is some overhead): don't do the wrap around check in the loop at all. Instead you can use a separate outer loop to figure out how many samples your inner loop should process before you need to wrap around (or reach the end of block).

JCJR · Post by **JCJR** » Wed Nov 20, 2019 2:52 am

quikquak wrote: ↑Tue Nov 19, 2019 8:44 pm Just my 2ps worth... I've always avoided % as it internally uses a divide (yeah, it may not be relevant these days). I use something simple like this which also catches additions of larger than 1:-
Code: Select all
delayPtr += 1;
if (delayPtr >= DelayLength)
     delayPtr -= DelayLength;

I used that kind of pattern lots over the years. The following is probably only useful as comic relief. Maybe a good belly laugh.

Haven't used C++ or ObjPascal for a few years since retirement. Maybe real languages would never compile this kludge:

In JIT compiled Reaper jsfx, was recently playing with algorithmic reverby stuff that had so many allpass head and tail pointers that it appeared that all the range checking conditionals had become a significant overhead. Especially since the default processing is always one sampleframe at a time, no option for asm tight loops, etc.

So trying to think up a non-branching, non-mod pointer wrapping method which might run a little faster in jsfx, came up with monstrosities like this for "monotonically increasing by integer" pointer/index wrapping:

delayIdx *= ((delayIdx += 1) < DelayLength);

Assuming jsfx compiles a branchless comparison, TRUE comparisons return 1 and FALSE comparisons return 0. Also jsfx seems to run fastest the fewer times you reference vars.

So the above line first incs delayIdx, then if the new [delayIdx < DelayLength] it multiplies the new delayIdx by 1 (no change), otherwise it multiplies the new delayIdx by 0, resetting the pointer to the buffer bottom. PERHAPS it does this without any low-level branching. At least I think I recall disassembling (and writing asm) compares which didn't need a branch in order to do the actual compare.

Regardless whether this abomination actually avoids a low-level branch in jsfx, it does run faster when range checking a big array of head and tail pointers, compared to an if--then-- branch equivalent. Maybe no self-respecting C++ compiler would compile such a thing? Or maybe this kind of abomination is so old and well-known that this is only the zillionth time this wheel was reinvented? Dunno.

Here is an even uglier version for modulated delay lines with non-integer pointer increments--

delayIdx -= ((delayIdx += delayInc) >= DelayLength) * DelayLength;

If the new delayIdx < DelayLength it subtracts zero from delayIdx, otherwise it subtracts DelayLength from delayIdx.

To repeat, the only reason the above isn't completely silly, is at least for my situation in jsfx, it actually seems to run faster than if--then--

2DaT · Post by **2DaT** » Wed Nov 20, 2019 6:01 am

karrikuh wrote: ↑Tue Nov 19, 2019 7:28 pm It's possible I misinterpreted 2Dats comment, I thought he was recommending to generally prefer plain dynamic arrays (using new [] / delete []) over std::vector to avoid any overhead he assumed that std::vector would add. The term "array" is obviously a bit fuzzy...

It can have some overhead compared to array if it's std::vector<uint8_t> or on MSVC without strict aliasing. Though std::vector<float> is not that bad, I don't find it very useful either. Simple RAII wrapper around new/delete can be useful if you need alignment control.

Fender19 wrote: ↑Wed Nov 20, 2019 12:05 am I just discovered a significant oversight in my code. I will confess my mistake here just as a reminder for everyone else. I had some code that prepares data for the GUI display running inside the for(i<nFrames) loop, i.e., computing on every sample. It wasn't needed there because I only update the display once per ProcessReplacing block. I moved that code outside the "for" loop where it now runs once every 512 samples or so and only when needed to update the display. Significant improvement as you might expect!

Sometimes a fresh set of eyes and rearranging things can make giant leaps.

Profiling can be helpful to determine performance bottlenecks. It's not a silver bullet, sometimes there are no obvious bottlenecks, but performance still can be improved a lot. But when there are bottlenecks, it's probably worth working on them first.

mystran · Post by **mystran** » Wed Nov 20, 2019 7:03 am

JCJR wrote: ↑Wed Nov 20, 2019 2:52 am Regardless whether this abomination actually avoids a low-level branch in jsfx, it does run faster when range checking a big array of head and tail pointers, compared to an if--then-- branch equivalent. Maybe no self-respecting C++ compiler would compile such a thing? Or maybe this kind of abomination is so old and well-known that this is only the zillionth time this wheel was reinvented? Dunno.

In terms of doing this in native code, the basic idea with the branching version is that it's zero-cost when predicted and it's expected to always predict correctly, except when you need a wrap around. So doing something other than compare + branch must be cheaper than the average cost of the expected mispredictions once per buffer wraparound...

...but then again you don't necessarily need a branch, because the whole conditional assignment can usually be compiled into a CMOVcc which is 1 cycle (where as SETcc+IMUL would be 4 cycles). Whether this is faster essentially depends on whether or not the delay is long enough that the occasional branch cost averages less than a cycle per iteration (assuming this is what is limiting the pipeline in the first place).

There's a further "gotcha" though and that's with loop induction variables. The classical induction variable analysis looks for a single linear operation of the form "x*a+b" where a and b are loop invariant (so we replace x with x1 and initialise x1=b, then increment x1+=a per iteration) and every type of wrap-around handling will cause the classic analysis to fail. No idea if modern compilers use more flexible variants, but at least when this is done manually, you can combine it with CMOVcc (or assignment in branch; doesn't matter), since we can also rewrite the conditional assignment "x=0" into "x1=b" allowing us to just increment a pointer, rather than using ptr+index*sizeof(element) for addressing.

DJ Warmonger · Post by **DJ Warmonger** » Wed Nov 20, 2019 7:56 am

delayIdx *= ((delayIdx += 1) < DelayLength);

Assuming jsfx compiles a branchless comparison, TRUE comparisons return 1 and FALSE comparisons return 0. Also jsfx seems to run fastest the fewer times you reference vars.

This is called "multiplication by predicate" and is common optimization technique for all conditional instructions. I already saw it recommended in CUDA code as well as Synthmaker code module.
Mathematical and logical operations don't cause branching (or jumps). Ideally, you'd not want to use any branches in DSP loop. This can not only replace arithmetical comparisons, but boolean condition and FSM states.

2DaT · Post by **2DaT** » Wed Nov 20, 2019 8:44 am

DJ Warmonger wrote: ↑Wed Nov 20, 2019 7:56 amIdeally, you'd not want to use any branches in DSP loop. This can not only replace arithmetical comparisons, but boolean condition and FSM states.

ONLY unpredictable ones.

mystran · Post by **mystran** » Wed Nov 20, 2019 9:24 am

DJ Warmonger wrote: ↑Wed Nov 20, 2019 7:56 am This is called "multiplication by predicate" and is common optimization technique for all conditional instructions. I already saw it recommended in CUDA code as well as Synthmaker code module.
Mathematical and logical operations don't cause branching (or jumps). Ideally, you'd not want to use any branches in DSP loop. This can not only replace arithmetical comparisons, but boolean condition and FSM states.

Either directly predicated instructions (eg. CMOVcc on x86) or bitmasking are typically slightly faster than multiplication (even SETcc+DEC+AND beats SETcc+IMUL by one cycle on your average x86).

That said, the "prefer predicates to branches" might be somewhat dated advice when it comes to GPUs. In general, for coherent branches (ie. the whole warp takes the same path) you save having to compute the other path at all and for incoherent branches you'll basically get the predicated masking automatically. If you're working on some SM3 hardware, then you should certainly avoid branches, but these days they are not really a problem anymore. For short branches where predication is preferable, the compiler can take care of it anyway.

If I'm not mistaken, ISPC actually does a similar thing with SIMD where it generates code that only evaluates those branches that required for at least one element. This way different elements can take different branches or different number of loop iterations and the compiler will automatically predicate for you.

Z1202 · Post by **Z1202** » Wed Nov 20, 2019 9:28 am

syntonica wrote: ↑Tue Nov 19, 2019 6:49 pmNow that I look into it, -Ofast on clang adds -fno-signed-zeros -freciprocal-math -ffp-contract=fast -menable-unsafe-fp-math -menable-no-nans -menable-no-infs, so that may be why the extra flag in Xcode never does anything.

Interesting. I'd consider being careful with no-signed-zeros though, if SIMD is involved.

Z1202 · Post by **Z1202** » Wed Nov 20, 2019 9:31 am

syntonica wrote: ↑Tue Nov 19, 2019 6:49 pmHowever, I'm sure I've tried it just by itself using -O0 and not seen any significant gains. My tests use 5-10 instances of my plugin using different styles of patches so I get a good, overall average of CPU use.

Hmmm, that's weird. For one I remember that e.g. NaN handling can make comparisons much slower, since they need (at least in certain cases) to be compiled into a sequence of several commands if NaNs are to be properly handled.

mystran · Post by **mystran** » Wed Nov 20, 2019 10:34 am

Z1202 wrote: ↑Wed Nov 20, 2019 9:31 am Hmmm, that's weird. For one I remember that e.g. NaN handling can make comparisons much slower, since they need (at least in certain cases) to be compiled into a sequence of several commands if NaNs are to be properly handled.

It's actually not a huge penalty on x86. With SSE at least, the actual comparisons are generic (similar to integer CMP) and just sets flags (ie. ZF/CF/PF). If either operand is a NaN, then PF is set. While you can check any combination of ZF/CF together, you can only check PF on it's own, so if you care about NaNs you need to check the flags twice. For regular branches, this just means you emit two conditional jumps (first check parity, then whatever comparison you had) rather than one.

syntonica · Post by **syntonica** » Wed Nov 20, 2019 10:41 am

Z1202 wrote: ↑Wed Nov 20, 2019 9:31 am
syntonica wrote: ↑Tue Nov 19, 2019 6:49 pmHowever, I'm sure I've tried it just by itself using -O0 and not seen any significant gains. My tests use 5-10 instances of my plugin using different styles of patches so I get a good, overall average of CPU use.
Hmmm, that's weird. For one I remember that e.g. NaN handling can make comparisons much slower, since they need (at least in certain cases) to be compiled into a sequence of several commands if NaNs are to be properly handled.

I've weeded out most all possible sources of NaNs. The one place I need to check isn't horribly critical, in the wave display in the GUI. I just make sure the sample equals itself since NaNs don't. It's cheaper to do that check rather than to block the thread until the wave calculation is complete.

I've listened very carefully to -Ofast and -O0 and there is no audible difference, nor have I seen any blow-ups. Any of those are strictly due to me and array bounds that I didn't lock down. Threading.

Optimize plugin code for balanced load or least load?