KVR Audio

Fender19 · Post by **Fender19** » Mon Nov 18, 2019 12:43 am

When optimizing plugin code for CPU use is it best to go for:
A) lowest load - only run what needs to run at any given time (typically results in fluctuating CPU load) or,
B) balanced load - design code to maintain a steady CPU load, for example use continuous algorithms vs. if/else statements, etc.

(I'm not asking about parts of code that a user can turn on/off, I'm asking about the internal parts of code that run within a "turned on" block.)

Something like a limiter, for example, can be optimized so the gain reduction part of the code only runs when gain reduction is required (otherwise gain is simply "1"). This can make CPU use low during some passages and high in others. Is that fluctuating, but minimized, CPU load a good thing - or can it cause problems?

It seems it would make sense to always work to achieve the lowest CPU load at any time but I also envision the possibility of multiple plugins spiking the CPU meter at that "one spot" in a song when they all kick up at the same instant.

What are your thoughts/approaches to this?

camsr · Post by **camsr** » Mon Nov 18, 2019 5:47 am

The user is more likely to notice when the CPU is tapping out if the process is always running (until bypassed). That's about the only advantage to having it run when nothing is effected.

syntonica · Post by **syntonica** » Mon Nov 18, 2019 6:09 am

Unless you feel the need to "reserve" some CPU, go for the least load. Your end-user will be happier with the saved CPU cycles and it should be obvious that your plugin is eating them up when running, but barely nibbling when idle.

Fender19 · Post by **Fender19** » Mon Nov 18, 2019 5:54 pm

Thank you for the replies. Seems the answer is unanimous - and that's good to know!

I am finding code optimization to be a bear. Once the obvious are taken care of it seems VERY difficult to do better than the standard math libraries and compiler optimizations (they've been around a long time so it is no surprise they are already optimized). For example, replacing "int x * 2" with "x << 1" doesn't run any faster because the compiler is already doing that!

So the next step is to find alternative, faster ways of doing things. For example, I have found that exp(x) runs faster than pow(x,n) - so using exp(x) whenever possible saves some cycles. I'm compiling cross-platform so whatever I do needs to work on both Mac and PC.

Are there any "fast plugin code" alternatives/suggestions/rules of thumb published anywhere? I'm "Googling" every step and a lot of the methods I have tried actually run SLOWER on my system.

Any suggestions appreciated. Thank you!

signalsmith · Post by **signalsmith** » Mon Nov 18, 2019 6:27 pm

Fender19 wrote: ↑Mon Nov 18, 2019 12:43 am When optimizing plugin code for CPU use is it best to go for:
A) lowest load - only run what needs to run at any given time (typically results in fluctuating CPU load) or,
B) balanced load - design code to maintain a steady CPU load, for example use continuous algorithms vs. if/else statements, etc.

If there are bits of computation you can just transparently switch off off without affecting anything else, I think that's straightforward. However, it could be worth considering the peak computation per-block as well.

As an example: I had a first-draft of a plugin which performed large-ish FFTs (500ms), several times a second. The overall CPU use was low, but every ~12th block had a bigger chunk of computation in it,

The problem didn't present itself until I used it on live input in a fairly-loaded project. In normal use, the sudden CPU spike was fine because the host was computing audio a little bit ahead (so it had some buffer if one block took longer) but for low-latency live input it wasn't doing that, so the computation-heavy blocks delayed things enough to cause a small dropout a couple of times a second.

Once I tweaked things a bit, the CPU use was more even across the blocks (still not perfect, but better) and the problem improved. I had to adjust the algorithm a bit, and took a very marginal average-CPU hit, but it was worth it.

Anyway - that's my personal experience: while most of the time, average CPU use is what matters, beware of per-block peaks, and live input is slightly different.

syntonica · Post by **syntonica** » Mon Nov 18, 2019 8:21 pm

Fender19 wrote: ↑Mon Nov 18, 2019 5:54 pm
Are there any "fast plugin code" alternatives/suggestions/rules of thumb published anywhere? I'm "Googling" every step and a lot of the methods I have tried actually run SLOWER on my system.

Any suggestions appreciated. Thank you!

Your compiler can handle quite a bit of optimization for you, but not all. I wish there were modern guidelines, but most people just toe the party line with "let the compiler handle it."

Lame, because this is the most fun part of programming for me!

My highly optimized code (-Ofast) runs, on average, twice as fast as the unoptimized (-O0). Before I started hand-optimization, I was averaging only about a 50% gain, so there are definitely opportunities. I used to do tons of testing and experimentation to see what worked and why. I'm still learning, except hand-optimization is still quite necessary.

Partly, it's using smart programming practices.
Partly, it's using mathy tricks.
Partly, it's avoiding the stdlib like the plague and writing your own bespoke routines where it makes sense.

Off the top of my head:
-read up on auto-vectorization so that you can clean up your tight loops so they are good candidates.
-Use mathy solutions rather than conditionals (i.e. use x&n instead of if x>n x=x-n).
-Don't call out of tight loops, send data pointers into them.
-Don't predefine nonce variables that are just used once or twice. Only define them as needed and they will be used as registers rather than stack variables.
-Weird one, I always thought the compiler would fix it, but use ++i instead of i++. i++ can force an unnecessary copy.
-My GUI has to display a number of floats from 0.0000 to 1.0000 quite often, so by writing my own inline function to format them rather than using one of the built-in (slowwwww.....) print functions, my GUI uses quite a bit less CPU.
-memset and memcpy always appear to be slower, but when optimized, are faster than loops.

You'll find huge gains, small gains, no gains and holy crap! I killed it!

2DaT · Post by **2DaT** » Mon Nov 18, 2019 9:46 pm

Fender19 wrote: ↑Mon Nov 18, 2019 5:54 pm Are there any "fast plugin code" alternatives/suggestions/rules of thumb published anywhere? I'm "Googling" every step and a lot of the methods I have tried actually run SLOWER on my system.

1.--ffast-math or equivalent. Helps the compiler to optimize floating point expressions. On MSVC this also enables cheaper transcendentals.
2. Prefer placing your dsp code in headers for better inlining.
3. Prefer block processing. Make sure your functions process blocks of samples instead of single sample. Helps the auto vectorizer, reduces the function call overhead.
4. Reduce floating point divides when possible. Avoid integer divisions and module even more so.

Write

Code: Select all

a+=b*sampleRateInv

instead of

Code: Select all

a+=b/sampleRate

5. Floating point transcendentals. Prefer single precision when possible. Avoid pow(). Prefer reasonable approximations (but don't trust random code, test it for precision and performance).

6. Vectorization. Auto-vectorization is very unreliable, so sometimes you need to vectorize yourself to get best performance. This is complicated, but on some problems can be worthwhile (up to 4x perf or 8x on modern processors). May require cpu and asm knowledge to get best results.

7. Prefer local variables. Sometimes more code allows better optimization. Aliasing is a big performance killer for c/c++. __restrict doesn't work sometimes.
This is more likely to vectorize:

Code: Select all

size_t size=this->bufferSize;
for(size_t i=0;i<size;i++){...}

instead of this:

Code: Select all

for(size_t i=0;i<this->bufferSize;i++){...}

Sometimes you need to copy local variables to get vectorization, esp. when you write to pointers:

Code: Select all

float c0=this->c0,c1=this->c1;
for(size_t i=0;i<s;i++)
	res[i]=in[i]*c0+c1;

8. Avoid std library when possible. Prefer arrays to std::vector.

9. Avoid using big amounts of memory when not needed. This will help with cache. Reuse hot memory when possible. Avoid big look-up tables when possible.

Fender19 · Post by **Fender19** » Mon Nov 18, 2019 9:49 pm

syntonica wrote: ↑Mon Nov 18, 2019 8:21 pmYou'll find huge gains, small gains, no gains and holy crap! I killed it!

LOL - so far I'm only achieving the second half of those possibilities!

Thank you for the info and suggestions.

Fender19 · Post by **Fender19** » Mon Nov 18, 2019 9:51 pm

Thank you 2Dat!

2DaT wrote: ↑Mon Nov 18, 2019 9:46 pm 5. Floating point transcendentals. Prefer single precision when possible. Avoid pow(). Prefer reasonable approximations (but don't trust random code, test it for precision and performance).

8. Avoid std library when possible.

These are two big ones for me. Are there preferred "fast math" libraries that everyone here uses - or do all the functions need to be home grown?

syntonica · Post by **syntonica** » Mon Nov 18, 2019 10:52 pm

2DaT wrote: ↑Mon Nov 18, 2019 9:46 pm 1.--ffast-math or equivalent. Helps the compiler to optimize floating point expressions. On MSVC this also enables cheaper transcendentals.
...
6. Vectorization. Auto-vectorization is very unreliable, so sometimes you need to vectorize yourself to get best performance. This is complicated, but on some problems can be worthwhile (up to 4x perf or 8x on modern processors). May require cpu and asm knowledge to get best results.
...
8. Avoid std library when possible. Prefer arrays to std::vector.
...
9. Avoid using big amounts of memory when not needed. This will help with cache. Reuse hot memory when possible. Avoid big look-up tables when possible.

A few additional thoughts:

I've never had that much luck on the Mac with the fast-math. Never does a thing for me. It does seem to do some good with gcc. I don't recall if I even used it on MSVC, but I was busy learning the Windows Way of things.

I just suggested the first steps on working with auto-vectorization. I didn't want to scare the OP off!

Turn on the verbose mode for the compiler and it will tell you where it's a yes, where it's a no since there's no gain, and where it's a flat no. To be honest, all my MACs easily get vectorized, but things that don't usually have good reasons and can't be refactored, either. But I'd recommend going this route before trying to use intrinsics directly.

I think there's a few that would argue vehemently against not using vectors.

I hate pretty much all of the STL and the stdlib in C++. I once spent half a day trying to figure out how to use their linked list and finally said feck it and wrote my own in 10 minutes. Yes, I'm that stupid-clever.

Finally, memory, I try to avoid all news/deletes except at the beginning and the end. I prefer pools of objects over randomly allocating chunks of memory.

If I said half of this on Stack Overflow, I'd probably get perma-banned!

2DaT · Post by **2DaT** » Tue Nov 19, 2019 12:12 am

Fender19 wrote: ↑Mon Nov 18, 2019 9:51 pm These are two big ones for me. Are there preferred "fast math" libraries that everyone here uses - or do all the functions need to be home grown?

I posted some vectorized functions here on KVR: exp, log, tanh and tan for BLT. These are nice approximations, almost library accuracy.
Vectorized elementary functions are useful, because they are more performant and help with manual vectorization. MSVC can vectorize elem. functions with fast math, but last time I checked - clang and gcc couldn't. At least not without external library such as SVML, but even then approx. is probably faster. Of course this only does make sense if you are heavily bottlenecked by function evaluation.

syntonica wrote: ↑Mon Nov 18, 2019 10:52 pm I've never had that much luck on the Mac with the fast-math. Never does a thing for me. It does seem to do some good with gcc. I don't recall if I even used it on MSVC, but I was busy learning the Windows Way of things.

IIRC there is another flag that helps to optimize further. (-ffp-contract=fast?).

syntonica wrote: ↑Mon Nov 18, 2019 10:52 pmTurn on the verbose mode for the compiler and it will tell you where it's a yes, where it's a no since there's no gain, and where it's a flat no.

These diagnostic messages are not helpful. They don't tell you how to get this thing to vectorize. You need to spoon feed a compiler to get a good auto-vectorization of non trivial code (such as local copying as I mentioned before).

syntonica wrote: ↑Mon Nov 18, 2019 10:52 pmFinally, memory, I try to avoid all news/deletes except at the beginning and the end. I prefer pools of objects over randomly allocating chunks of memory.

Usually heap allocations and deallocations are to be avoided on audio thread anyway.

Fender19 · Post by **Fender19** » Tue Nov 19, 2019 12:34 am

Thanks all for your help and suggestions.

I have been employing some of these suggestions and am getting some strange results. For example, I was able to eliminate one exp(x) and log(x) block from my code - which I thought would be significant - yet I saw zero change in CPU load. I was then was able to eliminate a single fp divide and saw a ~5% reduction in CPU load!

So it seems things that SHOULD make a huge difference sometimes aren't and little things ARE. Doesn't seem to be any logic to it. Is that usually how optimization goes (hit and miss, interactions, etc.) - or am I just not seeing the results properly?

I am testing my plugin by stacking 10 instances of it in one track in Reaper. When all 10 plugins are running Reaper reports 1.9% total CPU usage - and each plugin instance shows 0.2%. But when I remove all but one plugin it reports 0.3% usage (looks like 50% more for just one plugin by itself). That doesn't make sense to me - does it to you? Or are these numbers too "low in the weeds" to be meaningful?

mystran · Post by **mystran** » Tue Nov 19, 2019 12:51 am

syntonica wrote: ↑Mon Nov 18, 2019 10:52 pm I've never had that much luck on the Mac with the fast-math. Never does a thing for me. It does seem to do some good with gcc. I don't recall if I even used it on MSVC, but I was busy learning the Windows Way of things.

What "fast-math" does is to allow the compiler to perform algebraic simplification in floating-point without having to worry about changing the results slightly. Since the order of operations has effect on the rounding, there are very few optimisations you can do with floating-point code if you need to guarantee a particular bit-pattern (ie. you can't optimise (a+1)+2 into a+3, because this will change rounding).

Typically it also allows the compiler to ignore signed zeroes and ignore the possibility of NaNs (eg. you can't normally invert a floating-point comparison to reorder branches, because NaNs always compare false with everything). That said, if your code has very few opportunities for such optimisations (ie. you mostly already simplified your expressions), then fast-math might not have much performance impact.

mystran · Post by **mystran** » Tue Nov 19, 2019 1:06 am

As for the original question, in my opinion you should usually try to find a balance between the two: go for the lowest average CPU usage, except where it results in significant variation from one block to the next.

For example, there is little purpose in processing something like modulation slots that are disabled. The load will vary depending on how many features are enabled, but as long as there are no CPU spikes this is fine. The host CPU meter still gives the user a reasonable estimate on whether or not there are going to be glitches with whatever feature set they currently have enabled.

On the other hand, if you try to do something like long FFTs in the audio thread, your load can vary a lot from one block to another. Such random variation is bad, because now you might end up with glitches if two plugins happen to have a spike at the same time, even though the average CPU load might be lower. In this case it might be better to use an algorithm that is slower if it means that the CPU load is more predictable.

Sometimes it depends on the plugin. For a static EQ, you might tolerate more spikes when changing parameters, compared to something like a filter plugin where you expect the user to do a lot of automation.

karrikuh · Post by **karrikuh** » Tue Nov 19, 2019 6:31 am

2DaT wrote: ↑Mon Nov 18, 2019 9:46 pm8. Avoid std library when possible. Prefer arrays to std::vector.

Why? II'm using std::vector all over the place, and to me it leads to much clearer and safer code than taking care of deleting memory myself. Also, I didn't observe any performance hit compared to manually allocated array.
https://isocpp.github.io/CppCoreGuideli ... Rsl-arrays
I agree with most of your other points, though.

Points I would add are:
- Use a recent and decent compiler like GCC >= 8. You can use mingw-w64 on Win. LAst time I checked, MSVC optimizer couldn't really compete. Also, you get better standard conformance.
- Use SIMD intrinsics and optimize your internal state layout to make best use of it (SoA vs AoS, proper 16-byte alignment on heap to allow aligned loads/stores). Forget about auto-vectorization. Except for some trivial cases, it will not give you optimal code. There is no free lunch.

Optimize plugin code for balanced load or least load?