Secrets to writing fast DSP (VST) code?

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

I have purchased VST plugins from some of you folks that have tremendous graphics, tons of controls and intensive audio processing - yet they have very low CPU usage. Incredible. My hat's off to you!

My plugins, on the other hand - even simple, non-GUI ones - are CPU hogs. Terrible.

I realize some of this knowledge is proprietary but are there any basic "rules of thumb" for writing fast DSP code (especially for VST plugins)? Things like choice of compiler brand and/or certain settings; using "while" loops instead of "for" loops, "don't do this", etc.? How do you get so much plugin with so little CPU?

Any advice appreciated!

Post

IMHO there are two things in fact:
- good algorithmic compromise
- efficient C++
It's not about while or for loops (they are the same in binary code), it's about what kind of functions you call (exponential, sin...), the algorithm you choose (TLU or optimization, FFT or convolution) and the way you access memory (can operations be vectorized?).

Post

- Find the "core" loop of your plugin (normally this is the "for" that calculates each sample) and make sure it does the least possible calculations (so that everything heavy is calculated beforehand).

- In particular, make sure the core loop avoids all slow math operations: sin(), cos(), tan(), asin(), acos(), atan(), pow(), sqrt(), log(), exp(), log10(), and also division. (these operations are acceptable in setup and "control rate" operations)

- Make sure the core loop isn't doing slow accesses to heavy data structures (for instance, don't use an std::deque to implement a delay line).

- Make you're you're making your release builds with optimisation on and "fast math" ("fast" floating point model in MSVC).

- Instead of calculating envelopes/lfo/pitch/filter coefs every sample, calculate them every few samples (I like calculating them every 16 samples or so) and ramp the volume (pitch and cutoff don't have to be ramped), this makes it a lot easier to have a fast core loop.

Post

Profile your code to find bottlenecks.

Post

MadBrain wrote: - Make you're you're making your release builds with optimisation on and "fast math" ("fast" floating point model in MSVC).
Are most people using the fast math mode instead of precise? I am wondering what kinds of side-effects happen (or don't) with samples, convolutions, IIRs, etc.

Post

Use multiplication over division whereever it's possible.

Here is a small and simple example

Code: Select all

void setfs(float thesamplerate)
{
    fs = 1.f / thesamplerate;
}
void setf(float fc)
{
    f = tanf(pi * (fc * fs));
    f2 = 1.f / (1.f + f);
}
Since setfs() is called very sparingly, and setf() is called very frequently, using the reciprocal as it was written in the orignal equation (as fc / fs) is way faster.

Post

Thanks, all. I am aware of most of those principles with the exception of the "multiply instead of divide" suggestion. Will try that. Thank you!

Now, one area I don't thoroughly understand is memory access to/from large arrays (for FIR, FFT, delay line, etc.). I've heard that "indexing math" is slow but what, exactly, does that mean? Does that mean that any variable inside the index makes it slow, like this:

Code: Select all

for (i = 0; i < N, i++) output = buffer[i];
or that doing math inside the index brackets is slow, like this:

Code: Select all

offset = 20;
for (i = 0; i < N, i++) output = buffer[i + offset];
Are these methods of accessing data in an array slow in general or is that how it's typically done? Is there a better/faster way?

Post

Doing index arithmetic is okay (from my experience) if the array contains floats. They are two separate parts of the CPU.

Post

MadBrain wrote: - In particular, make sure the core loop avoids all slow math operations: sin(), cos(), tan(), asin(), acos(), atan(), pow(), sqrt(), log(), exp(), log10(), and also division. (these operations are acceptable in setup and "control rate" operations)
Yes, and the processing loop should avoid calling any functions at all unless it's impossible to avoid.

Post

Dunno if it is worth the trouble, and maybe too old fashioned, but something I've done a lot when possible on tight loops-- 32 bit code. I suppose the same would work for 64 bit, haven't studied 64 bit asm.

The cpu can "easy and convenient" hold up to 6 int values or pointers, and 8 doubles in the fpu registers. When possible I try to make tight loop asm modules that use that many vars or less. Load all the fpu constants, in out pointers, loop comparison vars, etc all into the cpu and fpu. Then do a tight loop over the buffer ideally doing NO MEMORY ACCESS except reading input samples and writing output samples. Everything else necessary was loaded into the cpu and fpu before beginning the tight loop.

Maybe some compilers can do as good or better on stripped down tight loops, but I've never noticed a compiler usually filling the fpu more than a few numbers deep, and if a certain constant is used by the fpu in several lines, I've never noticed a compiler loading that constant one time and leaving it in the fpu for multiple operations. The ones I've looked at seem to want to load the fpu for every line, then flush the fpu, then load again for the next line. That would seem to involve a lot of redundant memory access, but maybe some compilers are smart enough to do such optimizations. Am very out of date on most everything.

Post

I noticed in another thread that you mention you are using an old borland compiler. You don't sound too keen on using MSVS but it's free, generates efficient code and the VST examples compile out of the box.

You may be able to get performance increases just by switching. I never used borland, but when I compared MSVS with GCC a few years ago MSVS generated significantly faster code. (Although I did see a thread here, also quite a while back, listing about a page worth of compiler options that apparently brought GCC up to speed).

Post

Fender19 wrote:Now, one area I don't thoroughly understand is memory access to/from large arrays (for FIR, FFT, delay line, etc.). I've heard that "indexing math" is slow but what, exactly, does that mean? Does that mean that any variable inside the index makes it slow, like this:
No, index math is really fast. Wht is not fast is getting the array from memory to the cache levels. The compiler has to optimize those calls, and if you work with only one float array, it knows it can vetorize the computation without having to get the updated data. The issue is that you usually have several arrays (one input, one output) and the compiler HAS to assume they share some space content. So each time you compute one value, it has to get the input data back from the cache, which is dead slow (this is actually the reason why Fortran is so fast, it doesn't allow aliasing pointers).
There are compiler extensions that deal with this efficiently, but you need a updated compiler (that handles vectorization properly), and then learn how to write such code. You can also use tools like Boost.SIMD.

Post

Two key things IMHO:
- learn the assembler language(s) of your target machines (I don't mean "write in the assembler", just learn it). This will give you a clear understanding of the effects of the choice of the compiler, which functions to call, which high-level language constructions to avoid etc.
- make sure your math knowledge has a sufficient level, so that you can optimize the algorithms you use. E.g. you don't need to call an exp() or pow() functions to generate an exponential envelope.

This might be a time investment but it's gonna be worth it. Otherwise you will be always relying on a set of known tricks and other people's opinions.

Post

Block processing. Instead of rendering oscillators, LFOs, envelopes, filters, and what not in one loop, create a method to call for each. Make those methods take not just one sample to process, but a stream of 64 or more. When they process, each module renders into its own temporary buffer when necessary.

Prefer stack memory over heap, and oversample individually as needed:

void render( float* in, float* out, int numSamples )
{
float tmp[ numSamples * 4 ];

upsample4( tmp, in, numSamples ); // upsample into tmp buffer

… process tmp …

downsample4( out, tmp, numSamples ); // downsample into out buffer
}

This takes into account that oversampling alone doesn't help anti aliasing much (need I explain?). Anti aliasing is only met properly if each individual module is oversampled. Alternatively, keep the large buffers and lowpass inbetween each module (that's the same thing but might require more memory on stack). The advantage for CPU optimisation is that each module can be oversampled just as much as it needs. A bandlimited oscillator doesn't need to be oversampled at all, a non-linear filter does.

Post

Chris Jones
www.sonigen.com

Post Reply

Return to “DSP and Plugin Development”