Secrets to writing fast DSP (VST) code?

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

This pdf comes up quite often. Too bad it still refers to old architectures (Penitum 4 and Pentium M) and old languages (no C++11). Still good for the broad picture.

Post

Fender19 wrote:Thanks, all. I am aware of most of those principles with the exception of the "multiply instead of divide" suggestion. Will try that. Thank you!

Now, one area I don't thoroughly understand is memory access to/from large arrays (for FIR, FFT, delay line, etc.). I've heard that "indexing math" is slow but what, exactly, does that mean? Does that mean that any variable inside the index makes it slow, like this:

Code: Select all

for (i = 0; i < N, i++) output = buffer[i];
or that doing math inside the index brackets is slow, like this:

Code: Select all

offset = 20;
for (i = 0; i < N, i++) output = buffer[i + offset];
Are these methods of accessing data in an array slow in general or is that how it's typically done? Is there a better/faster way?
The indexing is just calculating a memory address and then loading from that address so it's normally pretty fast. In fact, the CPU can often predict your memory access and precache data for you (if your access pattern is predictable). But there are some things that can slow it down:

- If your array is enormous and your accesses fall all over the place, the CPU might not be able to cache it, and in extreme cases this can have large penalties.

- If your loop does memory allocations or deallocations, this can result in an OS call, context switch etc... Operations that can result in memory allocations: adding anything to std::list or std::map, growing an std::vector or std::deque (though they are written to try to reduce the impact), new and delete operators, creating or destroying an object that will call any of the previously mentioned things.

- Some code can result in copying large objects (operator = on large objects, passing large objects by value to functions), or doing long access loops (random accesses into lists, inserting data at the start or middle of a vector). This is slow if it's done each time you loop, of course.

Post

Miles1981 wrote:
This pdf comes up quite often. Too bad it still refers to old architectures (Penitum 4 and Pentium M) and old languages (no C++11). Still good for the broad picture.
First off there's 5 pdfs, not one, only one of them deals with C++.

Second they cover pretty much everything up to the very latest Intel and AMD architectures. To say they give a broad picture is a gross understatement, they are far more detailed than anything else you will find, including Intel and AMDs own SDMs / optimization guides.

Probably the reason he doesn't address C++11 is because there's nothing in C++11 that is relevant. It's all either stuff that doesn't directly have a runtime cost or how fast it is going to be is down to the compiler / stdlib implementation. IE... How fast the range based loop will be is down the implementation of the container and how the compiler puts it all together.
Chris Jones
www.sonigen.com

Post

MadBrain wrote: The indexing is just calculating a memory address and then loading from that address so it's normally pretty fast. In fact, the CPU can often predict your memory access and precache data for you (if your access pattern is predictable). But there are some things that can slow it down:
As long as the actual address can be handled by one of the addressing modes it makes no difference how complicated it is.. IE...

[eax], [eax+ebx], [eax+ebx*2 or 4 or 8], or any of those plus an immediate offset.

They all cost the same.

Cache prediction (on x86/64) pretty much just assumes you'll be reading sequentially forwards from the last read. AFAIK there's no mechanism to predict complicated patterns, just you ask for 0x10000, then it'll load N bytes from there, and immediate read the next N bytes as soon as it can.
Chris Jones
www.sonigen.com

Post

Urs wrote:Prefer stack memory over heap.
could you explain exactly why this is faster? I recently noticed an improvement that may have been down to this but i'd like to understand it!

thanks

Post

sonigen wrote:
MadBrain wrote: The indexing is just calculating a memory address and then loading from that address so it's normally pretty fast. In fact, the CPU can often predict your memory access and precache data for you (if your access pattern is predictable). But there are some things that can slow it down:
As long as the actual address can be handled by one of the addressing modes it makes no difference how complicated it is.. IE...

[eax], [eax+ebx], [eax+ebx*2 or 4 or 8], or any of those plus an immediate offset.

They all cost the same.

Cache prediction (on x86/64) pretty much just assumes you'll be reading sequentially forwards from the last read. AFAIK there's no mechanism to predict complicated patterns, just you ask for 0x10000, then it'll load N bytes from there, and immediate read the next N bytes as soon as it can.
Sometimes a larger stride than a unit stride is faster.

Post

hibrasil wrote:
Urs wrote:Prefer stack memory over heap.
could you explain exactly why this is faster? I recently noticed an improvement that may have been down to this but i'd like to understand it!

thanks
To reserve memory from the stack it's just moving a pointer.

To reserve memory from the heap... it's a stdlib sub routine, it has to be thread safe, involves checking freelists, maybe spliting larger blocks, maybe calling the OS, etc...

Last time i checked calling malloc cost at least 300 cycles, setting up stack frame is maybe 2.
Chris Jones
www.sonigen.com

Post

camsr wrote: Sometimes a larger stride than a unit stride is faster.
Not seen that myself but I wouldnt be surprised!
Chris Jones
www.sonigen.com

Post


Post

random_id wrote:
MadBrain wrote: - Make you're you're making your release builds with optimisation on and "fast math" ("fast" floating point model in MSVC).
Are most people using the fast math mode instead of precise? I am wondering what kinds of side-effects happen (or don't) with samples, convolutions, IIRs, etc.
Basically whatever the compiler (it's about the same for most of them), when you set "fast math" it essentially means that you don't want to spend extra CPU on getting strict IEEE rules where the hardware doesn't quite follow them directly. So basically, fast-math typically skips all the extra checks, doesn't force any intermediate rounding (on x87) and just generates the obvious code most of the time.

This is essentially what you always want. It's basically "free performance" and you almost certainly want to set it for any number crunching code, unless you explicitly have a reason why you can't use it. For all practical purposes it should be treated as the "default math" except it can't be a default option in compilers, because it breaks IEEE rules.

Most of the time using fast-math will break nothing, unless you are relying on something like exceptions, exact "bit-correct" results or some other corner cases. For about 99.9% of anything you ever want to do, it shouldn't matter. Just enjoy the free speed. There are a couple of gotchas, most importantly the compilers typically ignore NaN comparison rules when it comes to fast-math, so (value!=value) would get optimized away (solution: declare volatile temporary) and some related stuff with NaNs and less-than/greater-than (compiler might take the inverted comparison if it's more efficient.. which fails with NaNs).

In theory fast-math also allows compiler to do some arithmetic simplifications and such, so if you're relying on the exact order of calculations (in order to improve/maintain precision) you might want to double-check that you're getting what you're asking for.. but in practice I've not had much trouble with fast-math even in such cases. This can vary between compilers though.. MSVC is fairly conservative in my experience. Anyway, typically stuff just works.. except it works faster.

Post

So far just one person mentioned profiling your code, that has to be the very first thing you do so use the best profiler you can. It is great fun optimising code to get the very last cycle reduced but too often we spend time in places that make only a very minute difference. Once you know where most of the time is spent look for algorithmic changes first, those will have the biggest effect. Again we can spend hours looking at a line of C++ code trying to speed it up where as an algorithmic change could remove the need for that line altogether. Remember the fastest code is doing nothing at all! Next I would look at optimising data access to the cache. Nowadays memory access is a large bottleneck and complete redesigns of code to take into account cache access are common ways of producing large speed increases. This often means going away from the nice OO design patterns where all data and functionality that works on it are encapsulated in types and instead splitting data apart and closely binding it to one function - using a component based design approach.

Post

Keith99 wrote:So far just one person mentioned profiling your code, that has to be the very first thing you do so use the best profiler you can. It is great fun optimising code to get the very last cycle reduced but too often we spend time in places that make only a very minute difference. Once you know where most of the time is spent look for algorithmic changes first, those will have the biggest effect. Again we can spend hours looking at a line of C++ code trying to speed it up where as an algorithmic change could remove the need for that line altogether. Remember the fastest code is doing nothing at all! Next I would look at optimising data access to the cache. Nowadays memory access is a large bottleneck and complete redesigns of code to take into account cache access are common ways of producing large speed increases. This often means going away from the nice OO design patterns where all data and functionality that works on it are encapsulated in types and instead splitting data apart and closely binding it to one function - using a component based design approach.
Exactly, through profiling I discovered that boost circular list could be greatly improved upon, and thus I doubled the speed in a single sitting.

Post

Oh right, for Windows Very Sleepy is great free choice for a sampling profiler. Sampling profilers are great in that they often (including Sleepy) requires no instrumentation (though you want debug info, so you get source line mapping for more useful results). This way you can profile unmodified release builds to get the results for the "real deal" rather than something that would be "sort of like the actual thing."

Pretty much the only gotchas with these are that the results are noisy (more sample -> better results, but takes longer to collect) and when they poll the process, they do some memory inspection that burns caches.. so if you have a high polling rate, you'll get unreasonably bad cache performance. So better set a reasonably polling rate, then have it run for a while (say have a coffee break or something).

As for OO vs. caches .. I'd say if your OO structure interferes with cache usage, there's probably other problems with it as well (such as architecture over utility).

Post

Those are a bunch of great tips. Thanks all.

Fender19, on the borland compiler-- I've used bcb a good bit, but delphi more. There are a couple of foibles on some versions of delphi which may also apply to bcb. Maybe you know all about it so will be brief. It can be googled.

The default in the past was to enable fpu exceptions. Maybe it is still so. The borland frameworks were conceived to be more elegant relying on exceptions to handle errors. If you turn that off it may improve speed, but also make your program less flakey interacting with other programs.

It seemed last time I looked that many plugins don't pay attention to divide by zero, inf, nan hardly at all. My perspective was from the host side. Maybe they are better nowadays. Maybe a host would take more care to avoid sending you a buffer full of garbage. Dunno.

But if you are set to use fpu exceptions and you get a buffer with weird values, then the exception your code throws will possibly be treated as an error deriving from the host.

Once fpu exceptions are disabled (on delphi) calling of trunc would turn exceptions back on. So I had to write a trunc replacement that preserved the exception state (if on, leave it on and vice versa). Round and the others seemed to preserve the state. That may also be applicable to bcb.

Another thing that can slow down DRAMATICALLY any code, is denormals, even if they don't throw an exception. Usually if a host or plugin hits a buffer full of denormals, it can be dramatic slow-to-a-stall effect. Many codes take precautions to avoid processing denormals, on host and plugin side. Dunno how universal.

So the host may take care not to send you buffers of nan, inf and denormal, and if your own code just naturally never can cause such, then maybe you would never see the issue even if you take no precautions.

But you never know what a host might send you, and it can be easy to accidentally generate denormals in some of the common algorithms you might decide to use. So it is just a thing to read up on and keep in the back of your mind.

Post

"As for OO vs. caches .. I'd say if your OO structure interferes with cache usage, there's probably other problems with it as well (such as architecture over utility)."

Sorry I was not saying OO is inherently bad for caching but neither is it designed with cache in mind. A common approach is to put all data and functionality that works on it in a class. It is a nice encapsulated black box with lots of advantages not least maintenance and extensibility. However if you work on one piece of data across many instances of the class at a time that is not very cache friendly. Instead you would be better to keep data held in terms of how it is accessed which differs from common OO designs.

Post Reply

Return to “DSP and Plugin Development”