Secrets to writing fast DSP (VST) code?

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

Miles1981 wrote:
BachRules wrote:Do your own memory management instead of C++ native when it makes a difference.
Never had to do that. I don't even know what you mean by that. Allocating everything before the processing loop is enough, and usually, you don't even need to reallocate anything the second time the processing loop is called.
I'm not making assumptions about what a VST does when it's called, besides it's somehow filling its output buffer. It could use a loop for that, but that's not necessary.
If you criticize Spitfire Audio, the mods will lock the thread.

Post

Here is something I would like to know: how do you actually write SSE2 packed double code that actually runs any faster than simply doing the same thing in scalars? Is this actually supposed to be possible? When it comes to SSE packed singles, you just put 4 jobs in parallel, and the thing magically runs much faster (often almost 4 times).. but when I try do the same with 2x doubles, half the time scalars seem to be faster and the rest of the time I can't measure a meaningful difference one way or another (on either Intel or AMD hardware; I have multiple specimen of both).

So what's the magic with this stuff? Is it actually possible, or do you need to go 4-way AVX in order to actually see some improvement, or what's the deal? :P

Post

It's not as easy as that.
I don't do that by hand, some do, but I suggest you use Boost.SIMD instead (supports SSE2 up to AVX, ARM, x64, MIPS...).
Now, you have to check your code to see if it is possible to use SIMD or not. The compiler can do a good job if you tell it that it has non aliased arrays (restrict keyword/extension). If this doesn't do the trick, you have to roll up your sleeves.

To efficiently use SIMD, you need to be able to process n*4 pieces of data with the same operation (NOT 4 jobs). There is an article that I need to find again that explains that SIMD may not be what you are looking after if you don't have enough data for the same operation.

Post

Miles1981 wrote:It's not as easy as that.
I don't do that by hand, some do, but I suggest you use Boost.SIMD instead (supports SSE2 up to AVX, ARM, x64, MIPS...).
Now, you have to check your code to see if it is possible to use SIMD or not. The compiler can do a good job if you tell it that it has non aliased arrays (restrict keyword/extension). If this doesn't do the trick, you have to roll up your sleeves.
Most of the time it's relatively easy to extract a lot of performance from packed singles as long as the process is fairly SIMD friendly. Having 4-way data isn't always necessary either, as processing some junk (while wasteful) in the extra values might still end up faster than scalar code. On the other hand, I've yet to manage to come up with any actually useful code using packed doubles (at least not in SSE2) that improves over scalar code at all (usually rather the opposite). I used to use packed doubles for stereo processing actually (at least where the streams were independent and equal), but at some point I found that replacing all that stuff with a "fake SIMD" (2 scalar paths interleaved) class actually ended up a bit faster (and the results translate pretty much over all my test systems, with different generations and CPU vendors). :shrug:

I find this weird, hence the question for whether other people are observing the same.

Post

Urs wrote:
lkjb wrote:
Urs wrote:float tmp[ numSamples * 4 ];
Out of curiosity, at least VS2012 doesn't allow to dynamically set the size of a C-array. Do you set numSamples to a value you don't expect to be exceeded or am I missing something?
We just allocate enough. Otherwise there's always calloc() which does the same thing.
But you wouldnt call calloc() in a dsp processing block, would you ?

Post

Big Tick wrote:
Urs wrote:
lkjb wrote:
Urs wrote:float tmp[ numSamples * 4 ];
Out of curiosity, at least VS2012 doesn't allow to dynamically set the size of a C-array. Do you set numSamples to a value you don't expect to be exceeded or am I missing something?
We just allocate enough. Otherwise there's always calloc() which does the same thing.
But you wouldnt call calloc() in a dsp processing block, would you ?
Couple of solutions:

1. allocate on stack, use malloc()/calloc() as backup.. sure it might cause drop-outs, but at least things will still work, and if the block is long (which is the case if it didn't fit in your normal buffer), then the probability of drop-outs goes down (since we're not pushing for low latency, obviously)

2. use alloca() .. be careful though, you still need to check for maximum size here, so you don't run out of stack space (which is typically quite finite for threads other than the main one).

3. wrap your actual processing method with a wrapper function (which you might want to do for a dozen other reasons as well).. and have the wrapper check the block-size and split it into smaller parts if needed.. then rest of the code everywhere can pretend that the maximum block is never larger than whatever you set as the threshold (at 1024 samples the overhead is already negligible)

I personally like the solution 3.

Post

THIS.
Urs wrote:Block processing. Instead of rendering oscillators, LFOs, envelopes, filters, and what not in one loop, create a method to call for each. Make those methods take not just one sample to process, but a stream of 64 or more. When they process, each module renders into its own temporary buffer when necessary.
I spent the past year developing my plugin in a one-sample-at-a-time fashion and now that it's getting close to release I decided to do what I could to optimize it's performance. After doing the somewhat standard known techniques and definitely doing some profiling, I took the time to refactor my code to work in blocks instead of a sample at a time and damn, was it worth it. My plugin's execution time is now 3x faster than what it was originally.

Urs, I owe you (at least) another beer. If you're going to NAMM, PM me.

Thanks again.

Post

Hey Josh, you're welcome!

- U

Post

Big Tick wrote:
Urs wrote:
lkjb wrote:
Urs wrote:float tmp[ numSamples * 4 ];
Out of curiosity, at least VS2012 doesn't allow to dynamically set the size of a C-array. Do you set numSamples to a value you don't expect to be exceeded or am I missing something?
We just allocate enough. Otherwise there's always calloc() which does the same thing.
But you wouldnt call calloc() in a dsp processing block, would you ?
ugh, I meant alloca, not calloc, to allocate memory on stack...

I guess the better solution is to allocate memory for a fixed number of samples, and cut the audio stream into pieces that fit.

Post

Interesting thread.
Some of the biggest performance gains I've seen have come from using the Accelerate framework on mac and more recently (the now free) Intel IPP.
Static code analyzers are helpful too, such as the clang one in XCode http://clang-analyzer.llvm.org/xcode.html

Post

is now IPP free for commercial use? if so, that's really interesting

Post

Yeah IPP is now free, I think MKL too. You only pay if you want support services: https://software.intel.com/en-us/articles/free-ipp

Post

There's a problem if I'm using doubles for my vst?
I'm following this guide http://www.martin-finke.de/blog/tags/ma ... ugins.html
And here the doubles are used, i'm currently building a vst at 32 bit, and when i build the dll and add it to savihost, the performance percentage always go over 10%, which in my opinion is a lot, i don't know what to do, any suggestion?

PS: to access to my oscillators, i've used this code

Code: Select all

double output = 0.0;
double oscillatorDivisions = 1.0 / oscillators.size();
for(int i=0; i<oscillators.size(); i++) {
	output += (oscillators[i].nextSample() * oscillatorVolumeEnvelopeGenerators[i].nextSample()) * oscillatorDivisions;
}

output = output * midiVelocity / 127.0;

return output;
For all the actived oscillators I'm trying to do a simple thing, taking each sample with attached the correct envelope generator, it works but it's using a lot of cpu :(

Post

The whole algorithm needs to be vectorized ( http://sci.tuomastonteri.fi/programming/sse ), which is a lot of work as long as you aren't bringing something serious to the market:) Not necessary for prototyping or learning.


For example,in the above code, if oscillators.size()==4, the processor could calculate all four oscillators in parallel (I haven't checked exactly how many, 4 or 8 ).

p.s. There are probably more recent websites with more up to date info about sse instructions, the one above was the first one google had found.
~stratum~

Post

Gonna try this process of paralleling alghoritms, seems interesting

Post Reply

Return to “DSP and Plugin Development”