I'm not making assumptions about what a VST does when it's called, besides it's somehow filling its output buffer. It could use a loop for that, but that's not necessary.Miles1981 wrote:Never had to do that. I don't even know what you mean by that. Allocating everything before the processing loop is enough, and usually, you don't even need to reallocate anything the second time the processing loop is called.BachRules wrote:Do your own memory management instead of C++ native when it makes a difference.
Secrets to writing fast DSP (VST) code?
-
- Banned
- 228 posts since 3 Feb, 2014
If you criticize Spitfire Audio, the mods will lock the thread.
- KVRAF
- 7964 posts since 12 Feb, 2006 from Helsinki, Finland
Here is something I would like to know: how do you actually write SSE2 packed double code that actually runs any faster than simply doing the same thing in scalars? Is this actually supposed to be possible? When it comes to SSE packed singles, you just put 4 jobs in parallel, and the thing magically runs much faster (often almost 4 times).. but when I try do the same with 2x doubles, half the time scalars seem to be faster and the rest of the time I can't measure a meaningful difference one way or another (on either Intel or AMD hardware; I have multiple specimen of both).
So what's the magic with this stuff? Is it actually possible, or do you need to go 4-way AVX in order to actually see some improvement, or what's the deal?
So what's the magic with this stuff? Is it actually possible, or do you need to go 4-way AVX in order to actually see some improvement, or what's the deal?
-
- KVRian
- 1379 posts since 26 Apr, 2004 from UK
It's not as easy as that.
I don't do that by hand, some do, but I suggest you use Boost.SIMD instead (supports SSE2 up to AVX, ARM, x64, MIPS...).
Now, you have to check your code to see if it is possible to use SIMD or not. The compiler can do a good job if you tell it that it has non aliased arrays (restrict keyword/extension). If this doesn't do the trick, you have to roll up your sleeves.
To efficiently use SIMD, you need to be able to process n*4 pieces of data with the same operation (NOT 4 jobs). There is an article that I need to find again that explains that SIMD may not be what you are looking after if you don't have enough data for the same operation.
I don't do that by hand, some do, but I suggest you use Boost.SIMD instead (supports SSE2 up to AVX, ARM, x64, MIPS...).
Now, you have to check your code to see if it is possible to use SIMD or not. The compiler can do a good job if you tell it that it has non aliased arrays (restrict keyword/extension). If this doesn't do the trick, you have to roll up your sleeves.
To efficiently use SIMD, you need to be able to process n*4 pieces of data with the same operation (NOT 4 jobs). There is an article that I need to find again that explains that SIMD may not be what you are looking after if you don't have enough data for the same operation.
- KVRAF
- 7964 posts since 12 Feb, 2006 from Helsinki, Finland
Most of the time it's relatively easy to extract a lot of performance from packed singles as long as the process is fairly SIMD friendly. Having 4-way data isn't always necessary either, as processing some junk (while wasteful) in the extra values might still end up faster than scalar code. On the other hand, I've yet to manage to come up with any actually useful code using packed doubles (at least not in SSE2) that improves over scalar code at all (usually rather the opposite). I used to use packed doubles for stereo processing actually (at least where the streams were independent and equal), but at some point I found that replacing all that stuff with a "fake SIMD" (2 scalar paths interleaved) class actually ended up a bit faster (and the results translate pretty much over all my test systems, with different generations and CPU vendors).Miles1981 wrote:It's not as easy as that.
I don't do that by hand, some do, but I suggest you use Boost.SIMD instead (supports SSE2 up to AVX, ARM, x64, MIPS...).
Now, you have to check your code to see if it is possible to use SIMD or not. The compiler can do a good job if you tell it that it has non aliased arrays (restrict keyword/extension). If this doesn't do the trick, you have to roll up your sleeves.
I find this weird, hence the question for whether other people are observing the same.
-
- KVRAF
- 3388 posts since 29 May, 2001 from New York, NY
But you wouldnt call calloc() in a dsp processing block, would you ?Urs wrote:We just allocate enough. Otherwise there's always calloc() which does the same thing.lkjb wrote:Out of curiosity, at least VS2012 doesn't allow to dynamically set the size of a C-array. Do you set numSamples to a value you don't expect to be exceeded or am I missing something?Urs wrote:float tmp[ numSamples * 4 ];
- KVRAF
- 7964 posts since 12 Feb, 2006 from Helsinki, Finland
Couple of solutions:Big Tick wrote:But you wouldnt call calloc() in a dsp processing block, would you ?Urs wrote:We just allocate enough. Otherwise there's always calloc() which does the same thing.lkjb wrote:Out of curiosity, at least VS2012 doesn't allow to dynamically set the size of a C-array. Do you set numSamples to a value you don't expect to be exceeded or am I missing something?Urs wrote:float tmp[ numSamples * 4 ];
1. allocate on stack, use malloc()/calloc() as backup.. sure it might cause drop-outs, but at least things will still work, and if the block is long (which is the case if it didn't fit in your normal buffer), then the probability of drop-outs goes down (since we're not pushing for low latency, obviously)
2. use alloca() .. be careful though, you still need to check for maximum size here, so you don't run out of stack space (which is typically quite finite for threads other than the main one).
3. wrap your actual processing method with a wrapper function (which you might want to do for a dozen other reasons as well).. and have the wrapper check the block-size and split it into smaller parts if needed.. then rest of the code everywhere can pretend that the maximum block is never larger than whatever you set as the threshold (at 1024 samples the overhead is already negligible)
I personally like the solution 3.
-
- KVRist
- 134 posts since 13 Apr, 2016
THIS.
Urs, I owe you (at least) another beer. If you're going to NAMM, PM me.
Thanks again.
I spent the past year developing my plugin in a one-sample-at-a-time fashion and now that it's getting close to release I decided to do what I could to optimize it's performance. After doing the somewhat standard known techniques and definitely doing some profiling, I took the time to refactor my code to work in blocks instead of a sample at a time and damn, was it worth it. My plugin's execution time is now 3x faster than what it was originally.Urs wrote:Block processing. Instead of rendering oscillators, LFOs, envelopes, filters, and what not in one loop, create a method to call for each. Make those methods take not just one sample to process, but a stream of 64 or more. When they process, each module renders into its own temporary buffer when necessary.
Urs, I owe you (at least) another beer. If you're going to NAMM, PM me.
Thanks again.
- u-he
- 28119 posts since 8 Aug, 2002 from Berlin
- u-he
- 28119 posts since 8 Aug, 2002 from Berlin
ugh, I meant alloca, not calloc, to allocate memory on stack...Big Tick wrote:But you wouldnt call calloc() in a dsp processing block, would you ?Urs wrote:We just allocate enough. Otherwise there's always calloc() which does the same thing.lkjb wrote:Out of curiosity, at least VS2012 doesn't allow to dynamically set the size of a C-array. Do you set numSamples to a value you don't expect to be exceeded or am I missing something?Urs wrote:float tmp[ numSamples * 4 ];
I guess the better solution is to allocate memory for a fixed number of samples, and cut the audio stream into pieces that fit.
-
- KVRist
- 210 posts since 11 Feb, 2006
Interesting thread.
Some of the biggest performance gains I've seen have come from using the Accelerate framework on mac and more recently (the now free) Intel IPP.
Static code analyzers are helpful too, such as the clang one in XCode http://clang-analyzer.llvm.org/xcode.html
Some of the biggest performance gains I've seen have come from using the Accelerate framework on mac and more recently (the now free) Intel IPP.
Static code analyzers are helpful too, such as the clang one in XCode http://clang-analyzer.llvm.org/xcode.html
- KVRian
- 1341 posts since 15 Nov, 2005 from Italy
-
- KVRist
- 210 posts since 11 Feb, 2006
Yeah IPP is now free, I think MKL too. You only pay if you want support services: https://software.intel.com/en-us/articles/free-ipp
-
- KVRer
- 26 posts since 20 Jan, 2017
There's a problem if I'm using doubles for my vst?
I'm following this guide http://www.martin-finke.de/blog/tags/ma ... ugins.html
And here the doubles are used, i'm currently building a vst at 32 bit, and when i build the dll and add it to savihost, the performance percentage always go over 10%, which in my opinion is a lot, i don't know what to do, any suggestion?
PS: to access to my oscillators, i've used this code
For all the actived oscillators I'm trying to do a simple thing, taking each sample with attached the correct envelope generator, it works but it's using a lot of cpu
I'm following this guide http://www.martin-finke.de/blog/tags/ma ... ugins.html
And here the doubles are used, i'm currently building a vst at 32 bit, and when i build the dll and add it to savihost, the performance percentage always go over 10%, which in my opinion is a lot, i don't know what to do, any suggestion?
PS: to access to my oscillators, i've used this code
Code: Select all
double output = 0.0;
double oscillatorDivisions = 1.0 / oscillators.size();
for(int i=0; i<oscillators.size(); i++) {
output += (oscillators[i].nextSample() * oscillatorVolumeEnvelopeGenerators[i].nextSample()) * oscillatorDivisions;
}
output = output * midiVelocity / 127.0;
return output;
-
- KVRAF
- 2256 posts since 29 May, 2012
The whole algorithm needs to be vectorized ( http://sci.tuomastonteri.fi/programming/sse ), which is a lot of work as long as you aren't bringing something serious to the market:) Not necessary for prototyping or learning.
For example,in the above code, if oscillators.size()==4, the processor could calculate all four oscillators in parallel (I haven't checked exactly how many, 4 or 8 ).
p.s. There are probably more recent websites with more up to date info about sse instructions, the one above was the first one google had found.
For example,in the above code, if oscillators.size()==4, the processor could calculate all four oscillators in parallel (I haven't checked exactly how many, 4 or 8 ).
p.s. There are probably more recent websites with more up to date info about sse instructions, the one above was the first one google had found.
~stratum~