Secrets to writing fast DSP (VST) code?

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

earlevel wrote:... is there any chance that you are hitting denormalization problems?
I add -140dB denormal noise (as a habit) to my inputs but not the sine and cosine functions that are generated for my DFTs/iDFTs. Those values DO cross through zero in many places. Could THAT be a problem? Should I add denormal noise to those signals as well?

Post

Decorrelated noise mixed with the signal will still fall into the denormal range.

Post

denormal numbers were an issue on AMD cores int he past (because they increased the range numbers were treated as denormal), but that was years ago. Denormal are behaving more efficiently now.

And yes, heap and stack data has to be initialized, so they are in the cache at one point, but they can be removed from it, whereas the instruction block will still be in the cache, meaning that stack data has a higher propbability of still being in the cache.
Anyway, I don't care about where it is located actually, as long as the access pattern are clear.

Post

Given the regular usage of the stack for calls and stack variables, i'd imagine it too unlikely to be unloaded.

Also bear locality of reference in mind too. Profile and a/b test the speed of long batches. If your code permits you can even carry out both in the same run.

Has anyone profiled differences between stack / heap or batch loop / per sample processing?

Post

Miles1981 wrote:And yes, heap and stack data has to be initialized, so they are in the cache at one point, but they can be removed from it, whereas the instruction block will still be in the cache, meaning that stack data has a higher propbability of still being in the cache.
What "instruction block"?

"Still be in the cache" When?
Chris Jones
www.sonigen.com

Post

I think I messed up things in my head. It's only the PC that is on the stack, not the instructions :ud:

Post

PC? Program counter? That's in a register.

The only thing that automatically gets put on the stack is the return address when you do a CALL instruction.

As you didn't answer my other question I'll explain anyway...

"still in the cache" is irrelevant for stack data. Stack data is not persistent, it is local to the function, every single time the function is called the stack frame has to be created and any stack data will have to be written to before it can be read.

If you have...

float Foo()
{
float tmp[128];
// do something with tmp
}

If you call Foo() a second time in succession what good is it that the stack memory that was used for 'tmp' is still in the cache? The data that was in there last time you called Foo() is completely irrelevant. You still have to write each float before you can use it, because it is by definition a new variable on entry to foo. Not to mention the fact that there is no guarantee the stack pointer is in the same place nor that the actual data is still the same. When you call Foo() tmp is garbage, it has to be written to before you can read it.

And because you have to write to it before you can use it, it'll be serviced by the SFB / cache whether it's on the heap or the stack.

Yes it's likely that stack data is still in the cache simply because of it's more frequent use. But it's actually irrelevant.
Chris Jones
www.sonigen.com

Post

When calling a subroutine, the PC is pushed on the stack, that's what I meant.
I never said that the stack data is persistent. I just said that the stack data had more probability to be in the cache, which is one of the reasons it is "faster" than the equivalent on the heap. That's it, I didn't say anything more, just tried to explain why this was said in the pdf.

Post

I've been writing DSP code in managed environments(namely Flash/Haxe, it's kind of a hobby project for me) and it's like working with one hand tied behind my back; I can't pull most of the low-level tricks used by a C++ plugin, so there are a lot of things that are just slow and always will be. And I can't turn off denormals, so I have to manage them instead.

That said, I'm on my second or third generation synthesizer/sequencer now, depending on how you count, and I have coaxed it into playing 128 good-quality, filtered, modulated sawtooths in real-time. I have several higher-level strategies that carry over:

Architecture. The inner loops are super sensitive, of course, and doing unnecessary loops is cost-prohibitive in this environment. But doing a bunch of processing in one pass destroys the maintainability. What to do? My current trade-off at the top level is to allow one buffer per voice and mix those, but to also aggressively monitor the volume level of each voice and cut them off at -80dB peak estimated, so that quiet and unused voices are effectively free, and there is room to group voices and do an effects pass. Volume cutoffs are a huge help for amplitude at release time - a simple linear amplitude fade can leave you with a huge number of values in the denormal range. Most of the output timbre is intended to be made at the oscillator stage, so I allow that loop to be bulkier and contain envelope and LFO processing and an IIR filter.

Be smart about caching and reusing data. I have a single-cycle wavetable oscillator. It calculates an additive waveform for the cycle at runtime. How does it do this performantly with low aliasing? It caches the requested frequency at a fixed power-of-two size, and then drop-samples from that buffer every time afterwards - memory usage grows to the amount needed for that session. If the environment were more memory-constrained this could be more troublesome - but these are modern desktops we're talking about, it's only a few megabytes at most for any one wavetable. For the case of pitch bending, the cache is snapped to 4ths of a semitone and we let the intermediate values alias some more.

(The next level of this would be: for a given note at a given velocity, buffer the entire attack-sustain phase and only compute new envelopes for release. It's confounded by having varying modulations though. The less modulated the data is, the more likely you can make this kind of caching work out. Working with a single cycle, instead of arbitrary PCM, is also a big deal; you have more information, so you can "cheat" more.)

Do more linear interpolations. As with many DSP architectures, I have chunks of samples being passed around. I set an internal notion of framerate and lock parameter changes into that framerate. Then I set up the inner loop so that those parameters are interpolated every sample by extracting a per-sample accumulator value and then doing simple additions within the loop. So I can have some pretty smooth envelopes at a low cost. If I were to get more fancy, I could go to the 2nd order: calculate an acceleration value and accumulate that.



My last thought is - early DSP gear usually found a creative way to reframe the problem to get the performance needed. Today we have more ability to fling computational resources at stuff, but there or two benefits from that - one is in opening the possibility of doing really fancy things, and the other is in being able to prototype what you're doing by following the most naive methods and simply profiling your way down to the parts that need optimization. If you're looking for "effectively negligible" resource usage, you still have to bring yourself back to the mindset of someone coding for that old hardware, even if the specific techniques you have are different. Doing "one to throw out" means you can be "deadly accurate" in optimizing the second one.

Post

Inline assembler language is an option, but it only speeds things up linearly, so first you need to get your algorithms right. If you find yourself calling trig functions repeatedly with the same arguments, consider pre-computing tables instead. Do your own memory management instead of C++ native when it makes a difference.
If you criticize Spitfire Audio, the mods will lock the thread.

Post

BachRules wrote:Inline assembler language is an option, but it only speeds things up linearly, so first you need to get your algorithms right.
It is usually significantly easier to just figure out how to get the compiler to generate good code for you, than to actually beat good compiler generated code using assembler (inline or not). Modern compilers (basically any production compiler from the last 10 years) are really quite good at optimizing the low-level details, as long as you give them source code that is straight-forward enough.

Basically.. most (and yes there might be rare exceptions) of the time that your assembly ends up faster than your high-level code.. you are either doing something extra in the high level code (that can't be simplified out at compile time, for one reason or another; figure out what it is and get rid of it) or you are using an ancient compiler .. or maybe forgot to turn the optimizer on. Sometimes there are optimizer bugs that you can stumble on though, my all-time favorite probably being the MSVC (08 and 10 at least) 64-bit version failing to do the right thing with the std::complex that ships with the thing (where as a basic text-book replacement will work just fine).

edit: one thing worth noting is that some compilers traditionally had trouble with the x87 register stack weirdness, probably because it's so far from the usual RISC style register allocation that most of the literature deals with.. but that shouldn't be an issue anymore, since you probably want to ignore the whole x87 for any high performance code anyway.

Post

mystran wrote:
BachRules wrote:Inline assembler language is an option, but it only speeds things up linearly, so first you need to get your algorithms right.
It is usually significantly easier to just figure out how to get the compiler to generate good code for you, than to actually beat good compiler generated code using assembler (inline or not). Modern compilers (basically any production compiler from the last 10 years) are really quite good at optimizing the low-level details, as long as you give them source code that is straight-forward enough.

Basically.. most (and yes there might be rare exceptions) of the time that your assembly ends up faster than your high-level code.. you are either doing something extra in the high level code (that can't be simplified out at compile time, for one reason or another; figure out what it is and get rid of it) or you are using an ancient compiler .. or maybe forgot to turn the optimizer on. Sometimes there are optimizer bugs that you can stumble on though, my all-time favorite probably being the MSVC (08 and 10 at least) 64-bit version failing to do the right thing with the std::complex that ships with the thing (where as a basic text-book replacement will work just fine).

edit: one thing worth noting is that some compilers traditionally had trouble with the x87 register stack weirdness, probably because it's so far from the usual RISC style register allocation that most of the literature deals with.. but that shouldn't be an issue anymore, since you probably want to ignore the whole x87 for any high performance code anyway.
I haven't done this since 1997, so I won't argue with any of that, and sorry my info was outdated. I don't miss coding assembler either.
If you criticize Spitfire Audio, the mods will lock the thread.

Post

BachRules wrote: I haven't done this since 1997, so I won't argue with any of that, and sorry my info was outdated. I don't miss coding assembler either.
Yeah, the average compiler now does a whole lot better than the average compiler from 1997... but on the other hand, you probably wouldn't want to use a modern compiler if your development system was hardware from 1997, because it'd take ages for the compilation to finish (well, assuming you didn't run out of memory/swap).
Last edited by mystran on Sun Jun 29, 2014 6:49 am, edited 1 time in total.

Post

I got new hardware.
If you criticize Spitfire Audio, the mods will lock the thread.

Post

BachRules wrote:Do your own memory management instead of C++ native when it makes a difference.
Never had to do that. I don't even know what you mean by that. Allocating everything before the processing loop is enough, and usually, you don't even need to reallocate anything the second time the processing loop is called.

Post Reply

Return to “DSP and Plugin Development”