That sounds an awful lot like an object far too large. If you need some piece of data from multiple "objects" then just split that data into a separate object. There's a chance it'll make your design easier to maintain too, because when your object size is only as large as you typically need at once, you'll usually end up far less code for "navigation" and it'll usually much easier to make sure that any internal consistency requirements are properly maintained.Keith99 wrote:A common approach is to put all data and functionality that works on it in a class. It is a nice encapsulated black box with lots of advantages not least maintenance and extensibility. However if you work on one piece of data across many instances of the class at a time that is not very cache friendly. Instead you would be better to keep data held in terms of how it is accessed which differs from common OO designs.
Secrets to writing fast DSP (VST) code?
- KVRAF
- 7899 posts since 12 Feb, 2006 from Helsinki, Finland
-
- KVRian
- 1000 posts since 1 Dec, 2004
True, and most of this time this mass of data is variable sized so it has to be kept in a vector or chunk of memory anyways. :3mystran wrote:That sounds an awful lot like an object far too large. If you need some piece of data from multiple "objects" then just split that data into a separate object. There's a chance it'll make your design easier to maintain too, because when your object size is only as large as you typically need at once, you'll usually end up far less code for "navigation" and it'll usually much easier to make sure that any internal consistency requirements are properly maintained.Keith99 wrote:A common approach is to put all data and functionality that works on it in a class. It is a nice encapsulated black box with lots of advantages not least maintenance and extensibility. However if you work on one piece of data across many instances of the class at a time that is not very cache friendly. Instead you would be better to keep data held in terms of how it is accessed which differs from common OO designs.
-
- KVRian
- 563 posts since 23 Nov, 2010
AFAIK the only issue with regards to OO and cache friendliness is that you can end up jumping around in memory. IE an array of objects by value will all be sequential, the same by pointer/reference could be scattered all over the heap.
However you can fix that pretty easily with a custom allocation scheme and placement new. For a synth you could allocate one large chunk of memory for each voice, and placement new each dsp object into that memory sequentially. If you avoid internal pointers you can make those objects movable and so insert / remove becomes doable.
That said I think it's pretty much irrelevant for audio DSP code. Were running over the same data/code 44100 times a second, usually in chunks of 128 or there abouts. They are very small data sets / code loops as far as the cache is concerned. It's not like a game engine with an enormous data set that gets processed only 50..100 times a second. By the time we hit the second iteration of the process loop all our data will still be in the cache. In a game engine 99% of it will be gone. That's when you really need to think about being cache friendly.
However you can fix that pretty easily with a custom allocation scheme and placement new. For a synth you could allocate one large chunk of memory for each voice, and placement new each dsp object into that memory sequentially. If you avoid internal pointers you can make those objects movable and so insert / remove becomes doable.
That said I think it's pretty much irrelevant for audio DSP code. Were running over the same data/code 44100 times a second, usually in chunks of 128 or there abouts. They are very small data sets / code loops as far as the cache is concerned. It's not like a game engine with an enormous data set that gets processed only 50..100 times a second. By the time we hit the second iteration of the process loop all our data will still be in the cache. In a game engine 99% of it will be gone. That's when you really need to think about being cache friendly.
Chris Jones
www.sonigen.com
www.sonigen.com
- KVRAF
- 7899 posts since 12 Feb, 2006 from Helsinki, Finland
Or you could fix it by making a class/struct called "Voice" which has all the relevant objects (for a single voice of a given plugin) directly as members. Then you make an array of those (for whatever maximum amount you allow) and put them into the plugin class. And then you automatically know that the lifetime of a "Voice" is the same as the lifetime as the instance of the plugin, and you can safely pass (and even store) references to any sub-object and you don't even need any smart pointers anymore.sonigen wrote: However you can fix that pretty easily with a custom allocation scheme and placement new. For a synth you could allocate one large chunk of memory for each voice, and placement new each dsp object into that memory sequentially. If you avoid internal pointers you can make those objects movable and so insert / remove becomes doable.
Most of the time (not always.. but really more often than one would think) when one needs "arbitrary amounts" one can actually just pick an upper limit instead, and often there's a realistic limit that's actually small enough that one can just allocate an array. Wrap the whole thing into a (say template specialized) wrapper and it'll be easy to increase the limit later, or add overflow handling or whatever if necessary (or if you know it'll always be a fixed limit, just define "maxFoobar" and you can change that later). It can still look like any regular heap container (and even fall back to that when full, if you absolutely can not have an upper limit), except it's just more efficient.
Last edited by mystran on Sat Jun 21, 2014 12:07 pm, edited 1 time in total.
- u-he
- 28065 posts since 8 Aug, 2002 from Berlin
Even if the memory is already malloced elsewhere, we observed that using stack memory for locally used sample buffers is faster.sonigen wrote:To reserve memory from the stack it's just moving a pointer.hibrasil wrote:could you explain exactly why this is faster? I recently noticed an improvement that may have been down to this but i'd like to understand it!Urs wrote:Prefer stack memory over heap.
thanks
To reserve memory from the heap... it's a stdlib sub routine, it has to be thread safe, involves checking freelists, maybe spliting larger blocks, maybe calling the OS, etc...
Last time i checked calling malloc cost at least 300 cycles, setting up stack frame is maybe 2.
I can not explain why, but I think it may have something to do with caching. I guess stack memory is more likely to be kept in cache than memory that's allocated on the heap.
Mallocs in a dsp block are of course a no-no.
-
- KVRian
- 1379 posts since 26 Apr, 2004 from UK
Stack memory will be at least in L2, as you have the instructions that need to be in cache as well (usually L1 instruction, so the stack won't necessarily be in L1 data).
That being said, placement new or new allocators are not the nicest beast on earth. Just preload your data, and it will be in cache. We are on a modern CPU, not on a SoC without a MMU.
That being said, placement new or new allocators are not the nicest beast on earth. Just preload your data, and it will be in cache. We are on a modern CPU, not on a SoC without a MMU.
-
- KVRian
- 653 posts since 4 Apr, 2010
I only took a quick skim through the thread—besides some of the good points mentioned, is there any chance that you are hitting denormalization problems?Fender19 wrote:My plugins, on the other hand - even simple, non-GUI ones - are CPU hogs. Terrible.
My audio DSP blog: earlevel.com
-
- KVRian
- 563 posts since 23 Nov, 2010
How much faster are you talking about? Might be reduced register contention since stack memory can be reference via EBP, heap memory it'd need to use an extra register for each block?Urs wrote: I can not explain why, but I think it may have something to do with caching. I guess stack memory is more likely to be kept in cache than memory that's allocated on the heap.
Chris Jones
www.sonigen.com
www.sonigen.com
-
- KVRian
- 563 posts since 23 Nov, 2010
You cant use uninitialized memory so the fact that the stack memory is already in the cache is irrelevant because either way it will be cached as soon as you initialize it. IE. Whether you use heap or stack memory you have to initialize it before you can use it, that step puts it in the cache because writes are cached as well as reads.Miles1981 wrote:Stack memory will be at least in L2, as you have the instructions that need to be in cache as well (usually L1 instruction, so the stack won't necessarily be in L1 data).
Which is it abandon OO or preload your data? Or maybe just help the MMU do what it's good at, make your memory usage patterns easy to predict?That being said, placement new or new allocators are not the nicest beast on earth. Just preload your data, and it will be in cache. We are on a modern CPU, not on a SoC without a MMU.
well they all have their place I suppose, but the later would be my first choice.
Chris Jones
www.sonigen.com
www.sonigen.com
- KVRAF
- 7899 posts since 12 Feb, 2006 from Helsinki, Finland
Even if you filled complete cache lines and the could processor skips the reads, you could still save some potential TLB misses (the CPU still has to check that there's no page fault), which could result in further cache misses reading page tables, etc.sonigen wrote:You cant use uninitialized memory so the fact that the stack memory is already in the cache is irrelevant because either way it will be cached as soon as you initialize it. IE. Whether you use heap or stack memory you have to initialize it before you can use it, that step puts it in the cache because writes are cached as well as reads.Miles1981 wrote:Stack memory will be at least in L2, as you have the instructions that need to be in cache as well (usually L1 instruction, so the stack won't necessarily be in L1 data).
Beyond caches though, another potential advantage is that it could make it a lot easier for the compiler to do aliasing analysis, especially if the buffer is allocated in the same function (either directly or through inline expansion, link-time code-generation, etc). Could be interesting to look at the generated assembly in some cases where a non-negligible difference is observed and see if there's differences beyond just a different base pointer..
edit: oh .. and since x86 traditionally passes arguments in stack (and generally spills a lot of registers, etc), I wouldn't be surprised if there was some special handling built into the processors..
-
- Banned
- 194 posts since 30 Aug, 2008
Yeah but if you cache the shlangerbanger with knickerbocker you will save .0000001 cycles DSP or not!
-
- KVRist
- 194 posts since 13 Oct, 2012
Out of curiosity, at least VS2012 doesn't allow to dynamically set the size of a C-array. Do you set numSamples to a value you don't expect to be exceeded or am I missing something?Urs wrote:float tmp[ numSamples * 4 ];
-
- KVRian
- 563 posts since 23 Nov, 2010
I'm just saying that it's not because the data is already in the cache because it actual fact it's not. Either heap or stack you have to initialize it first which means it'll go through the store forwarding buffer to the cache either way. IE As soon as you do a write any subsequent read from the same address will be serviced by the SFB or the cache.mystran wrote: Even if you filled complete cache lines and the could processor skips the reads, you could still save some potential TLB misses (the CPU still has to check that there's no page fault), which could result in further cache misses reading page tables, etc.
I agree there could be other reasons it's faster, I'd bet it's just easier to optimize, less register contention or some of the things you suggest.
Chris Jones
www.sonigen.com
www.sonigen.com
- u-he
- 28065 posts since 8 Aug, 2002 from Berlin
We just allocate enough. Otherwise there's always calloc() which does the same thing.lkjb wrote:Out of curiosity, at least VS2012 doesn't allow to dynamically set the size of a C-array. Do you set numSamples to a value you don't expect to be exceeded or am I missing something?Urs wrote:float tmp[ numSamples * 4 ];