KVR Audio

mystran · Post by **mystran** » Sat Jun 21, 2014 12:51 am

Keith99 wrote:A common approach is to put all data and functionality that works on it in a class. It is a nice encapsulated black box with lots of advantages not least maintenance and extensibility. However if you work on one piece of data across many instances of the class at a time that is not very cache friendly. Instead you would be better to keep data held in terms of how it is accessed which differs from common OO designs.

That sounds an awful lot like an object far too large. If you need some piece of data from multiple "objects" then just split that data into a separate object. There's a chance it'll make your design easier to maintain too, because when your object size is only as large as you typically need at once, you'll usually end up far less code for "navigation" and it'll usually much easier to make sure that any internal consistency requirements are properly maintained.

MadBrain · Post by **MadBrain** » Sat Jun 21, 2014 3:44 am

mystran wrote:
Keith99 wrote:A common approach is to put all data and functionality that works on it in a class. It is a nice encapsulated black box with lots of advantages not least maintenance and extensibility. However if you work on one piece of data across many instances of the class at a time that is not very cache friendly. Instead you would be better to keep data held in terms of how it is accessed which differs from common OO designs.
That sounds an awful lot like an object far too large. If you need some piece of data from multiple "objects" then just split that data into a separate object. There's a chance it'll make your design easier to maintain too, because when your object size is only as large as you typically need at once, you'll usually end up far less code for "navigation" and it'll usually much easier to make sure that any internal consistency requirements are properly maintained.

True, and most of this time this mass of data is variable sized so it has to be kept in a vector or chunk of memory anyways. :3

sonigen · Post by **sonigen** » Sat Jun 21, 2014 8:04 am

AFAIK the only issue with regards to OO and cache friendliness is that you can end up jumping around in memory. IE an array of objects by value will all be sequential, the same by pointer/reference could be scattered all over the heap.

However you can fix that pretty easily with a custom allocation scheme and placement new. For a synth you could allocate one large chunk of memory for each voice, and placement new each dsp object into that memory sequentially. If you avoid internal pointers you can make those objects movable and so insert / remove becomes doable.

That said I think it's pretty much irrelevant for audio DSP code. Were running over the same data/code 44100 times a second, usually in chunks of 128 or there abouts. They are very small data sets / code loops as far as the cache is concerned. It's not like a game engine with an enormous data set that gets processed only 50..100 times a second. By the time we hit the second iteration of the process loop all our data will still be in the cache. In a game engine 99% of it will be gone. That's when you really need to think about being cache friendly.

mystran · Post by **mystran** » Sat Jun 21, 2014 11:40 am

sonigen wrote: However you can fix that pretty easily with a custom allocation scheme and placement new. For a synth you could allocate one large chunk of memory for each voice, and placement new each dsp object into that memory sequentially. If you avoid internal pointers you can make those objects movable and so insert / remove becomes doable.

Or you could fix it by making a class/struct called "Voice" which has all the relevant objects (for a single voice of a given plugin) directly as members. Then you make an array of those (for whatever maximum amount you allow) and put them into the plugin class. And then you automatically know that the lifetime of a "Voice" is the same as the lifetime as the instance of the plugin, and you can safely pass (and even store) references to any sub-object and you don't even need any smart pointers anymore.

Most of the time (not always.. but really more often than one would think) when one needs "arbitrary amounts" one can actually just pick an upper limit instead, and often there's a realistic limit that's actually small enough that one can just allocate an array. Wrap the whole thing into a (say template specialized) wrapper and it'll be easy to increase the limit later, or add overflow handling or whatever if necessary (or if you know it'll always be a fixed limit, just define "maxFoobar" and you can change that later). It can still look like any regular heap container (and even fall back to that when full, if you absolutely can not have an upper limit), except it's just more efficient.

Urs · Post by **Urs** » Sat Jun 21, 2014 11:55 am

sonigen wrote:
hibrasil wrote:
Urs wrote:Prefer stack memory over heap.
could you explain exactly why this is faster? I recently noticed an improvement that may have been down to this but i'd like to understand it!

thanks
To reserve memory from the stack it's just moving a pointer.

To reserve memory from the heap... it's a stdlib sub routine, it has to be thread safe, involves checking freelists, maybe spliting larger blocks, maybe calling the OS, etc...

Last time i checked calling malloc cost at least 300 cycles, setting up stack frame is maybe 2.

Even if the memory is already malloced elsewhere, we observed that using stack memory for locally used sample buffers is faster.

I can not explain why, but I think it may have something to do with caching. I guess stack memory is more likely to be kept in cache than memory that's allocated on the heap.

Mallocs in a dsp block are of course a no-no.

Miles1981 · Post by **Miles1981** » Sat Jun 21, 2014 7:06 pm

Stack memory will be at least in L2, as you have the instructions that need to be in cache as well (usually L1 instruction, so the stack won't necessarily be in L1 data).
That being said, placement new or new allocators are not the nicest beast on earth. Just preload your data, and it will be in cache. We are on a modern CPU, not on a SoC without a MMU.

earlevel · Post by **earlevel** » Sat Jun 21, 2014 8:55 pm

Fender19 wrote:My plugins, on the other hand - even simple, non-GUI ones - are CPU hogs. Terrible.

I only took a quick skim through the thread—besides some of the good points mentioned, is there any chance that you are hitting denormalization problems?

sonigen · Post by **sonigen** » Sun Jun 22, 2014 12:32 am

Urs wrote: I can not explain why, but I think it may have something to do with caching. I guess stack memory is more likely to be kept in cache than memory that's allocated on the heap.

How much faster are you talking about? Might be reduced register contention since stack memory can be reference via EBP, heap memory it'd need to use an extra register for each block?

sonigen · Post by **sonigen** » Sun Jun 22, 2014 1:39 am

Miles1981 wrote:Stack memory will be at least in L2, as you have the instructions that need to be in cache as well (usually L1 instruction, so the stack won't necessarily be in L1 data).

You cant use uninitialized memory so the fact that the stack memory is already in the cache is irrelevant because either way it will be cached as soon as you initialize it. IE. Whether you use heap or stack memory you have to initialize it before you can use it, that step puts it in the cache because writes are cached as well as reads.

That being said, placement new or new allocators are not the nicest beast on earth. Just preload your data, and it will be in cache. We are on a modern CPU, not on a SoC without a MMU.

Which is it abandon OO or preload your data? Or maybe just help the MMU do what it's good at, make your memory usage patterns easy to predict?

well they all have their place I suppose, but the later would be my first choice.

mystran · Post by **mystran** » Sun Jun 22, 2014 2:14 am

sonigen wrote:
Miles1981 wrote:Stack memory will be at least in L2, as you have the instructions that need to be in cache as well (usually L1 instruction, so the stack won't necessarily be in L1 data).
You cant use uninitialized memory so the fact that the stack memory is already in the cache is irrelevant because either way it will be cached as soon as you initialize it. IE. Whether you use heap or stack memory you have to initialize it before you can use it, that step puts it in the cache because writes are cached as well as reads.

Even if you filled complete cache lines and the could processor skips the reads, you could still save some potential TLB misses (the CPU still has to check that there's no page fault), which could result in further cache misses reading page tables, etc.

Beyond caches though, another potential advantage is that it could make it a lot easier for the compiler to do aliasing analysis, especially if the buffer is allocated in the same function (either directly or through inline expansion, link-time code-generation, etc). Could be interesting to look at the generated assembly in some cases where a non-negligible difference is observed and see if there's differences beyond just a different base pointer..

edit: oh .. and since x86 traditionally passes arguments in stack (and generally spills a lot of registers, etc), I wouldn't be surprised if there was some special handling built into the processors..

Kevin Deas · Post by **Kevin Deas** » Sun Jun 22, 2014 3:15 am

Yeah but if you cache the shlangerbanger with knickerbocker you will save .0000001 cycles DSP or not!

lkjb · Post by **lkjb** » Sun Jun 22, 2014 8:59 am

Urs wrote:float tmp[ numSamples * 4 ];

Out of curiosity, at least VS2012 doesn't allow to dynamically set the size of a C-array. Do you set numSamples to a value you don't expect to be exceeded or am I missing something?

camsr · Post by **camsr** » Sun Jun 22, 2014 9:24 am

C++

You could also use the pointer to the array.

EDIT: that array should probably be declared outside of that function's scope.

sonigen · Post by **sonigen** » Sun Jun 22, 2014 10:21 am

mystran wrote: Even if you filled complete cache lines and the could processor skips the reads, you could still save some potential TLB misses (the CPU still has to check that there's no page fault), which could result in further cache misses reading page tables, etc.

I'm just saying that it's not because the data is already in the cache because it actual fact it's not. Either heap or stack you have to initialize it first which means it'll go through the store forwarding buffer to the cache either way. IE As soon as you do a write any subsequent read from the same address will be serviced by the SFB or the cache.

I agree there could be other reasons it's faster, I'd bet it's just easier to optimize, less register contention or some of the things you suggest.

Urs · Post by **Urs** » Sun Jun 22, 2014 11:54 am

lkjb wrote:
Urs wrote:float tmp[ numSamples * 4 ];
Out of curiosity, at least VS2012 doesn't allow to dynamically set the size of a C-array. Do you set numSamples to a value you don't expect to be exceeded or am I missing something?

We just allocate enough. Otherwise there's always calloc() which does the same thing.

Secrets to writing fast DSP (VST) code?