KVR Audio

mystran · Post by **mystran** » Tue Nov 19, 2019 6:49 am

karrikuh wrote: ↑Tue Nov 19, 2019 6:31 am
2DaT wrote: ↑Mon Nov 18, 2019 9:46 pm8. Avoid std library when possible. Prefer arrays to std::vector.
Why? II'm using std::vector all over the place, and to me it leads to much clearer and safer code than taking care of deleting memory myself. Also, I didn't observe any performance hit compared to manually allocated array.

It depends on what you do with your vectors, but if you're just using them to allocate dynamic memory (ie. you can't use static arrays anyway) for access with operator[] there shouldn't really be any impact whatsoever, unless you have some debug features enabled.

aciddose · Post by **aciddose** » Tue Nov 19, 2019 7:43 am

What's wrong with std::array ? You can get debug-time bounds-checking and all the useful size/length features in a static array with zero runtime overhead.

Using a vector (heap, dynamic) instead of a static array is guaranteed to be way less efficient in a variety of cases. You ought to be using aligned allocation of "whole" objects anyway, including any static arrays. Those are required for SIMD.

Some of the most important optimization problems aren't about hand-sharpening a blade, down to the bare metal level. They're knowing what types of algorithms and data structures to use and when. Using a dynamic structure like a vector with overhead where a static array would work is nuts.

Many of the problems you see in modern software are due to obsessive application of data structures or patterns where they aren't beneficial. The small overhead adds up when you use 1000s of vectors where static arrays would have worked. It might be cases like storing a 64-bit word where you only needed an 8-bit word and suddenly you're cache-smashed by huge multi-mb arrays that might have been 64k instead.

(cache-smash!)

BertKoor · Post by **BertKoor** » Tue Nov 19, 2019 7:52 am

Fender19 wrote: ↑Tue Nov 19, 2019 12:34 am I am testing my plugin by stacking 10 instances of it in one track in Reaper. When all 10 plugins are running Reaper reports 1.9% total CPU usage - and each plugin instance shows 0.2%. But when I remove all but one plugin it reports 0.3% usage (looks like 50% more for just one plugin by itself). That doesn't make sense to me - does it to you? Or are these numbers too "low in the weeds" to be meaningful?

Why don't you do proper unit tests for these things?
Rin your process(blockOfSamples) 10000 times while logging a performance timer.
You thenn also have control over what it actually will process and do comparisons.

0.2 or 0.3% is not a significant difference as you have seen.

aciddose · Post by **aciddose** » Tue Nov 19, 2019 8:05 am

Those small measurements in real-time tests are usually below the accuracy threshold anyway. There is likely more variation due to cache variations over time than you're actually measuring... so the 3% might actually be due to several small peaks on the first sample of each block or in certain other conditions, while processing the whole block, on a per-sample basis the cost is much lower.

The question then isn't actually "Who trashed my cache!?", but "What can I do to ensure this data remains as long as possible in a minimal number of consecutive cache lines?"

Code: Select all

struct voice_t { bool active; datadatadatadata... }
array<voice_t, N> v;

vs.

Code: Select all

struct voice_t { datadatadatadata... }
array<voice_t, N> v;
array<bool, N> v_active;

Performance profiling is a whole universe in itself apart from mere optimization. If you're already up to your neck with optimization, don't jump face-first down that rabbit hole, Alice!

DJ Warmonger · Post by **DJ Warmonger** » Tue Nov 19, 2019 8:15 am

II'm using std::vector all over the place, and to me it leads to much clearer and safer code than taking care of deleting memory myself

Many of the problems you see in modern software are due to obsessive application of data structures or patterns where they aren't beneficial.

Vector is beneficial when you actually need to delete memory and allocate structures dynamically - in things like preset list, or pieces of GUI. But not for audio buffers which need to be optimized as much as possible, and probably fixed size anyway. This is completely different area of application and design philosophy.

otristan · Post by **otristan** » Tue Nov 19, 2019 8:42 am

Code: Select all

#include <boost/align/aligned_allocator.hpp>
#include <vector>
  
template <typename Value, std::size_t Alignment = 16>
  using AlignedVector = std::vector<Value, boost::alignment::aligned_allocator<Value, Alignment>>;

Vector is perfect for audio buffer, just need to ensure alignement for simd operation, and use it as a float* but at least it will take care of resize and automatic delete
float *pBuffer = &myVector[0];

Best of both world.

aciddose · Post by **aciddose** » Tue Nov 19, 2019 8:44 am

It's horrible for a short static buffer, of which you need 1000s. For example filter state buffers and bi-delay buffers used in a reverb mesh (or any other type of matrix). You might allocate the whole buffer once in a single block vs. thousands of tiny <1k allocations that trash not just the cache, but make a mess of the heap too.

Why would anyone ever need to re-size a static buffer? It's only allocated once (on init) and never changes, ever.

Z1202 · Post by **Z1202** » Tue Nov 19, 2019 10:17 am

syntonica wrote: ↑Mon Nov 18, 2019 10:52 pmI've never had that much luck on the Mac with the fast-math. Never does a thing for me. It does seem to do some good with gcc. I don't recall if I even used it on MSVC, but I was busy learning the Windows Way of things.

IIRC on clang the fast math option can be overly aggressive, which sometimes can cause trouble. However there are more granular options of controlling various aspects of fast math. Don't remember details though, it was just some quick experiment.

syntonica · Post by **syntonica** » Tue Nov 19, 2019 6:49 pm

Z1202 wrote: ↑Tue Nov 19, 2019 10:17 am
syntonica wrote: ↑Mon Nov 18, 2019 10:52 pmI've never had that much luck on the Mac with the fast-math. Never does a thing for me. It does seem to do some good with gcc. I don't recall if I even used it on MSVC, but I was busy learning the Windows Way of things.
IIRC on clang the fast math option can be overly aggressive, which sometimes can cause trouble. However there are more granular options of controlling various aspects of fast math. Don't remember details though, it was just some quick experiment.

I'm still learning about controlling compilers with a whip and a chair rather than using IDE built-in settings. I never got any problems with the Relax IEEE Compliance setting, I just never got any speed boost.

Now that I look into it, -Ofast on clang adds -fno-signed-zeros -freciprocal-math -ffp-contract=fast -menable-unsafe-fp-math -menable-no-nans -menable-no-infs, so that may be why the extra flag in Xcode never does anything.

However, I'm sure I've tried it just by itself using -O0 and not seen any significant gains. My tests use 5-10 instances of my plugin using different styles of patches so I get a good, overall average of CPU use.

syntonica · Post by **syntonica** » Tue Nov 19, 2019 7:00 pm

aciddose wrote: ↑Tue Nov 19, 2019 8:44 am It's horrible for a short static buffer, of which you need 1000s. For example filter state buffers and bi-delay buffers used in a reverb mesh (or any other type of matrix). You might allocate the whole buffer once in a single block vs. thousands of tiny <1k allocations that trash not just the cache, but make a mess of the heap too.

Why would anyone ever need to re-size a static buffer? It's only allocated once (on init) and never changes, ever.

To add, dynamic buffers are fine if you only occasionally need to change the size in large chunks. However, when you are sweeping that size by samples in live audio, that's a ton of overhead you don't need. I'll never understand that slavish insistence that "if it's there, you must use it" mentality, rather than looking at what's best for your use case.

karrikuh · Post by **karrikuh** » Tue Nov 19, 2019 7:28 pm

aciddose wrote: ↑Tue Nov 19, 2019 7:43 am What's wrong with std::array ? You can get debug-time bounds-checking and all the useful size/length features in a static array with zero runtime overhead. <snip>

There's absolutely nothing wrong with std::array, of course. But they have different semantics than std::vector (dynamic size only known at run-time vs compile-time sized). The decision which to use should be primarily based on whether size is known statically or not.

Practical examples:
For a buffer of a delay line I would use a std::vector because it's size is chosen as a multiple of the
current host samplerate (run-time).
For intermediate sample blocks inside a modular synth I use std::arrays because the blocksize is kept fixed (e.g. 16 samples) and can e.g. be optimized to fit in a cacheline.

It's possible I misinterpreted 2Dats comment, I thought he was recommending to generally prefer plain dynamic arrays (using new [] / delete []) over std::vector to avoid any overhead he assumed that std::vector would add. The term "array" is obviously a bit fuzzy...

Fender19 · Post by **Fender19** » Tue Nov 19, 2019 7:51 pm

BertKoor wrote: ↑Tue Nov 19, 2019 7:52 am
Fender19 wrote: ↑Tue Nov 19, 2019 12:34 am I am testing my plugin by stacking 10 instances of it in one track in Reaper. When all 10 plugins are running Reaper reports 1.9% total CPU usage - and each plugin instance shows 0.2%. But when I remove all but one plugin it reports 0.3% usage (looks like 50% more for just one plugin by itself). That doesn't make sense to me - does it to you? Or are these numbers too "low in the weeds" to be meaningful?
Why don't you do proper unit tests for these things?
Rin your process(blockOfSamples) 10000 times while logging a performance timer.
You thenn also have control over what it actually will process and do comparisons.

0.2 or 0.3% is not a significant difference as you have seen.

Yes, for accurate testing of blocks of code I need to do what you suggest - and I will do that to develop my own reference list of "dos and don'ts".

However, by running the "optimized" plugin in a DAW I am seeing how it actually behaves in real world use with dynamic signals (looped for repeatability and comparisons). That is what a customer sees and if what I'm doing makes no difference there then there really isn't much point spending a lot of time on it.

Fender19 · Post by **Fender19** » Tue Nov 19, 2019 8:10 pm

mystran wrote: ↑Tue Nov 19, 2019 6:49 amIt depends on what you do with your vectors, but if you're just using them to allocate dynamic memory (ie. you can't use static arrays anyway) for access with operator[] there shouldn't really be any impact whatsoever, unless you have some debug features enabled.

I know it is not memory efficient but I declare and set up my delay line (for latency compensation) in the constructor as a static array of maximum required length. I then account for different sample rates by simply setting the delay pointer length "modulus" point accordingly, i.e. DelayLength at 44.1K is 100 at 88.2K it's 200, etc.

I access the elements of that array using:

Code: Select all

	  
	  leftdelayed = DelayArray[0][delayPtr];	    //rotating buffer - retrieve saved value
	  rightdelayed = DelayArray[1][delayPtr];
	  DelayArray[0][delayPtr] = leftin;		   //rotating buffer - store new value
	  DelayArray[1][delayPtr] = rightin;
	  
	  delayPtr += 1;
	  delayPtr %= DelayLength;

Is this a good way to do this - or is it SLOW?

quikquak · Post by **quikquak** » Tue Nov 19, 2019 8:44 pm

Just my 2ps worth... I've always avoided % as it internally uses a divide (yeah, it may not be relevant these days). I use something simple like this which also catches additions of larger than 1:-

Code: Select all

delayPtr += 1;
if (delayPtr >= DelayLength)
     delayPtr -= DelayLength;

otristan · Post by **otristan** » Tue Nov 19, 2019 9:10 pm

aciddose wrote: ↑Tue Nov 19, 2019 8:44 am It's horrible for a short static buffer, of which you need 1000s. For example filter state buffers and bi-delay buffers used in a reverb mesh (or any other type of matrix). You might allocate the whole buffer once in a single block vs. thousands of tiny <1k allocations that trash not just the cache, but make a mess of the heap too.

True. In specific cases hence I would not use this technique by default.
For the same reason you don't code your whole plugin in assembly but only when necessary/make sense.

Optimize plugin code for balanced load or least load?