KVR Audio

Nowhk · Post by **Nowhk** » Wed Jun 12, 2019 9:27 am

Hi,

I'm trying to build a sort of delay plugin that will trace incoming audio for a fixed window.
In my case, the window is max 4 bars (16 beats).
The worst case, with sample rate 192000 and 10 bpm I'll got:

Code: Select all

const int sampleRate = 192000;
const int bpm = 10;
const int beatsPerBar = 4;
const int bar = 4;

const int size = sampleRate * (60.0 / bpm) * beatsPerBar * bar;
float megaBytes = (size * sizeof(float)) / 1e+6;
[code]
Around ~17 megabytes that need to be allocated in memory.
Of course it can't go in the Stack, but in the Heap:

[code]
float *buffer = new float[size];
...
delete buffer;

The question is: would this impact performances (simd, operations, latency, ecc) or DSP in general?
I've always work in the Stack; this is the first time I'll approch Heap, processing blocks.

So, before design everything, I think it's better to ask to the experts

Thanks

Xenakios · Post by **Xenakios** » Wed Jun 12, 2019 10:03 am

1) If you allocate with new float[size], you *must* deallocate with delete[], not delete. (delete might now work with your particular system, compiler and runtime, but it's still wrong and can easily fail on some other system, compiler or runtime.)
2) Don't do the allocations (new) and deallocations (delete) in the audio thread. (Impossible to tell from your pseudocode if you are doing that, but just pointing out that as a potential problem.)
3) Consider using std::vector (or equivalent) instead of raw pointers to floats. If you need to be resizing your buffers, you will eventually get the new/delete[] calls wrong. (Point 2 would still hold, though, no manipulation of the vector's size/capacity in the audio thread.)

Nowhk · Post by **Nowhk** » Wed Jun 12, 2019 10:29 am

Xenakios wrote: ↑Wed Jun 12, 2019 10:03 am 1) If you allocate with new float[size], you *must* deallocate with delete[], not delete. (delete might now work with your particular system, compiler and runtime, but it's still wrong and can easily fail on some other system, compiler or runtime.)
2) Don't do the allocations (new) and deallocations (delete) in the audio thread. (Impossible to tell from your pseudocode if you are doing that, but just pointing out that as a potential problem.)
3) Consider using std::vector (or equivalent) instead of raw pointers to floats. If you need to be resizing your buffers, you will eventually get the new/delete[] calls wrong. (Point 2 would still hold, though, no manipulation of the vector's size/capacity in the audio thread.)

1) True! Typo on writing this topic (in fact I've did manually, without copy and paste real code

)
2) Yeah of course: I only do allocations on CTOR and delete on DTOR.
3) This is an alive question which I didn't find any answers in the past years: is vector really faster than raw pointers?

Anyway, is accessing value to Heap as fast as access to Stack?
Also: a generic SSE2 load function take the same amount of time?

davidguda · Post by **davidguda** » Wed Jun 12, 2019 10:58 am

Stack is sometimes faster since you can in some situations avoid cache misses but for bit chunks of memory like that heap is the only option anyhow.
Vector can be as fast as raw pointers but not faster. Vectors stores their memory in one big chunk just like some allocated memory pointed to by a raw pointer so you can see them as raw pointers with some smart stuff around it.

Xenakios · Post by **Xenakios** » Wed Jun 12, 2019 11:00 am

Nowhk wrote: ↑Wed Jun 12, 2019 10:29 am 3) This is an alive question which I didn't find any answers in the past years: is vector really faster than raw pointers?

Anyway, is accessing value to Heap as fast as access to Stack?

3) Vectors or similar abstractions would be used for correctness, not for speed. In any case, with optimized release builds, they should be compiling to the same machine code you would get when writing manually with raw pointers, new and delete[].

The stack versus heap access speed is an academic issue if you simply *need* large amounts of memory to work with. You don't really have another choice but to use the heap. (If you are writing a stand alone application, you could maybe use some build flags to get a larger stack, but I don't think that applies to plugins...)

mystran · Post by **mystran** » Thu Jun 13, 2019 12:42 am

I'll answer the original question (ie. stack vs. heap) first:

Stack has basically two advantages and one disadvantage. The first advantage is that you save a register, because the compiler doesn't need to track an extra pointer (ie. usually it can just offset from the stack pointer instead). The second advantage is that the top of stack is usually likely to stay in cache (and some people have speculated that CPUs even do some special magic to make this happen, although this might be urban legend as well), so you might save a bunch of cache misses.

The disadvantage is that the amount of stack space available is rather unpredictable and it might at times be less than what you expect. For example, on Windows the default stack size for threads created by CreateThread is usually 1MB! That's plenty even for most recursive algorithms with a couple of local variables, but not a whole lot if you start allocating a bunch of large buffers. What's worse, the errors you end up when you exhaust the stack space are sometimes quite annoying to track down. The key here is that unless you created the thread yourself, you really don't know.

As a rule of thumb, I'd usually allocate from the heap anything that measures in the dozens of kilobytes or more. At that point the potential cache advantages of stack are unlikely to be relevant anymore and you save yourself the trouble of having to think about the implications of potentially finite stack size.

[edit: also NEVER use anything like alloca() that let's you do dynamic allocation from the stack; that's just asking for trouble and you'll end up regretting it sooner or later]

Nowhk wrote: ↑Wed Jun 12, 2019 10:29 am 3) This is an alive question which I didn't find any answers in the past years: is vector really faster than raw pointers?

The answer is no, but it can be as fast as raw pointers.

Basically, unless you have additional debug checks enabled in your particular STL implementation (which some of them provide), then operator[] should optimise to a regular pointer access. You can also get the raw pointer with data() in case you need to pass the array to some function that expects a raw pointer, but really there's no reason to avoid operator[] unless you really just need the pointer.

There's also the at() method, which is like operator[] but throws on out of bounds access. This one is obviously slightly slower, unless the compiler manages to optimise the bounds checking away.

The cool thing about vectors though is that even the methods that change the size is real-time safe as long as you reserve() an upper bound in advance, because vector never reallocates storage unless it grows larger than the reserved size or you explicitly call shrink_to_fit(). This means is that you can also use vectors also in situations where you might need a variable sized buffer, just call resize() (or even push_back/pop_back) as required and it won't ever reallocate unless your reserve() was too conservative (in which case you risk a glitch with reallocation and usually this is a better "fail-safe" than randomly trashing memory).

camsr · Post by **camsr** » Thu Jun 13, 2019 6:43 pm

With the size of the dataset being so large relative to the page size, there will not be as many misses versus hits when the dataset is accessed. If there were only like 20kb to access, the page miss would be in relative terms larger.

Chrisboy2000 · Post by **Chrisboy2000** » Fri Jun 14, 2019 11:39 pm

edit: also NEVER use anything like alloca() that let's you do dynamic allocation from the stack; that's just asking for trouble and you'll end up regretting it sooner or later

Can you elaborate that? I am using alloca pretty successfully for scratch buffers that have a limited lifetime in the inner processing algorithm.

mystran · Post by **mystran** » Sat Jun 15, 2019 1:29 am

Chrisboy2000 wrote: ↑Fri Jun 14, 2019 11:39 pm
edit: also NEVER use anything like alloca() that let's you do dynamic allocation from the stack; that's just asking for trouble and you'll end up regretting it sooner or later
Can you elaborate that? I am using alloca pretty successfully for scratch buffers that have a limited lifetime in the inner processing algorithm.

First and most importantly, unless you know the upper bound, there's no way to tell whether or not it'll actually fit into the stack and if you do know the upper bound, then you could just as well use a statically sized array. What's worse, unlikely bad heap allocs, you don't usually even get any sort of error or exception, unless you hit a page fault on a guard page, which often might not happen at the call site.

Second, it can sometimes interfere with some compiler optimizations. The most obvious example is that on register-starved architectures like x86 you can often benefit from avoiding frame pointers which is essentially impossible with variably sized stack frames, but it might also interfere with some loop optimizations, mem2reg tricks, etc.

In general, from the compiler's point of view alloca() is largely an "annoying special case" that's not even part of any standard, but has to be supported for legacy Unix compatibility reasons. It might even disable some optimization passes completely "just to be safe" either in current or future versions of any given compiler... and since it's generally discouraged anyway (eg. it's a pretty good source of security issues) there's not a whole lot of incentive for compiler writers to really care very much.

vortico · Post by **vortico** » Sat Jun 15, 2019 9:51 am

>is vector really faster than raw pointers

If accessing data with at least -O1, it's exactly the same assembly. See for yourself https://godbolt.org/z/lvDzcB. Note that some vector operations can be slow, like resizing or appending, which might involve an allocation with unbounded runtime, so avoid doing this in the audio thread.

>is accessing value to Heap as fast as access to Stack?

Usually the same if you're accessing the array frequently. If infrequently, the stack is more likely to already be in L1-3 cache. You can't say for sure unless you benchmark your exact application.

>if you do know the upper bound, then you could just as well use a statically sized array

Usually this is the best method.

The only reason you might need alloca() or VLAs (standard in C99+, nonstandard in C++ but most compiler versions support it as an extension) is if
- you have a reason to minimize your function's stack size because you're calling other functions inside the function and want to decrease the probability of a stack overflow
- or if you want to leave the responsibility of determining the maximum stack size up to the caller.
For example,

Code: Select all

template <typename T, typename F>
void stepEuler(T t, T dt, T x[], int len, F f) {
	T k[len];

	f(t, x, k);
	for (int i = 0; i < len; i++) {
		x[i] += dt * k[i];
	}
}

If `f` requires over half your stack size (unlikely) and you choose to use a fixed array of half your stack size, you'll get a stack overflow when `len` could have been just 2 or something.

2DaT · Post by **2DaT** » Sun Jun 16, 2019 2:37 am

vortico wrote: ↑Sat Jun 15, 2019 9:51 am The only reason you might need alloca() or VLAs (standard in C99+, nonstandard in C++ but most compiler versions support it as an extension) is if

vortico wrote: ↑Sat Jun 15, 2019 9:51 am want to decrease the probability of a stack overflow

That's a contradiction.

Anyway, for performance reasons, you almost certainly don't want to stack-alloc more than a few Kbytes (<L1) for temporary buffers. Do cache-blocking like that:

Code: Select all

template <typename T, typename F>
void stepEuler(T t, T dt, T x[], size_t len, F f)
{
	//Didn't test, probably has errors
	T k[1024];
	size_t rem = len % 1024;
	size_t i = 0;
	f(t, x, k, rem);
	for (; i < rem; i++)
		x[i] += dt * k[i];
	for (; i < len; i += 1024)
	{
		f(t, x + i, k, 1024);
		for (size_t j = 0; j < 1024; j++)
		{
			x[i + j] += dt * k[j];
		}
	}
}

It has some overhead, but it's safe and cache-efficient for large lengths. Also, function f is never called for buffers larger than a block size. Though it may be better to cache-block at a higher level and then give the max-size guarantees to all called functions. (and template the block size).

Code: Select all

template <size_t max_length, typename T, typename F>
void stepEuler(T t, T dt, T x[], int len, F f) {
	//Didn't test, probably has errors
	assert(len<=max_length)
	T k[max_length];
	f<max_length>(t, x, k, len);
	for (int i = 0; i < len; i++) {
		x[i] += dt * k[i];
	}
}

mystran · Post by **mystran** » Sun Jun 16, 2019 4:04 am

2DaT wrote: ↑Sun Jun 16, 2019 2:37 am It has some overhead, but it's safe and cache-efficient for large lengths. Also, function f is never called for buffers larger than a block size. Though it may be better to cache-block at a higher level and then give the max-size guarantees to all called functions. (and template the block size).

Something else to keep in mind is that if you split right at the external interface level to some compile-time maximum block-size (eg. 512 or 1024, something that's large enough to make the overhead negligible), then any buffers with sizes that only depend on block size can become regular member arrays of the plugin (or module, whatever) class directly, even if you need their contents to be preserved from one block to the next.

vortico · Post by **vortico** » Sun Jun 16, 2019 10:30 am

If you set `len = 1025`, your implementation is incorrect because `f` is incapable of coupling equation 0 with equation 1024. My function solves a system of N ODEs, not N independent ODEs (otherwise why would it support arrays at all?)

>That's a contradiction.

No it's not. If I call `stepEuler` with `len = 2`, then your implementation preserves less of the stack for `f`, while mine preserves 1022*sizeof(T) more bytes for `f` to use. If the maximum stack size consumed by `f` follows some probability distribution over all use cases, then keeping `stepEuler`'s stack smaller will indeed decrease the probability of a stack overflow. Of course, with an 8MiB stack this difference is negligible, but what if you modified your code to support ODEs up to 1M float elements? You could say "at that point why not just malloc() inside `stepEuler`?". Although malloc() will *usually* be faster than processing a 1M-element array, it is not guaranteed to return in a bounded time, which breaks the real-time requirement. You could say "why not just tell the caller to malloc() beforehand and pass the data in a `workingBuffer` argument? Because it exposes implementation details to the API, so a change to the implementation (such as increasing the required buffer size) would break the API. Therefore alloca()/VLAs are the "best" solution in this case.

Your second implementation is a good one. However, what's the difference between an unknowing programmer using `stepEuler(t, dt, x, 10000000, f)` with my implementation and `stepEuler<10000000>(t, dt, x, 10000000, f)` with your implementation? Both will crash, so it doesn't solve anything.

mystran · Post by **mystran** » Sun Jun 16, 2019 11:17 am

vortico wrote: ↑Sun Jun 16, 2019 10:30 amIf the maximum stack size consumed by `f` follows some probability distribution over all use cases, then keeping `stepEuler`'s stack smaller will indeed decrease the probability of a stack overflow.

Most of us prefer to write code that works with a probability exactly equal to 1.

edit: also in my humble opinion, the so called "unknowing programmers" and real-time code are fundamentally incompatible with each other

2DaT · Post by **2DaT** » Sun Jun 16, 2019 1:11 pm

vortico wrote: ↑Sun Jun 16, 2019 10:30 am
>That's a contradiction.

No it's not. If I call `stepEuler` with `len = 2`, then your implementation preserves less of the stack for `f`, while mine preserves 1022*sizeof(T) more bytes for `f` to use. If the maximum stack size consumed by `f` follows some probability distribution over all use cases, then keeping `stepEuler`'s stack smaller will indeed decrease the probability of a stack overflow. Of course, with an 8MiB stack this difference is negligible, but what if you modified your code to support ODEs up to 1M float elements? You could say "at that point why not just malloc() inside `stepEuler`?". Although malloc() will *usually* be faster than processing a 1M-element array, it is not guaranteed to return in a bounded time, which breaks the real-time requirement. You could say "why not just tell the caller to malloc() beforehand and pass the data in a `workingBuffer` argument? Because it exposes implementation details to the API, so a change to the implementation (such as increasing the required buffer size) would break the API. Therefore alloca()/VLAs are the "best" solution in this case.

If 'f' function has data-dependent stack size, it's already unsafe to begin with.

Any downsides to allocate array on Heap instead of Stack?