Dealing with variable buffer sizes in VS...

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

Hey gents,

I've got my plugin working fast and efficiently by operating on blocks instead of sample by sample as Urs suggested. I'm using JUCE and have the Mac version working great and while bringing the Windows version up to speed, I ran into one issue dealing with buffer sizes and compiler differences.

in my processBlock function, I allocate some buffers so I can operate efficiently on the entire buffer:

Code: Select all

// obviously simplified...
void MyAudioProcessor::processBlock (AudioSampleBuffer& buffer, MidiBuffer& midiMessages)
{
    const int numSamples = buffer.getNumSamples();
    float* inBuffer = buffer.getReadPointer(0);
    float* outBuffer = buffer.getWritePointer(0);

    // temp buffers...this is fine in Xcode, but VS doesn't like it
    float aBuffer[numSamples];
    float bBuffer[numSamples];
    float cBuffer[numSamples];

   // process intermediate steps and put in temp buffers
    doSomething1(inBuffer, aBuffer, numSamples);
    doSomething2(inBuffer, bBuffer, numSamples);
    doSomething3(inBuffer, cBuffer, numSamples);

    // use the temp buffers to create final output
    doMore(aBuffer, bBuffer, cBuffer, outBuffer, numSamples);
}
As you can see, I allocate some buffers on the stack inside the function, but Visual Studio 15 doesn't like it.

I was thinking about doing something like this:

Code: Select all

    // temp buffers
#if MAC
    float aBuffer[numSamples];
    float bBuffer[numSamples];
    float cBuffer[numSamples];
#else
    #define kMaxBufferSize 2048
    float aBuffer[kMaxBufferSize];
    float bBuffer[kMaxBufferSize];
    float cBuffer[kMaxBufferSize];
#endif
...but I'm not sure if that's safe.

I guess it's possible to do this:

Code: Select all

    // temp buffers
    std::vector<float> aBuffer(numSamples);
    std::vector<float> bBuffer(numSamples);
    std::vector<float> cBuffer(numSamples);
...but I don't know if I'll get the same performance out of it.

Thoughts?

Post

Your Xcode code works because it utilizes a C feature called VLA. This is not a part of standard C++ in any version number, so it won't be portable. Beyond that, people usually avoid this because it is easy to create stack overflows if the array bound can vary wildy, as processing' blocks number of samples notoriously do (especially if you consider something like offline rendering).

With that said, you can get the same effect using the alloca() function, it reserves stack memory as well that is automatically deallocated and is supported on virtually any compiler, in some sort of way.

The vector approach is really bad due to memory allocations. When people say render in blocks, they usually mean some number between 16 to 128. They then subdivide processing calls if incoming buffers happen to be larger.

Post

Nowadays memory is so big and typical audio buffers so small, I pre-allocated at initialization and then re-use any needed buffers on each process call. If some plugin setting or host condition would need bigger buffers, I would set a flag and cause it to discard the old pointers and make new bigger pointers running from the GUI thread. But the initial default size was chosen generous enough to handle most cases and they rarely needed resizing bigger.

Maybe better ways to do it. Just how I did it.

When I did "block-based" processing it wasn't to reduce the update rate. It was to "hopefully" make the processing faster by minimizing memory churning and minimizing function call overhead. It seemed to actually help at the time, though maybe it is too labor-intensive or maybe doesn't offer the same advantages on modern hardware.

I tried to make "tight loop" processes that could rip thru an audio buffer doing one simple thing each. With all needed variables stored in CPU registers. So each process could do its "one simple thing" on an arbitrary-sized audio buffer with no memory access except load and store the audio samples. Using old 32 bit intel asm, I could push environment values and such on function entry, and typically make use of 6 or 7 integer values for pointer address or integer values, and 8 doubles in the FPU stack. Quite a bit of "one simple thing" ASM functions could be written to rip thru an audio buffer with all needed vars in registers like that. Then after the function has processed the buffer, I would pop any register values that needed preserving to keep the OS, language or framework happy.

Function entry would push some registers, load needed values and coffs, and calculate loop bounds in the audio buffer. Then do the tight loop. Then save off any values that need saving for "next time", clear the FPU registers and pop any registers than need restoring, then exit.

For instance, doing a simple equalizer with N bands-- At initialization create N peaking filter objects and set them to initial default values. The peaking filter object contains the "optimized ASM tight loop" process function, along with filter-specific values, and value setter-getter functions and such.

So when ProcessReplacing, do a loop 0 to (N - 1) calling each filter's "optimized tight loop" on the audio buffer. Whether the audio buffer is 32 samples or 4096 samples, each tight loop rips thru the buffer only reading/writing sample values, all other access in registers. So the biggest overhead is probably reading all samples N times, and writing all samples N times.

On older machines, this seemed a lot faster than loading sample[0], filter sample[0] N times, write sample[0], load sample[1], etc. Maybe nowadays it wouldn't be faster, haven't tested lately.

Post

I would recommend to use pre-allocated buffers via std::vector<T>::reserve(). VST 2.4 e.g. provides the effSetBlockSize which tells the plugin the max. length of audio blocks to be processed. Pass this value to reserve(). I don't know JUCE, but I'm sure it has similar functionality to effSetBlockSize.

Post

Allocating on stack instead of heap could be dangerous if the thread calling processBlock limited its stack size. I've seen it go as low as 32KB of which 2048 * 4 * 3 at 24KB is three-quarters there.

I'd suggest pre-allocating outside the real-time processing thread (either in constructor/reset functions) and storing it as a member variable somewhere.

Post

Mayae wrote: The vector approach is really bad due to memory allocations.
Could you (or someone) elaborate on this please ?
Why shouldn't I use vectors as buffers, what are the drawbacks ?

Post

As an aside, declaring c++ vectors inside the process function is bad because by default, their sizes aren't known at compile time. What happens is that during runtime, the OS will reallocate that 2048 element chunk onto the heap every time it gets called, which is very often.

Post

No_Use wrote:
Mayae wrote: The vector approach is really bad due to memory allocations.
Could you (or someone) elaborate on this please ?
Why shouldn't I use vectors as buffers, what are the drawbacks ?
You can use std::vectors just fine, just don't construct/resize them in the audio processing function. (That is, put them as member variables of your DSP class and initialize them to a proper size somewhere else than in your audio processing function.)

However, if you are already using JUCE, you probably really want to use the juce::AudioBuffer class instead for your buffers, to be consistent with JUCE's coding style. (Unless you have some 3rd party code that deals with interleaved audio buffers instead of the split channels buffers.) Of course the same thing applies here : you should not construct/resize them in the processBlock function.
Last edited by Xenakios on Thu Feb 09, 2017 5:26 pm, edited 1 time in total.

Post

Xenakios wrote:
No_Use wrote:
Mayae wrote: The vector approach is really bad due to memory allocations.
Could you (or someone) elaborate on this please ?
Why shouldn't I use vectors as buffers, what are the drawbacks ?
You can use std::vectors just fine, just don't construct/resize them in the audio processing function. (That is, put them as member variables of your DSP class and initialize them to a proper size somewhere else than in your audio processing function.)

However, since you are already using JUCE, you probably really want to use the juce::AudioBuffer class instead for your buffers, to be consistent with JUCE's coding style. (Unless you have some 3rd party code that deals with interleaved audio buffers instead of the split channels buffers.) Of course the same thing applies here : you should not construct/resize them in the processBlock function.
Sorry for being unclear, yes, as you replied I actually wanted to ask if using them in the audio processing function as buffers could be a problem.
Got it, thanks.

Post

No_Use wrote: Sorry for being unclear, yes, as you replied I actually wanted to ask if using them in the audio processing function as buffers could be a problem.
Got it, thanks.
Actually, under Visual Studio with the default debug build settings, std::vector will have some overhead when accessing the vector elements. That should however not be a real problem, aiming for performance with debug builds is not a sensible goal anyway.

Post

In VST2 the DAW calls setBlockSize to tell you the max size of the buffer you need to allocate.

In VST3 the DAW calls setupProcessing with a max samplers per block.

You can allocate your memory directly in these VST handlers as they are guaranteed to be called outside the processing loop.

As others have said heap allocation is a blocking operation so you shouldn't do it in your processing handler.

The processing handler will send in less than or equal the number of samples you were told about in setBlockSize/setupProcessing.

As other's have also noted, std::vector is awfully slow in debug mode in MSVC (even in VS 2017). I tend to use std::unique_ptr<T[]> and std::make_unique<T[]>(size). It's is a lot leaner in debug.

Post

The slow debug is due to an required feature: bound check. I think it is far better to be slow and jave this in debug, especially if your code crashes!

Post

Miles1981 wrote:The slow debug is due to an required feature: bound check. I think it is far better to be slow and jave this in debug, especially if your code crashes!
I agree, it has saved me a lot of time. Otherwise, you should just debug in profiling or /O0 optimized mode if you don't care for those features.

Post

I agree bounds checking is good, but if you use std::unique_ptr<T[]> in a, say, a buffer helper class you can macro on/off your range checks so that for the majority of time you don't have them turned on.

I recommend a single pre-processor definition that enables range checking in any and all code that does index based access, but only enable it every so often for testing purposes and to rule out out of bounds access for any bugs.

Post

Well, in that case, why don't you use the existing facilities inside the STL? https://msdn.microsoft.com/en-us/library/hh697468.aspx

Post Reply

Return to “DSP and Plugin Development”