KVR Audio

joshb · Post by **joshb** » Wed Feb 08, 2017 7:15 pm

Hey gents,

I've got my plugin working fast and efficiently by operating on blocks instead of sample by sample as Urs suggested. I'm using JUCE and have the Mac version working great and while bringing the Windows version up to speed, I ran into one issue dealing with buffer sizes and compiler differences.

in my processBlock function, I allocate some buffers so I can operate efficiently on the entire buffer:

Code: Select all

// obviously simplified...
void MyAudioProcessor::processBlock (AudioSampleBuffer& buffer, MidiBuffer& midiMessages)
{
    const int numSamples = buffer.getNumSamples();
    float* inBuffer = buffer.getReadPointer(0);
    float* outBuffer = buffer.getWritePointer(0);

    // temp buffers...this is fine in Xcode, but VS doesn't like it
    float aBuffer[numSamples];
    float bBuffer[numSamples];
    float cBuffer[numSamples];

   // process intermediate steps and put in temp buffers
    doSomething1(inBuffer, aBuffer, numSamples);
    doSomething2(inBuffer, bBuffer, numSamples);
    doSomething3(inBuffer, cBuffer, numSamples);

    // use the temp buffers to create final output
    doMore(aBuffer, bBuffer, cBuffer, outBuffer, numSamples);
}

As you can see, I allocate some buffers on the stack inside the function, but Visual Studio 15 doesn't like it.

I was thinking about doing something like this:

Code: Select all

    // temp buffers
#if MAC
    float aBuffer[numSamples];
    float bBuffer[numSamples];
    float cBuffer[numSamples];
#else
    #define kMaxBufferSize 2048
    float aBuffer[kMaxBufferSize];
    float bBuffer[kMaxBufferSize];
    float cBuffer[kMaxBufferSize];
#endif

...but I'm not sure if that's safe.

I guess it's possible to do this:

Code: Select all

    // temp buffers
    std::vector<float> aBuffer(numSamples);
    std::vector<float> bBuffer(numSamples);
    std::vector<float> cBuffer(numSamples);

...but I don't know if I'll get the same performance out of it.

Thoughts?

Mayae · Post by **Mayae** » Wed Feb 08, 2017 7:38 pm

Your Xcode code works because it utilizes a C feature called VLA. This is not a part of standard C++ in any version number, so it won't be portable. Beyond that, people usually avoid this because it is easy to create stack overflows if the array bound can vary wildy, as processing' blocks number of samples notoriously do (especially if you consider something like offline rendering).

With that said, you can get the same effect using the alloca() function, it reserves stack memory as well that is automatically deallocated and is supported on virtually any compiler, in some sort of way.

The vector approach is really bad due to memory allocations. When people say render in blocks, they usually mean some number between 16 to 128. They then subdivide processing calls if incoming buffers happen to be larger.

JCJR · Post by **JCJR** » Wed Feb 08, 2017 9:56 pm

Nowadays memory is so big and typical audio buffers so small, I pre-allocated at initialization and then re-use any needed buffers on each process call. If some plugin setting or host condition would need bigger buffers, I would set a flag and cause it to discard the old pointers and make new bigger pointers running from the GUI thread. But the initial default size was chosen generous enough to handle most cases and they rarely needed resizing bigger.

Maybe better ways to do it. Just how I did it.

When I did "block-based" processing it wasn't to reduce the update rate. It was to "hopefully" make the processing faster by minimizing memory churning and minimizing function call overhead. It seemed to actually help at the time, though maybe it is too labor-intensive or maybe doesn't offer the same advantages on modern hardware.

I tried to make "tight loop" processes that could rip thru an audio buffer doing one simple thing each. With all needed variables stored in CPU registers. So each process could do its "one simple thing" on an arbitrary-sized audio buffer with no memory access except load and store the audio samples. Using old 32 bit intel asm, I could push environment values and such on function entry, and typically make use of 6 or 7 integer values for pointer address or integer values, and 8 doubles in the FPU stack. Quite a bit of "one simple thing" ASM functions could be written to rip thru an audio buffer with all needed vars in registers like that. Then after the function has processed the buffer, I would pop any register values that needed preserving to keep the OS, language or framework happy.

Function entry would push some registers, load needed values and coffs, and calculate loop bounds in the audio buffer. Then do the tight loop. Then save off any values that need saving for "next time", clear the FPU registers and pop any registers than need restoring, then exit.

For instance, doing a simple equalizer with N bands-- At initialization create N peaking filter objects and set them to initial default values. The peaking filter object contains the "optimized ASM tight loop" process function, along with filter-specific values, and value setter-getter functions and such.

So when ProcessReplacing, do a loop 0 to (N - 1) calling each filter's "optimized tight loop" on the audio buffer. Whether the audio buffer is 32 samples or 4096 samples, each tight loop rips thru the buffer only reading/writing sample values, all other access in registers. So the biggest overhead is probably reading all samples N times, and writing all samples N times.

On older machines, this seemed a lot faster than loading sample[0], filter sample[0] N times, write sample[0], load sample[1], etc. Maybe nowadays it wouldn't be faster, haven't tested lately.

karrikuh · Post by **karrikuh** » Thu Feb 09, 2017 7:11 am

I would recommend to use pre-allocated buffers via std::vector<T>::reserve(). VST 2.4 e.g. provides the effSetBlockSize which tells the plugin the max. length of audio blocks to be processed. Pass this value to reserve(). I don't know JUCE, but I'm sure it has similar functionality to effSetBlockSize.

nonnaci · Post by **nonnaci** » Thu Feb 09, 2017 2:56 pm

Allocating on stack instead of heap could be dangerous if the thread calling processBlock limited its stack size. I've seen it go as low as 32KB of which 2048 * 4 * 3 at 24KB is three-quarters there.

I'd suggest pre-allocating outside the real-time processing thread (either in constructor/reset functions) and storing it as a member variable somewhere.

No_Use · Post by **No_Use** » Thu Feb 09, 2017 3:34 pm

Mayae wrote: The vector approach is really bad due to memory allocations.

Could you (or someone) elaborate on this please ?
Why shouldn't I use vectors as buffers, what are the drawbacks ?

nonnaci · Post by **nonnaci** » Thu Feb 09, 2017 3:47 pm

As an aside, declaring c++ vectors inside the process function is bad because by default, their sizes aren't known at compile time. What happens is that during runtime, the OS will reallocate that 2048 element chunk onto the heap every time it gets called, which is very often.

Xenakios · Post by **Xenakios** » Thu Feb 09, 2017 4:23 pm

No_Use wrote:
Mayae wrote: The vector approach is really bad due to memory allocations.
Could you (or someone) elaborate on this please ?
Why shouldn't I use vectors as buffers, what are the drawbacks ?

You can use std::vectors just fine, just don't construct/resize them in the audio processing function. (That is, put them as member variables of your DSP class and initialize them to a proper size somewhere else than in your audio processing function.)

However, if you are already using JUCE, you probably really want to use the juce::AudioBuffer class instead for your buffers, to be consistent with JUCE's coding style. (Unless you have some 3rd party code that deals with interleaved audio buffers instead of the split channels buffers.) Of course the same thing applies here : you should not construct/resize them in the processBlock function.

No_Use · Post by **No_Use** » Thu Feb 09, 2017 5:24 pm

Xenakios wrote:
No_Use wrote:
Mayae wrote: The vector approach is really bad due to memory allocations.
Could you (or someone) elaborate on this please ?
Why shouldn't I use vectors as buffers, what are the drawbacks ?
You can use std::vectors just fine, just don't construct/resize them in the audio processing function. (That is, put them as member variables of your DSP class and initialize them to a proper size somewhere else than in your audio processing function.)

However, since you are already using JUCE, you probably really want to use the juce::AudioBuffer class instead for your buffers, to be consistent with JUCE's coding style. (Unless you have some 3rd party code that deals with interleaved audio buffers instead of the split channels buffers.) Of course the same thing applies here : you should not construct/resize them in the processBlock function.

Sorry for being unclear, yes, as you replied I actually wanted to ask if using them in the audio processing function as buffers could be a problem.
Got it, thanks.

Xenakios · Post by **Xenakios** » Thu Feb 09, 2017 5:30 pm

No_Use wrote: Sorry for being unclear, yes, as you replied I actually wanted to ask if using them in the audio processing function as buffers could be a problem.
Got it, thanks.

Actually, under Visual Studio with the default debug build settings, std::vector will have some overhead when accessing the vector elements. That should however not be a real problem, aiming for performance with debug builds is not a sensible goal anyway.

keithwood · Post by **keithwood** » Thu Feb 09, 2017 9:53 pm

In VST2 the DAW calls setBlockSize to tell you the max size of the buffer you need to allocate.

In VST3 the DAW calls setupProcessing with a max samplers per block.

You can allocate your memory directly in these VST handlers as they are guaranteed to be called outside the processing loop.

As others have said heap allocation is a blocking operation so you shouldn't do it in your processing handler.

The processing handler will send in less than or equal the number of samples you were told about in setBlockSize/setupProcessing.

As other's have also noted, std::vector is awfully slow in debug mode in MSVC (even in VS 2017). I tend to use std::unique_ptr<T[]> and std::make_unique<T[]>(size). It's is a lot leaner in debug.

Miles1981 · Post by **Miles1981** » Thu Feb 09, 2017 10:03 pm

The slow debug is due to an required feature: bound check. I think it is far better to be slow and jave this in debug, especially if your code crashes!

Mayae · Post by **Mayae** » Thu Feb 09, 2017 10:26 pm

Miles1981 wrote:The slow debug is due to an required feature: bound check. I think it is far better to be slow and jave this in debug, especially if your code crashes!

I agree, it has saved me a lot of time. Otherwise, you should just debug in profiling or /O0 optimized mode if you don't care for those features.

keithwood · Post by **keithwood** » Fri Feb 10, 2017 9:40 am

I agree bounds checking is good, but if you use std::unique_ptr<T[]> in a, say, a buffer helper class you can macro on/off your range checks so that for the majority of time you don't have them turned on.

I recommend a single pre-processor definition that enables range checking in any and all code that does index based access, but only enable it every so often for testing purposes and to rule out out of bounds access for any bugs.

Miles1981 · Post by **Miles1981** » Fri Feb 10, 2017 9:57 am

Well, in that case, why don't you use the existing facilities inside the STL? https://msdn.microsoft.com/en-us/library/hh697468.aspx

Dealing with variable buffer sizes in VS...