KVR Audio

Fender19 · Post by **Fender19** » Fri Jul 18, 2014 3:23 pm

How much time do we have between "processReplacing()" calls before the next block arrives? Does it depend on how many other plugins are running or is there a set period regardless?

And is the host waiting for that processing to be finished before it moves on - or does it send the data to the plugin and come back later when the next block is transferred?

I have a plugin that uses an FFT/iFFT and I'm trying to figure out why it is taxing the CPU so hard (one plugin instance using >10% CPU). I have seen big convolution reverb plugins with multiple FFTs use similar, if not less, CPU resources so it seems I am doing something wrong.

I'm wondering if it has to do with WHERE I am doing the processing. Currently, I do all of my audio processing inside the processReplacing() function. I load the FFT buffer while --sampleFrames>=0 and then compute the FFT when its buffer is full. My FFT is large (8192+) which means sometimes it runs within that call and other times it is simply waiting for more data to fill its buffer. I can see that fluctuation in various host VST performance meters.

Is there a better way/place to do this?

camsr · Post by **camsr** » Fri Jul 18, 2014 5:05 pm

You have as much time as the CPU allows.

Keith99 · Post by **Keith99** » Fri Jul 18, 2014 5:25 pm

well you know the block size and the sampling rate so you can work out the maximum time you can spend on one block before needing to process the next one

Zaphod (giancarlo) · Post by **Zaphod (giancarlo)** » Fri Jul 18, 2014 6:31 pm

it's frames/samplerate, if your plugin is the only one running and your host is blazing fast. You cant do that. No way.
Google overlap and save, overlap and add
Other tip, add a delay, so you can return processed samples for every possible frame (even a single sample)

mystran · Post by **mystran** » Sat Jul 19, 2014 1:38 am

The way to do long convolution etc with low ASIO load is to collect a buffer while playing back already processed stuff, then schedule the block for a background thread, which would then only need to finish it around the time you've collected the next buffer and so on. This necessarily introduce some latency (basically one block for the background thread, one block for the FFT processing) so for things like low-latency convolution you use multiple strategies, processing part of the convolution directly and probably scheduling multiple blocksizes for the background thread too.

The details are kinda messy, but that's the basic idea. Whatever you do in processReplacing directly has to have a low "per-sample cost" (or at least "per ASIO block" cost), which is what generally shows up in a host meter: This is different from long-time average CPU usage, which is what FFTs are good at optimizing and you can hide the "bursty" nature of the CPU load by using a separate thread that runs outside the processReplacing dependency chain. The "true CPU" (as it would show up in system meters) then is just a matter of using a fast FFT and good block divisions.

edit: another possibility to even out the CPU load is to subdivide the FFT process into smaller chunks of work and advance it a bit every few samples or some such logic.. basically running sort-of co-operative threading.. this is still a bit wasteful since it's still counting towards audio load, though.. but it saves you from having to mess with thread priorities on the fly, which is something you'll probably end up doing with an actual background thread

LemonLime · Post by **LemonLime** » Sat Jul 19, 2014 3:07 pm

Here is a paper on multithreaded FFT processing by the author of that recent plug-ins book:
http://www.willpirkle.com/project-galle ... notes/#AN2

I haven't tried implmenting this myself, but it provides a good explanation of the theory and process behind it.

Fender19 · Post by **Fender19** » Sat Jul 19, 2014 6:04 pm

mystran wrote:The way to do long convolution etc with low ASIO load is to collect a buffer while playing back already processed stuff

OK, that part I am doing - using rotating buffers for FFT in/out, etc. But those rotating buffers really only have meaning at the boundaries of sampleFrames - i.e., they are really just a means of filling/reading a processing buffer size that is different than sampleFrame size. They are still both "blocks" of data. If the FFT size was the same size as sampleFrames the CPU load would be consistent. But I understand from the VST spec that we cannot assume that sampleFrames is any certain size - or is constant. So, one must use rotating buffers (I think).

mystran wrote:...you can hide the "bursty" nature of the CPU load by using a separate thread that runs outside the processReplacing dependency chain.

I am convinced this MUST be how most of the big convolution plugs work. I just don't see, otherwise, how it's possible to convolve 10+ second long impulses using multiple FFTs - and still have such low CPU usage. It's quite ingenious however it's being done.

Fender19 · Post by **Fender19** » Sat Jul 19, 2014 6:06 pm

LemonLime wrote:Here is a paper on multithreaded FFT processing by the author of that recent plug-ins book:
http://www.willpirkle.com/project-galle ... notes/#AN2

I haven't tried implmenting this myself, but it provides a good explanation of the theory and process behind it.

Thank you, I will read up on it.

mystran · Post by **mystran** » Sat Jul 19, 2014 8:50 pm

Fender19 wrote:
mystran wrote:...you can hide the "bursty" nature of the CPU load by using a separate thread that runs outside the processReplacing dependency chain.
I am convinced this MUST be how most of the big convolution plugs work. I just don't see, otherwise, how it's possible to convolve 10+ second long impulses using multiple FFTs - and still have such low CPU usage. It's quite ingenious however it's being done.

Yeah, my IRDust uses up to 5 different FFT sizes, with just the shortest (64 samples) done in the audio thread directly. The longest one is currently 256k samples, which is 5.8 seconds at 44.1kHz and trying to do that synchronously would be totally ridiculous. During the time that such a long block gets processed, it also gets interrupted many times by processing of shorter, higher priority blocks.

edit: if you try IRDust, please note that the biggest bottleneck with the performance currently is actually the FFT algorithm used.. which is a bit slow, and the main reason I wrote "DustFFT" (that I mentioned in the other thread) which will eventually replace the one in IRDust.

How much time do we have during processReplacing()?