KVR Audio

Kraku · Post by **Kraku** » Sun Oct 31, 2021 1:04 pm

I'm pondering what's the best and fastest way to handle upsampling of both large and small audio waveforms when loading them from disk. First I thought that traditional FIR filter would be great, but after a bit of thinking, I remembered hearing that FFT is a faster way to do it. This in turn got me thinking that if I used FFT for the job, are there any downsides to actually using one huge FFT window (one which covers the whole long sample) instead of splitting the audio wave into smaller sections which are processed with smaller FFT windows? I.e. using 4096 samples long FFT windows instead of one 2 minute long window for example (if the software user happens to load such a huge waveform into the software)? Doesn't the FFT get more inaccurate due to floating point resolution as the FFT window gets larger? I.e. 2 minute long FFT window might be a bad idea?

Kraku · Post by **Kraku** » Mon Nov 01, 2021 9:22 pm

Never mind. I just went with 65 taps long FIR and it's fast enough.

mystran · Post by **mystran** » Thu Nov 04, 2021 8:22 am

For 65 taps kernel you probably can't get anything out of FFT convolution no matter what, it's just too short.

edit: Well.. you could TRY with FFT block size of 8 or 16 samples, then frequency domain convolution, but I'd imagine just bruteforcing is probably going to be faster. Using FFT block that's longer than twice the kernel size is never profitable (due to the logN factor in O(NlogN)), and usually it makes sense to use at most about half your kernel size and then do the rest with frequency-domain delays, 'cos the cost of having to multiply accumulate over a small number of buffers tends to be less than the overhead of a larger FFT... but it's very much one of those things where the big-O complexity is not the whole truth and you'll have to manually tune for the platform (eg. caches and all that).

Kraku · Post by **Kraku** » Thu Nov 04, 2021 10:47 am

Yep, I'll stick to using FIR for now

mystran · Post by **mystran** » Fri Nov 05, 2021 9:42 am

Kraku wrote: ↑Sun Oct 31, 2021 1:04 pmThis in turn got me thinking that if I used FFT for the job, are there any downsides to actually using one huge FFT window (one which covers the whole long sample) instead of splitting the audio wave into smaller sections which are processed with smaller FFT windows? I.e. using 4096 samples long FFT windows instead of one 2 minute long window for example (if the software user happens to load such a huge waveform into the software)? Doesn't the FFT get more inaccurate due to floating point resolution as the FFT window gets larger? I.e. 2 minute long FFT window might be a bad idea?

The pedant in me still wants to answer these questions properly, 'cos I was sorta vague.

To convolve (in a non-circular fashion) a block of N samples with a kernel of K samples, you need an FFT size of (N+K-1)=B. The cost of FFT is on the order of (N+K)*log(N+K). If the total audio signal has length N*M, then the total processing of all blocks takes on the order of M*((N+K)*log(N+K)) time and if you plot the curve, it's clear that the theoretical processing time first goes down quickly with increasing N, then plateaus around N=K and then slowly increases. In practice, FFT cost doesn't increase smoothly (usually you go straight to next power of two) so generally speaking you pick the block-size as 2*K, rounded to the next power of two, and then set N to whatever will fit that block.

The above however ignores constant factors and the fact that we can transform the kernel once (in any number of blocks), transform the input once, multiply-accumulate with block-delays in frequency domain and then transform the output once.. and it turns out that constant factors of FFT as such that it's usually faster to multiply-accumulate something like 8 blocks before increasing the FFT size is worth the trouble... but if your kernel is 64 samples, then accumulating 8 blocks takes the same amount of time as the direct convolution, so you're just losing for the overhead of the FFT. Practically speaking, the whole thing only becomes profitable when your kernel is 256+ or possibly more depending on what sort of SIMD you have available. YMMV.

Also.. for upsampling that's assuming the 64 (or 63) samples is one branch of a polyphase kernel. Actually zero-stuffing is not profitable even with FFT, so you should still polyphase decompose... so when I say "kernel of minimum of 256+" that's per branch when it comes to resampling.

Finally to answer the question of floating-point accuracy, the answer is that .. yes ... floating point precision does get crappy once the FFT size goes up. I'd argue that single-precision is completely unusable for any FFT block-size that's actually worth using for any audio purpose whatsoever... but then again my idea of "minimum acceptable quality" is known to exceed the industry standards in some cases (but joking aside, there will be non-trivial noise with single-precision even with blocksizes in the 1k-2k range).

Kraku · Post by **Kraku** » Fri Nov 05, 2021 2:36 pm

Great answer, thanks!

rafa1981 · Post by **rafa1981** » Fri Nov 05, 2021 7:04 pm

Indeed.

Upsampling with FIR vs. FFT