Upsampling with FIR vs. FFT

 KVRAF
 Topic Starter
 1609 posts since 13 Oct, 2003 from Oulu, Finland
I'm pondering what's the best and fastest way to handle upsampling of both large and small audio waveforms when loading them from disk. First I thought that traditional FIR filter would be great, but after a bit of thinking, I remembered hearing that FFT is a faster way to do it. This in turn got me thinking that if I used FFT for the job, are there any downsides to actually using one huge FFT window (one which covers the whole long sample) instead of splitting the audio wave into smaller sections which are processed with smaller FFT windows? I.e. using 4096 samples long FFT windows instead of one 2 minute long window for example (if the software user happens to load such a huge waveform into the software)? Doesn't the FFT get more inaccurate due to floating point resolution as the FFT window gets larger? I.e. 2 minute long FFT window might be a bad idea?

 KVRAF
 Topic Starter
 1609 posts since 13 Oct, 2003 from Oulu, Finland
Never mind. I just went with 65 taps long FIR and it's fast enough.
 KVRAF
 7890 posts since 12 Feb, 2006 from Helsinki, Finland
For 65 taps kernel you probably can't get anything out of FFT convolution no matter what, it's just too short.
edit: Well.. you could TRY with FFT block size of 8 or 16 samples, then frequency domain convolution, but I'd imagine just bruteforcing is probably going to be faster. Using FFT block that's longer than twice the kernel size is never profitable (due to the logN factor in O(NlogN)), and usually it makes sense to use at most about half your kernel size and then do the rest with frequencydomain delays, 'cos the cost of having to multiply accumulate over a small number of buffers tends to be less than the overhead of a larger FFT... but it's very much one of those things where the bigO complexity is not the whole truth and you'll have to manually tune for the platform (eg. caches and all that).
edit: Well.. you could TRY with FFT block size of 8 or 16 samples, then frequency domain convolution, but I'd imagine just bruteforcing is probably going to be faster. Using FFT block that's longer than twice the kernel size is never profitable (due to the logN factor in O(NlogN)), and usually it makes sense to use at most about half your kernel size and then do the rest with frequencydomain delays, 'cos the cost of having to multiply accumulate over a small number of buffers tends to be less than the overhead of a larger FFT... but it's very much one of those things where the bigO complexity is not the whole truth and you'll have to manually tune for the platform (eg. caches and all that).

 KVRAF
 Topic Starter
 1609 posts since 13 Oct, 2003 from Oulu, Finland
Yep, I'll stick to using FIR for now
 KVRAF
 7890 posts since 12 Feb, 2006 from Helsinki, Finland
The pedant in me still wants to answer these questions properly, 'cos I was sorta vague.Kraku wrote: ↑Sun Oct 31, 2021 5:04 amThis in turn got me thinking that if I used FFT for the job, are there any downsides to actually using one huge FFT window (one which covers the whole long sample) instead of splitting the audio wave into smaller sections which are processed with smaller FFT windows? I.e. using 4096 samples long FFT windows instead of one 2 minute long window for example (if the software user happens to load such a huge waveform into the software)? Doesn't the FFT get more inaccurate due to floating point resolution as the FFT window gets larger? I.e. 2 minute long FFT window might be a bad idea?
To convolve (in a noncircular fashion) a block of N samples with a kernel of K samples, you need an FFT size of (N+K1)=B. The cost of FFT is on the order of (N+K)*log(N+K). If the total audio signal has length N*M, then the total processing of all blocks takes on the order of M*((N+K)*log(N+K)) time and if you plot the curve, it's clear that the theoretical processing time first goes down quickly with increasing N, then plateaus around N=K and then slowly increases. In practice, FFT cost doesn't increase smoothly (usually you go straight to next power of two) so generally speaking you pick the blocksize as 2*K, rounded to the next power of two, and then set N to whatever will fit that block.
The above however ignores constant factors and the fact that we can transform the kernel once (in any number of blocks), transform the input once, multiplyaccumulate with blockdelays in frequency domain and then transform the output once.. and it turns out that constant factors of FFT as such that it's usually faster to multiplyaccumulate something like 8 blocks before increasing the FFT size is worth the trouble... but if your kernel is 64 samples, then accumulating 8 blocks takes the same amount of time as the direct convolution, so you're just losing for the overhead of the FFT. Practically speaking, the whole thing only becomes profitable when your kernel is 256+ or possibly more depending on what sort of SIMD you have available. YMMV.
Also.. for upsampling that's assuming the 64 (or 63) samples is one branch of a polyphase kernel. Actually zerostuffing is not profitable even with FFT, so you should still polyphase decompose... so when I say "kernel of minimum of 256+" that's per branch when it comes to resampling.
Finally to answer the question of floatingpoint accuracy, the answer is that .. yes ... floating point precision does get crappy once the FFT size goes up. I'd argue that singleprecision is completely unusable for any FFT blocksize that's actually worth using for any audio purpose whatsoever... but then again my idea of "minimum acceptable quality" is known to exceed the industry standards in some cases (but joking aside, there will be nontrivial noise with singleprecision even with blocksizes in the 1k2k range).

 KVRAF
 Topic Starter
 1609 posts since 13 Oct, 2003 from Oulu, Finland
Great answer, thanks!