## Floating point / fixed point limitations in ZDF calculations

DSP, Plug-in and Host development discussion.
Kraku
KVRian
1403 posts since 13 Oct, 2003 from Prague, Czech Republic
Has anyone tried how fast you run into issues with floating point and fixed point inaccuracies while implementing ZDF filters? How simple/complex can the filter be before you start hearing unwanted artifacts which are there only because of the amount of bits used in the calculations?

I would imagine 64 bit floating points can handle almost everything you throw at them, but what about 32 bit floats? Linear filters probably handle OK with both but how much/complex non-linearity calculations can you add into the filter before you have to start using 64 bits instead of 32 bits?

What implementation methods are there to ZDF filters and their non-linearities to take good advantage of the available bit depth?

I imagine fixed points would be a pain in the ass for ZDF filter implementation (that's why I've avoided them), but it would be interesting to hear someone's experiences with them.

camsr
KVRAF
6879 posts since 17 Feb, 2005
With floating points, it's safe to assume that the more bits of mantissa that are available, the higher the dynamic range. The mantissa bits of the float are basically just like integers, so you can apply the same quantization behaviors to each. It's always best to use the least amount of operations possible, as this limits the amount of truncation that can happen.

matt42
KVRian
1068 posts since 9 Jan, 2006
Different typologies will vary in terms of numerical stability, this will affect performance of linear filters, so you can't assume that linear filters will be stable.

ZDF filters are usually built from TDF2 integrators which have good numeric stability, they also play nice with modulation. So, typically ZDF would be my default choice, for linear filters

The main pain with ZDF and nonlinearities is finding an efficient numerical solution
Last edited by matt42 on Tue Jan 02, 2018 2:20 am, edited 1 time in total.

Urs
u-he
22535 posts since 8 Aug, 2002 from Berlin
If we're talking vectorial newton raphson, here are two or three observations re. numerical accuracy /w 32 bit floating point:

- it might be wise to choose the margin of error based on signal RMS. Otherwise harder driven filters will converge slower and softer tones will have more audible noise.

- when using approximations to tanh, exp, division, sqrt and so on the algorithm converges faster if the derivatives are built from the approximations, not the original operation. Sometimes it's better to use a more expensive operation if the derivative is closer and simpler. I.e. sometimes _mm_div is faster than _mm_rcpe due to fewer iterations. And you'll always want approximations to tanh() as good as possible since its derivative is so easily computed.

- study Andy Simper's delta method to update the state variables (similar to Vadim's example in his book IIRC), as apparently it is numerically more precise than calculating the variables directly.

mystran
KVRAF
5001 posts since 12 Feb, 2006 from Helsinki, Finland
What others said... plus if you're using something like LU factorisation to solve your linear systems (which you probably should if you want performance; less efficient algorithms can be more accurate though) then make sure your pivot selection doesn't get too bad (eg. very small pivot elements tend to lose you a lot of precision).

You can easily sanity check the linear solver part though: once you have x from Ax=b, just calculate Ax - b in the other direction and if (when) it comes out as non-zero, you'll get an idea of how far the solution is from the desired.

ps. I'd guess part of the benefit from optimising Ax-b=0 using deltas (rather than Ax-b iteratively directly) is precisely because it then allows you to feedback the error back into the iteration. I never looked at this formally though.
If you'd like Signaldust to return, please ask Katinka Tuisku to resign.

Kraku
KVRian
1403 posts since 13 Oct, 2003 from Prague, Czech Republic
Lots of useful tips. Thank you!

You're free to post even more tips if you want to

S0lo
KVRian
625 posts since 31 Dec, 2008
Using 32bit floats, one thing I've noticed is that ADDING very small values thousands of times to a relatively much larger value results in rounding that eventually don't change the larger value!!. This would result in very bad inaccuracies. You may experience this with high oversampling because the changes between samples become very small. An example of such a situation that may run into this is parameter smoothing.

If you want my advice. Make every thing a 64bit double. Your hardly going to notice any CPU hit, infact it can be lighter (than floats) with modern CPUs. If you mix some double and some float, your probably going to end up having many register conversions which may result in slower code than if you used all doubles. The only case that I can think off that all floats is going to be lighter on CPU is if handing very large amounts of floating point data. In this case, using floats will half the data size an so may bring the cache misses down so to improve performance.

Kraku
KVRian
1403 posts since 13 Oct, 2003 from Prague, Czech Republic
Urs wrote:- study Andy Simper's delta method to update the state variables (similar to Vadim's example in his book IIRC), as apparently it is numerically more precise than calculating the variables directly.
I'm not sure what this means exactly. I've read most of Vadim's book but can't remember this kind of thing mentioned in the chapters I read.

Do you mean that you calculate "u" which is the location in 1 pole LPF right after the feedback point and use that to update the integrator's state? I.e. you do this:

s = 2 * g * u + s