KVR Audio

nollock · Post by **nollock** » Sun Jun 10, 2007 7:44 pm

l0calh05t wrote: basically the algo is:

let a be a q.fa fixed-point number, b a q.fb fixed-point number
to compute a/b:
a << fb
a / b (64 bit division)

deceptively simple right? but when actually implementing it, it get's real ugly.
of course if you dont care about precision, do the following (which is fast, and simple):
res = a / b (normal integer division)
res << fb

You can also do

res = a * (1 / b)

By taking reciprocal of b first, using a fixed numerator you can control where the precision of the division is focused. so you can do..

res = a * (0x100000000 / b);

That way you are getting the absolute maximum range you can out of the 64/32 bit division. Then you can do the 32x32 bit multiply and shift that 64 bit result to remove the extra mantissa bits.

arakula · Post by **arakula** » Sun Jun 10, 2007 8:45 pm

#The fREaK! wrote:
aciddose wrote:(are those qubits ints or floats? )
Both.

Thank you, this made my day

vonRed · Post by **vonRed** » Sun Jun 10, 2007 9:56 pm

aciddose wrote: well, when you're using a very low frequency it means you'll be operating with very small numbers. lets take 50hz at 96khz for example in a simple lossy integrator. the whole calculation will end up as equal to the floating point representation of

exp(-2*pi * 50 / 96000)

the error for a float version (1:8:23) is
(sign*(1+manissa/2^23))*(2^(exponent-127))
with [1][127][8333794]
= 0.00000002541812071502541143185416

the maximum error for an int version is
1 / (2^31)
= 0.0000000004656612873077392578125

54 times more error for float here.

for double...
(sign*(1+manissa/2^52))*(2^(exponent-2047))
with [1][2046][4474171814146206]
= 4.2972895747497294664630987600103e-18

obviously the error is less than for 32bit int.
however, double has 40 times more error than for 64bit int.

That's the maximum error for the storage of numbers, not the error added by performing the actual calculations on an x86 CPU. But given that the FPU registers actually use a 64bit mantissa (plus an additional sign bit), that doesn't tell the whole story. The floating point calculations are not less precise than 64bit int calculations are. They are more precise, by at least one bit.

Edit: I almost forgot (again), you didn't include the implied MSB of the mantissa that IEEE 754 uses for 32bit and 64bit storage formats.

JonHodgson · Post by **JonHodgson** » Sun Jun 10, 2007 10:00 pm

One thing to add to all this. There has been mention of TDM systems, well they are 24 bit fixed point, which gives the best case coefficient accuracy as being as good as a 32 bit float. Now it is true they have 56 bit accumulators, which can give advantages in a few situations, but I spent nearly two years programming a DSP with a very similar architecture (24 bit fixed point, 56 bit accumulators), programming codecs and other DSP, and let me tell you... I DREAMED of floating point. Getting enough precision was often a complete nightmare to achieve, especially given cycle constraints.

aciddose · Post by **aciddose** » Mon Jun 11, 2007 10:48 am

"I almost forgot (again), you didn't include the implied MSB of the mantissa that IEEE 754 uses for 32bit and 64bit storage formats"

what do you mean? that is part of the function to convert the sign/exponent/manissa to a real number. i havent forgotten anything.

jon, yes 24 bits just isnt enough for many tasks especially when calculating coefficients. it doesnt give enough headroom much of the time. using 24 bits out of 32 (with 8 for headroom) is generally enough for me though.

i calculate many of my coefficients in double though, and then store them in long. there isnt much advantage to doing coefficient calculations or anything else not directly attached to the audio path in int. if you're calculating a table, it is best to use the highest precision available, then decrease the precision to the lowest level possible while storing the table. when the table is accessed/interpolated, then it is generally best to use the same format as used in the audio path. this allows you to decrease the amount of cache used up by tables and eliminate most conversions from the real-time sections.

thorkz · Post by **thorkz** » Mon Jun 11, 2007 10:50 am

What I dont get though is why would the error in floating p. be more biased(leaning into one direction?) than in int?

thorK

aciddose · Post by **aciddose** » Mon Jun 11, 2007 10:57 am

thorkz, because in float the error increases as the scale of the values increase while it remains linear (the error is constant) in int. the fact that the accuracy of float is distributed non-linearly means that the error terms will be non-linear and dependent upon the scale of the values being operated on.

think about this: in int if you add two numbers with some fractional component, then you divide the result to eliminate the fractions, the fractions will add up linearly and you'll just get ordinary quantization error as the result of the division. the error will be equal in both directions. if both numbers are less than x.5, the error will be toward zero. if the numbers are above 0.5, it'll be toward one. any combination of values will sum up to the correct result.

in float, if we add two numbers which are above 0.5, we'll get a level of error, lets call this "1" point of error. now, if the numbers are below 0.5, we'll get "2" points of error. if the numbers are below 0.25 we'll get "3" points of error and so on. every time the numbers decrease by half, the error is doubled. if one of the numbers is very small and one is very high, the very small number will have a lot of error and the large number will have less error in the sum. we're assuming here that we're adding numbers like 1.6 and 0.02. if we're adding something like 0.03 and 0.031, there will be very little bias error in the sum.

this can create all kinds of problems in systems with feedback. if there are multiple feedback paths and one feedback signal level is very small, while the other is large, the very small signal will have a lot more error when they are combined.

think of it like a noise floor. every time the numbers are doubled in scale, the noise floor will increase by 6db. if we're combining a loud signal with a quiet signal, the noise in the loud signal will cover up the quiet signal.

if we combine a very quiet signal with a loud signal in int, the noise remains unchanged, it is at a fixed value.

think of a sine wave. if you're generating a sine wave through an iterative method like in a filter with feedback then the error will become greater at the far edges of the wave. if you think of a sine wave, you should remember that the edges move the slowest. the rate drops to 0 and then grows again in the opposite direction. so, in float the error for this is greatest at the time when we need the most accuracy, and the error is least during the time when it doesnt matter much since the waveform is moving quickly.

thorkz · Post by **thorkz** » Mon Jun 11, 2007 11:30 am

That "level of error" is making me problems. I thought(without mixing) in float the error decreases as the amplitude falls towards zero and floats precision comes into the game?

thorK

thorkz · Post by **thorkz** » Mon Jun 11, 2007 11:35 am

OK. I didnt see your last paragrapgh. lemme think

thorkz · Post by **thorkz** » Mon Jun 11, 2007 11:57 am

OK I see where you see that bias. Its towards low level signals and thus not good for filters with feedback.

But isnt that error still at highest equal to 1 Bit integer?

thorkz · Post by **thorkz** » Mon Jun 11, 2007 12:30 pm

what did you mean by this:

"ints are linear, meaning that you can always add any two numbers and the result will always be perfect. there will never be an error in addition or subtraction.

because of this difference, error signals in floating point will have a bias. depending upon how the filter/function works, the resulting numbers can be too small or too large. over time this will cause the numbers to "creep" in one direction and the error will be allowed to build up.

in int, errors are also possible. with int we do not suffer from errors caused by a non-linear representation, but we do suffer from quantization error which can take effect during scaling, (multiplication and division). the distribution of these errors however will not be biased into one direction. instead, the error will usually be noise, often close to white noise. this means that over time the error will tend to cancel itself out and will be unable to "creep" like in float code.

What makes floating p. errors creep?

thorkz · Post by **thorkz** » Mon Jun 11, 2007 1:21 pm

Now that nobody is here i can do it, can i?

i'll do it.

This whole int vs float thang is just a pile of shit!!!!!!!

oh I see. Thats why. It feels quite nice. And doesnt mean nothing either.

mmh.......

aciddose · Post by **aciddose** » Mon Jun 11, 2007 1:39 pm

the direction of creep depends upon the function. the error isnt constant in float, that is what is important. it isnt just "toward zero", it's that you get more error with larger numbers. it's kind of like a 2nd order error that isnt entirely obvious at first glance.

it is a load of shit though, you're right. if you remember what i said originally, int and float for most purposes are pretty much the same. for some cases int can be better like some filters, but for many cases it doesnt really make a noticeable difference. it does indeed make a difference and it is indeed measurable and has an effect which could theoretically be noticed - but that doesnt make it really "noticeable" since you cant really tell unless you're really listening for it and looking in specific places.

i said i think on average, int code tends to be of high quality for many different reasons, but it isnt some magic wand you can wave over everything and get an instant enchanted synthesizer of +5 phatness.

in practical terms, it is absolutely useless for somebody using vst plugins to care if they're written using an int or float audio path. it doesnt make a difference. it in my opinion can be useful for coders to experiment with integer versions of their code because i think you can do a better job with a larger tool set. for each problem there is a matching tool to provide the best solution and many times a combination of tools will be the proper choice.

mostly the reason i'm interested in the float vs. int thing is to demonstrate that there are real differences between the formats so that more coders will try using int. the only possible outcome of this would be that we get higher quality code, better compilers, better processors and so on. right now int is in my opinion fairly neglected for dsp (audio and video) and i think it excels in these areas and so should not be.

mauseoleum · Post by **mauseoleum** » Mon Jun 11, 2007 1:43 pm

Imo you should rather say that "int is neglected in 'affordable' dsp solutions".

stefancrs · Post by **stefancrs** » Mon Jun 11, 2007 1:55 pm

aciddose: I have been wondering, since you often mention the "bigger error with bigger numbers" issue... For some applications, especially in audio, is it not good that the relative, instead of absolute, error is basically constant, so that the "signal-to-error" ratio stays the same?

Integer is King? - final thoughts about the EQ challenge