What is the absolute denormal threshold for floats and doubles (using SSE)?

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

I would test this myself, but thought I can save some time in case somebody already knows. I am working on a dithering/denormal prevention algo.

Post

FLT_MIN and DBL_MIN?

Post

Probably, where are those defined?
I got 2^-23 for floats, I guess it's the same formula for doubles, 2^-(mantissa bits)
Although my test was simple, based only of decrementing the exponent.

Code: Select all

#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <x86intrin.h>

int main()
{
uint32_t cw = _mm_getcsr();
cw &= 0b11111111111111110111111111111111; // FTZ OFF
//cw |= 0b00000000000000001000000000000000; // FTZ ON
//cw &= 0b11111111111111111111111110111111; // DAZ OFF
cw |= 0b00000000000000000000000001000000; // DAZ ON
_mm_setcsr(cw);

float zero = 0.f;
float half = 0.5f;
uint32_t addend = 0b01000000000000000000000000000000;
float a = 0.f;
float res = 0.f;

uint64_t c = 128;



while(c)
{
a = *(float*)&addend;
//res = zero + a;
res = half * a;
if(res == 0.f) {break;}

addend = addend - 0b00000000100000000000000000000000;
c--;
}
addend >>= 23;
printf("%u\n", addend);
system("pause");
return 0;
}
The adding portion didn't work with this simple recurison so I switched to multipling and the answer was always the bottom exponent bit.
Last edited by camsr on Sun Jun 14, 2015 9:35 pm, edited 1 time in total.

Post

<float.h>:
http://www.cplusplus.com/reference/cfloat/

They reference it as the minimum representable number, which is a bit weird wording.

Post

Mayae wrote:<float.h>:
http://www.cplusplus.com/reference/cfloat/

They reference it as the minimum representable number, which is a bit weird wording.
I think then they are defining the actual bottom of the mantissa range. It has nothing to do with denormals, unless they are talking about a different precision. Remember, pre-SSE floats were calculated with 80-bits.

Post

Mayae wrote:FLT_MIN and DBL_MIN?
No, that's the smallest float that can be represented, so the last denormal before 0.
The issue is that there is a value in the norm, but not all the processors have the same limit.
And they can be flushed to 0 anyway if speed is required.

Post

Miles1981 wrote:
Mayae wrote:FLT_MIN and DBL_MIN?
No, that's the smallest float that can be represented, so the last denormal before 0.
The issue is that there is a value in the norm, but not all the processors have the same limit.
And they can be flushed to 0 anyway if speed is required.
I must admit I'm not 100% and the info around the net varies..

Raymond cites it from the standard as the minimum normalized float, here:
http://stackoverflow.com/a/7973776/1287254

Post

What prompted this thread from me was I thought I read somewhere that AMD processors have a different denormal behavior. So far the behavior on my Core 2 CPU has been to treat any float with an exponent value of zero as a denormal. If somebody using a newer AMD would like to test this, I've made the test work better now :)

Code: Select all

#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <x86intrin.h>

int main()
{
uint32_t cw = _mm_getcsr();
//cw &= 0b11111111111111110111111111111111; // FTZ OFF
cw |= 0b00000000000000001000000000000000; // FTZ ON
//cw &= 0b11111111111111111111111110111111; // DAZ OFF
cw |= 0b00000000000000000000000001000000; // DAZ ON
_mm_setcsr(cw);

float zero = 0.f;
uint32_t addend = 0b01111111111111111111111111111111;
float a = 0.f;
float res = 0.f;

uint64_t c = (uint64_t)addend;


while(c)
{

a = *(float*)&addend;
res = zero + a;
if(res == 0.f) {break;}
addend = addend - 0b00000000000000000000000000000001;
c--;
}

printf("%u\n", addend);
system("pause");
return 0;
}
With either denormal prevention mode turned on, the program outputs 8388607, with neither on, the output is 0. I compiled this as x64 with no optimizations in MinGW.

Post

camsr wrote:
Mayae wrote:<float.h>:
http://www.cplusplus.com/reference/cfloat/

They reference it as the minimum representable number, which is a bit weird wording.
I think then they are defining the actual bottom of the mantissa range. It has nothing to do with denormals, unless they are talking about a different precision. Remember, pre-SSE floats were calculated with 80-bits.
But the actual bottom of the exponent is by definition the threshold - that is, denormal numbers are floats without bits in the exponent, thus the number is treated as having no implied leading bit in the mantissa. Thus, the smallest non-denormal number is a number without any bits in the mantissa and one bit in the exponent.

This can be seen here:
http://www.h-schmidt.net/FloatConverter/IEEE754.html

This value equals FLT_MIN/DBL_MIN, which is 1.17549435e-38 for 32-bit precision. The smallest denormal number is 1.4e-45.

Constants dealing with exponent ranges are *_EXP_MIN.

e: oops, misread your answer. answer still stands though

Post

I made an oops also :D
Mixed my thinking of integers and real numbers :lol:
But that was the result of thinking in terms of bits, reciprocals, etc. Hopefully I don't look stupid again :pray:
I tend to not think in terms of real numbers while programming, since there isn't a real number line.

Post

I will just bump this even if it is embarrassing :)

Post

Just figured it out, the DAZ flag makes the denormal operant ZERO, FTZ makes the result zero. Hooray for me :)

Try it, DAZ will not iterate in this example, FTZ will.

Code: Select all

#include <stdlib.h>
#include <stdint.h>
#include <stdio.h>
#include <x86intrin.h>

int main()
{
uint32_t cw = _mm_getcsr();
cw &= 0b11111111111111110111111111111111; // FTZ OFF
//cw |= 0b00000000000000001000000000000000; // FTZ ON
cw &= 0b11111111111111111111111110111111; // DAZ OFF
//cw |= 0b00000000000000000000000001000000; // DAZ ON
_mm_setcsr(cw);

//float zero = 0.f;
uint32_t addend = 0b00000000111111111111111111111111;
uint32_t one = 1;
//float a = 0.f;
float a = *(float*)&addend;
float res = 0.f;
float disp = 0.f;

uint32_t c = addend;

while(addend >= 0)
{

//a = *(float*)&addend;
//res = zero + a;
disp = res;
res = a - (*(float*)&one);
a = res;
addend = *(uint32_t*)&res;
if(res == 0.f) {break;}

//addend -= 0x00000001U;

c--;
}

printf("%u\n", addend);
printf("%u\n", c);
printf("%.25e\n", disp);
printf("%.25e\n", res);
printf("%.25e\n", a);
//system("pause");
return 0;
}

Post

camsr wrote: Sat Apr 13, 2024 11:05 pm Just figured it out, the DAZ flag makes the denormal operant ZERO, FTZ makes the result zero. Hooray for me :)
DAZ = "denormals are zero"
FTZ = "flush to zero"

These names are quite descriptive. One says "if you see a denormal, just pretend that it's a zero" while the other says "if the result would be a denormal, just round it down to zero instead."

Realistically there is typically little reason to enable one without enabling the other. I'm pretty sure on ARM there's just one flag (and I think it defaults to just avoiding denormals on macOS). For whatever reason, on x86 they added one of the flags (forgot which one, probably FTZ) and then it turned out that it wasn't enough to avoid all issues caused by denormals, so they added the second one too.

I guess the main issue with treating denormals as zero is that it sort of cuts off precision early.. like we have 23 mantissa bits as long as the implied leading bit is one.. but then we subtract a least significant mantissa bit and suddenly we lose all the 23 bits. What this means in practice (for numerical computations like audio) is that you actually sort of lose some of the range of the exponent, 'cos you don't want to let it go so small that you need to worry about this... but then again that's a small price to pay most of the time given that the alternative is ridiculously poor performance (on some older systems it was hilarious, you could literally slow down the CPU so much the OS could no longer move the mouse cursor smoothly).

Post

mystran wrote: Sun Apr 14, 2024 2:00 pm
DAZ = "denormals are zero"
FTZ = "flush to zero"

These names are quite descriptive. One says "if you see a denormal, just pretend that it's a zero" while the other says "if the result would be a denormal, just round it down to zero instead."

Realistically there is typically little reason to enable one without enabling the other. I'm pretty sure on ARM there's just one flag (and I think it defaults to just avoiding denormals on macOS). For whatever reason, on x86 they added one of the flags (forgot which one, probably FTZ) and then it turned out that it wasn't enough to avoid all issues caused by denormals, so they added the second one too.

I guess the main issue with treating denormals as zero is that it sort of cuts off precision early.. like we have 23 mantissa bits as long as the implied leading bit is one.. but then we subtract a least significant mantissa bit and suddenly we lose all the 23 bits. What this means in practice (for numerical computations like audio) is that you actually sort of lose some of the range of the exponent, 'cos you don't want to let it go so small that you need to worry about this... but then again that's a small price to pay most of the time given that the alternative is ridiculously poor performance (on some older systems it was hilarious, you could literally slow down the CPU so much the OS could no longer move the mouse cursor smoothly).
Yes, and it's important to realize zero (or negative zero) is not a denormal, in case that's useful.
One without the other does not seem useful I suppose, using both or neither seems the way to go by default. Could switching between CW states cause latency issues?
My small example above has a denormal (instead of zero) as an operand but I realized that with the addition operator, and DAZ, the result will be the value of the non-denormal operand. However with multiplication, I think the result will be zero.

Post

camsr wrote: Sun Apr 14, 2024 7:53 pm Yes, and it's important to realize zero (or negative zero) is not a denormal, in case that's useful.
A "denormal" number is a non-zero value without an "implied" leading one in the mantissa. As we discussed in the other thread, normally there is an implied one, followed by 23 decimal bits. If the exponent is already at the minimum value, we can't normalize the number so that the leading bit is one, hence "denormal" (ie. "not normalized").

The problem with denormals is that (on logic level) arithmetics with them work differently from regular normalized floats. So you end up with all these special cases for everything where one or both of the operands or perhaps the result might be denormals and rather than throwing a ton of extra silicon at the problem (not sure if this is even feasible), processors treat it as an exceptional situation and handle it very slowly (probably using some form of microcode, but no idea to be honest).

Now, you might think that a zero is also a special case.. but it turns out that zero is a very easy special case: a+0=a, a-0=a, 0-a=-a, a*0=0, 0/a=0, a/0=inf and 0/0 = NaN... so basically with zeroes we can take a fast-path and not even really compute anything (at most we need to xor the signbits).

Why are NaNs and infinities also slow? I don't know. It would seem that these could be handled as easily as zeroes.. but perhaps it has something to do with floating point exceptions (eg. signalling NaNs) or some such thing.

In any case, unlike denormals, zeroes don't cause performance issues, so as long as you can tolerate losing a bit of precision with very small numbers, just turn the FTZ/DAZ bits on and forget about it.

Post Reply

Return to “DSP and Plugin Development”