Are denormals still an issue?
- KVRist
- 77 posts since 8 Nov, 2020
I was under the assumption that modern compilers and processors had gotten rid of the denormals issue, but this article from Earlevel in 2019 says differently: https://www.earlevel.com/main/2019/04/1 ... denormals/
Any thoughts?
Any thoughts?
- KVRAF
- 8476 posts since 12 Feb, 2006 from Helsinki, Finland
Denormals might not be quite as bad on modern CPUs compared to what they were in the past (you used to be able to pretty much freeze the whole system when them), but you should still except a performance hit.
Compilers won't normally really do anything to avoid denormals as such, but as long as any floating point code isn't using the x86 legacy FPU you can set relevant CPU flags yourself (assuming the host doesn't do it already; I think in another thread about denormals on M1 someone suggested that Apple disables them by default already). With modern compilers the x87 FPU is only used for 32-bit build if you don't enable SSE2+ instruction sets (so if you compile 32-bit, sanity check that SSE2 is allowed); for amd64 (ie. 64-bit builds) SSE2 is guaranteed to be available and the legacy FPU generally won't be used except possibly for extended precision which is not relevant for audio.
So... the TL;DR version is, set (on entry, restore on exit) the CPU flags (eg. FTZ/DAZ on x86) and make sure your x86 32-bit compiler is allowed to use SSE2... and you won't see a denormal and won't need to worry about them.
Compilers won't normally really do anything to avoid denormals as such, but as long as any floating point code isn't using the x86 legacy FPU you can set relevant CPU flags yourself (assuming the host doesn't do it already; I think in another thread about denormals on M1 someone suggested that Apple disables them by default already). With modern compilers the x87 FPU is only used for 32-bit build if you don't enable SSE2+ instruction sets (so if you compile 32-bit, sanity check that SSE2 is allowed); for amd64 (ie. 64-bit builds) SSE2 is guaranteed to be available and the legacy FPU generally won't be used except possibly for extended precision which is not relevant for audio.
So... the TL;DR version is, set (on entry, restore on exit) the CPU flags (eg. FTZ/DAZ on x86) and make sure your x86 32-bit compiler is allowed to use SSE2... and you won't see a denormal and won't need to worry about them.
-
- KVRian
- 653 posts since 4 Apr, 2010
Mystran covered it, but if you want to see an example, scroll down to the two images in my article here:
https://www.earlevel.com/main/2019/04/1 ... denormals/
This was on my older computer (2009 Mac Pro, Nehalem), but as you can see the processor hit is 3x when playing nothing, if denormal protection is turned off. That's not nearly as bad as the old Pentium 4, IIRC, but can still murder a plugin.
https://www.earlevel.com/main/2019/04/1 ... denormals/
This was on my older computer (2009 Mac Pro, Nehalem), but as you can see the processor hit is 3x when playing nothing, if denormal protection is turned off. That's not nearly as bad as the old Pentium 4, IIRC, but can still murder a plugin.
My audio DSP blog: earlevel.com
- KVRAF
- 8476 posts since 12 Feb, 2006 from Helsinki, Finland
I literally had to once (on an old computer, can't remember which CPU it was) hold the power button until force power-off to get a computer to recover from some code that was doing a bit too much denormal arithmetics ... so yeah, modern computers are better, but you don't need a whole lot to be just "better" in this case.earlevel wrote: Tue Jan 04, 2022 10:47 pm This was on my older computer (2009 Mac Pro, Nehalem), but as you can see the processor hit is 3x when playing nothing, if denormal protection is turned off. That's not nearly as bad as the old Pentium 4, IIRC, but can still murder a plugin.
-
- KVRian
- 653 posts since 4 Apr, 2010
Similarly, when I first worked on native DSP code, I was aware of the issue but hadn't yet addressed it. Everything ran fine on my Power PC Mac, which had a modest penalty for denormals. I knew the Pentium 4 was considerably worse, but it did catch me off guard when I gave it the first run on Windows with a P4—locked it up completely and immediately...mystran wrote: Tue Jan 04, 2022 11:08 pmI literally had to once (on an old computer, can't remember which CPU it was) hold the power button until force power-off to get a computer to recover from some code that was doing a bit too much denormal arithmetics ... so yeah, modern computers are better, but you don't need a whole lot to be just "better" in this case.![]()
My audio DSP blog: earlevel.com
- KVRAF
- 1752 posts since 2 Jul, 2018
This approach below looks interesting. He uses a class that destroys itself as soon as it is getting out of scope. However I am not sure how badly creating/destroying classes affects performance. Maybe inline code and destroying it manually would be better.
He also does not seem to restore the original register content. So it can collide with the denormal settings of the DAW and/or other plugins.
https://github.com/rcliftonharvey/rchundenormal
He also does not seem to restore the original register content. So it can collide with the denormal settings of the DAW and/or other plugins.
https://github.com/rcliftonharvey/rchundenormal
https://www.tone2.com
Our award-winning synthesizers offer true high-end sound quality.
Our award-winning synthesizers offer true high-end sound quality.
- KVRist
- 362 posts since 1 Apr, 2009 from Hannover, Germany
JUCE has a very nice scoped crossplatform implementation here (class ScopedNoDenormals):
https://github.com/juce-framework/JUCE/ ... erations.h
And no, creating and destroying such an object on the stack does not affect performance (no overhead compared to calling set/unset functions manually, it really just calls the constructor and destructor). It'd be a bad idea to create it on the heap using new/delete though, because that'll allocate memory in the realtime thread, which we don't do around here.
If you want to make sure it can be inlined because performance, you can implement it header-only. But that's probably overkill for something that is done once per process block.
https://github.com/juce-framework/JUCE/ ... erations.h
And no, creating and destroying such an object on the stack does not affect performance (no overhead compared to calling set/unset functions manually, it really just calls the constructor and destructor). It'd be a bad idea to create it on the heap using new/delete though, because that'll allocate memory in the realtime thread, which we don't do around here.
If you want to make sure it can be inlined because performance, you can implement it header-only. But that's probably overkill for something that is done once per process block.
- KVRAF
- 1752 posts since 2 Jul, 2018
I haven't inevstigated the JUCE code in detail, but it seems that none of the solutions does recover the old register status at the end. This is a no-go in the assembler-world since it can have unexpected side effects on other software/plugins/DAW
I suggest this solution:
I suggest this solution:
#include <xmmintrin.h>
//call this at the beginninng of your precoessing block
inline unsigned int disableDenormals()
{
const int maskFTZ = 0x8000; // Mask to switch FLUSH TO ZERO mode
const int maskDAZ = 0x0040; // Mask to switch DENORMALS ARE ZERO mode
unsigned int oldRegisterStatus = _mm_getcsr();
_mm_setcsr(_mm_getcsr() | maskFTZ);
_mm_setcsr(_mm_getcsr() | maskDAZ);
return oldRegisterStatus;
}
//recover the old register status at the end of your processing block
inline void recoverOldDenormalsRegisterStatus(unsigned int oldRegisterStatus)
{
_mm_setcsr(oldRegisterStatus);
}
void myPlugin::processReplacing (float **inputs, float **outputs, VstInt32 sampleFrames)
{
unsigned int oldRegisterStatus = disableDenormals();
...
//process your stuff here
...
recoverOldDenormalsRegisterStatus(oldRegisterStatus);
}
https://www.tone2.com
Our award-winning synthesizers offer true high-end sound quality.
Our award-winning synthesizers offer true high-end sound quality.
- KVRist
- 362 posts since 1 Apr, 2009 from Hannover, Germany
It does:
Code: Select all
ScopedNoDenormals::ScopedNoDenormals() noexcept
{
#if JUCE_USE_SSE_INTRINSICS || (JUCE_USE_ARM_NEON || defined (__arm64__) || defined (__aarch64__))
#if JUCE_USE_SSE_INTRINSICS
intptr_t mask = 0x8040;
#else /*JUCE_USE_ARM_NEON*/
intptr_t mask = (1 << 24 /* FZ */);
#endif
fpsr = FloatVectorOperations::getFpStatusRegister();
FloatVectorOperations::setFpStatusRegister (fpsr | mask);
#endif
}
ScopedNoDenormals::~ScopedNoDenormals() noexcept
{
#if JUCE_USE_SSE_INTRINSICS || (JUCE_USE_ARM_NEON || defined (__arm64__) || defined (__aarch64__))
FloatVectorOperations::setFpStatusRegister (fpsr);
#endif
}
- KVRAF
- 8476 posts since 12 Feb, 2006 from Helsinki, Finland
This is pretty much a canonical example of a situation where you really want to use a RAII wrapper (eg. similar to the JUCE one).
