Table Lookup for tanh() versus other solutions
-
- KVRAF
- 2458 posts since 3 Oct, 2002 from SF CA USA NA Earth
-
- KVRAF
- 2875 posts since 28 Jan, 2004 from Da Nang, Vietnam
Are you talking about some kind of auto-vectorization? I think if you used the Accelerate framework instead of explicit intrisincs you shouldn't have to do anything more than recompile.Borogove wrote:I am now hoping in vain that Apple gets a relatively transparent vectorized Clang/LLVM solution in place before they switch their low end laptops to ARM.
-
- KVRAF
- 2458 posts since 3 Oct, 2002 from SF CA USA NA Earth
f**k recompiling. f**k it right in the ear. With a rusty harpoon.kuniklo wrote:Are you talking about some kind of auto-vectorization? I think if you used the Accelerate framework instead of explicit intrisincs you shouldn't have to do anything more than recompile.Borogove wrote:I am now hoping in vain that Apple gets a relatively transparent vectorized Clang/LLVM solution in place before they switch their low end laptops to ARM.
It's 2011.
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
If Apple switches over to ARM, my guess is that recompiling will be the least of our worries. Having a laptop platform that none of the existing DAWs or plugins work on, would allow Apple to institute all sorts of exciting new policies about what and how things can run on these computers.Borogove wrote:f**k recompiling. f**k it right in the ear. With a rusty harpoon.kuniklo wrote:Are you talking about some kind of auto-vectorization? I think if you used the Accelerate framework instead of explicit intrisincs you shouldn't have to do anything more than recompile.Borogove wrote:I am now hoping in vain that Apple gets a relatively transparent vectorized Clang/LLVM solution in place before they switch their low end laptops to ARM.
But I digress. I'm currently working on my "higher quality" inverse square root sigmoid code, that uses a single iteration of the Newton-Raphson thing. I'll post the code once it is working.
I'm not saying that I'm going to use the higher quality version in my new plugin. Seriously, 12 bits of accuracy might end up providing some very nice artifacts in my feedback loops. I was thinking about sticking a quantizer in there, but if I could get some artifacts for free, I'll take them.
I do wonder what the actual SNR of the ISRS code is without the Newton iterations, and whether this translates into noise that would be audible, or just some deviation from the "ideal" curve. If the SSE reciprocal square root instruction uses any form of interpolation for its table look up, then the curve would still have a relatively smooth slope, even if things were piecewise continuous. Of course, it could be a very small table with interpolation being used. I just don't have a feel for how the accuracy of the reciprocal square root function translates into output noise.
Sean Costello
-
- KVRAF
- 2458 posts since 3 Oct, 2002 from SF CA USA NA Earth
There's no way Apple puts out an ARM laptop without a transparent emulation layer to run existing apps. Performance may suffer but DAWs will run. They've done it beautifully in the past, and get better at it each time. The only question is whether most apps are getting emulated LLVM to ARM or x86 to ARM.valhallasound wrote:If Apple switches over to ARM, my guess is that recompiling will be the least of our worries. Having a laptop platform that none of the existing DAWs or plugins work on, would allow Apple to institute all sorts of exciting new policies about what and how things can run on these computers.
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
Please check out the following. Small text size necessary to preserve comments.Borogove wrote:f you support a high-quality mode in your algorithm, you should probably implement a version that does one iteration of Newton-Raphson refinement, which gets you better than 20-bit precision for about 3 times the cost (13-19 cycles instead of 5-6*, depending on architecture, which is still hell of cheap).
Code: Select all
// Calculate a block of the input signal, using the Inverse Square Root Sigmoid
// (http://www.tinygod.com/sigmoid at the bottom of the page)
// implements x / √( x² + 1 )
// accuracy improved via one iteration of Newton Raphson method
inline void ProcessISRSBlockPrecise (float *input, float *output, int blockSize)
{
const float *localin = input;
float *localout = output;
const __m128 theOnes = _mm_set_ps1(1.f);
const __m128 pointFiveVector = _mm_set_ps1(0.5f);
const __m128 threeVector = _mm_set_ps1(3.f);
for(int i=0;i<blockSize;i+=4)
{
__m128 vin = _mm_load_ps(localin + i); // loads input into SSE variable
__m128 vtmp = _mm_mul_ps(vin, vin); // calculate in*in
__m128 vtmp2 = _mm_add_ps(vtmp, theOnes); // add 1.f to in*in
// reciprocal square root part of code,
// with 1 iteration of Newton Raphson method
// implements the follow C code:
// x = rsqrt_approx(c);
// xprime = 0.5 * x * (3 - x*x*c);
__m128 vrecip = _mm_rsqrt_ps(vtmp2); // calculates 1/sqrt(in*in+1.f)
vtmp = _mm_mul_ps(vrecip, vtmp2); // multiplies square root by input
vtmp2 = _mm_mul_ps(vtmp, vrecip); // MULPS
vtmp = _mm_sub_ps(threeVector , vtmp2); // SUBPS
vtmp2 = _mm_mul_ps(vtmp, vrecip); // MULPS
vtmp = _mm_mul_ps(vtmp2, pointFiveVector); // MULPS
__m128 vresult = _mm_mul_ps(vtmp, vin); // multiply 1/sqrt(in*in+1) by in
_mm_store_ps(localout + i, vresult); // place result into output vector
}
}
This seems to work for me. I'm not hearing any audible improvements, but I need to do some more stringent listening. Please let me know if you see any obvious bugs in the code.
Sean Costello
-
- KVRer
- 23 posts since 11 Feb, 2003
I simplified it a bit:
Code: Select all
template <typename FTYPE>
void process(FTYPE* __restrict pout, FTYPE const * __restrict pin, int n)
{
for (int i = 0; i < n; i++)
{
FTYPE input = pin[i];
pout[i] = input * rsqrt_1nri(input * input + 1.0f);
}
}
Code: Select all
float4:
$LL3@process:
00024 0f 28 0c 01 movaps xmm1, XMMWORD PTR [ecx+eax]
00028 0f 28 c1 movaps xmm0, xmm1
0002b 0f 59 c1 mulps xmm0, xmm1
0002e 0f 58 c3 addps xmm0, xmm3
00031 0f 52 d0 rsqrtps xmm2, xmm0
00034 0f 28 f2 movaps xmm6, xmm2
00037 0f 59 f2 mulps xmm6, xmm2
0003a 0f 59 f0 mulps xmm6, xmm0
0003d 0f 28 c4 movaps xmm0, xmm4
00040 0f 5c c6 subps xmm0, xmm6
00043 0f 28 f5 movaps xmm6, xmm5
00046 0f 59 f2 mulps xmm6, xmm2
00049 0f 59 c6 mulps xmm0, xmm6
0004c 0f 59 c1 mulps xmm0, xmm1
0004f 0f 29 00 movaps XMMWORD PTR [eax], xmm0
00052 83 c0 10 add eax, 16 ; 00000010H
00055 4a dec edx
00056 75 cc jne SHORT $LL3@process
float4 VEX:
$LL3@process:
00030 c5 f8 28 04 01 vmovaps xmm0, XMMWORD PTR [ecx+eax]
00035 c5 f8 59 c8 vmulps xmm1, xmm0, xmm0
00039 c5 f0 58 d3 vaddps xmm2, xmm1, xmm3
0003d c5 f8 52 ca vrsqrtps xmm1, xmm2
00041 c5 f0 59 f1 vmulps xmm6, xmm1, xmm1
00045 c5 c8 59 d2 vmulps xmm2, xmm6, xmm2
00049 c5 d8 5c d2 vsubps xmm2, xmm4, xmm2
0004d c5 d0 59 c9 vmulps xmm1, xmm5, xmm1
00051 c5 e8 59 c9 vmulps xmm1, xmm2, xmm1
00055 c5 f0 59 c0 vmulps xmm0, xmm1, xmm0
00059 c5 f8 29 00 vmovaps XMMWORD PTR [eax], xmm0
0005d 83 c0 10 add eax, 16 ; 00000010H
00060 4a dec edx
00061 75 cd jne SHORT $LL3@process
float8 AVX:
$LL3@process:
00030 c5 fc 28 04 01 vmovaps ymm0, YMMWORD PTR [ecx+eax]
00035 c5 fc 59 c8 vmulps ymm1, ymm0, ymm0
00039 c5 f4 58 d3 vaddps ymm2, ymm1, ymm3
0003d c5 fc 52 ca vrsqrtps ymm1, ymm2
00041 c5 f4 59 f1 vmulps ymm6, ymm1, ymm1
00045 c5 cc 59 d2 vmulps ymm2, ymm6, ymm2
00049 c5 dc 5c d2 vsubps ymm2, ymm4, ymm2
0004d c5 d4 59 c9 vmulps ymm1, ymm5, ymm1
00051 c5 ec 59 c9 vmulps ymm1, ymm2, ymm1
00055 c5 f4 59 c0 vmulps ymm0, ymm1, ymm0
00059 c5 fc 29 00 vmovaps YMMWORD PTR [eax], ymm0
0005d 83 c0 20 add eax, 32 ; 00000020H
00060 4a dec edx
00061 75 cd jne SHORT $LL3@process
-
- KVRer
- 27 posts since 17 Feb, 2011
Sorry to resurrect a thread which is getting a little old, but I am learning SSE instructions and this is one of the prime examples I was able to find on this board.valhallasound wrote: Please check out the following. Small text size necessary to preserve comments.
[/size]Code: Select all
// Calculate a block of the input signal, using the Inverse Square Root Sigmoid // (http://www.tinygod.com/sigmoid at the bottom of the page) // implements x / √( x² + 1 ) // accuracy improved via one iteration of Newton Raphson method inline void ProcessISRSBlockPrecise (float *input, float *output, int blockSize) { const float *localin = input; float *localout = output; const __m128 theOnes = _mm_set_ps1(1.f); const __m128 pointFiveVector = _mm_set_ps1(0.5f); const __m128 threeVector = _mm_set_ps1(3.f); for(int i=0;i<blockSize;i+=4) { __m128 vin = _mm_load_ps(localin + i); // loads input into SSE variable __m128 vtmp = _mm_mul_ps(vin, vin); // calculate in*in __m128 vtmp2 = _mm_add_ps(vtmp, theOnes); // add 1.f to in*in // reciprocal square root part of code, // with 1 iteration of Newton Raphson method // implements the follow C code: // x = rsqrt_approx(c); // xprime = 0.5 * x * (3 - x*x*c); __m128 vrecip = _mm_rsqrt_ps(vtmp2); // calculates 1/sqrt(in*in+1.f) vtmp = _mm_mul_ps(vrecip, vtmp2); // multiplies square root by input vtmp2 = _mm_mul_ps(vtmp, vrecip); // MULPS vtmp = _mm_sub_ps(threeVector , vtmp2); // SUBPS vtmp2 = _mm_mul_ps(vtmp, vrecip); // MULPS vtmp = _mm_mul_ps(vtmp2, pointFiveVector); // MULPS __m128 vresult = _mm_mul_ps(vtmp, vin); // multiply 1/sqrt(in*in+1) by in _mm_store_ps(localout + i, vresult); // place result into output vector } }
This seems to work for me. I'm not hearing any audible improvements, but I need to do some more stringent listening. Please let me know if you see any obvious bugs in the code.
Sean Costello
I'm not sure if this is an actual issue, but shouldn't you use _mm_loadu_ps instead of _mm_load_ps for "// loads input into SSE variable"? Because I believe that _mm_load_ps assumes that the float is aligned - and your float pointer localin is not aligned unless input is aligned. I think in order to align you need to do something like
Code: Select all
float *localIn = (float*) _aligned_malloc(blockSize * sizeof(float), 16);
I am also unsure as to why you're using localin and localout at all instead of using input and output directly, does this have to do with alignment... or something else?
Thanks for any insight,
-Colin
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
They are. Things would crash horribly in the XP versions of my code if they weren't aligned. Not sure why they would crash, but I have found this in the past.noisetteaudio wrote: I'm guessing that if it works as-is then either input and output are already aligned
This is more out of habit than anything else. A lot of my code uses the same pointer for the input and output, and sometimes bad things happen if the low level vector functions overwrite the input vector. I could probably change things in this function without worrying about overwriting stuff - I'll take a second look at it.I am also unsure as to why you're using localin and localout at all instead of using input and output directly, does this have to do with alignment... or something else?
Sean Costello
-
- KVRer
- 27 posts since 17 Feb, 2011
So the pointers you're passing to the function are allocated somewhere elsewhere in a way that ensures they are aligned? Have you found a good cross-platform way to allocate aligned arrays? Eg, _malloc_aligned but GCC compatible.
FWIW, I have found that on my new MacBook Pro with 10.7, trying to use load instead of loadu (for example) on a non-aligned array will also cause a crash. I bet on different architectures/OSs mileage varies.
-Colin
FWIW, I have found that on my new MacBook Pro with 10.7, trying to use load instead of loadu (for example) on a non-aligned array will also cause a crash. I bet on different architectures/OSs mileage varies.
-Colin
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
In my OSX vector code, I declare the following:noisetteaudio wrote:So the pointers you're passing to the function are allocated somewhere elsewhere in a way that ensures they are aligned? Have you found a good cross-platform way to allocate aligned arrays? Eg, _malloc_aligned but GCC compatible.
Code: Select all
#define _aligned_malloc(N, A) malloc(N)
#define _aligned_free(P) free(P)
The crashes can be kind of helpful, in that they keep bad coding from creeping into other OSes/DAWs that are less tolerant.FWIW, I have found that on my new MacBook Pro with 10.7, trying to use load instead of loadu (for example) on a non-aligned array will also cause a crash. I bet on different architectures/OSs mileage varies.
An example of this: I have written plenty of AUs, that work great in Ableton Live, but crash in Logic. I tend to swear at Logic in these cases, but invariably the crash is due to some code error on my part.
Sean Costello
-
- KVRer
- 27 posts since 17 Feb, 2011
Interesting. I will investigate thoroughly where this crash is coming from, as it shows up quite reliably when I switch from loadu to load, without doing alignment - but I believe you that Mac's GCC is thought to 16 byte align (although to be fair I am using LLVM GCC right now).
-
- KVRist
- 46 posts since 20 Jun, 2013 from Berkeley, CA
My obsessive nature turned its attention to tanh() recently, and I figured this thread would be as good as any to post an update on what I found. I tried a whole bunch of different things and have come up with what I consider a gold-plated SIMD tanh that is only moderately more expensive than the "good enough for rock'n'roll" one that Sean posted earlier.
First, Urs wondered how much error the recip sqrt estimate would add, and Sean wondered whether it would be artifacts or just a different shape to the curve. The answer is that it's about 70dB down. The error seems to be piecewise linear (a continuous function of the input, in other words), so not too bad with most inputs. But put a 20 Hz sine into it and what you get sounds aliasy and digital. So it is something you could probably get away with but I would not want it for the absolute highest quality work. I would do the Newton-Raphson step, which seems to put the non-harmonic error down well below 120dB.
Second, on the difference between the sqrt based sigmoid and "real" tanh. The wonderful thing about tanh is that the higher harmonics in the spectrum drop off rapidly, so even a small amount of oversampling will make the aliased harmonics all but inaudible (4x would seem to be good enough for just about all practical purposes). The falloff of harmonics is also true for the sqrt-based sigmoid (it is very good), but not quite as much as the real tanh. For example, with a sine wave of amplitude 1, the seventh harmonic is about 10dB higher in the sqrt sigmoid than tanh. This is audible, for example, when putting a 6200 Hz sine in, where the seventh harmonic aliases to 700 Hz at 44.1KHz sampling rate. (Note that this is also a 12.4kHz sine input with 2x oversampling). So, again, sqrt is good, but tanh is a little better.
Fortunately, it's not too bad to compute a moderately high accuracy approximation:
This function has the same shape as tanh and almost identical harmonic structure. In particular, it doesn't need any clamping for large inputs, as it has the correct asymptotic approach to +/-1. It's within 2e-4 of the real tanh, which is not super impressive from a strictly numerical point of view, but gives some idea how closely it fits. By comparison, x/sqrt(1 + x^2) has a max error 7.4e-2 as an approximation to tanh (ie almost 400 times as much error).
More importantly, the approximation is made out of smooth functions, so all of whatever small error is left comes out as low harmonics, where they're totally masked by the legitimate harmonic distortion of the input signal. By contrast, I found that the approximations that had abs() or clamping tended to bring the error to high harmonics. In all my poking and prodding with sine waves, I was not able to find a case where you could audibly tell the difference between this approximation and calling tanh() directly.
A straightforward SIMD implementation takes a 1.04ns/tanh on my MacBook Pro, which is about 1.7x the time as the sqrt one with the Newton-Raphson step (and still about 8x faster than calling tanhf). So there are likely few applications in which the extra CPU cost of the higher quality approximation would be an issue.
So I'd say this is what you should use when you want the absolute highest quality, though other approximations may well be good enough for most applications.
I could clean up and post my IPython notebook for this if people want to see it.
First, Urs wondered how much error the recip sqrt estimate would add, and Sean wondered whether it would be artifacts or just a different shape to the curve. The answer is that it's about 70dB down. The error seems to be piecewise linear (a continuous function of the input, in other words), so not too bad with most inputs. But put a 20 Hz sine into it and what you get sounds aliasy and digital. So it is something you could probably get away with but I would not want it for the absolute highest quality work. I would do the Newton-Raphson step, which seems to put the non-harmonic error down well below 120dB.
Second, on the difference between the sqrt based sigmoid and "real" tanh. The wonderful thing about tanh is that the higher harmonics in the spectrum drop off rapidly, so even a small amount of oversampling will make the aliased harmonics all but inaudible (4x would seem to be good enough for just about all practical purposes). The falloff of harmonics is also true for the sqrt-based sigmoid (it is very good), but not quite as much as the real tanh. For example, with a sine wave of amplitude 1, the seventh harmonic is about 10dB higher in the sqrt sigmoid than tanh. This is audible, for example, when putting a 6200 Hz sine in, where the seventh harmonic aliases to 700 Hz at 44.1KHz sampling rate. (Note that this is also a 12.4kHz sine input with 2x oversampling). So, again, sqrt is good, but tanh is a little better.
Fortunately, it's not too bad to compute a moderately high accuracy approximation:
Code: Select all
a = x + 0.16489087 * x**3 + 0.00985468 * x**5
return a / sqrt(1 + a * a)
More importantly, the approximation is made out of smooth functions, so all of whatever small error is left comes out as low harmonics, where they're totally masked by the legitimate harmonic distortion of the input signal. By contrast, I found that the approximations that had abs() or clamping tended to bring the error to high harmonics. In all my poking and prodding with sine waves, I was not able to find a case where you could audibly tell the difference between this approximation and calling tanh() directly.
A straightforward SIMD implementation takes a 1.04ns/tanh on my MacBook Pro, which is about 1.7x the time as the sqrt one with the Newton-Raphson step (and still about 8x faster than calling tanhf). So there are likely few applications in which the extra CPU cost of the higher quality approximation would be an issue.
So I'd say this is what you should use when you want the absolute highest quality, though other approximations may well be good enough for most applications.
I could clean up and post my IPython notebook for this if people want to see it.
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
Cool!raphx wrote:My obsessive nature turned its attention to tanh() recently
I use the Newton-Raphson step in my release code (ValhallaÜberMod is where I made use of this). Now I'm interested in the aliasing noise you describe without the Newton-Raphson. I might be able to use that in the future.First, Urs wondered how much error the recip sqrt estimate would add, and Sean wondered whether it would be artifacts or just a different shape to the curve. The answer is that it's about 70dB down. The error seems to be piecewise linear (a continuous function of the input, in other words), so not too bad with most inputs. But put a 20 Hz sine into it and what you get sounds aliasy and digital. So it is something you could probably get away with but I would not want it for the absolute highest quality work. I would do the Newton-Raphson step, which seems to put the non-harmonic error down well below 120dB.
That is very interesting. I hadn't done any real studies of the harmonic generation of the sqrt sigmoid.Second, on the difference between the sqrt based sigmoid and "real" tanh. The wonderful thing about tanh is that the higher harmonics in the spectrum drop off rapidly, so even a small amount of oversampling will make the aliased harmonics all but inaudible (4x would seem to be good enough for just about all practical purposes). The falloff of harmonics is also true for the sqrt-based sigmoid (it is very good), but not quite as much as the real tanh. For example, with a sine wave of amplitude 1, the seventh harmonic is about 10dB higher in the sqrt sigmoid than tanh. This is audible, for example, when putting a 6200 Hz sine in, where the seventh harmonic aliases to 700 Hz at 44.1KHz sampling rate. (Note that this is also a 12.4kHz sine input with 2x oversampling). So, again, sqrt is good, but tanh is a little better.
What is interesting about this approximation, is that it is essentially the sqrt sigmoid, with a polynomial used to process the input signal. Is this correct?Fortunately, it's not too bad to compute a moderately high accuracy approximation:
This function has the same shape as tanh and almost identical harmonic structure. In particular, it doesn't need any clamping for large inputs, as it has the correct asymptotic approach to +/-1. It's within 2e-4 of the real tanh, which is not super impressive from a strictly numerical point of view, but gives some idea how closely it fits. By comparison, x/sqrt(1 + x^2) has a max error 7.4e-2 as an approximation to tanh (ie almost 400 times as much error).Code: Select all
a = x + 0.16489087 * x**3 + 0.00985468 * x**5 return a / sqrt(1 + a * a)
If this IS correct, the next question: Does this code also benefit from using the Newton-Raphson method? It seems like the same reciprocal square root approximation issues would arise when using SSE2, or other SIMD variants that have an approximation to reciprocal square roots.
The followup to the above question: If the tanh() approximation benefits from the Newton-Raphson method, what is the performance of the tanh() approximation in SIMD, compared to other tanh() approximations?
Sean Costello
- u-he
- 28043 posts since 8 Aug, 2002 from Berlin
Also, hehehe, if a Newton-Raphson step is not required, I wonder if the reciprocal square root estimate is guaranteed to produce same results on every CPU - I remember that the G5 has a far worse version than the G4.
It does sound like a great method though… I'll check it out shortly and compare to to our Pade-based version. I guess not needing boundary checks will already provide for a speed bump
It does sound like a great method though… I'll check it out shortly and compare to to our Pade-based version. I guess not needing boundary checks will already provide for a speed bump