Table Lookup for tanh() versus other solutions
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
Hi all,
I'm looking into implementing table lookup as a way of waveshaping in my current plugin. I would use a relatively small table (1024 entries would be fine, with an extra guard point) and bound the input to the table with some form of clipping. I would use linear interpolation, where I can take advantage of my SIMD optimized float to int conversions. However, some platforms would probably require the C code version of the float to int (and the frac conversion), or the standard scalar float to int conversion that has been around in Intel assembly for over a decade now.
The question is: Good idea, or no? Let's say, hypothetically, that this is being used for tanh() approximation over a limited range, where anything outside of the (-3.f, 3.f) range is clipped to +/-1. How will interpolated table lookups compare in such a situation to the other optimizations that require a divide per sample?
Back in 2000, I used table lookups in Synthbuilder quite a bit, and things seemed to run efficiently. Just wondering if things have changed since then, such that a table lookup is a bad choice today.
Thanks,
Sean Costello
EDIT: Changed thread title to reflect where the thread went. I'm trying to figure out the best tanh() approximation, so the new thread title is more accurate.
EDIT #2: Around page 2, I figure out a different sigmoid nonlinearity solution, that is close to tanh() but with a slower approach to +/-1.f. It is super cheap if implemented with SSE - I've posted SSE intrinsics for both the "accurate" version and a less accurate but super efficient version.
I'm looking into implementing table lookup as a way of waveshaping in my current plugin. I would use a relatively small table (1024 entries would be fine, with an extra guard point) and bound the input to the table with some form of clipping. I would use linear interpolation, where I can take advantage of my SIMD optimized float to int conversions. However, some platforms would probably require the C code version of the float to int (and the frac conversion), or the standard scalar float to int conversion that has been around in Intel assembly for over a decade now.
The question is: Good idea, or no? Let's say, hypothetically, that this is being used for tanh() approximation over a limited range, where anything outside of the (-3.f, 3.f) range is clipped to +/-1. How will interpolated table lookups compare in such a situation to the other optimizations that require a divide per sample?
Back in 2000, I used table lookups in Synthbuilder quite a bit, and things seemed to run efficiently. Just wondering if things have changed since then, such that a table lookup is a bad choice today.
Thanks,
Sean Costello
EDIT: Changed thread title to reflect where the thread went. I'm trying to figure out the best tanh() approximation, so the new thread title is more accurate.
EDIT #2: Around page 2, I figure out a different sigmoid nonlinearity solution, that is close to tanh() but with a slower approach to +/-1.f. It is super cheap if implemented with SSE - I've posted SSE intrinsics for both the "accurate" version and a less accurate but super efficient version.
Last edited by valhallasound on Sun Nov 06, 2011 9:14 pm, edited 2 times in total.
-
- KVRer
- 23 posts since 11 Feb, 2003
If you can do 4 in parallel, div is clearly better. But why not just time it?
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
Interesting. Tell me more about div with 4 in parallel, if you don't mind sharing. I work with SIMD friendly block sizes inside my plugins.oskari wrote:If you can do 4 in parallel, div is clearly better. But why not just time it?
Thanks,
Sean Costello
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
Also, any thoughts about using divide SIMD instructions, as opposed to the reciprocal SIMD instructions?
Sean Costello
Sean Costello
-
- KVRer
- 23 posts since 11 Feb, 2003
divps (4 single precision divs) is only 14 cycles on Sandy Bridge and not much more on the previous x86 architectures. It's the _mm_div_ps intrinsic.
- u-he
- 28063 posts since 8 Aug, 2002 from Berlin
You can do a pade approximation rather quickly in SSE (and Altivec). IIRC it was slightly faster than lookups with clipping and all.
Dunno about that new ARM stuff. Do they have parallel reciproce estimate?
Dunno about that new ARM stuff. Do they have parallel reciproce estimate?
- u-he
- 28063 posts since 8 Aug, 2002 from Berlin
14 cycles?oskari wrote:divps (4 single precision divs) is only 14 cycles on Sandy Bridge and not much more on the previous x86 architectures. It's the _mm_div_ps intrinsic.
Want!
-
- KVRer
- 23 posts since 11 Feb, 2003
Here's what I do in discoDSP Corona (it's the usual pade approximation):
FTYPE is either float4 or float8, both of which are simple wrappers for the SSE intrinsics.
Code: Select all
template <typename FTYPE>
__noalias __forceinline FTYPE tanh_approx(FTYPE const &_x)
{
FTYPE x = maximum(minimum(_x, 3.0f), -3.0f);
FTYPE x2 = x * x;
return x * (x2 + 27.0f) / (x2 * 9.0f + 27.0f);
}
-
- KVRer
- 23 posts since 11 Feb, 2003
Code: Select all
00000 0f 28 01 movaps xmm0, XMMWORD PTR [ecx]
00 00 minps xmm0, XMMWORD PTR __xmm@10
00 00 maxps xmm0, XMMWORD PTR __xmm@11
00 00 movaps xmm1, XMMWORD PTR __xmm@12
00 00 movaps xmm3, XMMWORD PTR __xmm@13
0001f 0f 28 d0 movaps xmm2, xmm0
00022 0f 59 d0 mulps xmm2, xmm0
00025 0f 59 da mulps xmm3, xmm2
00028 0f 58 d9 addps xmm3, xmm1
0002b 0f 58 ca addps xmm1, xmm2
0002e 0f 59 c8 mulps xmm1, xmm0
00031 0f 5e cb divps xmm1, xmm3
00034 0f 29 08 movaps XMMWORD PTR [eax], xmm1
- u-he
- 28063 posts since 8 Aug, 2002 from Berlin
Yep, same thing here, but we do reciproce and a round or two of Newton Raphson. Might benchmark that.
-
Christian Budde Christian Budde https://www.kvraudio.com/forum/memberlist.php?mode=viewprofile&u=25572
- KVRAF
- 1538 posts since 14 May, 2004 from Europe
If you wonder what approximation will get you what you need you can have a look at my old forum thread: Tanh approximations
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
I had no idea what the hell this Newton Raphson thing was. So I used Google, and found this page:Urs wrote:Yep, same thing here, but we do reciproce and a round or two of Newton Raphson. Might benchmark that.
http://www.virtualdub.org/blog/pivot/entry.php?id=229
The relevant info:
This seems like it could take advantage of the RSQRTPS in SSE to make things VERY fast. For the Mac, I will probably just call the vvrsqrtf function in vecLib, so I can keep things working on PPC as well. I'll hand code a reciprocal square root function using SSE for the PC.The usual way to improve the accuracy of the reciprocal and reciprocal square root operations is by iteration through Newton's Method. For the reciprocal, it looks like this:
x = reciprocal_approx(c);
x' = x * (2 - x * c);
...and for the reciprocal square root, it looks like this:
x = rsqrt_approx(c);
x' = 0.5 * x * (3 - x*x*c);
I've also decided against using tanh(), in favor of the Inverse Square Root Sigmoid function posted by Russell Borogove at
http://www.tinygod.com/sigmoid/
This is fairly simple:
out = in / sqrt(in*in + 1.f);
For low-level signals, it tracks tanh(), in that the signal gain is around 1 for signals that have an amplitude around zero. It is bounded by +/-1, which is useful, and converges towards +/- 1 more slowly than tanh(). Most importantly, it should be easy to implement with a few SIMD multiplies and a SIMD-optimized reciprocal square root function as described above.
Thanks to everyone who helped out here. Hopefully the above info will be useful for people implementing soft clipping functions in their own projects.
Sean Costello
-
- KVRer
- 19 posts since 30 Apr, 2005
If you want to compute the full tanh() using SIMD instructions, the exp_ps function here might be useful:
http://gruntthepeon.free.fr/ssemath/
http://gruntthepeon.free.fr/ssemath/
- u-he
- 28063 posts since 8 Aug, 2002 from Berlin
Hey Sean,
That's a good one... might be fast as a whip...
I just wonder how much error/noise the sqr reciproce estimate will add?
That's a good one... might be fast as a whip...
I just wonder how much error/noise the sqr reciproce estimate will add?
-
- KVRian
- 639 posts since 19 Apr, 2007 from Frankfurt, Germany
http://varietyofsound.wordpress.com/201 ... -fraction/
SSE code included, uses around 20 cycles.
peace,
bootsie
SSE code included, uses around 20 cycles.
be aware that the sqrt is even more expensive than the div.out = in / sqrt(in*in + 1.f);
peace,
bootsie