KVR Audio

valhallasound · Post by **valhallasound** » Thu Nov 03, 2011 8:14 pm

Hi all,

I'm looking into implementing table lookup as a way of waveshaping in my current plugin. I would use a relatively small table (1024 entries would be fine, with an extra guard point) and bound the input to the table with some form of clipping. I would use linear interpolation, where I can take advantage of my SIMD optimized float to int conversions. However, some platforms would probably require the C code version of the float to int (and the frac conversion), or the standard scalar float to int conversion that has been around in Intel assembly for over a decade now.

The question is: Good idea, or no? Let's say, hypothetically, that this is being used for tanh() approximation over a limited range, where anything outside of the (-3.f, 3.f) range is clipped to +/-1. How will interpolated table lookups compare in such a situation to the other optimizations that require a divide per sample?

Back in 2000, I used table lookups in Synthbuilder quite a bit, and things seemed to run efficiently. Just wondering if things have changed since then, such that a table lookup is a bad choice today.

Thanks,

Sean Costello

EDIT: Changed thread title to reflect where the thread went. I'm trying to figure out the best tanh() approximation, so the new thread title is more accurate.

EDIT #2: Around page 2, I figure out a different sigmoid nonlinearity solution, that is close to tanh() but with a slower approach to +/-1.f. It is super cheap if implemented with SSE - I've posted SSE intrinsics for both the "accurate" version and a less accurate but super efficient version.

oskari · Post by **oskari** » Thu Nov 03, 2011 9:26 pm

If you can do 4 in parallel, div is clearly better. But why not just time it?

valhallasound · Post by **valhallasound** » Thu Nov 03, 2011 9:48 pm

oskari wrote:If you can do 4 in parallel, div is clearly better. But why not just time it?

Interesting. Tell me more about div with 4 in parallel, if you don't mind sharing. I work with SIMD friendly block sizes inside my plugins.

Thanks,

Sean Costello

valhallasound · Post by **valhallasound** » Thu Nov 03, 2011 10:02 pm

Also, any thoughts about using divide SIMD instructions, as opposed to the reciprocal SIMD instructions?

Sean Costello

oskari · Post by **oskari** » Thu Nov 03, 2011 10:03 pm

divps (4 single precision divs) is only 14 cycles on Sandy Bridge and not much more on the previous x86 architectures. It's the _mm_div_ps intrinsic.

Urs · Post by **Urs** » Thu Nov 03, 2011 10:03 pm

You can do a pade approximation rather quickly in SSE (and Altivec). IIRC it was slightly faster than lookups with clipping and all.

Dunno about that new ARM stuff. Do they have parallel reciproce estimate?

Urs · Post by **Urs** » Thu Nov 03, 2011 10:05 pm

oskari wrote:divps (4 single precision divs) is only 14 cycles on Sandy Bridge and not much more on the previous x86 architectures. It's the _mm_div_ps intrinsic.

14 cycles?

Want!

oskari · Post by **oskari** » Thu Nov 03, 2011 10:09 pm

Here's what I do in discoDSP Corona (it's the usual pade approximation):

Code: Select all

template <typename FTYPE>
__noalias __forceinline FTYPE tanh_approx(FTYPE const &_x)
{
	FTYPE x = maximum(minimum(_x, 3.0f), -3.0f);
	FTYPE x2 = x * x;
	return x * (x2 + 27.0f) / (x2 * 9.0f + 27.0f);
}

FTYPE is either float4 or float8, both of which are simple wrappers for the SSE intrinsics.

oskari · Post by **oskari** » Thu Nov 03, 2011 10:24 pm

Code: Select all

  00000 0f 28 01         movaps  xmm0, XMMWORD PTR [ecx]
        00 00            minps   xmm0, XMMWORD PTR __xmm@10
        00 00            maxps   xmm0, XMMWORD PTR __xmm@11
        00 00            movaps  xmm1, XMMWORD PTR __xmm@12
        00 00            movaps  xmm3, XMMWORD PTR __xmm@13
  0001f 0f 28 d0         movaps  xmm2, xmm0
  00022 0f 59 d0         mulps   xmm2, xmm0
  00025 0f 59 da         mulps   xmm3, xmm2
  00028 0f 58 d9         addps   xmm3, xmm1
  0002b 0f 58 ca         addps   xmm1, xmm2
  0002e 0f 59 c8         mulps   xmm1, xmm0
  00031 0f 5e cb         divps   xmm1, xmm3
  00034 0f 29 08         movaps  XMMWORD PTR [eax], xmm1

Here's the generated code using float4, so you can see that the abstraction isn't hurting at all (usually it would be inline though).

Urs · Post by **Urs** » Thu Nov 03, 2011 10:29 pm

Yep, same thing here, but we do reciproce and a round or two of Newton Raphson. Might benchmark that.

Christian Budde · Post by **Christian Budde** » Fri Nov 04, 2011 1:12 am

If you wonder what approximation will get you what you need you can have a look at my old forum thread: Tanh approximations

valhallasound · Post by **valhallasound** » Fri Nov 04, 2011 6:04 am

Urs wrote:Yep, same thing here, but we do reciproce and a round or two of Newton Raphson. Might benchmark that.

I had no idea what the hell this Newton Raphson thing was. So I used Google, and found this page:

http://www.virtualdub.org/blog/pivot/entry.php?id=229

The relevant info:

The usual way to improve the accuracy of the reciprocal and reciprocal square root operations is by iteration through Newton's Method. For the reciprocal, it looks like this:

x = reciprocal_approx(c);
x' = x * (2 - x * c);

...and for the reciprocal square root, it looks like this:

x = rsqrt_approx(c);
x' = 0.5 * x * (3 - x*x*c);

This seems like it could take advantage of the RSQRTPS in SSE to make things VERY fast. For the Mac, I will probably just call the vvrsqrtf function in vecLib, so I can keep things working on PPC as well. I'll hand code a reciprocal square root function using SSE for the PC.

I've also decided against using tanh(), in favor of the Inverse Square Root Sigmoid function posted by Russell Borogove at

http://www.tinygod.com/sigmoid/

This is fairly simple:

out = in / sqrt(in*in + 1.f);

For low-level signals, it tracks tanh(), in that the signal gain is around 1 for signals that have an amplitude around zero. It is bounded by +/-1, which is useful, and converges towards +/- 1 more slowly than tanh(). Most importantly, it should be easy to implement with a few SIMD multiplies and a SIMD-optimized reciprocal square root function as described above.

Thanks to everyone who helped out here. Hopefully the above info will be useful for people implementing soft clipping functions in their own projects.

Sean Costello

jpommier · Post by **jpommier** » Fri Nov 04, 2011 8:43 am

If you want to compute the full tanh() using SIMD instructions, the exp_ps function here might be useful:

http://gruntthepeon.free.fr/ssemath/

Urs · Post by **Urs** » Fri Nov 04, 2011 9:18 am

Hey Sean,

That's a good one... might be fast as a whip...

I just wonder how much error/noise the sqr reciproce estimate will add?

bootsie · Post by **bootsie** » Fri Nov 04, 2011 10:08 am

http://varietyofsound.wordpress.com/201 ... -fraction/

SSE code included, uses around 20 cycles.

out = in / sqrt(in*in + 1.f);

be aware that the sqrt is even more expensive than the div.

peace,
bootsie

Table Lookup for tanh() versus other solutions