KVR Audio

random_id · Post by **random_id** » Thu Feb 06, 2014 11:08 pm

Do you know how this compares with Aleksey Vaneev's fasttanh from this post? The code is:

inline double vox_fasttanh2( const double x )
{
   const double ax = fabs( x );
   const double x2 = x * x;

   return( x * ( 2.45550750702956 + 2.45550750702956 * ax +
      ( 0.893229853513558 + 0.821226666969744 * ax ) * x2 ) /
      ( 2.44506634652299 + ( 2.44506634652299 + x2 ) *
      fabs( x + 0.814642734961073 * x * ax )));

valhallasound · Post by **valhallasound** » Thu Feb 06, 2014 11:19 pm

Urs wrote:Also, hehehe, if a Newton-Raphson step is not required, I wonder if the reciprocal square root estimate is guaranteed to produce same results on every CPU - I remember that the G5 has a far worse version than the G4.

This is a good question. Also, I didn't know about the G5 versus G4. I only tested on a G4 machine, and my Intel machines. I guess the more important question in 2014 would be how the Intel SSE2(?) instruction performs, versus whatever the corresponding ARM NEON instruction is.

Sean Costello

valhallasound · Post by **valhallasound** » Thu Feb 06, 2014 11:21 pm

random_id wrote:Do you know how this compares with Aleksey Vaneev's fasttanh from this post? The code is:

Code: Select all

inline double vox_fasttanh2( const double x )
{
   const double ax = fabs( x );
   const double x2 = x * x;

   return( x * ( 2.45550750702956 + 2.45550750702956 * ax +
      ( 0.893229853513558 + 0.821226666969744 * ax ) * x2 ) /
      ( 2.44506634652299 + ( 2.44506634652299 + x2 ) *
      fabs( x + 0.814642734961073 * x * ax )));

I suppose that it depends on the speed of the divide in the code you posted, versus the speed of the reciprocal square root function in SIMD.

Sean Costello

Urs · Post by **Urs** » Fri Feb 07, 2014 12:10 am

I think the reciprocal square root estimate is 1 cycle latency while divide is still typically 20+ cycles, no?

I could be wrong though...

valhallasound · Post by **valhallasound** » Fri Feb 07, 2014 12:13 am

Urs wrote:I think the reciprocal square root estimate is 1 cycle latency while divide is still typically 20+ cycles, no?

I could be wrong though...

This is what I thought, as well. I don't know for sure if the reciprocal square root estimate is 1 cycle. I sure hope this is the case.

Sean Costello

raphx · Post by **raphx** » Fri Feb 07, 2014 12:18 am

@valhalladsp: your understanding is correct, it's a simple shaping polynomial followed by the sqrt sigmoid. It's mix-or-match whether you want to do the NR step. If you're ok with a 70dB noise floor, skip it, otherwise include it. It costs ballpark of 0.38ns (that's right, 380 picoseconds) per tanh. I included it in my performance evaluation - ie, my tanh with the NR step is 1.7x the cost of sqrt sigmoid with NR step.

@Urs: what these recip approx instructions actually do is not documented. It's easy to believe that you could get decent results on one chip, then test on another and find that it sounds much worse. Plus, results won't in general be consistent. Another reason to include the NR step, imho.

I didn't test against a plain 'ol Pade, but I suspect that the total cost is in the same ballpark - you need to do a small amount of polynomial evaluation followed by a reciprocal-ish operation followed by an NR step to provide good noise behavior. The bounds clamping on the Pade will cost extra, as you point out, so my guess is that the total time is pretty much the same.

@random_id: I just whipped it the vox_fasttanh2. Because of the abs()'s, the falloff at higher harmonics is not anywhere nearly as good - in fact it's comparable to just the sqrt one. But the numerical error is not bad, it's 4.3e-4 (ie about double the error as mine). In addition, the vox one doesn't go to +/-1 in the limits. Speed is almost identical - about 1ns/tanh.

raphx · Post by **raphx** » Fri Feb 07, 2014 12:30 am

@Urs: the definitive source for instruction timings is http://www.agner.org/optimize/instruction_tables.pdf. On Ivy Bridge (the chip in my MacBook Pro), rsqrtps is indeed 1 cycle recip throughput, while divps is 7 cycles. Note that divps has much more stringent rounding behavior - it's overkill for audio. You're almost always going to get better performance for estimated reciprocal followed by an NR step, which is what I did for all my comparative tests.

Urs · Post by **Urs** » Fri Feb 07, 2014 12:46 am

Top notch info, raphx!

Thanks,

- Urs

Richard_Synapse · Post by **Richard_Synapse** » Sat Feb 08, 2014 11:44 pm

raphx wrote:@Urs: the definitive source for instruction timings is http://www.agner.org/optimize/instruction_tables.pdf. On Ivy Bridge (the chip in my MacBook Pro), rsqrtps is indeed 1 cycle recip throughput, while divps is 7 cycles. Note that divps has much more stringent rounding behavior - it's overkill for audio. You're almost always going to get better performance for estimated reciprocal followed by an NR step, which is what I did for all my comparative tests.

Great infos, thanks! I wonder if I interpret the data correctly though:

- On Haswell, multiplying floats via SSE can be faster than adding them?
- Some AVX instructions (e.g. division) seem to have about twice the cost compared to their SSE equivalents?

I thought the bottleneck of AVX had to do with fetching/storing data, but now I'm not sure anymore.

Richard

raphx · Post by **raphx** » Sun Feb 09, 2014 1:20 am

Richard:

Yes, in particular the fused multiply-accumulate speed on the Haswell is insane. A single cycle on a single core can dispatch two 256 bit wide FMA's, each of which is 8 multiplies and 8 adds, for a total of 32 floating point ops per cycle, or 100 GFLOPs at typical clock speeds. This Stack Overflow answer has a really detailed explanation: http://stackoverflow.com/questions/1565 ... 2-avx-avx2

One consequence is that it's actually faster to multiply one and add (using FMA) than it is to just add. This is one reason to tend towards matrix math, where you just stick a 1.0 in your coefficient matrix, over the old ways. One way to think about IIR direct forms is that they're a similarity transform on the state space matrix to make as many of the coefficients 0 or 1 as possible, so the multiplies can be skipped. But in a SIMD world, that didn't buy you anything, and cost you numerical precision.

Note that actually making use of all this FMA muscle is hard, as the latency is 5 cycles. So you need to have 10 of these in the pipeline (ie 160 floating point ops) at the same time to achieve maximal utilization. That is not easy, and another reason to prefer matrix operations.

The double cost of the AVX instructions is for the 256 bit variant only. So the cost per vector of 4 is the same as the SSE variant.

I also find all these variations of SIMD architectures to be quite a pain. These days, to get top performance, you need 32 and 64 bit ARM versions, an SSE2 version (the stuff added in SSE3 and SSE4 is not very important for audio), an AVX version, and an AVX + FMA version for Haswell. On the desktop side, it's quite understandable to just write SSE2 and not worry about it too much.

andy-cytomic · Post by **andy-cytomic** » Fri Nov 19, 2021 6:30 am

raphx wrote: ↑Thu Feb 06, 2014 7:45 am ...
Fortunately, it's not too bad to compute a moderately high accuracy approximation:
Code: Select all
a = x + 0.16489087 * x**3 + 0.00985468 * x**5
return a / sqrt(1 + a * a)
This function has the same shape as tanh and almost identical harmonic structure. In particular, it doesn't need any clamping for large inputs, as it has the correct asymptotic approach to +/-1. It's within 2e-4 of the real tanh, which is not super impressive from a strictly numerical point of view, but gives some idea how closely it fits. By comparison, x/sqrt(1 + x^2) has a max error 7.4e-2 as an approximation to tanh (ie almost 400 times as much error).
...

I think for most approximations you're better off minimising the relative error of the function. Also a trig identity is useful here to show what is going on:

tanh(x) = sinh(x)/cosh(x) = sinh(x)/sqrt(cosh(x)^2) = sinh(x)/sqrt(sinh(x)^2 + 1)

So we're approximating sinh(x) with our polynomial and generating tanh(x) from that.

Here are a few approximations of this form, starting with 5th order, and going to 9th order that match the derivative at x = 0 (something that minimax approximations do not do a lot of the time) as well as minimising the relative error:

If you don't want to use a fast reciprocal sqrt with 1 Newton Raphson step then a straight out rational approximation can also be good, but then you have to tidy up the endpoints by matching the derivative(s) at the endpoints to keep things smooth. If anyone is interested in some regular rational approximations please let me know.

2DaT · Post by **2DaT** » Fri Nov 19, 2021 8:34 am

andy-cytomic wrote: ↑Fri Nov 19, 2021 6:30 am If you don't want to use a fast reciprocal sqrt with 1 Newton Raphson step then a straight out rational approximation can also be good, but then you have to tidy up the endpoints by matching the derivative(s) at the endpoints to keep things smooth. If anyone is interested in some regular rational approximations please let me know.

Straightforward relative err. minmax is very viable strategy for tanh approximations. Check out this version:
viewtopic.php?p=7503081#p7503081

Derivatives are not as big of a concern for precise approximations - if function is approximated within few ulps, derivatives are going to be approximated decently aswell.

On modern processors single precision division is way better than rcpss+NR. In some cases it can be as cheap as usual arithmetics as long as you mix divs with other operations.

mystran · Post by **mystran** » Fri Nov 19, 2021 12:15 pm

2DaT wrote: ↑Fri Nov 19, 2021 8:34 am On modern processors single precision division is way better than rcpss+NR. In some cases it can be as cheap as usual arithmetics as long as you mix divs with other operations.

Has RCPSS+NR ever been profitable?

RSQRTSS+NR can be profitable over sqrt+division if that's what you need (eg. vector normalization). Whether it's profitable when you're free to choose another type of approximation, I don't know.

As for the basic strategy of approximating sin/sinh and using rsqrt to get tan/tanh, I remember starting a thread about this a few years ago and I'd argue that it's a great strategy if you truly don't care about accuracy too much, in which case you can use something like a 3rd order poly to approximate your sin/sinh and skip the NR for the rsqrt.... but I feel like you're right and it's better to just minmax a rational function if you want something better.

mystran · Post by **mystran** » Fri Nov 19, 2021 12:17 pm

Also I feel someone (Vadim?) posted some derivative matched approximation a while ago too.. but I don't have a link 'cos I just use either the super-cheap approximations or 2DaT's approx depending on how much I happen to care at any given moment.

juha_p · Post by **juha_p** » Fri Nov 19, 2021 12:18 pm

Hmm... just curious but, why include last exponent (e-6x⁹) to equations ...

Sorry, did not look carefully, you seem to go > 1.

Table Lookup for tanh() versus other solutions