- Code: Select all
`inline double vox_fasttanh2( const double x )`

{

const double ax = fabs( x );

const double x2 = x * x;

return( x * ( 2.45550750702956 + 2.45550750702956 * ax +

( 0.893229853513558 + 0.821226666969744 * ax ) * x2 ) /

( 2.44506634652299 + ( 2.44506634652299 + x2 ) *

fabs( x + 0.814642734961073 * x * ax )));

- KVRist
- 296 posts since 30 Apr, 2006, from lancaster, pa

Do you know how this compares with Aleksey Vaneev's fasttanh from this post? The code is:

- KVRAF
- 3414 posts since 14 Nov, 2006, from Pacific NW

Urs wrote:Also, hehehe, if a Newton-Raphson step is not required, I wonder if the reciprocal square root estimate is guaranteed to produce same results on every CPU - I remember that the G5 has a far worse version than the G4.

This is a good question. Also, I didn't know about the G5 versus G4. I only tested on a G4 machine, and my Intel machines. I guess the more important question in 2014 would be how the Intel SSE2(?) instruction performs, versus whatever the corresponding ARM NEON instruction is.

Sean Costello

Plugins: http://www.valhalladsp.com

Blog: https://valhalladsp.com/blog/

KVR Subforum: http://www.kvraudio.com/forum/viewforum.php?f=146

Blog: https://valhalladsp.com/blog/

KVR Subforum: http://www.kvraudio.com/forum/viewforum.php?f=146

- KVRAF
- 3414 posts since 14 Nov, 2006, from Pacific NW

random_id wrote:Do you know how this compares with Aleksey Vaneev's fasttanh from this post? The code is:

- Code: Select all
`inline double vox_fasttanh2( const double x )`

{

const double ax = fabs( x );

const double x2 = x * x;

return( x * ( 2.45550750702956 + 2.45550750702956 * ax +

( 0.893229853513558 + 0.821226666969744 * ax ) * x2 ) /

( 2.44506634652299 + ( 2.44506634652299 + x2 ) *

fabs( x + 0.814642734961073 * x * ax )));

I suppose that it depends on the speed of the divide in the code you posted, versus the speed of the reciprocal square root function in SIMD.

Sean Costello

Plugins: http://www.valhalladsp.com

Blog: https://valhalladsp.com/blog/

KVR Subforum: http://www.kvraudio.com/forum/viewforum.php?f=146

Blog: https://valhalladsp.com/blog/

KVR Subforum: http://www.kvraudio.com/forum/viewforum.php?f=146

- u-he
- 21776 posts since 7 Aug, 2002, from Berlin

I think the reciprocal square root estimate is 1 cycle latency while divide is still typically 20+ cycles, no?

I could be wrong though...

I could be wrong though...

- KVRAF
- 3414 posts since 14 Nov, 2006, from Pacific NW

Urs wrote:I think the reciprocal square root estimate is 1 cycle latency while divide is still typically 20+ cycles, no?

I could be wrong though...

This is what I thought, as well. I don't know for sure if the reciprocal square root estimate is 1 cycle. I sure hope this is the case.

Sean Costello

Plugins: http://www.valhalladsp.com

Blog: https://valhalladsp.com/blog/

KVR Subforum: http://www.kvraudio.com/forum/viewforum.php?f=146

Blog: https://valhalladsp.com/blog/

KVR Subforum: http://www.kvraudio.com/forum/viewforum.php?f=146

- KVRist
- 46 posts since 20 Jun, 2013, from Berkeley, CA

@valhalladsp: your understanding is correct, it's a simple shaping polynomial followed by the sqrt sigmoid. It's mix-or-match whether you want to do the NR step. If you're ok with a 70dB noise floor, skip it, otherwise include it. It costs ballpark of 0.38ns (that's right, 380 picoseconds) per tanh. I included it in my performance evaluation - ie, my tanh with the NR step is 1.7x the cost of sqrt sigmoid with NR step.

@Urs: what these recip approx instructions actually do is not documented. It's easy to believe that you could get decent results on one chip, then test on another and find that it sounds much worse. Plus, results won't in general be consistent. Another reason to include the NR step, imho.

I didn't test against a plain 'ol Pade, but I suspect that the total cost is in the same ballpark - you need to do a small amount of polynomial evaluation followed by a reciprocal-ish operation followed by an NR step to provide good noise behavior. The bounds clamping on the Pade will cost extra, as you point out, so my guess is that the total time is pretty much the same.

@random_id: I just whipped it the vox_fasttanh2. Because of the abs()'s, the falloff at higher harmonics is not anywhere nearly as good - in fact it's comparable to just the sqrt one. But the numerical error is not bad, it's 4.3e-4 (ie about double the error as mine). In addition, the vox one doesn't go to +/-1 in the limits. Speed is almost identical - about 1ns/tanh.

@Urs: what these recip approx instructions actually do is not documented. It's easy to believe that you could get decent results on one chip, then test on another and find that it sounds much worse. Plus, results won't in general be consistent. Another reason to include the NR step, imho.

I didn't test against a plain 'ol Pade, but I suspect that the total cost is in the same ballpark - you need to do a small amount of polynomial evaluation followed by a reciprocal-ish operation followed by an NR step to provide good noise behavior. The bounds clamping on the Pade will cost extra, as you point out, so my guess is that the total time is pretty much the same.

@random_id: I just whipped it the vox_fasttanh2. Because of the abs()'s, the falloff at higher harmonics is not anywhere nearly as good - in fact it's comparable to just the sqrt one. But the numerical error is not bad, it's 4.3e-4 (ie about double the error as mine). In addition, the vox one doesn't go to +/-1 in the limits. Speed is almost identical - about 1ns/tanh.

- KVRist
- 46 posts since 20 Jun, 2013, from Berkeley, CA

@Urs: the definitive source for instruction timings is http://www.agner.org/optimize/instruction_tables.pdf. On Ivy Bridge (the chip in my MacBook Pro), rsqrtps is indeed 1 cycle recip throughput, while divps is 7 cycles. Note that divps has much more stringent rounding behavior - it's overkill for audio. You're almost always going to get better performance for estimated reciprocal followed by an NR step, which is what I did for all my comparative tests.

- u-he
- 21776 posts since 7 Aug, 2002, from Berlin

Top notch info, raphx!

Thanks,

- Urs

Thanks,

- Urs

- KVRian
- 807 posts since 19 Dec, 2010

raphx wrote:@Urs: the definitive source for instruction timings is http://www.agner.org/optimize/instruction_tables.pdf. On Ivy Bridge (the chip in my MacBook Pro), rsqrtps is indeed 1 cycle recip throughput, while divps is 7 cycles. Note that divps has much more stringent rounding behavior - it's overkill for audio. You're almost always going to get better performance for estimated reciprocal followed by an NR step, which is what I did for all my comparative tests.

Great infos, thanks! I wonder if I interpret the data correctly though:

- On Haswell, multiplying floats via SSE can be faster than adding them?

- Some AVX instructions (e.g. division) seem to have about twice the cost compared to their SSE equivalents?

I thought the bottleneck of AVX had to do with fetching/storing data, but now I'm not sure anymore.

Richard

Synapse Audio Software - www.synapse-audio.com

- KVRist
- 46 posts since 20 Jun, 2013, from Berkeley, CA

Richard:

Yes, in particular the fused multiply-accumulate speed on the Haswell is insane. A single cycle on a single core can dispatch two 256 bit wide FMA's, each of which is 8 multiplies and 8 adds, for a total of 32 floating point ops per cycle, or 100 GFLOPs at typical clock speeds. This Stack Overflow answer has a really detailed explanation: http://stackoverflow.com/questions/15655835/flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2

One consequence is that it's actually faster to multiply one and add (using FMA) than it is to just add. This is one reason to tend towards matrix math, where you just stick a 1.0 in your coefficient matrix, over the old ways. One way to think about IIR direct forms is that they're a similarity transform on the state space matrix to make as many of the coefficients 0 or 1 as possible, so the multiplies can be skipped. But in a SIMD world, that didn't buy you anything, and cost you numerical precision.

Note that actually making use of all this FMA muscle is hard, as the latency is 5 cycles. So you need to have 10 of these in the pipeline (ie 160 floating point ops) at the same time to achieve maximal utilization. That is not easy, and another reason to prefer matrix operations.

The double cost of the AVX instructions is for the 256 bit variant only. So the cost per vector of 4 is the same as the SSE variant.

I also find all these variations of SIMD architectures to be quite a pain. These days, to get top performance, you need 32 and 64 bit ARM versions, an SSE2 version (the stuff added in SSE3 and SSE4 is not very important for audio), an AVX version, and an AVX + FMA version for Haswell. On the desktop side, it's quite understandable to just write SSE2 and not worry about it too much.

Yes, in particular the fused multiply-accumulate speed on the Haswell is insane. A single cycle on a single core can dispatch two 256 bit wide FMA's, each of which is 8 multiplies and 8 adds, for a total of 32 floating point ops per cycle, or 100 GFLOPs at typical clock speeds. This Stack Overflow answer has a really detailed explanation: http://stackoverflow.com/questions/15655835/flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2

One consequence is that it's actually faster to multiply one and add (using FMA) than it is to just add. This is one reason to tend towards matrix math, where you just stick a 1.0 in your coefficient matrix, over the old ways. One way to think about IIR direct forms is that they're a similarity transform on the state space matrix to make as many of the coefficients 0 or 1 as possible, so the multiplies can be skipped. But in a SIMD world, that didn't buy you anything, and cost you numerical precision.

Note that actually making use of all this FMA muscle is hard, as the latency is 5 cycles. So you need to have 10 of these in the pipeline (ie 160 floating point ops) at the same time to achieve maximal utilization. That is not easy, and another reason to prefer matrix operations.

The double cost of the AVX instructions is for the 256 bit variant only. So the cost per vector of 4 is the same as the SSE variant.

I also find all these variations of SIMD architectures to be quite a pain. These days, to get top performance, you need 32 and 64 bit ARM versions, an SSE2 version (the stuff added in SSE3 and SSE4 is not very important for audio), an AVX version, and an AVX + FMA version for Haswell. On the desktop side, it's quite understandable to just write SSE2 and not worry about it too much.