Table Lookup for tanh() versus other solutions
-
- KVRAF
- 2875 posts since 28 Jan, 2004 from Da Nang, Vietnam
Thanks for sharing the details. I've been working on a saturator effect for my next iPad app and this stuff will be useful.
I'm finding the vectorized tanh in veclib to be pretty cheap on iOS already though.
I'm finding the vectorized tanh in veclib to be pretty cheap on iOS already though.
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
I really like the vecLib tanh() in OSX. In general, all of the Accelerate library stuff is awesome. Unfortunately, it isn't available for Windows. So I would have to roll my own vectorized tanh() function, or just use the math library one.kuniklo wrote:Thanks for sharing the details. I've been working on a saturator effect for my next iPad app and this stuff will be useful.
I'm finding the vectorized tanh in veclib to be pretty cheap on iOS already though.
It looks like there are some interesting tanh() approaches in this thread, but the inverse square root sigmoid function seems like it would be easier to implement in SSE. I've already tried it out in my code (both in straight C and using vDSP and vecLib functions) and it sounds great for my purposes, and doesn't require any clipping before or afterwards to keep things within bounds. Honestly, it might be somewhat "smoother" than tanh() for general distortion purposes.
Sean Costello
-
- KVRAF
- 2393 posts since 28 Mar, 2005
use intel integrated performance primitive on Windows.
-
- KVRAF
- 2458 posts since 3 Oct, 2002 from SF CA USA NA Earth
You could disassemble the vecLib tanh() and recreate it, or at least verify what approximation it uses.valhallasound wrote:I really like the vecLib tanh() in OSX. In general, all of the Accelerate library stuff is awesome. Unfortunately, it isn't available for Windows. So I would have to roll my own vectorized tanh() function, or just use the math library one.
Also, Fons Adriensen suggested the 1/sqrt approximation, but I can't find the original posting anywhere on the googles.
Might be time for me to revisit "Fun With Sigmoids" with plots and performance comparisons of the sigmoids themselves. In my copious spare time.
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
I'm really liking the 1/sqrt approximation more than tanh(). Plus, the suggestion above to use Intel IPP is a good one. I tried to get it running before, and got stuck in the configuration, but spending a few hundred bucks for the IPP might save me a lot of handcoding of SSE in the long run. Although I actually really enjoy handcoding this stuff, for some perverse reason.Borogove wrote:You could disassemble the vecLib tanh() and recreate it, or at least verify what approximation it uses.valhallasound wrote:I really like the vecLib tanh() in OSX. In general, all of the Accelerate library stuff is awesome. Unfortunately, it isn't available for Windows. So I would have to roll my own vectorized tanh() function, or just use the math library one.
Well, it's a goodun.Also, Fons Adriensen suggested the 1/sqrt approximation, but I can't find the original posting anywhere on the googles.
What is this "spare time" of which you speak?Might be time for me to revisit "Fun With Sigmoids" with plots and performance comparisons of the sigmoids themselves. In my copious spare time.
There is an earlier tanh() optimization thread, that shows aliasing plots for different realizations of optimized tanh functions. It would be interesting to see the same plots taken of the 1/sqrt function, as well as some of the other sigmoids.
The soundfiles on the "Fun With Sigmoids" page are pretty useful in listening to the results of different sigmoids in feedback filters. I used the Antti Huovilainen Moog examples to evaluate things, as I know that filter pretty well (a few of the TAL plugins use this filter). The 1/sqrt sigmoid sounded at least as good as the tanh() in those examples, to my ears.
Sean Costello
-
- KVRAF
- 2458 posts since 3 Oct, 2002 from SF CA USA NA Earth
One of my shameful little secrets is that I can barely qualify the difference between the filters, let alone the sigmoids, on that page.valhallasound wrote:The soundfiles on the "Fun With Sigmoids" page are pretty useful in listening to the results of different sigmoids in feedback filters. I used the Antti Huovilainen Moog examples to evaluate things, as I know that filter pretty well (a few of the TAL plugins use this filter). The 1/sqrt sigmoid sounded at least as good as the tanh() in those examples, to my ears.
Also, at least one of the filters has the sigmoid at the wrong point in the signal path.
-
- KVRAF
- 2875 posts since 28 Jan, 2004 from Da Nang, Vietnam
So far I haven't tried to port anything to Windows but I think I'd take a serious look at this as a first step. IIRC Urs said he'd had good results with it too.otristan wrote:use intel integrated performance primitive on Windows.
Of course, there's nothing wrong with tinkering with this stuff yourself if you're having fun and can spare the time. You might even make some interesting discoveries along the way.
-
- KVRAF
- 2875 posts since 28 Jan, 2004 from Da Nang, Vietnam
Some trivia there. I was foolish enough to go into graduate school in Chemistry years ago. I came to my senses and bailed out but a fellow student actually finished his PhD and then somehow went on to teach himself enough about all this stuff to get hired on as an expert on Altivec and then to become one of the architects of Accelerate.valhallasound wrote: I really like the vecLib tanh() in OSX. In general, all of the Accelerate library stuff is awesome. Unfortunately, it isn't available for Windows. So I would have to roll my own vectorized tanh() function, or just use the math library one.
I guess there are plenty of good reasons for plugin devs to take Apple's name in vain but Accelerate is a great freebie once you've paid the other costs of admission.
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
I understand that. I just had far too much fun rolling an inverse square root sigmoid function in SSE assembly. And I was an Anthropology major.kuniklo wrote:Some trivia there. I was foolish enough to go into graduate school in Chemistry years ago. I came to my senses and bailed out but a fellow student actually finished his PhD and then somehow went on to teach himself enough about all this stuff to get hired on as an expert on Altivec and then to become one of the architects of Accelerate.valhallasound wrote: I really like the vecLib tanh() in OSX. In general, all of the Accelerate library stuff is awesome. Unfortunately, it isn't available for Windows. So I would have to roll my own vectorized tanh() function, or just use the math library one.
I love Accelerate. But I am starting to wonder if I should use my handrolled vector library for Intel OSX, instead of the vector library that calls the Accelerate function. ValhallaRoom runs most efficiently as a 64-bit VST in Windows, where it can use my SSE/SSE2 intrinsics. I need to conduct some experiments in the next few days.I guess there are plenty of good reasons for plugin devs to take Apple's name in vain but Accelerate is a great freebie once you've paid the other costs of admission.
Sean Costello
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
Here's my SSE implementation of the Fons Adriensen (via Russell Borogove) Inverse Square Root Sigmoid. The function follows my vector library format: input and output buffers are presumed to be the same size, and are aligned to 16-byte boundaries.
[/size]
To my ears, it sounds really smooth. I originally implemented the code with the Accelerate library reciprocal square root function, which should be around twice as accurate as the _mm_rsqrt_ps intrinsic I am using. The above optimization sound around the same in my application, and is ridiculously light on the CPU.
Any suggestions on improving efficiency, etc. much appreciated. If you see anything profoundly or even mildly stupid in the above code, please let me know, as I'm always trying to learn about this stuff.
Sean Costello
Code: Select all
// implements x / √( x² + 1 )
inline void ProcessISRSBlock (float *input, float *output, int blockSize)
{
const float *localin = input;
float *localout = output;
const __m128 theOnes = _mm_set_ps1(1.f);
for(int i=0;i<blockSize;i+=4)
{
__m128 vin = _mm_load_ps(localin + i); // loads input
__m128 vtmp = _mm_mul_ps(vin, vin); // calculate in*in
__m128 vtmp2 = _mm_add_ps(vtmp, theOnes); // in*in+1.f
vtmp = _mm_rsqrt_ps(vtmp2); // 1/sqrt(in*in+1.f)
__m128 vresult = _mm_mul_ps(vtmp, vin); // in*1/sqrt(in*in+1)
_mm_store_ps(localout + i, vresult); // write to output
}
}
To my ears, it sounds really smooth. I originally implemented the code with the Accelerate library reciprocal square root function, which should be around twice as accurate as the _mm_rsqrt_ps intrinsic I am using. The above optimization sound around the same in my application, and is ridiculously light on the CPU.
Any suggestions on improving efficiency, etc. much appreciated. If you see anything profoundly or even mildly stupid in the above code, please let me know, as I'm always trying to learn about this stuff.
Sean Costello
-
- KVRAF
- 2875 posts since 28 Jan, 2004 from Da Nang, Vietnam
I can imagine that your own hand-rolled code might be better than Accelerate for your use cases. Accelerate probably has to handle cases you know you'll never encounter in your own code. And besides, as your example shows it doesn't necessarily take a lot of code to write some of these functions.valhallasound wrote: I love Accelerate. But I am starting to wonder if I should use my handrolled vector library for Intel OSX, instead of the vector library that calls the Accelerate function.
If Apple ever starts making ARM laptops as is widely rumored you may have to go back and write another version of these though.
-
- KVRAF
- 2458 posts since 3 Oct, 2002 from SF CA USA NA Earth
If you maintain debug and release versions of your plugs, it's worth throwing in the assert()s about your alignment, blocksize divisibility, and other invariants like that.valhallasound wrote:Here's my SSE implementation of the Fons Adriensen (via Russell Borogove) Inverse Square Root Sigmoid. The function follows my vector library format: input and output buffers are presumed to be the same size, and are aligned to 16-byte boundaries.
The code looks sane.
My understanding is that rsqrtps uses a table lookup internally and is 11 or 12 bits of precision. That's not awful; it should correspond to 66-72dB SNR - "good enough for rock 'n' roll". If you support a high-quality mode in your algorithm, you should probably implement a version that does one iteration of Newton-Raphson refinement, which gets you better than 20-bit precision for about 3 times the cost (13-19 cycles instead of 5-6*, depending on architecture, which is still hell of cheap).To my ears, it sounds really smooth. I originally implemented the code with the Accelerate library reciprocal square root function, which should be around twice as accurate as the _mm_rsqrt_ps intrinsic I am using. The above optimization sound around the same in my application, and is ridiculously light on the CPU.
* all cycle counts pulled directly from my bottom
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
I'm actually really surprised at how compact this code is. This creates a smooth sigmoid clipping function, with similar behavior to tanh() (maybe better in some ways), with no branching required for clipping. Hooray for _mm_rsqrt_ps!kuniklo wrote:I can imagine that your own hand-rolled code might be better than Accelerate for your use cases. Accelerate probably has to handle cases you know you'll never encounter in your own code. And besides, as your example shows it doesn't necessarily take a lot of code to write some of these functions.valhallasound wrote: I love Accelerate. But I am starting to wonder if I should use my handrolled vector library for Intel OSX, instead of the vector library that calls the Accelerate function.
I wouldn't be surprised if Apple's Accelerate reciprocal square root uses the same SSE assembly instruction, but with a few Newton's iterations to get the desired accuracy. It may have to have some sort of guards in place to prevent dividing by zero, or whatever the equivalent is for this assembly instruction. The nice thing about rolling your own code is that you can determine what level of accuracy you need. For this function, the 12 bits of accuracy in _mm_rsqrt_ps seem to work fine.
I already have the Accelerate version of this code in place, and will need to keep an Accelerate version of functions around as long as I support PPC. That being said, I just started looking at the ARM Neon instruction set, in order to see what the future may bring.If Apple ever starts making ARM laptops as is widely rumored you may have to go back and write another version of these though.
Sean Costello
-
- KVRAF
- 2875 posts since 28 Jan, 2004 from Da Nang, Vietnam
Yeah I guess Accelerate makes planning for that contingency fairly painless. I haven't gotten into the Neon instructions much yet but I'd be interested to see what you come up with if you do dive into it.valhallasound wrote: I already have the Accelerate version of this code in place, and will need to keep an Accelerate version of functions around as long as I support PPC. That being said, I just started looking at the ARM Neon instruction set, in order to see what the future may bring.
I've been mired in GUI architecture for weeks now instead of fun DSP stuff.
- KVRAF
- Topic Starter
- 3426 posts since 15 Nov, 2006 from Pacific NW
I just found this in some random blog comments (http://omcfadde.blogspot.com/2011/02/ma ... ation.html):kuniklo wrote:Yeah I guess Accelerate makes planning for that contingency fairly painless. I haven't gotten into the Neon instructions much yet but I'd be interested to see what you come up with if you do dive into it.valhallasound wrote: I already have the Accelerate version of this code in place, and will need to keep an Accelerate version of functions around as long as I support PPC. That being said, I just started looking at the ARM Neon instruction set, in order to see what the future may bring.
So it seems like VRSQRTE would work in a NEON version of the function I posted above. I think that vrecpeq_f32 would be the appropriate NEON intrinsic.For x86 (SSE), fast inverse square-root approximation can be calculated using RSQRTSS instruction.
For ARM (NEON), this is achieved using VRSQRTE (initial approximation) instruction followed by VRSQRTS instructions (Newton-Raphson iteration).
Sean Costello