KVR Audio

kuniklo · Post by **kuniklo** » Fri Nov 04, 2011 11:54 am

Thanks for sharing the details. I've been working on a saturator effect for my next iPad app and this stuff will be useful.

I'm finding the vectorized tanh in veclib to be pretty cheap on iOS already though.

valhallasound · Post by **valhallasound** » Fri Nov 04, 2011 4:50 pm

kuniklo wrote:Thanks for sharing the details. I've been working on a saturator effect for my next iPad app and this stuff will be useful.

I'm finding the vectorized tanh in veclib to be pretty cheap on iOS already though.

I really like the vecLib tanh() in OSX. In general, all of the Accelerate library stuff is awesome. Unfortunately, it isn't available for Windows. So I would have to roll my own vectorized tanh() function, or just use the math library one.

It looks like there are some interesting tanh() approaches in this thread, but the inverse square root sigmoid function seems like it would be easier to implement in SSE. I've already tried it out in my code (both in straight C and using vDSP and vecLib functions) and it sounds great for my purposes, and doesn't require any clipping before or afterwards to keep things within bounds. Honestly, it might be somewhat "smoother" than tanh() for general distortion purposes.

Sean Costello

otristan · Post by **otristan** » Fri Nov 04, 2011 5:10 pm

use intel integrated performance primitive on Windows.

Borogove · Post by **Borogove** » Fri Nov 04, 2011 5:28 pm

valhallasound wrote:I really like the vecLib tanh() in OSX. In general, all of the Accelerate library stuff is awesome. Unfortunately, it isn't available for Windows. So I would have to roll my own vectorized tanh() function, or just use the math library one.

You could disassemble the vecLib tanh() and recreate it, or at least verify what approximation it uses.

Also, Fons Adriensen suggested the 1/sqrt approximation, but I can't find the original posting anywhere on the googles.

Might be time for me to revisit "Fun With Sigmoids" with plots and performance comparisons of the sigmoids themselves. In my copious spare time.

valhallasound · Post by **valhallasound** » Fri Nov 04, 2011 5:43 pm

Borogove wrote:
valhallasound wrote:I really like the vecLib tanh() in OSX. In general, all of the Accelerate library stuff is awesome. Unfortunately, it isn't available for Windows. So I would have to roll my own vectorized tanh() function, or just use the math library one.
You could disassemble the vecLib tanh() and recreate it, or at least verify what approximation it uses.

I'm really liking the 1/sqrt approximation more than tanh(). Plus, the suggestion above to use Intel IPP is a good one. I tried to get it running before, and got stuck in the configuration, but spending a few hundred bucks for the IPP might save me a lot of handcoding of SSE in the long run. Although I actually really enjoy handcoding this stuff, for some perverse reason.

Also, Fons Adriensen suggested the 1/sqrt approximation, but I can't find the original posting anywhere on the googles.

Well, it's a goodun.

Might be time for me to revisit "Fun With Sigmoids" with plots and performance comparisons of the sigmoids themselves. In my copious spare time.

What is this "spare time" of which you speak?

There is an earlier tanh() optimization thread, that shows aliasing plots for different realizations of optimized tanh functions. It would be interesting to see the same plots taken of the 1/sqrt function, as well as some of the other sigmoids.

The soundfiles on the "Fun With Sigmoids" page are pretty useful in listening to the results of different sigmoids in feedback filters. I used the Antti Huovilainen Moog examples to evaluate things, as I know that filter pretty well (a few of the TAL plugins use this filter). The 1/sqrt sigmoid sounded at least as good as the tanh() in those examples, to my ears.

Sean Costello

Borogove · Post by **Borogove** » Fri Nov 04, 2011 6:27 pm

valhallasound wrote:The soundfiles on the "Fun With Sigmoids" page are pretty useful in listening to the results of different sigmoids in feedback filters. I used the Antti Huovilainen Moog examples to evaluate things, as I know that filter pretty well (a few of the TAL plugins use this filter). The 1/sqrt sigmoid sounded at least as good as the tanh() in those examples, to my ears.

One of my shameful little secrets is that I can barely qualify the difference between the filters, let alone the sigmoids, on that page.

Also, at least one of the filters has the sigmoid at the wrong point in the signal path.

kuniklo · Post by **kuniklo** » Sat Nov 05, 2011 12:09 am

otristan wrote:use intel integrated performance primitive on Windows.

So far I haven't tried to port anything to Windows but I think I'd take a serious look at this as a first step. IIRC Urs said he'd had good results with it too.

Of course, there's nothing wrong with tinkering with this stuff yourself if you're having fun and can spare the time. You might even make some interesting discoveries along the way.

kuniklo · Post by **kuniklo** » Sat Nov 05, 2011 12:12 am

valhallasound wrote: I really like the vecLib tanh() in OSX. In general, all of the Accelerate library stuff is awesome. Unfortunately, it isn't available for Windows. So I would have to roll my own vectorized tanh() function, or just use the math library one.

Some trivia there. I was foolish enough to go into graduate school in Chemistry years ago. I came to my senses and bailed out but a fellow student actually finished his PhD and then somehow went on to teach himself enough about all this stuff to get hired on as an expert on Altivec and then to become one of the architects of Accelerate.

I guess there are plenty of good reasons for plugin devs to take Apple's name in vain but Accelerate is a great freebie once you've paid the other costs of admission.

valhallasound · Post by **valhallasound** » Sat Nov 05, 2011 4:07 am

kuniklo wrote:
valhallasound wrote: I really like the vecLib tanh() in OSX. In general, all of the Accelerate library stuff is awesome. Unfortunately, it isn't available for Windows. So I would have to roll my own vectorized tanh() function, or just use the math library one.
Some trivia there. I was foolish enough to go into graduate school in Chemistry years ago. I came to my senses and bailed out but a fellow student actually finished his PhD and then somehow went on to teach himself enough about all this stuff to get hired on as an expert on Altivec and then to become one of the architects of Accelerate.

I understand that. I just had far too much fun rolling an inverse square root sigmoid function in SSE assembly. And I was an Anthropology major.

I guess there are plenty of good reasons for plugin devs to take Apple's name in vain but Accelerate is a great freebie once you've paid the other costs of admission.

I love Accelerate. But I am starting to wonder if I should use my handrolled vector library for Intel OSX, instead of the vector library that calls the Accelerate function. ValhallaRoom runs most efficiently as a 64-bit VST in Windows, where it can use my SSE/SSE2 intrinsics. I need to conduct some experiments in the next few days.

Sean Costello

valhallasound · Post by **valhallasound** » Sat Nov 05, 2011 4:27 am

Here's my SSE implementation of the Fons Adriensen (via Russell Borogove) Inverse Square Root Sigmoid. The function follows my vector library format: input and output buffers are presumed to be the same size, and are aligned to 16-byte boundaries.

Code: Select all

// implements x / √( x² + 1 ) 
inline void ProcessISRSBlock (float *input, float *output, int blockSize)
{
   const float *localin = input;
   float *localout = output;
   const __m128 theOnes = _mm_set_ps1(1.f);
		
   for(int i=0;i<blockSize;i+=4)
   {
      __m128 vin = _mm_load_ps(localin + i); // loads input
      __m128 vtmp = _mm_mul_ps(vin, vin);    // calculate in*in
      __m128 vtmp2 = _mm_add_ps(vtmp, theOnes); // in*in+1.f
      vtmp = _mm_rsqrt_ps(vtmp2);	       // 1/sqrt(in*in+1.f)
      __m128 vresult = _mm_mul_ps(vtmp, vin); // in*1/sqrt(in*in+1) 
      _mm_store_ps(localout + i, vresult);	// write to output
   }
}

[/size]

To my ears, it sounds really smooth. I originally implemented the code with the Accelerate library reciprocal square root function, which should be around twice as accurate as the _mm_rsqrt_ps intrinsic I am using. The above optimization sound around the same in my application, and is ridiculously light on the CPU.

Any suggestions on improving efficiency, etc. much appreciated. If you see anything profoundly or even mildly stupid in the above code, please let me know, as I'm always trying to learn about this stuff.

Sean Costello

kuniklo · Post by **kuniklo** » Sat Nov 05, 2011 5:12 am

valhallasound wrote: I love Accelerate. But I am starting to wonder if I should use my handrolled vector library for Intel OSX, instead of the vector library that calls the Accelerate function.

I can imagine that your own hand-rolled code might be better than Accelerate for your use cases. Accelerate probably has to handle cases you know you'll never encounter in your own code. And besides, as your example shows it doesn't necessarily take a lot of code to write some of these functions.

If Apple ever starts making ARM laptops as is widely rumored you may have to go back and write another version of these though.

Borogove · Post by **Borogove** » Sat Nov 05, 2011 5:16 am

valhallasound wrote:Here's my SSE implementation of the Fons Adriensen (via Russell Borogove) Inverse Square Root Sigmoid. The function follows my vector library format: input and output buffers are presumed to be the same size, and are aligned to 16-byte boundaries.

If you maintain debug and release versions of your plugs, it's worth throwing in the assert()s about your alignment, blocksize divisibility, and other invariants like that.

The code looks sane.

To my ears, it sounds really smooth. I originally implemented the code with the Accelerate library reciprocal square root function, which should be around twice as accurate as the _mm_rsqrt_ps intrinsic I am using. The above optimization sound around the same in my application, and is ridiculously light on the CPU.

My understanding is that rsqrtps uses a table lookup internally and is 11 or 12 bits of precision. That's not awful; it should correspond to 66-72dB SNR - "good enough for rock 'n' roll". If you support a high-quality mode in your algorithm, you should probably implement a version that does one iteration of Newton-Raphson refinement, which gets you better than 20-bit precision for about 3 times the cost (13-19 cycles instead of 5-6*, depending on architecture, which is still hell of cheap).

* all cycle counts pulled directly from my bottom

valhallasound · Post by **valhallasound** » Sat Nov 05, 2011 5:20 am

kuniklo wrote:
valhallasound wrote: I love Accelerate. But I am starting to wonder if I should use my handrolled vector library for Intel OSX, instead of the vector library that calls the Accelerate function.
I can imagine that your own hand-rolled code might be better than Accelerate for your use cases. Accelerate probably has to handle cases you know you'll never encounter in your own code. And besides, as your example shows it doesn't necessarily take a lot of code to write some of these functions.

I'm actually really surprised at how compact this code is. This creates a smooth sigmoid clipping function, with similar behavior to tanh() (maybe better in some ways), with no branching required for clipping. Hooray for _mm_rsqrt_ps!

I wouldn't be surprised if Apple's Accelerate reciprocal square root uses the same SSE assembly instruction, but with a few Newton's iterations to get the desired accuracy. It may have to have some sort of guards in place to prevent dividing by zero, or whatever the equivalent is for this assembly instruction. The nice thing about rolling your own code is that you can determine what level of accuracy you need. For this function, the 12 bits of accuracy in _mm_rsqrt_ps seem to work fine.

If Apple ever starts making ARM laptops as is widely rumored you may have to go back and write another version of these though.

I already have the Accelerate version of this code in place, and will need to keep an Accelerate version of functions around as long as I support PPC. That being said, I just started looking at the ARM Neon instruction set, in order to see what the future may bring.

Sean Costello

kuniklo · Post by **kuniklo** » Sat Nov 05, 2011 5:28 am

valhallasound wrote: I already have the Accelerate version of this code in place, and will need to keep an Accelerate version of functions around as long as I support PPC. That being said, I just started looking at the ARM Neon instruction set, in order to see what the future may bring.

Yeah I guess Accelerate makes planning for that contingency fairly painless. I haven't gotten into the Neon instructions much yet but I'd be interested to see what you come up with if you do dive into it.

I've been mired in GUI architecture for weeks now instead of fun DSP stuff.

valhallasound · Post by **valhallasound** » Sat Nov 05, 2011 5:42 am

kuniklo wrote:
valhallasound wrote: I already have the Accelerate version of this code in place, and will need to keep an Accelerate version of functions around as long as I support PPC. That being said, I just started looking at the ARM Neon instruction set, in order to see what the future may bring.
Yeah I guess Accelerate makes planning for that contingency fairly painless. I haven't gotten into the Neon instructions much yet but I'd be interested to see what you come up with if you do dive into it.

I just found this in some random blog comments (http://omcfadde.blogspot.com/2011/02/ma ... ation.html):

For x86 (SSE), fast inverse square-root approximation can be calculated using RSQRTSS instruction.

For ARM (NEON), this is achieved using VRSQRTE (initial approximation) instruction followed by VRSQRTS instructions (Newton-Raphson iteration).

So it seems like VRSQRTE would work in a NEON version of the function I posted above. I think that vrecpeq_f32 would be the appropriate NEON intrinsic.

Sean Costello

Table Lookup for tanh() versus other solutions