KVR Audio

Ichad.c · Post by **Ichad.c** » Wed May 16, 2012 9:56 pm

Spent a couple of hours(new to SSE) translating a filter into SSE code an now it's about 180% SLOWER. The code doesn't use any trig or fancy math - just */+-. Used example: __m128 m5 = _mm_set1_ps(5.0f); -> used that for setting constants - out of the loop. Loaded inputs per example: __m128 InStream = _mm_loadu_ps(in1); -> then followed with just a bunch of multiplies,adds etc. At the end I used -> float c; _mm_storeu_ps(&c,LP); *out1=c; Also tried the above with normal scalar(ss) and the performance is the same. Also used _MM_SET_FLUSH_ZERO_MODE(0x8000); Which I think is correct(?).

Now I know - unaligned loads and stores are a bit pricey, but could they be that pricey? Or did I mess up something drastically? Is _mm_set1_ps (out of loop) the right way to do constants with SSE? Do you get speed-ups or slow-downs while using filters with SSE.

Regards
Andrew

Urs · Post by **Urs** » Wed May 16, 2012 10:06 pm

unaligned stores/loads may come with a penalty. Never did them myself.

SSE code often seems slow. Don't do it unless you do 2 or 4 filters in parallel.

Oherwise code "normal" floating point code and let the compiler translate to SSE if applicable ("instruction set=SSE2" or so)

Ichad.c · Post by **Ichad.c** » Wed May 16, 2012 10:19 pm

Urs wrote: Oherwise code "normal" floating point code and let the compiler translate to SSE if applicable ("instruction set=SSE2" or so)

Well, gcc will not "vectorize" any recursive structure (or nested loop) btw - it's in the manual. I got a hefty speedup when gcc "auto-vectorized" some waveshaping functions I tried a while back. So I thought hey - filter+nonlinearities sounds fun - gcc won't auto-vectorize - so you'll get slower non-linearities + possibly a slower filter. So I decided hey - why not do it manually? Now I'm stuck

mystran · Post by **mystran** » Wed May 16, 2012 11:37 pm

Urs wrote:unaligned stores/loads may come with a penalty. Never did them myself.

I remember trying them a couple of times, only to find out that it's mostly cheaper to either shuffle or arrange things so you don't need them. Unfortunately shuffling isn't exactly free either.

SSE code often seems slow. Don't do it unless you do 2 or 4 filters in parallel.

If you can calculate things costly things in parallel eg several non-linearities or something, then it might speed up single filters as well. I've done this in the past, but it takes a whole lot of profiling to even match the trivial thing, so I don't think I'd bother anymore. If you go for it, remember to budget for latency; I've observed scalar code generally pipelining much better.

Anyway, I got a Stereo class that's basically a 2d vector, that allows me to use template parameters to compile anything for either mono or stereo operation. I've got two implementations of it, one mapping to 2-way SSE2 and one just using regular doubles. The SSE2 version used to be marginally faster on my previous Core2 (and my old Turion laptop), but curiously on my current i7 desktop the scalar "fall-back" (compiled for SSE2 scalar math, which usually outperforms x87 and lets you forget about denormals if you set FTZ and DAZ) performs better.

Four-way single precision still seems a win in some cases, but I'm not convinced it makes much sense to waste time trying to micromanage some tight loops that don't vectorize naturally. My take is to just enable the instruction sets and write regular scalar code.

Caco · Post by **Caco** » Thu May 17, 2012 7:14 am

mystran wrote:
Urs wrote:unaligned stores/loads may come with a penalty. Never did them myself.
I remember trying them a couple of times, only to find out that it's mostly cheaper to either shuffle or arrange things so you don't need them. Unfortunately shuffling isn't exactly free either.

SSE code often seems slow. Don't do it unless you do 2 or 4 filters in parallel.
If you can calculate things costly things in parallel eg several non-linearities or something, then it might speed up single filters as well. I've done this in the past, but it takes a whole lot of profiling to even match the trivial thing, so I don't think I'd bother anymore. If you go for it, remember to budget for latency; I've observed scalar code generally pipelining much better.

Anyway, I got a Stereo class that's basically a 2d vector, that allows me to use template parameters to compile anything for either mono or stereo operation. I've got two implementations of it, one mapping to 2-way SSE2 and one just using regular doubles. The SSE2 version used to be marginally faster on my previous Core2 (and my old Turion laptop), but curiously on my current i7 desktop the scalar "fall-back" (compiled for SSE2 scalar math, which usually outperforms x87 and lets you forget about denormals if you set FTZ and DAZ) performs better.

Four-way single precision still seems a win in some cases, but I'm not convinced it makes much sense to waste time trying to micromanage some tight loops that don't vectorize naturally. My take is to just enable the instruction sets and write regular scalar code.

Similar to what I found using SSE. Unless you are doing large amounts of number crunching within the SSE section then the cost of unaligned stores/loads will probably be more than your savings. I only use SSE now for very specific sections of code as the savings don't generally justify the hassle of writing the SSE.

Ichad.c · Post by **Ichad.c** » Thu May 17, 2012 10:32 am

Thanks for all the replies and suggestions so far! There still are a couple of "oddities" that still bother me though...
I've looked at the some SSE math libraries - and none of them seem to use _mm_set1_ps for constants - they create their own "defines" . So it got me intrigued - and then I checked out the xmmintrin.h file, and set and load are basically the same thing

Now maybe, because of the recursive structure, the compiler cannot optimize effeciently, because the loop analyzer goes

,so it first loads the constants into the registers, then goes uhm - need space - unload register -> going on and on like that - in a vicious cycle. 'Cause seriously - I though, worst case scenario - no speedup - not 180% slower. The thing is - some filters per example -> Annti's Moog, has a lot of non-linearities that in theory would benefit from SSE. I just started with a simpler linear filter - just to check that my manual 'normal' to SSE code works.
Will experiment with a defined constants class - and see if there is any difference.

On a maybe related note - what the hell is this SSE function; uses?

Code: Select all

/* The execution of the next instruction is delayed by an implementation
   specific amount of time.  The instruction does not modify the
   architectural state.  */
extern __inline void __attribute__(( __always_inline__, __artificial__))
_mm_pause (void)
{
  __asm__ __volatile__ ("rep; nop" : : );
}

Andrew

mystran · Post by **mystran** » Thu May 17, 2012 3:27 pm

Ichad.c wrote:The thing is - some filters per example -> Annti's Moog, has a lot of non-linearities that in theory would benefit from SSE.

For traditional sequential implementation, you have 8 tanh-evaluations per sample, and 4 of those (ie tanhs of the old states) could be calculated in parallel, however if you unroll the loop once, you'll notice that 3 of the those where already calculated as outputs during the previous sample (and can be cached). So assuming sequential solver (eg traditional Euler discretization) there is no parallel potential whatsoever (when you cache, which is more efficient).

If you're not solving sequentially, then sure, you might be able to pull out more parallelism.

Slow Filter in SSE Hell...