Slow Filter in SSE Hell...
- KVRian
- 1091 posts since 8 Feb, 2012 from South - Africa
Spent a couple of hours(new to SSE) translating a filter into SSE code an now it's about 180% SLOWER. The code doesn't use any trig or fancy math - just */+-. Used example: __m128 m5 = _mm_set1_ps(5.0f); -> used that for setting constants - out of the loop. Loaded inputs per example: __m128 InStream = _mm_loadu_ps(in1); -> then followed with just a bunch of multiplies,adds etc. At the end I used -> float c; _mm_storeu_ps(&c,LP); *out1=c; Also tried the above with normal scalar(ss) and the performance is the same. Also used _MM_SET_FLUSH_ZERO_MODE(0x8000); Which I think is correct(?).
Now I know - unaligned loads and stores are a bit pricey, but could they be that pricey? Or did I mess up something drastically? Is _mm_set1_ps (out of loop) the right way to do constants with SSE? Do you get speed-ups or slow-downs while using filters with SSE.
Regards
Andrew
Now I know - unaligned loads and stores are a bit pricey, but could they be that pricey? Or did I mess up something drastically? Is _mm_set1_ps (out of loop) the right way to do constants with SSE? Do you get speed-ups or slow-downs while using filters with SSE.
Regards
Andrew
- u-he
- 30189 posts since 8 Aug, 2002 from Berlin
unaligned stores/loads may come with a penalty. Never did them myself.
SSE code often seems slow. Don't do it unless you do 2 or 4 filters in parallel.
Oherwise code "normal" floating point code and let the compiler translate to SSE if applicable ("instruction set=SSE2" or so)
SSE code often seems slow. Don't do it unless you do 2 or 4 filters in parallel.
Oherwise code "normal" floating point code and let the compiler translate to SSE if applicable ("instruction set=SSE2" or so)
- KVRian
- Topic Starter
- 1091 posts since 8 Feb, 2012 from South - Africa
Well, gcc will not "vectorize" any recursive structure (or nested loop) btw - it's in the manual. I got a hefty speedup when gcc "auto-vectorized" some waveshaping functions I tried a while back. So I thought hey - filter+nonlinearities sounds fun - gcc won't auto-vectorize - so you'll get slower non-linearities + possibly a slower filter. So I decided hey - why not do it manually? Now I'm stuckUrs wrote: Oherwise code "normal" floating point code and let the compiler translate to SSE if applicable ("instruction set=SSE2" or so)
- KVRAF
- 8476 posts since 12 Feb, 2006 from Helsinki, Finland
I remember trying them a couple of times, only to find out that it's mostly cheaper to either shuffle or arrange things so you don't need them. Unfortunately shuffling isn't exactly free either.Urs wrote:unaligned stores/loads may come with a penalty. Never did them myself.
If you can calculate things costly things in parallel eg several non-linearities or something, then it might speed up single filters as well. I've done this in the past, but it takes a whole lot of profiling to even match the trivial thing, so I don't think I'd bother anymore. If you go for it, remember to budget for latency; I've observed scalar code generally pipelining much better.SSE code often seems slow. Don't do it unless you do 2 or 4 filters in parallel.
Anyway, I got a Stereo class that's basically a 2d vector, that allows me to use template parameters to compile anything for either mono or stereo operation. I've got two implementations of it, one mapping to 2-way SSE2 and one just using regular doubles. The SSE2 version used to be marginally faster on my previous Core2 (and my old Turion laptop), but curiously on my current i7 desktop the scalar "fall-back" (compiled for SSE2 scalar math, which usually outperforms x87 and lets you forget about denormals if you set FTZ and DAZ) performs better.
Four-way single precision still seems a win in some cases, but I'm not convinced it makes much sense to waste time trying to micromanage some tight loops that don't vectorize naturally. My take is to just enable the instruction sets and write regular scalar code.
-
- KVRian
- 995 posts since 25 Apr, 2005
Similar to what I found using SSE. Unless you are doing large amounts of number crunching within the SSE section then the cost of unaligned stores/loads will probably be more than your savings. I only use SSE now for very specific sections of code as the savings don't generally justify the hassle of writing the SSE.mystran wrote:I remember trying them a couple of times, only to find out that it's mostly cheaper to either shuffle or arrange things so you don't need them. Unfortunately shuffling isn't exactly free either.Urs wrote:unaligned stores/loads may come with a penalty. Never did them myself.
If you can calculate things costly things in parallel eg several non-linearities or something, then it might speed up single filters as well. I've done this in the past, but it takes a whole lot of profiling to even match the trivial thing, so I don't think I'd bother anymore. If you go for it, remember to budget for latency; I've observed scalar code generally pipelining much better.SSE code often seems slow. Don't do it unless you do 2 or 4 filters in parallel.
Anyway, I got a Stereo class that's basically a 2d vector, that allows me to use template parameters to compile anything for either mono or stereo operation. I've got two implementations of it, one mapping to 2-way SSE2 and one just using regular doubles. The SSE2 version used to be marginally faster on my previous Core2 (and my old Turion laptop), but curiously on my current i7 desktop the scalar "fall-back" (compiled for SSE2 scalar math, which usually outperforms x87 and lets you forget about denormals if you set FTZ and DAZ) performs better.
Four-way single precision still seems a win in some cases, but I'm not convinced it makes much sense to waste time trying to micromanage some tight loops that don't vectorize naturally. My take is to just enable the instruction sets and write regular scalar code.
- KVRian
- Topic Starter
- 1091 posts since 8 Feb, 2012 from South - Africa
Thanks for all the replies and suggestions so far! There still are a couple of "oddities" that still bother me though...
I've looked at the some SSE math libraries - and none of them seem to use _mm_set1_ps for constants - they create their own "defines" . So it got me intrigued - and then I checked out the xmmintrin.h file, and set and load are basically the same thing
Now maybe, because of the recursive structure, the compiler cannot optimize effeciently, because the loop analyzer goes
,so it first loads the constants into the registers, then goes uhm - need space - unload register -> going on and on like that - in a vicious cycle. 'Cause seriously - I though, worst case scenario - no speedup - not 180% slower. The thing is - some filters per example -> Annti's Moog, has a lot of non-linearities that in theory would benefit from SSE. I just started with a simpler linear filter - just to check that my manual 'normal' to SSE code works.
Will experiment with a defined constants class - and see if there is any difference.
On a maybe related note - what the hell is this SSE function; uses?
Andrew
I've looked at the some SSE math libraries - and none of them seem to use _mm_set1_ps for constants - they create their own "defines" . So it got me intrigued - and then I checked out the xmmintrin.h file, and set and load are basically the same thing
,so it first loads the constants into the registers, then goes uhm - need space - unload register -> going on and on like that - in a vicious cycle. 'Cause seriously - I though, worst case scenario - no speedup - not 180% slower. The thing is - some filters per example -> Annti's Moog, has a lot of non-linearities that in theory would benefit from SSE. I just started with a simpler linear filter - just to check that my manual 'normal' to SSE code works.
Will experiment with a defined constants class - and see if there is any difference.
On a maybe related note - what the hell is this SSE function; uses?
Code: Select all
/* The execution of the next instruction is delayed by an implementation
specific amount of time. The instruction does not modify the
architectural state. */
extern __inline void __attribute__(( __always_inline__, __artificial__))
_mm_pause (void)
{
__asm__ __volatile__ ("rep; nop" : : );
}
- KVRAF
- 8476 posts since 12 Feb, 2006 from Helsinki, Finland
For traditional sequential implementation, you have 8 tanh-evaluations per sample, and 4 of those (ie tanhs of the old states) could be calculated in parallel, however if you unroll the loop once, you'll notice that 3 of the those where already calculated as outputs during the previous sample (and can be cached). So assuming sequential solver (eg traditional Euler discretization) there is no parallel potential whatsoever (when you cache, which is more efficient).Ichad.c wrote:The thing is - some filters per example -> Annti's Moog, has a lot of non-linearities that in theory would benefit from SSE.
If you're not solving sequentially, then sure, you might be able to pull out more parallelism.
