Plug-ins, Hosts, Apps,
Hardware, Soundware
Developers
(Brands)
Videos Groups
Whats's in?
Banks & Patches
Download & Upload
Music Search
KVR
   
KVR Forum » DSP and Plug-in Development
Thread Read
Slow Filter in SSE Hell...
Ichad.c
KVRist
- profile
- pm
- e-mail
- www
PostPosted: Wed May 16, 2012 1:56 pm reply with quote
Spent a couple of hours(new to SSE) translating a filter into SSE code an now it's about 180% SLOWER. The code doesn't use any trig or fancy math - just */+-. Used example: __m128 m5 = _mm_set1_ps(5.0f); -> used that for setting constants - out of the loop. Loaded inputs per example: __m128 InStream = _mm_loadu_ps(in1); -> then followed with just a bunch of multiplies,adds etc. At the end I used -> float c; _mm_storeu_ps(&c,LP); *out1=c; Also tried the above with normal scalar(ss) and the performance is the same. Also used _MM_SET_FLUSH_ZERO_MODE(0x8000); Which I think is correct(?).

Now I know - unaligned loads and stores are a bit pricey, but could they be that pricey? Or did I mess up something drastically? Is _mm_set1_ps (out of loop) the right way to do constants with SSE? Do you get speed-ups or slow-downs while using filters with SSE.

Regards
Andrew
^ Joined: 08 Feb 2012  Member: #274678  Location: South - Africa
Urs
KVRAF
- profile
- e-mail
- www
PostPosted: Wed May 16, 2012 2:06 pm reply with quote
unaligned stores/loads may come with a penalty. Never did them myself.

SSE code often seems slow. Don't do it unless you do 2 or 4 filters in parallel.

Oherwise code "normal" floating point code and let the compiler translate to SSE if applicable ("instruction set=SSE2" or so)
^ Joined: 07 Aug 2002  Member: #3542  Location: Berlin
Ichad.c
KVRist
- profile
- pm
- e-mail
- www
PostPosted: Wed May 16, 2012 2:19 pm reply with quote
Urs wrote:

Oherwise code "normal" floating point code and let the compiler translate to SSE if applicable ("instruction set=SSE2" or so)


Well, gcc will not "vectorize" any recursive structure (or nested loop) btw - it's in the manual. I got a hefty speedup when gcc "auto-vectorized" some waveshaping functions I tried a while back. So I thought hey - filter+nonlinearities sounds fun - gcc won't auto-vectorize - so you'll get slower non-linearities + possibly a slower filter. So I decided hey - why not do it manually? Now I'm stuck Crying or Very sad
^ Joined: 08 Feb 2012  Member: #274678  Location: South - Africa
mystran
KVRAF
- profile
- pm
- e-mail
- www
PostPosted: Wed May 16, 2012 3:37 pm reply with quote
Urs wrote:
unaligned stores/loads may come with a penalty. Never did them myself.


I remember trying them a couple of times, only to find out that it's mostly cheaper to either shuffle or arrange things so you don't need them. Unfortunately shuffling isn't exactly free either.

Quote:

SSE code often seems slow. Don't do it unless you do 2 or 4 filters in parallel.


If you can calculate things costly things in parallel eg several non-linearities or something, then it might speed up single filters as well. I've done this in the past, but it takes a whole lot of profiling to even match the trivial thing, so I don't think I'd bother anymore. If you go for it, remember to budget for latency; I've observed scalar code generally pipelining much better.

Anyway, I got a Stereo class that's basically a 2d vector, that allows me to use template parameters to compile anything for either mono or stereo operation. I've got two implementations of it, one mapping to 2-way SSE2 and one just using regular doubles. The SSE2 version used to be marginally faster on my previous Core2 (and my old Turion laptop), but curiously on my current i7 desktop the scalar "fall-back" (compiled for SSE2 scalar math, which usually outperforms x87 and lets you forget about denormals if you set FTZ and DAZ) performs better.

Four-way single precision still seems a win in some cases, but I'm not convinced it makes much sense to waste time trying to micromanage some tight loops that don't vectorize naturally. My take is to just enable the instruction sets and write regular scalar code.
----
<- my plugins | my music -> @Soundcloud
^ Joined: 11 Feb 2006  Member: #97939  Location: Helsinki, Finland
Caco
KVRian
- profile
- pm
PostPosted: Wed May 16, 2012 11:14 pm reply with quote
mystran wrote:
Urs wrote:
unaligned stores/loads may come with a penalty. Never did them myself.


I remember trying them a couple of times, only to find out that it's mostly cheaper to either shuffle or arrange things so you don't need them. Unfortunately shuffling isn't exactly free either.

Quote:

SSE code often seems slow. Don't do it unless you do 2 or 4 filters in parallel.


If you can calculate things costly things in parallel eg several non-linearities or something, then it might speed up single filters as well. I've done this in the past, but it takes a whole lot of profiling to even match the trivial thing, so I don't think I'd bother anymore. If you go for it, remember to budget for latency; I've observed scalar code generally pipelining much better.

Anyway, I got a Stereo class that's basically a 2d vector, that allows me to use template parameters to compile anything for either mono or stereo operation. I've got two implementations of it, one mapping to 2-way SSE2 and one just using regular doubles. The SSE2 version used to be marginally faster on my previous Core2 (and my old Turion laptop), but curiously on my current i7 desktop the scalar "fall-back" (compiled for SSE2 scalar math, which usually outperforms x87 and lets you forget about denormals if you set FTZ and DAZ) performs better.

Four-way single precision still seems a win in some cases, but I'm not convinced it makes much sense to waste time trying to micromanage some tight loops that don't vectorize naturally. My take is to just enable the instruction sets and write regular scalar code.


Similar to what I found using SSE. Unless you are doing large amounts of number crunching within the SSE section then the cost of unaligned stores/loads will probably be more than your savings. I only use SSE now for very specific sections of code as the savings don't generally justify the hassle of writing the SSE.
^ Joined: 25 Apr 2005  Member: #66287  
Ichad.c
KVRist
- profile
- pm
- e-mail
- www
PostPosted: Thu May 17, 2012 2:32 am reply with quote
Thanks for all the replies and suggestions so far! There still are a couple of "oddities" that still bother me though...
I've looked at the some SSE math libraries - and none of them seem to use _mm_set1_ps for constants - they create their own "defines" . So it got me intrigued - and then I checked out the xmmintrin.h file, and set and load are basically the same thing Surprised Now maybe, because of the recursive structure, the compiler cannot optimize effeciently, because the loop analyzer goes Idiot
,so it first loads the constants into the registers, then goes uhm - need space - unload register -> going on and on like that - in a vicious cycle. 'Cause seriously - I though, worst case scenario - no speedup - not 180% slower. The thing is - some filters per example -> Annti's Moog, has a lot of non-linearities that in theory would benefit from SSE. I just started with a simpler linear filter - just to check that my manual 'normal' to SSE code works.
Will experiment with a defined constants class - and see if there is any difference.

On a maybe related note - what the hell is this SSE function; uses?

/* The execution of the next instruction is delayed by an implementation
   specific amount of time.  The instruction does not modify the
   architectural state.  */
extern __inline void __attribute__(( __always_inline__, __artificial__))
_mm_pause (void)
{
  __asm__ __volatile__ ("rep; nop" : : );
}
 


Andrew
^ Joined: 08 Feb 2012  Member: #274678  Location: South - Africa
mystran
KVRAF
- profile
- pm
- e-mail
- www
PostPosted: Thu May 17, 2012 7:27 am reply with quote
Ichad.c wrote:
The thing is - some filters per example -> Annti's Moog, has a lot of non-linearities that in theory would benefit from SSE.


For traditional sequential implementation, you have 8 tanh-evaluations per sample, and 4 of those (ie tanhs of the old states) could be calculated in parallel, however if you unroll the loop once, you'll notice that 3 of the those where already calculated as outputs during the previous sample (and can be cached). So assuming sequential solver (eg traditional Euler discretization) there is no parallel potential whatsoever (when you cache, which is more efficient).

If you're not solving sequentially, then sure, you might be able to pull out more parallelism.
----
<- my plugins | my music -> @Soundcloud
^ Joined: 11 Feb 2006  Member: #97939  Location: Helsinki, Finland
All times are GMT - 8 Hours

Printable version
Page 1 of 1
Display posts from previous:   
ReplyNew TopicPrevious TopicNext Topic
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum
Username: Password:  
KVR Developer Challenge 2012