Fast FIR implementation

DSP, Plugin and Host development discussion.
Post Reply New Topic
RELATED
PRODUCTS

Post

The fastest implementation I've found is implementing a ring buffer:

Code: Select all

// inspired by https://www.youtube.com/watch?v=fdCAsZkjpSg
inline double process (double input) noexcept
{
    buffer [offset] = input;

    double y = 0.0;
    int i;
    int n = 0;

    for (i = offset; i != -1; i--)
        y += taps[n++] * buffer[i];

    for (i = numTaps-1; i != offset; i--)
        y += taps[n++] * buffer[i];

    if (++offset >= numTaps) offset = 0;

    return y;
}
But it is not fast enought to process filters with ~70 taps.
I''ve trying to implement this with sse2 intrinsics, but I don't get any improvements.

Code: Select all

// y, a, b, mul: __m128d
// in the for loop:
 a  = _mm_loadu_pd (taps + n++);
 b  = _mm_loadu_pd (buffer + i);
 mul = _mm_mul_pd (a, b);
 y = _mm_add_pd (y, mul);
Then
I've tried too cast taps[] and buffer[] to __m128d outside the for loop

Code: Select all

 __m128d* t = (__m128d*)taps;      // and __m128* t = reinterpret_cast<__m128*>(taps)
 __m128d* b = (__m128d*)buffer;

// and then in the for loop:
 mul = _mm_mul_pd (*(t+n++), *(b+i));
 y = _mm_add_pd (y, mul);
but, _m128d ptrs to c array index increments by two instead one, for example:

Code: Select all

cArray[] {1,2,3,4,5};
*(m128Array + 0) == 1
*(m128Array + 1) == 3
*(m128Array + 2) == 5
So, I'm a bit confused.

Post

The quick answer is that SSE is overrated. In a diverse number of practical cases I never got more than about 30% performance increase, be it floats (SSE) or doubles (SSE2). Multi-threaded processing is far more rewarding.
Image

Post

It looks like you're constantly loading and unloading data to and from the simd registers. This isnt so efficient. Use correctly aligned buffers and m128d pointers and iterate with those.

For a FIR 32 bit precision is fine in my experience with a noise floor under -100dB, so you can get further speed up with SSE1.

The problem is even with correctly aligned buffers things get messy (you might need multiple copies of the kernal each offset by a sample) and you'll need to work around it. There was a tutorial on just that somewhere, but I cant seem to track it down.

I can say with SSE1 I had significant performance improvements in my FIR code I haven't tested SSE2 for this

Post

plusfer wrote:but, _m128d ptrs to c array index increments by two instead one, for example:

Code: Select all

cArray[] {1,2,3,4,5};
*(m128Array + 0) == 1
*(m128Array + 1) == 3
*(m128Array + 2) == 5
So, I'm a bit confused.
That's kind of the point. You only need half the operations for SSE2 or a quarter for SSE1.

Edit: I think you should find m128Array[0] =[1,2]

Edit 2: you need to make sure your buffer is divisible by 2(sse2) or in your example.
m128Array[2]=[5, out of range]

Post

Aleksey Vaneev wrote:The quick answer is that SSE is overrated. In a diverse number of practical cases I never got more than about 30% performance increase, be it floats (SSE) or doubles (SSE2). Multi-threaded processing is far more rewarding.
Multithreading in processblock? How synchronize threads without using locks? with my poor knowledge, I guess it must be something hard to do.

matt42 wrote: That's kind of the point. You only need half the operations for SSE2 or a quarter for SSE1.

Edit: I think you should find m128Array[0] =[1,2]

Edit 2: you need to make sure your buffer is divisible by 2(sse2) or in your example.
m128Array[2]=[5, out of range]
I understand,
I'ts hard to implement, because I need odd number of taps to get linear phase.
Or I can process taps[:len-1] and process the last tap in c++
Another way is to remove for loop and hardcode the whole filter, I don't like to do that but...

Thanks guys!

Post

plusfer wrote:[
I understand,
I'ts hard to implement, because I need odd number of taps to get linear phase.
Or I can process taps[:len-1] and process the last tap in c++
I'd just increase the kernal length by 1 and zero the last element
plusfer wrote:Multithreading in processblock? How synchronize threads without using locks? with my poor knowledge, I guess it must be something hard to do
From my point of view multithreading is best handled by the host. Exceptions to that would be if the algorithm had to be multithreaded to function correctly, or the CPU demands of a single instance are high enough to max out (or close to it) a single core

Found the tutorial. It's for sse1, but the principles are basically the same.

Speedy Fast 1d Convolution with SSE

Also, why do you have two loops to iterate through the kernal?

Post

Multithreading in processblock? How synchronize threads without using locks? with my poor knowledge, I guess it must be something hard to do.
We had a long thread about that sometime ago.
The case is that apparently there are few agreements about how to do it right, complicated by the fact that there are 3rd party libraries that solve the problem without their users actually knowing what's going on behind the scenes.

Threading without locks is not the problem in processblock. Since it is a task that has a definite beginning and end (it returns to the host), some kind of synchronization is necessary. When you do it without locks, the result is busy waiting. When you use a lock, the result is still busy waiting as operating system implementation of a mutex uses busy waiting behind the scenes before giving up finally and sleeping the thread (this is the behavior when the cpu has multiple cores). Apparently there are operating system hacks that alter the behavior of such sleep calls (and that of thread scheduler) and make them more realtime task friendly, but this does not alter the overall picture much.

The conclusion was that threading is best handled by the host, as matt42 also notes. One reason is that only in that case those busy waiting cpu cycles can be used for doing something useful (*), instead of just waiting something to happen. Another reason is that having more cpu-intensive threads than the actual number of cpu cores is rarely useful. In short, the operating system has a thread scheduler which is a great piece of engineering, but there is no reason for not trying to give it a simpler problem to solve.

(*) Some perfectly balanced load between spawned "tasks" that always predictably take the same amout of time to complete regardless of possible cache mismatches might be another scenario.
Last edited by stratum on Thu Sep 01, 2016 9:14 am, edited 1 time in total.
~stratum~

Post

stratum wrote:When you do it without locks, the result is busy waiting.
In windows if we use WaitForSingleObject(), for example, the result is the calling thread is put to sleep, either until the object is signaled or we hit a pre determined timeout (0 to inifinity ms), avoiding a busy wait. I'd be surprised if there weren't equivalent methods for osx also. Its a minor point in this context and other than that I pretty much agree,
Last edited by matt42 on Thu Sep 01, 2016 9:30 am, edited 1 time in total.

Post

In windows if we use waitforsingleobject(), for example, the result is the calling thread is put to sleep, either until the object is signaled or we hit a pre determined timeout (0 to inifinity ms), avoiding a busy wait. I'd be surprised if there weren't equivalent methods for ios also. Other than that I pretty much agree
I have noticed that windows was noticably "better" in this aspect than say, linux, too, but the result was inconclusive.

For one thing, boost::mutex, which also happens to use waitforsingleobject, happened to have that busy waiting problem when used for syncrhonization in a low latency audio pipeline together with boost::condition, to implement a producer-consumer pattern that happened to connect another thread to portaudio audio callback. I know this design does not make sense - it's just an example that shows the problem. The test was done on windows.

In another case, I have ignored a busy waiting problem observed only in linux, and it has turned out that the app still worked well without consuming cpu unnecessarily, which made me think that the difference between windows/linux was just the way the "task manager" calculated and displayed cpu usage, not an actual difference in thread scheduler. I'm not sure about that.
~stratum~

Post

Maybe it's not directly applicable to the thrust of this thread but locks are worth using at times. As I remember Windows critical sections are much faster than mutexes as they don't enter the kernel but, of course, that limits them to a single process. This can be useful if you want to run some sort of anytime algorithm where you are, for example, iteratively increasing accuracy of a result or generating logging data. Of course, having metrics for your application makes it even more useful as you can fine tune the spin time to trade of the wait for switching to another thread.

Post

I've found compilers are competitive with hand written SSE2 code for double precision and AVX is necessary to see better results. Mileage also varies on different CPUs, Intel seems to have done a better job at optimizing AVX instructions than AMD on the CPUs I've been testing on.

Also Agner Fog has a nice VectorClass you can experiment with if you've not done much with intrinsics: http://www.agner.org/optimize/#vectorclass

Post

There is nothing wrong with using "locks" (whether that's semaphores or mutex or condition variables or whatever the local OS calls it's various implementations of the same) of pretty much any type in audio thread.

There is absolutely everything wrong with having the audio thread wait for low priority threads (most notably the GUI thread) whether or not such waiting involves "locks" of one type or another.

If you create your own set of worker threads running at "real-time priority" (or some approximation of the same) then there isn't really anything wrong with using "locks" as such.

The main "hairy" part here is figuring out which priority to put your worker threads at, but for the most part if your setup is more or less reasonable the answer is pretty much "the higher the better."

Ps. As others have noted, the actual overhead of various methods varies somewhat, but in general they all have some overhead, yet they all work fine if you are willing to accept said overhead.

Post

matt42 wrote:
plusfer wrote:[
I understand,
I'ts hard to implement, because I need odd number of taps to get linear phase.
Or I can process taps[:len-1] and process the last tap in c++
I'd just increase the kernal length by 1 and zero the last element
plusfer wrote:Multithreading in processblock? How synchronize threads without using locks? with my poor knowledge, I guess it must be something hard to do
From my point of view multithreading is best handled by the host. Exceptions to that would be if the algorithm had to be multithreaded to function correctly, or the CPU demands of a single instance are high enough to max out (or close to it) a single core

Found the tutorial. It's for sse1, but the principles are basically the same.

Speedy Fast 1d Convolution with SSE
Great! thanks matt42, I get x2 faster my fir filter using sse processing 4 floats at time.
Also, why do you have two loops to iterate through the kernal?
because of the ring buffer. If the first for loop process 9 times, the second process 11 times in a 20 taps filter.
Now i've replaced the ring buffer to process 4 taps per loop iteration.

Here is the code:

Code: Select all

    inline float processBasicSSE (float input) noexcept
    {
        int i;
        buffer[0] = input;

        __m128 _ALIGNED_(16) *h = reinterpret_cast<__m128*>(taps);
        __m128 _ALIGNED_(16) *z = reinterpret_cast<__m128*>(buffer);
        __m128  y = _mm_set_ps (0.0, 0.0, 0.0, 0.0);
    
        for (i = 0; i < numTaps; i += 4)
            y = _mm_add_ps (y, _mm_mul_ps (*h++, *z++));

        memcpy (&buffer[1], &buffer[0], (numTaps - 2) * sizeof (buffer[0]));
    
        float _ALIGNED_(16) result[4];
        _mm_store_ps (result, y);
        return result[0] + result[1] + result[2] + result[3];

    }

Post

I think the memcpy() may be inefficient. This would get worse the larger the buffer, I suppose. I haven't profiled this, but I think that would be worth checking in to.

Perhaps make the input buffer a power of 2 length. Every time it's incremented you could do something like:

Code: Select all

bufferIndex++;
bufferIndex &= bufferLength-1;
Obviously bufferLength-1 would be a precalculated value.

The only thing is now you will need multiple offset copies of the kernal, so it'd only be worth it if it really speeds things up

Post

In my experience (disclaimer, I'm speaking about doubles)

1) asm instructions provide faster convolution than plain c++. Speaking about vc8 and vc10, gcc 4.0, gcc 4.2
2) sse2 provides a faster one only on several systems, it is funny but amd ones more than Intel. Improvement ranges between 10% and 30%
3) I wrote around 8 slightly different asm versions, and they perform differently (a lot!) on different cpu architectures. Cache is the reason for different performances. So at the end we implemented a sort of "tuner" (like ipp)
4) normally direct convolution is used in partitioned convolution schema: in such case partition strategy helps more than asm if block sizes are properly chosen.
5) multithreading is a good strategy, but synchronization is not so straightforward. Yes there are convention papers where multithreading does miracles, but it is not the case for real-time audio plugins, maybe based on sampleframes.
6) there is nothing existing on windows side which was not replicable on mac osx/linux side. You can have the same building blocks, you could wake up threads or wait for them in similar ways using for example posix primitives. The general strategy is a spin lock for shorter tasks, and events for longer ones. Context switches require a lot of attention, they waste a lot of cpu cycles. Sometimes it is not worthy. Pay attention: test your things on high instance count, and cpu close to 100%. Many strategies are cool before you reach the real cpu limit, but they perform in poor ways when stressed, because mutex contention increases.

Post Reply

Return to “DSP and Plugin Development”