Multiple threads running in DSP code

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

in my experience, parallelizing audio applications is not necessarily the problem, but real-time safe synchronization and wakeup latencies ... especially if low latencies/small block sizes are required.
current hardware has worst-case scheduling latencies of several hundred microseconds, which can be a limiting factor when using block sizes of 64 samples at 44.1/48khz. and of course, one needs to pay attention not to block the real-time threads.

some self-promo: i wrote my master thesis about parallelizing the synthesis engine of supercollider, which includes a discussion of the technical challenges with a focus on low-latency real-time safety:
http://tim.klingt.org/publications/tim_ ... ernova.pdf

Post

DaveHoskins wrote:Is does a 'WaitFor...' in TinyThreads code. It's in the 'arg.cond.wait(arg.m);' call.
And this is what they do in their, albeit limit examples.
You're doing it by the book, only thing arg.wait and arg.finished are duplicates you don't need to keep them both.

Post

Ciozzi wrote:
DaveHoskins wrote:Is does a 'WaitFor...' in TinyThreads code. It's in the 'arg.cond.wait(arg.m);' call.
And this is what they do in their, albeit limit examples.
You're doing it by the book, only thing arg.wait and arg.finished are duplicates you don't need to keep them both.
Yeah, 'finish' was put it to "fix" an unrelated sync bug. I now wait for 'wait' to be true to know it's finished the current cycle.
I may have found a way to do long convolution using threading properly, but I've yet to finish the code. I'll post all the details later, if it works.

Post

I couldn't find much stuff on the Net describing threaded convolution, well, not anything I could understand anyway. :) So here's mine.
My convolution has 3 or more partitions, I'm still balancing the process.

Code: Select all

Time----->

1 |---|---|
2         |-------|-------|
3                         |---------------|--------------- etc
(Overlap-save not used in explanation for simplicity reasons.)

Partition 3 is a large FFT that can go all the way up to over a dozen seconds, so this is the one everything is waiting for.
If I put all three partitions in different threads I still have to wait for the largest convolution to finish, which means I'm waiting sitting idle in the main process - something I don't want of course.

So instead I only use one thread for the 3 partition but delay waiting for it to finish for the during of a partition 2 size, see in diagram 'added'.

Code: Select all

1 |---|---|
2         |-------|-------|-added-|
3                          Delayed|---------------|--------------- etc
By the time the samples of the extra partition 2 have completed, the thread 3 has finished so there'll be no waiting idle.
The delay means I have to store the processed samples in a ring buffer, and grab them from the buffer at the time minus 'partition 2' size.
I'm going to try another thread and delay partition 2 but it might not be worth it as they are quite short anyway. And people's warnings on breaking the hosts thread system might have dissuaded me anyway.

This all gave me the saving of half the metered CPU and stopped the periodic spikes in CPU usage in all the DAWs tried so far.

I've only just finished so much testing is needed, and this is nowhere near a complete explanation, but it'll hopefully nudge the old thinking buds into gear for some struggling like I have on really fast convolution. And I'm sure there are many ways to do it. :wink:

Cheers,
Dave.

Post

DaveHoskins wrote:I couldn't find much stuff on the Net describing threaded convolution, well, not anything I could understand anyway. :) So here's mine.
My convolution has 3 or more partitions, I'm still balancing the process.

Code: Select all

Time----->

1 |---|---|
2         |-------|-------|
3                         |---------------|--------------- etc
(Overlap-save not used in explanation for simplicity reasons.)

Partition 3 is a large FFT that can go all the way up to over a dozen seconds, so this is the one everything is waiting for.
If I put all three partitions in different threads I still have to wait for the largest convolution to finish, which means I'm waiting sitting idle in the main process - something I don't want of course.

So instead I only use one thread for the 3 partition but delay waiting for it to finish for the during of a partition 2 size, see in diagram 'added'.

Code: Select all

1 |---|---|
2         |-------|-------|-added-|
3                          Delayed|---------------|--------------- etc
By the time the samples of the extra partition 2 have completed, the thread 3 has finished so there'll be no waiting idle.
The delay means I have to store the processed samples in a ring buffer, and grab them from the buffer at the time minus 'partition 2' size.
I'm going to try another thread and delay partition 2 but it might not be worth it as they are quite short anyway. And people's warnings on breaking the hosts thread system might have dissuaded me anyway.

This all gave me the saving of half the metered CPU and stopped the periodic spikes in CPU usage in all the DAWs tried so far.

I've only just finished so much testing is needed, and this is nowhere near a complete explanation, but it'll hopefully nudge the old thinking buds into gear for some struggling like I have on really fast convolution. And I'm sure there are many ways to do it. :wink:

Cheers,
Dave.
Optimizing convolution is a hell of a job, I did it myself and it took me about three months to do it properly (I know...it's a lot) . The first step should be to optimize the engine so load doesn't increase much when latency goes down. Have you already taken care of that ?!? Another problem you mentioned is spikes, what fft are you using ?!?

Post

Ciozzi wrote:Optimizing convolution is a hell of a job, I did it myself and it took me about three months to do it properly (I know...it's a lot) . The first step should be to optimize the engine so load doesn't increase much when latency goes down. Have you already taken care of that ?!? Another problem you mentioned is spikes, what fft are you using ?!?
Hi, can you be more specific by "when latency goes down"? Thanks. Perhaps I have some horror around the corner...

The spikes are thrown into the waiting thread so the VSTs/AUs process thread is not halted when convolving a massive reverb tail, it flows through the process thread for a few thousand samples before waiting for the convolve thread to finish. The time to wait is long enough with most sample rates for it to have finished way before it gets to the time to use it, although sizes may have to be increased for 192Khz sample rates.
I moved this over to the Mac recently and getting equal speed ups, which is lucky as I just expected TinyThreads++ to work on Mac - Oh dear - :hihi:
I'm not using an FFT, I've optimised an FHT (Fast Hartley) for my frequency domain stuff.

Post

DaveHoskins wrote: Hi, can you be more specific by "when latency goes down"? Thanks. Perhaps I have some horror around the corner...
When you work with small buffer sizes (32, 64, 128,....) convolution becomes progressively heavier for the cpu. With what buffer size/s are you testing your algorithm?!?
I moved this over to the Mac recently and getting equal speed ups, which is lucky as I just expected TinyThreads++ to work on Mac - Oh dear - :hihi:
I'm not using an FFT, I've optimised an FHT (Fast Hartley) for my frequency domain stuff.
Unless you're extremely good at optimizing code in assembly, I highly suggest you use an already optimized library. Unfortunately I'm not aware of many libraries that provide an FHT, one being Intel ipps but it's deprecated and will be removed eventually. The most common way to implement convolution is via FFT (which is by the way, closely related to the FHT). The FFT in the ipps is hands down the best performing algorithm for one dimensional transforms that runs on intel compatible processors, along with that you'd also get all the other primitives for the multply and add stuff. Those primitives alone could increment your code performance 2 or 3 times without even using additional threads. The library isn't free but you'll get a 30day trial period, plenty of time to see if it's really worth it for you.
If you want something free, you could get the fftw.

Post

Thanks.
The smallest frequency domain is 256 samples, and the latency is brought to zero with SSE code in time domain convolution at the beginning.

From very basic tests my FHT came in just slightly slower than FFTW which I'm perfectly happy with! :)
Yeah, it surprised me as well, but I did work hard to achieve it.

Dave.

*edit*
I don't think FFTW is actually free?

Post

You should look to FFTReal from Laurent De Soras which is free to use too ;-)

Post

Wolfen666 wrote:You should look to FFTReal from Laurent De Soras which is free to use too ;-)
Just looked at that, and shivered in C++ obfuscation horror. Even the testing code has 24 files and layers of hierarchy. Making something so complicated seems like a 'having fun in c++' exercise rather than something worthy of DSP.
IMHO of course! :)

Again, I'm happy with my own tiny and fast FHT code, thanks. I might publish it at some point I suppose.

Post

Has anyone tried applying multithreading to a polyphase filter? I'm thinking it would be quite applicable for that structure as each subfilter can be processed in parallel. I'm currently using polyphase interpolating filters to implement resampling, and this got me thinking it would be a good way to optimize.

I'm not quite up to speed on multithreading in C++ though. :? I'm familiar with mutexes, but not with actually creating and assigning threads.

Post

Ciozzi wrote:
DaveHoskins wrote: Hi, can you be more specific by "when latency goes down"? Thanks. Perhaps I have some horror around the corner...
When you work with small buffer sizes (32, 64, 128,....) convolution becomes progressively heavier for the cpu. With what buffer size/s are you testing your algorithm?!?
I moved this over to the Mac recently and getting equal speed ups, which is lucky as I just expected TinyThreads++ to work on Mac - Oh dear - :hihi:
I'm not using an FFT, I've optimised an FHT (Fast Hartley) for my frequency domain stuff.
Unless you're extremely good at optimizing code in assembly, I highly suggest you use an already optimized library. Unfortunately I'm not aware of many libraries that provide an FHT, one being Intel ipps but it's deprecated and will be removed eventually. The most common way to implement convolution is via FFT (which is by the way, closely related to the FHT). The FFT in the ipps is hands down the best performing algorithm for one dimensional transforms that runs on intel compatible processors, along with that you'd also get all the other primitives for the multply and add stuff. Those primitives alone could increment your code performance 2 or 3 times without even using additional threads. The library isn't free but you'll get a 30day trial period, plenty of time to see if it's really worth it for you.
If you want something free, you could get the fftw.
+1

Post

I'm not converting the FHT into an FFT, I'm doing everything in the Hartley domain. It's FAST!...Oh never mind.

Post

DaveHoskins wrote:I'm not converting the FHT into an FFT, I'm doing everything in the Hartley domain. It's FAST!...Oh never mind.
Doesn't FHT require a few more operations than an FFT?
I've never actually used the Hartley transform, so I might be missing something.

Post

I have no idea how many instructions it has, or it's efficiency with memory cache or SSE instruction weaving in the compiler, sorry. I seem to remember at the time I tried a few different FFTs but settled on this one after my optimisations.
It may help speed it up by the fact that the exact same code is used for forward and inverse transforms, both real and in-place and the entire transform fits in the original input array, which is nice.

The FFT conversion is a loop with a small amount of maths, I don't need it for most of my work, but it's not really the slow down. It's the setup loops and if statements at the beginning that always run through the same thing each time the function is called, that are more wasteful, so I've moved them out of the main function, replacing it all with a simple look up table.
One down side to the FHT is that the final output is in a strange order, but that can be quickly moved around in-place, ready for SSE convolution.

Dave.

*edit* this has gone OT a bit, as far as the multi-threaded DSP is concerned it appears to be rock solid. It's brain-meltingly complex when dealing with shared resources and debugging, but the results are great and I'm sure I'll get used to it quite quickly.

Post Reply

Return to “DSP and Plugin Development”