What is KVR Audio? | Submit News | Advertise | Developer Account

Options (Affects News & Product results only):

OS:
Format:
Include:
Quick Search KVR

"Quick Search" KVR Audio's Product Database, News Items, Developer Listings, Forum Topics and videos here. For advanced Product Database searching please use the full product search. For the forum you can use the phpBB forum search.

To utilize the power of Google you can use the integrated Google Site Search.

Products 0

Developers 0

News 0

Forum 0

Videos 0

Search  

Multiple threads running in DSP code

DSP, Plug-in and Host development discussion.

Moderator: Moderators (Main)

timblechmann
KVRer
 
6 posts since 7 Jan, 2013

Postby timblechmann; Mon Jan 21, 2013 3:18 am

in my experience, parallelizing audio applications is not necessarily the problem, but real-time safe synchronization and wakeup latencies ... especially if low latencies/small block sizes are required.
current hardware has worst-case scheduling latencies of several hundred microseconds, which can be a limiting factor when using block sizes of 64 samples at 44.1/48khz. and of course, one needs to pay attention not to block the real-time threads.

some self-promo: i wrote my master thesis about parallelizing the synthesis engine of supercollider, which includes a discussion of the technical challenges with a focus on low-latency real-time safety:
http://tim.klingt.org/publications/tim_blechmann_supernova.pdf
Ciozzi
KVRer
 
12 posts since 14 Jan, 2013

Postby Ciozzi; Wed Jan 23, 2013 3:08 pm

DaveHoskins wrote:Is does a 'WaitFor...' in TinyThreads code. It's in the 'arg.cond.wait(arg.m);' call.
And this is what they do in their, albeit limit examples.


You're doing it by the book, only thing arg.wait and arg.finished are duplicates you don't need to keep them both.
DaveHoskins
KVRian
 
569 posts since 7 Jan, 2009, from Gloucestershire

Postby DaveHoskins; Thu Jan 24, 2013 4:41 am

Ciozzi wrote:
DaveHoskins wrote:Is does a 'WaitFor...' in TinyThreads code. It's in the 'arg.cond.wait(arg.m);' call.
And this is what they do in their, albeit limit examples.


You're doing it by the book, only thing arg.wait and arg.finished are duplicates you don't need to keep them both.

Yeah, 'finish' was put it to "fix" an unrelated sync bug. I now wait for 'wait' to be true to know it's finished the current cycle.
I may have found a way to do long convolution using threading properly, but I've yet to finish the code. I'll post all the details later, if it works.
DaveHoskins
KVRian
 
569 posts since 7 Jan, 2009, from Gloucestershire

Postby DaveHoskins; Thu Jan 24, 2013 7:30 am

I couldn't find much stuff on the Net describing threaded convolution, well, not anything I could understand anyway. :) So here's mine.
My convolution has 3 or more partitions, I'm still balancing the process.
Code: Select all
Time----->

1 |---|---|
2         |-------|-------|
3                         |---------------|--------------- etc

(Overlap-save not used in explanation for simplicity reasons.)

Partition 3 is a large FFT that can go all the way up to over a dozen seconds, so this is the one everything is waiting for.
If I put all three partitions in different threads I still have to wait for the largest convolution to finish, which means I'm waiting sitting idle in the main process - something I don't want of course.

So instead I only use one thread for the 3 partition but delay waiting for it to finish for the during of a partition 2 size, see in diagram 'added'.
Code: Select all
1 |---|---|
2         |-------|-------|-added-|
3                          Delayed|---------------|--------------- etc


By the time the samples of the extra partition 2 have completed, the thread 3 has finished so there'll be no waiting idle.
The delay means I have to store the processed samples in a ring buffer, and grab them from the buffer at the time minus 'partition 2' size.
I'm going to try another thread and delay partition 2 but it might not be worth it as they are quite short anyway. And people's warnings on breaking the hosts thread system might have dissuaded me anyway.

This all gave me the saving of half the metered CPU and stopped the periodic spikes in CPU usage in all the DAWs tried so far.

I've only just finished so much testing is needed, and this is nowhere near a complete explanation, but it'll hopefully nudge the old thinking buds into gear for some struggling like I have on really fast convolution. And I'm sure there are many ways to do it. :wink:

Cheers,
Dave.
Ciozzi
KVRer
 
12 posts since 14 Jan, 2013

Postby Ciozzi; Sat Jan 26, 2013 6:12 am

DaveHoskins wrote:I couldn't find much stuff on the Net describing threaded convolution, well, not anything I could understand anyway. :) So here's mine.
My convolution has 3 or more partitions, I'm still balancing the process.
Code: Select all
Time----->

1 |---|---|
2         |-------|-------|
3                         |---------------|--------------- etc

(Overlap-save not used in explanation for simplicity reasons.)

Partition 3 is a large FFT that can go all the way up to over a dozen seconds, so this is the one everything is waiting for.
If I put all three partitions in different threads I still have to wait for the largest convolution to finish, which means I'm waiting sitting idle in the main process - something I don't want of course.

So instead I only use one thread for the 3 partition but delay waiting for it to finish for the during of a partition 2 size, see in diagram 'added'.
Code: Select all
1 |---|---|
2         |-------|-------|-added-|
3                          Delayed|---------------|--------------- etc


By the time the samples of the extra partition 2 have completed, the thread 3 has finished so there'll be no waiting idle.
The delay means I have to store the processed samples in a ring buffer, and grab them from the buffer at the time minus 'partition 2' size.
I'm going to try another thread and delay partition 2 but it might not be worth it as they are quite short anyway. And people's warnings on breaking the hosts thread system might have dissuaded me anyway.

This all gave me the saving of half the metered CPU and stopped the periodic spikes in CPU usage in all the DAWs tried so far.

I've only just finished so much testing is needed, and this is nowhere near a complete explanation, but it'll hopefully nudge the old thinking buds into gear for some struggling like I have on really fast convolution. And I'm sure there are many ways to do it. :wink:

Cheers,
Dave.


Optimizing convolution is a hell of a job, I did it myself and it took me about three months to do it properly (I know...it's a lot) . The first step should be to optimize the engine so load doesn't increase much when latency goes down. Have you already taken care of that ?!? Another problem you mentioned is spikes, what fft are you using ?!?
DaveHoskins
KVRian
 
569 posts since 7 Jan, 2009, from Gloucestershire

Postby DaveHoskins; Sat Jan 26, 2013 7:09 am

Ciozzi wrote:Optimizing convolution is a hell of a job, I did it myself and it took me about three months to do it properly (I know...it's a lot) . The first step should be to optimize the engine so load doesn't increase much when latency goes down. Have you already taken care of that ?!? Another problem you mentioned is spikes, what fft are you using ?!?


Hi, can you be more specific by "when latency goes down"? Thanks. Perhaps I have some horror around the corner...

The spikes are thrown into the waiting thread so the VSTs/AUs process thread is not halted when convolving a massive reverb tail, it flows through the process thread for a few thousand samples before waiting for the convolve thread to finish. The time to wait is long enough with most sample rates for it to have finished way before it gets to the time to use it, although sizes may have to be increased for 192Khz sample rates.
I moved this over to the Mac recently and getting equal speed ups, which is lucky as I just expected TinyThreads++ to work on Mac - Oh dear - :hihi:
I'm not using an FFT, I've optimised an FHT (Fast Hartley) for my frequency domain stuff.
Ciozzi
KVRer
 
12 posts since 14 Jan, 2013

Postby Ciozzi; Sat Jan 26, 2013 7:36 am

DaveHoskins wrote:Hi, can you be more specific by "when latency goes down"? Thanks. Perhaps I have some horror around the corner...

When you work with small buffer sizes (32, 64, 128,....) convolution becomes progressively heavier for the cpu. With what buffer size/s are you testing your algorithm?!?

I moved this over to the Mac recently and getting equal speed ups, which is lucky as I just expected TinyThreads++ to work on Mac - Oh dear - :hihi:
I'm not using an FFT, I've optimised an FHT (Fast Hartley) for my frequency domain stuff.


Unless you're extremely good at optimizing code in assembly, I highly suggest you use an already optimized library. Unfortunately I'm not aware of many libraries that provide an FHT, one being Intel ipps but it's deprecated and will be removed eventually. The most common way to implement convolution is via FFT (which is by the way, closely related to the FHT). The FFT in the ipps is hands down the best performing algorithm for one dimensional transforms that runs on intel compatible processors, along with that you'd also get all the other primitives for the multply and add stuff. Those primitives alone could increment your code performance 2 or 3 times without even using additional threads. The library isn't free but you'll get a 30day trial period, plenty of time to see if it's really worth it for you.
If you want something free, you could get the fftw.
DaveHoskins
KVRian
 
569 posts since 7 Jan, 2009, from Gloucestershire

Postby DaveHoskins; Sat Jan 26, 2013 7:56 am

Thanks.
The smallest frequency domain is 256 samples, and the latency is brought to zero with SSE code in time domain convolution at the beginning.

From very basic tests my FHT came in just slightly slower than FFTW which I'm perfectly happy with! :)
Yeah, it surprised me as well, but I did work hard to achieve it.

Dave.

*edit*
I don't think FFTW is actually free?
Wolfen666
KVRian
 
595 posts since 11 Aug, 2004, from Marcoussis, France

Postby Wolfen666; Sun Jan 27, 2013 2:12 pm

You should look to FFTReal from Laurent De Soras which is free to use too ;-)
DaveHoskins
KVRian
 
569 posts since 7 Jan, 2009, from Gloucestershire

Postby DaveHoskins; Mon Jan 28, 2013 5:16 am

Wolfen666 wrote:You should look to FFTReal from Laurent De Soras which is free to use too ;-)

Just looked at that, and shivered in C++ obfuscation horror. Even the testing code has 24 files and layers of hierarchy. Making something so complicated seems like a 'having fun in c++' exercise rather than something worthy of DSP.
IMHO of course! :)

Again, I'm happy with my own tiny and fast FHT code, thanks. I might publish it at some point I suppose.
LemonLime
KVRist
 
226 posts since 15 Apr, 2012, from Toronto, ON

Postby LemonLime; Mon Jan 28, 2013 8:05 am

Has anyone tried applying multithreading to a polyphase filter? I'm thinking it would be quite applicable for that structure as each subfilter can be processed in parallel. I'm currently using polyphase interpolating filters to implement resampling, and this got me thinking it would be a good way to optimize.

I'm not quite up to speed on multithreading in C++ though. :? I'm familiar with mutexes, but not with actually creating and assigning threads.
Zaphod (giancarlo)
KVRAF
 
1956 posts since 23 Jun, 2006

Postby Zaphod (giancarlo); Mon Jan 28, 2013 1:07 pm

Ciozzi wrote:
DaveHoskins wrote:Hi, can you be more specific by "when latency goes down"? Thanks. Perhaps I have some horror around the corner...

When you work with small buffer sizes (32, 64, 128,....) convolution becomes progressively heavier for the cpu. With what buffer size/s are you testing your algorithm?!?

I moved this over to the Mac recently and getting equal speed ups, which is lucky as I just expected TinyThreads++ to work on Mac - Oh dear - :hihi:
I'm not using an FFT, I've optimised an FHT (Fast Hartley) for my frequency domain stuff.


Unless you're extremely good at optimizing code in assembly, I highly suggest you use an already optimized library. Unfortunately I'm not aware of many libraries that provide an FHT, one being Intel ipps but it's deprecated and will be removed eventually. The most common way to implement convolution is via FFT (which is by the way, closely related to the FHT). The FFT in the ipps is hands down the best performing algorithm for one dimensional transforms that runs on intel compatible processors, along with that you'd also get all the other primitives for the multply and add stuff. Those primitives alone could increment your code performance 2 or 3 times without even using additional threads. The library isn't free but you'll get a 30day trial period, plenty of time to see if it's really worth it for you.
If you want something free, you could get the fftw.


+1
DaveHoskins
KVRian
 
569 posts since 7 Jan, 2009, from Gloucestershire

Postby DaveHoskins; Mon Jan 28, 2013 1:26 pm

I'm not converting the FHT into an FFT, I'm doing everything in the Hartley domain. It's FAST!...Oh never mind.
AXP
KVRist
 
84 posts since 24 Mar, 2012

Postby AXP; Wed Jan 30, 2013 1:51 am

DaveHoskins wrote:I'm not converting the FHT into an FFT, I'm doing everything in the Hartley domain. It's FAST!...Oh never mind.


Doesn't FHT require a few more operations than an FFT?
I've never actually used the Hartley transform, so I might be missing something.
DaveHoskins
KVRian
 
569 posts since 7 Jan, 2009, from Gloucestershire

Postby DaveHoskins; Wed Jan 30, 2013 4:53 am

I have no idea how many instructions it has, or it's efficiency with memory cache or SSE instruction weaving in the compiler, sorry. I seem to remember at the time I tried a few different FFTs but settled on this one after my optimisations.
It may help speed it up by the fact that the exact same code is used for forward and inverse transforms, both real and in-place and the entire transform fits in the original input array, which is nice.

The FFT conversion is a loop with a small amount of maths, I don't need it for most of my work, but it's not really the slow down. It's the setup loops and if statements at the beginning that always run through the same thing each time the function is called, that are more wasteful, so I've moved them out of the main function, replacing it all with a simple look up table.
One down side to the FHT is that the final output is in a strange order, but that can be quickly moved around in-place, ready for SSE convolution.

Dave.

*edit* this has gone OT a bit, as far as the multi-threaded DSP is concerned it appears to be rock solid. It's brain-meltingly complex when dealing with shared resources and debugging, but the results are great and I'm sure I'll get used to it quite quickly.
PreviousNext

Moderator: Moderators (Main)

Return to DSP and Plug-in Development