Multiple threads running in DSP code
-
- KVRer
- 6 posts since 7 Jan, 2013
in my experience, parallelizing audio applications is not necessarily the problem, but real-time safe synchronization and wakeup latencies ... especially if low latencies/small block sizes are required.
current hardware has worst-case scheduling latencies of several hundred microseconds, which can be a limiting factor when using block sizes of 64 samples at 44.1/48khz. and of course, one needs to pay attention not to block the real-time threads.
some self-promo: i wrote my master thesis about parallelizing the synthesis engine of supercollider, which includes a discussion of the technical challenges with a focus on low-latency real-time safety:
http://tim.klingt.org/publications/tim_ ... ernova.pdf
current hardware has worst-case scheduling latencies of several hundred microseconds, which can be a limiting factor when using block sizes of 64 samples at 44.1/48khz. and of course, one needs to pay attention not to block the real-time threads.
some self-promo: i wrote my master thesis about parallelizing the synthesis engine of supercollider, which includes a discussion of the technical challenges with a focus on low-latency real-time safety:
http://tim.klingt.org/publications/tim_ ... ernova.pdf
-
- KVRer
- 12 posts since 14 Jan, 2013
You're doing it by the book, only thing arg.wait and arg.finished are duplicates you don't need to keep them both.DaveHoskins wrote:Is does a 'WaitFor...' in TinyThreads code. It's in the 'arg.cond.wait(arg.m);' call.
And this is what they do in their, albeit limit examples.
-
- KVRian
- Topic Starter
- 614 posts since 7 Jan, 2009 from Gloucestershire
Yeah, 'finish' was put it to "fix" an unrelated sync bug. I now wait for 'wait' to be true to know it's finished the current cycle.Ciozzi wrote:You're doing it by the book, only thing arg.wait and arg.finished are duplicates you don't need to keep them both.DaveHoskins wrote:Is does a 'WaitFor...' in TinyThreads code. It's in the 'arg.cond.wait(arg.m);' call.
And this is what they do in their, albeit limit examples.
I may have found a way to do long convolution using threading properly, but I've yet to finish the code. I'll post all the details later, if it works.
-
- KVRian
- Topic Starter
- 614 posts since 7 Jan, 2009 from Gloucestershire
I couldn't find much stuff on the Net describing threaded convolution, well, not anything I could understand anyway. So here's mine.
My convolution has 3 or more partitions, I'm still balancing the process.
(Overlap-save not used in explanation for simplicity reasons.)
Partition 3 is a large FFT that can go all the way up to over a dozen seconds, so this is the one everything is waiting for.
If I put all three partitions in different threads I still have to wait for the largest convolution to finish, which means I'm waiting sitting idle in the main process - something I don't want of course.
So instead I only use one thread for the 3 partition but delay waiting for it to finish for the during of a partition 2 size, see in diagram 'added'.
By the time the samples of the extra partition 2 have completed, the thread 3 has finished so there'll be no waiting idle.
The delay means I have to store the processed samples in a ring buffer, and grab them from the buffer at the time minus 'partition 2' size.
I'm going to try another thread and delay partition 2 but it might not be worth it as they are quite short anyway. And people's warnings on breaking the hosts thread system might have dissuaded me anyway.
This all gave me the saving of half the metered CPU and stopped the periodic spikes in CPU usage in all the DAWs tried so far.
I've only just finished so much testing is needed, and this is nowhere near a complete explanation, but it'll hopefully nudge the old thinking buds into gear for some struggling like I have on really fast convolution. And I'm sure there are many ways to do it.
Cheers,
Dave.
My convolution has 3 or more partitions, I'm still balancing the process.
Code: Select all
Time----->
1 |---|---|
2 |-------|-------|
3 |---------------|--------------- etc
Partition 3 is a large FFT that can go all the way up to over a dozen seconds, so this is the one everything is waiting for.
If I put all three partitions in different threads I still have to wait for the largest convolution to finish, which means I'm waiting sitting idle in the main process - something I don't want of course.
So instead I only use one thread for the 3 partition but delay waiting for it to finish for the during of a partition 2 size, see in diagram 'added'.
Code: Select all
1 |---|---|
2 |-------|-------|-added-|
3 Delayed|---------------|--------------- etc
The delay means I have to store the processed samples in a ring buffer, and grab them from the buffer at the time minus 'partition 2' size.
I'm going to try another thread and delay partition 2 but it might not be worth it as they are quite short anyway. And people's warnings on breaking the hosts thread system might have dissuaded me anyway.
This all gave me the saving of half the metered CPU and stopped the periodic spikes in CPU usage in all the DAWs tried so far.
I've only just finished so much testing is needed, and this is nowhere near a complete explanation, but it'll hopefully nudge the old thinking buds into gear for some struggling like I have on really fast convolution. And I'm sure there are many ways to do it.
Cheers,
Dave.
-
- KVRer
- 12 posts since 14 Jan, 2013
Optimizing convolution is a hell of a job, I did it myself and it took me about three months to do it properly (I know...it's a lot) . The first step should be to optimize the engine so load doesn't increase much when latency goes down. Have you already taken care of that ?!? Another problem you mentioned is spikes, what fft are you using ?!?DaveHoskins wrote:I couldn't find much stuff on the Net describing threaded convolution, well, not anything I could understand anyway. So here's mine.
My convolution has 3 or more partitions, I'm still balancing the process.(Overlap-save not used in explanation for simplicity reasons.)Code: Select all
Time-----> 1 |---|---| 2 |-------|-------| 3 |---------------|--------------- etc
Partition 3 is a large FFT that can go all the way up to over a dozen seconds, so this is the one everything is waiting for.
If I put all three partitions in different threads I still have to wait for the largest convolution to finish, which means I'm waiting sitting idle in the main process - something I don't want of course.
So instead I only use one thread for the 3 partition but delay waiting for it to finish for the during of a partition 2 size, see in diagram 'added'.By the time the samples of the extra partition 2 have completed, the thread 3 has finished so there'll be no waiting idle.Code: Select all
1 |---|---| 2 |-------|-------|-added-| 3 Delayed|---------------|--------------- etc
The delay means I have to store the processed samples in a ring buffer, and grab them from the buffer at the time minus 'partition 2' size.
I'm going to try another thread and delay partition 2 but it might not be worth it as they are quite short anyway. And people's warnings on breaking the hosts thread system might have dissuaded me anyway.
This all gave me the saving of half the metered CPU and stopped the periodic spikes in CPU usage in all the DAWs tried so far.
I've only just finished so much testing is needed, and this is nowhere near a complete explanation, but it'll hopefully nudge the old thinking buds into gear for some struggling like I have on really fast convolution. And I'm sure there are many ways to do it.
Cheers,
Dave.
-
- KVRian
- Topic Starter
- 614 posts since 7 Jan, 2009 from Gloucestershire
Hi, can you be more specific by "when latency goes down"? Thanks. Perhaps I have some horror around the corner...Ciozzi wrote:Optimizing convolution is a hell of a job, I did it myself and it took me about three months to do it properly (I know...it's a lot) . The first step should be to optimize the engine so load doesn't increase much when latency goes down. Have you already taken care of that ?!? Another problem you mentioned is spikes, what fft are you using ?!?
The spikes are thrown into the waiting thread so the VSTs/AUs process thread is not halted when convolving a massive reverb tail, it flows through the process thread for a few thousand samples before waiting for the convolve thread to finish. The time to wait is long enough with most sample rates for it to have finished way before it gets to the time to use it, although sizes may have to be increased for 192Khz sample rates.
I moved this over to the Mac recently and getting equal speed ups, which is lucky as I just expected TinyThreads++ to work on Mac - Oh dear -
I'm not using an FFT, I've optimised an FHT (Fast Hartley) for my frequency domain stuff.
-
- KVRer
- 12 posts since 14 Jan, 2013
When you work with small buffer sizes (32, 64, 128,....) convolution becomes progressively heavier for the cpu. With what buffer size/s are you testing your algorithm?!?DaveHoskins wrote: Hi, can you be more specific by "when latency goes down"? Thanks. Perhaps I have some horror around the corner...
Unless you're extremely good at optimizing code in assembly, I highly suggest you use an already optimized library. Unfortunately I'm not aware of many libraries that provide an FHT, one being Intel ipps but it's deprecated and will be removed eventually. The most common way to implement convolution is via FFT (which is by the way, closely related to the FHT). The FFT in the ipps is hands down the best performing algorithm for one dimensional transforms that runs on intel compatible processors, along with that you'd also get all the other primitives for the multply and add stuff. Those primitives alone could increment your code performance 2 or 3 times without even using additional threads. The library isn't free but you'll get a 30day trial period, plenty of time to see if it's really worth it for you.I moved this over to the Mac recently and getting equal speed ups, which is lucky as I just expected TinyThreads++ to work on Mac - Oh dear -
I'm not using an FFT, I've optimised an FHT (Fast Hartley) for my frequency domain stuff.
If you want something free, you could get the fftw.
-
- KVRian
- Topic Starter
- 614 posts since 7 Jan, 2009 from Gloucestershire
Thanks.
The smallest frequency domain is 256 samples, and the latency is brought to zero with SSE code in time domain convolution at the beginning.
From very basic tests my FHT came in just slightly slower than FFTW which I'm perfectly happy with!
Yeah, it surprised me as well, but I did work hard to achieve it.
Dave.
*edit*
I don't think FFTW is actually free?
The smallest frequency domain is 256 samples, and the latency is brought to zero with SSE code in time domain convolution at the beginning.
From very basic tests my FHT came in just slightly slower than FFTW which I'm perfectly happy with!
Yeah, it surprised me as well, but I did work hard to achieve it.
Dave.
*edit*
I don't think FFTW is actually free?
-
- KVRian
- 1153 posts since 11 Aug, 2004 from Breuillet, France
You should look to FFTReal from Laurent De Soras which is free to use too
-
- KVRian
- Topic Starter
- 614 posts since 7 Jan, 2009 from Gloucestershire
Just looked at that, and shivered in C++ obfuscation horror. Even the testing code has 24 files and layers of hierarchy. Making something so complicated seems like a 'having fun in c++' exercise rather than something worthy of DSP.Wolfen666 wrote:You should look to FFTReal from Laurent De Soras which is free to use too
IMHO of course!
Again, I'm happy with my own tiny and fast FHT code, thanks. I might publish it at some point I suppose.
-
- KVRist
- 231 posts since 15 Apr, 2012 from Toronto, ON
Has anyone tried applying multithreading to a polyphase filter? I'm thinking it would be quite applicable for that structure as each subfilter can be processed in parallel. I'm currently using polyphase interpolating filters to implement resampling, and this got me thinking it would be a good way to optimize.
I'm not quite up to speed on multithreading in C++ though. I'm familiar with mutexes, but not with actually creating and assigning threads.
I'm not quite up to speed on multithreading in C++ though. I'm familiar with mutexes, but not with actually creating and assigning threads.
-
Zaphod (giancarlo) Zaphod (giancarlo) https://www.kvraudio.com/forum/memberlist.php?mode=viewprofile&u=111268
- KVRAF
- 2596 posts since 23 Jun, 2006
+1Ciozzi wrote:When you work with small buffer sizes (32, 64, 128,....) convolution becomes progressively heavier for the cpu. With what buffer size/s are you testing your algorithm?!?DaveHoskins wrote: Hi, can you be more specific by "when latency goes down"? Thanks. Perhaps I have some horror around the corner...
Unless you're extremely good at optimizing code in assembly, I highly suggest you use an already optimized library. Unfortunately I'm not aware of many libraries that provide an FHT, one being Intel ipps but it's deprecated and will be removed eventually. The most common way to implement convolution is via FFT (which is by the way, closely related to the FHT). The FFT in the ipps is hands down the best performing algorithm for one dimensional transforms that runs on intel compatible processors, along with that you'd also get all the other primitives for the multply and add stuff. Those primitives alone could increment your code performance 2 or 3 times without even using additional threads. The library isn't free but you'll get a 30day trial period, plenty of time to see if it's really worth it for you.I moved this over to the Mac recently and getting equal speed ups, which is lucky as I just expected TinyThreads++ to work on Mac - Oh dear -
I'm not using an FFT, I've optimised an FHT (Fast Hartley) for my frequency domain stuff.
If you want something free, you could get the fftw.
-
- KVRian
- Topic Starter
- 614 posts since 7 Jan, 2009 from Gloucestershire
I'm not converting the FHT into an FFT, I'm doing everything in the Hartley domain. It's FAST!...Oh never mind.
-
- KVRist
- 96 posts since 24 Mar, 2012
Doesn't FHT require a few more operations than an FFT?DaveHoskins wrote:I'm not converting the FHT into an FFT, I'm doing everything in the Hartley domain. It's FAST!...Oh never mind.
I've never actually used the Hartley transform, so I might be missing something.
-
- KVRian
- Topic Starter
- 614 posts since 7 Jan, 2009 from Gloucestershire
I have no idea how many instructions it has, or it's efficiency with memory cache or SSE instruction weaving in the compiler, sorry. I seem to remember at the time I tried a few different FFTs but settled on this one after my optimisations.
It may help speed it up by the fact that the exact same code is used for forward and inverse transforms, both real and in-place and the entire transform fits in the original input array, which is nice.
The FFT conversion is a loop with a small amount of maths, I don't need it for most of my work, but it's not really the slow down. It's the setup loops and if statements at the beginning that always run through the same thing each time the function is called, that are more wasteful, so I've moved them out of the main function, replacing it all with a simple look up table.
One down side to the FHT is that the final output is in a strange order, but that can be quickly moved around in-place, ready for SSE convolution.
Dave.
*edit* this has gone OT a bit, as far as the multi-threaded DSP is concerned it appears to be rock solid. It's brain-meltingly complex when dealing with shared resources and debugging, but the results are great and I'm sure I'll get used to it quite quickly.
It may help speed it up by the fact that the exact same code is used for forward and inverse transforms, both real and in-place and the entire transform fits in the original input array, which is nice.
The FFT conversion is a loop with a small amount of maths, I don't need it for most of my work, but it's not really the slow down. It's the setup loops and if statements at the beginning that always run through the same thing each time the function is called, that are more wasteful, so I've moved them out of the main function, replacing it all with a simple look up table.
One down side to the FHT is that the final output is in a strange order, but that can be quickly moved around in-place, ready for SSE convolution.
Dave.
*edit* this has gone OT a bit, as far as the multi-threaded DSP is concerned it appears to be rock solid. It's brain-meltingly complex when dealing with shared resources and debugging, but the results are great and I'm sure I'll get used to it quite quickly.