Using Multiple Threads In Audio Thread

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

Urs wrote: Sun Oct 17, 2021 3:19 pmYeah, a proper documentation of our goals is on the todo list. Not sure if the host controlled threading will be part of our initial release (u-he stuff), but the first major host to support CLAP will have it built in. I'll start a thread about it once we can supply proof of concept (DAW + our full product line + information).
wow..

Code: Select all

nm -D ./Hive.so
tells me there's a "clap_plugin_entry" symbol in there!
So, this CLAP plugin format is really happening?
I have an idea what the "first major host" will be, but eagerly awaiting more info..:-)

(i just started experimenting with the format in my own framework, btw)

Post

Yeah, we're currently finishing up our own support and then there's some more work on the example code (it's easy on Linux, but we yet have to make it compile and run out of the box on PC and Mac). I guess we'll roll this out here in a few weeks.

We're actually looking for freelance devs with experience in VSTGUI and/or VST2/3... -> jobs at u-he dot com

Post

SNFK wrote: Fri Oct 15, 2021 12:23 pm Forgive me if this is a dumb idea. I’ve been working on an oscillator that can already generate a lot of unison voices. I was researching how different synths did their optimization and realized that I could (maybe) use multiple threads on the main audio thread.

This might be a huge red flag, I’m not sure, so please tell me if this is a bad idea. What if each voice had its own thread? Yeah, there would be a lot of architecture and memory stuff to work on, but would it be worth it? If not, is there another place in the synth architecture that could benefit from multi-threading?

I had come across DUNE 3 and saw it used multi-threading, but I may have interpreted it incorrectly…
Yes DUNE 3 does have the option to multi-thread voices across several cores. Generally it works very well, there is just a few caveats:
  • Cubase does not seem to like plugin MT in some configurations, and in some more exotic hosts it may not work at all
  • If the buffer size is too low, MT should be turned off automatically since the sync overhead becomes too high
If we see CPUs with 16 cores or more become standard, plugin MT could become yet more interesting than currently. Even more so if the cores themselves are kinda slow.

Richard
Synapse Audio Software - www.synapse-audio.com

Post

Richard_Synapse wrote: Sun Nov 07, 2021 7:21 pm [*]If the buffer size is too low, MT should be turned off automatically since the sync overhead becomes too high
I mostly use FLStudio myself which is well-known to sometimes give you very short buffers. What works pretty well there though is using a task-model (each voice is a task) and checking on per-buffer basis whether the buffer length (=nsamples) is very short and if that's the case, compute all the tasks directly, otherwise dispatch to threadpool.

Another thing I'd suggest (and apologies if this is very obvious) is using an atomic counter for task completion: set the counter to the number of tasks before dispatch, have every task decrement (atomically) the counter when finished and then have the last task post() on a semaphore when the counter hits zero after decrement. The main thread can then wait() on that semaphore, just once, and gets woken up only after all the tasks are finished (which is infinitely better than having it sync separately with every task).

That doesn't reduce the sync overhead of actually dispatching tasks though. Something I haven't done (since I just thought of it about 10 minutes ago), but which might be profitable for small buffers is that one could potentially group such tasks into "macro tasks" where you basically combine the processing of multiple voices into a single task such that the total number of voices*samples exceeds some minimum value. This way if we have something like 16 voices and we put the threshold at 32 samples, if the host gives a tiny buffer size of 8, you'll dispatch 32/8 = 4 voices per task, you're still generating 4 tasks for multi-threading, but we don't need to go through the dispatch queue 16 times (where as with longer buffers having separate tasks makes sense to load-balance with multiple threads often processing at different rates).

You could do that with a very simple algorithm too: have each task consist of the "first voice", "number of samples" and "number of voices" and then when generating tasks, loop over the voices, if the nsamples*nvoices of the current task is less than threshold, bump nvoices, otherwise bump task index and when done, dispatch however many tasks you ended up with. If "nsamples" is larger than the threshold, every voice gets it's own task. If the work doesn't divide evenly, the last task will end up with less work (ie. just the "left overs"), but I'd imagine that's not a huge deal.

Post

From our experience with multicore support with several plug-ins, multithreading in plug-ins can be very useful, as soon as there is enough work to be done in parallel. The overhead of task switching is otherwise too high for any benefit. With PatchWork, that is both a host and plug-in, multithreading lets you load many more plug-ins in parallel. We have experimented with some CPU-heavy virtual instruments, and you can increase polyphony drastically by using several instances of the instrument in parallel.

IMHO "clashing" with the host's own scheduling is not really possible, unless the host or the plug-in is trying to be smarter than the operating system's scheduler (which is in general a pretty bad idea), or the plugin's multithreading system is not re-entrant: when using a high priority worker threads pool, a fundamental idea to avoid problems is to make sure that the calling DSP thread does not only wait for the worker threads: It has to poll the task list and get jobs done too, while (potentially) other threads are working if available. If you do not do that, there is no guarantee that the host & OS will let you wake up the threads in time (some hosts manipulate threads affinity and scheduling quite a bit so most of your worker threads may just be discarded during the DSP call, especially if the app is already quite busy), so you will get dropouts pretty quickly.

It also solves the issue with ultra-low buffer lengths: the DSP thread may have already finished the work before the worker threads are even starting to look at the jobs list.

I am not very favorable with adding APIs for the hosts to manage the worker threads for you. There is already enough room for nasty bugs in host/plug-in communication not to add more complexity, especially at such a low level.

Post

Blue Cat Audio wrote: Mon Nov 08, 2021 10:26 amI am not very favorable with adding APIs for the hosts to manage the worker threads for you. There is already enough room for nasty bugs in host/plug-in communication not to add more complexity, especially at such a low level.
This is why we're set out to simplifying communication. Make it fully clear to developers what is called and what can be called, and when.

As it is now, as soon as a single plug-in has a priority in its realtime threads that does not match any others, there'll be conflicts, simply because the order of execution will be jumbled one way or the other. The advantage of the host based thread pool is that all plug-ins get the same priority, and the implementation is completely free of mutexes or whatever else plug-ins (or hosts...) could get wrong.

Post

Urs wrote: Mon Nov 08, 2021 10:44 am The advantage of the host based thread pool is that all plug-ins get the same priority, and the implementation is completely free of mutexes or whatever else plug-ins (or hosts...) could get wrong.
In theory yes, I agree. But mutithreading is a complex thing for the human brain, and it becomes even more complex if you do not know how things get called (documentation has limits...). As a host developer myself, I can bet this would produce even more crashes and odd issues, as developers who have no clue about multithreading use these APIs to call concurrent tasks that share unprotected data structures, unless the API is already designed for such cases.

And anyway, as each host will have its own private implementation, it will be almost impossible to find out where the issue comes from. Just like with VST3 and its messaging system / parameters handling etc. : as soon as a host acts as a middleware for services that are implementation-dependent, you can be sure that it will cause unexpected behaviors by adding more complexity.

Having it as an option is nice though. I would just be very cautious about relying on it! :-)
As it is now, as soon as a single plug-in has a priority in its realtime threads that does not match any others, there'll be conflicts, simply because the order of execution will be jumbled one way or the other.
Unless the priority is indeed very low and other processes are already taking over, the OS scheduler should be able to manage such cases properly. If the plug-in has a re-entrant thread pool, you should not even notice it. I guess that if the developer was able to write a broken plug-in, new APIs won't fix it anyway :-)

Post

mystran wrote: Mon Nov 08, 2021 5:42 am The main thread can then wait() on that semaphore, just once, and gets woken up only after all the tasks are finished (which is infinitely better than having it sync separately with every task).
Yes this is the approach we are using, main thread syncs just once at the end and otherwise never waits, like Blue Cat wrote above. Probably the only way.
mystran wrote: Mon Nov 08, 2021 5:42 am That doesn't reduce the sync overhead of actually dispatching tasks though. Something I haven't done (since I just thought of it about 10 minutes ago), but which might be profitable for small buffers is that one could potentially group such tasks into "macro tasks" where you basically combine the processing of multiple voices into a single task such that the total number of voices*samples exceeds some minimum value. This way if we have something like 16 voices and we put the threshold at 32 samples, if the host gives a tiny buffer size of 8, you'll dispatch 32/8 = 4 voices per task, you're still generating 4 tasks for multi-threading, but we don't need to go through the dispatch queue 16 times (where as with longer buffers having separate tasks makes sense to load-balance with multiple threads often processing at different rates).
You could do that with a very simple algorithm too: have each task consist of the "first voice", "number of samples" and "number of voices" and then when generating tasks, loop over the voices, if the nsamples*nvoices of the current task is less than threshold, bump nvoices, otherwise bump task index and when done, dispatch however many tasks you ended up with. If "nsamples" is larger than the threshold, every voice gets it's own task. If the work doesn't divide evenly, the last task will end up with less work (ie. just the "left overs"), but I'd imagine that's not a huge deal.
Not sure I get this tbh. In a small-buffer situation, like 32 samples or less, it is unlikely the worker threads will be able to do anything at all.

Richard
Synapse Audio Software - www.synapse-audio.com

Post

Urs wrote: Mon Nov 08, 2021 10:44 am The advantage of the host based thread pool is that all plug-ins get the same priority, and the implementation is completely free of mutexes or whatever else plug-ins (or hosts...) could get wrong.
Nice!!. But how do you prevent plugins or hosts from ignoring all this and directly calling OS threading API ?

Edit: for example, threads can spawn new threads and then change their priority.
Last edited by S0lo on Mon Nov 08, 2021 11:34 am, edited 2 times in total.
www.solostuff.net
Advice is heavy. So don’t send it like a mountain.

Post

Blue Cat Audio wrote: Mon Nov 08, 2021 10:26 am IMHO "clashing" with the host's own scheduling is not really possible, unless the host or the plug-in is trying to be smarter than the operating system's scheduler (which is in general a pretty bad idea)
There is one exotic Windows host which crashes plugin MT, but I cannot remember which one it was. IIRC it was abandonware though and it was also the only host that really caused a massive conflict with MT. Usually, if MT does not work, there is no severe consequences other than weak performance, from our experience thus far :)

Richard
Synapse Audio Software - www.synapse-audio.com

Post

S0lo wrote: Mon Nov 08, 2021 11:12 am
Urs wrote: Mon Nov 08, 2021 10:44 am The advantage of the host based thread pool is that all plug-ins get the same priority, and the implementation is completely free of mutexes or whatever else plug-ins (or hosts...) could get wrong.
Nice!!. But how do you prevent plugins or hosts from ignoring all this and directly calling OS threading API ?
A host ignores it by not implementing the extension. Then it's up to the plug-in developer to do what they like.

But if a host implements it, the plug-in developer does not need to use any threading specific code. The developer surely still needs to know what they're doing, i.e. try to avoid access to shared memory or expect a certain order of execution.

We will be testing this with a major host and our own implementation shortly and we'll see if it runs smoother or not. We're in good spirits that it'll remove a lot of the context switching overhead.

Post

Urs wrote: Mon Nov 08, 2021 11:28 am We will be testing this with a major host and our own implementation shortly and we'll see if it runs smoother or not. We're in good spirits that it'll remove a lot of the context switching overhead.
It would be nice to see a benchmark when/if you have the opportunity!

Post

Urs wrote: Sun Oct 17, 2021 3:19 pm Yeah, a proper documentation of our goals is on the todo list. Not sure if the host controlled threading will be part of our initial release (u-he stuff), but the first major host to support CLAP will have it built in. I'll start a thread about it once we can supply proof of concept (DAW + our full product line + information).
Great stuff Urs, sounds exciting! :)

Richard
Synapse Audio Software - www.synapse-audio.com

Post

Urs wrote: Mon Nov 08, 2021 11:28 am
S0lo wrote: Mon Nov 08, 2021 11:12 am
Urs wrote: Mon Nov 08, 2021 10:44 am The advantage of the host based thread pool is that all plug-ins get the same priority, and the implementation is completely free of mutexes or whatever else plug-ins (or hosts...) could get wrong.
Nice!!. But how do you prevent plugins or hosts from ignoring all this and directly calling OS threading API ?
A host ignores it by not implementing the extension. Then it's up to the plug-in developer to do what they like.

But if a host implements it, the plug-in developer does not need to use any threading specific code. The developer surely still needs to know what they're doing, i.e. try to avoid access to shared memory or expect a certain order of execution.
I'm sure that you've already thought of this, but I'd like to emphasize any way.

A plugin dev usually resolves to the most common denominator between all hosts. If too many hosts, don't do the same. Then the most common here becomes the minimum common. Simply to be compatible with all and not having to do "if this host, do this. If that host do that". I think this discourages plugin devs to implement more extensions.

To help resolve this, I recommend a "CLAP compatible" logo. As a reward for hosts to implementing a certain essential set of extensions and/or pass a few tests. A dev will only be allowed to place the logo on their host product if it implements those few essential extensions and/or pass a few tests.

Such logos has been done before. Example, "Windows compatible" driver or "Certified for use on ....."

However, I don't recommend doing this from the get go :). the format needs to get some momentum for a while.
www.solostuff.net
Advice is heavy. So don’t send it like a mountain.

Post

Richard_Synapse wrote: Mon Nov 08, 2021 11:10 am
mystran wrote: Mon Nov 08, 2021 5:42 am The main thread can then wait() on that semaphore, just once, and gets woken up only after all the tasks are finished (which is infinitely better than having it sync separately with every task).
Yes this is the approach we are using, main thread syncs just once at the end and otherwise never waits, like Blue Cat wrote above. Probably the only way.
Oh.. I've always let the main thread do nothing except block on the semaphore from the moment it dispatches the worker threads to the moment the workers are all finished. I have no idea what Blue Cat is talking about with the whole "need to use main thread" thing. Maybe I'm missing something.

edit: Is this a macOS thing? I'm aware the RT scheduling there is a bit "weird" although I've never observed any issues there either, but I admit my code for that platform has not seen very wide circulation... where as Windows I can't possibly imagine how anything could possibly go wrong as long as you tell MMCSS to bump your threads to "Pro Audio"?
Not sure I get this tbh. In a small-buffer situation, like 32 samples or less, it is unlikely the worker threads will be able to do anything at all.
Depends on how heavy your voices are... you could put the threshold a lot higher.. point is that there might be a middle ground between "one task per voice" and "everything in main thread."

Post Reply

Return to “DSP and Plugin Development”