Using Multiple Threads In Audio Thread

DSP, Plug-in and Host development discussion.
KVRAF
6582 posts since 12 Feb, 2006 from Helsinki, Finland

Post Sat Oct 16, 2021 1:19 pm

I'd like to add that it's important to remember we're not trying to optimize the number of CPU cycles spent or the maximum throughput or anything like that, but rather we're trying to minimize the wallclock time it takes until the last plugin on the master bus is done. From this point of view, pretty much the worst thing that can happen is that you're processing serially something that you could have processed in parallel. What you really don't want to do is risk having CPUs sit idle, while you could have scheduled some useful work on them.

Having "too many threads" is much less of an issue, because at least they are doing something, keeping the CPUs busy and moving you towards the final goal. The issue with plugins multi-threading on their own has nothing to do with "too many threads" and everything to do with DAW losing control over which plugin is processed first (because a slightly lower priority plugin might delay the highest priority plugin by competing for the CPU cores).

So which plugin is the highest priority then? It's one that's blocking the longest (as ideally measured in wall-clock time, but you could use the number of dependent plugins as a rough approximation) chain to the master bus output. If this plugin can use all the CPU cores, then really the DAW wouldn't actually want do anything else in parallel until that plugin is done, but since in current plugin APIs the DAW really has no idea whether a plugin can use all the cores (or perhaps some.. and even if API allowed such a thing, you really wouldn't want CPU cores sitting idle while a given plugin is trying to figure out what to schedule for them; you'd rather have them do something else and let the work pile up for the future), the best it can do is assume that the plugin is single-threaded and process other plugins on the other cores. This is the real issue, not the actual number of threads launched.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

KVRian
589 posts since 4 Jan, 2007

Post Sat Oct 16, 2021 2:12 pm

mystran wrote:
Sat Oct 16, 2021 12:50 pm
It's perfectly fine to have a hundred threads on RT priority if they mostly sit on a semaphore waiting. It's perfectly fine to have a queue where you put some work, then you post on a semaphore, a worker thread wakes up, does the job and goes back to sleep.
As said on my previous message, notice that in the text you quoted it says busy mix. Implying a high usage scenario with contention. Not threads idling.
mystran wrote:
Sat Oct 16, 2021 12:50 pm
From the projects point of view, it typically makes sense to give one plugin as many resources as it can use, then use the rest for the next one and so on. The trickiest part of parallel processing an audio graph is finding enough actual parallel work and it makes sense to try to optimize the (real-)time that it takes to release further serially dependent work.
I started from the assumption that the DAW can parallel-process individually each track that doesn't have serial dependencies (receives), To me it seems that on a given project at the top of the graph there is a lot of potential for parallelization. Then normally the plugins using multiple threads are most probably synths/generators, mostly sitting on top.

As there is no way for the DAW to know how many cores a plugin requires, why should it keep them free just in case? I work with the assumption that the DAW assumes each plugin as a black-box individual process and only measures the cycles it takes to complete for each node or the graph.
mystran wrote:
Sat Oct 16, 2021 12:50 pm
Now, the "non-optimality" of having multiple plugins have their own threads has nothing to do with any scheduling (which the OS can handle just fine and which is negligible overhead in practice when you're doing short chunks of work that won't typically run out of their time-slices at all). The real issue here is that one plugin using CPU cores that another plugin could use to finish faster means you might not be able to release further serial work as fast. This hurts the DAW "graph scheduler" where as the actual OS "thread scheduler" won't care... but note that those CPU heavy plugins not multi-threading internally and therefore finishing slower hurts the graph scheduler more so .. without a protocol for plugins to borrow DAW's worker threads, multi-threading in plugins internally is still a win for the whole project.
Yes, the problem is one or more threads of the fastest (time wise) graph path(s) preventing threads from the slowest path(s) because of the undeterministic nature (from the external viewer POV) of dynamic OS scheduling and the fact that the DAW scheduler can't optimize around that or see what is going on.

In other words. If I take N_CPUS instances of e.g Diva on N_CPUS tracks in single threaded mode, playing each trqck a chord with just the amount of voices that a single thread can handle (assumed around N_CPUs voicescor bigger)., the moment I enable multicore the project will start clicking.

KVRian
589 posts since 4 Jan, 2007

Post Sat Oct 16, 2021 3:00 pm

mystran wrote:
Sat Oct 16, 2021 1:19 pm
I'd like to add that it's important to remember we're not trying to optimize the number of CPU cycles spent or the maximum throughput or anything like that, but rather we're trying to minimize the wallclock time it takes until the last plugin on the master bus is done. From this point of view, pretty much the worst thing that can happen is that you're processing serially something that you could have processed in parallel. What you really don't want to do is risk having CPUs sit idle, while you could have scheduled some useful work on them.
Of course. But how would a DAW make the best decisions if e.g. each and every plugin processed on its own thread pool? Now every parallel short path could block progress on the long one and the DAW couldn't see it, which is equally as bad as not processing on parallel when able to.

Creating a per-plugin thread pool is betting that most of the time there are free CPU resources somewhere else. If the plugin is on top of the graph (synths) this may not be true on busy mixes. It is just a bet. Maybe for mastering plugins, limiters, etc this could work, but those are not as easily parallelized (if at all) as synths.

For me the only sane thing to do, provided no suitable shared DAW-owned work queue API is to have it as a configuration parameter, as many plugins already do. No bets.
mystran wrote:
Sat Oct 16, 2021 1:19 pm
Having "too many threads" is much less of an issue, because at least they are doing something, keeping the CPUs busy and moving you towards the final goal. The issue with plugins multi-threading on their own has nothing to do with "too many threads" and everything to do with DAW losing control over which plugin is processed first (because a slightly lower priority plugin might delay the highest priority plugin by competing for the CPU cores)..
"too many threads" doesn't seem to me a thing I have written. I might have had poor wording in some parts, but it is not definitely what I meant. From my first message:
I'm not very convinced this is optimal from the project's point of view. Normally the best thread priorities can be assigned at a higher level of abstraction than a single plugin. A plugin opening threads is betting on the assumption that other cores are free to do work.
You could correctly say that "thread priorities" should substitute by "counts", but I was tired and thinking on terms of $DAILY_JOB there, which is not audio, where having the plugin thread pools at a correctly selected priority would achieve something similar to the desired result.
Last edited by rafa1981 on Sat Oct 16, 2021 10:50 pm, edited 3 times in total.

KVRian
589 posts since 4 Jan, 2007

Post Sat Oct 16, 2021 3:02 pm

And this is ignoring that probably if every synth used threads and they competed for CPUs it would add jitter. Probably undesirable on near full CPU usage too.

KVRAF
6582 posts since 12 Feb, 2006 from Helsinki, Finland

Post Sat Oct 16, 2021 4:58 pm

rafa1981 wrote:
Sat Oct 16, 2021 3:00 pm
Of course. But how would a DAW make the best decisions if e.g. each and every plugin processed on its own thread pool? Now every parallel short path could block progress on the long one and the DAW couldn't see it, which is equally as bad as not processing on parallel when able to.
It can't. It can't make the best decisions even if every plugin is single-threaded, because the actual processing time of any given plugin is typically unknown (ie. varies from one block to the next).
Creating a per-plugin thread pool is betting that most of the time there are free CPU resources somewhere else. If the plugin is on top of the graph (synths) this may not be true on busy mixes. It is just a bet. Maybe for mastering plugins, limiters, etc this could work, but those are not as estoy parallelized (if at all) as synths.
I'm trying to argue that it's specially the synths on top of the mix where it's ideal that one plugin at a time steals all the CPU cores... but in general it's safe to bet that there are free CPU resources, because the host "spends" one CPU core when it calls into your plugin, hence either someone "steals" your core (which is what you actually want) or your thread pool has at least one core available.. and if it has one core available then it should process your plugin in approximately the same time you would have processed it single-threaded, hence you cannot possibly lose.
You could correctly say that "thread priorities" should substitute by "counts", but I was tired and thinking on terms of $DAILY_JOB there, which is not audio, where having the plugin thread pools at a correctly selected priority would achieve something similar to the desired result.
No. No "thread priorities" at all. Ideally put them all at the highest RT priority (eg. "Pro Audio" on Windows). You don't want any priorities between different realtime threads.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

KVRian
589 posts since 4 Jan, 2007

Post Sun Oct 17, 2021 1:45 am

mystran wrote:
Sat Oct 16, 2021 4:58 pm
It can't. It can't make the best decisions even if every plugin is single-threaded, because the actual processing time of any given plugin is typically unknown (ie. varies from one block to the next).
OK, let's reformulate and say a probably better decision, as one of the multiple sources of jitter is removed.
mystran wrote:
Sat Oct 16, 2021 4:58 pm
I'm trying to argue that it's specially the synths on top of the mix where it's ideal that one plugin at a time steals all the CPU cores... but in general it's safe to bet that there are free CPU resources, because the host "spends" one CPU core when it calls into your plugin, hence either someone "steals" your core (which is what you actually want) or your thread pool has at least one core available.. and if it has one core available then it should process your plugin in approximately the same time you would have processed it single-threaded, hence you cannot possibly lose.
I'm not sure I follow. I can't see why.

I will dump a simplified model of how I think a DAW does a block, with just enough resolution for explaining my point of view:
  • It calculates the work packages by traveling graph branches (parallel paths) until the next vortex, so it has N chunks of serial work packages. Notice that the first batch of work packages is the one where usually most generators(synths) would end up.
  • It starts processing work packages as soon as all vertices on top are done (in the case of the graph entry points, where synths are likely to end up, immediately and all at the "same" time).
  • It then keeps going with the first step until clearing the graph/reaching the bottom of the master bus.
So if processing every generator process (Instrument, VSTI) one at a time stealed all the cores, then when such generator process is not using all of them (e.g. not enough voices to play to fill all the cores, which seems a likely thing the more cores a machine has), then some cores do idle when some other work packages on the graph (from other tracks) could be making forward progress.

Notice that the tendence is for the number of cores increase. Not every part is playing 6+ note chords (they migh even be packing N voices per core due to SIMD). I might be missing something though.
mystran wrote:
Sat Oct 16, 2021 4:58 pm
rafa1982 wrote: You could correctly say that "thread priorities" should substitute by "counts", but I was tired and thinking on terms of $DAILY_JOB there, which is not audio, where having the plugin thread pools at a correctly selected priority would achieve something similar to the desired result.
No. No "thread priorities" at all. Ideally put them all at the highest RT priority (eg. "Pro Audio" on Windows). You don't want any priorities between different realtime threads.
Exactly what I was saying, hence the "would substitute". I could argue that intuitively I think that maybe the thread pools of the plugins could be just one priority level down the DAW threads but still above everything else, so a plugin worker doesn't block the CPU pinned threads the DAW is using to travel the graph (while still being able to make forward-progress with at least the thread that the DAW has assigned for it), but I don't want to go down that rabbit hole.

I made an practical test:
Ryzen 5800x (8 core), Reaper, Windows, Bazille with no multitheading enabled on a track playing polyphony until a single core starts clicking, then backing off two voices and start duplicating the track and see how/when it breaks.

At 88KHz 512 samples:
-Multicore disabled: 12 tracks with very spurious clicks. 13 a bit more frequent. 14 unusable.
-Multicore enabled: 10 tracks with clicks. 11 unusable.

At 88KHz 64 samples:
-Multicore disabled: 12 tracks with clicks. 13 unusable.
-Multicore enabled: 9 tracks OK. 10 unusable.

This is just a stress test with N=1, but I still wanted to see how small/big was the effect. Around 20% on this cherry-picked worst case scenario.
Last edited by rafa1981 on Sun Oct 17, 2021 4:45 am, edited 2 times in total.

User avatar
Urs
u-he
26018 posts since 8 Aug, 2002 from Berlin

Post Sun Oct 17, 2021 3:25 am

rafa1981 wrote:
Sat Oct 16, 2021 11:19 am
Maybe a plugin standard should abstract a DAW managed work queue for audio processing purposes? Sounds useful but also with potencial to be a vipers nest. EDIT: I see that Apple already did it.
CLAP will have this as well, but even easier - instead of notifying the host of the plug-in's threads, in CLAP the host provides its own thread pool to the plug-in. The plug-in tells the host to schedule N threaded calls, and the host then uses its worker threads to call N times into the plug-in with one out of N IDs, e.g. one per voice.

User avatar
KVRian
939 posts since 31 Dec, 2008

Post Sun Oct 17, 2021 4:00 am

Windows has this relatively new threading API which is work oriented instead of thread oriented. Basically the OS manages the threads pool(s). All you have to do is tell the OS what work/function do you want be done using SubmitThreadpoolWork(). And the OS takes care of the rest.

The problem is, these modern APIs (well, even the old ones) don't seam to be well designed to handle sample by sample processing which is what modern modulars do. The work you submit has to be a sizable chunk and not so frequent. Otherwise, the overhead seams to overcome the benefit.

KVRian
589 posts since 4 Jan, 2007

Post Sun Oct 17, 2021 4:51 am

Urs wrote:
Sun Oct 17, 2021 3:25 am
CLAP will have this as well, but even easier - instead of notifying the host of ...
Makes sense. I guess that this is the project:
https://github.com/free-audio/clap

The readme doesn't explain what the project is and it's goals.

EDIT:
https://news.ycombinator.com/item?id=8809659

User avatar
Urs
u-he
26018 posts since 8 Aug, 2002 from Berlin

Post Sun Oct 17, 2021 7:19 am

rafa1981 wrote:
Sun Oct 17, 2021 4:51 am
The readme doesn't explain what the project is and it's goals.
Yeah, a proper documentation of our goals is on the todo list. Not sure if the host controlled threading will be part of our initial release (u-he stuff), but the first major host to support CLAP will have it built in. I'll start a thread about it once we can supply proof of concept (DAW + our full product line + information).

KVRian
589 posts since 4 Jan, 2007

Post Sun Oct 17, 2021 7:51 am

Wow, I didn't know it was that ambitious. Let's hope it gains traction.

EDIT: good to see that the interface is plain C, so people can create bindings to every language.

User avatar
KVRian
939 posts since 31 Dec, 2008

Post Sun Oct 17, 2021 8:35 am

rafa1981 wrote:
Sun Oct 17, 2021 7:51 am
Wow, I didn't know it was that ambitious.
I didn't even know about it. way-ta-go Urs :tu:

KVRAF
6582 posts since 12 Feb, 2006 from Helsinki, Finland

Post Sun Oct 17, 2021 10:02 am

Urs wrote:
Sun Oct 17, 2021 3:25 am
rafa1981 wrote:
Sat Oct 16, 2021 11:19 am
Maybe a plugin standard should abstract a DAW managed work queue for audio processing purposes? Sounds useful but also with potencial to be a vipers nest. EDIT: I see that Apple already did it.
CLAP will have this as well, but even easier - instead of notifying the host of the plug-in's threads, in CLAP the host provides its own thread pool to the plug-in. The plug-in tells the host to schedule N threaded calls, and the host then uses its worker threads to call N times into the plug-in with one out of N IDs, e.g. one per voice.
This sounds great. Looking at the example in the draft header this would handle the simple cases of multiple voices the way it should be done. I can think of some cases (eg. if the plugin has an internal processing graph of some sort) where it would be nice to also be able to queue additional tasks from the worker threads, but that's not something you need for simple multi-threading of voice processing (and arguably makes everything a tiny bit more complicated, so perhaps not worth it).

The remark in draft header about thread-pools "possibly breaking hard realtime constraints" is a bit scary though. As far as I can see, if a thread-pool is specified in such a way that (1) all workers run at realtime priority and (2) only real-time threads are allowed to post work (ie. no priority inversions allowed when synchronizing multiple "producers" of the underlying queue)), then I don't see why you couldn't guarantee real-time just fine.

Strictly speaking you also need to allow for queueing to fail on full queue (to allow for fixed-size queue, to avoid allocs), but this need not necessarily be pushed to client if you just do the work on the original thread instead (well.. usually I'd do "one item" and then retry trying to queue the rest).
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

User avatar
Urs
u-he
26018 posts since 8 Aug, 2002 from Berlin

Post Sun Oct 17, 2021 2:13 pm

mystran wrote:
Sun Oct 17, 2021 10:02 am
The remark in draft header about thread-pools "possibly breaking hard realtime constraints" is a bit scary though.
Yeah, I should ask Alex about that. Thing is, everything is converging at them moment and we hope to have real working examples in a few weeks. I assume that things will even out, e.g. there might be a recommendation then that plug-ins should avoid using the threadpool with a block size of less than, say, 32 samples, or whatever makes its use counter productive.

I think there's way to little talk about all of this and if we start exchanging more experience, we'll all make better use of the means we have.

KVRian
589 posts since 4 Jan, 2007

Post Mon Oct 18, 2021 3:42 am

Probably the thread pool should be abstracted as simply a task executor. So no assumptions on how things will run or its implementation can be made from the plugin.

The plugin would just make a single function call passing a list of tasks that are totally paralellizable (and don't synchronize between themselves) and the DAW would decide how they are run and return when all of them are completed.

This is to allow a DAW to make decisions like:
  • On which CPU should every work package be put. So the DAW can only use the CPUs it knows that have reached the bottom of the graph or are waiting for a dependency to clear.
  • As the call is blocking, if the DAW knew that all other CPUs were processing, it could utilize the current CPU (assigned for the track where the plugin is running) and run in-place with no parallelization.
  • Same with extremely small block sizes. The DAW would make the decision, not every plugin.
Notice that these thread pools break with plugin sandboxing. But in this case the DAW is still able to run the passed tasks serially if the user has plugin sandboxing enabled.

Return to “DSP and Plug-in Development”