Using Multiple Threads In Audio Thread

DSP, Plug-in and Host development discussion.
User avatar
] Peter:H [
KVRian
1453 posts since 22 Sep, 2016

Post Fri Nov 12, 2021 10:04 am

mystran wrote:
Fri Nov 12, 2021 8:17 am
juha_p wrote:
Fri Nov 12, 2021 7:16 am
SNFK wrote:
Fri Nov 12, 2021 6:57 am
...
How would you create a lock-free thread pool?
...
If you google it you'll find many examples. An example.
Please don't do this kind of non-sense. Having your workers properly wait (=block) on a semaphore (=lock) is infinitely better than wasting CPU by having your threads poll for work... and it's also better for latency, 'cos you don't need to sleep() in order to avoid burning 100% CPU and you'll get woken up as soon as there's some work, rather than next time your sleep() expires.

edit: Also one should never do a busy-loop with a short sleep() on macOS, because it simply doesn't work: the OS scheduler will (at least sometimes) detect such behaviour, it'll spit out "thread XYZ is waking up too often" into console, and then it will throttle your thread by only waking it up once in a while. This can happen with sleeps on the order of 1ms (which is already terrible for latency) and it doesn't care what sort of realtime priority you've tried to give your thread. It's not really a profitable strategy on Windows either, but on macOS it's not even going to work properly... and if someone is thinking about busylooping without sleep they should immediately stop programming, for ever.
I'm afraid, you are not up to date a 100%. We as purple belt threaders are little behind what the black belt guys doing. Have you studied lmax disruptor? Nobody of those guys talks about busy looping... google it ... here's one article https://www.fatalerrors.org/a/high-perf ... trong.html and here github from the creators https://lmax-exchange.github.io/disruptor/ Actually there's even already successors (aka more performant) to disruptor afaik. So I would say it's good to at least peak into to black belt dojo before claiming there's no black belts.

rafa1981
KVRian
669 posts since 4 Jan, 2007

Post Fri Nov 12, 2021 10:21 am

mystran wrote:
Fri Nov 12, 2021 8:17 am
and if someone is thinking about busylooping without sleep they should immediately stop programming, for ever.
Just nitpicking because I saw the word "never". I'm sure we agree on mostly everything :)

There is people in HFT (high frequency trading) having cores dedicated to a single purpose that do busy-looping to win some nanoseconds. Those systems are single-purpose of course.

The mutex approach is better on the low contention, queue often empty, latency sensitive case and the lockfree one in the high contention, queue seldomly empty, throughput oriented case. It depends what is being optimized. Both approaches have its place as long as one knows what is doing.

For the audio case yes, no lockfree stuff mostly. The queues are empty very often (DAW doing something else), busy looping makes no sense (shared resources) and sleeping causes latency spikes.
Last edited by rafa1981 on Fri Nov 12, 2021 11:00 am, edited 1 time in total.

rafa1981
KVRian
669 posts since 4 Jan, 2007

Post Fri Nov 12, 2021 10:54 am

] Peter:H [ wrote:
Fri Nov 12, 2021 10:04 am
I'm afraid, you are not up to date a 100%. We as purple belt threaders are little behind what the black belt guys doing. Have you studied lmax disruptor? Nobody of those guys talks about busy looping... google it ... here's one article https://www.fatalerrors.org/a/high-perf ... trong.html and here github from the creators https://lmax-exchange.github.io/disruptor/ Actually there's even already successors (aka more performant) to disruptor afaik. So I would say it's good to at least peak into to black belt dojo before claiming there's no black belts.
In this case is not a belt thing, but trying too apply a cookie-cutter solution to a problem its not suited for. From the link:
Disruptor is an open source framework. The original intention of research and development is to solve the problem of queue locking under high concurrency.
Which is exactly the opposite case of audio, low concurrency, low contention, queue most of the time empty, very sensitive to spikes, many different processes in a graph.

At 48kHz 256 samples buffer a plugin gets/has to deliver around 188 blocks per second, this is periods of 5-6ms; ages in CPU time.

From the article. Waiting strategies:
BlockingWaitStrategy, SleepingWaitStrategy, YieldingWaitStrategy, BusySpinWaitStrategy, PhasedBackoffWaitStrategy.
It says that for the lowest latency either YieldingWaitStrategy or BusySpinWaitStrategy should be used. So basically Yield the time slice if the queue is empty or burn the CPU waiting until the thread is preempted. Having a mix with, let's say 20 plugins, each with its own disruptor configured this way would be insane, I don't think it even needs an explanation.

The sanest way to use it for audio would be "BlockingWaitStrategy", but at that point why bother?
BlockingWaitStrategy」
The default policy for Disruptor is BlockingWaitStrategy. Within BlockingWaitStrategy, locks and condition s are used to control the wake-up of threads. BlockingWaitStrategy is the most inefficient strategy, but it consumes the least CPU and provides more consistent performance in a variety of deployment environments.
PS: The lmax disruptor must be nearing 10 years old now. Far from new stuff. I remember seeing it long long ago on Dmitry Vyukov's site.
Last edited by rafa1981 on Sat Nov 13, 2021 1:09 am, edited 2 times in total.

mystran
KVRAF
6733 posts since 12 Feb, 2006 from Helsinki, Finland

Post Fri Nov 12, 2021 11:04 am

rafa1981 wrote:
Fri Nov 12, 2021 10:21 am
mystran wrote:
Fri Nov 12, 2021 8:17 am
and if someone is thinking about busylooping without sleep they should immediately stop programming, for ever.
Just nitpicking because I saw the word "never". I'm sure we agree on everything :)

There is people in HFT (high frequency trading) having cores dedicated to a single purpose that do busy-looping to win some nanoseconds. Those systems are single-purpose of course.
Should people doing HFT be allowed to program in the first place?

On a more serious note though, that strategy works if the hardware is dedicated for the purpose (and you don't care about energy usage). It also works in embedded systems, especially with microcontrollers that don't eat that much power either way. When your thread is competing for resources with other threads, wasting them on busy looping is always counter productive though, because any half-way decent OS scheduler is just going to punish for it (ie. if not explicitly, then at least implicitly by counting that busy-looping towards your thread's fair timeshare).
The mutex approach is better on the low contention, queue often empty, latency sensitive case and the lockfree one in the high contention, queue seldomly empty, throughput oriented case. It depends what is being optimized. Both approaches have its place as long as one knows what is doing.

For the audio case yes, no lockfree stuff mostly. The queues are empty very often (DAW doing something else), busy looping makes no sense (shared resources) and sleeping causes latency spikes.
Exactly.

edit: Also I want to emphasize that I'm by no means trying to talk against lock-free (or wait-free) queues as such. They are truly great when you're not interested in waiting, with the most obvious example being communication between UI and DSP threads (where a wait-free queue also saves you from having to worry about priority inversions).... but the threads of a typical DSP threadpool spend much of their time waiting for something to do, one way or another and if you're going to wait anyway, then waiting on a sync object is by far the best option.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

rafa1981
KVRian
669 posts since 4 Jan, 2007

Post Fri Nov 12, 2021 11:36 am

mystran wrote:
Fri Nov 12, 2021 11:04 am
Should people doing HFT be allowed to program in the first place?
:D very good point.

mystran
KVRAF
6733 posts since 12 Feb, 2006 from Helsinki, Finland

Post Fri Nov 12, 2021 11:54 am

rafa1981 wrote:
Fri Nov 12, 2021 10:54 am
] Peter:H [ wrote:
Fri Nov 12, 2021 10:04 am
I'm afraid, you are not up to date a 100%. We as purple belt threaders are little behind what the black belt guys doing. Have you studied lmax disruptor? Nobody of those guys talks about busy looping... google it ... here's one article https://www.fatalerrors.org/a/high-perf ... trong.html and here github from the creators https://lmax-exchange.github.io/disruptor/ Actually there's even already successors (aka more performant) to disruptor afaik. So I would say it's good to at least peak into to black belt dojo before claiming there's no black belts.
In this case is not a belt thing, but trying too apply a cookie-cutter solution to a problem its not suited for. From the link:
Disruptor is an open source framework. The original intention of research and development is to solve the problem of queue locking under high concurrency.
÷
That stuff is so silly anyway. If you're serious about low-latency, then why on earth would you pick Java in the first place? If you look at the tech paper, it's clear that much of what they do is to try to work around the problems that make Java generally unsuitable for anything where you care about latency. It's like trying to enter the global shipping business with a fleet of wheelbarrows... beats me.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

rafa1981
KVRian
669 posts since 4 Jan, 2007

Post Sat Nov 13, 2021 1:42 am

From a quick search, it seems that there are c++ ports. So I guess PeterH may mean one of those given the context.

About the JAVA version, intuitively it seems that using a GC collected language is going to be a drag. Maybe there are some advantages, I don't know the domain to be honest. My first impression is the same as yours.

I vaguely remember reading that a GC enables some non-blocking algorithms but I don't remember the details and I have no practical experience doing that.

User avatar
] Peter:H [
KVRian
1453 posts since 22 Sep, 2016

Post Sat Nov 13, 2021 1:54 am

mystran wrote:
Fri Nov 12, 2021 11:54 am
That stuff is so silly anyway. If you're serious about low-latency, then why on earth would you pick Java in the first place? If you look at the tech paper, it's clear that much of what they do is to try to work around the problems that make Java generally unsuitable for anything where you care about latency. It's like trying to enter the global shipping business with a fleet of wheelbarrows... beats me.
Omg ... are you able to get the difference between a Pattern and an Implementation? Disruptor is a Pattern. A Pattern is a Blueprint that comes with a Problem, Description, a Context, Liabilities, Description of Components/Roles and their Interactions. An Implementation of a Pattern is the concrete mapping of a Pattern to Language/Framwork.
Disruptor = Pattern, LMAX Disruptor = Reference Impl of Pattern in Java. LMAX making money with this, you know, they are not starting to cry now, that you tell them it's "silly"... *rofl*. Additionaly your "comments" about Java tells me, that there's no base in "dicussing" this any further. It kind of speaks for it self ... If you like - go and discuss with Martin Fowler and tell him that all this is/was stupid. I actually cannot take this any longer: https://martinfowler.com/articles/lmax.html
Simply go on and stick to whatever you think is best for you. I will keep my eyes open and be curious to learn stuff before I say it's silly.
--
Anyway, I mentioned Disruptor only because it is beyond the typical Queue/Worker stuff. Back to the discussion about Threadpools and Clap interface.
I think the approach is
1.) DAW dispatches to plugin
2.) Plugin re-dispatches "units of work" to DAW thread pool
3.) Urs wrote that he does some sort of heuristics, like do always 2 voices more in the main thread as in the thread pools... Which seems like the common "This mt code perfectly worked in my lab" trap.

Is this really a good way to go for? Or do I simply miss the point?

User avatar
Urs
u-he
26183 posts since 8 Aug, 2002 from Berlin

Post Sat Nov 13, 2021 2:09 am

] Peter:H [ wrote:
Sat Nov 13, 2021 1:54 am
Urs wrote that he does some sort of heuristics, like do always 2 voices more in the main thread as in the thread pools... Which seems like the common "This mt code perfectly worked in my lab" trap.
Why would scheduling "as few voices as possible, but as many as necessary" be a bad thing?

FYI, CLAP does not impose any such pattern on the plug-in developer. When I speak about "what we do" in the context of multithreading, I commonly refer to our past ten years of AU/VST development.

mystran
KVRAF
6733 posts since 12 Feb, 2006 from Helsinki, Finland

Post Sat Nov 13, 2021 2:28 am

] Peter:H [ wrote:
Sat Nov 13, 2021 1:54 am
mystran wrote:
Fri Nov 12, 2021 11:54 am
That stuff is so silly anyway. If you're serious about low-latency, then why on earth would you pick Java in the first place? If you look at the tech paper, it's clear that much of what they do is to try to work around the problems that make Java generally unsuitable for anything where you care about latency. It's like trying to enter the global shipping business with a fleet of wheelbarrows... beats me.
Omg ... are you able to get the difference between a Pattern and an Implementation?
Sure.. and I'm not actually trying to ridicule the pattern, as much of what they say makes a whole lot of sense (and much of it is well known in other programming subfields... for the example the gamedev folks tend to spend quite a bit of time figuring out how to do stuff efficiently), I just find it silly that someone tries to create an implementation for a platform that's pretty much designed to be as hostile as possible towards anything involving low-latency.

What I do not understand though is why are we even discussing this criminal activity. All HFT contributes to the society is wasting tons of energy in the middle of a global warming crisis in order to more effectively destabilize the economy. The fact that it's actually legal is for all intents and purposes a crime against humanity. Does it make a lot of money to the people doing it? Sure, but so does cocaine trading.

However, the main technical point I was trying to make earlier is that this kind of strategy only works if you can dedicate resources to it. That's not a problem for the HFT folks, but it's a problem for a plugin dev, so the whole discussion is almost completely irrelevant from the point of view of plugin development. If you try to use aggressive busy-looping in a shared-resource environment (ie. your thread is competing for CPU resources with other threads, while at the same time your thread spends most of it's time effectively idle just waiting for something to do) the OS scheduler will just punish you for it.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

User avatar
Urs
u-he
26183 posts since 8 Aug, 2002 from Berlin

Post Mon Dec 13, 2021 1:55 am

A quick update on our progress:

Comparing multithreading of plug-ins doing their own threading vs. using a host controlled thread pool, shows, in the examples we tried, clear advantages for using the thread pool. The figures range from 10% more performance to a magnitude of that. But they also show that no matter what, if multithreading can be avoided, it should be.

In short, CPU usage is very homogenous using the thread pool, allowing to max out the performance of CPU cores to much higher levels before audio crackles and drop outs occur than individual threading.

This relates to a very small sample batch of hosts, plug-ins and CPUs though. I'm curious to see how this pans out once it's widely available.

rafa1981
KVRian
669 posts since 4 Jan, 2007

Post Mon Dec 13, 2021 10:48 am

Very nice work!

otristan
KVRAF
2223 posts since 28 Mar, 2005

Post Tue Dec 14, 2021 6:27 am

Do you guys use pthread API to create those worker thread on OSX or use custom OSX stuff ?

Thanks !
Olivier Tristan
Developer - UVI Team
http://www.uvi.net

User avatar
Urs
u-he
26183 posts since 8 Aug, 2002 from Berlin

Post Tue Dec 14, 2021 6:54 am

otristan wrote:
Tue Dec 14, 2021 6:27 am
Do you guys use pthread API to create those worker thread on OSX or use custom OSX stuff ?

Thanks !
For the thread pool? - I have no idea, this comes from the host developers we work with.

In our plug-ins we use boost::thread afaik.

mystran
KVRAF
6733 posts since 12 Feb, 2006 from Helsinki, Finland

Post Tue Dec 14, 2021 7:29 am

Shouldn't matter a whole lot how a thread is created, it's "just a thread" either way (and almost any method will give you access to the native handle you'll need to set an RT policy too).
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

Return to “DSP and Plug-in Development”