KVR Audio

Richard_Synapse · Post by **Richard_Synapse** » Mon Nov 08, 2021 6:47 pm

mystran wrote: ↑Mon Nov 08, 2021 12:44 pm Oh.. I've always let the main thread do nothing except block on the semaphore from the moment it dispatches the worker threads to the moment the workers are all finished. I have no idea what Blue Cat is talking about with the whole "need to use main thread" thing. Maybe I'm missing something.

It depends, if you found some magic way to start the worker threads instantly, then any method is fine

I'm just not aware of such a trick, neither on Windows nor OS X. As I see it, we cannot rely on any thread to do anything at all within a given time interval (except for the main thread).

Richard

mystran · Post by **mystran** » Mon Nov 08, 2021 7:45 pm

Richard_Synapse wrote: ↑Mon Nov 08, 2021 6:47 pm
mystran wrote: ↑Mon Nov 08, 2021 12:44 pm Oh.. I've always let the main thread do nothing except block on the semaphore from the moment it dispatches the worker threads to the moment the workers are all finished. I have no idea what Blue Cat is talking about with the whole "need to use main thread" thing. Maybe I'm missing something.
It depends, if you found some magic way to start the worker threads instantly, then any method is fine I'm just not aware of such a trick, neither on Windows nor OS X. As I see it, we cannot rely on any thread to do anything at all within a given time interval (except for the main thread).

You can't actually rely on the main thread either (you need a real RTOS before you can guarantee anything), but I don't see why a magic trick would be needed, especially on Windows which is more less a pure priority scheduler. When you block the main thread, you've got a free core and it'll pick the highest priority runnable thread, which should usually be your worker that you just made runnable by giving it work to do.. or perhaps it's a worker in some other plugin that's working towards the same deadline so it's not THAT important which one runs first.

Richard_Synapse · Post by **Richard_Synapse** » Tue Nov 09, 2021 7:13 am

mystran wrote: ↑Mon Nov 08, 2021 7:45 pm You can't actually rely on the main thread either (you need a real RTOS before you can guarantee anything), but I don't see why a magic trick would be needed, especially on Windows which is more less a pure priority scheduler. When you block the main thread, you've got a free core and it'll pick the highest priority runnable thread, which should usually be your worker that you just made runnable by giving it work to do.. or perhaps it's a worker in some other plugin that's working towards the same deadline so it's not THAT important which one runs first.

So let's take a simple example. Buffer size is 2ms, and we run into some kind of worst-case scenario, where waking up the worker threads takes 1ms. So after 1ms,

- with the main thread alive we have fully rendered the audio already, worker threads can exit immediately
- with the main thread asleep, we haven't even started rendering anything at all

Quite a difference, unless I'm missing something

Richard
p.s. so you use WaitForSingleObject () under Windows to block the main thread, or some other method?

mystran · Post by **mystran** » Tue Nov 09, 2021 8:57 am

Richard_Synapse wrote: ↑Tue Nov 09, 2021 7:13 am p.s. so you use WaitForSingleObject () under Windows to block the main thread, or some other method?

WaitForSingleObject on a semaphore typically. Both ways (ie. workers wait on semaphore guarding a task queue, main thread waits on semaphore once tasks are dispatched).

Don't get me wrong though, I'm just trying to understand what you are talking about, 'cos I just have not observed any delays with regards to wakeups, certainly not on Windows.

Urs · Post by **Urs** » Tue Nov 09, 2021 9:41 am

We have also implemented things in accord to Richard's concept, we always let the original realtime thread do one or two voices more than the worker threads would handle. Such that when the worker threads finish, the main thread ist probably still busy doing things.

Maybe I'm missing something or maybe we're talking about different ways of waking up worker threads?

Anyhow, the concept we're going for in CLAP takes that burden off from us: The plug-in calls into the host, and the call returns only after the work is done.

antic604 · Post by **antic604** » Thu Nov 11, 2021 8:00 am

So, is this going to be in final Bitwig 4.1 or some for-testing-only fork for now?

Urs · Post by **Urs** » Thu Nov 11, 2021 10:48 am

antic604 wrote: ↑Thu Nov 11, 2021 8:00 am So, is this going to be in final Bitwig 4.1 or some for-testing-only fork for now?

I wouldn't know.

I can only speak for us, and we have added CLAP support to all our plug-ins and installers, so if nothing devastating happens in the next few days and weeks, it'll be available whenever we publish any betas or otherwise new versions. The CLAP example host on Github also has an example implementation of the thread pool, but as it only hosts one plug-in at a time, it really is merely example code.

rafa1981 · Post by **rafa1981** » Thu Nov 11, 2021 6:16 pm

I bet Mr Frankel will be on boat if it's a free and technically sound format.

] Peter:H [ · Post by **] Peter:H [** » Thu Nov 11, 2021 6:37 pm

Nice thread ...
I dev with 20+years experience and quite some experience in parallel programming...mostly java.

I think I would rethink whether "threads" is what you want. You want things get done. And each thread does come with cost. thus you have a ratio: unit of work scheduled to thread / cost.

And cost is a sum of maintaining a thread pool in a consistent state, thread context switches (i.e. kicking all the registers of all the fancy SSE units for instance) and staling the caches. My gut feeling is that music software is rather a data-driven business, i.e. I have to optimize the throughput of data pipelines (data pipeline is the path of a voice through all plugins). Therefore I would avoid in any case the cost of stalling caches, i.e. working first on voice 1, filter 1 (cpu 1) then kicking that of the cpu for voice 2, filter 3 (cpu 2) then again voice 1, amp1 (cpu 5) and then voice 3, eq2 (cpu 6) ... I would rather organize a pipeline for voice 1 as "one unit of work" and thus guranteeing for cache consistency...

Having more threads than the cpu does - my gut feeling - only makes sence if you optimize for "first response reaction time of a single request" in I/O bound systems (like web apps that put threads on wait for db queries) on the cost that your overall throughput degrades...

Anyway ... that said I would never implement such stuff on my own, but rely on pros who implement those frameworks. And guess what there's guys for instance in finance industry ... who invented the disruptor pattern: https://lmax-exchange.github.io/disruptor/. Freaky, that they implemented stuff in Java to be lighning fast for profit. But they are lock free, and cater for cache consistency... I'm not really into all of the magic they do, but I'd use solutions like that and just use that rather than get my own thread pool scheduling and optimization going.

Edit: Thread Pools that pick "units of work" from a queue that clients put onto the queue have another problem. they are mostly not fair. they don't have priorities like OS processes where the OS scheduler gurantees with some fairness policy that a process will get cpu time. You can easily spam a threadpool with thousands of "units of work" in one go and thus block other clients from getting cpu time.

juha_p · Post by **juha_p** » Thu Nov 11, 2021 6:39 pm

Did not notice fibers mentioned here so, whether fibers would have use in these type of applications or is it abandoned technology as like what Raymond suggests in his blog entry?
Recent benchmark shows fibers can perform faster than threads and looks like boost library provides classes and functions for fiber operations.

mystran · Post by **mystran** » Thu Nov 11, 2021 7:24 pm

] Peter:H [ wrote: ↑Thu Nov 11, 2021 6:37 pm Having more threads than the cpu does - my gut feeling - only makes sence if you optimize for "first response reaction time of a single request" in I/O bound systems (like web apps that put threads on wait for db queries) on the cost that your overall throughput degrades...

It's not entirely clear whether it is profitable to have more than one thread per core even in this situation. I'd imagine that the more common reason this is done is because either (1) not all your IO needs support async (so you're stuck blocking your threads) or (2) you're writing code in a language without decent support for something like coroutines and you don't want to deal with the (arguably rather painful) "inversion of control" manually.

edit: I suppose if you're also doing long running computations, then it might make sense to have at most 2 threads per core (one lower priority "computation thread" and one higher priority thread that juggles the IO), but I don't think that actually blocking for IO is ever ideal if the async and language support is good enough.

] Peter:H [ · Post by **] Peter:H [** » Fri Nov 12, 2021 7:05 am

mystran wrote: ↑Thu Nov 11, 2021 7:24 pm
] Peter:H [ wrote: ↑Thu Nov 11, 2021 6:37 pm Having more threads than the cpu does - my gut feeling - only makes sence if you optimize for "first response reaction time of a single request" in I/O bound systems (like web apps that put threads on wait for db queries) on the cost that your overall throughput degrades...
It's not entirely clear whether it is profitable to have more than one thread per core even in this situation. I'd imagine that the more common reason this is done is because either (1) not all your IO needs support async (so you're stuck blocking your threads) or (2) you're writing code in a language without decent support for something like coroutines and you don't want to deal with the (arguably rather painful) "inversion of control" manually.

edit: I suppose if you're also doing long running computations, then it might make sense to have at most 2 threads per core (one lower priority "computation thread" and one higher priority thread that juggles the IO), but I don't think that actually blocking for IO is ever ideal if the async and language support is good enough.

You're completely right. That is what node.js is doing. It actually handles "everything" in one thread by using OS services like posix select (waiting for signals on handles, like tcp/ip accept/connect or data available) and asyn I/O like in Windows case WriteFileEx/ReadFileEx and Async Completion Queues... Work in the application is only done to "wire up" I/O things that then where handled by the OS: Like you ReadFileEx a file from the disk and WriteFileEx it to a client that requested if over http, i.e. upon completion of ReadFileEx you can pump the buffer into the socket with WriteFileEx and start the next ReadfileEx. But I think in Audio Software it's a bit different.

SNFK · Post by **SNFK** » Fri Nov 12, 2021 2:57 pm

Some of this stuff is a bit above me, but I understand the majority of it. How would you create a lock-free thread pool? What class/library would you use to achieve this?

juha_p · Post by **juha_p** » Fri Nov 12, 2021 3:16 pm

SNFK wrote: ↑Fri Nov 12, 2021 2:57 pm ...
How would you create a lock-free thread pool?
...

If you google it you'll find many examples. An example.

mystran · Post by **mystran** » Fri Nov 12, 2021 4:17 pm

juha_p wrote: ↑Fri Nov 12, 2021 3:16 pm
SNFK wrote: ↑Fri Nov 12, 2021 2:57 pm ...
How would you create a lock-free thread pool?
...
If you google it you'll find many examples. An example.

Please don't do this kind of non-sense. Having your workers properly wait (=block) on a semaphore (=lock) is infinitely better than wasting CPU by having your threads poll for work... and it's also better for latency, 'cos you don't need to sleep() in order to avoid burning 100% CPU and you'll get woken up as soon as there's some work, rather than next time your sleep() expires.

edit: Also one should never do a busy-loop with a short sleep() on macOS, because it simply doesn't work: the OS scheduler will (at least sometimes) detect such behaviour, it'll spit out "thread XYZ is waking up too often" into console, and then it will throttle your thread by only waking it up once in a while. This can happen with sleeps on the order of 1ms (which is already terrible for latency) and it doesn't care what sort of realtime priority you've tried to give your thread. It's not really a profitable strategy on Windows either, but on macOS it's not even going to work properly... and if someone is thinking about busylooping without sleep they should immediately stop programming, for ever.

Using Multiple Threads In Audio Thread