KVR Audio

Miles1981 · Post by **Miles1981** » Wed Jun 27, 2018 10:04 am

mystran wrote:Urgh.. can you maybe explain something about this "host forcing" thingie?

I mean... if I create a thread pool (ie. a bunch of threads and a queue) in the plugin (eg. either when the dynamic library is loaded or first time a plugin is instantiated.. obviously one shouldn't create threads in processing methods) and put some jobs in the queue for them to process, how is the host even supposed to know which threads are processing which plugin when the actual host thread just sits on a generic semaphore?

Your thread pool is created with the affinity mask of the calling thread, it is then your responsibility to pin them to cores inside that affinity mask. And you are screwed in the host defines an affinity mask for the thread used to create your plugin (which may not even be the audio thread, so all thread pools may end up sharing one core, for instance, the GUI one).
So once again, very _bad_ practice to use thread pools in a plugin. If you want to, ask Apple and Steinberg to add the proper API to their interfaces.

mystran · Post by **mystran** » Wed Jun 27, 2018 3:10 pm

Miles1981 wrote:
mystran wrote:Urgh.. can you maybe explain something about this "host forcing" thingie?

I mean... if I create a thread pool (ie. a bunch of threads and a queue) in the plugin (eg. either when the dynamic library is loaded or first time a plugin is instantiated.. obviously one shouldn't create threads in processing methods) and put some jobs in the queue for them to process, how is the host even supposed to know which threads are processing which plugin when the actual host thread just sits on a generic semaphore?
Your thread pool is created with the affinity mask of the calling thread, it is then your responsibility to pin them to cores inside that affinity mask.

Ok.. so you're saying that a certain host exists that uses a thread with reduced affinity mask to create plugin instances. In my book, that translates to "add code to thread creation to ensure that the CPU affinity mask is reset back to all CPUs".

I would actually really appreciate the specifics of "which hosts" do this, because I would like to test this.

And you are screwed in the host defines an affinity mask for the thread used to create your plugin (which may not even be the audio thread, so all thread pools may end up sharing one core, for instance, the GUI one).

I would like to argue that if a given host loads plugins from audio thread, it's certainly FUBAR (that's a technical term)... but that's sort of not important. What's important is that if you are correct and there are hosts that play retard and go pinning their CPUs, then apparently it's necessary for plugin code to sanity check and undo the damage. Doesn't seem like a big deal, as you already have to do a bunch of low-level system-specific crap to get your threads running on real-time priorities and all.

So once again, very _bad_ practice to use thread pools in a plugin. If you want to, ask Apple and Steinberg to add the proper API to their interfaces.

When your choices come down to hitting the audio deadline or not, the question of "good" and "bad" tend to become quite meaningless and the actually interesting question is "what do I have to do to make it work" instead.

Delta Sign · Post by **Delta Sign** » Wed Jun 27, 2018 3:14 pm

mystran wrote:When your choices come down to hitting the audio deadline or not, the question of "good" and "bad" tend to become quite meaningless and the actually interesting question is "what do I have to do to make it work" instead.

Yeah, If Diva, and even ACE didn't have the multicore option, I simply wouldn't be able to use them on this shitty PC.
So I'm glad for the "bad practice".

mystran · Post by **mystran** » Wed Jun 27, 2018 5:00 pm

I would assume "fruit" referred to Apple (since it's can't refer to "FL" because "FL" certainly doesn't do anything stupid with affinities).. but then we have:

OS X does not export interfaces that identify processors or control thread placement—explicit thread to processor binding is not supported. Instead, the kernel manages all thread placement. Applications expect that the scheduler will, under most circumstances, run its threads using a good processor placement with respect to cache affinity.

However, the application itself knows the detailed caching characteristics of its threads and its data—in particular, the organization of threads as disjoint sets characterized by their association with (affinity to) distinct shared data.

While threads within such a set exhibit affinity with each other via shared data, they share a disaffinity or negative affinity with respect to other sets. In other words, a set expresses an affinity with an L2 cache and the scheduler should seek to run threads in a set on processors sharing that L2 cache.

Unlike POSIX pthreads affinity which usually gets inherited, I don't see anything in the documentation suggesting that these would get inherited. The default affinity set appears to be "null" which I suppose one could explicitly force on new threads too.. but like.. I'm seriously confused at this point.

Miles1981 · Post by **Miles1981** » Wed Jun 27, 2018 5:06 pm

mystran wrote:Ok.. so you're saying that a certain host exists that uses a thread with reduced affinity mask to create plugin instances. In my book, that translates to "add code to thread creation to ensure that the CPU affinity mask is reset back to all CPUs".

That's a big BAD thing to do. If you get an affinity mask, there is a good reason for that. Don't change it. You may be authorized to use the full machine.

mystran wrote:I would like to argue that if a given host loads plugins from audio thread, it's certainly FUBAR (that's a technical term)... but that's sort of not important. What's important is that if you are correct and there are hosts that play retard and go pinning their CPUs, then apparently it's necessary for plugin code to sanity check and undo the damage. Doesn't seem like a big deal, as you already have to do a bunch of low-level system-specific crap to get your threads running on real-time priorities and all.

No, they are not playing retards, it's because you don't know how important thread pinning and even memory pinning is important for processing. If they do this, they may even provide you with a thread pinned to a core an memory local to that core, with the audio data local to that core. If you screw up by changing the pinning then your code is FUBAR.

mystran wrote:When your choices come down to hitting the audio deadline or not, the question of "good" and "bad" tend to become quite meaningless and the actually interesting question is "what do I have to do to make it work" instead.

No, you have to play witht he limits you are given or you change them by proposing a new standard that extends the limits without FUBAR everyone else (and yourself in the process).

mystran · Post by **mystran** » Thu Jun 28, 2018 11:07 am

Miles1981 wrote: No, they are not playing retards, it's because you don't know how important thread pinning and even memory pinning is important for processing.

Pinning threads to cores on a general purpose desktop system where thread migration is reasonably cheap and the schedulers already "soft pin" anyway is just plain retarded about 100% of the time... and really it appears that the macOS developers probably agree, since they don't even seem to provide an API for that... so like whatever..

Vokbuz · Post by **Vokbuz** » Thu Jun 28, 2018 2:17 pm

It is probably DAW dependent but I'm pretty sure that there is no fixed thread on which each loaded plugin is processed. Instead DAW picks first available thread from the pool to invoke plugin processing on (or track processing, or whatever processing unit in this DAW). At least the DAWs I checked use different threads to run processing of the same plugin. Otherwise some audio processing threads will be overloaded while others idle most of the time because different plugins and presets in those plugins require different amount of CPU time to generate their output.

MadBrain · Post by **MadBrain** » Sat Jun 30, 2018 5:58 am

Currently I can get up to around 50 partials of 6 Polyphony before my 4.4 GHz core maxes out. I would like upwards of 100 partials.

Dunno how your partials are implemented, but on my synth (in development, C++ WDL-OL), a voice with 64 partials eats up something like 1% of CPU time.

Unaspected · Post by **Unaspected** » Sat Jun 30, 2018 7:10 am

Pure Data might be a good idea. It does things differently to Reaktor and some things are much easier - such as up and down sampling and pitch shifting.

Delta Sign wrote:
mikejm wrote:
Delta Sign wrote:Yeah, most U-He synths can distribute voices between cores, and I think there are a bunch of other synths as well. It's definitely possible.
This is probably the most realistic way I can run my synths that would help the current CPU issue, unless perhaps just by nature of coding it in Juce it becomes automatically 40% more CPU efficient.
That's most likely going to happen. Reaktor is terribly CPU inefficient in my experience.

You'll still have to think about optimizing things, of course.

It does seem to be. However, efficiency has been improved between versions 5 and 6. That seems rare in software companies so I have to applaud Native Instruments for being serious developers. There is also the option to run a GUI in a separate thread in Reaktor 6, which is nice. I don't know how much that improves the efficiency as I've yet to test it properly.

If you can use primary elements as much as possible they seem more efficient than anything you can do in core. I don't see the point in rewriting primary elements in core unless you need to adapt them in some manner.

Guillaume Piolat · Post by **Guillaume Piolat** » Sat Jun 30, 2018 3:41 pm

mystran wrote: Pinning threads to cores on a general purpose desktop system where thread migration is reasonably cheap and the schedulers already "soft pin" anyway is just plain retarded about 100% of the time... and really it appears that the macOS developers probably agree, since they don't even seem to provide an API for that... so like whatever..

+1
When I tried affinity masks in a real application, wasn't able to get any speed-up. Usually the OS would do better.
Anyone else has a different experience?

Miles1981 · Post by **Miles1981** » Mon Jul 02, 2018 1:43 pm

Guillaume Piolat wrote:+1
When I tried affinity masks in a real application, wasn't able to get any speed-up. Usually the OS would do better.
Anyone else has a different experience?

All the biggest scientific applications in the world.

mystran · Post by **mystran** » Mon Jul 02, 2018 4:22 pm

Miles1981 wrote:
Guillaume Piolat wrote:+1
When I tried affinity masks in a real application, wasn't able to get any speed-up. Usually the OS would do better.
Anyone else has a different experience?
All the biggest scientific applications in the world.

It's important to keep in mind that if you're working on a NUMA system, the cost of thread migration can be a LOT more than the handful of cache misses you take on a typical desktop CPU. When you combine the relatively cheap thread migrations with the fact that OS schedulers will typically try to avoid moving threads across cores anyway, affinity masks become a lot less attractive and when you then additional take into account that on a typical desktop system (let alone plugin situation) there's probably a whole bunch of other stuff running at the same time, the chance that your strict affinity policy ends up hurting you more than it's helping is actually pretty high.

As I pointed out above, at least the documentation I can find suggests that macOS just doesn't let you do any of this anyway. The affinity API that they provide will let you group multiple threads into a single affinity group that should intentionally be scheduled on the same CPU (which is arguably useful if you're multi-threading for control-flow reasons and the threads access a lot of the same memory), but beyond that there doesn't seem to be anything you can do to force threads on difference CPUs... and frankly I think this is probably the best design you could hope for on a desktop system. A bunch of code I found online maps pthreads-style affinity onto this system, but really it's doing a completely different thing.

YMMV.

BlitBit · Post by **BlitBit** » Thu Jul 05, 2018 3:38 pm

mikejm wrote:[...]

However, I do a lot of modal synthesis. This is where every partial of a sound is created additively on an individual basis, generally each with a sine wave or resonant bandpass filter. This gets very CPU intensive. [...]

Do you already use the Sine Bank module in your ensembles? I think this will save quite a lot of CPU because the module is very likely heavily CPU optimized (SIMD, etc.).

Here's an introduction video for the Sine Bank:
https://www.youtube.com/watch?v=FDCuzRJkYj4

Reaktor can't handle my big projects anymore (no multicore). Good next option? Flowstone? Synthedit?