KVR Audio

gearwatcher · Post by **gearwatcher** » Wed May 25, 2022 8:27 pm

These days Microsoft announced their own dev kit device based around a Qualcomm ARM SoC with integrated NPU, thus formalising what everyone already expected which is that "PC" side of the industry is heading in the direction that Apple went with Macs (ironically Linux and Windows were both ported to ARM years before the M1 move but I have no doubts that Apple marketing will have a great time with this announcement, especially given the shape of dev box).

I dabbled with GP/GPU back before Cuda and OpenCV, when the only option that wasn't completely low level was BrookGPU and it was painfully obvious that the memory access ie data bandwidth is a huge limiting factor if the data wasn't unidirectional ie the bulk of it isn't going to eventually leave the GPU by the way of the "RAMDAC" ie video out.

Then the AMD APU concept surfaced but amounted to little in terms of GP/GPU, despite seemingly addressing the main issue of data bandwidth through integration - the software wasn't there and neither was performance ie they never leveraged the bandwidth opportunity.

Meanwhile, all major phone SoC manufacturers have been integrating GPUs, DSPs and lately NPUs (that are similar SIMD number crunchers) for years steadily. In fact, lately the term "heterogeneous compute" started to be thrown around for the concept of distribution of tasks to GPU or other compute units based on fitness (I believe it was first uttered by Qualcomm or HiSilicon), with bandwidth that is in the order of lower cache and multi-core interlink bandwidth, or very close to it.

I am a bit out of the loop, but does it mean that, with desktop/laptop class computers now starting to become standardly equipped with these "hetero-compute" SoCs, we could finally have a viable future of low latency DSP acceleration for general software, and thus audio (and presumably video) real-time processing plugins?

mystran · Post by **mystran** » Wed May 25, 2022 8:56 pm

I wonder if Windows on ARM would have become more successful in the past if MS hadn't made the questionable decision to try and lock it down. It's clear they wanted to try and replicate the AppStore / Play store model, but that's not really the model that made Windows a success on x86 and I have a hard time believing it's going to be a success on ARM either.

Personally I don't really have any strong feelings in terms of CPUs, but ... I have no interest in developing for any platform where distribution involves messing with vendor stores..

kerfuffle · Post by **kerfuffle** » Thu May 26, 2022 8:54 am

One issue with using GPUs and NPUs is that they are typically blocking operations with relatively high overhead in setting up the task to be performed.

mystran · Post by **mystran** » Thu May 26, 2022 9:34 am

kerfuffle wrote: ↑Thu May 26, 2022 8:54 am One issue with using GPUs and NPUs is that they are typically blocking operations with relatively high overhead in setting up the task to be performed.

Well ... I think this tends to be the bigger issue with reading stuff back to the CPU than the actual data bandwidth. Even if you're reading a single value (eg. one pixel or something), then if you're doing it synchronously (as BrookGPU certainly was forced to do using old 3D APIs) then your performance tends to go down the drain as the CPU needs to wait for the GPU pipeline to finish and the GPU might further have to wait for the CPU to issue more work ... and it's a mess. As long as you're just sending commands to the GPU, it's all just queued and can be done "whenever" but once you read back you're stuck waiting.

You can actually hit this "pipeline" problem on the GPU sometimes even with rendering, even if nothing is ever read back to the CPU. Often the only thing you really need to do is introduce a suitable frame-to-frame dependency (ie. can't start rendering the next frame before the results from the previous frame are ready) and suddenly what used to run at 500fps might have trouble hitting 60fps. Then you add a bunch of completely unrelated computation at the same time and there might not be any difference in performance whatsoever, because the performance problem was never about running out of computational resources, but rather about having most of the device sitting idle while waiting for that one specific computation to finish.

GPU programming is really fun when you're doing something that maps well to a GPU as you can do insane amounts of computations without really even thinking about it... but then when you want to try and do something that doesn't map well to the massively parallel latency-hiding paradigm it's just horrible. YMMV.

MadBrain · Post by **MadBrain** » Thu May 26, 2022 1:27 pm

You don't gain quite as much going from SIMD CPU to GPU for sound as for graphic workloads, because it doesn't involve bilinear interpolation (the one thing that massively blows up CPU pipelines but GPUs are great at).

As for DSPs, AFAIK the main reason mobile devices have DSPs is to improve battery life when being used as MP3 players and telephones (since the CPU can power down), not because the CPU can't do DSP processing alright.

liquidsonics · Post by **liquidsonics** » Fri May 27, 2022 12:02 pm

There’s some good discussion here from an Apple engineer talking about why bursts of work with small gaps between them (of course this how audio tasks will appear to a GPU) can map quite poorly on to the GPU.

https://developer.apple.com/forums/thread/46817

It’s great that we’re getting architectures now that mean transferring memory back and forth from GPU/CPU isn’t so much of a problem, but the fact that audio is still not a throughput heavy computing task (as GPUs require to excel) remains the case for most scenarios.

There is some interesting work being done by the GPUAudio team to batch work together and other new techniques, it’ll be interesting to see how they get on.

Heterogeneous compute and plugins/DSP