KVR Audio

monsterbeetle · Post by **monsterbeetle** » Sat Mar 25, 2017 7:05 pm

[quote] Is this 'zero copy' feature also mean 'near zero latency'? It processing latency affected by the fact that the operating system or some other app may also be using the same GPU? It's a shared resource, after all. I wonder how fast/frequently that resource may change hands. [/quote]

zero copy is the main contributor to reduced latency on this board, on a standard PC transfers between CPU and GPU have a lot of bandwidth but a high latency as well, and you can alleviate this only by processing data in large batches (you'll have latency anyway). doing lots of short transfers is inefficient.

from CPU->GPU, the CPU writes the data to be processed in a pinned memory page, then sends a command to the GPU over PCIe to signal the GPU that there is some data to be pulled from main RAM to GPU ram, the the GPU, whenever it wants, initiates a DMA to pull data from main RAM to GPU ram, runs the specified kernels, and then writes the results back into main RAM and only then control returns to the host, which is locked waiting for a signal from the GPU to proceed.

On Jetson, CPU writes data to some pinned page, hands control to the GPU which can process data straight away in-place, and control is handed back to the CPU.
So you go from having to stream data around twice, with undeterministic latency, to in-place processing with just a control-context switch.
Actually Jetson is able to run very costly CNN based vision processing with lower latency than on a traditional PC.

However, besides // friendly algos such as FFT/iFFT, convolution and FIR (IIR?) (or more advanced things like volterra transform in nebula which is a sort of higher dimensional convolution) most audio algos will not be able to make the most of the 768 CUDA cores.

A new, really interesting breed of processors I came by recently:

http://www.analog.com/en/products/proce ... sharc.html

two SHARC DSPs @450MHz combined with two ARM cores, SHARC offer low latency hardware-accelerated FFT, convolution, mixing, single-cyle exp/sin/log functions, and the ARM cores make this processor self contained, ie you can run a linux kernel on the ARM cores, have your UI code running on them, and you can talk directly to the SHARC DSPs from the ARM cores. These SoCs seem to cost about 35$ a piece, and come with a seemingly pretty extensive DSP dev. suite.
I'll be looking into this in the not so distant future.

stratum · Post by **stratum** » Sat Mar 25, 2017 7:08 pm

The most typical and commonly available hardware for this kind of processing is the GPU on intel processors. It shares the memory with the CPU too. Unfortunately it is disabled when a discrete GPU is installed. At least that's what happens on the haswell chipset. I don't have a more recent PC to try.

Nvidia Jetson k1 for embedded real-time audio processing?