KVR Audio

Angus_FX · Post by **Angus_FX** » Sun Sep 20, 2015 3:26 pm

GPU compute for audio.. hmm, not convinced. Wide, yes, but a bit too shallow. From where I'm sitting, the rumoured Skylake-Xeon arch is about perfect. 6-10 cores, high clock rate, each core has 2x 16-wide SP vector units. Xeon Phi's also interesting though.. 72x2x16 although at a much lower clockrate.

- a bunch of threads that pick up work as they go, from whichever track (reasonable)

The only down side being that you lose core affinity, and are likely to encounter a load of L2 cache misses at the start of each plug-in's process(). Perhaps not a problem on single-socket i3/i5/i7 (modern single socket chips have a big L3 cache shared across all cores) - used to be a big deal on multi-socket Core2 Xeon, PPC G4 / G5 etc., still is on big dual-socket Xeon systems but who runs one of those for music nowadays?

juha_p · Post by **juha_p** » Sun Sep 20, 2015 4:06 pm

Angus_FX wrote:GPU compute for audio.. hmm, not convinced. Wide, yes, but a bit too shallow.
...

Not for all for sure ... just hoping more inventions to come...

https://queue.acm.org/detail.cfm?id=2484010

Couple VST plug-ins from LiquidSonics utilizes the GUDA power well.

Angus_FX · Post by **Angus_FX** » Sun Sep 20, 2015 5:13 pm

Yep, good for big, dumb brute force convolution (with apologies to all the very smart people working on convolution in this thread). Not so good for synthesis, unless perhaps some kind of additive with zillions of partials.

Urs · Post by **Urs** » Sun Sep 20, 2015 5:17 pm

Angus_FX wrote:I'm not sure that an iterative solver would do so well.

Say, we try to solve something like y = tanh( s2 + f*tanh( s1 - s2 + f*tanh ( x - r * y - s1 ))); as y is on both sides of the equation and can't be isolated due to non-linearities, we need to iteratively estimate y, run the equation and refine the guess. On SSE we can do 4 guesses at a time, with AVX we could do 8. As the accuracy of the guess doubles, we might be able to run almost twice as fast, depending on the root finding algorithm. Or, we could do two filters at the expense of a single one.

(on a GPU we could solve a Diva filter in just two iterations insetad of, say, 4-15. It would be iterative solver heaven)

Angus_FX · Post by **Angus_FX** » Sun Sep 20, 2015 6:08 pm

I see! Was under the impression that those class of algorithms tend to branch during convergence (run until fabs(error) < threshold). SSE & AVX don't allow per-lane branching (they have masks, but what you can do with those is limited) - although you could iterate until the slowest lane converges acceptably.

Aleksey Vaneev · Post by **Aleksey Vaneev** » Mon Sep 21, 2015 10:05 am

On a side note, I've also found out that such algorithms as convolution with kernel size > cache size 256KB (that's the usual size of about any useful reverb impulse), SSE or any other vectorisation is not useful and does not give much boost. It may be useful for pure mathematical problems, but if you mix in a big data array there, no vectorization helps. In fact, I never measured a considerable SIMD performance gain on modern processors, maybe 20-30% only.

Here's why. Just think about it: why would SIMD be much faster when a processor's instruction decoder and scheduler actually assigns both vector and non-vector instructions into the same basic on-crystal operation blocks? In SIMD you still need to run N multiplications even if they are parallelized. It's also a reason why CISC turned out to be as power-expensive as RISC on modern generation of processors - because schedulers became more complex and more efficient at turning complex CISC code into basic instructions. There will be no inherent gain in SIMD as architectures mature further. (GPUs is a different story, but only due to its huge SIMD scale-factor and no real need to have cache memory)

Miles1981 · Post by **Miles1981** » Mon Sep 21, 2015 10:19 am

Urs wrote:
Angus_FX wrote:I'm not sure that an iterative solver would do so well.
Say, we try to solve something like y = tanh( s2 + f*tanh( s1 - s2 + f*tanh ( x - r * y - s1 ))); as y is on both sides of the equation and can't be isolated due to non-linearities, we need to iteratively estimate y, run the equation and refine the guess. On SSE we can do 4 guesses at a time, with AVX we could do 8. As the accuracy of the guess doubles, we might be able to run almost twice as fast, depending on the root finding algorithm. Or, we could do two filters at the expense of a single one.

(on a GPU we could solve a Diva filter in just two iterations insetad of, say, 4-15. It would be iterative solver heaven)

How do you select these 4 values? Newton only gives you one

Angus_FX · Post by **Angus_FX** » Mon Sep 21, 2015 10:42 am

Good point miles1981 - SIMD isn't all that helpful for iterative algorithms, but it's good for running stuff in parallel without data dependencies.

On a side note, I've also found out that such algorithms as convolution with kernel size > cache size 256KB (that's the usual size of about any useful reverb impulse), SSE or any other vectorisation is not useful and does not give much boost. It may be useful for pure mathematical problems, but if you mix in a big data array there, no vectorization helps. In fact, I never measured a considerable SIMD performance gain on modern processors, maybe 20-30% only.

If your algorithm is fundamentally bandwidth-constrained, that's likely to be true.

Here's why. Just think about it: why would SIMD be much faster when a processor's instruction decoder and scheduler actually assigns both vector and non-vector instructions into the same basic on-crystal operation blocks? In SIMD you still need to run N multiplications even if they are parallelized.

Not sure if you understand - the CPU's decode/dispatch hardware on a Haswell can issue two FMA instructions (x += y*z) per clock. If you run Haswell-optimised scalar code, you'll get 2 FMAs (4 flops/clock). If you run pre-Haswell scalar code you'll get one multiply and one add (2 flops/clock). If you can rewrite your code to use vector instruction sets, you may get 2x8 (16 flops/clock).

But - whether you can actually use that much power depends on whether you can keep your algorithm fed with data. A quad Haswell can easily become memory-constrained if you have a multithreaded algorithm whose data won't fit in cache. Main memory can only transfer about 16 bytes per CPU clock cycle, each core can consume 64.

So for something like a VA synth oscillator or filter, which can be parallelised nicely but has very little memory requirement, you can get huge speedups. For FFT it'd depend greatly on your data layout. I'd be inclined to just let Intel MKL make the hard decisions there.

Urs · Post by **Urs** » Mon Sep 21, 2015 11:10 am

Miles1981 wrote:
Urs wrote:
Angus_FX wrote:I'm not sure that an iterative solver would do so well.
Say, we try to solve something like y = tanh( s2 + f*tanh( s1 - s2 + f*tanh ( x - r * y - s1 ))); as y is on both sides of the equation and can't be isolated due to non-linearities, we need to iteratively estimate y, run the equation and refine the guess. On SSE we can do 4 guesses at a time, with AVX we could do 8. As the accuracy of the guess doubles, we might be able to run almost twice as fast, depending on the root finding algorithm. Or, we could do two filters at the expense of a single one.

(on a GPU we could solve a Diva filter in just two iterations insetad of, say, 4-15. It would be iterative solver heaven)
How do you select these 4 values? Newton only gives you one

Newton requires too much overhead (need to calculate derivate) and is not guaranteed to converge*. Following Brent we're doing a mixed approach of secant and regula falsi and we test two more points inbetween. We also start with one "clever" guess, 2 estimates on the expected maxima (brackets to ensure convergence) and one guess somewhere inbetween.

I think. 4 years have gone by quickly. But - coincidentally - I'm currently evaluating Maple trying to see how it could help me improve the speed of that stuff.

Alternatively you could do a race between 4 different initial guesses, say, one by "last output", one by "linear estimate" and so on. I'd love to mess with 8 different guesses.

With 1024 parallel guesses on a GPU one could of course do a fixed 2 iteration approach to achieve 16 bit accuracy in a result space between -2 and +2.

- Urs

* should converge with true tanh() but not necessarily with fast approximations.

Urs · Post by **Urs** » Mon Sep 21, 2015 11:15 am

Angus_FX wrote:Intel MKL

When I tried it on Windows a decade ago it required libraries to be installed on the user's machine and it would clash if two plug-ins used different versions of MKL. Has that changed, i.e. is it usable for plug-ins?

I'm still toying with looking into IPP or something - once time permits.

Aleksey Vaneev · Post by **Aleksey Vaneev** » Mon Sep 21, 2015 12:17 pm

Angus_FX wrote:Not sure if you understand - the CPU's decode/dispatch hardware on a Haswell can issue two FMA instructions (x += y*z) per clock. If you run Haswell-optimised scalar code, you'll get 2 FMAs (4 flops/clock). If you run pre-Haswell scalar code you'll get one multiply and one add (2 flops/clock). If you can rewrite your code to use vector instruction sets, you may get 2x8 (16 flops/clock).

Imagine a future generation processor that can unwind a CISC code (e.g. a for() loop) into micro code eqivalent to SIMD FMA instructions, and also execute them in 1 clock cycle. This isn't so unrealistic as it seems considering how optimized CISC execution became over time. In such scenario manual SIMD optimization may become a bottleneck as it may be more efficient for scheduler to turn a simple non-SIMD for() loop into SIMD instruction.

Miles1981 · Post by **Miles1981** » Mon Sep 21, 2015 1:10 pm

Urs wrote:
Angus_FX wrote:Intel MKL
When I tried it on Windows a decade ago it required libraries to be installed on the user's machine and it would clash if two plug-ins used different versions of MKL. Has that changed, i.e. is it usable for plug-ins?

I'm still toying with looking into IPP or something - once time permits.

You still have issues, I think. It got worse atbsome point...

Thx for the explanations. Still a huge Newton fan, especially since derivative is for free for tanh, but I'll try your solutions one day. Perhaps it's less costly than my current approach.

Angus_FX · Post by **Angus_FX** » Mon Sep 21, 2015 1:19 pm

You can static link MKL - it'll fatten up your binaries significantly, but that's not such a big deal now.

Imagine a future generation processor that can unwind a CISC code (e.g. a for() loop) into micro code eqivalent to SIMD FMA instructions, and also execute them in 1 clock cycle. This isn't so unrealistic as it seems considering how optimized CISC execution became over time. In such scenario manual SIMD optimization may become a bottleneck as it may be more efficient for scheduler to turn a simple non-SIMD for() loop into SIMD instruction.

Perhaps. I've got enough work to do getting things running as well as possible on existing architectures to worry too much about hypothetical ones

There are certainly some scenarios where letting the compiler auto-vectorise simple loops may be more efficient than hand-coding them. I've found an approach that, for the sort of things I'm building at least, strikes a good balance between targeting different CPU architectures' capabilities and letting the compiler figure out the finer points of scheduling.

Aleksey Vaneev · Post by **Aleksey Vaneev** » Mon Sep 21, 2015 1:55 pm

Angus_FX wrote:
Imagine a future generation processor that can unwind a CISC code (e.g. a for() loop) into micro code eqivalent to SIMD FMA instructions, and also execute them in 1 clock cycle. This isn't so unrealistic as it seems considering how optimized CISC execution became over time. In such scenario manual SIMD optimization may become a bottleneck as it may be more efficient for scheduler to turn a simple non-SIMD for() loop into SIMD instruction.
Perhaps. I've got enough work to do getting things running as well as possible on existing architectures to worry too much about hypothetical ones

Well, that was just a side note. I wonder why Intel at the time MMX and SSE were introduced did not implement a SIMD prefix instruction applicable to MOST instructions (at first calculated iteratively, but as silicon grows, calculated in 1 cycle). If they did that in the past, by this time we would have 1024-element SIMD cores, and finally say bye-bye to GPUs. But instead of this we have an awful collection of 3-word "extensions". They can still do that.

Aleksey Vaneev · Post by **Aleksey Vaneev** » Mon Sep 21, 2015 1:57 pm

..double post...

Plug-ins with multithreaded audio engines