KVR Audio

PurpleSunray · Post by **PurpleSunray** » Thu Dec 13, 2018 2:00 pm

Yes, I've used that documents often recently, trying to undertand the underlying levels.
They are more "intrinsics" than "simd" though.

Look at intrinsics like beeing functions build directly into the compiler.
Works same like when you write "'if", that will translate into some cmp and jmp assembly instructions.
So using intrinsics is not same as writing assembly code. You use functions that are build into the compiler and that will be translated to assmebly code.

How the compiler does that is let to the compiler.
i.e. gcc will throw an error when you use AVX intrinsics and compile with -msse. While on MSCV, AVX intrinsics + /arch:sse will generate some strange "256bit on 128bit registers" code.

PurpleSunray · Post by **PurpleSunray** » Thu Dec 13, 2018 2:37 pm

I'm curious: do you use only SSE2? Or your own dispatch inst-sets code?
Or out-of-the-box ready libraries like IPP?

It really depends... you can put as much effort as you want into optimizing things. It can always be faster.
If you know C/C++ and have just started to understand how a CPU pipeline works and what SIMD is, than intrinsics is a good way to start using that knowlege. Start coding for the min. supported platform.
When it comes to intrinsics + optimizing for specific systems.. like having a SSE2, AVX and AVX2 branch.. idk. You probably end up with lots of #ifdef's and struggles when changing the compiler (i.e. how do you convince gcc to compile C into see2 code but still accept the AVX2 intrinsics dispatched by you? no idea if possible at all)
If you want to dispatch / optimize for a specific system, you will end up writing assembly code - using an nasm (or similar), not the C/C++ compiler.
But before you try that: keep ind mind that asm is another language than C. There is stuff to do you probably have never heard before yet (prologue, epilogue, calling conventions, .. you know tat stuff?). Lot to learn before you can even do a "hello world"... and now IPP comes into the game

stratum · Post by **stratum** » Thu Dec 13, 2018 3:55 pm

p.s. The web contains many articles about the CPU pipeline but this one appears to be good http://www.cs.utexas.edu/~pingali/CS377 ... e-areg.pdf

Nowhk · Post by **Nowhk** » Fri Dec 14, 2018 9:57 am

PurpleSunray wrote: ↑Thu Dec 13, 2018 2:37 pmand now IPP comes into the game

Yes, that's what it makes me think: why one should implement your own/not wrapped/not portable/home made functions "SIMD oriented" when we have such of great libraries?

It makes them portable, written surely better than programmer average, and will use the best SIMD available (at cost of a small overhead, surely compensated by its great implementation).

Surely straight asm would be more faster and specific for actual problem (if you know what you are doing), but later you have "lots" of trouble on make it portable. The same with intrinsics, I believe.

Anyway, learning is somethings really awesome to me, so that's my first attempt to add two (% 2 == 0) equal-size arrays with Intel Intrinsics, taking advantage of SSE2:

Code: Select all

alignas(16) double a[bufferSize];
alignas(16) double b[voiceSize][bufferSize];
alignas(16) double c[voiceSize][bufferSize];

inline void AddIntrinsics(int voiceIndex, int blockSize) {
	// assuming blockSize / 2 == 0 and voiceIndex is within the range
	int iters = blockSize / 2;

	double *pA = a;
	double *pB = b[voiceIndex];
	double *pC = c[voiceIndex];

	int step = 0;
	for (int i = 0; i < iters; i++, step += 2) {
		__m128d vA = _mm_load_pd(&pA[step]);
		__m128d vB = _mm_load_pd(&pB[step]);

		_mm_store_pd(&pC[step], _mm_add_pd(vA, vB));
	}
}

IPP set to only use SSE2 would beat me of 1000% probably, but I'm getting a general idea on what is happening under the hood

And this will help also using IPP, as suggested by you all, heroes! In fact, learning this, I'm seeing why using float instead of double makes sense: it almost halve the computations, using more registers. With audio it seems a great deal, since you don't need so much precision (as you would in filter, I believe).

Do you use floats or doubles?

PeterP_swe · Post by **PeterP_swe** » Fri Dec 14, 2018 10:08 am

PurpleSunray · Post by **PurpleSunray** » Fri Dec 14, 2018 12:28 pm

Yes, that's what it makes me think: why one should implement your own/not wrapped/not portable/home made functions "SIMD oriented" when we have such of great libraries?

Because you might not find what are looking for in there.
IPP is really strong if you go on algrotihm level. i.e try to beat the IPP FFT.. good luck.
If you are on CPU instruction level, you will always move memory in and out of IPP, while the compiler migth be able to use registers if you code intrinsics.
Example: Extend your for loop with some more math than a single _mm_add_pd. If you don't find that math in IPP you need to call lot IPP functions with blocksize=1 (will be slower than intrinsics for sure), or you need to re-arrange code and break up into multiple loops, pre-implemented by IPP.
So you can't say IPP will be faster than your SSE2 intrinsics, just because of the dispatcher.
It will depend on what you do and how you do it and what your target system is actually capable of (I remeber an Intel (or was it AMD??) CPU generation that supported AVX(256bit) instructions, but on an SSE(128bit) execution unit - by doing same op twice. Result: AVX instructions where terrible slow compared to SSE2, so your SSE2->AVX optimization made it worse).

But as you say.. it is more important to understand first whats going on under the hood and to understand how use a CPU effectively. Afterwards think about implementation details

Starting with usage of SSE2 intrinsics is a good way. If you understood it once, porting to AVX or AVX2 or IPP or.. will be no big deal, just typing working.

Surely straight asm would be more faster and specific for actual problem (if you know what you are doing), but later you have "lots" of trouble on make it portable. The same with intrinsics, I believe.

Same with IPP. If you leave the C spec, you enter the world of CPU architectures: there is no IPP for ARM/NEON or PPC or any other CPU does not support the Intel instruction set. IPP != portable ;D

mystran · Post by **mystran** » Fri Dec 14, 2018 8:51 pm

PurpleSunray wrote: ↑Thu Dec 13, 2018 2:00 pm While on MSCV, AVX intrinsics + /arch:sse will generate some strange "256bit on 128bit registers" code.

Last I checked the disassembly (can't remember which MSVC version, either 2013 or 2015) it would produce "correct" code except as far as MSVC would happily mix VEX encoded instructions from AVX intrinsics with "legacy" encoded SSE operations (from SSE intrinsics or scalar operations). The problem with this is that CPUs really don't like such a mix at all. In one particular case that I actually measured the resulting code would "work" except (on my Sandy at least) it would run approximately thousand(!) times slower than straight scalar code (but it would work; that's gotta be worth something).

ICC on the other hand seems to be intelligent enough to realize that it should always use VEX encoding in functions(?) that use AVX intrinsics, even if you're nominally compiling your code for SSE2; not sure what the exact logic for the boundaries is, but it still seems to produce normal SSE2 code (ie. with the legacy encoding) outside those code-paths, so you can dispatch based on CPUID. Unfortunately "use ICC" is probably not the "portable" solution one would hope for.

2DaT · Post by **2DaT** » Sat Dec 15, 2018 2:47 am

mystran wrote: ↑Fri Dec 14, 2018 8:51 pm
PurpleSunray wrote: ↑Thu Dec 13, 2018 2:00 pm While on MSCV, AVX intrinsics + /arch:sse will generate some strange "256bit on 128bit registers" code.
Last I checked the disassembly (can't remember which MSVC version, either 2013 or 2015) it would produce "correct" code except as far as MSVC would happily mix VEX encoded instructions from AVX intrinsics with "legacy" encoded SSE operations (from SSE intrinsics or scalar operations). The problem with this is that CPUs really don't like such a mix at all. In one particular case that I actually measured the resulting code would "work" except (on my Sandy at least) it would run approximately thousand(!) times slower than straight scalar code (but it would work; that's gotta be worth something).

I think they fixed that in modern releases.

Nowhk · Post by **Nowhk** » Sun Dec 16, 2018 5:14 pm

PurpleSunray wrote: ↑Fri Dec 14, 2018 12:28 pm Example: Extend your for loop with some more math than a single _mm_add_pd. If you don't find that math in IPP you need to call lot IPP functions with blocksize=1 (will be slower than intrinsics for sure), or you need to re-arrange code and break up into multiple loops, pre-implemented by IPP.
So you can't say IPP will be faster than your SSE2 intrinsics, just because of the dispatcher.

Uhm... which IPP function can't do (in a optimized way) what you could do straight with SSE2? Can you give to me an example?
I mostly see that every simd are covered by IPP.

PurpleSunray wrote: ↑Fri Dec 14, 2018 12:28 pmSame with IPP. If you leave the C spec, you enter the world of CPU architectures: there is no IPP for ARM/NEON or PPC or any other CPU does not support the Intel instruction set. IPP != portable ;D

Heheh, yes, right! But...
Since (as you said) it would be hard to #ifdef's your code in C++, do you directly write assembly code for every target systems?

I've got from that post that than I would enter into the domain of asm, which could be hard, hence the tip to use IPPs. But if also them are not portable, well, I don't understand that suggestion

mystran · Post by **mystran** » Sun Dec 16, 2018 6:18 pm

Nowhk wrote: ↑Sun Dec 16, 2018 5:14 pm
PurpleSunray wrote: ↑Fri Dec 14, 2018 12:28 pm Example: Extend your for loop with some more math than a single _mm_add_pd. If you don't find that math in IPP you need to call lot IPP functions with blocksize=1 (will be slower than intrinsics for sure), or you need to re-arrange code and break up into multiple loops, pre-implemented by IPP.
So you can't say IPP will be faster than your SSE2 intrinsics, just because of the dispatcher.
Uhm... which IPP function can't do (in a optimized way) what you could do straight with SSE2? Can you give to me an example?
I mostly see that every simd are covered by IPP.

Function calls have overhead. Dynamic dispatch has overhead. Memory access (eg. for passing data in and out of a function) has overhead.

When a single function call to an optimised routine performs large enough chunk of work at once, these overheads don't matter (as they become a tiny fraction of the total running time), but when the amount of work done is small, then it's entirely possible that the overheads alone will cost more than just using straight scalar C-code to compute the whole thing.

Nowhk · Post by **Nowhk** » Mon Dec 17, 2018 9:47 am

mystran wrote: ↑Sun Dec 16, 2018 6:18 pm Function calls have overhead. Dynamic dispatch has overhead. Memory access (eg. for passing data in and out of a function) has overhead.

When a single function call to an optimised routine performs large enough chunk of work at once, these overheads don't matter (as they become a tiny fraction of the total running time), but when the amount of work done is small, then it's entirely possible that the overheads alone will cost more than just using straight scalar C-code to compute the whole thing.

Of course

I meant: which IPP function can't do (in a optimized way, considering a blockSize with i.e. 100 samples) what you could do straight with SSE2?

Its clear that if blockSize is 1, using IPP is penalizing...

PurpleSunray · Post by **PurpleSunray** » Mon Dec 17, 2018 10:15 am

Uhm... which IPP function can't do (in a optimized way) what you could do straight with SSE2? Can you give to me an example?

soft-clip for example. You have used ippsThreshold_64f_I already to implement a hard-clip/limiting. Now try same, but with soft-clip, i.e. scale from -3db to 0db insted of hard-limiting to 0db, or maybe even allow to adjust the "knee" setting? ippsThreshold_64f_I can't do that, so how you do it in IPP?

Since (as you said) it would be hard to #ifdef's your code in C++, do you directly write assembly code for every target systems?

I have never tried to implement such a dispatcher for intrinsics, but it already turned out that mixing SSE and AVX code on same file is a bad idea (see mystran's post about mixed VEX&legacy code produced by MSVC). So you want to avoid if-defs that could mix SSE2 (C code) and AVX (intrinsics) or vise versa, but split into files at least (so can have proper compile options for xxx_SEE.cpp and xxx_AVX.cpp). You need to figure out how to do that, what works and wont' on your own, can't help there

Edit:
Yes, always did assembly code for CPU level optimization in the past (if I did any at all on my private projects - on company we have asm coders for asm code). But it is also mind-set driven, never considered intrinsics as real alternative for me. If I'm going to write machine level instructions, I want to have full controll on it. I don't trust the compiler... it might use my SIMD intrinsics, but who knows what code it is produing arround that. So it's a all or nothing

Nowhk · Post by **Nowhk** » Mon Dec 17, 2018 2:30 pm

PurpleSunray wrote: ↑Mon Dec 17, 2018 10:15 am soft-clip for example. You have used ippsThreshold_64f_I already to implement a hard-clip/limiting. Now try same, but with soft-clip, i.e. scale from -3db to 0db insted of hard-limiting to 0db, or maybe even allow to adjust the "knee" setting? ippsThreshold_64f_I can't do that, so how you do it in IPP?

But also using SSE2 (or others SIMD) I'll end up as a "combination" of add/mul (i.e. basic) math operations.
So, as I'll use them in SSE2, why I can't simply use them with IPP (which also offers already support for different SIMD)?
I'm not sure SSE2 already implement a (solely) soft-clip instruction

PurpleSunray wrote: ↑Mon Dec 17, 2018 10:15 am Yes, always did assembly code for CPU level optimization in the past (if I did any at all on my private projects - on company we have asm coders for asm code). But it is also mind-set driven, never considered intrinsics as real alternative for me. If I'm going to write machine level instructions, I want to have full controll on it. I don't trust the compiler... it might use my SIMD intrinsics, but who knows what code it is produing arround that. So it's a all or nothing

I see, thanks for report your experience!

mystran wrote: ↑Mon Dec 10, 2018 1:48 pm Whether or not IPP has a function for doing this, your naive scalar code is almost certainly the fastest you can get, unless you can run several such filters in parallel. Breaking the serial dependency inherent in recursive filters generally involves at least log(n) parallel passes and that's never profitable on any CPU (not even close; it's quite tricky to make it profitable even on GPUs), because the SIMD architectures are far too narrow to make the parallel passes parallel enough.

Thanks to a man called Peter Cordes, I'm end up to translate my Smooth filters "scalar" code into this:

Code: Select all

// common z1 = inputA0 + z1 * b1;
__m128d zv = _mm_setr_pd(z0, inputA0 + z0 * b0);

__m128d step2_mul = _mm_set1_pd(b0 * b0);
__m128d step2_add = _mm_set1_pd(inputA0 + inputA0 * b0);

for (int i = 0; i < blockSize - 1; i += 2) {
	_mm_store_pd(pC + i, zv);

	zv = _mm_mul_pd(zv, step2_mul);
	zv = _mm_add_pd(zv, step2_add);
}
if (blockSize % 2 != 0) {
	_mm_store_sd(pC + blockSize - 1, zv);
}

The gain is amazing. On the same test that taken ~300ms to complete, now it takes only ~80ms.

PurpleSunray · Post by **PurpleSunray** » Mon Dec 17, 2018 3:01 pm

I'm not sure SSE2 already implement a (solely) soft-clip instruction

Nope, SSE2 does not have a soft-clipping instruction, but you can implement it, using SSE2 instructions.

Just further extend your loop with more math and try to port to IPP at the sime time to see what I mean.
At the moment you would run 1 loop with SSE2 mul+add (your code) vs 1 loop with AVX2 mul + 1 loop with AVX2 add (IPP). On your code the compiler migth be able to keep the mul result on register for the add. So you save a store and a load. On IPP you store after the mul to load it again for the add. At some point, memory moving overhead will negate improvement you get via IPP, just make your forumla complex enough.

The gain is amazing. On the same test that taken ~300ms to complete, now it takes only ~80ms.

That's the whole point... you wanna put all that effort into dispatcher and an additional AVX2 code branch just to bringt it down to hmm.. 60ms on Cannon Lake?

2DaT · Post by **2DaT** » Mon Dec 17, 2018 6:16 pm

Nowhk wrote: ↑Mon Dec 17, 2018 2:30 pm Thanks to a man called Peter Cordes, I'm end up to translate my Smooth filters "scalar" code into this:
Code: Select all
// common z1 = inputA0 + z1 * b1;
__m128d zv = _mm_setr_pd(z0, inputA0 + z0 * b0);

__m128d step2_mul = _mm_set1_pd(b0 * b0);
__m128d step2_add = _mm_set1_pd(inputA0 + inputA0 * b0);

for (int i = 0; i < blockSize - 1; i += 2) {
	_mm_store_pd(pC + i, zv);

	zv = _mm_mul_pd(zv, step2_mul);
	zv = _mm_add_pd(zv, step2_add);
}
if (blockSize % 2 != 0) {
	_mm_store_sd(pC + blockSize - 1, zv);
}
The gain is amazing. On the same test that taken ~300ms to complete, now it takes only ~80ms.

Why do you use an IIR filter? Linear interpolation is more performant because it does not have a dependency chain.

First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?