First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?

DSP, Plugin and Host development discussion.
Post Reply New Topic
RELATED
PRODUCTS

Post

PurpleSunray wrote: Mon Dec 17, 2018 3:01 pm At the moment you would run 1 loop with SSE2 mul+add (your code) vs 1 loop with AVX2 mul + 1 loop with AVX2 add (IPP). On your code the compiler migth be able to keep the mul result on register for the add. So you save a store and a load. On IPP you store after the mul to load it again for the add. At some point, memory moving overhead will negate improvement you get via IPP, just make your forumla complex enough.
Ok I see what you mean: using "basics" (not already "incapsulated") functions will help on chaining computation :wink: Clear!
PurpleSunray wrote: Mon Dec 17, 2018 3:01 pm That's the whole point... you wanna put all that effort into dispatcher and an additional AVX2 code branch just to bringt it down to hmm.. 60ms on Cannon Lake?
I see, but what do you do when the target platform (CPU) doesn't support SSE2?
Or (more important) what do you do if you use SSE2 Intel intrinsics and the CPU that will run your plugin is AMD?

A sort of dispatcher is still required, isn't?
2DaT wrote: Mon Dec 17, 2018 6:16 pm Why do you use an IIR filter? Linear interpolation is more performant because it does not have a dependency chain.
Isn't Linear interpolation "more complex" to implement on smoothing params (which depends of variable buffers)?

I mean: I would need to keep track of the "start" step, "end" step, "counter" step, and reset the "start/counter/end" points every time the param value change (introducing another check - i.e. its different from previous change?)...

Or maybe there's a fancy way? How do you smooth params with linear interpolation?

Post

I see, but what do you do when the target platform (CPU) doesn't support SSE2?
It won't run, but you have this problem also without using SSE2 intrinsics, because you also need to define the target platform when compiling C/C++ code. MSVC says in /arch option: If no option is specified, the compiler will use instructions found on processors that support SSE2. Use of enhanced instructions can be disabled with /arch:IA32
So unless you explicitly compile for IA32, you already have SSE2 as a min requirment.
Or (more important) what do you do if you use SSE2 Intel intrinsics and the CPU that will run your plugin is AMD?
What AMD processor? One supporting the Intel instructions set such as Ryzen or Threadripper? Will work.
Or one with an ARM core such as the Opteron A? Than SSE2 won't work, but you need NEON.

Post

mean: I would need to keep track of the "start" step, "end" step, "counter" step, and reset the "start/counter/end" points every time the param value change (introducing another check - i.e. its different from previous change?)...
Why check? Just interpolate.
v = (1 - t) * v0 + t * v1;
Last value is v0, param value is v1, t (0..1) is the possition in between. Now pre-fill 2 arrays with t's and 1-t's (depending on block size) and you have nice loop of 2 mul and 1 add, without depenedcies, that scales very well on any kind of parallelism (a GPU would love to calc that, while it would hate your IIR :lol: . A CPU might be even smart enough to run 2x128bit mul as once on a 256bit exec unit, leveraging CPUs feature you have not even coded for. )
No need to reset to anything, if v0 and v1 is same, v will be same for all t's.
Last edited by PurpleSunray on Tue Dec 18, 2018 11:29 am, edited 1 time in total.

Post

Nowhk wrote: Tue Dec 18, 2018 8:28 am I see, but what do you do when the target platform (CPU) doesn't support SSE2?
Every 64-bit x86 processor is guaranteed to support SSE2 (by specification). If you are serious about supporting old 32-bit CPUs then go ahead; in that case your best bet is probably to compile separate "legacy compatible" builds for those CPUs, since allowing your compiler to use SSE2 instructions typically speeds up even scalar floating point code slightly.

Post

PurpleSunray wrote: Tue Dec 18, 2018 10:25 am Why check? Just interpolate.
v = (1 - t) * v0 + t * v1;
Last value is v0, param value is v1, t (0..1) is the possition in between. Now pre-fill 2 arrays with t's and 1-t's (depending on block size) and you have nice loop of 2 mul and 1 add, without depenedcies
If you rewrite v=v0+t*(v1-v0) then you need 1 multiply-add since (v1-v0) is loop invariant.

That said, you probably want to compute the t-values on the fly; lookup tables for simple computations tend to be pessimisations on modern CPUs.

Post

I usually prefer the precise method, had problems with v != v1 when t=1 too many times already with the 'simple' one :/
But since Nowhk is hunting after the last CPU cycle, he should go for the 1 mul ofc :D (or add a FMA branch, that should bring a signifcant boost while maintinaing precision it think ^^)

Post

PurpleSunray wrote: Tue Dec 18, 2018 12:04 pm I usually prefer the precise method, had problems with v != v1 when t=1 too many times already with the 'simple' one :/
It doesn't really matter if you start from v1 as the initial value in the next block, since the error can only accumulate for one block (and even if you don't, as long as you compute the new delta with the "wrong" value as the initial point, you'll get negative feedback and it won't drift very far).

Still, we can make it faster, by computing step=(v1-v0)/blocksize, so that we get single (dependent) add: v+=step. But then, since step is loop invariant, we can break the dependency by unrolling, which leads to something like:

Code: Select all

float step = (v1-v0)/blocksize, step2 = 2*step, step3 = 3*step, step4 = 4*step;
for(int i = 0; i < blocksize; i += 4)
{
  out[i+0] = v0;
  out[i+1] = v0+step;
  out[i+2] = v0+step2;
  out[i+3] = v0+step3;
  v0 += step4;
}
This can then be trivially converted to SSE (or whatever you prefer), which gives you one ADDPS and one store per iteration. Then you can either unroll the loop some more or hope that the compiler does it for you; otherwise you probably risk an extra stall due to the loop body being too short.

edit: brainfart, only need one ADDPS. :)

Post

PurpleSunray wrote: Tue Dec 18, 2018 9:35 am What AMD processor? One supporting the Intel instructions set such as Ryzen or Threadripper? Will work.
But I'm talking about Intel Intrinsics, not SIMD.

If I take my code that use _mm_mul_pd on MSVC and my desktop I7 CPU and I open it with GCC on a Ryzen CPU, does it will compile?
Are they called Intel Intrinsics because refers to SIMD introduced by Intel or are they compilable only with compilers that run "over" Intel? :D

PurpleSunray wrote: Tue Dec 18, 2018 10:25 am
mean: I would need to keep track of the "start" step, "end" step, "counter" step, and reset the "start/counter/end" points every time the param value change (introducing another check - i.e. its different from previous change?)...
Why check? Just interpolate.
v = (1 - t) * v0 + t * v1;
Last value is v0, param value is v1, t (0..1) is the possition in between. Now pre-fill 2 arrays with t's and 1-t's (depending on block size) and you have nice loop without depenedcies that scales very well on any kind of parallelism (a GPU would love to calc that, while it would hate your IIR :lol: )
No need to reset to anything, if v0 and v1 is same, v will be same for all t's.
Ofc, but let say I smooth the param in 1000 samples.
Assuming buffers I got from DAW are variables.
Assuming param value change occurs between buffers.

If after 4 buffers (250+135+220+256 samples) I do a param change (i.e. I set a new value to the param, by DAW or User GUI), I need to "reset" t, else it will smooth to the new value (from the current one) only for 1000-250-135-220-256=139 samples, not 1000. So I need to keep track if the old param value and the new once change first, than reset t. No?

mystran wrote: Tue Dec 18, 2018 11:29 am Every 64-bit x86 processor is guaranteed to support SSE2 (by specification).
I see. In fact SSE2 seems to be supported by x86 32-bit from '00, which its obviously acceptable to me :) Question: any examples of architecture/CPU that run Windows/VSTs and are not x86 based?

Post

Nowhk wrote: Tue Dec 18, 2018 1:07 pm I see. In fact SSE2 seems to be supported by x86 32-bit from '00, which its obviously acceptable to me :) Question: any examples of architecture/CPU that run Windows/VSTs and are not x86 based?
start here: https://en.wikipedia.org/wiki/Instructi ... chitecture

Post

Nowhk wrote: Tue Dec 18, 2018 1:07 pm But I'm talking about Intel Intrinsics, not SIMD.

If I take my code that use _mm_mul_pd on MSVC and my desktop I7 CPU and I open it with GCC on a Ryzen CPU, does it will compile?
Yes. Ryzen is a x86 CPU with SSE2 same as i7. Doesn't matter if AMD or Intel fab build the chip as long as is supports same instruction set.
PurpleSunray wrote: Ofc, but let say I smooth the param in 1000 samples.
Assuming buffers I got from DAW are variables.
Assuming param value change occurs between buffers.

If after 4 buffers (250+135+220+256 samples) I do a param change (i.e. I set a new value to the param, by DAW or User GUI), I need to "reset" t, else it will smooth to the new value (from the current one) only for 1000-250-135-220-256=139 samples, not 1000. So I need to keep track if the old param value and the new once change first, than reset t. No?
Yes.
But t starts on 0 with every block, not only if parameter changes.
Lets say block size is 1000.
So you read param value from GUI, interpolate 1000 values between old param value and new one. Than you modulate 1000 audio signal samples coming from DAW with it. After 1000 samples have passed, you read param value again and interpolate next block of 1000 values.

Post

PurpleSunray wrote: Tue Dec 18, 2018 4:16 pm Yes. Ryzen is a x86 CPU with SSE2 same as i7. Doesn't matter if AMD or Intel fab build the chip as long as is supports same instruction set.
Why they are called Intel Intrinsics so? Because Intel introduces them as SSE2/AVX/etc?
I mean: the SSE2 on AMD are the same Intel Intrinsics?
Or I will use somethings different than emmintrin.h? :D
(sorry for the "stupid" question, but I also think terminology helps on learn stuff hehe).
PurpleSunray wrote: After 1000 samples have passed, you read param value again and interpolate next block of 1000 values.
I see. So basically you "sync" the refresh of param changes at fixed rate (the smoothing length).
If I change a param value in between blocks (after 2 blocks, where only 500 samples has processed instead of 1000, for example), you simply ignore the param change at that step, "delaying" it on the next block, right?

Since the smoothing window are pretty small (probably, 100 samples @ 441000 are enough; or which length is preferable?), in effect this won't be really noticeable.

Post

Nowhk wrote: Wed Dec 19, 2018 9:31 am Why they are called Intel Intrinsics so? Because Intel introduces them as SSE2/AVX/etc?
I mean: the SSE2 on AMD are the same Intel Intrinsics?
Or I will use somethings different than emmintrin.h? :D
(sorry for the "stupid" question, but I also think terminology helps on learn stuff hehe).
Because Intel invented the x86 architecture and owns intructions set specifications for it.
AMD has licenesed it and build chips based on same ISA.

Look at ARM than it is getting even more clear how the "ISA model" works.
ARM (a UK company) specifies instruction sets (i.e. NEON, which is the SSE for ARM) and the chip design, but they do not not build actual hardware. Instead they license their designs to Samsung, Qualcomm, Huawei, Apple, ... and they build the chip.
So you end up with a lot of different ARM processors, since they all support ARM insturctions - but they are build by Samsung or Huawei.. not by ARM.

I see. So basically you "sync" the refresh of param changes at fixed rate (the smoothing length).
If I change a param value in between blocks (after 2 blocks, where only 500 samples has processed instead of 1000, for example), you simply ignore the param change at that step, "delaying" it on the next block, right?
Yes and no. I "sync" the refresh (aka control rate) to sample rate, but the smoothing length is the length of the interplotation (= how many steps from t=0 to t=1). This must not necessarily be same as control rate (but beeing it or factor of it helps a lot ;) ).
If I pick 1:48 control rate to sample rate ratio on 48kHz, I read the UI value on sample 0, sample 1000, sample 2000, ..
If it changes on sample 500, the change will be picked up 500 samples later.
I avoid any kind of dynamic / event driven logic in there because it would make the control rate dynamic - depending on how many UI events you receive within a certain timeframe. Like: how many control param samples is 1 second? No idea, this will change if uses moves the knob, depending on how many UI events arrive within thin this second.
It just complicates things... :D Prefer to have fixed control rate : sample rate ratio.
If a 1:1000 control rate to sample rate is too slow / want to pick up changes faster, I simply reduce the ratio. A 1:1000 on 48kHz sample rate would be 20,8ms of worst case latency, a good drummer can notice this. 1:64 would be 1,3ms - nobody can notice this and you still reduced rate of control signal by factor 64 compared to audio singal.
PurpleSunray wrote: Since the smoothing window are pretty small (probably, 100 samples @ 441000 are enough; or which length is preferable?), in effect this won't be really noticeable.
If it is only about to avoid the click and the fade shouldn't be noticable, I usually go for something arround 10ms. That would be 480 samples on 48kHz

Post

I see! Thank you so much!!!!!!
I'm learning a lot :)

Post

I've basically converted all the code I did with IPP using SSE2 intrinsics: nice, now I got what you mean about "make complex" functions. Its better to stay within the same read/math registers, in the same block. Performance are better :)

But now: what about compute math not covered by SSE2?
For example, exp: it seems logic to use IPP with (only) such a function... which is an optimized approx. Do you?

Post


Post Reply

Return to “DSP and Plugin Development”