Ok I see what you mean: using "basics" (not already "incapsulated") functions will help on chaining computation Clear!PurpleSunray wrote: ↑Mon Dec 17, 2018 3:01 pm At the moment you would run 1 loop with SSE2 mul+add (your code) vs 1 loop with AVX2 mul + 1 loop with AVX2 add (IPP). On your code the compiler migth be able to keep the mul result on register for the add. So you save a store and a load. On IPP you store after the mul to load it again for the add. At some point, memory moving overhead will negate improvement you get via IPP, just make your forumla complex enough.
I see, but what do you do when the target platform (CPU) doesn't support SSE2?PurpleSunray wrote: ↑Mon Dec 17, 2018 3:01 pm That's the whole point... you wanna put all that effort into dispatcher and an additional AVX2 code branch just to bringt it down to hmm.. 60ms on Cannon Lake?
Or (more important) what do you do if you use SSE2 Intel intrinsics and the CPU that will run your plugin is AMD?
A sort of dispatcher is still required, isn't?
Isn't Linear interpolation "more complex" to implement on smoothing params (which depends of variable buffers)?
I mean: I would need to keep track of the "start" step, "end" step, "counter" step, and reset the "start/counter/end" points every time the param value change (introducing another check - i.e. its different from previous change?)...
Or maybe there's a fancy way? How do you smooth params with linear interpolation?