First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?
-
- KVRAF
- 2256 posts since 29 May, 2012
Non-Intel CPU's do not fail, but Intel's code may not be optimal for them. There are also initialization functions that need to be called for proper execution. For IPP, that's ippInit (or maybe ippInitStatic). That selects the optimal code for the CPU in use. For MKL, I don't recall needing to call any initialization function, but that's a possibility as we rarely use MKL.
~stratum~
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Oh nice! It seems I don't need to do any extra effort, since by default it dispatch the optimal function automatically, due to the current/running CPU: https://software.intel.com/en-us/articl ... -functionsstratum wrote: ↑Thu Nov 22, 2018 2:27 pm Non-Intel CPU's do not fail, but Intel's code may not be optimal for them. There are also initialization functions that need to be called for proper execution. For IPP, that's ippInit (or maybe ippInitStatic). That selects the optimal code for the CPU in use. For MKL, I don't recall needing to call any initialization function, but that's a possibility as we rarely use MKL.
-
- KVRAF
- 2256 posts since 29 May, 2012
If ippInit fails to detect an AMD processor correctly you can also use this https://software.intel.com/en-us/ipp-de ... pufeatures Not really practical to test though, as you would need many different machines.
~stratum~
- KVRAF
- 2239 posts since 25 Sep, 2014 from Specific Northwest
I played around with auto-vectorization using pragma hints, but the compiler just kinda laughed at my code and threw up its hands. Unfortunately, not much of what I do lends itself to vectorization. There's nothing that I use multiply/adds for that use constants--any scalars are usually computed per sample. I finally gave up. The compiler already gets me 40-50% on the Mac side of things. I forget what I get on the PC side, but it's quite a bit less, just due to the compiler. The MS defaults seem to do quite a bit of optimization automatically. In the end, I get about the same end result (my PC specs are almost identical to my Mac specs.)
That said, do learn about how your compiler works so you can set up your code for the compiler to take advantage of it. Keep things simple in your for loops, especially.
And I do recommend using a SIMD framework as it will save you a ton of time and hair loss if you do want to go down this road.
That said, do learn about how your compiler works so you can set up your code for the compiler to take advantage of it. Keep things simple in your for loops, especially.
And I do recommend using a SIMD framework as it will save you a ton of time and hair loss if you do want to go down this road.
I started on Logic 5 with a PowerBook G4 550Mhz. I now have a MacBook Air M1 and it's ~165x faster! So, why is my music not proportionally better?
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Yeah, in fact I'm trying that IPP, which seems nice
Do you static or dynamic link those libraries? For the second I believe you include the DLLs in your bundle install.
Any performance differences? Expecially when dispatching...
Do you static or dynamic link those libraries? For the second I believe you include the DLLs in your bundle install.
Any performance differences? Expecially when dispatching...
- u-he
- 28065 posts since 8 Aug, 2002 from Berlin
Auto-Vecorization *never* worked for me. Never. The only wonder I've ever seen was for this loop: for( int i = 0; i < 256; i++ ) var[ i ] = i; - the compiler did amazing things for this. Never firgured out how it worked.
We use vector intrinsics wrapped into objects. This usually gives us 2x the performance over scalar code. I more and more use templated functions so that scalar code and vectorized code are identical, and I just implement for either float or float vector. This makes the scalar code a tad slower (no conditional branches), but then it's only there for reference anyway.
Wrapping intrinsics into objects has helped the transistion from PowerPC to Intel, and possibly to ARM as well (whenever/if that's taking over).
We use vector intrinsics wrapped into objects. This usually gives us 2x the performance over scalar code. I more and more use templated functions so that scalar code and vectorized code are identical, and I just implement for either float or float vector. This makes the scalar code a tad slower (no conditional branches), but then it's only there for reference anyway.
Wrapping intrinsics into objects has helped the transistion from PowerPC to Intel, and possibly to ARM as well (whenever/if that's taking over).
-
- KVRAF
- 2256 posts since 29 May, 2012
If it's for a plugin you should use the static version otherwise those shared libraries will clutter the user's plugin folder.
No. But if you are sure about the CPU type (for example AVX only), then there is a way to avoid dispatching altogether https://software.intel.com/en-us/articl ... ce-guide#3 section named "Single Processor Static Linkage"Any performance differences? Expecially when dispatching...
~stratum~
- KVRAF
- 7890 posts since 12 Feb, 2006 from Helsinki, Finland
With regards to branches: ISPC uses a strategy similar to GPUs, where you maintain masks for conditions, then evaluate all branches (using predication, which if not supported by hardware can be emulated by bitwise logic or similar) that are required for at least one "thread." This method can support essentially arbitrary control flow. For example with loops you simply keep looping until your mask indicates that all the "threads" are done. Obviously you only get full benefit from SIMD when your control-flow is reasonable coherent.Urs wrote: ↑Sun Nov 25, 2018 11:07 am We use vector intrinsics wrapped into objects. This usually gives us 2x the performance over scalar code. I more and more use templated functions so that scalar code and vectorized code are identical, and I just implement for either float or float vector. This makes the scalar code a tad slower (no conditional branches), but then it's only there for reference anyway.
- KVRian
- 1091 posts since 8 Feb, 2012 from South - Africa
Yeah, auto-vectorization tends to not work most of the time, especially with the fun stuff(i.e. anything with feedback). Sometimes just rearranging an algorithm can help a bit in cases where there is vector and scalar code in use at the same time, forward stalls can eat performance, especially on older machines(i.e. Sandy Bridge). I really should just sit down one day and figure out how to do a simple 1 pole 1 zero in vector form, they have a habit of breaking my algorithm intrinsic sequence often.
A template/wrapper is a bit of upfront work but will save you a lot of time in the long run.
P.S. Somebody at Intel had Fabrication-Diarrhea when they did AVX512, all those different versions seems pretty half-baked to me.
A template/wrapper is a bit of upfront work but will save you a lot of time in the long run.
P.S. Somebody at Intel had Fabrication-Diarrhea when they did AVX512, all those different versions seems pretty half-baked to me.
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
Why one would do this? AVX is not supported by all CPUs. Users without them won't be able to use the plug. Isn't better deal the dispatch? At least it works not optmizied, but still works
Is dispatch so heavy?
-
- KVRAF
- 2256 posts since 29 May, 2012
Roughly the same as calling a virtual function, probably. Not an issue with large amount of data (image processing). For audio, I don't know (didn't measure).
~stratum~
-
Richard_Synapse Richard_Synapse https://www.kvraudio.com/forum/memberlist.php?mode=viewprofile&u=245936
- KVRian
- 1136 posts since 20 Dec, 2010
Of course dispatching is fine when it works. But when it doesn't work, it's worse than not having it at all, because your plugins will simply crash without any warning message. Note that there's more than one potential source for crashing, the CPU, the OS, as well as the packages you use (such as IPP) come to mind.
Richard
Synapse Audio Software - www.synapse-audio.com
- KVRian
- Topic Starter
- 878 posts since 2 Oct, 2013
For what I've got from IPP, it resolve (by default) at runtime which override function to call, due to the current CPU. So in the worst case (i.e. no CPU match), it will call the "basic" one, not optimized.Richard_Synapse wrote: ↑Sun Nov 25, 2018 4:21 pmOf course dispatching is fine when it works. But when it doesn't work, it's worse than not having it at all, because your plugins will simply crash without any warning message. Note that there's more than one potential source for crashing, the CPU, the OS, as well as the packages you use (such as IPP) come to mind.
Richard
Not sure what do you mean with "crash" here.
Any example?
-
- KVRAF
- 2256 posts since 29 May, 2012
IPP doesn't crash because it fails to detect CPU. Rather it may crash because on your own machines it may dispatch to AVX and SSE2 (assuming those are your test CPUs) whereas it may have a bug in MMX implementation which might be a forgotten piece of rusty code, and a customer may discover that. Just theoretical you may say, but a possibility nevertheless.
~stratum~
-
Richard_Synapse Richard_Synapse https://www.kvraudio.com/forum/memberlist.php?mode=viewprofile&u=245936
- KVRian
- 1136 posts since 20 Dec, 2010
Like I wrote, if it works properly then yes. But compilers and libraries are not free of bugs, dispatch issues have been around ever since SSE (of course they get fixed at some point, so you may want to google if your specific compiler version or library has such issues or not).
A crash is typically caused by an illegal instruction.
Richard
Synapse Audio Software - www.synapse-audio.com