Supporting AVX / AVX2 with reasonable effort

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

The minimum requirement for Windows10 is a CPU with SSE2.
The minimum requirement for Windows11 is now a CPU with SSE4.1 (this has changed recently).

I am currently still compiling with SSE2 in Visual Studio for compatibility reasons and to be safe. There is no option for SSE3 or SSE4. However there are several options for AVX.
In my general-purpose tests I could only measure a small performance boost when using AVX (around 3%) in practise. Results might be quite different on FFT transforms with special libraries.

What SSE/AVX settings do you use in Visual Studio?
Is it safe these days to require AVX?
Is it safe these days to require SSE4.0? Are there still common CPUs out there that do not support it?
Last edited by Tone2 Synthesizers on Wed Feb 21, 2024 3:11 pm, edited 1 time in total.
We do not have a support forum on kvr. Please refer to our offical location: https://www.tone2.com/faq.html

Post

I use runtime dispatching with code paths for SSE2, SSE4, AVX, AVX2 and AVX512. I get close to 4x and 8x speed ups on the 256 and 512 bit lanes using templated C++ intrinsic wrappers on critical paths. If I was being lazy about it I'd do an AVX2 path and SSE2 path. I suspect AVX2 is pretty common now and SSE2 will cover everyone else. I don't think there are flags for SSE3 and 4 in MSVC (there are in Clang). When using intrinsics, MSVC doesn't complain when using SSE4 instructions with SSE2 flag.

Post

The mini NUC PC I bought 2 years ago has Celeron N5105 processor (release Q1 2021).
It doesn't even have AVX. I thought it should have but not.
If you are targeting bedroom producers or poor students too, AVX requirement is still can be the problem today.
Well, current cheap PC should have intel N95 or N100 (Q1 2023), and they support AVX2 :)

Post

I also bought a weak and cheap €195 Mini PC with a Celeron CPU in 2022. It has SSE 4.1 but no AVX.
Intel still sold Celeron and some crappy Pentium CPUs without AVX till 2020. :o

On the other side those who use such hardware most likely won't buy much software.
We do not have a support forum on kvr. Please refer to our offical location: https://www.tone2.com/faq.html

Post

Tone2 Synthesizers wrote: Tue Feb 20, 2024 8:04 am The minimum requirement for Windows10 is a CPU with SSE2.
The minimum requirement for Windows11 is now a CPU with SSE4.1 (this has changed recently).

I am currently still compiling with SSE2 in Visual Studio for compatibility reasons and to be safe. There is no option for SSE3 or SSE4. However there are several options for AVX.
In my general-purpose tests I could only measure a small performance boost when using AVX (around 3%) in practise. Results might be quite different on FFT transforms with special libraries.
These is almost nothing useful in-between SSE2 (which is default for x64 systems) and AVX that is worth targeting, for DSP that mostly operates on floats. There are a couple of instructions that can be used in some rare cases (i.e. pmulld for implementing an RNG), but overall speedup from these won't be significant.

Transition from SSE2 to AVX does enable VEX encoding, which allows non-destructive moves and that alone should account for 5%-10% speedup due to fever moves in the code. On top of that it is twice as wide, but only for floats, which make it a little inconvenient for implementing elementary functions (e.g. sine, exp, tan, log, etc.).

Obviously, speedup from wideness only applies to successful auto-vectorization, which almost never happens in anything that is more complicated that the trivial examples.

AVX2 does all the remaining integer operations and most importantly, FMA (fused multiply-add) instructions. FMA instructions are hugely beneficial for DSP, even in pure scalar code. There is a slight catch that AVX2 doesn't imply FMA support even though almost all CPUs do (except one obscure VIA one). Even msvc doesn't hesitate to generate FMA instructions when AVX2 is enabled, so it's a very safe assumption. GCC and clang doesn't make that assumption, so it needs to be separately enabled (with -mfma).

AVX512 isn't too widespread and the majority of modern CPUs that do support it do "double pumping" which means that the instructions take twice as much to execute as corresponding AVX2 ones. So in practice AVX512 is barely faster than AVX2 on those CPUs that support it. And intel didn't enable AVX512 on cpus with e-cores which means that the instruction set is pretty much dead.

Now for the interesting part: dynamic dispatch is not trivial, in some cases if done in a naive way it can lead to crashes (https://randomascii.wordpress.com/2016/ ... any-speed/). As far as I am aware that issue is still not fixed on msvc so one should be extra careful.

Post

So the only reasonable way to support AVX without a headache is to build the plugin dll twice. One version with AVX and one without. The let the installer run CPUID and check for the AVX bit and install the right version.
So far I haven't found an easy way to do this in inno setup.

According to the steam database still 4% of the users have AVX incompatible CPUs. These people would experience hard crashes when AVX is used without a fallback.
We do not have a support forum on kvr. Please refer to our offical location: https://www.tone2.com/faq.html

Post

To avoid linker issues with runtime dispatching, use templates and a processor tag that is defined at compile time, then for each class (or free function) you want to use across architectures do something like this:

template <class = current_arch>
void foo()
{
// no linker issues here as I'm unique across architectures...
}

Post

Tone2 Synthesizers wrote: Tue Feb 20, 2024 10:49 pm So the only reasonable way to support AVX without a headache is to build the plugin dll twice. One version with AVX and one without. The let the installer run CPUID and check for the AVX bit and install the right version.
As far as headaches do, this is much easier, obviously. But dynamic dispatch is possible with some precautions.

keithwood wrote: Tue Feb 20, 2024 11:26 pm To avoid linker issues with runtime dispatching, use templates and a processor tag that is defined at compile time, then for each class (or free function) you want to use across architectures do something like this:

template <class = current_arch>
void foo()
{
// no linker issues here as I'm unique across architectures...
}
This works (and generally it is recommended), but with a few caveats:
1. Do not use c++ std library, which does include math library unfortunately. It is possible to write replacements for the most common math functions, and by cutting some precision and error-checking it can be preferable to standard ones.
2. Every class that crosses the dispatch boundary must be fully forward-declared and that includes default constructors too. And implementation of such classes must be compiled in a file with lowest instruction set available.

Post

Well, you generally have to develop for NEON too which is four floats. I'd be interested to know if people are using SIMD in such a way that you can enable AVX and see a real corresponding performance improvement.

For my own part I've seen performance improvements in FFT library, stereo processing for filters etc. Didn't see any improvement for wavetable rendering largely because it's memory bound.

Post

I did some benchmarks with typical synthesizer dsp code.
Visual Studio 17, Ryzen 12 Core, Windows 11.

AVX has a performance boost of 7% compared to compiling with SSE2/SSE4.1.
AVX2 has a performance boost of 15% compared to compiling with SSE2/SSE4.1.

According to the steam database 96% of the PC users have AVX and 92% have AVX2.

So if you support AVX you already should support AVX2 as the older standard does not have enough benefit for the additional effort.

An interesting find is also that Visual Studio does branch stuff automatically with SSE2 vs SSE 4.1 in 64 Bit mode. So I think chances are not bad that it does this also with AVX in future versions?
We do not have a support forum on kvr. Please refer to our offical location: https://www.tone2.com/faq.html

Post

2DaT wrote: Tue Feb 20, 2024 6:55 pm Transition from SSE2 to AVX does enable VEX encoding, which allows non-destructive moves and that alone should account for 5%-10% speedup due to fever moves in the code. On top of that it is twice as wide, but only for floats, which make it a little inconvenient for implementing elementary functions (e.g. sine, exp, tan, log, etc.).
Theoretically register-register renames shouldn't really cost anything more than decode, so whether or not you see speedup just from 3-reg vs. 2-reg in practice probably depends on where the bottleneck happens to be. Likewise, whether double-wide is significantly faster depends on whether you have a "real" AVX processor or one that breaks them down into two 4-wide operations (the same "double pumping" has been a thing every time especially lower end CPUs add support for wider stuff), though obviously in both cases we save half the decode.

edit: The decode renaming actually means that it makes sense to use vector moves even in scalar code, because (say) MOVPS is a register rename while MOVSS is a shuffle (in SSE encoding; apparently when using the VEX encoding from AVX it can also be told to zero the upper bits like Neon FMOV, which is fine as it breaks dependency and therefore allows fast path).
Obviously, speedup from wideness only applies to successful auto-vectorization, which almost never happens in anything that is more complicated that the trivial examples.
Exactly.. and this is probably why 4-wide is so successful: it's often enough to make the cost of SIMD code low enough vs. other non-SIMD code that going wider gives diminishing benefits in most cases where you aren't just crunching huge vectors.

Post

Now, Ableton Live 12 requires AVX2 in Windows Build.
Let's see how others do in the next major updates.

Return to “DSP and Plugin Development”