KVR Audio

Caco · Post by **Caco** » Mon Jun 21, 2010 7:50 am

I am starting a new thread based on a discussion that arose in the Mverb thread about CPU efficiency and optimizations by different compilers after MVerb was ported to Linux by ccern and lubomir.ivanov.

lubomir.ivanov found that there was a noticable difference between mscv and mingw versions

lubomir.ivanov wrote: i get the following performance differences from less to more cpu usage on my 10y/o pc:
msvc (original Mverb.dll?) < mingw-3.4.2-official < mingw-4.4.1-tdm < mingw-4.5.0-tdm

the mingw order is a bit in contradiction with what the gcc improvements and changes in standard optimization flags suggest.

interesting to investigate a bit more into this...

This is similar to my own personal findings as I originally developed MVerb using mingw to help keep the code as cross-platform compatible as possible but for the binary release I compiled using msvc as the resulting VST was much more CPU efficient. I wonder how the intel compiler would perform?

Ten instances of MVerb in REAPER:

My original version using msvc = 18.2%
My original version using mingw 3.4.2 = 23.7%
My original version using mingw 4.4.1 = 29.2%
ccern's version for windows cross-compiled under Linux using axonlib and mingw = 32.3%
(Compiler flags were basic and kept the same for all, e.g. -O3 -Os)

The later versions of mingw appear to produce less efficient code for both myself and lubomir.ivanov. It would be interesting to find out what other people have seen. Of course, this is all dependant on the actual code, compiler flags etc - perhaps mingw would out perform msvc if the code were rewritten or different compiler flags were used?

MichaelAudio · Post by **MichaelAudio** » Mon Jun 21, 2010 8:16 am

Caco wrote: (Compiler flags were basic and kept the same for all, e.g. -O3 -Os)

In that case -Os will override -O3 and -Os will optimize for _size_!
http://gcc.gnu.org/onlinedocs/gcc-4.5.0 ... ze-Options

-Os basically includes all optimization flags of -O2 but will disable some of them and optimize size over speed (as far as I know). Also I found that for DSP many times -O2 yields better results than -O3.

From http://www.network-theory.co.uk/docs/gc ... ro_49.html :

-O0 or no -O option (default)
At this optimization level GCC does not perform any optimization and compiles the source code in the most straightforward way possible. Each command in the source code is converted directly to the corresponding instructions in the executable file, without rearrangement. This is the best option to use when debugging a program and is the default if no optimization level option is specified.
-O1 or -O
This level turns on the most common forms of optimization that do not require any speed-space tradeoffs. With this option the resulting executables should be smaller and faster than with -O0. The more expensive optimizations, such as instruction scheduling, are not used at this level. Compiling with the option -O1 can often take less time than compiling with -O0, due to the reduced amounts of data that need to be processed after simple optimizations.
-O2
This option turns on further optimizations, in addition to those used by -O1. These additional optimizations include instruction scheduling. Only optimizations that do not require any speed-space tradeoffs are used, so the executable should not increase in size. The compiler will take longer to compile programs and require more memory than with -O1. This option is generally the best choice for deployment of a program, because it provides maximum optimization without increasing the executable size. It is the default optimization level for releases of GNU packages.
-O3
This option turns on more expensive optimizations, such as function inlining, in addition to all the optimizations of the lower levels -O2 and -O1. The -O3 optimization level may increase the speed of the resulting executable, but can also increase its size. Under some circumstances where these optimizations are not favorable, this option might actually make a program slower.
-Os
This option selects optimizations which reduce the size of an executable. The aim of this option is to produce the smallest possible executable, for systems constrained by memory or disk space. In some cases a smaller executable will also run faster, due to better cache usage.

Hope that helps.

helium · Post by **helium** » Mon Jun 21, 2010 8:25 am

edit: to slow, see MichaelAudio

Caco · Post by **Caco** » Mon Jun 21, 2010 8:59 am

That is interesting MichaelAudio, these flags were based on what ccern reported using for his Linux builds

ccern wrote: we're using only some basic compiler flags (opt="-mfpmath=387 -O3 -Os"),

I have added the CPU usage using compiler flag -O2 without -Os for the mingw builds

mingw 3.4.2 = 21.1%
mingw 4.4.1 = 23.2%

Both have come down in CPU usage, especially the mingw 4.4.1 build, but are still a few percent worse than using msvc. I am still getting more CPU efficient code from mingw 3.4.2 though. The dll filesize is different when using the same flags with the two different compilers so they are both producing different variants of the code. It is a lot better than before though

It will be interesting to see what would happen using more exotic flags.

tor.helge.skei · Post by **tor.helge.skei** » Mon Jun 21, 2010 9:11 am

very interesting!!
thanx for the info about -Os
re-tried everything, but in win7, and with gcc 4.4.1 (TDM-2 mingw32)
reaper, win7, 44.1 khz

50 instances of your original MVerb: 26%
50 of axonlib version: 25.7%

so, the differences are shrinking...
then, i tried a bunch of more exotic flags:

-march=nocona
-O3
-fexpensive-optimizations
-combine
-ffast-math
-ftree-vectorize
-msse
-funroll-loops
-fvariable-expansion-in-unroller
-funsafe-loop-optimizations
-ftree-loop-im
-funswitch-loops
-ftree-loop-ivcanon
-ftracer
-fprefetch-loop-arrays
-fno-exceptions
-freorder-blocks-and-partition
-funsafe-math-optimizations
-ffinite-math-only
-fdata-sections

and the result?
50 instances: 21.6%...
yoho !!

- ccernn

Caco · Post by **Caco** » Mon Jun 21, 2010 9:52 am

tor.helge.skei wrote:very interesting!!
thanx for the info about -Os
re-tried everything, but in win7, and with gcc 4.4.1 (TDM-2 mingw32)
reaper, win7, 44.1 khz

50 instances of your original MVerb: 26%
50 of axonlib version: 25.7%

so, the differences are shrinking...
then, i tried a bunch of more exotic flags:

-march=nocona
-O3
-fexpensive-optimizations
-combine
-ffast-math
-ftree-vectorize
-msse
-funroll-loops
-fvariable-expansion-in-unroller
-funsafe-loop-optimizations
-ftree-loop-im
-funswitch-loops
-ftree-loop-ivcanon
-ftracer
-fprefetch-loop-arrays
-fno-exceptions
-freorder-blocks-and-partition
-funsafe-math-optimizations
-ffinite-math-only
-fdata-sections

and the result?
50 instances: 21.6%...
yoho !!

- ccernn

Awesome, looks like I can stick to using mingw instead of msvc. Just tried it here using all the compiler flags above in win xp with gcc 4.4.1 and mingw version now 17.2% for 10 instances in Reaper

lubomir.ivanov · Post by **lubomir.ivanov** » Mon Jun 21, 2010 1:40 pm

MichaelAudio wrote: In that case -Os will override -O3 and -Os will optimize for _size_!
http://gcc.gnu.org/onlinedocs/gcc-4.5.0 ... ze-Options

noticed the override fact just yesterday, so thanks for confirming.
for mverb and all plugins that i've tried testing it seems that -O3 generates faster code than -O2, but there are a lot of variables involved..

tor.helge.skei wrote: -march=nocona
-O3
-fexpensive-optimizations
-combine
-ffast-math
-ftree-vectorize
-msse
-funroll-loops
-fvariable-expansion-in-unroller
-funsafe-loop-optimizations
-ftree-loop-im
-funswitch-loops
-ftree-loop-ivcanon
-ftracer
-fprefetch-loop-arrays
-fno-exceptions
-freorder-blocks-and-partition
-funsafe-math-optimizations
-ffinite-math-only
-fdata-sections

i have tried some of those i admit, but i haven't check what they do more in detail. using the "-ffast-math" alone will probably generate faster code than the msvc version. but i doubt that msvc optimization for speed is doing what "-ffast-math" is doing as it breaks some standards, but also it may break the actual code (it did for me once). i suggest that we leave "-mfpmath=387 -O3" as default.

Caco wrote: Both have come down in CPU usage, especially the mingw 4.4.1 build, but are still a few percent worse than using msvc. I am still getting more CPU efficient code from mingw 3.4.2 though.

i get faster code with mingw-4.4.1-tdm with just -O3 than mingw-3.4.2-official. but both are pinch slower than the msvc build. but adding some of the extra flags that ccernn has posted (even excluding fast-math) makes the gcc build faster.

Caco wrote: Awesome, looks like I can stick to using mingw instead of msvc. Just tried it here using all the compiler flags above in win xp with gcc 4.4.1 and mingw version now 17.2% for 10 instances in Reaper

yeah, mingw is definitely a substitution for msvc and also gcc is free and multiplatform (so double win). i don't have any plans to use msvc at all..

unfortunately, apart from the occasional compiler bugs, one of the major issues with mingw is the incompatible ABI's with msvc. for example the reaper extensions *has* to be compiled with msvc and mingw will not work for these.
(following examples on the web) i have done some tests that a mingw library is used properly, but it either requires including the definitions in the host app, or manually renaming the mangled names before linking.

i would like to address this issue as well on this forum. any ideas how to make it easy for users (preferably with some automation)? is mingw going to add some magical options for this?

Caco · Post by **Caco** » Mon Jun 21, 2010 1:58 pm

lubomir.ivanov wrote: yeah, mingw is definitely a substitution for msvc and also gcc is free and multiplatform (so double win). i don't have any plans to use msvc at all..

unfortunately, apart from the occasional compiler bugs, one of the major issues with mingw is the incompatible ABI's with msvc. for example the reaper extensions *has* to be compiled with msvc and mingw will not work for these.
(following examples on the web) i have done some tests that a mingw library is used properly, but it either requires including the definitions in the host app, or manually renaming the mangled names before linking.

i would like to address this issue as well on this forum. any ideas how to make it easy for users (preferably with some automation)? is mingw going to add some magical options for this?

Definately, I have had issues before using mingw due to code using MS specifics

Now that I can get code running at equivalent speeds using mingw, I would much prefer to use just gcc/mingw in future.

Ninjan · Post by **Ninjan** » Tue Jun 22, 2010 10:50 am

I think this may be interesting to read.
http://www.gamasutra.com/view/feature/4 ... _simd_.php

Comparing GCC, Intel and Microsoft SIMD compilation.

He also goes into float comparison as well.
Also how to code in C++ to get most efficient code depending on compiler.

In short , very informative.
Hope that helps out.

-Che

Compiler Optimizations