Audio Programming Environment 0.5.0: C++ DSP directly in your DAW!

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

Great, maybe finally I'll code my own plugin :)
Blog ------------- YouTube channel
Tricky-Loops wrote: (...)someone like Armin van Buuren who claims to make a track in half an hour and all his songs sound somewhat boring(...)

Post

mystran wrote: Fri Feb 12, 2021 11:58 am
camsr wrote: Fri Feb 12, 2021 8:17 am The compiler supports function-scope optimization attributes, so that may be useful. I would not compile an entire program using -Ofast just to be safe...
I've been compiling basically everything with -ffast-math or equivalent since the 90s, it's even the default in ICC with optimized builds I think. During that time I've seen maybe 2-3 issues related to this and all of them were easy to fix (eg. use isnan() instead of x!=x to check for NaNs, stuff like that).
Sure there is small issues with ffast-math, but specifically -Ofast I would avoid. -O2 is safest setting for plugins, and it does most useful optimizations.

Post

I always use -Ofast for deployment and have never had any issues with -ffastmath or using x != x to test for NaNs. This is using clang on Mac, but I've never seen any issues with gcc on the PC side either.

I think optimization features have gotten better/smarter over the last few compilers iterations. I've stopped trying to game my compiler and now just let it do its thing.

Oh, and at least on the clang side of things, the gains for -Ofast over -O2 are quite noticeable. Oddly, -Os does better too, but I think that's because it fits in cache and doesn't require paging.
I started on Logic 5 with a PowerBook G4 550Mhz. I now have a MacBook Air M1 and it's ~165x faster! So, why is my music not proportionally better? :(

Post

syntonica wrote: Sat Feb 13, 2021 1:04 am I always use -Ofast for deployment and have never had any issues with -ffastmath or using x != x to test for NaNs.
I can't remember if it's MSVC or ICC (or maybe both) that will optimize such a branch away with fast-math, but using isnan() works fine and is arguably cleaner anyway. The bigger danger is that in theory the compiler could optimize a numerically stable algorithm into a less stable one, but in practice I've yet to see this happen where it would matter.
Oh, and at least on the clang side of things, the gains for -Ofast over -O2 are quite noticeable. Oddly, -Os does better too, but I think that's because it fits in cache and doesn't require paging.
The -Ofast switch is really just -O3 -ffast-math and it's the latter that enables the compiler to actually optimize floating point in a meaningful way. In some cases that might be the difference between a nice fast auto-vectorized loop vs. much slower scalar code. The tricky question is the -O3 part which allows the compiler to spend more time optimizing and (more importantly) generate larger code, although I can't seem to find the exact details of what it does in clang currently. It's typically a win for hot loops, but it could be a loss for less frequently executed stuff.

As for -Os that's a tricky one. While "paging" isn't really a thing these days, it's true that optimizing for size can be profitable for code that typically runs cache cold and doesn't spend a whole lot of time in loops (eg. apparently it's a win for the Linux-kernel for example). I don't know about clang, but the general trouble with this is that some compilers (eg. GCC, MSVC) can generate pretty silly code just to save a byte or two if you tell them to optimize for size (MSVC actually generates completely ridiculous code when optimizing for size, but fortunately one can instead tell it to optimize for speed and favor smaller code; this way your code isn't three-times slower just to be 20% smaller).

Post

camsr wrote: Fri Feb 12, 2021 9:39 pm
mystran wrote: Fri Feb 12, 2021 11:58 am
camsr wrote: Fri Feb 12, 2021 8:17 am The compiler supports function-scope optimization attributes, so that may be useful. I would not compile an entire program using -Ofast just to be safe...
I've been compiling basically everything with -ffast-math or equivalent since the 90s, it's even the default in ICC with optimized builds I think. During that time I've seen maybe 2-3 issues related to this and all of them were easy to fix (eg. use isnan() instead of x!=x to check for NaNs, stuff like that).
Sure there is small issues with ffast-math, but specifically -Ofast I would avoid. -O2 is safest setting for plugins, and it does most useful optimizations.
See my previous message: -Ofast is -O3 + fast-math and the only "unsafe" part is fast-math. If you don't enable fast-math, then you should expect your floating-point code to not get much of any meaningful optimization.

Post

syntonica wrote: Sat Feb 13, 2021 1:04 am I think optimization features have gotten better/smarter over the last few compilers iterations. I've stopped trying to game my compiler and now just let it do its thing.
There is actually one thing one can do to really help a modern compiler: tell it about (the lack of) aliasing.

Basically when you have something like a function that reads from one buffer and then writes to another (which is not exactly an uncommon situation in DSP code; the famous process() function is like this), the compiler will often be able to optimize this a lot better if it can figure out that the two buffers are either completely distinct (use "restrict") or the same (use the same pointer).

While compilers can sometimes figure these things out through inter-procedural analysis (especially for inline functions or within a single translation unit), in the worst case the compiler must assume that the buffers can partially overlap and this means that (unless it wants to dispatch at run-time, which is bad for cache) it has to more or less preserve the exact order in which loads and stores were written in code (ie. no common sub-expression elimination or rematerialization of loads across stores, no load/store reordering to enable auto-vectorization; the whole situation truly sucks for a compiler).

With a good modern compiler, most other things you can think of are probably irrelevant, but aliasing is sometimes really tricky for a compiler to prove one way or another, yet it's also something where the compiler can potentially save a lot of (slow) loads and stores if you give it better information to work with.

Post

After experiencing problems using -O3 I decided to get into the fine details of compiler optimization settings. Ever since, I have used the -O2 as a basis for including more optimizations. Settings may depend on what version of compiler is in use and how the resulting binary is used as well. I have yet to encounter a situation where -O2 causes an issue, except for slight lack of optimization.

I noticed -O3 likes to generate lots of small critical sections, for what purposes IDK.

Post

syntonica wrote: Sat Feb 13, 2021 1:04 am I think optimization features have gotten better/smarter over the last few compilers iterations. I've stopped trying to game my compiler and now just let it do its thing.
GCC 10 added a 4th FRE optimization pass, which does a recursion for better folding. I have no idea if clang does the same.

Post

camsr wrote: Sat Feb 13, 2021 9:11 am After experiencing problems using -O3 I decided to get into the fine details of compiler optimization settings. Ever since, I have used the -O2 as a basis for including more optimizations.
Both -O2 and -O3 only do "safe" optimization (and really they are typically about 99% the same), so if you experience problems with one and not the other, then it's probably a problem with your code and I'd be suggest getting to the bottom of it, because the numbered levels are moving targets so what -O3 does today, -O2 might do tomorrow.

Post

camsr wrote: Sat Feb 13, 2021 9:48 am
syntonica wrote: Sat Feb 13, 2021 1:04 am I think optimization features have gotten better/smarter over the last few compilers iterations. I've stopped trying to game my compiler and now just let it do its thing.
GCC 10 added a 4th FRE optimization pass, which does a recursion for better folding. I have no idea if clang does the same.
I'm sorry, but what is "FRE" as this is not a common acronym for any common optimization that I would be aware of. Maybe you are talking about PRE although I'm not exactly sure what that has to do with folding....

Post

mystran wrote: Sat Feb 13, 2021 2:54 pm I'm sorry, but what is "FRE" as this is not a common acronym for any common optimization that I would be aware of. Maybe you are talking about PRE although I'm not exactly sure what that has to do with folding....
From GCC Optimize Options:
-ftree-fre
'Perform full redundancy elimination (FRE) on trees. The difference between FRE and PRE is that FRE only considers expressions that are computed on all paths leading to the redundant computation. This analysis is faster than PRE, though it exposes fewer redundancies. This flag is enabled by default at -O and higher.
https://gcc.gnu.org/onlinedocs/gcc/Opti ... tions.html

Post

thank you so much for this great tool. this will be really useful! i think, it may become a cornerstone in my workflow for experimenting with ideas for realtime dsp algorithms. the existing scripting/prototyping tools are all well and good (i guess) but this one is very different in one key aspect: it supports programming in c++ which means, i can have access to my whole dsp library. that's the killer feature for me that sets it apart from the existing tools. i've already managed to make that work and i am *really* happy about that! oh - and it's open source too, and i'm a big proponent of that.

i have a couple of questions:
-[edit: you can ignore that, i figured it out] am i supposed to pull the values of the parameter inside the process call - like once per block or something (that's what i currently do) - or is there some sort of callback mechanism that gets called when a parameter changes?
-are you handling denormals somewhere (supposedly with ftz/daz stuff*), so i don't have to worry about them in my scripts? i have occasionally seen the cpu and accps measurements go through the roof - but i'm not sure, if it was about denormals because it was somewhat weird (the sound was not necessarily close to zero and the values were in the billions and accps even negative (i.e. integer overflow or something?))
-are you planning to add support for handling midi events?

(*) i think, i will call that fitzdazzing from now on. i had some mulled wine and just made that up :hihi:
Last edited by Music Engineer on Sun Feb 14, 2021 6:02 am, edited 3 times in total.
My website: rs-met.com, My presences on: YouTube, GitHub, Facebook

Post

aah - ok - i see. i should have looked at the examples more closely. from the svf:

Code: Select all

	void process(umatrix<const float> inputs, umatrix<float> outputs, size_t frames) override
	{
		const auto shared = sharedChannels();

		for (std::size_t n = 0; n < frames; ++n)
		{
            // If you only want to calculate coefficients once in a while, move this out of the loop.
            const auto coeffs = SVF::Coefficients::design(
            	response, 
            	cutoff[n] / config().sampleRate, 
            	quality[n], 
            	dB::from<fpoint>(gain[n])
           	);

			for (std::size_t c = 0; c < shared; ++c)
				outputs[c][n] = filters[c].filter(inputs[c][n], coeffs);
		}

		clear(outputs, shared);
	}
...so the parameter objects actually contain arrays with per-sample values for each block? i wonder a bit, what the

Code: Select all

clear(outputs, shared);
call does. is this for clearing any channels that are not used, i.e. not already filled by the loop? if so, in what circumstance is that supposed to happen? and why are channels "shared"? with whom?
Last edited by Music Engineer on Sat Feb 13, 2021 5:55 pm, edited 2 times in total.
My website: rs-met.com, My presences on: YouTube, GitHub, Facebook

Post

juha_p wrote: Sat Feb 13, 2021 4:22 pm
mystran wrote: Sat Feb 13, 2021 2:54 pm I'm sorry, but what is "FRE" as this is not a common acronym for any common optimization that I would be aware of. Maybe you are talking about PRE although I'm not exactly sure what that has to do with folding....
From GCC Optimize Options:
-ftree-fre
'Perform full redundancy elimination (FRE) on trees. The difference between FRE and PRE is that FRE only considers expressions that are computed on all paths leading to the redundant computation. This analysis is faster than PRE, though it exposes fewer redundancies. This flag is enabled by default at -O and higher.
https://gcc.gnu.org/onlinedocs/gcc/Opti ... tions.html
Thanks. I actually tried to Google "full redundancy elimination" ('cos I sort of guessed that's what it might mean) but that didn't find anything useful for me (ie. it doesn't seem to be a commonly used term). I guess it's basically a compromise between simple CSE and full PRE then.

Post

Try the Compiler Explorer website. Select a GCC compiler and then in the assembler window, Add New a Tree/RTL output. It pops up a new workspace window with a dropdown box that shows all the optimization passes GCC makes. Not available with clang for some reason.

Post Reply

Return to “DSP and Plugin Development”