KVR Audio

laserbeak · Post by **laserbeak** » Sat Nov 10, 2007 4:26 pm

xoxos wrote:best thread title ever.

lol did i actually type it like that or did you change it?

laserbeak · Post by **laserbeak** » Sat Nov 10, 2007 4:29 pm

Aleksey Vaneev wrote:It's important to understand that (to my knowledge) no compiler will be able to 'optimize' accesses to such function arguments as pointer to structure or reference to structure:

Code: Select all

inline void process( CChannel& Data, int l, double* p )
{
    while( l > 0 )
    {
        Data.y += ( *p - Data.y ) * Data.c;
        *p = Data.y;
        p++;
        l--;
    }
}

The problem is, compiler does not know for sure where it should or should not put the intermediate values into the structure.

To get an optimized code you should write this way: (this code can be unrolled by a compiler without a problem)

Code: Select all

inline void process( CChannel& Data, int l, double* p )
{
    const double c = Data.c;
    double y = Data.y;

    while( l > 0 )
    {
        y += ( *p - y ) * c;
        *p = y;
        p++;
        l--;
    }

    Data.y = y;
}

wow how do you guys learn about all of this?

Kingston · Post by **Kingston** » Sat Nov 10, 2007 5:56 pm

Chris Walton wrote:(2) How in the world does block processing hinder object oriented design?

I meant that if you start avoiding function calls for arbitrary reasons you will end up with obfuscated code that no one will ever be able understand let alone fix, and it won't be any faster in the end either.

I've heard people still unroll function call chains and end up with a large collection of highly specialised non-reusable code. great for those much needed optimisations yielding 0.001% more performance that only work on certain CPU models.

I personally leave that to the obsessive compulsive types.

Aleksey Vaneev · Post by **Aleksey Vaneev** » Sat Nov 10, 2007 6:01 pm

laserbeak wrote:wow how do you guys learn about all of this?

Experience is the answer.

Really, in the two code examples I've posted the second one produces best performance, and I've described the reason.

To be more specific, in the second example compiler is able to process the whole loop in registers, without even touching memory (errhm, I mean, without touching Data structure). In the first example it may need to access Data structure to put intermediate result (while access to "Data.c" variable may still get optimized by preloading into register). Also, having access to a structure in such loop is at least 1 lost register in the context of the loop.

I've made a simple test with MinGW with -O3 --unroll-loops -no-inline settings.

Here's the typical unrolled output of the first variant (unrolled 4 times, typical part in ***):

Code: Select all

L7:
fldl	(%ecx)
subl	$4, %eax
fldl	(%edx)
fsub	%st(1), %st
fmull	8(%ecx)
faddp	%st, %st(1)
fstl	(%ecx)
fstpl	(%edx)
***
fldl	(%ecx)
fldl	8(%edx)
fsub	%st(1), %st
fmull	8(%ecx)
faddp	%st, %st(1)
fstl	(%ecx)
fstpl	8(%edx)
***

Second variant (unrolled 8 times, typical part in ***):

Code: Select all

L7:
fldl	(%edx)
subl	$8, %eax
fsub	%st(1), %st
fmul	%st(2), %st
faddp	%st, %st(1)
***
fldl	8(%edx)
fsub	%st(1), %st
fxch	%st(1)
fstl	(%edx)
fxch	%st(1)
fmul	%st(2), %st
faddp	%st, %st(1)
***

As you see, much less number of memory accesses in the second variant.

Chris Walton · Post by **Chris Walton** » Sat Nov 10, 2007 6:01 pm

Kingston wrote:
Chris Walton wrote:(2) How in the world does block processing hinder object oriented design?
I meant that if you start avoiding function calls for arbitrary reasons you will end up with obfuscated code that no one will ever be able understand let alone fix, and it won't be any faster in the end either.

I've heard people still unroll function call chains and end up with a large collection of highly specialised non-reusable code. great for those much needed optimisations yielding 0.001% more performance that only work on certain CPU models.

I personally leave that to the obsessive compulsive types.

Ah, I see, and I fully agree

MadBrain · Post by **MadBrain** » Sat Nov 10, 2007 6:23 pm

The penality for a function call inside the processing loop isn't too bad. This might happen in optimizing compiler if the loop is long (in particular for repeated calls to a same function, so that the loop sits inside the cache instead of being too large), even if you add "inline".

laserbeak · Post by **laserbeak** » Sat Nov 10, 2007 9:31 pm

Aleksey Vaneev wrote:
laserbeak wrote:wow how do you guys learn about all of this?
Experience is the answer. Really, in the two code examples I've posted the second one produces best performance, and I've described the reason.

To be more specific, in the second example compiler is able to process the whole loop in registers, without even touching memory (errhm, I mean, without touching Data structure). In the first example it may need to access Data structure to put intermediate result (while access to "Data.c" variable may still get optimized by preloading into register). Also, having access to a structure in such loop is at least 1 lost register in the context of the loop.

I've made a simple test with MinGW with -O3 --unroll-loops -no-inline settings.

Here's the typical unrolled output of the first variant (unrolled 4 times, typical part in ***):
Code: Select all
L7:
fldl	(%ecx)
subl	$4, %eax
fldl	(%edx)
fsub	%st(1), %st
fmull	8(%ecx)
faddp	%st, %st(1)
fstl	(%ecx)
fstpl	(%edx)
***
fldl	(%ecx)
fldl	8(%edx)
fsub	%st(1), %st
fmull	8(%ecx)
faddp	%st, %st(1)
fstl	(%ecx)
fstpl	8(%edx)
***
Second variant (unrolled 8 times, typical part in ***):
Code: Select all
L7:
fldl	(%edx)
subl	$8, %eax
fsub	%st(1), %st
fmul	%st(2), %st
faddp	%st, %st(1)
***
fldl	8(%edx)
fsub	%st(1), %st
fxch	%st(1)
fstl	(%edx)
fxch	%st(1)
fmul	%st(2), %st
faddp	%st, %st(1)
***
As you see, much less number of memory accesses in the second variant.

cool thanks i can see that it's smaller, but would you be what kingston calls "obsessive compulsive"?

FEV · Post by **FEV** » Sat Nov 10, 2007 9:36 pm

Aleksey Vaneev wrote:It's important to understand that (to my knowledge) no compiler will be able to 'optimize' accesses to such function arguments as pointer to structure or reference to structure:

This also applies to accessing class member variables in class methods (but that should be obvious, as calling a member function is basically a "normal" function call with implicit this pointer passed) - sometimes it's better to preload member variables into local ones before doing lot's of writing into them (reading from memory won't be that bad I think ). It's also worth noting, that this kind of optimization should be done hmmm... carefully

(check asm output) - especially if you're going to preload more variables than you have registers (in that case you may actually decrease performance).

cheers,
Bart

Aleksey Vaneev · Post by **Aleksey Vaneev** » Sun Nov 11, 2007 6:41 am

laserbeak wrote:cool thanks i can see that it's smaller, but would you be what kingston calls "obsessive compulsive"?

Not at all.

Unfortunately, to create a fast C code you should know compiler habits, at least on the macro level - what compiler will and what compiler won't do for sure (from the 'paradigm' point of view).

However, while it's 'obsessive', but it's also useful to know that at least MinGW won't unroll such construct (I do not know if Intel C++ Compiler would):

Code: Select all

while( l-- > 0 )
{
	*(p++) = *p * 5.0;
}

You should write it this way:

Code: Select all

while( l > 0 )
{
	*p = *p * 5.0;
	p++;
	l--;
}

They are identical from the point of view of programmer, but can't be 'cracked' identically by a compiler (postfix ++ in an expression is a bit of an evil to process since it implicitly says compiler to do two things: first save the original value on stack and then immediately increment the variable). Single postfix ++ in a line is simpler: compiler does not need to save the previous value and it can safely just increment the variable. While I myself prefer the second variant, some people use the first one trying to look ingenious (while it sucks). Prefix ++ notation is easier to cope with for compiler (e.g. while( --l >= 0 ) ), but needs a bit of code rearrangement (i.e. use >= instead of > in an expression) - I do not think it's worth it.

Leslie Sanford · Post by **Leslie Sanford** » Sun Nov 11, 2007 7:53 am

Aleksey Vaneev wrote: You should write it this way:
Code: Select all
while( l > 0 )
{
	*p = *p * 5.0;
	p++;
	l--;
}

Efficiency questions aside, I find this to be a better programming style. I don't like embedded expressions within a statement that have side effects. It obscures the intent of the statement and makes it tougher to debug. All in my opinion, of course.

Aleksey Vaneev · Post by **Aleksey Vaneev** » Sun Nov 11, 2007 12:21 pm

FEV wrote:It's also worth noting, that this kind of optimization should be done hmmm... carefully (check asm output) - especially if you're going to preload more variables than you have registers (in that case you may actually decrease performance).

I think it is always beneficial to pre-load variables that are going to be updated, even if you think not enough registers are available. Of course, I'm talking about block processing loops. If state variable is on stack, compiler can manage it in whatever way it wants. If the state variable belongs to an external structure, compiler's hands are tied. In fact, preloading a variable does not increase overhead, because compiler should usually issue a loading instruction anyway while it can still remove that pre-loading instruction if it's not required. So, pre-loading is merely a guideline to the compiler's optimizer: it's like putting a 'const' keyword near a variable or function that will not be subject to change.

FEV · Post by **FEV** » Sun Nov 11, 2007 1:47 pm

Aleksey Vaneev wrote:I think it is always beneficial to pre-load variables that are going to be updated, even if you think not enough registers are available.

I disagree...

Aleksey Vaneev wrote: Of course, I'm talking about block processing loops.

Same here

Aleksey Vaneev wrote: In fact, preloading a variable does not increase overhead, because compiler should usually issue a loading instruction anyway while it can still remove that pre-loading instruction if it's not required.

I'm only using MSVC++ and "mangle" only SSE (__m128, __m128d) data types in
block processing loops, and I've often seen a mess generated by the compiler in a preloading part. If there's not enough registers VC++ may preload a member variable into register and immediately save it again into memory (performance hit). Then in processing loop it will read/write into that memory block instead of using register. If the variable is often used in a inner loop, compiler may actually load it again into a register, but that will happen in a loop (and so you will also get per loop iteration memory read and write). In such cases the execution speed might be improved if you are often writing (inside the loop) into such a "badly preloaded" variable (reading does not cost much - at least not with movaps which I'm mostly dealing with) - but that shouldn't take place as writing to a final destination should always happen once per iteration (preferably at it's end - so the compiler can do a better job optimizing the whole loop). And even if you get some improvement in a loop speed you also have to consider how much performance penalty you'll get by pointless preloading of variables (those that didn't fit into registers) before the loop execution.
So like I said - "preloading" is a good idea, but not in all cases. I tend to pick only a couple of variables to preload instead of all of them (the ones I'll be writing to get a priority), and I always check asm output to make sure, that the compiler does exactly what I want

And another thing worth mentioning - inlining functions that do that kind of preloading is also mostly a bad idea, but I see (by looking at the compiler flags you're using) that you already know that

Ps.: I'm not sure if my "bad case scenario" description is quite clear, so if you want I can provide an example along with an assembly output

cheers,
Bart

laserbeak · Post by **laserbeak** » Sun Nov 11, 2007 2:08 pm

sure. if you can spare an example.

i could use a bit of asm exposure anyway.

FEV · Post by **FEV** » Sun Nov 11, 2007 3:49 pm

laserbeak wrote:sure. if you can spare an example.

Ok. Definitely not the best example, but I'm little after the deadlines and am busy with some other things (I just hope you're not reading this Peter

), so instead of writing some code from scratch I've dig a piece that should prove my point

Here's a processing function: example.cpp (it's a member function of some class whose member variables are prepended with 'm') and what msvc++ 8.0 produced out of it: example.asm

It's not really important what that thing does (well, you can try to guess if you have nothing better to do

) - it's just to show that I'm not doing something really stupid (or maybe I am???). As you can see, there's a "preloading" stage before the loop, and as you can also observe on the asm output, two of the variables haven't been loaded into registers (well they were, but then they were immediately saved into memory again):

Code: Select all

movaps	xmm0, XMMWORD PTR [ecx+464]
...
movaps	XMMWORD PTR _highFilterPrev$[esp+64], xmm0
movaps	xmm0, XMMWORD PTR [ecx+480]
movaps	XMMWORD PTR _highFilterLast$[esp+64], xmm0

They are read into the memory later inside the loop and again saved into the memory - basically a similar thing would happen if no preloading would take place. So you might see that there's no real performance gain from this preloading (for those variables), and because of that silly memory<->register operations at the beginning of the function, you actually do more than is needed (performance hit). If there would be more variables to preload, it would be even worse (especially if this process function is called often - mine mostly process 8 or 16 samples, so are called quite often I guess). In this example it isn't a big hit, but it's not so hard to imagine situations where it might be a bigger problem

Still, most of the time it's good to create local copies of member variables (but only those you'll be writing to) - but not always (check asm / profile it to be sure it's a good solution for a given task)

Other than that, I completely agree with Aleksey

cheers,
Bart

laserbeak · Post by **laserbeak** » Sun Nov 11, 2007 5:52 pm

cool i don't know a lot of those opcodes i'm very new to asm. but i do get your point and i've noticed that the asm version has a few things thrown around in a diff order. did the compiler do this or did you?

oh and my guess is, a filter module with an envelope? lol

how to generate random noise without making function calls?