KVR Audio

itoa · Post by **itoa** » Sun Aug 28, 2016 7:50 pm

Urs wrote: As far as I've tried, vectorial NR outperforms each stage NR and then feedback NR around it by a magnitude.

Hmm, how its possible? Inverted Jacobian of 4x4 matrix (two zeros in the row) looks very expensive.

Urs · Post by **Urs** » Sun Aug 28, 2016 8:15 pm

itoa wrote:
Urs wrote: As far as I've tried, vectorial NR outperforms each stage NR and then feedback NR around it by a magnitude.
Hmm, how its possible? Inverted Jacobian of 4x4 matrix (two zeros in the row) looks very expensive.

You don't need to invert it. You can solve this whole thing as a linear equation system, as I have done in the code posted (requires Maxima/Maple/Mathematic or a lot of nerves)

There's also another thread with a neat set of SSE/Clang code examples that do matrix inversion with pretty much the same number of instructions.

However... look at the code carefully. With SSE you can calculate all 4 filter stages at once. Because you don't have to wait for any results - they are all estimated. You can furthemore calculate all results of the linear equations system at once. With a bit of SSE, this code flies (please excuse me for not posting SSE code... we use an intermediate abstraction that's just for inhouse dev and I don't have the drive to write it all in SSE assembly/intrinsics again)

If you keep precision in the 16 bit bracket (i.e. keep the convergence level at InputRMS/100000 or something) it'll typically converge in 2-6 iterations, let's say 4 for simplicity. Each iteration costs approximately as much CPU as, hmm, 3 iterations of a one pole filter, considering that tanh is the most expensive part.

With the other method, if you need, say, 3 iterations per one pole, you're already using the CPU of 4 vectorial iterations for a single iteration of your outer feedback, and that's for each round. If you need 4 for that as well... if we're optimistic it'll be twice as slow.

itoa · Post by **itoa** » Sun Aug 28, 2016 8:42 pm

Thanks, damn I haven't noticed the source code.

Ok.. now trying to understand this:

"However, to divide by a Matrix we need to invert it. Instead of inverting, we change the terms around a bit to bring the Matrix on top:

JF(vE)y = -F(vE)
vE[ n + 1 ] = vE[ n ] + y
"

So you moved Jacobian to the left by multypling, but what did happen later? It's too late, but I must understand this

Urs · Post by **Urs** » Sun Aug 28, 2016 9:38 pm

itoa wrote:Thanks, damn I haven't noticed the source code.

Ok.. now trying to understand this:

"However, to divide by a Matrix we need to invert it. Instead of inverting, we change the terms around a bit to bring the Matrix on top:

JF(vE)y = -F(vE)
vE[ n + 1 ] = vE[ n ] + y
"

So you moved Jacobian to the left by multypling, but what did happen later? It's too late, but I must understand this

Half past eleven pm here, that's too late too!

What you basically have is JF * y = -x. If JF is a 4x4 matrix and y (unknown) and x (known) are 4x1 vectors, you end up with 4 equations. Those you can solve for each element of y.

More later if need be, but not tonight

mystran · Post by **mystran** » Mon Aug 29, 2016 3:00 am

Urs wrote:

Code: Select all


			float y1 =  (m14*m22*m33*x4-m14*m22*m43*x3+m14*m32*m43*x2-m22*m33*m44*x1)/(m11*m22*m33*m44-m14*m21*m32*m43);
			float y2 = -(m11*m33*m44*x2+m14*m21*m33*x4-m14*m21*m43*x3-m21*m33*m44*x1)/(m11*m22*m33*m44-m14*m21*m32*m43);
			float y3 = -(m11*m22*m44*x3-m11*m32*m44*x2-m14*m21*m32*x4+m21*m32*m44*x1)/(m11*m22*m33*m44-m14*m21*m32*m43);
			float y4 = -(m11*m22*m33*x4-m11*m22*m43*x3+m11*m32*m43*x2-m21*m32*m43*x1)/(m11*m22*m33*m44-m14*m21*m32*m43);

This looks awfully lot like a multiplication of the inverse matrix with the input vector.

You don't actually need to do it like this. You can also just use something like LU to solve the thing directly, which can potentially give you large savings in terms of operations done (not to mention it's usually more accurate; unfortunately it's get somewhat harder to run stuff in parallel) once your system gets larger. More importantly, if only some of the dimensions are non-linear, then LU makes it fairly simple to only iterate those dimensions (and back-substitute the rest afterwards).

But like.. if you have a 4x4 system you might just as well use an explicit inversion, no big deal.

itoa · Post by **itoa** » Mon Aug 29, 2016 5:31 am

Urs: Damn it was so obvious and you are right I shouldn't think about this after 10pm

.

Nevertheless the final result is the same as with multiplying by inverted Jacobian, quite expensive stuff.

Urs · Post by **Urs** » Mon Aug 29, 2016 7:58 am

mystran wrote:This looks awfully lot like a multiplication of the inverse matrix with the input vector.

It isn't all that much... the denominator term is identical among the four rows and many multiplications can be done in parallel with a bit of shuffling (provided the developer does it in SSE)

One can also just calculate 1 row, then substitute that y in the 4 original equations. This however leads to more iterations because of precision errors. Calculating y1 and y3 directly, then harvesting y2 and y4 from equations works well but is almost as expensive as the whole lot (one saves only 3 or 4 shuffles IIRC).

I will try LU decomposition. So far I abandoned the idea because I thought that pivoting and checks (were they for zeros in the diagonal?) might make it more expensive.

(that said, maybe you're right... maybe Maple just solved it by inverting the matrix... can it do that after I transformed it into an equation system? - I have no idea, I just know it works perfectly well

)

mystran · Post by **mystran** » Mon Aug 29, 2016 8:54 am

Urs wrote: I will try LU decomposition. So far I abandoned the idea because I thought that pivoting and checks (were they for zeros in the diagonal?) might make it more expensive.

The idea is to find such an order that it "just works" without further pivoting. It turns out that this is usually more of a theoretical than practical problem (with some caveats) since putting the (1+g) terms on the diagonal is sufficient most of the time.

itoa · Post by **itoa** » Mon Aug 29, 2016 11:35 am

Bit off-topic, but as you are here...

I want to test pivotal method with Heun like correction - to compare performance. I do the following:

1. Compute filter for current sample (using states: s to compute transconductances: t1), update state
2. Take next sample and compute transconductances: t2
3. Revert filter state
4. Compute filter using average of t1 and t2

Am I right?

Miles1981 · Post by **Miles1981** » Mon Aug 29, 2016 1:38 pm

Solving a complex system can sometimes be made by solving each independent parts, but the cost is always higher than doing it in one go.
The reason why you may not want to do is if some equations don't have the same behaviors as others (for instance one set of equations is parabolic, the other ones hyperbolic). In that case you may want to reduce the system so that you only solve one set and then both in an additional/final step.
By solving everything together, you ensure the consistency of your system and the feedback.

Also, some matrix libraries may help you solving that system faster than if you did it via symbolic estimation. You need to test the different versions to see which one is the fastest.

Urs · Post by **Urs** » Mon Aug 29, 2016 9:39 pm

I did the LU decompositon in Maple (using the Gaussian elimination option)... it might save some CPU, but at least one term is identical to the ones in my code. Hence I'm not certain whether it really saves a lot or not, considering that all 4 terms can be done in parallel, plus a few shuffles in SSE...

I hope I find the time to do this in actual code within the next few weeks.

mystran · Post by **mystran** » Tue Aug 30, 2016 7:07 am

Urs wrote:I did the LU decompositon in Maple (using the Gaussian elimination option)... it might save some CPU, but at least one term is identical to the ones in my code.

It is the LU decomposition process that is of interest (rather than the end result), because the algorithm itself is a very efficient way to calculate all the required terms while sharing most of the computations.

edit: Oh and yes, it tends to be slightly more serial, so it might take a larger matrix before it actually starts to give you significant advantages.

itoa · Post by **itoa** » Tue Aug 30, 2016 1:53 pm

Urs, this example uses tanh, with its cheap derivative. What with other more fancy curves? Do you use lookup tables? Random memory access in this situation may be a show stopper

(e.g. 16*2 with interpolation per sample)

Secant method is an alternative, but most probably number of iterations will increase.

mystran · Post by **mystran** » Tue Aug 30, 2016 2:08 pm

itoa wrote:Urs, this example uses tanh, with its cheap derivative. What with other more fancy curves? Do you use lookup tables? Random memory access in this situation may be a show stopper (e.g. 16*2 with interpolation per sample)

Secant method is an alternative, but most probably number of iterations will increase.

Secant converges slower, but it's worth keeping in mind that any kind of inaccuracy in the derivatives will also hurt Newton (sometimes a lot more than one would expect)... so be careful when you start using approximations or lookup tables, you might end up losing more (due to extra iterations) than you saved with the approximation.

I would also highly recommend keeping track of the number of iterations (eg. with exponential moving average that is cheap to calculate on the fly) and putting it visible on your UI when you test the thing, so you can tweak knobs and try different input signals and see how it affects your iteration counts.

Urs · Post by **Urs** » Tue Aug 30, 2016 2:23 pm

I do a lot of tests that look like unit tests but plot out number of iterations and measured time for, say, 100000 samples. I do those for different settings (cutoff, resonance, input material) and different accuracy (use sqrte() for tanh or 1/sqrt(), use 3rd order polynoms or 7th order etc.). On top of that I do different initial estimates (last sample's Vs, linear prediction, Vs from linearly computed filter). For this you better use C++ templates.

I also often define a waveshaper function like this and use it as a template parameter for the filter algorithm:

Code: Select all

float waveshape( float x, float* derivative )
{
	float out = x * x * x;
	*derivative = 3 * x * x;
	return out;
}

It's often faster to use the actual derivative of a tanh approximation than pair the derivative of tanh with an approximation of it.

Newton's Method for vectorial functions (4-pole filter)