KVR Audio

raphx · Post by **raphx** » Mon Jul 08, 2013 3:17 pm

Yep, you got it right. You build the matrix every time you update the parameters, but for every audio sample you're doing much less work: in the linear case, multiplying a vector by a scalar and multiplying a vector by a matrix (and adding these two results to form the new state vector). That's 20 multiplies.

There are a couple optimizations on top of what you said. First, you can square a Toeplitz matrix in 7 dot products, not 16, because of the symmetries. Lastly, I'm not just doing the repeated squaring any more, I'm starting with a few terms of the Taylor's series for exponentiation and then refining that with repeated squaring. This lets me exceed the accuracy of the original method, with only 7 matrix multiplications.

And all of these operations run pretty fast on modern SIMD hardware, without a need to do a lot of repacking, because they're highly optimized for exactly this kind of math. I seriously doubt x86 is going to need 50 cycles to do a 4x4 matrix multiplication. Even an ARM should be able to do it faster than that.

Hope this illuminates.

Syrou · Post by **Syrou** » Mon Jul 08, 2013 10:42 pm

raphx wrote:Ok, I withdraw my bold claim. I was surprised how bad the performance of Will Pirke's implementation was - it didn't actually seem any better than the Huovilainen one. As you say, it's now obvious that it's a wrong implementation.

Now I'll have to more deeply understand the TPT approach so that I can implement it correctly. This will no doubt take a few days.

I'm still fairly optimistic about total bang for the buck, but now it's clear there's more of a tradeoff.

The appnote has been revised and updated with a more accurate implementation of the TPT approach.

sonigen · Post by **sonigen** » Tue Jul 09, 2013 8:43 am

Thanks for the explanation.

raphx wrote: And all of these operations run pretty fast on modern SIMD hardware, without a need to do a lot of repacking, because they're highly optimized for exactly this kind of math. I seriously doubt x86 is going to need 50 cycles to do a 4x4 matrix multiplication.

The DPPS (dot product packed singles) op on X86 takes has a reciprocal throughput of 2 cycles and a latency of 12, so for 16 it'll take 32 cycles + 12 best case. If you don't have SSE4 it takes 3 separate ops to do each dot product, so 48 + approx 12. Then you have a bunch of shuffling to get the scalar results back into matrix form. I figure about 2 cycles per register.

mystran · Post by **mystran** » Tue Jul 09, 2013 12:52 pm

sonigen wrote:Thanks for the explanation.

raphx wrote: And all of these operations run pretty fast on modern SIMD hardware, without a need to do a lot of repacking, because they're highly optimized for exactly this kind of math. I seriously doubt x86 is going to need 50 cycles to do a 4x4 matrix multiplication.
The DPPS (dot product packed singles) op on X86 takes has a reciprocal throughput of 2 cycles and a latency of 12, so for 16 it'll take 32 cycles + 12 best case. If you don't have SSE4 it takes 3 separate ops to do each dot product, so 48 + approx 12. Then you have a bunch of shuffling to get the scalar results back into matrix form. I figure about 2 cycles per register.

He's not working on x86 though.. but how about packing the matrix in column-major?

edit: I'll expand a bit. Basically, you don't need dot-products if you store the columns of the (leftmost) matrix and then calculate the matrix multiply as a "sum of outer products" which results in the same amount of total multiplies and adds, but you get to do the adds with regular addps and move the shuffling to the input (rather than output). I haven't tried, but I suspect it almost might pipeline a tiny bit better.

raphx · Post by **raphx** » Tue Jul 09, 2013 5:33 pm

Syrou: Yes, he's been in touch with me as well. I took a look at the revised app note, and it looks good to me now. Certainly the frequency response plots are as expected.

mystran and sonigen: Yes, the matrix is in column major order, so the core of matrix multiply is just vmla.f32, and the results end up exactly where they need to go, no repacking necessary. There's a pretty good blog on how to do it in the general 4x4 case: http://blogs.arm.com/software-enablemen ... plication/

But I had to do it better, of course. I wanted to exploit the Toeplitz symmetry of this particular problem and reduce the number of operations required. Here's the core of matrix squaring (input is q0 and q2, output is q12 and q13):

Code: Select all

neon_ladder_mkmatrix2:
	vext.32 q8, q2, q0, #3

	@ q0 = {a11, a21, a31, a41}, q2 = {0, a14, a24, a34}

	@ square the matrix
	vmul.f32 q12, q0, d0[0]
	vmul.f32 q13, q8, d4[1]
	vext.32 q9, q2, q0, #2
	vmla.f32 q12, q8, d0[1]
	vmla.f32 q13, q9, d5[0]
	vext.32 q10, q2, q0, #1
	vmla.f32 q12, q9, d1[0]
	vmla.f32 q13, q10, d5[1]
	vmla.f32 q12, q10, d1[1]
	vmla.f32 q13, q2, d0[0]

Here the vext instructions are striping the two fragments of the matrix into a full 4x4 matrix (q0, q8, q9, q10 are the columns). According to Pulsar, the whole thing takes 32 cycles on Cortex-A8.

http://pulsar.webshaker.net/ccc/sample-75cbfede

I was pleased by the way the whole thing fit together. The whole matrix generation code, from frequency and resonance to A matrix and B vector, is 100 lines of assembler, not bad for how obsessively optimized this is.

mystran · Post by **mystran** » Tue Jul 09, 2013 9:27 pm

Yeah well, sadly SSE [at least the versions you can realistically expect] in comparison lacks a scalar-vector multiply, so you have to shuffle any scalars into all the components for regular packed multiply.. but I suspect it'd still be more efficient to do it with outer-products, especially since you don't need any bleeding edge extensions (apparently the fused multiply-add is coming, but in practice it's probably another 10 years before you can expect to rely on people actually having it).

sonigen · Post by **sonigen** » Tue Jul 09, 2013 10:43 pm

After seeing that ARM code it makes you wonder how Intel keeps getting things so wrong with SSE. It's like they design a bicycle and then decide to only give you one peddle.

I'm trying very hard not to rant here.

Smashed Transistors · Post by **Smashed Transistors** » Sat Oct 03, 2015 9:09 pm

Hello,

thanks to all for this thread, i've read it some 10 days ago while digging up some old C code to use them within Reaper (jsfx) . I've digged up an implementation of the state variable filter that used the same kind of method http://forum.cockos.com/showthread.php?t=166480.

Step invariant transform seem to me the best method for linear low pass filters, but it is a pain in the feedback to get it working.

I've got a simplified iterative method that gives the coefficients based on the "virtual oversampling"

.
I've also tried to use the taylor approximation of the exp(M)

, it converges faster but it provided me some unstable results.
I found an analytical form a few days ago for the SVF... i can imagine the hard work for finding it for the 24dB-filter-made-of-four-lossy-integrators-with-negative-feedback !!!

But as the numerical method converges very fast, i will only use it as a test/reference

Again, thanks for this thread.

Thierry

aciddose · Post by **aciddose** » Sat Oct 03, 2015 9:45 pm

Please stop using the term "state-variable" to refer to something concrete. This is an abstract term which applies to all of these filter implementations.

Likewise the term "moog filter" is ridiculous, although it can be assumed by those aware of the jargon, there were numerous different filters used in equipment manufactured by moog (corp) and others in association with it.

I believe when you use the term "state-variable" you intend to describe a "kerwin-huelsman-newcomb filter" which is a particular implementation of a state-variable filter.

When you use the term "moog filter" you intend to describe four inverting active lossy integrators with negative global feedback. I would suggest the term "generic four-pole OTA" or "four cascaded inverting lossy-integrator".

If you mean "mini-moog filter" (which I assume you do) you must recognize that it is foolish to use the term in a generic way when you are not intending to implement the specific properties resulting from the very specific layout of this particular filter.

I find use of the term "moog filter" as a generic to be especially offensive due to the fact I'm left with little to no idea what you're actually referring to when you use the term. It would be equally productive to use the term "warm filter".

Only if you truly have such a bold and narrow focus on one particular electronic circuit can it be justified to use the term "mini-moog filter". This would make it clear what you are referring to. Otherwise I'm sorry but you'll simply need to use more words to clearly describe the topic.

Most often I am left with the impression the term "four-pole/24db filter" is the true intent.

Smashed Transistors · Post by **Smashed Transistors** » Sat Oct 03, 2015 9:57 pm

aciddose · Post by **aciddose** » Sat Oct 03, 2015 10:01 pm

Explain which topology and which properties you want to refer to.

I suspect "four-pole filter" is what you want.

https://en.wikipedia.org/wiki/Electroni ... r_topology

BTW, I know it often looks like I'm replying to a particular post. In general if I did not directly quote a post, my reply is to the thread and not to a specific post.

Smashed Transistors · Sat Oct 03, 2015 10:04 pm

Sure...
i wanted to thank people here, not to trigger this kind of allergic reaction Mr aciddose.

aciddose · Post by **aciddose** » Sat Oct 03, 2015 10:16 pm

It's important to point out when people are using irrational/unnecessary jargon. I've been doing it for over a decade now and I've made what appears to be some small difference during this time, although perhaps my efforts simply coincidentally correspond to more knowledgeable individuals appearing in discussions of these topics.

https://en.wikipedia.org/wiki/Topology_ ... tronics%29

Some of these Wikipedia articles are relatively recent and I think it would be quite helpful for people to read up on them.

In many cases terms such as "ladder filter" or "moog filter" are used as generics when I suspect people don't realize the ambiguity in these terms. For example a series of repeated kerwin-huelsman-newcomb filters (or integrators comprising sections thereof) are classified as a "ladder" in terms of topology.

"Moog filter" is ambiguous as I described due to the fact that there were many different circuits employed in both the various separate synthesizers as well as synthesizer modules produced by r.a. moog co / moog music over the years, some of which are still produced today.

In fact you could refer to a "moog ladder filter" thinking you're referring to a generic four-pole low-pass filter with negative feedback, while in fact there was a moog music module implementing a high-pass ladder without any feedback!

Being technical and specific is important to avoid confusion. In these cases if the desire were to refer specifically to the minimoog filter configured as a low-pass, I'd go with "minimoog low-pass filter" which should avoid any possibility for confusion.

In all other cases, when referring to a generic four-pole low-pass filter, simply use the term "generic four-pole low-pass filter" !

Smashed Transistors · Sat Oct 03, 2015 11:19 pm

Sorry, i do not agree with you Mr aciddose, the term has been used for the generic kind of 24dB low pass filter for decades, long before kvr, vstis, musicdsp or even the analog heaven mailing list .
Moog's filter has been copied in many ways getting around his patents. May it be specifically a ladder or a series of CA3080, Norton amplifiers or diodes or any digital incarnation ... to me they have the same origin.

Smashed Transistors · Sat Oct 03, 2015 11:21 pm

There is also a lot of things that i do not agree with here and there... but i do not go in "i'm fed up, everybody is ridiculous" mode. Well... so far.

I may admit that i find the generic term "Zero Delay Feedback" very confusing.

What do we call "feedback" ? the complete loop or just the feedback connection (which is generally explicit) ?

What kind of "delay" are we talking about ? group delay, phase delay ? "Z-1" delays caused by explicit vs implicit vs bilinear integrators ?

What does "Zero" mean here ? Analog circuits induce frequency dependant delays http://www.radiolab.com.au/DesignFile/DN004.pdf. So why would we have "zero delay" in a digital implementation ?

As far as i understood, it seems that the so called ZDF consists in solving the explicit feedback connection in order to optimise resonance. As far as i tested, it does not work very well. I had to add some HP filter in the feedback connection much like any other implementation to get correct resonance on HF (i will continue to experiment, maybe i missed something).

The matrix approach goes a step further: it solves the whole feedback loop.

So let's exponentiate that Kerwin-Huelsman-Newcomb-Biquad-Filter and that 24dB-filter-made-of-four-lossy-integrators-with-negative-feedback filter.

A matrix approach to the Moog ladder filter