KVR Audio

mystran · Post by **mystran** » Mon Jun 08, 2020 1:33 pm

Fire Sledge - Ohm Force wrote: ↑Sun Jun 07, 2020 8:20 pm That’s why the DK method is nice: it formalizes what you described. The outer loop is just made of a few matrix-vector products. Non-linearities are isolated so the inner loop can focus on solving the voltages or currents at the junctions

I think you're missing the point I'm trying to make. You don't want any matrix-vector products. You want to expand (and then optimize) the whole thing as scalar code and you want to do whatever high-level optimisations that make said scalar code as short as possible (ie. least amount of FLOPs). Working at this level, the best strategy I know of involves using partial LU factorisations (and tracking fill-in to take advantage of whatever sparsity is left), but feel free to compare other options.

Basically if you don't have a high-level optimizing code-generator that spits out something that looks like glorified assembly then you're doing it wrong. You probably don't want to produce actual assembly though, because you want to take advantage of your compiler's register allocator (which is basically the single compiler optimization that gives you the most speedup here).

Fire Sledge - Ohm Force · Mon Jun 08, 2020 3:10 pm

Sorry I’m not sure to understand. The fastest way to simulate the circuit is to solve a naive (sparse) MNA matrix whose occupied cells are stored as a single variables, which would be significantly faster than a DK-method system? In other words, that the jacobian of the function which roots have to be found is much simpler to solve from a raw MNA matrix than from the DK method?

mystran · Post by **mystran** » Mon Jun 08, 2020 3:49 pm

Fire Sledge - Ohm Force wrote: ↑Mon Jun 08, 2020 3:10 pm Sorry I’m not sure to understand. The fastest way to simulate the circuit is to solve a naive (sparse) MNA matrix whose occupied cells are stored as a single variables, which would be significantly faster than a DK-method system? In other words, that the jacobian of the function which roots have to be found is much simpler to solve from a raw MNA matrix than from the DK method?

It's not so much about the method (and the "ideal" way to choose a method is to count which one produces the least number of FLOPs), but rather about making sure you get rid of any kind of interpretative overheads and fully optimize the code that you need and nothing else. I would avoid anything involving matrix inversions in particular because those (1) are numerically poor and (2) take about twice as long as solving directly, but that might actually be somewhat minor in the grand scheme of things.

The idea with forcing everything into scalar variables and unrolling absolutely everything except the sample loop and the inner iteration loop (if using NR) is to make sure that the compiler tries to keep as much of the data in registers as possible and to make it explicit that we really don't care about the memory layouts (because we really don't), we just want the least number of spills possible. We also want to give the CPU simple straight-line code with the data-dependencies limited to exactly those we can't avoid and with no branches to predict (well, other than the iteration and sample loops, which are unavoidable).

In my old Salt prototype, I had two solvers: "reference" and "JIT" which would use the exact same high-level algorithm, but with "reference" using generic (although still optimized) C++ code and "JIT" just unrolling the actual arithmetics into raw binary code (with register allocation and a bunch of peep-hole optimizations). The performance difference in favor of the JIT was typically something like 10-20 times faster, even though my generated binary code was somewhat slower than what a production C++ compiler would produce. Since then I haven't even bothered trying to use a generic solver, it's code-generation all the way. YMMV.

Fire Sledge - Ohm Force · Tue Jun 09, 2020 9:38 am

I get it now. So what you suggest is a “whole algorithm unrolling optimization”, so the ALU is kept saturated with directly required calculations. Because even if I use algorithms taking advantage of the sparsity of the MNA matrix, even if I use SIMD-optimized linear algebra libraries, there are still an important overhead caused by the algorithm structure. I wouldn’t have figured it was this important, and I already got a 5x speed-up by switching from MNA to the DK method.

So I have to write code-spitting linear algebra algorithms instead of actually compute the results. I have to keep track of the data dependencies so I can cut paths generating unused results or using data known to be 0. I can also use formal expressions to factorize products using coefficients known to be of the same magnitude or eliminate unity products. Maybe the compiler will be able to vectorize some computations. It sounds exciting, but this will be a huge amount of work.

Then I could check if it is better to start from a raw MNA or from the DK method.

Thank you very much for the tip.

mystran · Post by **mystran** » Tue Jun 09, 2020 1:11 pm

Fire Sledge - Ohm Force wrote: ↑Tue Jun 09, 2020 9:38 am I get it now. So what you suggest is a “whole algorithm unrolling optimization”, so the ALU is kept saturated with directly required calculations.

Well, that's one part of it. The other part is that "local optimizations" (ie. single basic-block of straight-line code without branches) are almost always much easier for the compiler to do (eg. for something as basic as register allocation is already a matter of "linear-time" vs. "NP-complete" for an "optimal" solution).

So you want to give the CPU straight-line code where it spend most of the time actually computing something, but you also want to give the compiler straight-line code where all the possible optimization opportunies are actually visible without any fancy data-flow analysis. Here things like moving all the matrix elements to scalar variables are essentially about making sure that the compiler doesn't get a chance to apply a poor heuristic where it choose to treat your matrix as memory. In fact, I would suggest going further and producing SSA just to make sure the compiler doesn't accidentally think two different values need to share a variable... but this is obviously only useful if you are working with a compiler that doesn't do such an SSA-conversion already anyway.

DaveClark · Post by **DaveClark** » Tue Jun 09, 2020 6:12 pm

Hi mystran & Fire,

First, thanks to both of you for continuing to post here.

Would it not be possible to write a "generic solver" that would be optimized for register usage, etc. ? From your (mystran's) earlier postings about this subject on KVR Audio forums, you mention MNA, stamps, etc. so it appears (to me anyway) that this should be possible. If possible but not recommended, would you please say a few words about that --- for example, about performance impact or other issues that don't come to my mind?

Also, if the performance asymptotically approaches that of a generic solver as the number of nodes increases, do you have any feeling for the number of nodes where the performances are, say, 50% different? I'm just trying to get some sort of an idea of how the two compare over the complete range of circuits.

Thanks again to both of you.

Regards,
Dave Clark

mystran · Post by **mystran** » Tue Jun 09, 2020 7:46 pm

DaveClark wrote: ↑Tue Jun 09, 2020 6:12 pm Would it not be possible to write a "generic solver" that would be optimized for register usage, etc. ?

Sure.. you just write a code generator and feed the output to a JIT.

As long as your JIT is at least somewhat optimizing (ie. at least register allocation is a must), this should get you fairly close to "optimal" performance.

ps. I want to emphasize that this is not a joke. For a high-performance generic solver you really want a JIT and if I'm not entirely mistaken, that's what some traditional circuit simulators do as well (although the code you want to generate will be quite different from the code a regular circuit simulator would want to use).

DaveClark · Post by **DaveClark** » Thu Jun 11, 2020 4:46 pm

Hi mystran,

Thanks for your response. Perhaps there are others out there who, like me, may be misunderstanding what you are saying. If not, I can send a PM to discuss offline so as not to derail this thread entirely. Meanwhile, please allow me to explain just a bit what my confusion is.

In another KVR thread from 2013, you mentioned code autogeneration and JIT as preferable to "manual." I was assuming that you meant that some coders were manually building programs that solved only one specific circuit at a time. I concluded that you were suggesting that it was far preferable to build a program that take a circuit description and autogenerate code for that same specific circuit, but could also autogenerate code for any specific circuit. It also appeared to me that you were saying that the solvers used for this approach were, for some reason I could not perceive, always superior to solvers used for SPICE. If someone had replaced the solver in a program like SPICE with the type of solver that you are talking about, then how much difference would remain between autogeneration/JIT and SPICE? I see SPICE as a kind of JIT compared to manually coding a pile of specific-circuit simulators because it compiles matrices from stamps on-the-fly from the input "deck."

Obviously, I'm missing something! Perhaps it is simply that the code for a circuit simulator like SPICE cannot be further optimized after the input has been read in the same way that JIT code could be. It's more frozen into place. On the other hand, the optimizations possible with JIT are also frozen into place in some ways. So the tradeoff might be reduced to differences between optimization schemes of the compilers, with no guarantee that either is superior. The more I think about it, the more confusing it seems.

Thanks for any comments/additional info, and once again, I'd be happy to take this offline or just drop it if you are busy.

Regards,
Dave Clark

mystran · Post by **mystran** » Thu Jun 11, 2020 10:43 pm

DaveClark wrote: ↑Thu Jun 11, 2020 4:46 pm In another KVR thread from 2013, you mentioned code autogeneration and JIT as preferable to "manual." I was assuming that you meant that some coders were manually building programs that solved only one specific circuit at a time. I concluded that you were suggesting that it was far preferable to build a program that take a circuit description and autogenerate code for that same specific circuit, but could also autogenerate code for any specific circuit.

I think this might have been with regards to ZDF filters at the time, although I'm not quite sure, because the basic situation is essentially the same. You can build a filter or circuit model by writing down the equations, then working out a way to solve those equations manually and/or with assistance of some computer algebra system. However, this is ridiculously slow and error-prone. Why would you want to spend 15 minutes (or more) every time you need to make a tinies change to your equations, when you could just as well hit a button and get the results instantly?

So what I've been trying to suggest for years is that it is ultimately faster to write a code-generator that takes the whole system and produces code for you. Once you have such a code-generator it literally takes seconds to produce new code when you change your model in one way or another, which means that you can try orders of magnitude more variations. This in turn allows you to quickly explore the differences of different device models and what not. As long as your code-generator is correct, the output code will also always be correct, so you save a huge chunk of debugging time as well.

Whether you actually JIT depends on whether or not you need to produce code for a circuit on the fly. If you only have a static circuit that you want to model once at development time, then there is no reason to bother with JIT. In fact, ahead-of-time compilation will probably give you slightly better code as ahead-of-time compilers will typically optimize more aggressively and you can write the code-generator in some high-level scripting language which will make it a whole lot easier too.

DaveClark wrote: ↑Thu Jun 11, 2020 4:46 pm It also appeared to me that you were saying that the solvers used for this approach were, for some reason I could not perceive, always superior to solvers used for SPICE. If someone had replaced the solver in a program like SPICE with the type of solver that you are talking about, then how much difference would remain between autogeneration/JIT and SPICE? I see SPICE as a kind of JIT compared to manually coding a pile of specific-circuit simulators because it compiles matrices from stamps on-the-fly from the input "deck."

I believe some SPICE variants actually use JIT to accelerate the simulation, but it's not really about producing "superior" solvers, but rather about producing solvers with a different goal. An industrial circuit simulation needs to (try to) produce accurate results in as many situations as possible and they usually solve the simulation for the whole circuit, so you can plot whatever voltages or currents you might choose to afterwards. For simulation at audio rates, for the purpose of processing audio where you really truly only care about the input and output, you can skip much of the overhead by not solving for stuff you don't care about and using device models that only model exactly what you need at audio rates, rather than also modelling stuff that might be critical for radio-frequency engineering, but is entirely irrelevant for audio.

Still, the main point I'm trying to make here is not about JIT as such, but rather the fact that a generic code path that takes a circuit description and then simulates said circuit has non-trivial "interpretative overhead" which can be avoided by generating code for the specific circuit (whether you use a JIT or ahead-of-time compilation). If you think about the circuit description as a programming language, then this is very clearly the exact same thing as the difference between an interpreted vs. compiled languages: the latter are usually significantly faster which is why even modern "interpreted" languages often end up using a JIT for compilation when ahead-of-time compilation is not available.

DaveClark · Post by **DaveClark** » Fri Jun 12, 2020 5:29 pm

Hi mystran,

Thanks for taking the time to reply. I have also requested information from a former colleague and friend who has been a SPICE developer for over 35 years (hard to believe!). I'll PM you if he says anything that I believe you may be interested in, if that's OK with you. I'd like to get him interested in circuit simulation for audio, but he probably won't do it. The art could use a serious push.

Regards,
Dave Clark

Fire Sledge - Ohm Force · Fri Jun 12, 2020 7:12 pm

I quickly hacked a code generator just for a dense 10×10 matrix LU decomposition + down/up traversal, which was about half of the time spent, and already got a ×2 speed up at the first try. I tested it only on my home computer, not on the RPi4 yet, but it’s a very promising improvement indeed!

Edit: not that good on the RPi4, only +25 %, but it is appreciable.

drofdissonance · Post by **drofdissonance** » Wed Aug 05, 2020 5:56 am

Hey Fire sledge, What did you use for the code generation? and do you mean that you essentially generated a function which is the whole matrix LU and traversal "unrolled"?

Fire Sledge - Ohm Force · Wed Aug 05, 2020 7:33 am

drofdissonance wrote: ↑Wed Aug 05, 2020 5:56 amHey Fire sledge, What did you use for the code generation?

I just used a C++ function of my own. It’s not very complicated. Basically I started from a standard LU decomposition function, then turned every operation into printing an equivalent line of code. Of course you have to generate a lot of temporary variables with their own names and this become slightly messy. The very first step could be optimised a bit further because the row order is still completely known, but this is a detail.

drofdissonance wrote: ↑Wed Aug 05, 2020 5:56 amand do you mean that you essentially generated a function which is the whole matrix LU and traversal "unrolled"?

Yes, more or less. The function runs with no hint about pivoting order, so it has to find suitable pivots, and that cause some conditionals. If you’re curious, the generated code looks like this.

But now I’m working on the unrolling of the whole DK-method and this is much more complex.

drofdissonance · Post by **drofdissonance** » Thu Aug 06, 2020 1:55 am

right, that makes sense. sounds like a great first strategy, I should give it a go on something. I came across some C code generators in python, I'm thinking of doing a bit of digging into those to see it they'd also be suitable

Big Muff Pi emulation