KVR Audio

jgalt91 · Post by **jgalt91** » Wed Dec 04, 2019 5:04 pm

Hi folks, I'm programming an FM synth and I get the basics of it but I'm confused at how FM works when you add self modulating oscillators/operators or closed loops alas FM8. For instance something like:

Code: Select all

operator 1 modulates 2, 2 modulates 3, 3 modulates 1.

In that case, which is the "must do" aproach?
1. All oscillators are rendered each block, this output is cached somewhere and the destination oscillator updates its phase with that cached modulation. Repeat this process in each block/sample.

Code: Select all

each block: 
          step 1: render 1, 2 and 3 
          step 2: modulate 2, 3 and 1

Way more efficient but don't know if it's accurate when it comes to modulations.

2. Somehow order the oscillators depending on their connections: the sources are rendered first, then they update the destination's phases with the cached output, then the destinations are rendered. Stay in a loop until all the modulations have been performed.

Code: Select all

each block: 
          step 1: render 1
          step 2: modulate 2
          step 3: render 2
          step 4: modulate 3
          step 5: render 3
          step 4: modulate 1

Maybe more accurate, but much harder on CPU when multiple loops/modulations

I'd really love to mantain a "block rendering" aproach (i.e rendering the X samples of the oscillator each block) instead of tick rendering since it's much more cache friendly and better for optimizations.

mystran · Post by **mystran** » Wed Dec 04, 2019 6:29 pm

For feedback loops, you'll really want to process on per-sample basis to keep the loop delays down. Strictly speaking this is already one unit delay too much, but there's not a whole lot you can do about that without running iterative solvers and it's really what all the classics have anyway. Either way, just forget block rendering for feedback FM.

The question then really is whether you should render one sample of each oscillator, then process all modulations and then render the next sample, or process the modulations from one oscillator before processing the next one. In the first case, you're basically adding 1 sample of delay to all modulations, where as in the latter cases the delay only applies to the feedback loops.

With fixed algorithms (eg. DX7) you already have an obvious order to process the oscillators and modulations. With a FM8-style matrix, I guess the most obvious thing to do would be to process one column at a time, such that modulations below the diagonal would be instant while direct feedback and modulations above diagonal would have the single sample delay. Since the unit delays are not entirely negligible in terms of sound design, this has the advantage of being predictable, where as some automatic sorting scheme might lead to some user frustration down the line.

jgalt91 · Post by **jgalt91** » Wed Dec 04, 2019 7:11 pm

For feedback loops, you'll really want to process on per-sample basis to keep the loop delays down

That's what I was afraid of, I'm already doing it per tick/sample but it's hitting the cache so hard that SIMD barely improves performance over a normal unoptimized release build, so I was hoping something could be done in blocks. I can't test the FM8 matrix in demo mode (or it just doesn't work in my pc) but I believe it's quite efficient even with high number of voices and unisons.

In the first case, you're basically adding 1 sample of delay to all modulations, where as in the latter cases the delay only applies to the feedback loops.

Taking the latter aproach, I guess it should be a matter to establish some kind of priority in the rendering order like the columns perspective you mentioned, or ordering with some other criteria (i.e the less modulation inputs the more priority). Anyway I guess as long as it does "real" FM with basic linear modulations, the uncertainity of feedback loops will be more a sonic taste matter rather than anything "mathematically right/wrong" since there's no way around to choosing one origin and being modulated in the next sample.

Vokbuz · Post by **Vokbuz** » Wed Dec 04, 2019 7:53 pm

You can compute oscillators' phases for each sample in blocks. And use this phase buffers to render and modulate later.

2DaT · Post by **2DaT** » Thu Dec 05, 2019 1:37 am

jgalt91 wrote: ↑Wed Dec 04, 2019 7:11 pm
For feedback loops, you'll really want to process on per-sample basis to keep the loop delays down
That's what I was afraid of, I'm already doing it per tick/sample but it's hitting the cache so hard that SIMD barely improves performance over a normal unoptimized release build, so I was hoping something could be done in blocks.

Well it's not possible to SIMD a wavetable lookup (lots of FM synths use wavetables). Make sure wavetables are not too large and have efficient implementation. You can also utilize memory parallelism for lookups if you do more than 1 voice at a time (lookups in different voices don't have dependency, that can help if your code is bottlenecked by latency of lookups or latency of instructions in general). For example:

Code: Select all

each block: 
          step 1: render 1 voice1
          step 1: render 1 voice2
          step 2: modulate 2 voice1
          step 2: modulate 2 voice2

I wouldn't bother with interleaving more than 2-4 times (theoretically it should be 8 times for modern CPUs, but this technique has diminishing returns), but even then it's very sensitive to inlining and overheads of any kind. It's not elegant and you need to write a lot of code for it (c++ templates don't solve this problem either), but it's the only way I know to accelerate anything with feedback without modifying the original algorithm.

jgalt91 · Post by **jgalt91** » Thu Dec 05, 2019 2:57 am

You can compute oscillators' phases for each sample in blocks. And use this phase buffers to render and modulate later.

But then, wouldn't the "looped modulations and automodulated oscilators" make the modulations effective in the next block, thus having an initial delay of X samples (the size of the audio block)?

Well it's not possible to SIMD a wavetable lookup

Well I'm using SIMD on the linear interpolation calculating all the unison voices interpolations in only a few SIMD operations. But doing one sample at a time and then moving on to doing something else (updating phases) instead of the whole block continuously, takes a toll on the cache (or I think so, since when using SIMD profiler shows moving stuff the highest cost instructions instead of arithmetics).

The memory parallelism on voice rendering seems like a good idea, the problem is I'm using Juce which manages voice rendering internally so not much I can't do without modifying the framework itself

synthpark · Post by **synthpark** » Thu Dec 05, 2019 9:20 am

mystran wrote: ↑Wed Dec 04, 2019 6:29 pm For feedback loops, you'll really want to process on per-sample basis to keep the loop delays down. Strictly speaking this is already one unit delay too much, but there's not a whole lot you can do about that without running iterative solvers and it's really what all the classics have anyway. Either way, just forget block rendering for feedback FM.

Thats why I am doing FM using FPGA at extreme sampling rate. One cycle delay is not a problem. There is something no current software will be able to clone, because any loop will break down at some point in self oscillation unless lowpass filtering each osc to the point where the sound becomes dull. In other words, for really high feedback factors over several oscillators the only merit seems to be an "analog"-like sampling rate to keep the architecture straight-forward and expansible.

mystran · Post by **mystran** » Thu Dec 05, 2019 10:11 am

One of the first synths I wrote when I started getting into audio was a pretty typical FM synth with 6x6 free matrix modulation. This was some 15 years ago, I guess, but back then the bottleneck was actually converting floating point into LUT indexes. Converting all the oscillator code into fixed point gave such a speedup that it really just wasn't even funny.

synthpark · Post by **synthpark** » Thu Dec 05, 2019 10:29 am

mystran wrote: ↑Thu Dec 05, 2019 10:11 am One of the first synths I wrote when I started getting into audio was a pretty typical FM synth with 6x6 free matrix modulation. This was some 15 years ago, I guess, but back then the bottleneck was actually converting floating point into LUT indexes. Converting all the oscillator code into fixed point gave such a speedup that it really just wasn't even funny.

Interesting. I am not using floating-point at all, all fixed-point. I have a similar approach, just less options: 4x4 matrix. But FM is only one part of the list. Every signal has its own resolution. Of course sometimes you need to shift a number, but the exponent is fixed.

mystran · Post by **mystran** » Thu Dec 05, 2019 1:29 pm

I don't think programming in fixed-point generally makes sense unless you're stuck on a platform with poor floating-point, but there are some odd situations where it does lead to some advantages.

synthpark · Post by **synthpark** » Thu Dec 05, 2019 2:55 pm

mystran wrote: ↑Thu Dec 05, 2019 1:29 pm I don't think programming in fixed-point generally makes sense unless you're stuck on a platform with poor floating-point, but there are some odd situations where it does lead to some advantages.

Well, FPGA is not a processor architecture. In a processor all numbers will carry their exponents around, a summation involves initial shifting before the actual summation. In terms of dynamic range, floating point offers no advantage, on the onctrary. If you sum a very large number and a very small number, it would have been better to spend the exponent bits as mantissa bits, increasing the dynamic range.

Vokbuz · Post by **Vokbuz** » Thu Dec 05, 2019 8:52 pm

jgalt91 wrote: ↑Thu Dec 05, 2019 2:57 am But then, wouldn't the "looped modulations and automodulated oscilators" make the modulations effective in the next block, thus having an initial delay of X samples (the size of the audio block)?

If you use phase modulation (what most fm synths do) you compute phases without any modulation first anyway and only after you modulate it. These are phases of unmodulated oscillator and they do not depend on any modulation. You can make computation of these phases as a separate step in process function.
If you really modulate oscillator frequency, then you can’t use simd in any way.

MadBrain · Post by **MadBrain** » Fri Dec 06, 2019 7:54 am

My thoughts:
- I'm not sold on the idea of modulation matrices for FM. Sure, they look neat, they're a generalization so they're very flexible and they don't have the issue of starting off a patch with the wrong operator, but they tend to lead to overly modulated sounds. I kinda feel adding together simple 2-operator stack with the modulator having some feedback on itself is where it's at for a lot of FM sounds, and then more complex modulation setups can be built as an elaboration of this setup for when you need something more gritty (such as combining two 2-op groups into a 4-op stack, or having some extra bonus feedback/modulation links that can be optionally added on top like on the SY-77/99, etc).

- For delays between operators, real Yamaha synths have 0, 1 or 2 samples of delay between operators (depending on the chip, algorithm, and the individual voice). These small delays are pretty hard to notice. Feedback is usually done as a mix of the previous sample and the sample before that.

- If there ever was an algorithm that you'd do in Fixed point rather than Floating point, it's FM synthesis. The thing is that for some reason, mainstream CPU pipelines seem to REALLY hate it when you convert a float into an int and then use that as a table index. I'm not sure if it's because of very long latency memory address computations blocking the issue of new memory operations, cpu domain-crossing penalties, cpus intentionally running the FPU later in the pipeline, defeating the prefetcher with crazy address patterns or whatnot, but it seems to completely dwarf the other operations in terms of slowness. For FM synthesis, another benefit is that you can take advantage of integer wrapping to do some of the work for you.

- FM is kindof the archetype of the algorithm where SIMD makes no sense. All the independent wavetable loads with unpredictable patterns create way more instructions in flight in the pipeline than whatever you save parallel multiplications and additions. Generally SIMD only makes sense when your memory accesses are 100% linear in and out. If you have a table lookup somewhere in there (which is the case in FM), you can pretty much already forget about any kind of SSE or AVX. Incidentally, this is what sunk the infamous Intel Larrabbee - it could do most of the 3d rendering pretty well, but texture mapping with bilinear interpolation was essentially impossible to do in any kind of fast way and it completely blew up performance.

If you want your FM synth to run faster, I'd look into the following possibilities:
- You can typically run some calculations (LFOs, envelope generators) once per N sample instead of every sample, and use volume ramping to prevent this from creating artifacts.
- Processing everything voice-per-voice is normally fast enough, processing the individual voice components separately is generally not necessary.
- If your wavetable is large enough (something on the order of ~4096 samples), you might not need to interpolate your wavetable lookup, and then your oscillator process becomes literally op_phase += op_freq; out = op_vol * op_wave[(op_phase + modulation_in) >> 20]; which is pretty efficient.
- Copying your voice variables into stack variables can generate a speed up (especially with stuff like pointers). This is because it lets the compiler put the variable into a register, rather than doing memory loads/stores every time. You don't need to go overboard with this (there are only 16 registers after all), but this isn't too hard to apply, and often generates a larger speedup than SIMD.

mystran · Post by **mystran** » Fri Dec 06, 2019 10:32 am

MadBrain wrote: ↑Fri Dec 06, 2019 7:54 am - Copying your voice variables into stack variables can generate a speed up (especially with stuff like pointers). This is because it lets the compiler put the variable into a register, rather than doing memory loads/stores every time. You don't need to go overboard with this (there are only 16 registers after all), but this isn't too hard to apply, and often generates a larger speedup than SIMD.

It should be noted that all modern compilers other than MSVC can normally cache heap variables in registers as well, unless there is some aliasing hazard. Whether MSVC can be convince to do it in some cases, I don't know. When the compiler can already do this, the best case from manual promotion to locals gives you identical performance and the worst case is slight theoretical overhead.

That said, with a compiler that doesn't do this, you might potentially still benefit from having more locals than registers, because compilers can typically split live ranges, allocating the variable into a register for part of it's lifetime, rather than using loads/stores exclusively. It might also enable some other optimisations that the compiler would normally not do on heap variables.

On the other hand, if the variable is locally read-only, then there is not necessarily much to be gained by moving it into a register, unless having it in memory blocks some other optimisation. Having a read-only memory operand does add some latency, but since this latency is with a separate uop with no dependency on the first operand (and can therefore execute before the first operand is available), it doesn't really add anything to the critical path. I suppose you might waste a bit of L1 bandwidth, but that's about.

edit: This confirms that MSVC doesn't do type-based aliasing, which is what the other compilers rely on in order to do this optimisation automatically: https://developercommunity.visualstudio ... piler.html

2DaT · Post by **2DaT** » Fri Dec 06, 2019 11:06 am

mystran wrote: ↑Fri Dec 06, 2019 10:32 am
It should be noted that all modern compilers other than MSVC can normally cache heap variables in registers as well, unless there is some aliasing hazard.

There will be aliasing hazards with writes to pointers or dynamic structure of any kind such as vector. Compilers often insert additional code before loops to check for aliasing, but it's not reliable. It may even prevent autovectorization, with heuristic "too much aliasing checks = no profit". But I also had a clang vectorizer doing like 20 address aliasing checks before going into loop, was worth it though

.

FM with feedback/closed loops