KVR Audio

rou58 · Post by **rou58** » Mon Apr 22, 2024 6:21 pm

I have an extensive blueprint for an instrument I've been prototyping for years, and I need to start thinking specifically about how I'm going to code certain things (w/ C++, JUCE). I'm trying to learn as much as I can about FFT stuff and spectral manipulation because it's very relevant to the instrument and CPU is a major consideration. Below are some questions I need to figure out. These questions are being asked independent of if it's an audio sample or an oscillator being played. I'm sure some of these questions don't make sense -- I'm very new to this and still learning.

1. How is it that I can just add partials to a sound and then IFFT and not have it sound mangled? Like if I have a sine wave, do an FFT, then add a partial one octave below, then IFFT, won't the syncronization/phase stuff get messed up because it takes twice the time for the lower partial to cycle than the original? So wouldn't the lower partial effectively become a half-sine waveform and sound totally different? I'm probably just confused how the FFT stuff works.

2. With an oscillator wave shape that you want to spectrally manipulate, can you do an FFT ahead of time when the shape changes in the UI instead of repeatedly doing an FFT in the processer? That way, you can just recall the saved FFT and start in the frequency domain, and then you would only have to do an IFFT in the processor to resynthesize instead of both FFT and IFFT. If this is possible for an oscillator, is there any way it can also be done for an audio sample? I assume the answer is no, because there's so much information in an audio sample that it would require an unrealistic amount of storage, whereas an oscillator is just a single FFT on a single cycle.

3. I've seen instruments that "simulate" doing hard sync waveform manipulation (a temporal process) through the frequency domain. Is it possible to easily simulate all typical temporal domain processes in the frequency domain? For example, if you want to do stuff like FM, PD, etc. can that be done after the FFT? If so, where would I learn about doing that?

4. Is it faster or slower to do stuff in the frequency domain vs. temporal domain? If it's slower in the frequency domain, would the benefit of avoiding an FFT/IFFT for say 100 simultaneous voices make it in turn faster? I'm wondering if I should just do everything in the frequency domain until the very end of the process chain, sum all spectrums, then do a single IFFT.

5. What are the main things that actually eat up CPU in a typical instrument? I feel like it's possible I'm focusing too much on avoiding FFTs when maybe they aren't as significant to speed as other things. Of course I will be profiling things extensively regardless.

Just spitballing here. I'm an experienced C++ programmer (not in audio), so feel free to use whatever programming lingo. Any response to any of these questions is appreciated, thanks.

dmbaer · Post by **dmbaer** » Mon Apr 22, 2024 8:54 pm

As to item 1, think of it this way. You start with a set of sample values in a table. Let's call that a wave table, and this gets input to the FFT process. That process doesn't know or care if the wave table contains a single cycle waveform, or more complex audio, maybe even multiple cycles of a single cycle waveform. It the wave table holds exactly one cycle of a sine wave, your FFT data will have a non-zero level value in the first slot and zeroes in all the others. You would not be able to add a sub-sine because you have no lower slots to put it into.

If your wave table held three sine wave cycles, now the third slot will be non-zero and all the others zero. So here you could add a sub-sine in slot 2 and a sub-sub-sine in slot 1.

As to item 2: "whereas an oscillator is just a single FFT on a single cycle". Not necessarily. One common approach to building oscillators is to keep the waveform data in the frequency domain. Use it to generate a wave table (definition of which as above) and generate the output stream from the wave table, staying in the time domain. Complex waveforms have many higher partials, If we use a wave table to generate a high-pitch signal, we risk aliasing. However, if we zero out the top slots in the frequency domain data before doing an IFFT, we can eliminate the frequencies (higher partials) that would cause aliasing - rather elegant, I think. So using this approach, we don't need to do an IFFT for every cycle in the generated signal, but we might need to do one at the start of each new pitch.

There's an extensive thread here somewhere that discusses how to use this technique with wave tables (contemporary definition, meaning an array of single cycle waveforms) if you can find it. This won't answer all your questions, but it should provide a good deal of insight.

[Edit]

I should clarify something: I wrote "That process doesn't know or care if the wave table contains a single cycle waveform, or more complex audio". OK, by definition, what's in the wave table is a single cycle waveform as far as FFT is concerned. It's got a fundamental harmonic and higher harmonics like all single cycle waveforms. It doesn't have to be a sine, triangle, saw, etc. to be a single cycle waveform.

rou58 · Post by **rou58** » Tue Apr 23, 2024 12:30 am

Thanks for the response dmbaer. Yes I quickly realized from recent reading the confusing wavetable definitions...

My lack of knowledge on how FFT works makes it a little difficult to understand your answer to question 1. I've tried watching videos but they were all very theoretical. I think I just need to go through some FFT code tutorials to actually grasp how it works. I think what you're saying might have something to do with the uncertainty principle? Are you implying that I should take the waveform geometry and duplicate it many many times, and then do the FFT on that so that I have more resolution to play with? What is typically the ideal quantity of partial 'slots' (if that even makes sense)? Because if I know with certainty the static waveform shape ahead of time, I can have the resolution be as large as I want.

As to your response to question 2, I don't intend on having the user ever draw stuff in an additive manner or see anything like that. This strategy wouldn't work for my case anyways because there's a temporal geometry editor, and then a spectral manipulation fx section that occurs after the waveform. This is why I need to repeatedly do an IFFT in the processor, because I have no idea what the spectrum will end up being after the spectral fx. As far as aliasing, if I have the correct nyquist sampling rate and everything, wouldn't whatever is drawn temporally be fine? If not, couldn't I just roll off the top at some point?

Music Engineer · Post by **Music Engineer** » Tue Apr 23, 2024 2:51 pm

rou58 wrote: ↑Tue Apr 23, 2024 12:30 amAre you implying that I should take the waveform geometry and duplicate it many many times, and then do the FFT on that so that I have more resolution to play with?

If I'm not mistaken, just repeating a length N waveform M times and then doing an M*N FFT instead of a length N FFT should just intersperse zeros into the spectrum. The spectral amplitude (and phase) that formerly appeared in slot k will now appear in slot M*k - and all the other slots will be zero (depending on the FFT implementation, there might be a scaling factor involved - there are different conventions in use for how scale the spectral data with respect to the buffer length - but the information is the same anyway).

Another typical thing to do to increase the *apparent* resolution is zero-padding. You just take your waveform and append zeros and then do an FFT on that longer buffer. That has the effect of smoothly interpolating between the original spectral bins in an, in some sense, ideal way. ...it also doesn't gain you any real additional information, though - but it can be useful for display purposes (and also for some spectral computations - but when dealing with single cycles of periodic waveforms, I currently can't see any benefits of using zero padding).

What is typically the ideal quantity of partial 'slots' (if that even makes sense)? Because if I know with certainty the static waveform shape ahead of time, I can have the resolution be as large as I want.

Depends on what you want to do with it. For the synthesis of single-cycle waveforms via mip-mapped lookup-table synthesis, I have personally settled to length 2048 more than 15 years ago. The rationale was: I want a full spectrum up to 20 kHz even when the fundamental is as low as 20 Hz. That means, I need at least 1000 partials. The next power of 2 is 1024 and you need two times of that for the FFT size (the number of spectral "slots" is half of the FFT size). I've since seen many other synths do the same - although back then in the 00s, it seemed to be quite common to use smaller lookup table sizes. This can be noticed by playing the synth in the lowest registers and observe how the spectrum attains a more and more lowpass-ish character, the lower you go on the keyboard. The lower notes sound dull - in an unpleasant/unmusical way because it's a brickwall filter. The trade-off between space-requirements and quality was often struck a bit more stingy back then. 2048 seems to be the sweet spot for me. I've also seen this size used in a couple of single-cycle sample-packs. Of course, you can use more - but you'll soon get into the territory of diminishing returns.

Music Engineer · Post by **Music Engineer** » Tue Apr 23, 2024 3:31 pm

rou58 wrote: ↑Tue Apr 23, 2024 12:30 am My lack of knowledge on how FFT works makes it a little difficult to understand your answer to question 1. I've tried watching videos but they were all very theoretical. I think I just need to go through some FFT code tutorials to actually grasp how it works.

To answer your questions, you don't really need to understand the first "F" part of the FFT (fast Fourier transform). I'd recommend learning about the DFT (discrete Fourier transform) first and, for the time being, forget about the "fast" and treat it as an implementation detail. A very good resource for a deep dive into the DFT is this book:

https://ccrma.stanford.edu/~jos/mdft/

Worrying about this "F" would be a distraction - it's not really relevant for the questions about how to use it. The DFT is conceptually much simpler and the FFT is simply a particular algorithm (actually, family of algorithms) to compute the DFT. The difference between DFT and FFT is like the difference between bubble sort and heap sort.

Music Engineer · Post by **Music Engineer** » Tue Apr 23, 2024 5:40 pm

dmbaer wrote: ↑Mon Apr 22, 2024 8:54 pm
There's an extensive thread here somewhere that discusses how to use this technique with wave tables (contemporary definition, meaning an array of single cycle waveforms) if you can find it. This won't answer all your questions, but it should provide a good deal of insight.

Is it this one?

viewtopic.php?t=585568

By the way - as for terminology: yeah, it is somewhat unfortunate that the term "wavetable synthesis" has these two conflicting meanings of "array of single-cycle waveforms" vs "use a lookup table for a single waveform". I've used the term "wavetable" in the latter sense in the past but I think, I'd prefer using the term in the former sense in the future - but how should I then call the other? Maybe "lookup table synthesis"? I may have to rename a couple of classes in my DSP library...

dmbaer · Post by **dmbaer** » Tue Apr 23, 2024 7:51 pm

Music Engineer wrote: ↑Tue Apr 23, 2024 5:40 pm
dmbaer wrote: ↑Mon Apr 22, 2024 8:54 pm
There's an extensive thread here somewhere that discusses how to use this technique with wave tables (contemporary definition, meaning an array of single cycle waveforms) if you can find it. This won't answer all your questions, but it should provide a good deal of insight.
Is it this one?

viewtopic.php?t=585568

By the way - as for terminology: yeah, it is somewhat unfortunate that the term "wavetable synthesis" has these two conflicting meanings of "array of single-cycle waveforms" vs "use a lookup table for a single waveform". I've used the term "wavetable" in the latter sense in the past but I think, I'd prefer using the term in the former sense in the future - but how should I then call the other? Maybe "lookup table synthesis"? I may have to rename a couple of classes in my DSP library...

That is indeed the thread I was thinking of. Thanks.

Seems to me the original meaning of wavetable should take precedence, since it was established years before the new guy showed up. The subject gets even murkier since there are wavetable (new definition) synths that use wavetable (classic definition) oscillators to generate the audio.

whyterabbyt · Post by **whyterabbyt** » Tue Apr 23, 2024 8:18 pm

Music Engineer wrote: ↑Tue Apr 23, 2024 5:40 pm
dmbaer wrote: ↑Mon Apr 22, 2024 8:54 pm
There's an extensive thread here somewhere that discusses how to use this technique with wave tables (contemporary definition, meaning an array of single cycle waveforms) if you can find it. This won't answer all your questions, but it should provide a good deal of insight.
Is it this one?

viewtopic.php?t=585568

By the way - as for terminology: yeah, it is somewhat unfortunate that the term "wavetable synthesis" has these two conflicting meanings of "array of single-cycle waveforms" vs "use a lookup table for a single waveform". I've used the term "wavetable" in the latter sense in the past but I think, I'd prefer using the term in the former sense in the future - but how should I then call the other? Maybe "lookup table synthesis"? I may have to rename a couple of classes in my DSP library...

single-cycle wavetable versus multi-cycle wavetable?

i think in the past ive suggested wavesequencing as descriptive of the later multicycle usage.

rou58 · Post by **rou58** » Wed Apr 24, 2024 1:50 am

I don't know what mipmaps are or what band-limited means, so I've got a lot to learn before I can even understand these conversations. That linked thread is great though, just read through it.

Something mentioned there (and I think here) is this idea of processing the waveform when there's a new pitch. Quote from other thread:

"This method can get more complicated when the user changes pitch, say using a pitch envelope, or if the synth uses a morphing wavetable. In this case you could generate a new wavetable WT with each new audio block, or every 500 or so samples. This won't burden a CPU too much, any aliasing caused by rapid increase in pitch before the WT regenerates won't be noticeable, and the transition between morphing wavetables should not be course."

Is this to say that the waveform is being updated at a slower rate in the processor than everything else (i.e. the modulation)? If so, are they in separate threads, or is it more of a "if x time has passed, update the waveform" polling type of thing, whereas modulation is updating regardless every time? Why would the waveform update at a slower rate than the modulation? I mean, a modulation system could be massive and complicated, I can't imagine the waveform processing would be so much more expensive than the modulation processing that it would warrant totally separate rates (especially since that might introduce frequent branch mispredictions). Maybe I'm confused.

BertKoor · Post by **BertKoor** » Wed Apr 24, 2024 6:58 am

rou58 wrote: ↑Wed Apr 24, 2024 1:50 am Is this to say that the waveform is being updated at a slower rate in the processor than everything else (i.e. the modulation)?

I don't think so. Modulate too slow and you'll hear zipper artefacts, but updating modulations each sample is over the top as well (unless ofcourse you want to support audio rate modulations)

If you ask me then "each new audio block, or every 500 or so samples" is in the same ballpark as the rate of all other modulations. With default sample rate of 48kHz "every 500 or so samples" is above 100x/sec. Sounds good enough for RocknRoll.

But a block of audio is not well defined in size. Some hosts will give you both blocks for a full buffer (could very well be 32 or 2048 samples) and occasionally blocks of a single sample.

Modulation strategy is something you'll have to think about, and decide what compromise works best or well enough for you.

JustinJ · Post by **JustinJ** » Wed Apr 24, 2024 10:17 am

rou58 wrote: ↑Wed Apr 24, 2024 1:50 am I don't know what mipmaps are or what band-limited means, so I've got a lot to learn before I can even understand these conversations. That linked thread is great though, just read through it.

There's your first port of call then. Mip-maps are just different versions of your time domain single cycle waveform at different resolutions. So, your highest resolution version would be 2048 samples. Then 1024, 512, 256, 128 etc. If your storing spectral content (FFT bins) they can also be mip-mapped. So in this case, 1024, 512, 256, 128, 64 as they're more or less half the size of your time domain waveforms (+1 bin at the beginning for DC offset and Nyquist).

In some schemes, and based on what frequency you're playing, you can select a different mip-map to render to audio. Main advantage is helping with memory caching. I don't think it's that uncommon now to stick with 2048 time domain size for the lot though.

Now to band-limiting. Say your sample rate is 48kHz so your Nyquist limit is at 24kHz. Many non-sine waveforms (saw, square etc) have spectral content, known as partials, that extends out from the fundamental frequency. Any of those partials that exceed the Nyquist frequency reflect back into the audible spectrum. As more of them reflect back say when you increase frequency, you're able to hear them very clearly. It produces a generally unpleasant 'interference' sound something like chirps and whistles.

To fix this, you band-limit the signal. This means trying to get rid of or reduce the effects of spectral content that extends beyond Nyquist and would reflect back into parts of the audible spectrum. There are many ways to tackle this, but if you're dealing with spectral data then you're in luck. In this case you can make a copy of your spectral bins and knowing the current play frequency you can work out which bins would exceed the Nyquist limit. Then just zero them. Now, when you convert the spectral bins back to time domain to play the result, it will be band-limited and alias free.

Just to be clear, you'll use FFT to convert from time domain to frequency domain (spectral bins) and IFFT to convert from frequency domain to time domain (samples) ready for rendering audio. In-between these operations you have chance to monkey around with the spectral data if you wish.

With regards to how and when you convert from frequency domain to time domain: a reasonable scheme is to do this initially on 'note on' and every 64, 128 or even 256 samples thereafter. It will of course be independent for each synth voice too. To reduce artefacts you can linearly interpolate between the old time domain waveform and the newly generated one over the intervening samples.

JustinJ · Post by **JustinJ** » Wed Apr 24, 2024 10:28 am

Music Engineer wrote: ↑Tue Apr 23, 2024 2:51 pm I have personally settled to length 2048 more than 15 years ago. The rationale was: I want a full spectrum up to 20 kHz even when the fundamental is as low as 20 Hz. That means, I need at least 1000 partials. The next power of 2 is 1024 and you need two times of that for the FFT size (the number of spectral "slots" is half of the FFT size). I've since seen many other synths do the same - although back then in the 00s, it seemed to be quite common to use smaller lookup table sizes. This can be noticed by playing the synth in the lowest registers and observe how the spectrum attains a more and more lowpass-ish character, the lower you go on the keyboard. The lower notes sound dull - in an unpleasant/unmusical way because it's a brickwall filter. The trade-off between space-requirements and quality was often struck a bit more stingy back then. 2048 seems to be the sweet spot for me. I've also seen this size used in a couple of single-cycle sample-packs. Of course, you can use more - but you'll soon get into the territory of diminishing returns.

It's still common for additive synths to use fewer partials (64, 128, 256) and you'll see the same partials truncation if you play lower notes and inspect with a frequency analyser. It's easy to see in Ableton's Operator which has 64 partials and also the additive engine in Pigments. Not sure what the answer is in this case - for additive synths summing up to 1024 sines (partials) for each sample, even with SIMD, is a real strain. I'm still actively concocting schemes that aim to make it viable.

Anyhow, thought I'd mention it. For wavetable synths with frequency wave generation 2048 is totally the sweet spot.

rou58 · Post by **rou58** » Wed Apr 24, 2024 2:18 pm

"...for additive synths summing up to 1024 sines (partials) for each sample, even with SIMD, is a real strain."

I'm confused how additive synths are any different from anything else. Like, technically everything is "additive", as additive is just referring to the frequency domain. So I'm confused how you say 1024 is really expensive but then say 2048 is great in some other context. By saying "summing up to 1024", aren't you just saying "do an IFFT"? I guess you said every sample so maybe that's the difference that you're talking about vs. every 500 or so.

Thanks for the explaination on mipmaps and band-limiting. I think some of this could be a problem for me though like when you say, "knowing the current play frequency you can work out which bins would exceed the Nyquist limit". Take oscillators out of the picture for a second. Say it's an audio file being played back and you want to FFT -> modify -> IFFT in a way that's consistently accurate (so that if nothing is modified, it will sound the same as if you didn't FFT at all). Is this feasible with the same 2048 every 500 method? A lot of this discussion is about wavetable oscillators but I'm concerned with finding a methodology that is totally independent of sound source. So for example, there's no way that I'll know the pitch if it's an audio sample (as it may not even have one). The key being pressed is all relative.

Music Engineer · Post by **Music Engineer** » Wed Apr 24, 2024 7:00 pm

rou58 wrote: ↑Wed Apr 24, 2024 2:18 pm So I'm confused how you say 1024 is really expensive but then say 2048 is great in some other context.

Rendering a lookup table mip-map of length 2048 once (when the user loads a waveform, say) is not an issue. It may only become expensive, if you need render it repeatedly during signal synthesis because you want to modulate some spectral parameters. Realtime additive synthesis is an entirely different story - you'd typically use some sort of (hopefully heavily parallelized) oscillator bank rather than an IFFT. At least, that's what I would use. You can do additive synthesis with IFFT, too - but it's quite unnatural, clunky and messy.

rou58 · Post by **rou58** » Wed Apr 24, 2024 7:36 pm

"Rendering a lookup table mip-map of length 2048 once (when the user loads a waveform, say) is not an issue. It may only become expensive, if you need render it repeatedly during signal synthesis because you want to modulate some spectral parameters."

Yes, the latter is what I'll need to do. Though I'm not sure what specifically you mean by "render" in this context. A lot of this stuff about changing pitch and re-rendering is not making sense to me. This is my thinking: Do an FFT in the UI thread whenever a waveform in the WT (modern definition) changes and save it. Then, when the WT position modulation is evaluated in the processor (somewhere morphed between two waveforms in the WT) you just do a quick linear interpolation between the two arrays of spectral data, resulting in the spectral data at that moment -- this way you skip the temporal domain/FFT completely. Then you scale the spectral data depending on what the current pitch is resolved to be relative to the root pitch of the data. Then you manipulate the data, filter out everything above Nyquist, and then IFFT. I don't understand how pitch increasing complicates this, because the spectral data you just did an IFFT on has the pitch inherently in it. The only way this would be complicated is if pitch modulation was being processed at a faster rate than the WT changes, which would require going back to the frequency domain, adjusting, Nyquist rolloff, and IFFT again. But as someone above said, both modulation and WT processing should occur at about the same rate (500 samples), so... Am I missing something? And unfortunately if it's an audio file you will also have to do FFT at the beginning of the processor chain I think because you can't prepare the FFT ahead of time.

"Realtime additive synthesis is an entirely different story - you'd typically use some sort of (hopefully heavily parallelized) oscillator bank rather than an IFFT."

Are you saying each individual partial (sine wave) is being played, so like a 64 part additive table would be played by 64 sine wave oscillators? If so, that seems insane to me. Why would you not just IFFT and play with a single oscillator?

Frequency domain simulation of temporal domain processes, FFT stuff