BrainDamage, a neural network synthesizer - feedback welcome

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

kippertoffee wrote: what becomes apparent is that sound design is totally trial and error. Seems like there's very little chance of actually learning the thing and being able to produce a sound I imagine. Not that this is necessarily a bad thing, but it might limit it's appeal.
Agreed, it is very exploratory / random. Not sure if people will like that or not. I find it pretty fun - and actually you can treat a single neural net model as a synth in itself and get to know particular neurons. Curation might be an interesting feature, could label particular neuron parameters descriptively (BrownFuzz, SawPhaser, etc.). Music Engineer had some interesting ideas above about sorting/highlighting the most salient neurons.

Earlier/later layers will produce certain types of effects on the waveform in general, so there is some broad knowledge you can gain from understanding the neural nets. But it's mostly experimental due to the randomness in how models are initialized/trained.

bitwise wrote: Is it an alias-free oscillator ?
Not at the moment. I have an oversampling knob currently (1x - 8x) to help with high pitch notes or waveforms with high harmonics. But need to explore if there are some more efficient techniques.

Post

QuadrupleA wrote: Thu Jun 09, 2022 2:56 pm
bitwise wrote: Is it an alias-free oscillator ?
Not at the moment. I have an oversampling knob currently (1x - 8x) to help with high pitch notes or waveforms with high harmonics. But need to explore if there are some more efficient techniques.
Assuming it works by waveshaping something like phase, then ADAA-style anti-aliasing should at least be theoretically workable (eg. replace activation functions with ADAA-versions), though I'm not sure if it would get expensive to deal with the degenerate situations (ie. very small delta). Might be worth a try if you choose to pursue this further.

One random thing (not related to anti-aliasing) one could possibly try is teaching structurally similar networks different things and then crossfade those by cross-fading coefficients. That might (or might not) result in some interesting morphs with both end-points (or even corners on a square morph-pad or whatever) being the result of training. I have no idea if that'd work out in practice, but if it did then that might reduce the perceived randomness while still keeping a certain degree of unpredictability.

Post

QuadrupleA wrote: Thu Jun 09, 2022 2:56 pm
bitwise wrote: Is it an alias-free oscillator ?
Not at the moment. I have an oversampling knob currently (1x - 8x) to help with high pitch notes or waveforms with high harmonics. But need to explore if there are some more efficient techniques.
Ok. I thought that your method was also an alternative to oversampling.

Post

mystran wrote: Thu Jun 09, 2022 3:37 pm Assuming it works by waveshaping something like phase, then ADAA-style anti-aliasing should at least be theoretically workable (eg. replace activation functions with ADAA-versions), though I'm not sure if it would get expensive to deal with the degenerate situations (ie. very small delta). Might be worth a try if you choose to pursue this further.
Yeah, curious about ADAA. So a simplified approximation of this synth's oscillator would be a polynomial that takes phase (x) and where several of the coefficients can be automated and modulated dynamically. Say, y = Ax^3 + Bx^2 + Cx + D and the synth lets you play around with A thru D either every sample or on a 128 sample timebase or something. From your understanding of ADAA would it apply here? Or would the coefficients have to be fixed? For fixed damages (coefficients) you could probably also sample and cache a wave cycle on the fly and generate some mipmaps to remove aliasing. But if the waveform is constantly morphing then the cache doesn't really help.

mystran wrote: Thu Jun 09, 2022 3:37 pm One random thing (not related to anti-aliasing) one could possibly try is teaching structurally similar networks different things and then crossfade those by cross-fading coefficients. That might (or might not) result in some interesting morphs with both end-points (or even corners on a square morph-pad or whatever) being the result of training. I have no idea if that'd work out in practice, but if it did then that might reduce the perceived randomness while still keeping a certain degree of unpredictability.
Interesting idea. Here's what you get lerp'ing all the biases and weights between two crude sine and sawtooth 8x8x8 networks - definitely a bit more complex than a a straight wave interpolation:

Image

Post

So the basic idea with ADAA is that rather than evaluating a non-linear function at a point, you evaluate it's integral over the sampling period (which in the first order version you'd approximate as a linear ramp). In a typical feed-forward neural network only the activation functions are non-linear, hence it would seem that you could just use ADAA with each activation function evaluation like you would any other waveshaper.

Looking at the animated image though (the morph looks cool; I suspect it'd be even more cool with more complex waveforms), there seems to be problem where there is a discontinuity in the phase itself which will directly cause aliasing (unless it happens to cancel out). One could probably get rid of that by using a complex phasor instead (ie. let the neural network take cosine and sine as two inputs) so that the input itself is continuous and there's no need to worry about aliasing there. I'd imagine this should still train fairly well with other layers being more or less the same size (and it might then open the possibility of also driving the network with other Lissajous curves, which might also give something interesting).

The choice of activation function will also have some effect, with a piece-wise activation like ReLU probably likely to cause more aliasing than something smoother (eg. tanh, some gaussian bump, whatever). With the morphing idea in particular, it might actually make sense to experiment with different types of activation functions as these might result in different morphs where the traditional ones (eg. ReLU and sigmoid variations) might not be the most interesting ones. While the choice of activation function can have an effect on how well the network trains and/or how low you can get the error with a given number of neurons, in theory the universal approximation theory applies with pretty much any non-linear function... but the way the network will configure itself during training will be different depending on what the function looks like (eg. sigmoids and similar partition the space with a hyper-plane, while a symmetric "gaussian like" bump will extract the neightborhood of a hyperplane and something like a sine will extract periodic offsets of the plane.. and so on).

I also kinda wonder whether there would be a nice way to train a network to do it's own anti-aliasing, perhaps by feeding some of the past outputs back to additional inputs and then using band-limited training data at various frequencies.. or something like that. You should still be able to train such a setup as if it was a simple feed-forward network as the ideal "feedback" is readily available from synthesized training data, but the network itself would probably need to be somewhat larger for this to work out in a reasonable fashion.

Post

mystran wrote: Thu Jun 09, 2022 11:22 pm I also kinda wonder whether there would be a nice way to train a network to do it's own anti-aliasing, perhaps by feeding some of the past outputs back to additional inputs and then using band-limited training data at various frequencies.. or something like that. You should still be able to train such a setup as if it was a simple feed-forward network as the ideal "feedback" is readily available from synthesized training data, but the network itself would probably need to be somewhat larger for this to work out in a reasonable fashion.
I suggested this on page 1, as well as morphing between pre-trained coefficients :cry: (Well, not exactly: I suggested using an additional pitch input rather than history.) I think the trouble either way is that the aliasing will be affected by “damage”, which is cool when using it as an actual damage simulator but not so great when using it as a fun morphing oscillator.

I really like the idea of using a complex sine/cosine phasor, that seems like it would really help the network to make nicely looping waveforms.

I’ve been thinking it would be a lot of fun to try and implement this algorithm as a Eurorack VCO. In which case another antialiasing scheme becomes available: the Noise Engineering “clock your DAC at a multiple of the fundamental” trick, which doesn’t remove aliasing but could make it more pleasant.

Post

imrae wrote: Fri Jun 10, 2022 12:54 am
mystran wrote: Thu Jun 09, 2022 11:22 pm I also kinda wonder whether there would be a nice way to train a network to do it's own anti-aliasing, perhaps by feeding some of the past outputs back to additional inputs and then using band-limited training data at various frequencies.. or something like that. You should still be able to train such a setup as if it was a simple feed-forward network as the ideal "feedback" is readily available from synthesized training data, but the network itself would probably need to be somewhat larger for this to work out in a reasonable fashion.
I suggested this on page 1, as well as morphing between pre-trained coefficients :cry: (Well, not exactly: I suggested using an additional pitch input rather than history.)
I must have missed the morphing part, but I did notice the pitch input suggestion and the trouble is that you'd have to train the network to associate the pitch input with bandwidth of harmonics which seems complicated (ie. something that would probably requires a fairly big NN).

The idea with the memory based approach is that perhaps it'd be possible to teach the networks something similar to an ADAA-scheme without actually specifying the exact algorithm, which turns the problem essentially into finding a function approxiation in a low number of dimensions... which is what NNs are theoretically pretty good at.

Post

mystran wrote: So the basic idea with ADAA is that rather than evaluating a non-linear function at a point, you evaluate it's integral over the sampling period (which in the first order version you'd approximate as a linear ramp). In a typical feed-forward neural network only the activation functions are non-linear, hence it would seem that you could just use ADAA with each activation function evaluation like you would any other waveshaper.
So I've been using ReLU activation on most neurons for performance, and sigmoid (logistic function) on the output neuron since it clamps the waveform nicely to 0..1 and provides some cheap curvature.

Antiderivatives of both functions are available - ReLU would be a parabola for x > 0 and 0 otherwise, and sigmoid is a gentle upward curve. So integrating the activations individually is easy. But integrating the whole net during feedforward is tricky, e.g. here's the calculation for a simple 1x2x2x1 net output (quick derivation, I didn't check it carefully but should be roughly correct):

out = sig((relu((relu(in * w1 + b1) * w3 + relu(in * w2 + b2) * w4) + b3) * w7 + relu((relu(in * w1 + b1) * w5 + relu(in * w2 + b2) * w6) + b4) * w8) + b5)

So gets complicated, the activations are nested etc. My calculus skills aren't up to it but if that looks feasible to integrate, and the computation time of the integral is roughly the same as a regular feedforward it might be viable. Sounds like ADAA requires at least a couple computations / samplings of the integrated function?

mystran wrote: Looking at the animated image though (the morph looks cool; I suspect it'd be even more cool with more complex waveforms), there seems to be problem where there is a discontinuity in the phase itself which will directly cause aliasing (unless it happens to cancel out). One could probably get rid of that by using a complex phasor instead (ie. let the neural network take cosine and sine as two inputs) so that the input itself is continuous and there's no need to worry about aliasing there.
That's a very cool idea. Yeah there is a discontinuity when phase wraps from 1 to 0, since the net has no knowledge that they're related. I added a "de-buzz" knob to combat that (smoothstep crossfade of the wave ends) but having the inputs be in a continuous form would be really interesting. Still on the fence about how much time I should pour in but intrigued to try that...

mystran wrote: I also kinda wonder whether there would be a nice way to train a network to do it's own anti-aliasing, perhaps by feeding some of the past outputs back to additional inputs and then using band-limited training data at various frequencies.. or something like that.
Not sure. My intuition says no, but I could be wrong - it'd have to handle all the degenerate waveforms you get as weights & biases are randomly damaged which is a big set of possibilities to train on. A separate "smoothing" network might work, but not sure it could work at point-in-time samplings. If it's just smoothing a big batch of samples you might as well cache a block and wavetable / mipmap it etc.

imrae wrote: I suggested this on page 1, as well as morphing between pre-trained coefficients
You're right, sorry :)

imrae wrote: I’ve been thinking it would be a lot of fun to try and implement this algorithm as a Eurorack VCO.
Cool idea. I did come across this Eurorack module while researching other neural net music projects:

https://www.analogueresearch.com/produc ... al-network

The whole module is just two analog neurons though so you'd need a big rack :)

Post

QuadrupleA wrote: Fri Jun 10, 2022 5:28 pm Antiderivatives of both functions are available - ReLU would be a parabola for x > 0 and 0 otherwise, and sigmoid is a gentle upward curve. So integrating the activations individually is easy. But integrating the whole net during feedforward is tricky, e.g. here's the calculation for a simple 1x2x2x1 net output (quick derivation, I didn't check it carefully but should be roughly correct):
So normally in a feed-forward network every layer basically first computes y=Ax+b (where x is inputs, A is the weight matrix, b is bias) and then you evaluate the activation function for each element on y, which then becomes the "x" for the next layer.

What I was suggesting is simply replacing the way in which the activation functions are evaluated (at runtime; I'd just ignore aliasing while training). This requires a tiny bit of memory so that you keep the previous values also (from previous sample) so that you can take the definite integral, but other than that it should literally change absolutely nothing except how the activation functions are evaluated.

Post

mystran wrote: Fri Jun 10, 2022 5:45 pm So normally in a feed-forward network every layer basically first computes y=Ax+b (where x is inputs, A is the weight matrix, b is bias) and then you evaluate the activation function for each element on y, which then becomes the "x" for the next layer.
Yup, exactly.

mystran wrote: Fri Jun 10, 2022 5:45 pm What I was suggesting is simply replacing the way in which the activation functions are evaluated (at runtime; I'd just ignore aliasing while training). This requires a tiny bit of memory so that you keep the previous values also (from previous sample) so that you can take the definite integral, but other than that it should literally change absolutely nothing except how the activation functions are evaluated.
I think I follow. Are the integrals used sort of the integrative equivalent of the finite difference method for derivatives? Sorry if I'm slow in following, my calculus / math background is mostly practical, not that broad. Curious to sit down with an ADAA paper to understand the technique better.

Post

QuadrupleA wrote: Fri Jun 10, 2022 5:28 pm Yeah there is a discontinuity when phase wraps from 1 to 0, since the net has no knowledge that they're related.
Why is that? Is it because you train the network only for t strictly in [0...1]? If so, why not just also train it for input values slightly beyond...say for the interval [0...1.1]. You have target values for t=1.1 etc. because that's just the same target as for t=0.1. Or am I missing something?
My website: rs-met.com, My presences on: YouTube, GitHub, Facebook

Post

Music Engineer wrote: Fri Jun 10, 2022 7:18 pm Is it because you train the network only for t strictly in [0...1]? If so, why not just also train it for input values slightly beyond...say for the interval [0...1.1].
They do train just in [0..1], yeah. It's customary in neural nets for inputs to be normalized to that range (like VST parameters :) ) but you could of course remap -0.1 ... 1.1 to 0..1 too. Not sure the mathematical implications of inputs going outside 0..1, might be fine in most cases, although I could see negative values causing problems.

To clarify, the discontinuity isn't a problem when the net is outputting the wave it's trained on. t=0 generally equals t=1, or is close enough for no audible click. But when you start morphing two nets together, or damaging neurons, all bets are off and t=0 and t=1 don't generally match up. So I think even if you trained some extra boundary on either end, if you mess a neuron's bias randomly you'd probably distort things to the point of slipping past the 0.1 safety margin.

But a continuous / cyclical input domain like phasor / sin-cos pair might handle it gracefully, curious to try that.

Post

QuadrupleA wrote: Fri Jun 10, 2022 6:41 pm I think I follow. Are the integrals used sort of the integrative equivalent of the finite difference method for derivatives? Sorry if I'm slow in following, my calculus / math background is mostly practical, not that broad. Curious to sit down with an ADAA paper to understand the technique better.
I might not be the best person to explain it, but let me try.

The basic idea with the 1st order version is that if we "reconstruct" the signal in continuous time with a triangular kernel (which results in linear interpolation between the current and previous sample values) and we put the resulting linear slope through a non-linear function f(x) and then filter the result with a box-filter before resampling (ie. average over the sampling period) then we can compute the whole thing "analytically" from the two sample values by taking a definite integral as (F(x1)-F(x0))/(x1-x0) where F(x) is the antiderivative (=indefinite integral) of f(x) and x0, x1 are the previous and current sample values.

Feel free to derive this formula as an exercise, it's actually fairly elementary when you get your head around it. There's a slight practical gotcha that while the limit for x1=x0 works out as f(x1), numerically speaking you need some strategy for avoiding the division by zero. The papers suggest just branching when abs(x1-x0) < epsilon, though another possibility is to let a=copysign(epsilon,x1-x0) and rewrite as (F(x1)-F(x0)+a*f((x1+x0)/2))/(x1-x0+a) which is a bit more expensive, but will reach the limit smoothly.

Post

QuadrupleA wrote: Fri Jun 10, 2022 7:52 pm
Music Engineer wrote: Fri Jun 10, 2022 7:18 pm Is it because you train the network only for t strictly in [0...1]? If so, why not just also train it for input values slightly beyond...say for the interval [0...1.1].
They do train just in [0..1], yeah. It's customary in neural nets for inputs to be normalized to that range (like VST parameters :) ) but you could of course remap -0.1 ... 1.1 to 0..1 too. Not sure the mathematical implications of inputs going outside 0..1, might be fine in most cases, although I could see negative values causing problems.
Emphasis on "customary" as literally nothing bad will happen (with the math anyway... no idea if some framework enforces some limit, even though that would be slightly weird) if you use whatever range that happens to be suitable for your purposes. The little I've played around with NNs I just wrote the training code from scratch in C++ (partially to gain better understanding of what is going on) so no idea.

Note that beyond the input layer, the values are rarely in any "nice" range unless you happen to have a bounded activation function (but for example ReLU is not bounded).
To clarify, the discontinuity isn't a problem when the net is outputting the wave it's trained on. t=0 generally equals t=1, or is close enough for no audible click. But when you start morphing two nets together, or damaging neurons, all bets are off and t=0 and t=1 don't generally match up. So I think even if you trained some extra boundary on either end, if you mess a neuron's bias randomly you'd probably distort things to the point of slipping past the 0.1 safety margin.
Yeah. As far as I can see the only way to get continuous waveforms when messing with the weights randomly is to make the input continuous.. hence the suggestion for a phasor... as long as every neuron has a continuous activation function, this will result in a continous waveform .. plus now all the inputs lie on the unit circle, so we'll probably get random (potentially fun) results elsewhere, so another form of "damage" would be to distort the input phasor circle. :)
Last edited by mystran on Fri Jun 10, 2022 8:53 pm, edited 1 time in total.

Post

Also.. ReLU might not be the best choice for activation when doing function approximation, because it'll essentially force your network to learn a piecewise linear approximation. It's popular for classification tasks because it apparently performs better than sigmoids in those tasks, but even there I think you can get better result with one of the smooth functions like swish. Yet there is an important distinction between classification and approximation tasks and what works well for one might not be ideal for the other. From what I've seen, this is rarely talked about these days because much of the NN discussion seems to assume you're doing classification.

For function approximation, the plain old tanh() is actually not a bad choice (and using a smooth function might be preferable as discontinuities in first derivative also create infinite harmonics, even though they decay 6dB/octave faster than a straight discontinuity). In my (limited) experience bipolar functions almost always outperform unipolar functions here so even though logistic function and tanh are the same up to shifting/scaling, the latter typically performs (much!) better in approximation tasks ... but really just about anything non-linear can work (eg. sin() is a perfectly fine activation function).

Curiously ... if the activation function is boxcar (or some smoother version of the same) then a single neuron can implement any traditional two-input logic gate. With sigmoids you need two layers for xor I think.

Post Reply

Return to “DSP and Plugin Development”