Use machine learning to generate brand new never-before-heard sounds

How to make that sound...
RELATED
PRODUCTS

Post

Maybe it's a bit of blasphemy to post this here, but I would love to see someone break sound design by using machine learning to generate endless new never-before-heard sounds to be utilized like presets in a VSTi. It may be cutting edge at the moment, but it's certainly possible. See this video:

https://www.youtube.com/watch?v=oitGRdHFNWw

Instead of using images as in the video, train the unsupervised deep learning algorithm on tons of different sounds (probably both VSTi notes and other musical samples). Then refine its output with an adversarial network, etc. Anyone else get giddy with excitement at this prospect or does it make you fear for your job? Anyone out there with the skills to make an attempt at this??

Post

The problem is that most of the datasets that the deep learning algorithms train contain real-world & natural content with regularities defined by physics. The generative model's objective is often to produce a plausible output that would fool an adversarial network/classifier or a human into believing that the output is real/valid.

The analogy to audio space would follow say the production of natural data such as speech/language/sound of nature so unless you want the network to generate plausible sounds of existing VSTi instruments, it will still lack that creativity element unless it receive feedback from the a human listener during training. e.g. have human report on the "interestingness" of its recommendations and have the algorithm optimize for that.

Post

the method outlined in the video is, like i posted in another thread, all about

"always-before-heard" sounds not "never-before-heard"


"hey look, here's some more stuff that would seem familiar to you!"
fantastic :/


the audience sees: OMG a computer is making something up!


developers see: hahaha some clown using an arbitrary selection of methods to achieve an occasionally amusing result. that's what us clowns do, except we sell it to those we can as some kind of profundity. far, far from empirical, or from any authoritative enactment of what the audience believe is being purveyed.

it's great to play around with development and method and procedure and do a bunch of stuff.

but when you suck on it instead of do it, you're helping to continue a cycle of social abuse ;) think of the children, develop yourself. develop, develop, develop, yourself. and soon, those amazing guys off somewhere doing the amazing stuff, will have less of your attention.

and that is worth more than gold.
you come and go, you come and go. amitabha neither a follower nor a leader be tagore "where roads are made i lose my way" where there is certainty, consideration is absent.

Post

Most sounds have already been heard before. The individual sounds aren't as important as what you do with them.
Incomplete list of my gear: 1/8" audio input jack.

Post

Some interesting stuff here:
https://soundcloud.com/musicpostbot909303

Post

Interesting, maybe I should dive into machine learning.

Also, this approach generates images (or sounds) similiar to already given and not completely new ones. It might be tricky to generate or evaluate completely new sound based on already existing examples.
Besides, their quality is unsatisfactory and most.

On the other hand, sound processing is much simpler and faster than video (or image) processing. The question is, what's the use of that? Glitchy sample packs maybe?
Blog ------------- YouTube channel
Tricky-Loops wrote: (...)someone like Armin van Buuren who claims to make a track in half an hour and all his songs sound somewhat boring(...)

Post

Not deep and dark enough.

Post

@nonnaci I imagine the data set would include as many real-world instrument sounds as possible as well as as many VSTi presets as possible. This domain already defines the "interestingness" because the recorded instrument would never have been made in the 1st place if it didn't sound good and someone wouldn't have saved the preset in the 1st place if it also didn't sound good. My hunch is that what sounds good and interesting to us is something that has a lot of patterns occurring in it, and general patterns across all of the data is what an unsupervised algorithm winds up finding. Plenty of these would exist whether because of physics in the real world sounds or because they imitate real world sounds in order to be pleasing to us in the case of synthesis. My guess is that there would wind up being neurons that code for patterns like harmonics in tonal sounds, reverbs and delays and perhaps even some larger scale rhythmic patterns. Randomness like static would probably be selected against, which does make me wonder how it would handle some level of desirable distortion or overdrive. If it was included in the training data I bet it would slip some in here and there in what it generates too.

Post

to_the_sun wrote:@nonnaci I imagine the data set would include as many real-world instrument sounds as possible as well as as many VSTi presets as possible. This domain already defines the "interestingness" because the recorded instrument would never have been made in the 1st place if it didn't sound good and someone wouldn't have saved the preset in the 1st place if it also didn't sound good. My hunch is that what sounds good and interesting to us is something that has a lot of patterns occurring in it, and general patterns across all of the data is what an unsupervised algorithm winds up finding. Plenty of these would exist whether because of physics in the real world sounds or because they imitate real world sounds in order to be pleasing to us in the case of synthesis. My guess is that there would wind up being neurons that code for patterns like harmonics in tonal sounds, reverbs and delays and perhaps even some larger scale rhythmic patterns. Randomness like static would probably be selected against, which does make me wonder how it would handle some level of desirable distortion or overdrive. If it was included in the training data I bet it would slip some in here and there in what it generates too.
My hunch is that interestingness is closer to a transfer learning problem where representations from different domains cross-polinate to produce a result that could be understood in either domains yet whose sum is greater than the parts. e.g. a musical arrangement can become interesting when it conforms to an archetypal narrative structure. The current approaches to adversarial and generative networks do not do this; they train to either imitate (compress and decompress data back to original like autoencoder variants) or fool a classifier designed to determine if input originated from dataset or not. i.e. you'll get new samples that sound like they were plausibly generated from a VSTi in your dataset. Yes your network may windup coding for harmonics but such a parameter space is small. Compare this to negative space of inharmonics which is huge; what is an interesting inharmonic in this case? My vote is that some form of regularization is required and that will come from some other domain.

Post

nonnaci wrote:
to_the_sun wrote:@nonnaci I imagine the data set would include as many real-world instrument sounds as possible as well as as many VSTi presets as possible. This domain already defines the "interestingness" because the recorded instrument would never have been made in the 1st place if it didn't sound good and someone wouldn't have saved the preset in the 1st place if it also didn't sound good. My hunch is that what sounds good and interesting to us is something that has a lot of patterns occurring in it, and general patterns across all of the data is what an unsupervised algorithm winds up finding. Plenty of these would exist whether because of physics in the real world sounds or because they imitate real world sounds in order to be pleasing to us in the case of synthesis. My guess is that there would wind up being neurons that code for patterns like harmonics in tonal sounds, reverbs and delays and perhaps even some larger scale rhythmic patterns. Randomness like static would probably be selected against, which does make me wonder how it would handle some level of desirable distortion or overdrive. If it was included in the training data I bet it would slip some in here and there in what it generates too.
My hunch is that interestingness is closer to a transfer learning problem where representations from different domains cross-polinate to produce a result that could be understood in either domains yet whose sum is greater than the parts. e.g. a musical arrangement can become interesting when it conforms to an archetypal narrative structure. The current approaches to adversarial and generative networks do not do this; they train to either imitate (compress and decompress data back to original like autoencoder variants) or fool a classifier designed to determine if input originated from dataset or not. i.e. you'll get new samples that sound like they were plausibly generated from a VSTi in your dataset. Yes your network may windup coding for harmonics but such a parameter space is small. Compare this to negative space of inharmonics which is huge; what is an interesting inharmonic in this case? My vote is that some form of regularization is required and that will come from some other domain.
My hunch could be all wrong, yes. I suppose the patterns that it winds up finding would just be similarities among the training set, whether those similarities are distortion or nice juicy reverb. It's true that the current approaches, like in the video, are supervised and I was hypothesizing an unsupervised version, but now that I think about it a supervised algorithm would probably work just fine. Rather than classifying as bird or dog, etc. you would classify based on pitch (A, B, C etc.) with perhaps a category for atonal. In any case I would maintain that the question of interestingness is irrelevant. That will be for the machines to decide.

Post

@xoxos With an ideal training set of "always before heard" sounds that encompasses every musical sound, generating something similar would be "never before heard". Anyway, the point for me would not be so much whether or not the new sounds hadn't already been created down to the sample by someone before, but that you would always be surprised if you had it generate a new one. Maybe it'd be a bit of a novelty I suppose, but I feel like I've already come to know all the presets on the VSTs use.

Post

deastman wrote:Most sounds have already been heard before. The individual sounds aren't as important as what you do with them.
I totally agree. I'd much rather spend my time actually playing music which, on the one hand, is why I'd love to have something thatcould endlessly provide me with new sounds to work with without having to spend time dialing them in, but on the other hand, means I also have more pertinent things to do than actually make that thing myself.. I'm just trying to garner some interest in the idea in the hopes that someone else will pick it up and run with it..

Post

to_the_sun wrote: My hunch could be all wrong, yes. I suppose the patterns that it winds up finding would just be similarities among the training set, whether those similarities are distortion or nice juicy reverb. It's true that the current approaches, like in the video, are supervised and I was hypothesizing an unsupervised version, but now that I think about it a supervised algorithm would probably work just fine. Rather than classifying as bird or dog, etc. you would classify based on pitch (A, B, C etc.) with perhaps a category for atonal. In any case I would maintain that the question of interestingness is irrelevant. That will be for the machines to decide.
Even in the unsupervised case, the machine still optimizes for an objective function that we give it, this typically being some measure how well it models the distribution of the data presented to it without overfitting (e.g. max-likelihood, marginal likelihood, cross-entropy, RMSE of reconstructed data).

e.g. I supply a dataset consisting of violin sounds and cello sounds. The deep unsupervised learner does a good job modeling the two, manages to hypothetically learn features that resembles harmonics in a deep layer and code for timbre in a shallower layer. i.e. it learned two major features common to all string instruments. Question now is, how do I sample this space in a reasonable amount of time as to produce an interesting effect? What if the underlying features weren't harmonics and timbre but something else alltogether? The regions of low-probability is vast!

Post

DJ Warmonger wrote:Interesting, maybe I should dive into machine learning.

Also, this approach generates images (or sounds) similiar to already given and not completely new ones. It might be tricky to generate or evaluate completely new sound based on already existing examples.
Besides, their quality is unsatisfactory and most.

On the other hand, sound processing is much simpler and faster than video (or image) processing. The question is, what's the use of that? Glitchy sample packs maybe?
With an ideal training set of every instrumental sound, whether real world or VST preset, "similar" is all you really need. I'm not asking for it to discover some frequency that's new to science or anything, but even with that training set there is always going to be tons of space in between the examples for experimentation.

The images generated were only low-quality before they were sent through the adversarial algorithm. Then they became crystal clear.

What he did say though was that that sort of thing hadn't been done yet with video and I would say that audio is pretty akin video processing. While audio only records one sample at a time, it does so tens of thousands of times a second. Video might record thousands of pixels per frame, but you only have dozens of frames per second.

Post

Hi,

I would love to participate to such a project. Did you further investigate on the subject?

Post Reply

Return to “Sound Design”