KVR Audio

Mike Greene · Post by **Mike Greene** » Mon Aug 25, 2014 2:11 am

I'd like to have an app that reads a wave file, then tunes it in the crudest of ways:
The app would determine mathematically how many samples there should be per cycle (in a perfectly tuned note.) Then it takes my wave file and stretches (or compresses) each cycle individually to match that length. (And thus be in tune.)

There would be no need to maintain overall file length, so if the wave file has 587 cycles, the resulting file will have 587 cycles as well. If that makes the audio file longer or shorter, so be it. (I want it this way because AutoTune or Melodyne often introduce artifacts as they add or subtract cycles in an effort to keep the audio file "in sync.")

I also don't want any formant fanciness. (Leave that to Melodyne.) I want the wave file as unmolested (other than pitch) as possible. And obviously, this does not need to have realtime operation. It would be an entirely off-line process.

My guess is that the stretching of each cycle is fairly simple, but the hard part will be determining exactly where each cycle should be designated as "starting." For my purposes, zero-crossings are likely good enough.

The interface can be cruder than crude, since the only user of this will be me. (I'm a sample developer. I'd like to have this app because I believe this method, albeit crude, would be better suited to sample tuning than Melodyne or Autone.) It also doesn't need to be cross-platform, obviously.

Obviously this would be a paid gig. Please PM or email (mike at realitone daht calm) if you're interested. Or if it turns out there's something already out there that tunes in the way I've described, please let me know. Thanks!

DaveHoskins · Post by **DaveHoskins** » Mon Aug 25, 2014 2:42 pm

I presume you want this for vocals, so finding the difference and transition between the vocal tract pulses and fricative sounds is very difficult, so this maybe one cause of the artefacts you're hearing on those packages. Another may be the missing data caused by the cycle length altering.
For mono sounds, the software you mentioned don't use formant synthesis (as far as I hear), they use pitch tracking and cycle lengthening.

Mike Greene · Post by **Mike Greene** » Mon Aug 25, 2014 6:29 pm

Hi Dave. You presume correctly that this would be for vocals.

I’d be chopping off the fricative consonants before tuning, though, so that the audio file is entirely smooth and tunable. Otherwise we’d have to write additional code so our app would divide the audio to figure out which sections of the audio even can be tuned.

The artifacts we currently get (with AT or Melodyne) are actually during the sustains, not the consonants. They’re really rare, mind you, but they occur with with both Melodyne and Autotune, although samples that are bad with one will usually be good with the other. (Lucky for us, although it’s a PIA to keep switching.)

The artifacts are generally distortion type sounds, or even clicks. And they’re consistent, in that if a note has glitches, changing settings usually doesn’t help. It’s weird, because we don’t slam the audio levels, so that’s not the issue, and like I said, they occur in the middle of sustained held notes, which should be the easiest things to tune.

My guess as to why AT and Melodyne sometimes glitch is that they delete or add cycles (in order to keep a tuned phrase in sync with it’s original timing.) That’s why in my process, I want to keep every cycle so there would not, in theory, be any problematic splices.

Techie37 · Post by **Techie37** » Mon Aug 25, 2014 7:47 pm

AUTO-ADMIN: Non-MP3, WAV, OGG, SoundCloud, YouTube, Vimeo, Twitter and Facebook links in this post have been protected automatically. Once the member reaches 5 posts the links will function as normal.

Hey Mike,

This is such a cool idea.

If you're still looking for a developer, I would suggest Zco Corporation. I work there, so I highly recommend them!

Not only do are we one of the largest app developers in the world, we make top-notch apps. Have you thought about what devices you'd want this project to work on? We can build native and hybrid apps for iOS, Android, Windows, and BlackBerry.

If it helps, I have a music degree in voice performance and am really excited about this project!

Check us out here: http://www.zco.com/mobile-app-development.aspx (http://www.zco.com/mobile-app-development.aspx)

DaveHoskins · Post by **DaveHoskins** » Mon Aug 25, 2014 7:49 pm

OK, it's been a while since I've played with Melodyne. If you're going to sustain an note longer than the original, there is no other way than to repeat sections, it should NOT distort though, it should just sound little robotic if anything. From my own experience in time-shifting I suspect that something else is happening in that sustained note, as there is no reason for it to glitch or click.

Mike Greene · Post by **Mike Greene** » Tue Aug 26, 2014 11:17 pm

Dave, you're right that Melodyne shouldn't distort, and most of the time, it doesn't. But . . . there are times that it does. I think it happens when cycles get either skipped or repeated. (And even then, only occasionally.)

It's not very noticeable in a sung line, but when you're doing samples, where a little glitch gets repeated every time you play that sample, it's noticeable.

Mike Greene · Post by **Mike Greene** » Tue Aug 26, 2014 11:26 pm

Based on a question I got by PM, I might not have explained this process clearly enough. Hopefully this example will make things clearer:

Lets suppose we have a wave file of a singer singing on "A" above middle C. This note, if perfectly tuned, should have 440 cycles per second. At a 44.1k sample rate, that would mean each cycle, if perfectly tuned, should be 100.23 samples. (Obviously interpolation would be involved, since 100.23 isn't a whole number of samples.)

But our singer, what with her being human and all, isn't perfect. So when our app analyzes the wave file, it finds the lengths of the cycles to be: 99, 98, 102, 100, 99, 102, 101 . . .

So our app would then stretch or compress each of these cycles:
That first cycle, which lasts 99 samples, gets stretched to 100.23 samples.
The second cycle, which lasts 98 samples, gets stretched to 100.23 samples.
The third cycle, which lasts 102 samples, gets compressed to 100.23 samples.
. . .
And so on, so that their durations when all finished would be 100.23, 100.23, 100.23, 100.23, 100.23, 100.23, 100.23, 100.23 . . . thus giving us a perfectly tuned "A."

I hope that makes it clearer.

camsr · Post by **camsr** » Wed Aug 27, 2014 12:31 am

But zero crossings are not indicative of the signal content. Just because a sine wave is, doesn't mean an audio signal is, and overtones and their phases will "modulate" the zerocross point. Trying to adjust pitch based on this is a kind of average that may not make sense. So, your idea ignores phase, one of the key attributes of a signal, and uses time as a vain attempt to replace it. This is why Melodyne and the like use FFT, all the necessary attributes are present.

MadBrain · Post by **MadBrain** » Wed Aug 27, 2014 6:19 am

Mike Greene wrote:Based on a question I got by PM, I might not have explained this process clearly enough. Hopefully this example will make things clearer:

Lets suppose we have a wave file of a singer singing on "A" above middle C. This note, if perfectly tuned, should have 440 cycles per second. At a 44.1k sample rate, that would mean each cycle, if perfectly tuned, should be 100.23 samples. (Obviously interpolation would be involved, since 100.23 isn't a whole number of samples.)

But our singer, what with her being human and all, isn't perfect. So when our app analyzes the wave file, it finds the lengths of the cycles to be: 99, 98, 102, 100, 99, 102, 101 . . .

So our app would then stretch or compress each of these cycles:
That first cycle, which lasts 99 samples, gets stretched to 100.23 samples.
The second cycle, which lasts 98 samples, gets stretched to 100.23 samples.
The third cycle, which lasts 102 samples, gets compressed to 100.23 samples.
. . .
And so on, so that their durations when all finished would be 100.23, 100.23, 100.23, 100.23, 100.23, 100.23, 100.23, 100.23 . . . thus giving us a perfectly tuned "A."

I hope that makes it clearer.

You'd need a very clean signal with a very dull sound for this to work. Human voice, for instance, typically has multiple zero crossings per period (due to the strong resonances), and the noise content might throw off the zero crossing locations a bit, causing random variations in the pitch...

Mike Greene · Post by **Mike Greene** » Wed Aug 27, 2014 9:00 pm

Camsr and MadBrain, you guys make very valid points. I’ve certainly stared at enough audio files to see the obstacles you mention.

I still think I can come up with an algorithm to find consistent 0-crossings, though. (Says the guy who's never coded something like this before.

)

Indulge me for a minute. Lets suppose we’re dealing with a pitch where each period (or cycle) would be 500 samples when perfectly tuned. I would start the process by first searching for an overall peak value anywhere in the audio file. Then I assign a cycle/period start point to the zero crossing immediately preceding that peak. That’s my starting anchor.

Then I would search in a range of 250 to 750 samples after this first peak, to find the next peak. (Peaks would be expected to be roughly 500 samples after each other, so this range encompasses that.) Once I find that next peak, then I go back for the 0-crossing immediately preceding it, and that becomes the start point for the next cycle. And so on.

These successive 0-crossings *should* be roughly 500 samples apart, but as you guys note, they won’t always be. If they’re in a range of 480 to 520 samples apart, then I accept that 0-crossing. If not, then I have to include secondary algorithms. Check an earlier 0-crossing, for instance, to see if that one would fall into the 480 to 520 range.

The error correcting schemes could get fairly complicated, but I do think it would be possible to tweak this with enough error correction schemes so that I would always get 0-crossings that are always 480 to 520 samples apart.

Note that it’s not all that important whether or not these 0-crossings are the true mythical starts of each cycle/period. Even if they drift a bit as the harmonics evolve and mess with the 0-crossings, as long as I stay in that 480 to 520 range for each cycle, then I can’t get into too much trouble as I stretch (or compress) each cycle to 500 samples. The cycle will still always be starting on an upswing and ending on a downswing every 500 samples.

That’s my theory, at least. We’ll see if reality is another story.

At least if I can find any takers . . .

camsr · Post by **camsr** » Wed Aug 27, 2014 9:12 pm

You have one taker up above.

But if you are concerned over artifacts, your idea will have more.
FFT and similar processes are best for pitch reassignment.

MadBrain · Post by **MadBrain** » Wed Aug 27, 2014 10:58 pm

Yeah I'd probably look into autocorrelation if I were you... it more robust than zero crossings.

Mike Greene · Post by **Mike Greene** » Thu Aug 28, 2014 12:52 am

You guys could very well be right. But I still want to try.

BertKoor · Post by **BertKoor** » Thu Aug 28, 2014 8:05 am

Chances are, if this approach is done correctly, you won't recognise it anymore as human voice. It's a recipe to suck out all the life from it.

Some weeks ago someone wrote that most samples / single-cycle waves when played on keyboard sounded like "accordion" and I'd say he's right!

DaveHoskins · Post by **DaveHoskins** » Thu Aug 28, 2014 1:08 pm

The complications are immense. And all algorithms have to be tested against thousands of voices of all ages and languages before you can say that it's infallible - which it won't be.

Or to put it another way, a large amount of people and years of R&D have gone into these algorithms, do you think they've tried all the possibilities like this one?

It is interesting that the human brain can see the patterns in waveforms very clearly, but that's the wonder of brains for you!

Need custom off-line pitch correction app - for pay