KVR Audio

Mike Greene · Post by **Mike Greene** » Wed Jul 08, 2015 6:10 pm

I'm working on a custom app to automate some of the editing for my vocal sample libraries. ( http://realitone.com ) The Fourier Transform elements are working nicely. (Thank you to Nigel and several other members here!

). But for another section, I need to do some old fashioned pitch shifting, where the sample simply speeds up or slows down. Like a turntable speeding up, or a sampler when you play other pitches. (No Fourier fanciness in this section.)

Obviously I can't just speed up or slow down the sample rate, so I have to use some sort of interpolation method. Linear is easiest, of course, but after some poking around on the interwebs, apparently there are much better methods. Olli Niemitalo's "Elephant" paper is loaded with great information, especially since he actually gives the relevant code for the various methods (starting on page 39.)

I have a few questions:

1. He talks about an "over-sampling ratio." If my sample rate is 44.1k, then I assume my oversampling rate is 2, correct?

2. To that end, I've read where people mention filtering, especially before pitch shifting down, to avoid artifacts. In my particular case, my pitch shifts will always be less than a semi-tone. So can I ignore that?

3. My particular need is pitch shifting solo vocals, where there is not just the tone of the held note, but also consonants. (S or t, or k or whatever.) Does anyone have a feeling for which algorithm might be best for me, or if it even matters much for basic vocal samples? According to the chart on page 60 of Olli's paper, it looks like my best signal to noise option is "Optimal 2x" 6-point, 5th order. It takes more processing power than other methods, but that's not an issue for me, since this is all offline.

4. Are there any gotchas that I should know about?

xoxos · Post by **xoxos** » Thu Jul 09, 2015 4:09 am

for the shifts of a few semitones i anticipate your application would perform, linear would be good enough for my standards. if you want to be absurd and go beyond that, i would use the one that says 'hermite' -

a = (b[i+2] - b[i+1]) - (b[i-1] - b); // bicubic interpolation
b = (b[i-1] - b) - a;
c = b[i+1] - b[i-1];
d = decimal value
output = ((a * d + b) * d + c) * d + b;

a = (3 * (b - b[i+1]) - b[i-1] + b[i+2]) * 0.5; // hermite interpolation
b = b[i+1] + b[i+1] + b[i-1] - (5 * b + b[i+2]) * 0.5;
c = (b[i+1] - b[i-1]) * 0.5;
d = decimal value
output = ((a * d + b) * d + c) * d + b;

i'm unsure and not concerned about if the names of these are accurate, they've been around so don't quote me.

the alleged "bicubic" method is a bit cheaper but don't bother with it unless you need the cpu difference - (edit: iirc i somehow picked up on calling this 'bicubic' from somewhere, but afaik it's more properly called cubic, as eg. 2d linear interpolation ("lerp" to nonaudio developers) is called 'bilinear')

my discernment about the qualities of these methods does not come from any informed analysis, rather from implementation in waveguides/pitched delays. if you've done this, you know at 44.1k linear interpolation decays hi freqs fast.. hermite has overtly superior sustain of hi freq content in this app, and that's why i advise it for quality.

other methods are well documented if you search for them. if you really wanted to "impress" people with "critical" standards use a windowed sinc, but that's beyond stupid.

try the stuff.. you know, eg. if you implement linear, it's only a few key taps away. tappy tappy tappy.

Mike Greene · Post by **Mike Greene** » Fri Jul 10, 2015 12:56 am

Thanks xoxos.

earlevel · Post by **earlevel** » Sat Jul 11, 2015 6:58 pm

Hi Mike,

Just some quick thoughts—xoxos already gave you some good comments...

You can think of linear interpolation as a poor lowpass filter. If you're familiar with FIR filters and sinc-based interpolation, note that linear interpolation can be done with an FIR and coefficients in a triangular distribution of (0,) .5, 1, .5 (,0). So it rolls off gently and is down 3 dB at half the sample rate (with max error at half-way between samples).

So, Olli's comment about oversampling ratio means: If you were already oversampled by a factor of 2 (or perform a higher quality oversampling/interpolation to get up to 2x oversampled), then linear would be pretty darned good, right? (The 3 dB rolloff is curved, so that at half the bandwidth, it's basically flat.)

A little more detail on what "oversampling ratio" means: It's the amount greater than the minimum needed (which would be "critically sampled"). So, at 44.1 kHz, you have a max bandwidth of just under 22,050 Hz (don't think, "well, less, due to the LP filter—true, but the filter is not strictly needed if the source were limited to below 22,050 Hz

). But if you recorded stuff that had no content (or no content you cared about—say -100 dB down) above 11 kHz, that audio would be oversampled by 2x at 44.1 KHz.

What I'm getting at is that the results with linear interpolation will depend to some degree on the source material you're interpolating. Linear interpolation on voice *may* sound as good as anything else you might try.

I'm just giving you things to think about more than I'm championing linear interpolation. But some people will ridicule you for linear interpolation on various DSP discussion boards, while almost every sampling keyboard you've ever heard uses linear interpolation (note xoxos' comment about shifting a relatively small amount—samplers have historically relied on having a real sample every few keys on the keyboard, and shifting a small amount—fortunately that's a requirement for keeping the sample sounding natural anyway; E-mu used higher quality interpolation and could shift better over wide ranges, but this was unimportant in typical uses).

Modest shifts on typical musical components (as opposed to entire mixes) match up with linear interpolation pretty well. Probably a good match for voice, since shifting by a wide range won't sound right anyway.

Regards,

Nigel

Mike Greene · Post by **Mike Greene** » Mon Jul 13, 2015 5:18 pm

Thanks Nigel. This is really helpful.

So if I understand correctly, 44.1k is not really oversampled at all, assuming we're aiming for around 20k (or 22.05k) "listening" range. If I wanted 2x oversampling, I'd need to sample at 88.2k or 96k, right? (Not that this is going to change my methodology, I just want to make sure I have the terminology correct so I don't look like an idiot at NAMM.

)

This is very interesting about linear interpolation. It hadn't occurred to me that hardware samplers would have used linear. Makes sense.

In fact, linear interpolation was the first method we tried, and it didn't sound bad at all. I would have been fine with that, but processing speed is not a factor, so we tried the other methods, too. (Once the basic processing loop is set up, then plugging in other formulas is certainly easy enough.) I don't hear much difference, but my son (who hasn't subjected his ears to the abuses of a career in rock and roll) says the "Optimal 2x" sounds best.

Funny, Optimal 2x is pretty slow, but linear isn't exactly fast either. Way faster than Optimal 2x, but processing a 4 second sample still takes a couple seconds on a pretty fast Mac. Maybe it's slow because we're doing all this in Python.

Speed doesn't matter anyway, though. We (my 19 year old son is doing most of the Python coding) are doing Fourier transform stuff at the same time. (Yep, thanks to SMS-Tools, which we've completely tweaked.) There are a lot of steps in the process, so creating just one finished wave file takes around 10 to 20 seconds. So a couple seconds one way or the other doesn't make much difference. It's slow either way.

So we're setting up our master app to batch process entire folders. That way, speed becomes completely unimportant. Just click Go, then come back in an hour.

This is a lot of fun, by the way. There are certainly frustrations along the way, but each time we add a new element to the process and hear the magic that our code did, it's really satisfying. Even hearing simple pitch shifted notes was a "Wow, it sounds like a real sampler!" moment.

earlevel · Post by **earlevel** » Mon Jul 13, 2015 6:18 pm

Hi Mike—sounds like you're making great progress, and I agree with the approach that if it sounds right, it is right.

Mike Greene wrote:So if I understand correctly, 44.1k is not really oversampled at all, assuming we're aiming for around 20k (or 22.05k) "listening" range. If I wanted 2x oversampling, I'd need to sample at 88.2k or 96k, right?

Some might take the view that 44.1 kHz is never oversampled, because the system is designed to record a bandwidth of ~20 kHz. But technically, it's relative to the bandwidth of the signal you're talking about.

An example of where the latter distinction is important: Let's say I give you an algorithm for an audio effect, and tell you, "it's very important that the source be oversampled by at least 2x, otherwise you'll get hideous audio artifacts." Does this mean that you need to upsample your 44.1 kHz source to 88.2 KHz or better? Maybe. "Yes" if you know that the input signal will have a 20k bandwidth, or you don't know what it will be but it might have significant content over 11k. "No" if you know that your input is speech with no significant harmonic content over 11k (or upright bass, etc.). Oversampled means that it just has more sample points than absolutely necessary for the signal. An oversampled signal is smoother—less wiggle between samples—because the harmonic content is low relative to the sample rate. And smoother between samples means more accuracy for linear interpolation.

Linear interpolation has a frequency response that's down 3 dB at the top of the passband. That's the reason, for small shifts, that it's said to need 2x oversampling. But note that you can stretch that 11k out further, because the fall-off is a curve (sinc^2)—I can't tell you the point at which it's down 1 dB, for instance, without checking, but it's higher than 11k [edit: 60% of the way, so 13.26k for 44.1k SR].

Now, an entire mix may have a lot of content over 11k with cymbals and all, but it's unlikely that you'll be pitch shifting an entire mix for what you're after. Individual musical instruments and natural sounds drop off a lot quicker than you might guess. Overdriven electric guitar may seem to have a lot of high frequency content, but the fact is that the response of typical guitar cabinets drops like a rock starting around 5k. Even relatively high instruments like violins and trumpets don't have much left at the top of the audio band.

I'm just pointing out reasons why linear interpolation often surprises people by sounding better than they expected, even while you may hear from people who scoff at the idea of even considering it. High frequency droop and big side lobe ugliness, yet it's fine in the band of most real instruments, and for relatively small shifts.

Mike Greene · Post by **Mike Greene** » Thu Jul 16, 2015 12:14 am

That all makes sense. Thanks for that, Nigel.

Richard_Synapse · Post by **Richard_Synapse** » Thu Jul 16, 2015 7:30 am

In general I'd use a good interpolator whereever possible, and certainly for your case. Linear can sound better in some cases though. For example if you strongly pitch down something, a good interpolator is going to remove all the high-frequency content, which can sound very ugly. A poor interpolator is going to fill this with artifacts.

Richard

Mike Greene · Post by **Mike Greene** » Thu Jul 16, 2015 5:53 pm

We tried just about all of them, and at huge ranges, you're right, Richard. There was some definite weirdness in a couple cases.

I'm half tempted to make an effect out of the weirdness. Anyone who had an E-mu SP12 or SP1200 will remember how horrible their pitch shifting sounded. It was a textbook example of aliasing and artifacts. But the hip hop crowd loved it, and that's a lot of "the sound" when you listen to records by groups like Cypress Hill.

A_SN · Post by **A_SN** » Thu Jul 23, 2015 9:42 am

Mike Greene wrote:processing speed is not a factor

That's good, then what you can do is calculate the position of each of your new samples (it's a real numbered position between two original samples) wrt the original sound, let's call it p, then calculate the scaled distance of a certain number of samples (the range should be the scaled span of your windowed sinc function), multiply each original sample with the correct matching windowed sinc value then sum it all up and you've got your interpolated sample. This allows you to have a flexible playback rate that can change smoothly with every sample if you'd like, and depending on how wide you window your sinc you can choose between low aliasing in the top range to effectively no audible aliasing at all.

But if you don't need the playback rate flexibility and you only need to resample a whole sound to just one ratio then you can do something even simpler and faster. Let's say your sound has 2,000,000 samples and you want to make it play at a speed of 0.8x, (optionally) add some zeros to the end of your sound, FFT it, add 25% (1/0.8 - 1 = 0.25) as many samples (complex zeroes) to the top end of the FFT (so, above the old Nyquist frequency all the way up to the new Nyquist frequency, which since you don't change the nominal sample rate is interpreted as the same frequency, it's just that you added more stuff at the top to squeeze everything down in frequency) then inverse FFT so that you have 25% more samples (so at least 2,500,000 samples) then chop the end to remove the original zero padding so you exactly have 2,500,000 samples. That will give you a perfectly resampled sound that will play at 0.80 the speed with absolutely no aliasing whatsoever.

Mike Greene · Post by **Mike Greene** » Fri Jul 24, 2015 7:22 pm

Thanks A_SN. Interestingly, I don’t want to use an FFT process for this because it’s actually the noise or “residual” element that I’m most interested in, and that’s what Fourier transforms struggle with. In other words, I separate the harmonics and the noise, then use FFT on the harmonic elements, since that’s the strength of FFT, and then use my interpolation method for the noise elements.

fmr · Post by **fmr** » Fri Jul 24, 2015 7:34 pm

Mike Greene wrote:Thanks A_SN. Interestingly, I don’t want to use an FFT process for this because it’s actually the noise or “residual” element that I’m most interested in, and that’s what Fourier transforms struggle with. In other words, I separate the harmonics and the noise, then use FFT on the harmonic elements, since that’s the strength of FFT, and then use my interpolation method for the noise elements.

Seems interesting. Remembers me of Soundhack, from Tom Erbe. This was a sound mangler program created by Tom Erbe many years ago, that is no longer available. Tom is now releasing packages of VST plug-ins with some of the DSP processors included in Soundhack.

One of the processes did more or less this - separate transients from holding partials, and allow us to save them as separate files. I achieved some good sounds using this. Are you planning to release this as a sound editor?

A_SN · Post by **A_SN** » Fri Jul 24, 2015 8:47 pm

Mike Greene wrote:Thanks A_SN. Interestingly, I don’t want to use an FFT process for this because it’s actually the noise or “residual” element that I’m most interested in, and that’s what Fourier transforms struggle with. In other words, I separate the harmonics and the noise, then use FFT on the harmonic elements, since that’s the strength of FFT, and then use my interpolation method for the noise elements.

Alright, but that's completely irrelevant, you need to resample, and that's a perfectly good resampling technique. What you do before or after the resampling with or without a FFT has no connection to the FFT based resampling.

For one thing for resampling you'd rather FFT the whole sound at once, not just a short chunk as you would for finding the pure tones.

Mike Greene · Post by **Mike Greene** » Wed Jul 29, 2015 2:17 am

fmr wrote:Seems interesting. Remembers me of Soundhack, from Tom Erbe. This was a sound mangler program created by Tom Erbe many years ago, that is no longer available. Tom is now releasing packages of VST plug-ins with some of the DSP processors included in Soundhack.

One of the processes did more or less this - separate transients from holding partials, and allow us to save them as separate files. I achieved some good sounds using this. Are you planning to release this as a sound editor?

Nah, this is just for my own use, tweaking vocal samples for Realivox. I couldn't release this commercially even if I wanted to, since even on my own three work computers, I can only keep the app working reliably on one of them!

Non-linear interpolation methods