Next gen stereo image tool teaser aka vocal removal
- KVRAF
- Topic Starter
- 6478 posts since 16 Dec, 2002
well no shit the word wavelet gives you google hits! It's only one of the most important DSP developments in history after FFT. Wavelet is just a way to represent a signal. (you probably knew that so nevermind)
The algorithm here has nothing to do with blind source separation by the way. That's an entirely different area of study (AI, neural networks etc.). I'm not picking up any particular instruments or tonal ranges here you see. I'm selecting a slice (or a window rather) in the stereo image field and extracting or suppressing that.
Here is the whitepaper my algorithm is loosely based on: "FREQUENCY-DOMAIN SOURCE IDENTIFICATION AND MANIPULATION IN STEREO MIXES FOR ENHANCEMENT, SUPPRESSION AND RE-PANNING APPLICATIONS" - Carlos Avendano.
It's FFT based and sounds like shit. It's probably not available for free as it's an AES paper.
The algorithm here has nothing to do with blind source separation by the way. That's an entirely different area of study (AI, neural networks etc.). I'm not picking up any particular instruments or tonal ranges here you see. I'm selecting a slice (or a window rather) in the stereo image field and extracting or suppressing that.
Here is the whitepaper my algorithm is loosely based on: "FREQUENCY-DOMAIN SOURCE IDENTIFICATION AND MANIPULATION IN STEREO MIXES FOR ENHANCEMENT, SUPPRESSION AND RE-PANNING APPLICATIONS" - Carlos Avendano.
It's FFT based and sounds like shit. It's probably not available for free as it's an AES paper.
-
- KVRist
- 190 posts since 28 Nov, 2003
No need to get defensive - if you didn't intend to talk about what you had done, why did you post in DSP and Plug-in Development? Found the paper, btw:
http://www.ee.columbia.edu/~dpwe/papers ... -unmix.pdf
Haven't read it yet, but I notice it cites a paper on blind source separation...
http://www.ee.columbia.edu/~dpwe/papers ... -unmix.pdf
Haven't read it yet, but I notice it cites a paper on blind source separation...
- KVRAF
- Topic Starter
- 6478 posts since 16 Dec, 2002
Oh cool,
I didn't know that paper was available online. (it wasn't when I started getting into it) As the papers name suggests, it's completely in frequency domain. Sure he might have done background research on timedomain implementations as well.
Anyway, that paper explains the basics. What I did was a kind of pseudo-wavelet version of that with linearphase filters and an improved similarity comparison algo. I ditched the panning index as well as it was an unnecessary complication. It was replaced with a simple (MS) stereo rotator to choose the slice for processing.
It's an absolute pig and there's no way to run it realtime.
I didn't know that paper was available online. (it wasn't when I started getting into it) As the papers name suggests, it's completely in frequency domain. Sure he might have done background research on timedomain implementations as well.
Anyway, that paper explains the basics. What I did was a kind of pseudo-wavelet version of that with linearphase filters and an improved similarity comparison algo. I ditched the panning index as well as it was an unnecessary complication. It was replaced with a simple (MS) stereo rotator to choose the slice for processing.
It's an absolute pig and there's no way to run it realtime.
-
- KVRist
- 143 posts since 3 Apr, 2001 from Mont de Marsan, 40000, France
It sounds great, and it's great to be able to hear the reverb and delays seperated from the lead.
These examples go very well with my current bed side book: "The Mixing Engineers handbook" and gives me quite a few illustration on pre-delays, reberb use and of course panning.
Great job anyway.
These examples go very well with my current bed side book: "The Mixing Engineers handbook" and gives me quite a few illustration on pre-delays, reberb use and of course panning.
Great job anyway.
-
- KVRist
- 190 posts since 28 Nov, 2003
Yeah, I could imagine that being quite the CPU cycle burner. Still, it's amazing how much faster you can get some Matlab calculations when you rewrite some of your routines as mex files in C. Good luck with it, BTW. I'd rush, though - I can imagine the road to commercialization being littered with more than just a few patents if you don't get there first.
- KVRAF
- Topic Starter
- 6478 posts since 16 Dec, 2002
If I only had the skills... I'm aware that SSE2 optimisations would make it a lot faster as well (parallelising it). Matlabs internal 64bit math in this is mostly unnecessary, too. I suspect one might be able to optimise this to take "only" about 70-100% of a modern CPU. The real hit is that fact that I have to compare every single wavelet data "bin" and individual grains separately (for the two channels). There's no possible shortcut unless I want to sacrifice quality a good deal.autloc wrote:Yeah, I could imagine that being quite the CPU cycle burner. Still, it's amazing how much faster you can get some Matlab calculations when you rewrite some of your routines as mex files in C.
I'm not too fussy about this thing going commercial though. I'm hoping for the best, expecting the worst.
and at worst I'll be posting the .m files here.
-
- KVRist
- 327 posts since 13 Nov, 2002 from Germany, Darmstadt
I haven't read it, but it might be a similar topic: http://www.wavelet.org/phpBB2/viewtopic.php?t=4493
- KVRAF
- Topic Starter
- 6478 posts since 16 Dec, 2002
Hey thanks! I see I'm not alone in this.
Jesus christ what language! He could've explained the point of his paper several multiples easier if he made some effort. Their paper doesn't result in the same thing as mine though. What they're trying to do is to separate several instruments or tonal packets from signals in time domain using adaptive methods. As I mentioned before, what I'm doing isn't defined as blind source separation. "use of sparsity of sources in some signal dictionary" implies the use of neural networks as you have to teach the algo to look for certain types of packets. In order to separate something like speech you'd essentially have to teach the algo to speak (in rough terms).
I'm doing none of that. My algo has no adaptive properties whatsoever. It simply bruteforces its way thru the selected slice in stereo image.
By the way what I'm doing is multiresolution processing too. That's what wavelets are really: multiresolution FFT.
Jesus christ what language! He could've explained the point of his paper several multiples easier if he made some effort. Their paper doesn't result in the same thing as mine though. What they're trying to do is to separate several instruments or tonal packets from signals in time domain using adaptive methods. As I mentioned before, what I'm doing isn't defined as blind source separation. "use of sparsity of sources in some signal dictionary" implies the use of neural networks as you have to teach the algo to look for certain types of packets. In order to separate something like speech you'd essentially have to teach the algo to speak (in rough terms).
I'm doing none of that. My algo has no adaptive properties whatsoever. It simply bruteforces its way thru the selected slice in stereo image.
By the way what I'm doing is multiresolution processing too. That's what wavelets are really: multiresolution FFT.
- KVRAF
- Topic Starter
- 6478 posts since 16 Dec, 2002
I also see more room for improvement in my plugin efficiency wise. Right now I'm actually doing pseudo wavelets. I found no existing linearphase wavelet solutions suited for my purpose so I had to do it using multiresolution fft-based tricks. If I had the skill to change this to actual linearphase filtered wavelets the processing would be a lot easier on the CPU. I'd speculate the sound quality might suffer slightly in the transform.
-
- KVRist
- 62 posts since 6 May, 2004 from IL USA
I read the paper.. I know you probably don't want to give away too much of your algorithm... Are you basically using the same similarity/panning measures but just using wavelets/multiresolution FT instead?
I guess my only question is what happens when you have two things panned differently occupying the same frequency band, like one panned far left, and one slightly left.
Just as a nitpicky offtopic note, as your algo. is, as you say, not blind source separation.. but to clarify some things you did say about BSS... is that great many BSS approaches use no machine learning, but rely on signal subspace methods, namely Independent Components Analysis, or a whole slew of array processing techniques when you have a lot of sensors.
ICA is well suited to the cocktail party in speech, as its underlying assumption is source independence.. and two different speakers won't really be talking in a similar manner/rate etc.. ICA isn't that well suited to music though because sources tend to be aligned both in time and frequency.
You should use the terms AI or machine learning instead of all the 'neural nets' you were throwing around as well, as neural nets are just one of a bazillion AI approaches... to me 'dictionaries' implies things other than neural nets.. and without the context of that whole paper I'm not convinced 'dictionaries' is making any reference to any AI at all.. could just be a large collection of vectors or bases or such. But this is just me being a mitpick as I said
I guess my only question is what happens when you have two things panned differently occupying the same frequency band, like one panned far left, and one slightly left.
Just as a nitpicky offtopic note, as your algo. is, as you say, not blind source separation.. but to clarify some things you did say about BSS... is that great many BSS approaches use no machine learning, but rely on signal subspace methods, namely Independent Components Analysis, or a whole slew of array processing techniques when you have a lot of sensors.
ICA is well suited to the cocktail party in speech, as its underlying assumption is source independence.. and two different speakers won't really be talking in a similar manner/rate etc.. ICA isn't that well suited to music though because sources tend to be aligned both in time and frequency.
You should use the terms AI or machine learning instead of all the 'neural nets' you were throwing around as well, as neural nets are just one of a bazillion AI approaches... to me 'dictionaries' implies things other than neural nets.. and without the context of that whole paper I'm not convinced 'dictionaries' is making any reference to any AI at all.. could just be a large collection of vectors or bases or such. But this is just me being a mitpick as I said
- KVRAF
- Topic Starter
- 6478 posts since 16 Dec, 2002
Hi,
I can't say I'm awfully familiar with the current state of blind source separation. Also neural nets (or net's of adaptive filters rather) aren't awfully familiar to me.
I also suspect dictionary in that other study was referring to a collection of signal states/shapes, possibly connected to adaptive filter networks. I'm not too familiar with this field so excuse the use of possibly wrong terms.
About my algo: the basic similarity comparison is nearly the same to that other paper, but with a better window shape. Like I said earlier, I omitted the use of panning index. Getting similar results to what I've done shouldn't be too difficult.
What happens with two signals panned close to each other?
It depends what window width and depth is used. I can extract extremely narrow slices with increased distortion but with nearly no bleed from nearby panned sources. If a wider window is used the nearby sources will simply become louder. If I use the widest possible window, the extraction will be identical to summing the two channels linearly.
If I use an extremely deep window and the window edge falls exactly on a loud source, the algorithm will produce the worst possible distortion. It kind of tries to divide something that cannot be divided.
I can't say I'm awfully familiar with the current state of blind source separation. Also neural nets (or net's of adaptive filters rather) aren't awfully familiar to me.
I also suspect dictionary in that other study was referring to a collection of signal states/shapes, possibly connected to adaptive filter networks. I'm not too familiar with this field so excuse the use of possibly wrong terms.
About my algo: the basic similarity comparison is nearly the same to that other paper, but with a better window shape. Like I said earlier, I omitted the use of panning index. Getting similar results to what I've done shouldn't be too difficult.
What happens with two signals panned close to each other?
It depends what window width and depth is used. I can extract extremely narrow slices with increased distortion but with nearly no bleed from nearby panned sources. If a wider window is used the nearby sources will simply become louder. If I use the widest possible window, the extraction will be identical to summing the two channels linearly.
If I use an extremely deep window and the window edge falls exactly on a loud source, the algorithm will produce the worst possible distortion. It kind of tries to divide something that cannot be divided.
- KVRAF
- Topic Starter
- 6478 posts since 16 Dec, 2002
By the way here is another whitepaper on the subject. The similarity comparison (and the author) are the same as in the previous paper.
"FREQUENCY DOMAIN TECHNIQUES FOR STEREO TO MULTICHANNEL UPMIX - CARLOS AVENDANO AND JEAN-MARC JOT"
When I'm referring to a window, I'm talking about a horizontal slice in a "Panogram" that is mentioned and pictured in that paper. The parameters are width, depth and edge tension.
I'm tempted to just post the algo here by now so people would get to play with it. I'll just have to hold on to it a little while longer and wait for a response from several potentional customer companies.
Excuse the blatant intent to make money from this.
(but surely anyone who knows a little math would be able to top my algo by reading those papers and this thread.)
"FREQUENCY DOMAIN TECHNIQUES FOR STEREO TO MULTICHANNEL UPMIX - CARLOS AVENDANO AND JEAN-MARC JOT"
When I'm referring to a window, I'm talking about a horizontal slice in a "Panogram" that is mentioned and pictured in that paper. The parameters are width, depth and edge tension.
I'm tempted to just post the algo here by now so people would get to play with it. I'll just have to hold on to it a little while longer and wait for a response from several potentional customer companies.
Excuse the blatant intent to make money from this.
-
- KVRist
- 62 posts since 6 May, 2004 from IL USA
Cool, thanks for the heads up on the other paper.. I'm actually doing transcription and BSS work for my thesis and have considered leveraging stereo mixing to help with it.. your fairly awesome results lead me to believe such an approach might be a fairly useful preprocessing stage.. Good luck with your algo! if you don't have luck getting industry interested, I'd encourage trying to write a publication perhaps.. I could always use references to site 
- KVRAF
- Topic Starter
- 6478 posts since 16 Dec, 2002
Just remember that the linear phase wavelet implementation burns CPU like there's no tomorrow. As a "quick" preprocessing stage, it might end up being more of a pig than the actual BSS stage. The FFT version (from those papers) runs a 10-30% load on current CPUs.Ecko wrote:such an approach might be a fairly useful preprocessing stage.. if you don't have luck getting industry interested, I'd encourage trying to write a publication perhaps.. I could always use references to site
As for writing a paper, I've kind of done that already, it's just not quite ready for public viewing.
When you're finished with your thesis, I'd be interested in the results as well.
