Audio Resynthesis Algorithm

Official support for:
Topic Starter
1 posts since 26 Nov, 2013

Post Mon Nov 25, 2013 5:26 pm

I am a neuroscientist researching auditory processing and am interested in using Photosounder for a research project, however, I have to be able to analytically describe the method for resynthesis for grants and papers. Specifically, when a spectrogram is formed, phase information is discarded so I would like to know what assumptions are made to recover audio information during resysnthesis. I would appreciate it if Michel, or someone else who knows, could point me in the direction of some resources that describe the resynthesis method. I have some really cool ideas for projects, but I can't do them unless I know how the songs are being processed. Thanks!

1028 posts since 6 May, 2008 from Poland

Post Sun Dec 22, 2013 4:24 am

Well first of all I don't know of any paper that describes what this does (I didn't base my algorithms on anything I read), although for the analysis part is seem to recall hearing of a 'Q-transform' that seemed similar to what I do. Basically the analysis is done using a bank of bandpass filters with their bands centered on logarithmically equidistant position, and they're of varying widths to create various time/frequency resolution depending on the frequency, and envelope detection on each band is then used to create each row of the image.

Then resynthesis is simple enough, it takes a white-pink noise (something between a white and a pink noise, where the intensity is proportional to f(w) = 1/sqrt(w), w being the frequency), filters it through the same kind of filter bank and modulates each band with the envelope from each interpolated row from the image.
Developer of Photosounder (a spectral editor/synth), SplineEQ and Spiral

Rock Brentwood
2 posts since 2 Feb, 2016

Post Fri Feb 05, 2016 1:28 pm

Going by the description, it sounds like the same method used in your ARSS (to the original respondent: it is still freely available on SourceForge and can be studied). Having done a complete once-over on ARSS I can say a few things about it -- and probably also about PhotoSounder.

In ARSS you were using a fixed phase clock on each frequency band, with a randomly chosen initial phase. Once all the frequency band are "rephased" into complex form, then it's just a matter applying the sound -> graph analysis transform in reverse to the whole graph. That can be done one band at a time, with the results accumulated band-by-band into the total sound.

The forward transform is just a (more or less) standard filter bank with center frequencies placed on a hybrid linear-logarithmic scale (p = a + b exp(k f) to convert frequency f to bin p; where k -> 0 achieves linearity). Each bin uses a cosine window as its envelope (on the p-scale which may be logarithmic, not the linear f scale!), and the envelopes add up to a partition of unity over the overall frequency band, thus giving you a de facto band pass filter for the total sound.

The transform is just a variant of the S-transform, except for using cosine-windows in place of Gaussian-windows. Equivalently, both forward transforms are instances of the wavelet transform in which the wavelets are complex exponentials enveloped with the given windows; i.e. psi(x) = g(x) exp(ikx) where psi(x) is the mother wavelet and g(x) is the windowing function. In fact, the whole reason for the S-transform was that the inverse transform for wavelets could not be done with Gaussian-windowed complex exponential wavelets.

Three points about the approach used in ARSS/PhotoSounder: (1) the amplitude spectrum is linearly scaled. You lose a lot of detail by doing this. Audacity, for instance, uses a logarithmic scale for amplitudes in its spectrographic display. It's a trivial fix to make to ARSS and PhotoSounder (and to also add in the ability to process Audacity's color coding). In fact, in (my) modified form of ARSS I allow one to choose between linear and logarithmic (which I think would benefit PhotoSounder greatly to add in); (2) the choice of phase for the inverse transform is poorly-motivated and can't properly handle variable-frequency sounds (i.e. chirps), (3) the design does not properly integrate or account for k (by your own admission) both in ARSS and in PhotoSounder. So the only real values in use are k = 0 (spectrograph on a linear scale) and k = log(2) (scalograph on a logarithmic scale). On account of (1), you're going to lose a lot of fine detail in the graphs and this shows up quite prominently in the graph to sound conversions in both ARSS and PhotoSounder.

On the larger issue: if you're using the same method for the inverse transform you used with ARSS, you may be doing too much. Transforms are actually not needed for the reverse direction! The method I use -- to much better effect -- is to simply find the peaks at each column on the graph and then optimally blend the peaks from each column to the next, which also gives you the proper phase adjustment for free, thus resolving issue (2). The demo software I sent you (BtoW.c for BMP to WAV conversion), incorporates the algorithm. Feel free to add the method in PhotoSounder.

The proper use of phase (going by standard DSP accounts) is to use the stationary phase principle to find the natural frequencies and times of occurrences of the signals in the graph and to use this to move them both in the frequency and time directions to their "natural" locations; resulting in a graph that is concentrated along the lines (centered at their "instantaneous frequencies") corresponding to the sound components. The peak-finding method I mentioned already effects a measure of "frequency re-location" (without the need for phase), but not time re-location. You'll hear this as an accurately rendered sound with a slight studio-reverberation effect.

The simplest way to diagram phase is to color-code it, and display amplitude by brightness. For any of this to be effective, you'll need the graph to be at a large enough resolution to actually *see* the phase (e.g. 5280 pixels/second).

Here is a demo showing off both phase-colored spectrum on a logarithmic scale and an amplitude-colored spectrum (on a linear scale) alongside it. The sound comes straight off the latter using the above-mentioned process (though with a little remixing). The manipulations illustrated in the video were done graphically on the graph.

Experiment in Reverse Mixing and Sound Morphing ...

Had the chirp lines in the phase-colored graph been added, the original sound would have been reproduced within the frequency band displayed (27.5 - 1760 Hz). So, there, the inverse transform is as trivial as it is the for the Wigner transform: just add up the components!

And, finally, whether or PhotoSounder uses the same (publicly accessible) routines in ARSS, the modifications I made in my version of ARSS (along with the dozens of other routines I now have) are freely available for use for both PhotoSounder and for researchers; and I may post them on SourceForge in the near future. In the meanwhile, either of you may feel free to drop me a note.

1028 posts since 6 May, 2008 from Poland

Post Fri Feb 05, 2016 1:54 pm

Rock Brentwood wrote:(1) the amplitude spectrum is linearly scaled. You lose a lot of detail by doing this. Audacity, for instance, uses a logarithmic scale for amplitudes in its spectrographic display.
Totally wrong. This is because you, as well as the people who made Audacity, as well as pretty much everybody else, don't actually understand gamma compression. Read this

Spiral displays things totally linearly and it looks perfect, Photosounder/ARSS by default don't because I didn't understand sRGB back then but I had the good sense to use a gamma of 2 on the sRGB which gives you an effective gamma of 1.1, so close to linear. Using a "logarithmic scale" (actually a logarithmic scale plus a strong gamma) is crap, for one thing it gives you a hard threshold whereas there's no threshold when staying linear. But also your perception of light intensity is non-linear in much the same way as your auditory perception is. So your perception of darks is improved just as your perception of quiet sounds is. Compare Spiral's right side visualisation with Audacity's and tell me Spiral's isn't far better.

People's belief that you need a logarithmic scale for that is just plain ignorance, because they tried "linearly" (with a gamma of 2.2) and figured "oh that looks too dark, what can I do about it?" and their first idea was "log scale!" (mine was a gamma of 2 the other way) and that's how it became the norm.

As for the rest don't take this the wrong way but I don't really care about that, you're talking about work I completed 8 years ago. Since then I made something slightly better (Photosounder's live synthesis, and even that was 5 years ago) and I've got other ideas for what to do next.
Developer of Photosounder (a spectral editor/synth), SplineEQ and Spiral

Return to “Photosounder”