image to sound conversion

DSP, Plugin and Host development discussion.
RELATED
PRODUCTS

Post

Righto, in this:

someone asks:
Are you interpreting the colours of a pixelcolumn as amount of frequency, where the pixel-y-coordinate represents the frequency itself?
To which the reply is:
Yes, and it's all levels of grey
So, do you reckon this is the principle being referred to:
Image
If we say it takes T seconds to 'scan' the whole picture from left to right, each calculated 'acoustical representation' would therefore last for T/8 seconds?

Anyone have any other creative suggestions for image to sound conversion, whether it be an enhancement of above principle, or something different altogether?[/quote]

Post

Nice illustration!
Off the top, how about RBG images, say the overall brightness is your frequency volume as usual, but the individual channels could be pan or modulo. Would be quite mad though!

.

Post

Those images sound really interesting. The frequency function is very interesting. It sounded to me as if it was capable of producing all kinds of timbres from noise to almost harmonic. I guess you could map the RGB values to HSL (or HSV) and do all kinds of fun with the three dimensions.

Post

To convert the image to greyscale, should I calculate a weighted average of each pixel's R, G, and B components?

Post

Or just a straight average. Optional for either?

Post

I had always assumed that pixels were directly converted to samples. From my understanding, pixels are generally 24 bits (RGB) or 32 bits (RGB + Alpha). These could simply be treated as 24 or 32 bits samples and played like any other sound. I have no idea how this would sound. It could turn out to be noise. I don't have the programming chops any more to able to do this kind of work.

Post

It might be interesting to take the R, G, and B components as different phases at the same frequency.
Image
Don't do it my way.

Post

JJBiener wrote:I had always assumed that pixels were directly converted to samples. From my understanding, pixels are generally 24 bits (RGB) or 32 bits (RGB + Alpha). These could simply be treated as 24 or 32 bits samples and played like any other sound. I have no idea how this would sound.
It would mean that red (or whichever channel you put at the high end of the sample word) would dominate the audible sound, with green at -48dB and blue at -96dB relative to that.
Image
Don't do it my way.

Post

Borogove wrote:
JJBiener wrote:I had always assumed that pixels were directly converted to samples. From my understanding, pixels are generally 24 bits (RGB) or 32 bits (RGB + Alpha). These could simply be treated as 24 or 32 bits samples and played like any other sound. I have no idea how this would sound.
It would mean that red (or whichever channel you put at the high end of the sample word) would dominate the audible sound, with green at -48dB and blue at -96dB relative to that.
You could be right. I only have a basic understanding of how sampling is done. I don't know how subtle changes from sample to sample affects timbre. I will leave this to people who know what they are doing.

Post

JJBiener wrote:I had always assumed that pixels were directly converted to samples. From my understanding, pixels are generally 24 bits (RGB) or 32 bits (RGB + Alpha). These could simply be treated as 24 or 32 bits samples and played like any other sound. I have no idea how this would sound. It could turn out to be noise. I don't have the programming chops any more to able to do this kind of work.
The advantage with the approach in the video is that you can correlate what your hearing with what your seeing. Also depending on how you scale the vertical axis(frequency axis)you can easily produce different tunings. Another advantage is that each step on the horizontal axis can represent an arbitrary time slice, allowing for frequency independent speeding up and slowing down.

Treating each pixel as a sample on the other hand would probably result in a less interesting and noisier sound (this is just a hunch :)) and be a lot less flexible. Also the overall brightness of the picture would dictate the level of DC offset although that could be filtered out of course.

Post

Borogove wrote:
JJBiener wrote:I had always assumed that pixels were directly converted to samples. From my understanding, pixels are generally 24 bits (RGB) or 32 bits (RGB + Alpha). These could simply be treated as 24 or 32 bits samples and played like any other sound. I have no idea how this would sound.
It would mean that red (or whichever channel you put at the high end of the sample word) would dominate the audible sound, with green at -48dB and blue at -96dB relative to that.
or you could just sum the RBG content to avoid this problem

Edit: of course this looses precision. I just meant to suggest an easy way to convert pixel to sample, but it was probably to obvious a thing to really be worth suggesting
Last edited by matt42 on Wed Jan 04, 2012 9:37 am, edited 1 time in total.

Post

I'd convert from RGB to HSV and use V as amplitude and probably H as phase. Don't know what to do with the saturation.

Post

I'd convert from RGB to HSV and use V as amplitude and probably H as phase. Don't know what to do with the saturation
Yes, seems logical. With reference to below model:
Image
Since Hue is basically all the colours arranged in a wheel and expressed in degrees from 0 to 360, it can be cast as the phase of our audio sine.

Since Value describes how light or dark the colour is (overall intensity/strength of the light), it can be cast as the amplitude of our audio sine.

Saturation is the ratio of the dominant wavelength to other wavelengths in the colour (white light at the centre of the model contains an even balance of all wavelengths). Now, since we are determining the frequency of our audio sine by the pixel row, there is no 'obvious' audio parameter we can drive from Saturation. 'Modulo' has been mentioned previously as a possible candidate: what is that?

Post

doctornash wrote:
Now, since we are determining the frequency of our audio sine by the pixel row, there is no 'obvious' audio parameter we can drive from Saturation. 'Modulo' has been mentioned previously as a possible candidate: what is that?
I meant volume modulation. I don't know though, there's just so many variables to consider to make any musical, or interesting sense from it. I noticed the examples given had very specifically bright lines and dark areas, chosen to make it sound more coherent and have smooth changing frequency blocks.

I suppose red areas could play through faster than blue areas, or something like that?

Post

JJBiener wrote:
Borogove wrote:
JJBiener wrote:I had always assumed that pixels were directly converted to samples. From my understanding, pixels are generally 24 bits (RGB) or 32 bits (RGB + Alpha). These could simply be treated as 24 or 32 bits samples and played like any other sound. I have no idea how this would sound.
It would mean that red (or whichever channel you put at the high end of the sample word) would dominate the audible sound, with green at -48dB and blue at -96dB relative to that.
You could be right. I only have a basic understanding of how sampling is done. I don't know how subtle changes from sample to sample affects timbre. I will leave this to people who know what they are doing.
He is right. Packed color values don't translate directly to any linear scale. They're 3 different amounts that have nothing to do with each other, red, green and blue, packed into one 24 or 32 bit word.

While this stuff is kind of cute, there's really no "correct" way to interpret images into sound. Images are a frozen instant in time, audio is inherently evolving over some length of time. The height and width dimensions in an image represent position in space, not in time. Audio is a one dimensional signal (voltage) over time. A color image is a 3 dimensional signal (RGB) arrayed on a 2D grid. The two models just don't overlap in many meaningful ways.

That said, DSP-wise, images and sound actually share a lot of similarities. For example, a low pass filter in sound equates to a blur filter in an image. A highpass is like a sharpen. Time (samples) in an audio file equates to space (pixels) in an image. (Just a slight problem that there is only one time dimension in sound, but two spatial dimensions in an image.) If you could find a way to traverse pixels of an image in some way that avoided sharp corners, raster scanning, or any other sudden position jump or change in direction, then sure, you could just take R+G+B (or use your favorite weighting, 30%, 59%, 11%, whatever). Maybe something that scanned circles around an image. But then of course, the shape of your scan is entirely arbitrary and has immense effect on what kind of sound you get out of it.

You could even say R is right, G is left, and B is... um...

So yeah, whatever you do, you're mapping apples to oranges.

Post Reply

Return to “DSP and Plugin Development”