KVR Audio

dspQuestions · Post by **dspQuestions** » Mon Jan 13, 2014 6:28 pm

AUTO-ADMIN: Non-MP3, WAV, OGG, SoundCloud, YouTube, Vimeo, Twitter and Facebook links in this post have been protected automatically. Once the member reaches 5 posts the links will function as normal.

Hi,

I want to experiment with additive synthesis. So I quickly wrote some code that simply adds up a bunch of sine waves. Nothing else. The problem is even this is way to slow. I implemented the sine oscillators by simply multiplying complex numbers in the obvious way.

Is this the right direction? Should I just spend time to implement this with SSE and hope it gets fast enough or is there a bettwer way?

Code: Select all (#)

template<typename S>
class SinOscStack
{
    std::vector<S> stateSin;
    std::vector<S> stateCos;
    std::vector<S> stepSin;
    std::vector<S> stepCos;

    S state1, state2;
    S step1, step2;

    unsigned harmonics;
public:
    SinOscStack(S freq, unsigned harmonics, S samplerate)
        : stateSin(harmonics)
        , stateCos(harmonics)
        , stepSin(harmonics)
        , stepCos(harmonics)
        , harmonics(harmonics)
    {
        static const double pi = 3.1415926535897932384626433832795;

        for(unsigned i=0; i<harmonics; ++i)
        {
            stateSin[i] = 0;
            stateCos[i] = 1;
        
            stepCos[i] = std::cos((i+1) + pi*freq/samplerate);
            stepSin[i] = std::sin((i+1) + pi*freq/samplerate);
        }
    }

    void update(S * buffer1, S * buffer2, std::size_t length)
    {        
        for (std::size_t i=0; i<length; ++i)
        {
            S sumCos = 0;
            S sumSin = 0;

            for (unsigned h=0; h<harmonics; ++h)
            {
                const S oldCos = stateCos[h];
                const S oldSin = stateSin[h];
            
                stateCos[h] = oldCos * stepCos[h] - oldSin * stepSin[h];
                stateSin[h] = oldSin * stepCos[h] + oldCos * stepSin[h];

                sumCos += stateCos[h];
                sumSin += stateSin[h];
            }
            buffer1[i] = sumCos;
            buffer2[i] = sumSin;
        }
    }
};

shabby · Post by **shabby** » Mon Jan 13, 2014 7:06 pm

I'm be interested in how to do this also - I always imagined you'd use a FFT for additive synthesis.

but If your going down the oldschool route - there is a faster method than phasors/complex numbers: you can use a Simple Harmonic Oscillator!

Code: Select all

float OscSin(float t,float dts)
{
p+=v;
v-=p*dts; // dts = dt*dt*4*pi*pi
return p;
}

it doesn't look stable but it is - as long as you don't try and FM it
you can either set p to amplitude & v to 0 or v to amplitude and p to 0

p=position
v=velocity

this will produce 6db per octave attenuation

Code: Select all

v=amp;
p=0;

this will produce constant amplitude across frequencies

Code: Select all

p=amp;
v=0;

MadBrain · Post by **MadBrain** » Tue Jan 14, 2014 3:15 am

I think the way to go is FFT.

It's also possible to use a MP3 style filter bank (split the spectrum in 32 bands so that you only have to fill 1/32th of the samples for each harmonic) but that might be complex to implement.

mystran · Post by **mystran** » Tue Jan 14, 2014 1:24 pm

Have you tried it this way:

Code: Select all

[..]
    void update(S * buffer1, S * buffer2, std::size_t length)
    {   
        memset(buffer1, 0, sizeof(S)*lenght);
        memset(buffer2, 0, sizeof(S)*lenght);

        for (unsigned h=0; h<harmonics; ++h)
        {   
            for (std::size_t i=0; i<length; ++i)
            {
                const S oldCos = stateCos[h];
                const S oldSin = stateSin[h];
            
                stateCos[h] = oldCos * stepCos[h] - oldSin * stepSin[h];
                stateSin[h] = oldSin * stepCos[h] - oldCos * stepSin[h];
    
                buffer1[i] += stateCos[h];
                buffer2[i] += stateSin[h];
            }
        }
    }
};

Idea being.. the output summing is as much memory access as the state-buffer updates, but the state-values can now be cached in registers (so no change in amount of read-modify-write memory access), and the cos/sin values can be cached in registers (saves practically all the lookups), and memory reads are no longer on the critical dependency path. You can now rewrite the inner loop to use only local variables (though your compiler can probably do it too).

Also now there's 2 separate memory streams (buffer1 + buffer2) rather than 4 in the inner loop. I didn't try if it's faster, but I'd be surprised if it wasn't.

dspQuestions · Post by **dspQuestions** » Thu Jan 16, 2014 4:23 pm

I take everything back. The problem was a simple typo (a minus instead of a plus, I fixed the original post). This made the numbers explode (they all became Inf) which drastically slowed it down. Now it's decently fast.

@mystran: Your version runs slower the the original version. It takes about 130% of the time. But thanks anyway.

@shabby: I've just quickly tried it but it seams like it very slowly diverges over time, but maybe I did something wrong. I'll look into it at the weekend.

mystran · Post by **mystran** » Thu Jan 16, 2014 6:21 pm

dspQuestions wrote: @mystran: Your version runs slower the the original version. It takes about 130% of the time. But thanks anyway.

Is that for the "as posted" version, or did you try caching the inner-loop data into local variables? I was too lazy to edit it into the version I posted, but if the compiler aliasing-analysis thinks the output buffers might alias with the lookups or states, then one would expect it to run slower.

mystran · Post by **mystran** » Thu Jan 16, 2014 6:46 pm

dspQuestions wrote: @shabby: I've just quickly tried it but it seams like it very slowly diverges over time, but maybe I did something wrong. I'll look into it at the weekend.

The sin/cos rotation formula will drift as well. You need to normalize every once in a while (say once per block) in either case.

dspQuestions · Post by **dspQuestions** » Thu Jan 16, 2014 7:30 pm

AUTO-ADMIN: Non-MP3, WAV, OGG, SoundCloud, YouTube, Vimeo, Twitter and Facebook links in this post have been protected automatically. Once the member reaches 5 posts the links will function as normal.

mystran wrote: Is that for the "as posted" version, or did you try caching the inner-loop data into local variables? I was too lazy to edit it into the version I posted, but if the compiler aliasing-analysis thinks the output buffers might alias with the lookups or states, then one would expect it to run slower.

It loos like this:

Code: Select all (#)

void update2(S * buffer1, S * buffer2, std::size_t length)
    {   
        std::memset(buffer1, 0, sizeof(S) * length);
        std::memset(buffer2, 0, sizeof(S) * length);

        for (unsigned h=0; h<harmonics; ++h)
        {
            S sCos = stateCos[h];
            S sSin = stateSin[h];
            for (std::size_t i=0; i<length; ++i)
            {
                const S oldCos = sCos;
                
                sCos = oldCos * stepCos[h] - sSin * stepSin[h];
                sSin = sSin * stepCos[h] + oldCos * stepSin[h];
    
                buffer1[i] += sCos;
                buffer2[i] += sSin;
            }
            stateCos[h] = sCos;
            stateSin[h] = sSin;
        }
    }

Using g++ with -march=native makes this version actually run a tad bit faster than my original version. Without -march the old version is faster.

shabby · Post by **shabby** » Thu Jan 16, 2014 7:44 pm

dspQuestions wrote: @shabby: I've just quickly tried it but it seams like it very slowly diverges over time, but maybe I did something wrong. I'll look into it at the weekend.

That's interesting - it shouldn't drift at all - as long as you don't change dt or FM it. If you do change DT it will go all VA on you

. As pointed out by mystran - one reason to use a SHO over vec2's is they don't disintegrate over time. Even with

Code: Select all

_control87(PC_24,MCW_PC);

set - running for a few mins - no drift

The other is you get a cos sinusoid (90 degrees out of phase for free). I forgot to put this in above:

Code: Select all

P= sin component
V= cos component scaled by 1/(dt*2*PI)

Be interested to hear/see you final result - never got around to coding up a Additive synth - quite exciting stuff!

How many partials/harmonics are you using?

dspQuestions · Post by **dspQuestions** » Thu Jan 16, 2014 8:54 pm

shabby wrote:That's interesting - it shouldn't drift at all?

It was my fault again. It's working now and it's the fastest sollution.

shabby wrote:How many partials/harmonics are you using?

At the moment 512, but for no particular reason. For most notes I'll need much less. I'll probably use something like ⌊20000/fundamental frequency⌋.

shabby · Post by **shabby** » Thu Jan 16, 2014 9:23 pm

dspQuestions wrote:
shabby wrote:That's interesting - it shouldn't drift at all?
It was my fault again. It's working now and it's the fastest sollution.

shabby wrote:How many partials/harmonics are you using?
At the moment 512, but for no particular reason. For most notes I'll need much less. I'll probably use something like ⌊20000/fundamental frequency⌋.

Cool stuff - love to hear it

if your interested in experimental stuff - you can really abuse SHO's

Stumbled onto this by accident one night experimenting - gave me a bit of shock - great on high midi notes, should try and load this into a speccy one day

function TapeLoaderOsc(t,dt)
(
odt=dt*dt*4*$pi*$pi;
p+=v;
v-=p*odt;
(abs(v)<odt)? (p+=1.333333;);
p;
);

for when SC isn't faffing about:

https://soundcloud.com/shabtronica/synthr-dtapeloader

mystran · Post by **mystran** » Fri Jan 17, 2014 12:08 am

dspQuestions wrote:
mystran wrote: Is that for the "as posted" version, or did you try caching the inner-loop data into local variables? I was too lazy to edit it into the version I posted, but if the compiler aliasing-analysis thinks the output buffers might alias with the lookups or states, then one would expect it to run slower.
It loos like this:

Ok.. so point point about aliasing was:

Code: Select all

void update2(S * buffer1, S * buffer2, std::size_t length)
    {   
        // loop could be faster, I wrote this for clarity
        std::memset(buffer1, 0, sizeof(S) * length);
        std::memset(buffer2, 0, sizeof(S) * length);

        for (unsigned h=0; h<harmonics; ++h)
        {
            S sCos = stateCos[h];
            S sSin = stateSin[h];
            // some compilers might not do this
            // I don't have a recent gcc installed, can't check
            S tCos = stepCos[h];
            S tSin = stepSin[h];
            for (std::size_t i=0; i<length; ++i)
            {
                const S oldCos = sCos;
                
                sCos = oldCos * tCos + sSin * tSin;
                sSin = sSin * tCos - oldCos * tSin;
    
                buffer1[i] += sCos;
                buffer2[i] += sSin;
            }
            stateCos[h] = sCos;
            stateSin[h] = sSin;
        }
    }

Some recent enough compilers might figure that out, and might even do the loop swapping, but in general there's no harm in making things explicit.

I'd imagine -O3 -ffast-math -msse2 -mfpmath=sse should be enough.

That said, in terms of speed or stability it's unlikely to beat the formula that shabby suggested. Sadly, that formula doesn't give proper 90 degrees phase at higher frequencies (that's the trade-off, though it's fine at low frequencies). If you don't need that, then there's little point in using rotations.

shabby · Post by **shabby** » Fri Jan 17, 2014 12:31 pm

mystran wrote:
dspQuestions wrote:
mystran wrote: Is that for the "as posted" version, or did you try caching the inner-loop data into local variables? I was too lazy to edit it into the version I posted, but if the compiler aliasing-analysis thinks the output buffers might alias with the lookups or states, then one would expect it to run slower.
It loos like this:
Ok.. so point point about aliasing was:
Code: Select all
void update2(S * buffer1, S * buffer2, std::size_t length)
    {   
        // loop could be faster, I wrote this for clarity
        std::memset(buffer1, 0, sizeof(S) * length);
        std::memset(buffer2, 0, sizeof(S) * length);

        for (unsigned h=0; h<harmonics; ++h)
        {
            S sCos = stateCos[h];
            S sSin = stateSin[h];
            // some compilers might not do this
            // I don't have a recent gcc installed, can't check
            S tCos = stepCos[h];
            S tSin = stepSin[h];
            for (std::size_t i=0; i<length; ++i)
            {
                const S oldCos = sCos;
                
                sCos = oldCos * tCos + sSin * tSin;
                sSin = sSin * tCos - oldCos * tSin;
    
                buffer1[i] += sCos;
                buffer2[i] += sSin;
            }
            stateCos[h] = sCos;
            stateSin[h] = sSin;
        }
    }
Some recent enough compilers might figure that out, and might even do the loop swapping, but in general there's no harm in making things explicit.

I'd imagine -O3 -ffast-math -msse2 -mfpmath=sse should be enough.

That said, in terms of speed or stability it's unlikely to beat the formula that shabby suggested. Sadly, that formula doesn't give proper 90 degrees phase at higher frequencies (that's the trade-off, though it's fine at low frequencies). If you don't need that, then there's little point in using rotations.

Really the phase messes up at higher dt's - I didn't notice that

I'll have a look on the weekend - see what's going on with that, at what point does it go shabby?

aciddose · Post by **aciddose** » Fri Jan 17, 2014 1:10 pm

shabby wrote:
dspQuestions wrote: @shabby: I've just quickly tried it but it seams like it very slowly diverges over time, but maybe I did something wrong. I'll look into it at the weekend.

That's interesting - it shouldn't drift at all - as long as you don't change dt or FM it.

Of course it will. This is the same issue as using cumulative matrix multiplications - (in fact that is what it is) - it simply doesn't work without normalization. Using double or long double will give better results than float, but not much better. In many cases using approximations and recomputing the matrix from scratch is actually much faster than a matrix multiply.

Unfortunately that isn't true in this case but fortunately you can use far more complex approximations to do normalization without introducing distortion - so long as normalization is frequent enough. There will be a trade-off where expense is exponentially higher with lower harmonic content and the optimum solution will actually be a quicker approximation than sin().

Just need to set a threshold like -100db.

shabby · Post by **shabby** » Fri Jan 17, 2014 1:29 pm

aciddose wrote:
shabby wrote:
dspQuestions wrote: @shabby: I've just quickly tried it but it seams like it very slowly diverges over time, but maybe I did something wrong. I'll look into it at the weekend.

That's interesting - it shouldn't drift at all - as long as you don't change dt or FM it.
Of course it will. This is the same issue as using cumulative matrix multiplications - (in fact that is what it is) - it simply doesn't work without normalization. Using double or long double will give better results than float, but not much better. In many cases using approximations and recomputing the matrix from scratch is actually much faster than a matrix multiply.

Unfortunately that isn't true in this case but fortunately you can use far more complex approximations to do normalization without introducing distortion - so long as normalization is frequent enough. There will be a trade-off where expense is exponentially higher with lower harmonic content and the optimum solution will actually be a quicker approximation than sin().

Just need to set a threshold like -100db.

The Complex rotation stuff disintegrates because of the XY multiplication transfer, loses energy every time because of FPU precision errors. SHO's preserve energy 100% and do not disintegrate or freq drift whatsoever - try it - set one up - run it for for few mins, even as a single with low precision - it won't disintegrate or drift. The only time it will start to mess up is if you change dt at a non peak crossing. It is effectively a spring with a non moving target - where as Complex is a axis transfer system.

Additive Synthesis