CICDecimator - do I really need 64 bit accumulators?

DSP, Plugin and Host development discussion.
Post Reply New Topic
RELATED
PRODUCTS

Post

hi,

i'm testing a CICDecimator i've found online, which seems to "sound" pretty cool as decimator for an oversampling routine.

here's the complete code:

Code: Select all

struct CICDecimator {
    static constexpr int64_t scale = ((int64_t)1) << 32;
    int mStages;
    int mFactor;
    float mGainCorrection;
    int64_t *pIntegrators;
    int64_t *pCombs;

    CICDecimator(int stages = 4, int factor = 8) {
        assert(stages > 0);
        assert(factor > 0);

        mStages = stages;
        pIntegrators = new int64_t[mStages + 1]{};
        pCombs = new int64_t[mStages]{};

        if (mFactor != factor) {
            mFactor = factor;
            mGainCorrection = 1.0f / (float)(pow(mFactor, mStages));
        }
    }
    ~CICDecimator() {
        delete[] pIntegrators;
        delete[] pCombs;
    }

    float process(const float *buffer) {
        for (int i = 0; i < mFactor; i++) {
            pIntegrators[0] = buffer[i] * scale;

            for (int j = 1; j <= mStages; j++) {
                pIntegrators[j] += pIntegrators[j - 1];
            }
        }

        int64_t s = pIntegrators[mStages];
        for (int i = 0; i < mStages; i++) {
            int64_t t = s;
            s -= pCombs[i];
            pCombs[i] = t;
        }

        return mGainCorrection * (s / (float)scale);
    }
};
my question is: do it really need that int64_t conversion?

i think it converts float * scale to int64_t (and than the comb's s / scale to float) due to improve the precision on summing/subtracting comb/integrators. is correct?

but why it needs int64_t and not (for example) int32_t? will the precision/performances will suffering this? somethings like:

Code: Select all

static constexpr int64_t scale = ((int32_t)1) << 16;
...
int32_t *pIntegrators;
int32_t *pCombs;
...
can you help me to understand this?

thanks to all

Post

The precision won't suffer if you use int32_t but the headroom. You would have to turn down "scale" to a lower value so your float signal never overflows the int. Ok, then the precision suffers too if you turn it down too far. Maybe a good compromise is to use int32_t for the combs and int64_t for the integrators.

It's using ints so the integrator works perfectly (to avoid losing accuracy when subtracting small floats from a big float value) but for audio applications it could be ok to do it all with floats and use a leaky integrator.

For others here is some good background info: https://www.dsprelated.com/showarticle/1337.php

Post

mda wrote: Fri Apr 16, 2021 7:39 am It's using ints so the integrator works perfectly (to avoid losing accuracy when subtracting small floats from a big float value) but for audio applications it could be ok to do it all with floats and use a leaky integrator.
first question: in the code, it add/sub values that are "integer". why using "int64_t" instead of (for example) double?

it could place pIntegrators and pCombs as double, do floor(buffer * scale), and than every sum/sub will be always "integer" value, so without any loss of precision (even if double) while computing them. correct?

i.e. 23523542.0 + 121.0 = 23523542 + 121

of course its using int64_t because can contains integer above (and -below) 2*53, but that's the main cause I believe, not that "int" sum/sub integer better than double/float...

Post

probably im deaf, or not able to test the edge/worst case, but switching to this (i.e. replaced int64_t to int32_t, and << 16 instead of << 32):

Code: Select all

struct CICDecimator {
    static constexpr int32_t scale = ((int32_t)1) << 16;
    int mStages;
    int mFactor;
    float mGainCorrection;
    int32_t *pIntegrators;
    int32_t *pCombs;

    CICDecimator(int stages = 4, int factor = 8) {
        assert(stages > 0);
        assert(factor > 0);

        mStages = stages;
        pIntegrators = new int32_t[mStages + 1]{};
        pCombs = new int32_t[mStages]{};

        if (mFactor != factor) {
            mFactor = factor;
            mGainCorrection = 1.0f / (float)(pow(mFactor, mStages));
        }
    }
    ~CICDecimator() {
        delete[] pIntegrators;
        delete[] pCombs;
    }

    float process(const float *buffer) {
        for (int i = 0; i < mFactor; i++) {
            pIntegrators[0] = buffer[i] * scale;

            for (int j = 1; j <= mStages; j++) {
                pIntegrators[j] += pIntegrators[j - 1];
            }
        }

        int32_t s = pIntegrators[mStages];
        for (int i = 0; i < mStages; i++) {
            int32_t t = s;
            s -= pCombs[i];
            pCombs[i] = t;
        }

        return mGainCorrection * (s / (float)scale);
    }
};
sounds pretty the same. when do i should notice the differences due to bigger headroom?

Post

markzzz wrote: Fri Apr 16, 2021 5:38 pm sounds pretty the same. when do i should notice the differences due to bigger headroom?
If you add together N values, then worst-case growth in magnitude is log2 N, so if you had say a window of 256 samples that you integrate over, then you would need additional 8 bits to avoid wrap-around if the input is nothing but maximum amplitude DC. If you then cascade 4 such integrators you'd need 4*8 bits of head-room in the worst-case. If you then wanted 16-bits of precision, you'd end up with 48-bits minimum. For shorter windows you need less, for longer windows more, but I hope you get the idea?

So what you normally want to do is take the actual window lengths you are using, then compute the worst-case growth in order to clip the input at a level that doesn't allow it to wrap-around, no matter how ugly the input signal happens to be.

edit: On the other hand, since this kind of filter is typically not great for critical sampling anyway, you might be able to get away with less precision if you work with oversampled signals and shape the quantization noise above the final Nyquist.

Post

mystran wrote: Fri Apr 16, 2021 7:53 pm
markzzz wrote: Fri Apr 16, 2021 5:38 pm sounds pretty the same. when do i should notice the differences due to bigger headroom?
If you add together N values, then worst-case growth in magnitude is log2 N, so if you had say a window of 256 samples that you integrate over, then you would need additional 8 bits to avoid wrap-around if the input is nothing but maximum amplitude DC. If you then cascade 4 such integrators you'd need 4*8 bits of head-room in the worst-case. If you then wanted 16-bits of precision, you'd end up with 48-bits minimum. For shorter windows you need less, for longer windows more, but I hope you get the idea?

So what you normally want to do is take the actual window lengths you are using, then compute the worst-case growth in order to clip the input at a level that doesn't allow it to wrap-around, no matter how ugly the input signal happens to be.

edit: On the other hand, since this kind of filter is typically not great for critical sampling anyway, you might be able to get away with less precision if you work with oversampled signals and shape the quantization noise above the final Nyquist.
not sure i totally follow you :)

let see this specific case:
given oversample = factor = N = 8, 3 (from log2 N) x 4 (num. stages) + 16 (precision) = 28bit is enough correct? thats why it sounds the same also for int32_t?

Post

probably i was wrong with my last understaing of mystran reply, because if I keep "integer" values but I use double to store/add/sub them, the same signal become very weird:

Code: Select all

struct CICDecimator {
    static constexpr int32_t scale = ((int32_t)1) << 16;
    int mStages;
    int mFactor;
    float mGainCorrection;
    double *pIntegrators;
    double *pCombs;

    CICDecimator(int stages = 4, int factor = 8) {
        assert(stages > 0);
        assert(factor > 0);

        mStages = stages;
        pIntegrators = new double[mStages + 1]{};
        pCombs = new double[mStages]{};

        if (mFactor != factor) {
            mFactor = factor;
            mGainCorrection = 1.0f / (float)(pow(mFactor, mStages));
        }
    }
    ~CICDecimator() {
        delete[] pIntegrators;
        delete[] pCombs;
    }

    float process(const float *buffer) {
        for (int i = 0; i < mFactor; i++) {
            pIntegrators[0] = std::floor(buffer[i] * scale);

            for (int j = 1; j <= mStages; j++) {
                pIntegrators[j] += pIntegrators[j - 1];
            }
        }

        double s = pIntegrators[mStages];
        for (int i = 0; i < mStages; i++) {
            double t = s;
            s -= pCombs[i];
            pCombs[i] = t;
        }

        return mGainCorrection * (s / (float)scale);
    }
};
and it doesnt make sense, since their should manage integer up to 2^53 (which are higher than int32) without any loss of precision :O

mmm...

Post

markzzz wrote: Sun Apr 18, 2021 6:53 am probably i was wrong with my last understaing of mystran reply, because if I keep "integer" values but I use double to store/add/sub them, the same signal become very weird:

Code: Select all

struct CICDecimator {
    static constexpr int32_t scale = ((int32_t)1) << 16;
    int mStages;
    int mFactor;
    float mGainCorrection;
    double *pIntegrators;
    double *pCombs;

    CICDecimator(int stages = 4, int factor = 8) {
        assert(stages > 0);
        assert(factor > 0);

        mStages = stages;
        pIntegrators = new double[mStages + 1]{};
        pCombs = new double[mStages]{};

        if (mFactor != factor) {
            mFactor = factor;
            mGainCorrection = 1.0f / (float)(pow(mFactor, mStages));
        }
    }
    ~CICDecimator() {
        delete[] pIntegrators;
        delete[] pCombs;
    }

    float process(const float *buffer) {
        for (int i = 0; i < mFactor; i++) {
            pIntegrators[0] = std::floor(buffer[i] * scale);

            for (int j = 1; j <= mStages; j++) {
                pIntegrators[j] += pIntegrators[j - 1];
            }
        }

        double s = pIntegrators[mStages];
        for (int i = 0; i < mStages; i++) {
            double t = s;
            s -= pCombs[i];
            pCombs[i] = t;
        }

        return mGainCorrection * (s / (float)scale);
    }
};
and it doesnt make sense, since their should manage integer up to 2^53 (which are higher than int32) without any loss of precision :O

mmm...
ok i think i've got why with "double" it goes weird...

its because integrators increase till the saturation of the type used: 2^31 in case of int32, 2^63 for int64, much more in case of double (but then they can express all integer values correctly without loss of precision).

correct?

Post

markzzz wrote: Mon Apr 19, 2021 8:51 am its because integrators increase till the saturation of the type used: 2^31 in case of int32, 2^63 for int64, much more in case of double (but then they can express all integer values correctly without loss of precision).
Integers in C or C++ don't saturate. The unsigned types wrap around and the signed types become undefined behaviour.

Post

mystran wrote: Mon Apr 19, 2021 4:16 pm
markzzz wrote: Mon Apr 19, 2021 8:51 am its because integrators increase till the saturation of the type used: 2^31 in case of int32, 2^63 for int64, much more in case of double (but then they can express all integer values correctly without loss of precision).
Integers in C or C++ don't saturate. The unsigned types wrap around and the signed types become undefined behaviour.
so i don't get why, refactoring the code using double (keeping sum integer values), the values accomulates by integrators become higher than the ones of int64 version :O

can't be saturation by undefined behaviour (since int64 used are signed), so they both should stop increasing to the same "integer" value.

or whats going on mystran?

Post

markzzz wrote: Mon Apr 19, 2021 5:12 pm can't be saturation by undefined behaviour (since int64 used are signed), so they both should stop increasing to the same "integer" value.
In practice the "undefined behaviour" usually manifests as wrap-around even for signed integers. I don't think you should be hitting that case with int64, but you could put some asserts in there and check that the values stay in reasonable range.

My brain doesn't work sufficiently well right now for me to try to figure out if it's the case in this particular implementation, but in some cases internal wrap-arounds can actually cancel in a way that they don't actually matter.

Post

mystran wrote: Mon Apr 19, 2021 5:24 pm I don't think you should be hitting that case with int64, but you could put some asserts in there and check that the values stay in reasonable range.
it seems it overflows a lot.
here's an asset i put in the original code (int64_t, << 32):

Code: Select all

for (int i = 0; i < mFactor; i++) {
	pIntegrators[0] = buffer[i] * scale;

	for (int j = 1; j <= mStages; j++) {
		if (pIntegrators[j - 1] > 0 && std::numeric_limits<int64_t>::max() - pIntegrators[j - 1] < pIntegrators[j]) {
			DEBUG("limits: %lld | %lld | %lld", pIntegrators[j], pIntegrators[j - 1], pIntegrators[j] + pIntegrators[j - 1]);
			assert(false);
		}

		pIntegrators[j] += pIntegrators[j - 1];
	}
}
it exists almost immediately with this result:

limits: 9208281067789095048 | 74325436013751358 | -9164137569906705210

so it overflow and wrap-around to -9164137569906705210 summing 74325436013751358 from 9208281067789095048 :O
is this intended by the algorithm? i mean: is it correct that it wrap so easily? (because it does it constantly).

Post

markzzz wrote: Tue Apr 20, 2021 7:05 am is this intended by the algorithm? i mean: is it correct that it wrap so easily? (because it does it constantly).
Well, I don't really properly understand the original code posted. I do understand how CIC filters work in principle and I'm aware that there are some tricks with regards to wrap-around when it comes to fixed-point filtering in general, but if some such trick is used here, I'm not familiar with it, so can't help you there.

Post

mystran wrote: Tue Apr 20, 2021 7:37 am
markzzz wrote: Tue Apr 20, 2021 7:05 am is this intended by the algorithm? i mean: is it correct that it wrap so easily? (because it does it constantly).
Well, I don't really properly understand the original code posted. I do understand how CIC filters work in principle and I'm aware that there are some tricks with regards to wrap-around when it comes to fixed-point filtering in general, but if some such trick is used here, I'm not familiar with it, so can't help you there.
i see. no problem :)
thank you for the effort you spent with me, much appreciated.

i hope the log2N * num stages + precision you said to me will be valid, because it seems that it works well with int32_t and << 16...

Post Reply

Return to “DSP and Plugin Development”