KVR Audio

SunkLo · Post by **SunkLo** » Wed Apr 09, 2014 1:00 am

Hello, I wrote this implementation today and figured I'd post it. The issue with the usual circular buffer implementations is, you either have to use a modulus operator to get the Nth element out of it (return (mStart + n) % mLength;), or you have to keep all the elements in order, which requires copying each element to the adjacent index whenever you add a new element. If you're using the buffer as the delay line in an FIR or something similar, you're going to need to perform one write and N reads per sample. With the two previous options, that means doing N modulus operations or N memcpy's.

With the following implementation, there's no bounds checking on reads, and writing involves N pointer decrements, two additions, and a single bound check. I tested it against Boost's circular buffer implementation in a typical usage case. The test involved pushing a double onto the buffer, and then reading each value into a temp variable, similar to the process of adding to a delay line and then reading the past samples for filtering. I tested 1000, 10000, and 100000 iterations with buffer lengths of 128, 512, 1024, and 3072. My implementation was 3-4 times faster in all cases.

It's templated of course, and works with dynamic objects as well. There's a flag in the constructor that sets whether it should call delete on its contents when the buffer is destructed. If you're using dynamically allocated objects, you should call Dyn_Push() instead of Push() to prevent memory leaks. Apart from that, it's pretty simple. The zero indexed element is always the most recent. I didn't add a setter method since that's not really useful in the context of a delay line buffer, but it would be trivial to add.

Feel free to use and modify this to your heart's content. Hopefully it's useful in speeding up your algorithms. Any corrections are welcome. Cheers.

Here's the code:

Code: Select all

#pragma once
template <class T>
class CircularBuffer
{
public:
	CircularBuffer(int size, bool containsDynamicObjects = false);
	~CircularBuffer();
	void Push(T element);
	void Dyn_Push(T element);
	inline const T& operator[](int index);
	inline int Size() { return mSize; };

private:
	T* mBuffer;
	T** mPointerBuffer;
	int mWrapIndex;
	int mSize;
	bool mIsDynamic;
};

// Template Implementation

template <class T>
CircularBuffer<T>::CircularBuffer(int size, bool containsDynamicObjects)
:mBuffer(new T[size]),
mPointerBuffer(new T*[size]),
mWrapIndex(0),
mSize(size),
mIsDynamic(containsDynamicObjects)
{
	for (int i = 0; i < size; ++i)
	{
		mBuffer[i] = static_cast<T>(0.0);
		mPointerBuffer[i] = &mBuffer[i];
	}
}

template <class T>
CircularBuffer<T>::~CircularBuffer()
{
	if (mIsDynamic)
	{
		for (int i = 0; i < mSize; ++i) delete mBuffer[i];
	}
	delete[] mBuffer;
	delete[] mPointerBuffer;
}

template <class T>
const T& CircularBuffer<T>::operator[](int index)
{
	if ((index < 0) || (index >= mSize)) return T();
	return *mPointerBuffer[index];
}

// Leaks memory if used with dynamically allocated objects. Use Dyn_Push instead.
template <class T>
void CircularBuffer<T>::Push(T element)
{
	for (int i = 0; i < mSize; ++i)
	{
		--mPointerBuffer[i];
	}
	mPointerBuffer[mWrapIndex] += mSize;
	++mWrapIndex;
	if (mWrapIndex >= mSize) mWrapIndex -= mSize;

	*mPointerBuffer[0] = element;
}

// For use with dynamically allocated objects. Calls delete on objects pushed off the buffer.
template <class T>
void CircularBuffer<T>::Dyn_Push(T element)
{
	for (int i = 0; i < mSize; ++i)
	{
		--mPointerBuffer[i];
	}
	mPointerBuffer[mWrapIndex] += mSize;
	++mWrapIndex;
	if (mWrapIndex >= mSize) mWrapIndex -= mSize;

	delete mPointerBuffer[0];
	*mPointerBuffer[0] = element;		
}

aciddose · Post by **aciddose** » Wed Apr 09, 2014 1:31 am

This is actually under patent set to expire soon.

There are actually a whole slew of delay-line / buffer related patents that I was surprised to find filed from between 2000 and 2006 (when I did some research.)

Just about every obvious optimization is patented. I found it really funny that I'd invented a large number of them myself without actually putting much effort into it.

For example if you're using an interpolation that requires access +/- 8 samples, you can add a special check for when you are within that range and any updates to the first 8 samples are written twice, once to their actual position and once to the end of the buffer to allow the interpolation without wrapping. Likewise for when writing the last 8, you also copy to the -1 ... -8 positions.

SunkLo · Post by **SunkLo** » Wed Apr 09, 2014 1:34 am

Under patent?!

aciddose · Post by **aciddose** » Wed Apr 09, 2014 1:39 am

I edited the post to add more info since I realize it was a bit sparse at first. Yes, if you do some research on patents with terms like "delay line optimization" you'll be horrified.

aciddose · Post by **aciddose** » Wed Apr 09, 2014 1:52 am

Here is what I can remember about the patented method:

You can avoid checks for wrapping if you know the range of the offset you'll read/write from the current buffer index. By writing a copy of the buffer immediately after the end you can avoid forward wraps, and immediately before you can avoid backward wraps.

In addition to that, if the maximum read/write offset is within the range of the current index you can calculate how many samples can be generated before a wrap will occur, generate those, check/wrap your pointer in one step and recalculate before you continue generating samples without the check.

These two optimizations were patented in two patents, if I remember correctly. Both owned by the same company (was it Yamaha?) and one referring back to the other.

SunkLo · Post by **SunkLo** » Wed Apr 09, 2014 1:54 am

Greasy. Safe to say I'll still be using my own code. Don't know how they'd try to enforce a patent on something as simple as an array of pointers being used to abstract buffer addresses.

Quick google search pulled up something that sounds familiar from Texas Instruments. Might be what you're thinking of?

aciddose · Post by **aciddose** » Wed Apr 09, 2014 2:00 am

I'm not sure, I read about ten different patents and gave up in disgust, it was years ago.

TI does sound familiar though. Like I said in my last post, Yamaha seems like another name I saw come up although I'm not 100% sure.

A lot of the patents reference each other as prior art so if you start to find them you can find others that way.

harikaram · Post by **harikaram** » Wed Apr 09, 2014 7:06 am

Have you seen this one? https://github.com/michaeltyson/TPCircularBuffer

It's very clever though it might be mac/unix only. No modulus ever required. It's uses the kernels virtual memory range mapping capability to create a mirror of the real memory arrange located sequential after itself. Reads crossing the tail boundary continue straight on to the virtual head.

Are the patents approved? I was told, at least in the uk, that you can't patent algorithms or code and that hardware is always required to be part of the technology. Maybe the US is more open in this respect?

aciddose · Post by **aciddose** » Wed Apr 09, 2014 6:02 pm

Yes, software patents apply in the US but have significant limits in the EU.

http://en.wikipedia.org/wiki/Software_p ... patent_law

They have a brief history on Wikipedia which covers most points, but the article doesn't have links to very many of the notable cases.

earlevel · Post by **earlevel** » Thu Apr 10, 2014 10:38 pm

aciddose wrote:Here is what I can remember about the patented method:

You can avoid checks for wrapping if you know the range of the offset you'll read/write from the current buffer index. By writing a copy of the buffer immediately after the end you can avoid forward wraps, and immediately before you can avoid backward wraps.

In addition to that, if the maximum read/write offset is within the range of the current index you can calculate how many samples can be generated before a wrap will occur, generate those, check/wrap your pointer in one step and recalculate before you continue generating samples without the check.

These two optimizations were patented in two patents, if I remember correctly. Both owned by the same company (was it Yamaha?) and one referring back to the other.

Seriously? I used both of these techniques in the '80s, and I'm sure that plenty of people before me figured them out too, of course. For instance, When coding coding MIDI Manager support for HyperMIDI: MM had variable packet sizes, up to 255 bits for (partial) syses messages. By making my MIDI buffers 254 bits longer than my size spec, I didn't have to check for buffer wraps in the middle of any packet.

Anyway, there are a lot of these sort of frivolous patents. It's pretty unlikely that a company like Yamaha would actually try to enforce them. more complicated frivolous patens, maybe, because it's hard to go against that much money. but the super-simple ones are too easy to defend without a high-priced legal team.

lorcan · Post by **lorcan** » Fri Apr 11, 2014 3:49 pm

Thanks for the code.
I'm don't quite understand why this is faster than a buffer using modulo to push / get an element though.

Looking at the code, pushing an element involves decrementing <size> pointers indexes, plus checking whether the wrapped index overflows. From what I understand the purpose of this is to make indexed accessed trivial ?

On a side note, you could probably gain a little performance by templating your container on the size and replace the dynamic flag with a helper policy template (ie using std::is_pointer)

SunkLo · Post by **SunkLo** » Fri Apr 11, 2014 4:53 pm

Yes, access is trivial and pushing an element only requires decrementing N pointers and doing a single overflow check for the counter. Versus N element copies if you're advancing the elements through the buffer, or alternately an overflow/underflow check for every read if you're leaving them in arbitrary order and using a pointer to indicate where the 0 index "write head" is.

lorcan · Post by **lorcan** » Fri Apr 11, 2014 5:55 pm

I imagine your implementation is therefore best employed in a write seldom / read often and many scenario.
In my implementation i simply use an index for the read/write head which gets incremented at each push, with an overflow check. Naturally reading back needs the under/overflow check also, but I haven't found that to be a problem in my case.
Also when reading several locations, you can sometimes do the overflow check only once, if the indexes you're reading are contiguous.

I just profiled the cost of a push in cycles and these are the results I get,
in cycles per sample push for my circular buffer, yours (alt), and a simple array for reference:

Code: Select all

[size]
[   32] push cycles/smp circular: 3.74, alt: 23.33, linear: 0.81
[   33] push cycles/smp circular: 4.60, alt: 25.46, linear: 0.79
[   64] push cycles/smp circular: 3.84, alt: 32.14, linear: 0.79
[   65] push cycles/smp circular: 3.85, alt: 43.38, linear: 0.79
[  128] push cycles/smp circular: 3.64, alt: 83.29, linear: 0.79
[  129] push cycles/smp circular: 3.62, alt: 97.20, linear: 0.79
[  256] push cycles/smp circular: 3.62, alt: 132.01, linear: 0.80 
[  257] push cycles/smp circular: 3.61, alt: 152.18, linear: 0.80 
[  512] push cycles/smp circular: 3.54, alt: 231.15, linear: 0.80 
[  513] push cycles/smp circular: 3.53, alt: 281.43, linear: 0.80 
[ 1024] push cycles/smp circular: 3.91, alt: 436.43, linear: 0.81 
[ 1025] push cycles/smp circular: 3.64, alt: 496.53, linear: 0.81 
[ 2048] push cycles/smp circular: 3.90, alt: 848.74, linear: 0.84 
[ 2049] push cycles/smp circular: 3.76, alt: 996.37, linear: 0.83 
[ 4096] push cycles/smp circular: 3.92, alt: 1681.97, linear: 0.85
[ 4097] push cycles/smp circular: 3.65, alt: 1890.12, linear: 0.88
[ 8192] push cycles/smp circular: 4.20, alt: 3385.06, linear: 0.83
[ 8193] push cycles/smp circular: 3.66, alt: 3727.60, linear: 0.85
[16384] push cycles/smp circular: 4.21, alt: 7631.65, linear: 0.87
[16385] push cycles/smp circular: 3.67, alt: 8517.01, linear: 0.93

SunkLo · Post by **SunkLo** » Fri Apr 11, 2014 6:16 pm

How do they fare for reads? I really just wanted to optimize reads without having to make the sacrifice of copying elements during writes to keep everything in order. In a filtering scenario, you'll write once and read N times per sample, so depending on the buffer length, the read speed is N times more important.

You should do a profile of a standard filtering scenario, pushing once and reading N times per sample. I think as N grows, my approach becomes more valuable. It's probably not the best approach with shorter buffer sizes.

lorcan · Post by **lorcan** » Fri Apr 11, 2014 6:43 pm

Just benchmarked the reads with 10 reads/element, and it looks like performance is equivalent, ie 3-4 cycles per read.

If I remove the index bounds checking in your read operator, a read only costs 1.3 -2 cycles. Btw, you're returning a reference to a temporary there anyway, so it won't work as intended.

When you talk about standard filtering scenario, what type of filter are you referring to, FIR/IIR, decimation .. ? For doing fast FIR, a specific implementation would most probably be faster anyway, especially with large buffer sizes (partitioned convolution tricks etc).
The case where caching the indexes as you do would benefit IMHO would be when one needs totally random access patterns (reverb ?).

Efficient Circular Buffer Implementation for Delay Line, Etc.