KVR Audio

kuniklo · Post by **kuniklo** » Sat Jun 02, 2012 3:44 am

Anybody who's done much DSP programming knows that one of the toughest things about it is the hard realtime requirement of the DSP render callback. This takes a lot of common techniques off the table: locks, malloc, expensive dynamic dispatch etc. This also takes a lot of attractive dynamically allocated data structures off the table like trees, hash tables etc.

But I've been thinking that you can use an STL std::vector for a lot of things provided you use it in a specific way and think ahead. Like so:

1. use a POD (plain old data) type

2. on creation, preallocate space by calling vector.reserve() with a big enough number that you'll never exceed it

3. use std::lower_bound to find insertion points to keep the vector in-order

This should give you log2(n) lookup times, like a tree. Of course, insertion and deletion is going to be o(n) but for a lot of the common uses of this in typical code n will be relatively small, and, more importantly, since you reserve() space up front you shouldn't trigger any allocations. And even if you mess up and exceed the reserved capacity you'll suffer an allocation and a possible glitch but not the crash you'd get if you write past the end of a fixed-size array. Plus you can use all the nice STL algorithms on the data.

This seems a lot simpler than the obvious alternatives like intrusive linked lists that require cycling dynamically allocated objects in and out of the render thread. Has anybody tried something like this? It seems to be working well in a simple engine for me so far.

stratum · Post by **stratum** » Sat Jun 02, 2012 5:00 am

You can use anything whose performance is predictable. Whether it makes sense to do so or is the best option, is another story. Things like mutexes, memory allocation, etc are unpredictable. Dynamic dispatch is not unpredictable, it's just a bit inefficient. Similarly you can use std::vector as long as operations performed on it have predictable running times.

kuniklo · Post by **kuniklo** » Sat Jun 02, 2012 5:10 am

Of course C++ virtual function style dynamic dispatch is no problem. I mostly work in iOS so I was thinking more of Objective-C's message passing, which isn't safe to use on a real-time thread.

And yeah, std::vector should be safe as long as you don't cause it to allocate. I was using a boost::intrusive container instead, but that was a lot more hassle.

stratum · Post by **stratum** » Sat Jun 02, 2012 5:33 am

I havent done any objective-c programming other than just running a typical "hello world" out of curiosity, so I didn't have that in mind.

mystran · Post by **mystran** » Sat Jun 02, 2012 3:53 pm

kuniklo wrote:Anybody who's done much DSP programming knows that one of the toughest things about it is the hard realtime requirement of the DSP render callback.

I kinda have to disagree. In most cases the only tough part is interfacing with the rest of the world. (my world-view could be skewed though)

This takes a lot of common techniques off the table: locks, malloc, expensive dynamic dispatch etc. This also takes a lot of attractive dynamically allocated data structures off the table like trees, hash tables etc.

But doesn't this mostly just simplify things? Anyway...

You can use locks just fine, as long as you don't make real-time code wait for a lower priority code (which is a priority inversion). Any half-way sane locking primitives allow you to attempt a lock with an immediate timeout (ie if the lock cannot succeed immediately, then it fails rather than blocks). This way you can turn a waiting scenario into a polling scenario, which is fine and usually doesn't complicate anything too much.

You can also use dynamic allocation as long as you can make allocation and garbage collection fast (ie O(N) for N samples or something) and non-blocking. Depending on what you want, this might be very easy (eg preallocate a block; then allocation is pointer advance and if after render() callback it's all garbage you can collect by resetting the allocation pointer). Mostly dynamic allocation becomes a problem if you want to pass (ownership of) dynamic objects between threads.

Dynamic dispatch? I'm not familiar with Objective C, so I'm not sure what you want. Dynamic method selection on multiple types (normally called "multiple dispatch" or "multi-methods")? Rewrite to dispatch on one type at the time (C++ virtual style) and you can compile it into O(1) per type to be dispatched. This is super-painful to do manually (maybe boost has a template hack for this?) if your dispatches are complicated, but languages like Common Lisp (or rather object systems like CLOS since it can be implemented in CL itself) supports multiple dispatch and I'm pretty sure many compilers turn it into something that's O(1) anyway.

But I've been thinking that you can use an STL std::vector for a lot of things provided you use it in a specific way and think ahead. Like so:

1. use a POD (plain old data) type

2. on creation, preallocate space by calling vector.reserve() with a big enough number that you'll never exceed it

3. use std::lower_bound to find insertion points to keep the vector in-order

Or use the pre-allocated private heap trick to write a custom allocator for the std::vector.. all the STL collection allows custom allocators out of the box.

I'm not sure what this buys you though. You could just as well use a regular array, possibly wrapped inside a class that keeps track of size and provides the usual iterators if you want to use those (though raw pointers work as iterators anyway).

This should give you log2(n) lookup times, like a tree. Of course, insertion and deletion is going to be o(n) but for a lot of the common uses of this in typical code n will be relatively small, and, more importantly, since you reserve() space up front you shouldn't trigger any allocations.

Wait, WHAT?!? If you want to build a tree.. you can trivially reserve a heap-block for the allocation of the nodes. If you can further make sure all the nodes are the same size (which should be easy for typical search tree) then you can do deallocation into a simple free-list, and just allocate from the free-list if there's nodes there. Alloc and dealloc then are O(1). Search is O(log2(n)) for binary tree. Insertion is search and O(1) for the new node plus maybe another O(1) for tree-rotation if you want to keep it balanced.

Deletion is a bit problematic since removing an intermediate node you might need to rewrite the tree all the way down to the leaves, but it's still only O(log2(n)) anyway since this only happens if the search terminates early.

Or with slight complication to the actual code you can do it all inside an array too.. though then you need O(n) copy if you need a larger array (with natural list you could just allocate more nodes to the free-list, giving up real-time in the worst-case).

And even if you mess up and exceed the reserved capacity you'll suffer an allocation and a possible glitch but not the crash you'd get if you write past the end of a fixed-size array. Plus you can use all the nice STL algorithms on the data.

Oh I see. Well, the second point doesn't apply: the STL algorithms work perfectly well on raw pointers too, since STL iterators are specifically designed such that a raw pointer is a valid iterator.

This seems a lot simpler than the obvious alternatives like intrusive linked lists that require cycling dynamically allocated objects in and out of the render thread. Has anybody tried something like this? It seems to be working well in a simple engine for me so far.

What are you trying to do?

kuniklo · Post by **kuniklo** » Sun Jun 03, 2012 5:28 am

Yeah it's the communication with the outside world that makes working in the render callback tricky. I'm currently using a lock-free ring buffer as a kind of message queue between the UI and DSP threads and that works fine but keeping complex data structures in sync on both sides is tricky.

I'm writing a sequencer with this. The difficult bit is that notes can be created in the UI but also recorded from the arpeggiator into the sequence. So the render loop also needs to be able to create new event objects. I'm sure the approaches you describe would work and also avoid the o(n) insert/delete performance that I suffer with the vector approach. But it seems like you still need to handle the possibility of running out of pre-allocated space and then things get messy. For a while I was running a separate allocator thread to feed new objects into the render loop but I think this just introduces another potential race condition. The advantage of the vector approach is that it's very, very simple to implement and can transparently handle resource exhaustion with a small risk of a glitch.

Also, since playback events far outnumber editing events in a typical sequencer, the memory locality of a contiguous vector should lead to better cache behavior than a pointer-based structure like a hash or tree. That factor is probably lost in the overhead of the actual DSP though.

stratum · Post by **stratum** » Sun Jun 03, 2012 5:47 am

In VST3 "communication with the outside world" seems to be done with a "pull" mechanism, i.e rendering code is looking for parameter changes and other events itself, not the other way around. Perhaps that's the simplest way to do it.
Also, I would try to avoid creating another thread, that's where things can get really messy. Try to do it all inside the processing method.

kuniklo · Post by **kuniklo** » Sun Jun 03, 2012 6:02 am

I agree. I didn't get very far with the third thread approach before I scrapped it as too complex.

I haven't looked too far into VST3 but the complicating factor in the thing I'm trying to build now is that I want the arp to record midi events into the track but this has to happen at very precise times so I can't think of a way around generating events in the render thread.

antto · Post by **antto** » Sun Jun 03, 2012 11:14 am

kuniklo: your sequencer logic and processing is running on the gui thread?

kuniklo · Post by **kuniklo** » Sun Jun 03, 2012 11:40 am

antto wrote:kuniklo: your sequencer logic and processing is running on the gui thread?

No, that would definitely not work. It's all running on the DSP thread. Only new notes are created from the UI thread when the user taps on the screen while recording. In my first implementation of this the note object was allocated on the UI thread and then passed into the DSP thread via a lock-free ring buffer, so the DSP thread never had to allocate any memory.

But I also have an arp, and want to record the output of the arp into the sequence as well. The arp has to run in the DSP thread too for timing reasons but this means that allocations also have to happen in the DSP thread. std::vector gives me an easy way to pre-allocate enough of these note objects that the user is unlikely to hit the limit and a simple, if not 100% correct, way to handle spillover.

cheating the render callback with std::vector