KVR Audio

d__b · Post by **d__b** » Wed Feb 01, 2017 1:52 pm

Hi,

I am building a sequencer/daw engine that I plan to use as a basis to build audio plugins and apps on top of.

Some of the features or goals of the engine are:
- Session and Arragement functionallity like Ableton Live
- Audio and Midi tracks containing clips
- Sequencer running in audio thread for sample accurate timing
- Modular design, ie connecting processing nodes by virtual wires to build the audio processing graph.

I have built a working prototype as a proof of concept, but in this first version I cheated and didn't take proper care about concurrency or real time requirements of the RT (audio) thread.

Next step will be to do a complete rewrite where I take concurrency into account to get a stable glitch free engine with a well written clean code base.

So I want to discuss DAW/Sequencer architecture and design stratgies.

Stuff like this:

- Suitable data structures and memory management, c structs, stl or other libraries, intrucive containers?
- UI/RT thread messaging/syncronisation, lock free fifos etc.
- UI/RT data sharing, UI could have a local version of the song data, and send messages to the RT thread when changes occur so the RT version is in sync with the UI version, other strategies?

I found Ross Bencina's articles on the subject really helpfull, but I am still researching the subject and haven't really decided on how to structure the internals.

Can you share your experiences on this subject?
How would you design something like this or if you have done something similar, could you describe your architecture and the design choices you made?

noizebox · Post by **noizebox** » Wed Feb 01, 2017 2:29 pm

Ambitious project, but fun

My experience is fairly limited but I'm gonna follow the discussion with a lot of interest. Imho strict separation of the realtime part audio processing part and non-realtime part is a must, with lock free message queues as the means of communication between them.

Juce have a number of youtube videos from their developer conferences that are worth checking out, one in particular is a walkthrough of how to build a lock-free fifo.

The source code for Ardour is open source and well laid out and documented. It's a fairly big codebase but worth taking a look at for reference and inspiration. I don't know of any other big DAWs with public source code though.

JCJR · Post by **JCJR** » Wed Feb 01, 2017 9:40 pm

I wrote sequencers starting from "the dawn of MIDI" years before affordable computer audio was possible, and then later with audio support, up to about 4 years back when I ran out of gas and retired.

Tools were relatively primitive and I'm not current on what fancy tools are currently available to make it easier. Haven't studied recent language enhancements. Nowadays computers being so fast compared to yesteryear should make it easier regardless.

I relied a lot on lock-free fifo and what I called global flags. Nowadays I think flags are decorated with the fancier term semaphore. Dunno if modern built-in semaphore features offer advantages over global flags, or what such advantages might be. https://en.wikipedia.org/wiki/Semaphore_(programming)

Historically there were popular sequencers that started midi-only and then eventually crowbarred audio support on top. And when audio became feasible, there were popular sequencers which started as audio-only and then crowbarred midi support on top. The "historical low level guts" that might not have received very extensive rewrites over the years-- The underlying architecture-- The ones that started midi-only might look somewhat different than the ones which started audio only. Or maybe not. Just guessing.

Are you interested in "fully supporting" tight hardware midi playback synced with computer audio tracks and plugin synthesis, or more interested in "mostly in the box"?

Maybe its different nowadays, but my main processes or threads were main thread message loop handling user input and screen and whatever else not "very time critical", a timer thread waking up with about 1 ms granularity (for midi i/o and other tasks), and of course the audio thread. You might spawn more threads depending on what needs doing, but a modern computer can get a fairly ambitious track count with the three basic threads.

The nature of modern operating systems, at least from my experience, if you only run those three threads and use a normal number of operating system features, the OS will also spawn assorted threads to do the work you asked it to do, and looking at the process window in the debugger you might notice your app owning lots more threads than the ones you specifically created, and in fact might have to do some research to figure out what the heck the mystery threads are doing.

I usually didn't worry about identifying them unless there was a problem.

The midi typically runs on a "tempo-related" tick. If you plan to support tempo maps you need a strong efficient structure for the tempo map data, and good efficient ways to convert from sample time, microseconds time, nanoseconds time, or whatever you find convenient for "steady time" audio sequencing, versus the tempo-dependent musical tempo time.

For example MidiTickToMicroseconds() and MicrosecondsToMidiTick() types of functions. In my experience as features are added, one eventually will end up writing lots of such functions, each one having slightly different purpose.

I have seen people design midi tracks, tempo tracks, audio event lists as linked lists of small data structures, thinking this would make it "more elegant" to insert/delete/modify tracks. But IMO basing such huge data on giant arrays of linked lists causes lots of brain damage writing efficient code to chase back and forth over track data and run the various editing and conversion functions. Also begging for accidental memory leaks.

Perhaps I'm wrong, but strongly prefer tracks which are big arrays of identical-sized data structures. For instance if a Midi event structure needs 20 bytes or whatever to handle all contingencies, then a track with 1000 midi events would just be a NewPtr(1000 * sizeof(TMidiEvent)). If events need inserted, deleted, quantized, otherwise mangled any number of ways, the edit functions just have to sequentially step thru all the events in that track pointer. Which is simple and most of the edit function code will look fairly similar because they are mostly traversing the same data structures the same ways, just doing different stuff. So it is easy to write a new edit function based on some older edit function.

Apologies rambling. Just saying unless you have a giant brain that remembers details forever, it might be good to keep the data structures s as simple as possible, rather that trying to make everything "elegant" from a data processing standpoint, but unfortunately so elegant that you have to spend a lot of time reverse-engineering "elegant complicated" code you wrote a year ago, in order to enhance it.

Another good trick might be to add some extra "dummy future bytes" to important fundamental data structures. Especially useful if you decide to for instance use your internal structure as file storage format.

For instance if your event list element is made 20 bytes big, then you go a couple of years and want to add another property to the event list element-- If you have been saving files that are just a header + dumps of the arrays of event list elements, then you have to write a new set of file load/save functions and also support the old set of file load/save functions to support earlier versions of your program's data files. Also of course reserve a file version var in the header so you can up the rev when you improve the file format, so your program knows how to load old and new files.

If you keep writing on your program for enough years, you may have guessed wrong how many futurebytes to reserve, and at some point run out of future bytes and have to radically rewrite anyway, but reserving some unused bytes can help postpone that painful day of reckoning.

stratum · Post by **stratum** » Wed Feb 01, 2017 11:12 pm

For instance if a Midi event structure needs 20 bytes or whatever to handle all contingencies

Is this just to support future file format extensions or somehow a set of midi events happen to be related and groupped together as a single one? I have never received such a long midi event from a keyboard, but, I have only played two different devices so far and never seen a large variety of them. Leaving the original data inact and adding changes made by the user as extra fields makes sense, perhaps that's the way that space is being used? (just imagining)

d__b · Post by **d__b** » Wed Feb 01, 2017 11:17 pm

Noizebox, indeed fun fun fun. Will sure look into the videos and stuff you mentioned.

My plan is to have all data mutable in the main thread, so I can add/remove/change objects and stuff like that there without having to worry about concurrency issues.

Then I will send messages over the lock free fifo to the audio thread to create an read only version of the data there.

I haven't decided on exactly how I will do this, but one obvious solution would be to send a message/command for every change in the main threads data to the audio thread so it can apply the same changes to its data.

Currently my prototype code are using stl containers, so if I have my data mirrored in the audio thread and I replay the all changes there that might cause memory allocations and other nasty stuff.

I might skip stl containers and use some kind of intrusive containers or something more low level like c structs.

My data model is similar to a web browser dom, and I am planning to write some serializers so I can save/load to json/xml or other formats.

One idea I have is to keep the stl containers in the main thread, and write a special serializer that writes the data to a simpler binary blob format that the audio thread can use (linked lists gets converted to plain arrays and stuff like that).
Then I could serialize the changes to this blob format in the main thread and send the new blobs to the audio thread to replace old blobs.

I have to do some more thinking about this.

I still have lots of questions and research to do, so it would be nice to see how other tackle this?

stratum · Post by **stratum** » Wed Feb 01, 2017 11:33 pm

Have a look at https://www.threadingbuildingblocks.org/ it's now available under Apache license (i.e. free as in almost any sense except that the copyright belongs to intel). May contain many things that probably should not be used in a rt-audio thread, so use with care, it's a general purpose library.

d__b · Post by **d__b** » Wed Feb 01, 2017 11:38 pm

AUTO-ADMIN: Non-MP3, WAV, OGG, SoundCloud, YouTube, Vimeo, Twitter and Facebook links in this post have been protected automatically. Once the member reaches 5 posts the links will function as normal.

JCJR, thanks for the input, this will be a short answer since it's late here (Sweden) and I am really tired and need to go to bed ASAP.

I agree to your suggestion to keep things simple and it might be wise to use fixed size data structures, but then again I have some features I am interested in that might require more dynamic sizes (hint: might borow some ideas from music21 http://web.mit.edu/music21/ (http://web.mit.edu/music21/)).

My plan is as you suggested three main threads (main eventloop/UI, Midi I/O and Audio), but I will also need to create threads to load/save samples from/to storage.

I was thinking about adding some layer to transform the internal data structures to an external format when i serialize it to storage, but thats a problem I will save for later.

JCJR · Post by **JCJR** » Thu Feb 02, 2017 6:35 am

stratum wrote:
For instance if a Midi event structure needs 20 bytes or whatever to handle all contingencies
Is this just to support future file format extensions or somehow a set of midi events happen to be related and groupped together as a single one? I have never received such a long midi event from a keyboard, but, I have only played two different devices so far and never seen a large variety of them. Leaving the original data inact and adding changes made by the user as extra fields makes sense, perhaps that's the way that space is being used? (just imagining)

Hi Stratum

20 bytes was just a number plucked out of the air. Except for sysex, which can be about any arbitrary size, the max length of an individual midi message would be 3 bytes though there are also 2 byte and 1 byte message types, and also "running status" which sometimes helps further compact the stream, a little bit.

I typically treated sysex as "something different than other midi" so that everything else would fit in the same sized event container, the event structure.

Possible fields in a midi event structure would be midi bytes 1, 2, 3. A TickTimestamp field. Possibly a NoteDurationInTicks field. Maybe the release velocity of the paired note-off (after recording, a process would sort thru the data matching note-ons with note-offs, filling in the NoteDuration fields of the noteon events, and marking the now-redundant note-off events for pruning and deletion after the new-recorded track is parsed.

I usually used int32 for tick timestamp and duration, but maybe if doing it over again I'd store timestamp and duration as floats or doubles.

Non-note data would typically not utilize the duration field, so probably set to zero and ignored for non-note events. But maybe some other purpose could be used for that field for controller or pitch bend events.

Midi processing probably takes such a small amount of processing nowadays compared to audio, that vast inefficiencies can be ignored without hurting anything. However, so far as I know modern cpu's can faster-read/write 16, 32, 64 bit values compared to bytes, so it might be considered an advantage to store byte1, byte2, byte3 as shorts or longs if the computer can load those faster than bytes, and if that would matter any more on modern fast computers.

It might also be considered an advantage to store the event midi channel in a separate field in the parsed event structure, so that channel-sensitive parsing or editing doesn't have to constantly load the status byte and mask the byte to find out the channel of each event.

One thing that is useful is a field of flags. We kept coming up with need for new flags once in awhile, so having some pre-allocated empty spares to later define came in handy. We usually used each bit in an int as its own flag, like a bit array.

Maybe one bit for Mute This Event. A bit for Marked for Deletion. A bit for This Event Is Selected (for instance non-contiguous selections in a piano roll or notation edit window).

It really depends on what one wants to accomplish, what data one might want to put in an event structure.

noizebox · Post by **noizebox** » Thu Feb 02, 2017 9:49 am

d__b wrote:Then I will send messages over the lock free fifo to the audio thread to create an read only version of the data there.

I haven't decided on exactly how I will do this, but one obvious solution would be to send a message/command for every change in the main threads data to the audio thread so it can apply the same changes to its data.
...
I might skip stl containers and use some kind of intrusive containers or something more low level like c structs.

I might be more inclined to divide the data so that some data is owned by the rt thread and some by the non-rt thread instead of duplicating all data on both sides of the barrier. It seems a bit wasteful to me.

You're building this in C++ I assume? I would advice against using too much low level C-data and in case you have to, build C++ wrappers around them. Simply because c++ abstractions will likely be more elegant and easier to use, and correctly written, they shouldn't incur any extra overhead. If the aim is for instance to have data structures that can be copied like c structs and of uniform size , that can still be achieved with C++ tools.

You could also look into using eastl https://github.com/electronicarts/EASTLwhich was recently open sourced, it has some useful containers that are absent from stl, like fixed size lists (with a continuous and pre-allocated storage) and intrusive containers.

d__b · Post by **d__b** » Thu Feb 02, 2017 12:52 pm

AUTO-ADMIN: Non-MP3, WAV, OGG, SoundCloud, YouTube, Vimeo, Twitter and Facebook links in this post have been protected automatically. Once the member reaches 5 posts the links will function as normal.

noizebox wrote: I might be more inclined to divide the data so that some data is owned by the rt thread and some by the non-rt thread instead of duplicating all data on both sides of the barrier. It seems a bit wasteful to me.

Of course I wouldn't duplicate stuff like audio samples, but separate copies of the whole song structure should not consume to much memory I think.

But I am still undecided and a full copy of the data might not be what I want in the end, still weighing pros and cons of the different methods.

noizebox wrote: You're building this in C++ I assume? I would advice against using too much low level C-data and in case you have to, build C++ wrappers around them. Simply because c++ abstractions will likely be more elegant and easier to use, and correctly written, they shouldn't incur any extra overhead. If the aim is for instance to have data structures that can be copied like c structs and of uniform size , that can still be achieved with C++ tools.

C++ yes, but with some restrictions to avoid going to far down the rabbit hole.

I am entertaining the idea to keep everyting pretty high level in the main thread, where the data structure is a tree of nodes, stl containers could be fine here.

Then I would "compile" the stuff from the main threads data tree into a more compact low level format.
For example a linked list of items would be transformed to one big chunk of memory containing structs representing the items from the list. If the items are of different types and sizes I could build a lookup table which maps the index of the item to an offset in the memory block.

All my objects have a unique integer id, so when I pass messages between different contexts/threads then I can use this id to address specific objects in the other context.
I have a lookup table for all these ids, where I store the pointers to the actual objects for all the contexts.

There is nothing that prevents me from having different representations of the data in the different threads. The id links them togeather, so I could have one type of Object or Interface on the UI side, and another on the audio side.
They might be two different wrappers encapsulating a shared data format, like a struct.
The internal data might also be different in for the different contexts. (depending on what you prefer).

noizebox wrote: You could also look into using eastl https://github.com/electronicarts/EASTL (https://github.com/electronicarts/EASTL)which was recently open sourced, it has some useful containers that are absent from stl, like fixed size lists (with a continuous and pre-allocated storage) and intrusive containers.

Yeah, I have heard of it, but never actually used it.

Miles1981 · Post by **Miles1981** » Thu Feb 02, 2017 3:09 pm

EASTL is a compromise different than STL. I don't think that you should use lists in general, try to use vectors (and there are fixed sized vectors in STL now). But EASTL has some nice maps implemented as vectors (not trees) that could be faster for audio processing. You may probably want to start with STL and then profile the app and see if you need to optimize the containers.

stratum · Post by **stratum** » Fri Feb 03, 2017 1:07 am

Hi JCJR,

I have also seen something called a "midi beat clock". Does it have a serious use? I mean, why do we need a clock while we can already store timing information as a message receive-time timestamp?

Here at wiki https://en.m.wikipedia.org/wiki/MIDI_beat_clock the article says:

MIDI beat clock (also known as MIDI timing clock or simply MIDI clock) is a clock signal that is broadcast via MIDI to ensure that several MIDI-enabled devices such as a synthesizer or music sequencer stay in synchronization. It is not MIDI timecode.

That's a bit odd, as I could not see anything that needs to be synchronized. I mean, the serial IO interface between these devices are supposed to be fast enough, so why are we supposed to synchronize them?

Thanks

JCJR · Post by **JCJR** » Fri Feb 03, 2017 7:56 pm

stratum wrote:Hi JCJR,

I have also seen something called a "midi beat clock". Does it have a serious use? I mean, why do we need a clock while we can already store timing information as a message receive-time timestamp?

Here at wiki https://en.m.wikipedia.org/wiki/MIDI_beat_clock the article says:

MIDI beat clock (also known as MIDI timing clock or simply MIDI clock) is a clock signal that is broadcast via MIDI to ensure that several MIDI-enabled devices such as a synthesizer or music sequencer stay in synchronization. It is not MIDI timecode.
That's a bit odd, as I could not see anything that needs to be synchronized. I mean, the serial IO interface between these devices are supposed to be fast enough, so why are we supposed to synchronize them?

Hi Stratum

It is not complicated, but can be difficult to clearly explain clocks, because there are several of them. So they need to be consistently named in the explanation.

Many feel that the old serial Midi is "too slow" but it is fast enough for many practical purposes. The Midi baud clock rate is such that it takes about 1 ms to transmit a 3 byte message such as note-on or note-off. When running status can be used, it only requires 2 bytes, or about 0.67 ms.

However, the "theoretical advantage" of this slow pipe is that (at least for sparse Midi streams) the timing for the start of each message has "infinitely fine resolution". You could theoretically control the exact time when each message is sent to the nanosecond. So far as I recall. Though if a Midi stream gets dense, some messages might get delayed waiting for earlier messages to go out the pipe, causing less-tight timing.

That is because the Midi pipe does not constantly toggle. It is asynchronous. When idle, it sends nothing. When a Midi byte is transmitted, it sends a sequence of pulses lasting (as best I recall) 10 Midi baud clocks long, and then returns to the steady idle state. So in theory you could start the bit sequence at any arbitrarily fine time increment even thru the slow pipe. The baud clock stuff is in the hardware chips and typically of no concern to the modern programmer.

There may nowadays be additional new Midi time messages in the spec, but the early standards were Midi Clock and Midi Time Code. Midi Clock messages are "tempo dependent". A fast song sends the Midi Clocks faster and a slow song sends the Midi Clocks slower.

Midi Time Code is "non-tempo-dependent" steady time. Or at least in an ideal world it would be absolutely steady time. The first intention of MTC was a "Midi version of SMPTE time code". Back in the days of multitrack audio and video tape, not all machines would free-run at exactly the same speed, so SMPTE was used (among other things) so that a master machine could force other machines to run at the master's conception of "perfect time". For instance in those days you could stripe a tape track with the SMPTE timing squeal, and then send that audio into a Midi Time Code capable interface, which would send the tape location to the computer. So if you rewind the tape, the computer sequencer would also rewind to the same location. If the tape is running faster or slower than the speed the computer expects, then the sequencer would try to lock on and run at the same speed as the tape.

In that case, since the Midi Time Code was "steady time" it didn't know anything about tempo. If composing a variable-tempo song to fit a piece of film, or multi-tracking a song with some "real" instruments and some "computer instruments", the musician would put his desired tempo map in the sequencer and tell the sequencer what SMPTE time is bar 1. So the sequencer would be playing variable musical tempos locked to the "steady time" coming in from the tape machine.

Midi Clock is tempo dependent. It is defined to send 24 Midi Clocks for each Quarter Note, or whatever time division you want 24 Midi Clocks to represent. For instance at a tempo of 120 Beats Per Minute, 2 beats per second, it would send 48 clocks per second.

Midi Clock was very popular in the early days and then became less popular as many musicians moved from interconnected synths into using a computer as the "control brain" of all the music tracks. I don't keep up but have the impression that Midi Clock has become quite popular again in modern times for dance music, because many musicians have either abandoned the "central computer" or minimized its importance.

For instance if a musician programs a drum pattern in the drum machine, and programs repetitive machine note patterns into one or more hardware synthesizers, and perhaps enables the arpeggiator feature on other hardware synthesizers. One of the devices, the computer, or the drum machine, or one of the synths, will send the Master Midi Clock, and all the other devices follow that tempo. The drums play and the repetitive machine note sequences burble and the live chords played by the musician arpeggiate, all in the same tempo. If you twist the tempo knob of the master, all the slave machines follow the master tempo.

As serial Midi is a slow pipe, they designed the 24 PPQN (Pulses Per Quarter Note) probably to avoid eating much of the Midi bandwidth with timing messages. It can represent about any fine tempo variation, but is rather poor time resolution. At 125 BPM, all notes would be quantized to 20 ms boundaries, useless except for stiff-as-a-cob techno machine music (IMO).

However, this isn't as bad as it would appear, because it is possible to "phase lock" a higher resolution onto the low-res tempo information. For instance if a sequencer uses 480 PPQN internally (1 ms resolution at 125 BPM tempo), then it can phase lock its 480 PPQN playback against the gross 24 PPQN Midi Clock, and have 1 ms timing resolution recording and playback.

I'm guessing that the fancier modern hardware boxes with built-in sequencers also do the phase-lock trick and have better time resolution than the Midi Clock, but dunno. The first-generation hardware drum machines and sequencers tended to be a bit stiff and jerky because most of them actually ran at that low time resolution, so far as I recall.

stratum · Post by **stratum** » Sat Feb 04, 2017 2:29 am

Thanks JCJR, it looks like there are many things to test if a DAW is supposed to control hardware synths or it is possible to 'shoot oneself in the foot' during the beta testing period of any such new product.

JCJR · Post by **JCJR** » Sat Feb 04, 2017 11:52 pm

d__b wrote:JCJR, thanks for the input, this will be a short answer since it's late here (Sweden) and I am really tired and need to go to bed ASAP.

I agree to your suggestion to keep things simple and it might be wise to use fixed size data structures, but then again I have some features I am interested in that might require more dynamic sizes (hint: might borow some ideas from music21 http://web.mit.edu/music21/).

My plan is as you suggested three main threads (main eventloop/UI, Midi I/O and Audio), but I will also need to create threads to load/save samples from/to storage.

I've never used STL types of objects. Maybe nowadays those would be the way to go, dunno. I was using OO programming for a long time, and the various data structures were defined as objects, but the tracks and such were just arrays of the objects array indexed off of an array pointer. For instance maybe the beginning of a track would be a "header object" typed TMIDITrack or TAudioTrack or whatever, containing stuff related to the entire track, followed by an arbitrary-length array of fixed-size data objects.

Briefly looking at the STL after Miles' mention, it does look like the Vector type would at least initially best-fit my prejudices, but perhaps my prejudices are wrong. The more labor one has invested in a project, the more "locked-in" one becomes by long-ago design decisions-- Because the bigger the project, the more painful becomes the scope of a large rewrite. Initial planning should be as wise as possible.

OTOH the scope of a DAW-like program is so vast that it is difficult to keep enough of it in one's head to fully think-out all the issues until after a long time getting one's hands dirty writing code and discovering initial misconceptions and seemingly-good ideas that turned out not so great in practice. It can be months or years later that one finally understands "the way it should have been done from the beginning".

Maybe using fine-tuned generic code in a lib like STL would result in better-behaved code than writing it all "lower level". OTOH I mostly knew "what I want the code to do" and it seemed better to write iterators and such do do exactly what I wanted it to do, rather than try to figure out how to trick generic code into getting the job done. But maybe that is more labor-intensive, and maybe generic code tricked into fitting the specific purpose would turn out faster and more efficient, presumably having been written by smarter programmers. Dunno.

If you will be using audio and midi tracks, then if your program specs will allow all the audio to be loaded into RAM at playtime, it can be simpler cleaner programming. Typical computers nowadays would allow quite a few audio tracks in RAM for typical songs no longer than maybe 10 minutes duration. But if the program must support large track counts and also arbitrarily long-duration program material, like a multitrack project for a half hour or hour TV show or whatever, then fewer tracks could be supported in RAM.

If the audio is not entirely RAM-resident, you need another layer to buffer a few seconds of audio at a time from disk to RAM. Just another set of functions requiring time and labor to create, debug and polish.

The disk-to-RAM buffering might get periodically called from the main thread. The main thread idle function calls the disk-to-RAM function "several times per second" or whatever, and the disk-to-RAM function might be written to only do a little bit of work "topping off its buffers" every time it is called, so that the disk buffering is a fairly steady load rather than occasionally impacting screen or mouse responsiveness.

Alternately the disk-to-RAM function might be in its own thread loop.

I personally would not duplicate two copies of much data, one dedicated to main thread and the other dedicated to realtime threads.

I typically shared data between threads with fairly small FIFO buffers but maybe there are better strategies. Trying to minimize locked states, made a set of self-enforced access rules-- Such as-- Thread A is the only one allowed to write Variable A but all threads can read Variable A at any time, and Thread B is the only one allowed to write Variable B, but all threads can read Variable B at any time. Sometimes you just HAVE to use a lock or mutex, but a consistent, well-enforced set of private access rules might minimize the need for locks.

For instance, one way to do it-- You have Midi tracks and Audio tracks in memory. The repetitive timer thread, each time looks at all the Midi tracks and picks out any notes which need to be played "soon". It stuffs those "ready to go" notes into small FIFO buffers and updates the Head pointer.

Next time the Audio Thread needs to render a time slice, it compares the FIFO buffer head and tail to discover if any notes need playing. If Head != Tail, it strips off the pending Midi data and sends it to the relevant VST synth plugins right before telling the synth plugin to render the next time slice. And the Audio thread updates the Tail pointer. Nothing ever writes the Tail except the Audio Thread and nothing ever writes the Head except the timer thread.

You can do a fairly good track and plugin count with a single audio thread which does the rendering and mixing during the audio interrupt. It is your code, but it is called-into and running on a thread owned by the audio driver.

So when the audio callback is invoked, it would render any VST synths into audio buffers the same size as the requested time slice. Then it would read the next X samples of audio track data of each track and apply whatever VST plugins are assigned to each track. Then it would apply track volume and pan and make a mix of all the tracks, and return that final mixed small audio slice buffer to the audio driver and exit, returning into the audio driver code on that thread and releasing your "temporary ownership" of the audio driver's thread until next time it is called.

I typically had an array of audio buffers sized as big as the audio driver's max requested time slice. Or a little bigger to account for mistakes, resampling, and such. One audio buffer for each track.

You could do mixing in another thread(s) so that the mixing doesn't have to get finished within the audio callback to avoid dropouts. That might increase the track count on well-endowed multi-core systems. But if you do the mixing async to the playback on another thread it will be more brain damage to program, and might be unavoidably a little less user-responsive. Doing the rendering and mixing in the audio callback might be as close to real-time that you can get it.

Starting, stopping and relocating/looping can take some time to get it right. During playback everything is just marching ahead in-time, but a sudden change in location will invalidate any data that you "played ahead" into small temporary buffers. And if you are streaming from disk, the entire disk buffer needs purged and reloaded for the current new song location.

Lets talk DAW/Sequencer design and architecture