KVR Audio

stefano-orastron · Post by **stefano-orastron** » Thu Nov 17, 2022 11:27 am

Hi all!

We are delighted to announce a new project named Brickworks: a music DSP toolkit that supplies you with the fundamental building blocks for creating and enhancing audio engines on any platform.

It is in an alpha development state and we just released version 0.1.0 and plan to reach 1.0.0 by the end of Q2 2023.

It is fully free/open source software (GPLv3 license) and we also currently offer commercial licenses for versions 0.x and 1.x for a reduced price to support the project.

You can find more information on the official web page: https://www.orastron.com/brickworks.

rafa1981 · Post by **rafa1981** » Thu Nov 17, 2022 11:46 am

I use C++ but I just opened one class and wanted to provide you some advice:
https://github.com/sdangelo/brickworks/ ... c/bw_svf.c

I wouldn't mix initialization and allocation. If I wanted to e.g. program an EQ with 6 SVF filters I'd want to make sure that all of them sit contiguously in memory, probably as member of a struct and then call something like "svf_init" on each one of them. "bw_svf_new" disallows that.

So I'd basically keep the struct and have init/destroy functions instead and drop all malloc/free stuff. For the objects than need external memory I'd have the user to provide an allocator (a function pointer and a void*) on both "init" and "destroy" instead of calling the library malloc.

stefano-orastron · Post by **stefano-orastron** » Thu Nov 17, 2022 12:05 pm

Hey, first of all thanks for the feedback!

You can actually supply your own memory allocator via bw_config.h by defining BW_MALLOC etc., see https://www.orastron.com/brickworks/bw_config.

This could probably also allow you to have SVF filters contiguous in memory with minimal effort (well... perhaps you should know somehow the size of those hidden structs?), but I'm however interested in such use case - adding a couple of extra functions to each module that give you the struct size and allow to init/destroy memory you provide is not difficult and safe. But still, I wonder if that sort of thing is actually useful as of today and where/why.

rafa1981 · Post by **rafa1981** » Thu Nov 17, 2022 2:09 pm

I guess that the main thing I see here is that having the SVF struct opaque is forcing you to dynamic allocation. It is probably not the right tradeoff at this level.

As I see it, those basic blocks should be as memory-tight and contiguous as possible. When mixing different DSP elements e.g. you don't want to touch e.g. 12 sparse cache lines when you could be touching 7 contiguous ones.

Also implementing the processing function on its own C file might be preventing inlining if the user doesn't set up LTO.

Another thing that might be preventing the compiler to generate good code is doing the block iteration inside the processing function. Sometimes what you want is to have the processing function doing just one sample and ideally having the next statements doing something that doen't have a data dependency with the result of the previous processing function, so good use of CPU pipelining is made.

mystran · Post by **mystran** » Thu Nov 17, 2022 3:36 pm

stefano-orastron wrote: ↑Thu Nov 17, 2022 12:05 pm This could probably also allow you to have SVF filters contiguous in memory with minimal effort (well... perhaps you should know somehow the size of those hidden structs?), but I'm however interested in such use case - adding a couple of extra functions to each module that give you the struct size and allow to init/destroy memory you provide is not difficult and safe. But still, I wonder if that sort of thing is actually useful as of today and where/why.

Every heap object is a pointer chase is a cache miss, so ideally the whole DSP module is a single big alloc that contains everything as single continuous block of memory. You can't always do that with everything (eg. buffers you want to resize at runtime), but that should generally be the goal.

Also, I would generally split coefficients and state into separate objects. It is not uncommon that you need a bunch of filters with the same coefficients and separating the two concepts will help keep the cache footprint down and in some cases even let the compiler reuse the values in registers if multiple filters are processed in parallel.

wrl · Post by **wrl** » Fri Nov 18, 2022 3:31 am

stefano-orastron wrote: ↑Thu Nov 17, 2022 12:05 pm This could probably also allow you to have SVF filters contiguous in memory with minimal effort (well... perhaps you should know somehow the size of those hidden structs?), but I'm however interested in such use case - adding a couple of extra functions to each module that give you the struct size and allow to init/destroy memory you provide is not difficult and safe. But still, I wonder if that sort of thing is actually useful as of today and where/why.

Just define the structs in the header and make them part of the public API. Problem solved.

There's really nothing to be gained here from having them be opaque.

-w

stefano-orastron · Post by **stefano-orastron** » Fri Nov 18, 2022 5:53 am

rafa1981 wrote: ↑Thu Nov 17, 2022 2:09 pm I guess that the main thing I see here is that having the SVF struct opaque is forcing you to dynamic allocation. It is probably not the right tradeoff at this level.

As I see it, those basic blocks should be as memory-tight and contiguous as possible. When mixing different DSP elements e.g. you don't want to touch e.g. 12 sparse cache lines when you could be touching 7 contiguous ones.

Well, yes and no. I mean, it depends on your expectations as the API user. However, as I see it, there's pragmatically two pros and one con in having such structs non-opaque: the pros are better caching, as you say, and the fact that the API would actually simplify (no actual need for destruction), the con is that it's more likely that API users will be tempted to play with the struct directly no matter how you tell them not to (in my experience this stuff happens all the time with some corporate users, and it's a pain in post-sales).

Comparing with "what would normally happen" in C++ though, and hence considering what the expectations are most likely, exposing the struct is more "natural", so I'm quite tempted to go in that direction.

rafa1981 wrote: ↑Thu Nov 17, 2022 2:09 pm Also implementing the processing function on its own C file might be preventing inlining if the user doesn't set up LTO.

I think processing functions operating on whole buffers are unlikely to ever be inlined except in the simplest cases.

rafa1981 wrote: ↑Thu Nov 17, 2022 2:09 pm Another thing that might be preventing the compiler to generate good code is doing the block iteration inside the processing function. Sometimes what you want is to have the processing function doing just one sample and ideally having the next statements doing something that doen't have a data dependency with the result of the previous processing function, so good use of CPU pipelining is made.

This could become counterproductive, e.g., code is also fetched from memory and pressures caches as well, so you might end up operating on closely arranged data using code that is scattered all over the place - and that would easily be the case with non-trivial synths.

OTOH, you would put these decisions in the hands of users and internal buffering could be more directly managed. Again, whether those points are good or bad depends on what the user wants (more responsibility can also be cumbersome).

There are other downsides though: operating sample-by-sample would either imply inability to put as many branches as possible outside of loops (it kind of breaks the control rate vs audio rate paradigm, see https://github.com/sdangelo/brickworks/ ... one_pole.c), thus getting potentially huge performance hits, or populating the API with a lot of different processing functions and perhaps coefficient update functions and whatnot (see https://www.orastron.com/brickworks/bw_inline_one_pole for a simple case), and that would only "work" as long as you know in advance in which case you'd fall.

But again, there are other uses to sample-by-sample processing that we are considering for the (hopefully not so distant) future...

stefano-orastron · Post by **stefano-orastron** » Fri Nov 18, 2022 6:01 am

mystran wrote: ↑Thu Nov 17, 2022 3:36 pm Also, I would generally split coefficients and state into separate objects. It is not uncommon that you need a bunch of filters with the same coefficients and separating the two concepts will help keep the cache footprint down and in some cases even let the compiler reuse the values in registers if multiple filters are processed in parallel.

This would be nice in theory, but in practice I see downsides.

First, you'd need to handle coefficient updates and audio-rate processing separately. This complicates the API slightly.

But more importantly, when smoothing is involved an interesting question arises: are smoothed parameters/coefficients to be considered state or coefficients? In the former case such arrangement becomes useless. In the latter case you'd either have to store such parameters/coefficients in audio-rate buffers to compute them once and apply to many signals or you need to proceed sample-by-sample which can also be problematic in languages such as those we use today (see my previous post).

stefano-orastron · Post by **stefano-orastron** » Fri Nov 18, 2022 6:17 am

wrl wrote: ↑Fri Nov 18, 2022 3:31 am
stefano-orastron wrote: ↑Thu Nov 17, 2022 12:05 pm This could probably also allow you to have SVF filters contiguous in memory with minimal effort (well... perhaps you should know somehow the size of those hidden structs?), but I'm however interested in such use case - adding a couple of extra functions to each module that give you the struct size and allow to init/destroy memory you provide is not difficult and safe. But still, I wonder if that sort of thing is actually useful as of today and where/why.
Just define the structs in the header and make them part of the public API. Problem solved.

There's really nothing to be gained here from having them be opaque.

No way. Implementation details are implementation details. Struct sizes and content might change at any time and it's unreasonable to make a major version bump to, say, introduce some optimization.

Structs can be included in headers though, as long as users don't directly access them (so, not part of the public API).

mystran · Post by **mystran** » Fri Nov 18, 2022 9:09 am

stefano-orastron wrote: ↑Fri Nov 18, 2022 5:53 amthe con is that it's more likely that API users will be tempted to play with the struct directly no matter how you tell them not to (in my experience this stuff happens all the time with some corporate users, and it's a pain in post-sales).

You really can't prevent people from doing this. No amount of opaque structs or access qualifiers or separate compilation units is going to keep people from doing it if they are determined enough. Even shipping binary only isn't truly fool proof. The standard practice in C is to decorate the public and private symbols in a different way (or use inner structs called "priv" or "impl" or something for stuff you don't want accessed directly) and if that's not enough for someone then really that's their problem...

...except when it really isn't, because the number one reason for people to mess with the internals of something (especially when they aren't completely clueless) is that your public API didn't expose the functionality that they needed. In a sense, the core problem is often too much encapsulation and not enough separation of concerns at the API level.

My two cents is that library design is hard, the biggest challenge is finding the right balance of flexibility and efficiency vs. easy of use and for most libraries out there you just can't find a good balance with a single API. Most decent libraries have a layered design, where the "core" consists of primitives that really only do one thing thing well, which can be composited with other primitives to satisfy almost any use case, at the cost of a bunch of boilerplate. Then on top of that many libraries provide a "simplified API" that implements the most common thing.

For example, an image loading library might have a "core" API where you can first decode headers, then do your own allocation and then feed it additional data (perhaps from the network) to incrementally update a progressive image.. or whatever.. and then on top of that the same library might provide a load_image() function that takes just a full buffer of data and returns a pointer to a struct with the image attributes and a data pointer or even load_image_from_file() that does the same with just a filename. This way the library is about as easy to use as they get, yet if you ever want to load progressively over a slow link, the functionality is there.

rafa1981 · Post by **rafa1981** » Fri Nov 18, 2022 10:42 am

stefano-orastron wrote:
I think processing functions operating on whole buffers are unlikely to ever be inlined except in the simplest cases.

Clang can hypothetically fuse loops. I say hypothetically because I haven't studied or read a lot about in which cases work. Take it at face value

stefano-orastron wrote:
This could become counterproductive, e.g., code is also fetched from memory and pressures caches as well, so you might end up operating on closely arranged data using code that is scattered all over the place - and that would easily be the case with non-trivial synths.

OTOH, you would put these decisions in the hands of users and internal buffering could be more directly managed. Again, whether those points are good or bad depends on what the user wants (more responsibility can also be cumbersome).

There are other downsides though: operating sample-by-sample would either imply inability to put as many branches as possible outside of loops (it kind of breaks the control rate vs audio rate paradigm, see https://github.com/sdangelo/brickworks/ ... one_pole.c), thus getting potentially huge performance hits, or populating the API with a lot of different processing functions and perhaps coefficient update functions and whatnot (see https://www.orastron.com/brickworks/bw_inline_one_pole for a simple case), and that would only "work" as long as you know in advance in which case you'd fall.

Well, it all depends on who is your target user. To prevent users from doing (obviously) dumb things they will also be prevented from doing some smart things

. You probably want to avoid to provide support to the first group. It is fine, the value of the library won't be in single poles and SVFs, but in providing things that the average DSP joe (or below, e.g. me) can't get by himself.

What I (and I think many) do is to have mostly single-sample basic processing blocks (with a single responsibility: no smoothing, no block processing, no memory management) and grouping them:

Code: Select all

for sample in block
  coefficient smoothing
  tiny process 1
  tiny process 2
  tiny process 3

for sample in block
  process 4
  process 5

for sample in block
  big process 6

Then as CPU usage is far from predictable (and changes on different machines), experiment which combination of blocks on which loops seems to reduce CPU usage when using a humongous amount of instances.

With this approach there is nothing preventing branches to be done outside the block loops.

Notice that if the branch can happen outside a block it means that the branch is predictable, so in some cases, if the compiler can see that the variables involved in the decision can't be changed externally, it might move the conditional outside the loop (I think hoisting is the term) (it can be helped by doing a dummy copy of the variables involved in local-scoped variables (that then the optimizer will of course remove)). If the compiler doesn't remove it the branch is predictable and the costs won't be catastrophic unless the code making the decision is expensive to compute.

With single-responsibility blocks there is nothing preventing creating compositions. Either provided by the library or outside of it. The other way around is harder.

stefano-orastron wrote:
But again, there are other uses to sample-by-sample processing that we are considering for the (hopefully not so distant) future...

E.g. ZDF. And for a ZDF state snapshots are required. Alternatively to implement it as Vadim does in the book with the G and S abstractions. In this same case the whole allpass cascade might just need 1 set of coefficients and N sets of states. What mystran mentioned of separate states and coefficients.

Even a naive phaser with unit delay is only possible to do using a block-processing-enabled SVF by setting the block size to 1.

In my case in C++ I have it exactly like that, dummy structs that contain (constexpr) info on how many states and coefficients a "part" requires and some public static template functions that take either scalars or (GCC-clang) vectors to do the state reset, coefficient update and processing.

Then there is a separate set of templates that can merge/cascade the DSP blocks above in flexible ways and take care of the (static) memory. A bit template-heavy and fugly at times, useful others. So it is with C++...

There are libraries that go even one step further and use expression templates. I guess that pays off more for memory-bound processes:
https://www.kfrlib.com/

The choice of language (C) also seems that could make you probably leave money on the table if your main audience is Desktop/plugins, e.g Rust won't be able to inline the SVF, neither will C++ which is the standard language for audio today (or to do so at least a lot of factors have to align with compilers and flags)). If your audience is on tiny embedded devices then those allocations raise even more questions, as presumably your client is an embedded-C coder. Those tend to know about resources, although some have peculiar ways of coding, I concede that.

mystran · Post by **mystran** » Fri Nov 18, 2022 12:35 pm

rafa1981 wrote: ↑Fri Nov 18, 2022 10:42 am Notice that if the branch can happen outside a block it means that the branch is predictable, so in some cases, if the compiler can see that the variables involved in the decision can't be changed externally, it might move the conditional outside the loop (I think hoisting is the term)

Hoisting usually refers to moving computations higher in the dominator tree (ie. basically earlier in code). This can give you loop invariant code motion when stuff is hoisted out of loops (eg. into a preheader), but it can also be done in other situations (eg. hoist common subexpressions on different sides of a branch to before the branch so you can combine them). This typically doesn't involve rewriting the control flow graph other than to possibly add a preheader (or more generally break what is known as "critical edges") if one is missing.

When a branch condition is loop invariant, the optimization that moves the branch out of loop and duplicates the loop in each of the conditional branches is known as "loop unswitching" and when the branch condition only differs for the first iteration of the loop (which is kinda common), we can perform what is known as "loop peeling" where the first iteration is unrolled so that the branches can then be hopefully eliminated. In theory a compiler could also do more general "loop splitting" (which peeling is a special case) if it can reason about some regular pattern (eg. half loop goes one way, other half goes other way), but I don't know how likely this will happen in practice (although I believe loop splitting is common with auto-vectorization to deal with the excess elements that don't fit the SIMD vector size).

If you have multiple small loops over the same range that are independent of each other (eg. those small "for each sample" loops) then it's also possible to do "loop fusion" (apparently sometimes called "loop jamming") that merges the loop bodies into a single combined loop, while the opposite (splitting one big loop into multiple smaller ones) is known as "loop fission" (sometimes called "loop distribution") and the tricky part here is profitability. In theory a compiler could first do loop fusion to merge everything, then fission to break it down (eg. to reduce register pressure and increase data locality), but I don't know how aggressively currect compilers will try to do this stuff.

Something to also keep in mind is that one of the biggest optimization hazards in C and C++ is the possibility of pointer aliasing. This is where inlining (or inter-procedural optimization) can really help when dealing with pointers even if the call overhead is negligible, because if the compiler can see whether two pointers (eg. perhaps the input and output buffers) refer to the same or distinct buffers in the outer scope, then it can make more intelligent decisions than if it had to produce code that's correct even if the buffers overlap in arbitrary ways.

stefano-orastron · Post by **stefano-orastron** » Fri Nov 18, 2022 10:31 pm

mystran wrote: ↑Fri Nov 18, 2022 9:09 am
stefano-orastron wrote: ↑Fri Nov 18, 2022 5:53 amthe con is that it's more likely that API users will be tempted to play with the struct directly no matter how you tell them not to (in my experience this stuff happens all the time with some corporate users, and it's a pain in post-sales).
You really can't prevent people from doing this. No amount of opaque structs or access qualifiers or separate compilation units is going to keep people from doing it if they are determined enough. Even shipping binary only isn't truly fool proof. The standard practice in C is to decorate the public and private symbols in a different way (or use inner structs called "priv" or "impl" or something for stuff you don't want accessed directly) and if that's not enough for someone then really that's their problem...

...except when it really isn't, because the number one reason for people to mess with the internals of something (especially when they aren't completely clueless) is that your public API didn't expose the functionality that they needed. In a sense, the core problem is often too much encapsulation and not enough separation of concerns at the API level.

That's true and I can relate to that. We'll move to exposing the struct and warn users as much as possible not to touch it.

mystran wrote: ↑Fri Nov 18, 2022 9:09 am My two cents is that library design is hard, the biggest challenge is finding the right balance of flexibility and efficiency vs. easy of use and for most libraries out there you just can't find a good balance with a single API. Most decent libraries have a layered design, where the "core" consists of primitives that really only do one thing thing well, which can be composited with other primitives to satisfy almost any use case, at the cost of a bunch of boilerplate. Then on top of that many libraries provide a "simplified API" that implements the most common thing.

For example, an image loading library might have a "core" API where you can first decode headers, then do your own allocation and then feed it additional data (perhaps from the network) to incrementally update a progressive image.. or whatever.. and then on top of that the same library might provide a load_image() function that takes just a full buffer of data and returns a pointer to a struct with the image attributes and a data pointer or even load_image_from_file() that does the same with just a filename. This way the library is about as easy to use as they get, yet if you ever want to load progressively over a slow link, the functionality is there.

I agree on that as well, I just don't know what would the low-level API could look like at this point for a variety of reasons.

stefano-orastron · Post by **stefano-orastron** » Fri Nov 18, 2022 11:01 pm

rafa1981 wrote: ↑Fri Nov 18, 2022 10:42 am
stefano-orastron wrote:
This could become counterproductive, e.g., code is also fetched from memory and pressures caches as well, so you might end up operating on closely arranged data using code that is scattered all over the place - and that would easily be the case with non-trivial synths.

OTOH, you would put these decisions in the hands of users and internal buffering could be more directly managed. Again, whether those points are good or bad depends on what the user wants (more responsibility can also be cumbersome).

There are other downsides though: operating sample-by-sample would either imply inability to put as many branches as possible outside of loops (it kind of breaks the control rate vs audio rate paradigm, see https://github.com/sdangelo/brickworks/ ... one_pole.c), thus getting potentially huge performance hits, or populating the API with a lot of different processing functions and perhaps coefficient update functions and whatnot (see https://www.orastron.com/brickworks/bw_inline_one_pole for a simple case), and that would only "work" as long as you know in advance in which case you'd fall.
Well, it all depends on who is your target user. To prevent users from doing (obviously) dumb things they will also be prevented from doing some smart things . You probably want to avoid to provide support to the first group. It is fine, the value of the library won't be in single poles and SVFs, but in providing things that the average DSP joe (or below, e.g. me) can't get by himself.

The library is in alpha state, but I guarantee you that especially the current SVF algorithm and the oscillator waveshapers already beat most implementations, free and proprietary, by a wide margin. Don't be fooled by their apparent simplicity.

rafa1981 wrote: ↑Fri Nov 18, 2022 10:42 am What I (and I think many) do is to have mostly single-sample basic processing blocks (with a single responsibility: no smoothing, no block processing, no memory management) and grouping them:
Code: Select all
for sample in block
  coefficient smoothing
  tiny process 1
  tiny process 2
  tiny process 3

for sample in block
  process 4
  process 5

for sample in block
  big process 6
Then as CPU usage is far from predictable (and changes on different machines), experiment which combination of blocks on which loops seems to reduce CPU usage when using a humongous amount of instances.

With this approach there is nothing preventing branches to be done outside the block loops.

Well, that is true in theory, but in practice I consider smoothing to be an integral part of a DSP algorithm. If not done the proper way it can create all sorts of stability issues and whatnot, and usually you need to know quite a lot about the algorithm internals to make it robust and effective. You can offer parameters to tune it, etc., but there's no sane way to de-encapsulate it in my opinion.

rafa1981 wrote: ↑Fri Nov 18, 2022 10:42 am Notice that if the branch can happen outside a block it means that the branch is predictable, so in some cases, if the compiler can see that the variables involved in the decision can't be changed externally, it might move the conditional outside the loop (I think hoisting is the term) (it can be helped by doing a dummy copy of the variables involved in local-scoped variables (that then the optimizer will of course remove)). If the compiler doesn't remove it the branch is predictable and the costs won't be catastrophic unless the code making the decision is expensive to compute.

Relying on compiler optimization sounds like a bad idea to me, especially since the library also targets embedded systems which have notoriously bad compilers (cough cough... CCES.. cough).

rafa1981 wrote: ↑Fri Nov 18, 2022 10:42 am
stefano-orastron wrote:
But again, there are other uses to sample-by-sample processing that we are considering for the (hopefully not so distant) future...
E.g. ZDF. And for a ZDF state snapshots are required. Alternatively to implement it as Vadim does in the book with the G and S abstractions. In this same case the whole allpass cascade might just need 1 set of coefficients and N sets of states. What mystran mentioned of separate states and coefficients.

Well, again, yes if we can find a workable API model.

rafa1981 wrote: ↑Fri Nov 18, 2022 10:42 am The choice of language (C) also seems that could make you probably leave money on the table if your main audience is Desktop/plugins, e.g Rust won't be able to inline the SVF, neither will C++ which is the standard language for audio today (or to do so at least a lot of factors have to align with compilers and flags)). If your audience is on tiny embedded devices then those allocations raise even more questions, as presumably your client is an embedded-C coder. Those tend to know about resources, although some have peculiar ways of coding, I concede that.

In all honesty, in my experience I have mostly seen performance issues in plugins due to bad DSP algorithm design/choice and over-engineered frameworks than because of lack of inlining, high cache pressure, branch mispredictions etc. Not that they're not important, quite the opposite (and that's why I do care about them), but whether Rust or C++ compilers can't squeeze the last bits of performance out of a C library is a secondary consideration - plus, there will be C++ wrappers at some point, hoping that helps adoption.

At the same time plugin developers and companies waste a ton of time/money reinventing the wheel and porting stuff to new platforms. That is exactly why this library exists in the first place. Perhaps it's not romantic, but it's not about making them tinker with stuff, but rather to give them an easy and ready-made solution.

stefano-orastron · Post by **stefano-orastron** » Sun Dec 04, 2022 7:40 pm

Hi all,

We are glad to announce that we have just released version 0.2.0.

We eventually decided to accept your input and relax encapsulation, also allowing for sample-by-sample processing.

Here' s a short list of changes:

Refactored API for better flexibility and performance.
Added wah, saturation, and pinking filter.
Added new example monophonic synth and two new example effects in VST3 and WebAudio/WebAssembly formats.
Added more fast math routines.
Bug fixes, improvements, and polish.

More info on the official webpage and github repo.

Brickworks released