About CAT

DSP, Plugin and Host development discussion.
Post Reply New Topic
RELATED
PRODUCTS

Post

Guillaume Piolat wrote: Tue Jul 14, 2020 9:08 am Ardour for Windows/macOS can host LV2 built for that OS. LV2 can work outside Linux. They can receive a Cocoa or HWND window handle.
Do you know how to compile LV2 on Windows ? I've downloaded it, nothing straightforward. I've looked at the docs and GIT
Last edited by S0lo on Fri Jul 17, 2020 11:55 pm, edited 1 time in total.
www.solostuff.net
Advice is heavy. So don’t send it like a mountain.

Post

syntonica wrote: Fri Jul 17, 2020 7:07 pm
mystran wrote: Fri Jul 17, 2020 5:47 pm Wrong. The normalized parameter is not the "value" of the knob, but it's a fraction of it's range in terms of the "perceptibly useful" parameterization. This is precisely what you want, so that you don't need to worry about all the complications with the actual value ranges or mapping curves on the API level. Stepped (or enumerated) parameters are another thing, I agree those are useful, but for continuous parameters anything other than normalized is just silly.
I just meant their useless convenience methods. Of course, the actual values should be stored as 0 to 1, whether float or double. I'm just saying that when it comes to displayed values, how I describe it covers the vast majority of possibilities.
Right.

A pretty nice way to do it internally in modern C++ is to have a generic parameter class with some std::function<float(float)> slots to convert back and forth. This way you can just plug functions or lambdas there as "configuration" without having to have tons of different actual parameter classes, but it's still code so you can do arbitrarily complicated things if you really want to.

Then you can also have a few more std::function slots to select from a set of generic string conversions for UI and the host (eg. pick the number of digits or whether to show a sign) that call the conversion routines to deal with the actual values.

You don't even need to write different conversions for different ranges though. Rather you can just use generic helpers that setup conversions for the most common curves with lambdas that capture the actual range. Something like this:

Code: Select all

void Param::setupExp(float lo, float hi)
{
   float logRange = log(hi/lo);
   this->normalizedToValue = [=](float v) -> float { return lo * exp(logRange * v) };
   this->valueToNormalized = [=](float v) -> float { return log(v/lo) / logRange; };
}
Then in your init code you can configure a cutoff parameter for an exponential map from 20Hz to 20kHz like so:

Code: Select all

params.cutoff.setupExp(20.f, 20e3f);

Post

Music Engineer wrote: Fri Jul 17, 2020 1:06 pm
mystran wrote: Fri Jul 17, 2020 1:03 pm What I'd propose for output is that if a plugin wants to send events, it should (pre-)allocate space for the events and space for the array of pointers internally and build the whole event list internally (just like the host is doing) and then just store the pointer and number of events back into the processInfo.
aha! sounds good! :tu: edit: yeah - actually, much better than what i said. if the plugin wants to pass the events through, it would have to explicitly copy the content of the inEvents to the outEvents buffer. that seems to make a lot more sense than what i was babbling about "implicit pass-through unless clearing"
Even if you want some events to pass-thru, you probably don't want them all (eg. pass MIDI but not automation) and you usually want to change the list in one way or another, so I don't think the "pass everything as-is" use-case is that common.

Whether the plugin output event list should be allowed to have pointers to the input events is another thing, which really could be specified either way. I'm personally slightly in favor of forcing a copy, so that the host is free to throw away the input before it has processed the output.

Post

Deducing the crazy permutations of my previous idea, I have come to the conclusion it should not be part of the spec.

In replacement, I propose that the host gather the plugin's parameter "existence state" by a method that returns a list of functional parameters. What makes this different than what prior specs already do? That the host may be allowed to call this function when it sees fit to do so. Perhaps the plugin can inform the host of parameter existence state changes with it's own dispatch, and a flag to indicate that it may never do this. This is to align with karrikuh's idea that parameters should be simple to replace or update in version increments.

Post

this is perhaps a wild idea, but i wonder, what the performance implications of a design with interleaved buffers would be. i can imagine that when starting with a clean slate, it may be worth to contemplate such a deviation from mainstream plugin interfaces. i've heard that microsoft's direct x plugins (does anyone remember these? :hihi:) also had that approach. i think, it could lead to better memory access patterns: samples that are needed at the same instant are neighbors in memory whereas with the common one-buffer-per-channel approach, they would feature a stride of at least the buffersize. moreover, in the (special, but (for me) common) case of double-precision stereo buffers, you could just cast the pointer to _m128d and run simd code on it directly.

obviously, the downside would be that you would need additional de/interleaving conversions to interoperate with what has become the standard, which is bad for performance. :shrug: ...just an idea that wanted to throw in the room to be ripped apart
My website: rs-met.com, My presences on: YouTube, GitHub, Facebook

Post

Music Engineer wrote: Wed Apr 21, 2021 6:53 am this is perhaps a wild idea, but i wonder, what the performance implications of a design with interleaved buffers would be.
I'm not convinced there's any performance advantage, rather I'd expect it to have a negative impact as most of the time dealing with interleaved formats internally is just a pain, so just about anything that isn't a simple EQ would then have to unpack and pack on the API boundary (and then the same thing again at the host side). Not that such packing is particularly expensive, but like.. what's the point?

Post

yes, as said - in a world where the standard is one-buffer-per-channel, the potential advantages may be eaten up by the conversion requirements. i was just imagining an ideal clean-slate situation, where such things are of no concern - say, when you are in a position to program the host and the plugins and can handle it uniformly all throughout. which layout would then be preferable? as said, the point would be (perhaps) a better memory layout (less hopping around for accessing individual sample frames) and more direct simd opportunities. of course, i'm assuming a situation here, where all samples for a particular sample instant n must be handled simultaneously by the algorithm - as in a "processSampleFrame(double* frameIn, double* frameOut)" function, where the "frames" could here be multichannel frames. because this sample-by-sample processing is the way, i typically write my realtime dsp code. it's all based on the general rule that for good performance, data that is needed simultaneously should be adjacent in memory - what is used together should sit together. i don't really know, if that matters here. perhaps one should benchmark both approaches in a realistic scenario
My website: rs-met.com, My presences on: YouTube, GitHub, Facebook

Post

Music Engineer wrote: Wed Apr 21, 2021 9:14 am yes, as said - in a world where the standard is one-buffer-per-channel, the potential advantages may be eaten up by the conversion requirements. i was just imagining an ideal clean-slate situation, where such things are of no concern - say, when you are in a position to program the host and the plugins and can handle it uniformly all throughout.
You're missing the point. I'm trying to argue that in an "ideal clean-sate situation" the ideal design is to have one buffer per channel, because that's usually the more convenient layout to work with. Interleaved formats are only useful for a few things (eg. running multiple channels of IIR filters in parallel using SIMD) and a huge pain for just about everything else.
Music Engineer wrote: Wed Apr 21, 2021 9:14 am it's all based on the general rule that for good performance, data that is needed simultaneously should be adjacent in memory - what is used together should sit together. i don't really know, if that matters here. perhaps one should benchmark both approaches in a realistic scenario
The data layout argument doesn't really fly. Random memory access is bad, sure.. but streaming for two buffers as opposed to just one doesn't make much of a practical difference, since the access pattern is still just as nice, straight through the memory.

Post

Music Engineer wrote: Wed Apr 21, 2021 9:14 am a better memory layout (less hopping around for accessing individual sample frames)
Is there a noticeable difference on modern architectures?
Music Engineer wrote: Wed Apr 21, 2021 9:14 am and more direct simd opportunities
What are those opportunities? Nowadays, typical simd works with 128/256-bit vector. Stereo 32-bit float is only 64 bit. Opportunities seem to only arise if sample processing does not depend on the previous one, which is really a trivial case.

Post

Vokbuz wrote: Thu Apr 22, 2021 7:17 am
Music Engineer wrote: Wed Apr 21, 2021 9:14 am and more direct simd opportunities
What are those opportunities? Nowadays, typical simd works with 128/256-bit vector. Stereo 32-bit float is only 64 bit. Opportunities seem to only arise if sample processing does not depend on the previous one, which is really a trivial case.
Interleaved formats really don't scale either. Realistically any API needs to support more than just stereo and if you have something like L2 ambisonics with 9 channels, there's nothing "nice" about having it all interleaved anymore.

Post

mystran wrote: Wed Apr 21, 2021 9:31 pm but streaming for two buffers as opposed to just one doesn't make much of a practical difference, since the access pattern is still just as nice, straight through the memory.
ok, thanks for clarification. i only have a superficial understanding of how caches work and i'd actually totally prefer to be wrong on this. so, that means, the (2 or more) channels just go into different cache lines, right?
Vokbuz wrote:Nowadays, typical simd works with 128/256-bit vector. Stereo 32-bit float is only 64 bit.
well, yeah. i was specifically thinking about the SSE2 _m128d datatype that seems to be just crying to be used for processing two audio channels simultaneously in double precision - which is indeed what 95% of my dsp code needs. ...but i see that i'm biased here. but that wasn't my main point anyway
My website: rs-met.com, My presences on: YouTube, GitHub, Facebook

Post

Music Engineer wrote: Fri Apr 23, 2021 9:04 am
mystran wrote: Wed Apr 21, 2021 9:31 pm but streaming for two buffers as opposed to just one doesn't make much of a practical difference, since the access pattern is still just as nice, straight through the memory.
ok, thanks for clarification. i only have a superficial understanding of how caches work and i'd actually totally prefer to be wrong on this. so, that means, the (2 or more) channels just go into different cache lines, right?
In the interest of brevity (ie. not writing a whole book) I'm going to drastically simplify things, but here's roughly what every programmer should know about caches:

1. Memory acccess generally operates by cache lines, which are most commonly 64 bytes (eg. x86), but could be slightly smaller (eg. 32 bytes) or larger (eg. 128 bytes) on some other CPUs. When you read a single byte of memory, you're actually first fetching the whole cache line through the cache hierarchy into L1, then reading the single byte from there. For normal "write-back" memory (ie. basically ignoring MMIO), when you're writing a single byte, you're first reading the whole cache line into L1 (well, technically this can be reorderer too; it's really the loads that are worse than writes, except see below on multi-core), then modifying the 1 byte there, then "eventually" writing the cache line back to main memory.

2. CPUs generally try to hide latency by using so-called "re-order buffers" (ROB) where you decode instructions into a queue, then process them as the data becomes available. When you have an operation that needs something from the memory, the load is issued as soon as the address is available, while the rest of the instruction sits in the ROB until the data is available. In the mean time, we can try to complete something. So if you have an access patterns such as array indexing where all the addresses can be computed without dependency on previous loads, the CPU can usually hide the latency (or at least some of it), where as if you had something like a linked list where every address depends on the result of the previous load, the loads are now on the "critical path" and no amount of reorder is going to hide the latency.

3. Most modern CPUs also have some special purpose "prefetch logic" where if you fetch several cache lines in a straight sequence (it works in both forward or backward direction), then the CPU is going to assume that you're probably going to want the next one too. In this case, it'll issue the load (into cache) even before the address has been computed. This allows for hiding longer latencies than what you can get out of just the ROB. While I'm not entirely up to date with the detail of how it's implemented these days, you can generally expect the CPU to handle at least a couple of separate streams just fine.

4. When you have multiple cores running multiple threads, there's one additional detail to worry about. Since we're doing our writes into our local cache lines and then writing them back, for correctness we need to make sure that only one CPU core is modifying any given cache line at any given time. The way this works is that we only allow multiple cores to share a given cache line in read only mode. If a given core wants to write, it will tell all the other cores to drop that cache line first (on architectures with hardware cache-coherency protocol such as x86 this happens automatically; on architectures without such a protocol you need a manual fence instruction). It should be fairly obvious that if multiple threads want to constantly write to the same cache line, then there is going to be an awful lot of cache flushing and the performance is going to be poor.

So.. the implications (in terms of writing code that is cache friendly):

1. your memory bandwidth is a function of the number of cache lines (=64 bytes on x86) that you touch, rather than the actual number of bytes you use; if you fetch a cache line, then you should try to use it all

2. address computation that depends on previous loads tends to hurt latency hiding that you otherwise get from the ROB

3. simple access patterns are even better, because you can start prefetching even before you have the address; multiple streams are fine, going forwards or backwards is fine, but jumping around randomly is bad

4. avoid sharing cache lines (that are written to; read-only sharing is fine) between threads like COVID-19, unless you want your multi-threaded code to be (sometimes a LOT) slower than your single-threaded code; make sure you avoid "false sharing" too where the threads access different variables that are close enough to actually sit in the same cache line

Post

Finally.. just because you pass a separate buffer for each channel doesn't necessarily mean you need to do a separate heap alloc for each buffer. You can still allocate one buffer of blockSize*nChannels and then you can take nFrames worth of samples (or round it up to SIMD size if you want to) from that buffer for each channel in order. This way for very short blocks you can still put all the channels in the same cacheline, even if they come one after another.

Post

thanks for the explanation. i was only aware of 1 and 3 but had the misconception that it works only for a single stream. ...sooo - nevermind then.

but i tend to think, the timing was appropriate to bump this thread for other reasons :wink:
My website: rs-met.com, My presences on: YouTube, GitHub, Facebook

Post

Music Engineer wrote: Fri Apr 23, 2021 7:13 pm thanks for the explanation. i was only aware of 1 and 3 but had the misconception that it works only for a single stream. ...sooo - nevermind then.
I would actually generally expect the inputs to mostly sit in L1 already by the time process() reaches the plugin and if they do then all of this is moot, because it'll take the same time to access no matter what.

edit: I obviously depends a bit on how the host schedules the plugins, but if it's not at least trying to schedule simple chains of plugins on the same thread in order so that the buffers stay in cache then it's really not doing it's job right.

Post Reply

Return to “DSP and Plugin Development”