How to measure the performance improvement gained by reducing the memory usage of a plug-in

DSP, Plug-in and Host development discussion.
KVRer
1 posts since 15 Jun, 2021

Post Mon Jun 14, 2021 10:36 pm

At my company, we’re currently experimenting with reducing the memory usage of our plug-ins, e.g. by replacing lookup tables with splines, by doing computations on the CPU rather than relying on memory, by using -Os rather than -O3, and so forth.

However, measuring the performance improvement gained by doing this is non-trivial, as you really should do this in a context where multiple plug-ins are fighting for the cache (compared to just benchmarking a plug-in in isolation).

Does anyone on this forum have any experience with/advice for how to do this properly?

KVRian
662 posts since 21 Feb, 2006 from FI

Post Tue Jun 15, 2021 12:34 am

Could something similar for DAW Bench be used for this kind of testing ...? Test old versions of your plug-ins against the new, improved ones.

KVRist
60 posts since 5 Jul, 2018 from Cambridge, UK

Post Tue Jun 15, 2021 1:49 am

Could you run an increasing number of your own plugin in parallel? You'd get a plot of "processing rate / parallel instances", something like:
parallel-performance-sketch.png
Improvements might be quantified like "Can run 120 instances in parallel before per-instance performance drops below XYZ, compared to 70 before".

I've used something similar informally, as well as having dealt with similar graphs where you can see different bottlenecks appear (particularly memory) as you increase some parameter.

(EDIT: You'd need to somehow make sure the instances are running fully independently, e.g. not all sharing a single wavetable or FFT instance or whatever)
You do not have the required permissions to view the files attached to this post.
Last edited by signalsmith on Tue Jun 15, 2021 2:05 am, edited 5 times in total.

KVRAF
6402 posts since 12 Feb, 2006 from Helsinki, Finland

Post Tue Jun 15, 2021 1:53 am

nisw wrote:
Mon Jun 14, 2021 10:36 pm
At my company, we’re currently experimenting with reducing the memory usage of our plug-ins, e.g. by replacing lookup tables with splines, by doing computations on the CPU rather than relying on memory, by using -Os rather than -O3, and so forth.
This sort of optimization might or might not be profitable (sometimes it's more important to improve the memory layout rather than to reduce the actual total footprint), but be careful with -Os which (depending on compiler) might end up making some incredibly silly trade-offs just to save a few bytes. If code-size is truly an issue for you, then you might still get better results by just using -Ofast and selectively disabling optimizations that give you the most code expansion.

As for measuring though, I'd usually just put the plugins in a regular "realistic" DAW project (with ASIO latency similar to what you expect your users to tolerate) and see if there's any difference on the host meter, since that's ultimately the only thing that really matters when it comes to hitting the real-time deadlines. This is obviously a rather inaccurate way to measure stuff, but IMHO that's not a bad thing, because tiny 1% differences often don't translate predictably from one system to another anyway and if you can find an obvious speedup then it should usually be obvious.

Sampling profilers (ie. those that let you profile a release-build without instrumentation) are useful in identifying where you're spending the most time and where you might be hitting cache bottlenecks (eg. large number of samples at a seamingly harmless instruction usually means some sort of a pipeline stall nearby), but ultimately you'll need to accept that the most optimal code will depend on the DAW, what other plugins are loaded, what the latency settings happen to be and a large number of other random things.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

KVRAF
6402 posts since 12 Feb, 2006 from Helsinki, Finland

Post Tue Jun 15, 2021 1:58 am

signalsmith wrote:
Tue Jun 15, 2021 1:49 am
Could you run an increasing number of your own plugin in parallel?
Even this won't be nearly as great as you'd first think, because having a large number of instances of the same plugin still means that all the shared stuff will mostly stay in cache (ie. you probably at least want to mix it with some other plugins get a more realistic situation). Obviously such sharing is desirable if you expect your users to be running a lot of instances of the same thing, but if you have a speciality plugin where you typically have just one instance and the effective cost of the first instance is much than the rest of them, then this might not truly help you.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

KVRist
60 posts since 5 Jul, 2018 from Cambridge, UK

Post Tue Jun 15, 2021 2:10 am

mystran wrote:
Tue Jun 15, 2021 1:58 am
a large number of instances of the same plugin still means that all the shared stuff will mostly stay in cache
Was just amending my original comment to note the same thing. :tu:

If you can separate them out somehow (e.g. a special build where all your caching is per-instance), I still think there's benefit to this.

While I agree that the acid test is "how does it perform in a DAW with a realistic project", being able to arbitrarily stress-test something and get metrics out can also be very useful.

User avatar
KVRian
981 posts since 25 Sep, 2014 from Specific Northwest

Post Tue Jun 15, 2021 10:07 am

Unfortunately, speed is generally proportional to RAM use. If I need to speed up a particular algorithm, for which a better one probably doesn't exist, I throw some more RAM at it.

I use multi-instance projects so very minor speed gains add up to measurable differences outside error margins, and I use time profiling to find bottlenecks in my code. I can then balance acceptable losses in speed v sound quality. (To be honest, I don't worry about RAM usage as I use so little compared to other plugins, but it can easily be another variable here.)

As an aside, use -Ofast rather than -Os for the best speed gains Smaller object code really only translates to that--a smaller final bundle size. That said, I usually find -Os equals or beats -O3 in most cases for speed.

KVRAF
6402 posts since 12 Feb, 2006 from Helsinki, Finland

Post Tue Jun 15, 2021 10:34 am

syntonica wrote:
Tue Jun 15, 2021 10:07 am
Unfortunately, speed is generally proportional to RAM use. If I need to speed up a particular algorithm, for which a better one probably doesn't exist, I throw some more RAM at it.
Right.. but this can backfire if it turns out that the additional cache footprint ends up being more costly than just computing stuff directly. One needs to find a balance here. The actual access pattern can also matter, sometimes a lot.
As an aside, use -Ofast rather than -Os for the best speed gains Smaller object code really only translates to that--a smaller final bundle size. That said, I usually find -Os equals or beats -O3 in most cases for speed.
-Ofast is generally the same as -O3 -ffast-math where as -Os is typically a completely different thing.

The -ffast-math matters a lot, because it's essentially required to do things like auto-vectorization of floating point code, so you might get large gains out of that... but whether smaller or larger code performs better is complicated.
Preferred pronouns would be "it/it" because according to this country, I'm a piece of human trash.

User avatar
KVRian
981 posts since 25 Sep, 2014 from Specific Northwest

Post Tue Jun 15, 2021 11:23 am

mystran wrote:
Tue Jun 15, 2021 10:34 am
... but whether smaller or larger code performs better is complicated.
As I understand, it can vary from processor to processor, as well and not just in regards to cache size.

Micro-optimization is a huge rabbit hole to get lost in and I'd rather spend my time coding and finding better algorithms than flipping compiler switches. My mantra is code clean and let the compiler do its work. The last few versions of clang/gcc do phenomenal work and, over unoptimized code, I get 33-50% speed gains, overall. It depends on the patch, of course.

Regarding auto-vectorization, I've found it does a pretty good job. When I monitored it, I only found like two loops it missed until I put in the #pragma to let it know it was okay. I think the issue was possibly overlapping writing to arrays. The majority of other loops skipped were ignored due to no discernable speed gain.

Return to “DSP and Plug-in Development”