KVR Audio

tfrog · Post by **tfrog** » Thu Feb 26, 2015 3:08 am

Hi,

I have a working synth that I want to put out on iOS.
The thing is, the thing is running at 75% cpu on 6 voices on a iPad3 in release mode, after doing some optimization myself.
Obviously this is not ideal.. how should I approach the optimization?
What are some optimizations that you guys do straight off the bat?
Just wondering if you guys can share some tips! Thanks!

Deisss · Post by **Deisss** » Thu Feb 26, 2015 5:37 am

Do you have any guess about what takes time ?

Because there is many - many optimizations...

The first is of course ram allocation:
- use reserve when you can
- re-use array when you can instead of creating/deleting each time
=> in general pay attention to ram allocation pitfall, easy optimization and usually almost no code to change

On the second hand, I will search for caching big algorithms:
- Keep an approximate/pre-compute (sometimes both, sometimes one or the other) version of some elements, to avoid re-computing everything...

Again, it's too "global" question to add a more precise answer

constanze · Post by **constanze** » Thu Feb 26, 2015 8:27 am

first analyse your code with the help of a profiler. this tool measures the time spent in each of your functions during runtime and prints reports. So you know where to start with optimization.

mfa · Post by **mfa** » Thu Feb 26, 2015 9:39 pm

Have you enabled all compiler optimizations? Use "Optimization level : Fastest [ -O3 ]" in build settings. This you might already know but, don't put a single statement of objective-c code into the audio render callback. Use c/c++ only.

Yes, i agree with Constanze. Profile your code in Xcode instruments. You will find immediately what's taking so much of the CPU. What ever you do don't go at it by the gut. The most important thing is to have some way of measuring the performance. At some point you are going to make bigger changes trying to gain performance. That's when you'll start to gain some and loose some. The Xcode profiler isn't the best tool for optimizing tiny fragments of code. Find better profiler or write your own.

In my experience the processors on iDevices aren't as fast compared to memory speed as they are on PC. So look-up tables may be worth a try, even if they'd be a bit last season on PC DSP. Processing interleaved chunks is also worth a try.

Then, most of the code you'll find on books or on the net, e.g. here, isn't optimized (and usually cannot be). It's an insane amount of work that it takes to sqeeze the most out of the processor.

Aletheian-Alex · Post by **Aletheian-Alex** » Thu Mar 12, 2015 5:16 pm

Hard to say without code, but my suggestion would be that if you are working exclusively on iOS/OSX -- as in, there are no concerns with portability to other platforms -- 1) use the time profiler developer tool in XCode to find your hotspots and zone-in on that code, and 2) be sure to make liberal use of the accelerate framework... it is built-in and very good.

a simple example would be that if you are generating a sine with a for loop like this (just pulling some code from memory of a demo vid):

Code: Select all

float buffer[length];
float idx[length];

for (int i = 0; i < length; i++)
}
buffer[i] = sinf(idx[i]);
}

just using vForce in accelerate like this:

Code: Select all

vvsinf(buffer, idx, length);

Can nearly triple the speed of that function.

cixelsyd · Post by **cixelsyd** » Fri Mar 13, 2015 10:17 am

That time profiler looks the money. How is it for profiling vsts or reason rack extensions? Is it difficult to setup?

MadBrain · Post by **MadBrain** » Fri Mar 13, 2015 11:26 pm

tfrog wrote:Hi,

I have a working synth that I want to put out on iOS.
The thing is, the thing is running at 75% cpu on 6 voices on a iPad3 in release mode, after doing some optimization myself.
Obviously this is not ideal.. how should I approach the optimization?
What are some optimizations that you guys do straight off the bat?
Just wondering if you guys can share some tips! Thanks!

- Reduce the number of operations in the "inner loop" (the loop that calculates each sample).
- Make sure the inner loop has no slow math operations. The following operations are slow: / (division by non constant value), %, sqrt, pow, exp, log, sin, cos, tan, asin, acos, atan...
- Also make sure the inner loop doesn't call potentially slower library functions like malloc, new, free, delete, printf, adding/removing elements from stuff like std::map or std::list, call file access functions, do lots of virtual function calls...
- Use O3 (debug builds don't have this flag, they're much slower - the compiler doesn't optimize anything)
- Don't calculate envelopes and LFOs every sample. It's generally fine if you calculate them every 16 samples or so (and ramp volume changes). Same goes for filter settings.
- It's generally faster if sample/oscillator position counters are integers than floating point (and I think Arm cpus don't like [float]->[int memory offset] transfers).

A_SN · Post by **A_SN** » Tue Mar 17, 2015 6:25 pm

Knowing exactly what is slow helps. But some general approaches help:

-Approximations for mathematical functions. When sin()/cos() take 100 cycles and your approximation for them takes less than 15, you need an approximation. For both precision, speed and low memory usage I use a lookup table that contains quadratic coefficients that fit small segments of the function so then you only have to compute (c2*x + c1)*x + c0. Here's a fixed point example. To know how much precision you really need from your approximation just do -96 dB * sqrt(number of such operations needed to make a single sample). Even integer division can be sped up using LUTs of quadratic coefficients! Know that some builtin functions can be extremely slow, like pow()! Again an accurate approximation might help, like this one.

-Fixed point arithmetic. If you don't need a FFT, consider fixed point arithmetic! If you do it right you can still get more efficiency out of it than floating point. One crucial advantage fixed point arithmetic has is that you can turn a fixed point number into an integer index without the potentially quite costly conversion from float to integer. This being said you can use the upper bits of the floating point mantissa as an integer index. For instance if you bring your float number to the range [1.0 , 2.0[ then the upper N bits of the mantissa turn into an integer in the [0 , 2^N - 1] range, which might be useful. You can even combine the upper bits of the mantissa with the lower bits of the exponent.

-Look up tables. Not just for approximations but also sometimes to store all the coefficients and indices you need. LUTs are a good way to avoid computing the exact same thing over and over. You can also have some stored to the filesystem, so instead of initialising something complicated and lengthy at startup you can just load it from a file.

-Reconsidering your whole approach or organisation. Nothing gives bigger boosts than looking at the big picture and rethinking how you do it. Maybe your per-sample approach would work better with time chunks and FFTs. Maybe you can reorder/rearrange your operations so that a slow part gets done less often, for instance like MadBrain said not recalculating your envelopes every sample.

-Being careful with memory usage. It helps being considerate with memory sizes, cache misses, not copying things pointlessly (for instance using circular buffer when appropriate), not deallocating/allocating memory all the time and so on. A large amount of your "CPU time" can be taken up by slow memory operations. A couple of cycles to access memory in L1 cache becomes hundreds of cycles when it's in main memory (which is why most of my LUTs are around 3 kB) and even worse when it's swapped.

Approaches to optimization