Optimization

DSP, Plugin and Host development discussion.
Post Reply New Topic
RELATED
PRODUCTS

Post

I could do with learning a bit more about optimization techniques for C/C++. In particular for AMD or Intel.

Does anyone know of any decent books / resources?

Post

texture,

Don't know of any books. It is a bit of a "gray art" these days and a lot of the older techniques will actually slow down a modern processor. Your best bet may be to check the music-dsp mailing list archives for information.

I'll post some specific examples in a later reply, but just to get things started, here are the basic of AMD/Intel optimization.

Vector Optimization:
SSE lets you load 16 bytes of memory, representing 4 floats, into one of 8 giant 128bit "multimedia" registers. You can then do most floating point operations using it with other 128bit mm registers. It basically does 4 floating point operations in the time it normally takes to do one. You can them "stream" the register back to another place in memory. Very useful for volume envelopes, FIR filters, and mixing.

Cache Control:
This is the real make-or-break optimization. L1 cache is very fast (can be accessed instantly). L2 is also fast, but takes a few clock cycles to access. And anything in RAM must be moved into cache before use, and can take on the order of 100 clock cycles.
In a multiprocess, mutlithreaded environment, you can't have any guarantee as to what will be in cache (a context switch will flush your cache).
In C/C++ you can't control the cache too well, but you can be mindful of how it works. Some of the old optimization techniques, like look-up tables, and decrementing loop variables to cut out one instruction in the loop, no longer apply because of cache issues. (depending on the calculation, a look-up table can be slower to access than the actual calculation - and worse, it may bump something important out of cache. incrementing the loop variable is faster because the processor pulls memory into the cache chunks at a time, so if you've pulled in MyArray[0], it is likely that MyArray[1] through MyArray[15] also got pulled in - the guarantee doesn't go the other way).

Float-to-int conversions:
The default way that this is done causes a "stall" on the floating point processor, so no other work can get done while this is calculating. There is a short assembly routine (I believe it is available somewhere at www.musicdsp.org) that does it much more quickly.

Denormalization:
When floating point numbers get very small, the CPU goes into a special mode in order to maintain precision. This is a very, very slow mode =) If you've ever hit "stop" in your host, and seen the CPU meter hit 100%, this is the reason. The change-over occurs roughly at -300dB. There are various techniques for overcoming this problem, such as adding in noise or a slight DC offset (-300dB is waaaay below human hearing).

Benchmarking:
You should always measure any change you made. Because of cache issues, a speed-up in one section of code may well slow down the section following it. There is an x86 instruction to get the clock cycle count (you run it before and after a block of code and calculate the difference)

-benski

Post

Thanks very much :D


I've been having a look around and have found the following if anyone is interested...


"Programming SIMD computers is no easy task, but here you will find all the information you need to build fast applications that fully exploit the power of MMX and SSE instructions on the current microarchitectures. "

http://www.tommesani.com/Docs.html

Post

Looks like a good site! Thanks

Post Reply

Return to “DSP and Plugin Development”