KVR Audio

Nowhk · Post by **Nowhk** » Tue Nov 20, 2018 11:45 am

Hi guys,

if you recently have seen my topics, you might have got the idea that I'm entering into the darkside of intrinsics and vectorized programming (what a pain)

I'm faced in front of lots of intrinsics and different instruction sets for C++, and I'm not able to decide which one to use.
I'm "developing" right now for Windows based machine, both 32 and 64 bit.

Which instruction sets are more suited in 2018 for developing plugins?
That can be also a bit "retro compatible" with non-very recent CPU.

AVX? SSE2? MMX? Which Intrinsics suite better for the choice?

I'm about to try to vectorize sin and exp function; I've seen mkl.h and vsSin (even if I'm not really able to include it).

Somethings like this (keeping or removing sin code) introduce +2% CPU, running 16 voices:

Code: Select all

for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
	double value = (sin(phase)) * pParamGain->GetProcessedVoiceValue(voiceIndex, sampleIndex);

	*left++ += value;
	*right++ += value;
	
	// next phase
	phase += BOUNDED(mRadiansPerSample * (bp0 * pParamPitch->GetProcessedVoiceValue(voiceIndex, sampleIndex) + pParamOffset->GetProcessedVoiceValue(voiceIndex, sampleIndex)), 0, PI);
	while (phase >= TWOPI) { phase -= TWOPI; }
}

I believe vectorization can help

Any suggestions?
What do you use for your audio plug?
Or do you simply delegate to the smartness of Auto-vectorization?

Thanks

vortico · Post by **vortico** » Tue Nov 20, 2018 1:07 pm

Nowhk wrote: ↑Tue Nov 20, 2018 11:45 am Which instruction sets are more suited in 2018 for developing plugins?

Not the full answer you're looking for, but if you want to vectorize these days, write a reference code without vectorization, which you'll need anyway for scratch/debugging/testing before writing the vector code, and a code using up to AVX2. Compile the reference code with automatic SSE2 vectorization, and ship two binaries, or use runtime CPU checking. Of course, evaluate whether you need AVX2 at all before hand-porting.

stratum · Post by **stratum** » Tue Nov 20, 2018 5:23 pm

Chances are high you can find two more functions in mkl.h that may help to vectorize:

Code: Select all

	*left++ += value;
	*right++ += value;
	
	// next phase
	phase += BOUNDED(mRadiansPerSample * (bp0 * pParamPitch->GetProcessedVoiceValue(voiceIndex, sampleIndex) + pParamOffset->GetProcessedVoiceValue(voiceIndex, sampleIndex)), 0, PI);

these functions are "vector add" and "vector multiple and add" (I don't recall their names)

edit: looks like these are in ipp https://software.intel.com/en-us/ipp-de ... -functions

mtytel · Post by **mtytel** » Wed Nov 21, 2018 1:20 am

Using an intrinics wrapper class gives better performance and cleaner code than auto-vectorization in my experience.
It's cleaner because you don't have to write the code for both the left and right channel separately.
It's faster because you can, for example, take advantage of comparison/max/min intrinisics, and mask intrinsics to make conditions branch-less. I don't think auto-vectorization will translate your code to use these but I may be wrong.

I have my own class that wraps both SSE2 and NEON intrinsics but there are other ones like JUCE's SIMDRegister: https://docs.juce.com/master/structdsp_ ... ister.html

I'd avoid AVX because it's not that widely supported yet.

stratum · Post by **stratum** » Wed Nov 21, 2018 8:06 am

for the sin function you can do this:
https://dsp.stackexchange.com/questions ... oscillator

xoxos had posted a simplified version some time ago which is probably lost somewhere below the next 1000 pages of the forum:)

davidguda · Post by **davidguda** » Wed Nov 21, 2018 9:30 am

First step of vectorizing if using JUCE is using the floatvectoroperations.
Really clean and nice for the basic operations.

quikquak · Post by **quikquak** » Wed Nov 21, 2018 9:44 am

It's probably best to learn to get used to vector coding. I use SSE(2) intrinsics. I can see many people shaking their heads and ready to spout, "But, but, the compiler does a better job than me." - A nice belief, but the trick is to understand the basics, and what a pipeline is. I've seen some bad C++ disassembly recently, enough to make me think twice about trusting the compiler implicitly.
The Juce stuff looks good for basic blocks of values, which may be OK for you, I haven't really looked into to it enough myself.
Like Vorico said, write the code normally first but while thinking about doing 4 math things at once, it will help in the long run. You have to start with, "Is this the best algorithm I can use for this task?" - as normal!

Who knew programming computers was hard, hey?

Miles1981 · Post by **Miles1981** » Wed Nov 21, 2018 5:47 pm

AVX is already very old, lots of computers supports them...

Bu tyes, two versions of the code may work, then there is cpu dispatch that works well (used by icc).

And yes, the compiler does a far better job than 99.9999% of the developers.

mtytel · Post by **mtytel** » Wed Nov 21, 2018 6:03 pm

Using Steam's Hardware Survey it looks that ~86% of user's computers support AVX. https://store.steampowered.com/hwsurvey
That's too low for me, I'd only switch when it reaches near 99%

juha_p · Post by **juha_p** » Wed Nov 21, 2018 8:20 pm

http://www.nersc.gov/users/computationa ... orization/

mystran · Post by **mystran** » Thu Nov 22, 2018 4:04 am

Auto-vectorisation is one of those things that is pretty hairy as soon as you step outside the realm of simple parallel loops. Sometimes it works great, sometimes it works poorly and sometimes it doesn't work at all. The worst part is that it's more or less impossible to predict which of the three you'll get for any given piece of code. While intrinsics are non-portable, at least they are fairly predictable: you generally get what you ask for, no more and no less.

I'm not really against auto-vectorisation as such and I've even seen cases where I was unable to write intrinsics code to match the performance of ICC auto-vectorisation. That said, always avoiding intrinsics on the basis that "compiler will do a better job" is also somewhat misguided if you really care about performance, since there's also plenty of situations where it's basically trivial to beat the compilers.

My 2 cents though (and this applies to any type of optimisation really): before you vectorise anything, always profile first (and if you don't know how to use a profiler, then learn that before intrinsics, because it's going to give you much better return of investment), because there's absolutely no point wasting days on saving a few cycles in code that is taking 2% of your total CPU time. Without a profile, you're wasting your time.

Aleksey Vaneev · Post by **Aleksey Vaneev** » Thu Nov 22, 2018 10:17 am

One funny thing I've spotted on modern processors: I have a fairly simple portable inner loop for alpha-blending.

op[ VOX_PIX_OFS_R ] += (uint8_t) (( r1 - op[ VOX_PIX_OFS_R ]) * alpha >> 23 );
op[ VOX_PIX_OFS_G ] += (uint8_t) (( g1 - op[ VOX_PIX_OFS_G ]) * alpha >> 23 );
op[ VOX_PIX_OFS_B ] += (uint8_t) (( b1 - op[ VOX_PIX_OFS_B ]) * alpha >> 23 );

If I convert it to MMX (which can be done without much effort), the MMX parallel code is 50% less efficient in 64-bit. In fact now even using alpha-blending in "floats" is as efficient as this byte alpha-blending. Reality changed a lot since introduction of MMX and SSE. Intel C++ manages to reduce execution time by 30% of our older non-parallel code just by enabling AVX2 instruction set, which is unbelievable.

Nowhk · Post by **Nowhk** » Thu Nov 22, 2018 10:57 am

mystran wrote: ↑Thu Nov 22, 2018 4:04 ambecause there's absolutely no point wasting days on saving a few cycles in code that is taking 2% of your total CPU time. Without a profile, you're wasting your time.

I know, but I'm stubborn and curious...
So I give it a try to IPP, suggested by stratum (the first one I try in my programming life).

Here's the specs of the test:

- Intel Core(TM) i7-4900 @ 3.60 3.60
- Windows 10 Professional (x64)
- Visual Studio 2017 (15.8.9)
- Release Configuration, compiling to a 32 Bit program (with /02 /Ot optimized flags)
- 16 Voices, bufferSize 256, calling the "emulating" plugin's Process function 1024 * 30

Here's the code:

Code: Select all

#include <iostream>
#include <chrono>
#include <algorithm>
#include "ipp.h"

constexpr int voiceSize = 16;
constexpr int bufferSize = 256;

class Param
{
public:
	double mValue, mMin, mRange;

	double *pModulationVoicesValues;
	double mProcessedVoicesValues[voiceSize][bufferSize];

	Ipp64f *pModulationVoicesValuesVectorized;
	Ipp64f mProcessedVoicesValuesVectorized[voiceSize][bufferSize];

	Param(double min, double max) : mValue{ 0.5 }, mMin { min }, mRange{ max - min } { }

	inline void AddModulation(int voiceIndex, int blockSize) {
		double *pMod = pModulationVoicesValues + voiceIndex * bufferSize;
		double *pValue = mProcessedVoicesValues[voiceIndex];

		// add modulation
		for (int sampleIndex = 0; sampleIndex < blockSize; sampleIndex++) {
			pValue[sampleIndex] = std::exp(pMod[sampleIndex]);
		}
	}

	inline void AddModulationVectorized(int voiceIndex, int blockSize) {
		Ipp64f *pMod = pModulationVoicesValuesVectorized + voiceIndex * bufferSize;
		Ipp64f *pValue = mProcessedVoicesValuesVectorized[voiceIndex];

		// add modulation
		ippsExp_64f(pMod, pValue, blockSize);
	}
};

class MyPlugin
{
public:
	double gainModValues[voiceSize][bufferSize];
	double offsetModValues[voiceSize][bufferSize];
	double pitchModValues[voiceSize][bufferSize];

	Ipp64f gainModValuesVectorized[voiceSize][bufferSize];
	Ipp64f offsetModValuesVectorized[voiceSize][bufferSize];
	Ipp64f pitchModValuesVectorized[voiceSize][bufferSize];
	
	Param mGain{ 0.0, 1.0 };
	Param mOffset{ -900.0, 900.0 };
	Param mPitch{ -48.0, 48.0 };

	MyPlugin() {
		// link mod arrays to params
		mGain.pModulationVoicesValues = gainModValues[0];
		mOffset.pModulationVoicesValues = offsetModValues[0];
		mPitch.pModulationVoicesValues = pitchModValues[0];

		mGain.pModulationVoicesValuesVectorized = gainModValuesVectorized[0];
		mOffset.pModulationVoicesValuesVectorized = offsetModValuesVectorized[0];
		mPitch.pModulationVoicesValuesVectorized = pitchModValuesVectorized[0];

		// fancy data for mod at audio rate
		for (int voiceIndex = 0; voiceIndex < voiceSize; voiceIndex++) {
			for (int sampleIndex = 0; sampleIndex < bufferSize; sampleIndex++) {
				gainModValues[voiceIndex][sampleIndex] = sampleIndex / (double)bufferSize;
				offsetModValues[voiceIndex][sampleIndex] = sampleIndex / (double)bufferSize;
				pitchModValues[voiceIndex][sampleIndex] = sampleIndex / (double)bufferSize;

				gainModValuesVectorized[voiceIndex][sampleIndex] = sampleIndex / (double)bufferSize;
				offsetModValuesVectorized[voiceIndex][sampleIndex] = sampleIndex / (double)bufferSize;
				pitchModValuesVectorized[voiceIndex][sampleIndex] = sampleIndex / (double)bufferSize;
			}
		}
	}
	~MyPlugin() { }

	void Process(int blockSize) {
		// voices
		for (int voiceIndex = 0; voiceIndex < voiceSize; voiceIndex++) {
			// add modulation
			mGain.AddModulation(voiceIndex, blockSize);
			mOffset.AddModulation(voiceIndex, blockSize);
			mPitch.AddModulation(voiceIndex, blockSize);
		}
	}

	void ProcessVectorized(int blockSize) {
		// voices
		for (int voiceIndex = 0; voiceIndex < voiceSize; voiceIndex++) {
			// add modulation
			mGain.AddModulationVectorized(voiceIndex, blockSize);
			mOffset.AddModulationVectorized(voiceIndex, blockSize);
			mPitch.AddModulationVectorized(voiceIndex, blockSize);
		}
	}
};

int main() {
	std::chrono::high_resolution_clock::time_point pStart;
	std::chrono::high_resolution_clock::time_point pEnd;
	MyPlugin myPlugin;

	// audio host call
	int numProcessing = 1024 * 30;
	int counterProcessing = 0;
	pStart = std::chrono::high_resolution_clock::now();
	while (counterProcessing++ < numProcessing) {
		// variable blockSize - it can vary
		int blockSize = 256;

		// process data
		myPlugin.Process(blockSize);
		//myPlugin.ProcessVectorized(blockSize);
	}
	pEnd = std::chrono::high_resolution_clock::now();
	std::cout << "execution time: " << std::chrono::duration_cast<std::chrono::milliseconds>(pEnd - pStart).count() << " ms" << std::endl;
}

Try to comment/uncomment these (which are the non-vectorized and vectorized versions, respectively):

Code: Select all

myPlugin.Process(blockSize);
//myPlugin.ProcessVectorized(blockSize);

With the non-vectorized one the computing is done in about ~1400ms, with the vectorized one in ~600ms. Less than a half

Maybe MSVC is really dumb, but vectorize it seems to REALLY improve this simple test

Richard_Synapse · Post by **Richard_Synapse** » Thu Nov 22, 2018 11:13 am

mtytel wrote: ↑Wed Nov 21, 2018 6:03 pm Using Steam's Hardware Survey it looks that ~86% of user's computers support AVX. https://store.steampowered.com/hwsurvey
That's too low for me, I'd only switch when it reaches near 99%

I think 90-95% is plenty. Essentially this may boil down to nearly 100% of users, because the rest may not be interested to try your software in the first place. People on totally outdated hardware are ihmo the least likely users/buyers.

Anyway we will start using AVX some time in 2019

Richard

Nowhk · Post by **Nowhk** » Thu Nov 22, 2018 2:05 pm

Tried also the MKL library:

Code: Select all

inline void AddModulationVectorizedMKL(int voiceIndex, int blockSize) {
	double *pMod = pModulationVoicesValues + voiceIndex * bufferSize;
	double *pValue = mProcessedVoicesValues[voiceIndex];

	// add modulation
	vdExp(blockSize, pMod, pValue);
}

It seems a bit "slower" than IPP: on the same test, I'm around ~900ms; maybe because IPP use more Intel-oriented intrinsics

Still better than the Dream about auto-vectorization though

But, what when I'll release the DLL?
If a user haven't a Intel CPU, the VST will fail?

Or if the CPU doesn't support the used SIMD from those libraries? Isn't rather risky?

First steps on Vectorizing Audio Plugins: which Instruction Set do you use in 2018?