Swiss Frank wrote:
Borogove, I'm comparing your int16 naive supersaw and my double bumpup.
10 billion times looping? How'd you do that with an int count? (Just a second loop 1 to 10?
Since you're saving into a huge huge huge array, whichever algo runs first would have to have the MMU/OS create a bunch of pages for the first algo that ran.
I ran 10,000 trials of 1,000,000 samples each. I alloc and free a buffer of 1 million floats = 4MB on each trial, so yeah, there's a small amount of OS mem management overhead on the very first trial, but I expect that to be swamped by the computation. I also didn't do much to ensure the rest of the system was quiet during the tests. But yeah, I should have pulled the memory management out of the tests.
(Done; short/float naive is 79 seconds, 312M samples/sec, float bump is 21 seconds, 476M samples/sec. Try changing yours to 10K trials of 1M samples each. That's -O2, gcc 4.2.1, OSX 10.6; I don't see any difference between -O2 and -O9.)
Just to check: I calculate your incr value for middle C to be 777.59, rounding to 777, which I think is 1.31 cents or .4Hz off, right? In the vicinity of that frequency, the 16-bit incr has a precision of about 2.23 cents? In the specific case of a fat supersaw, with so many frequencies, that sounds acceptable, but just curious, is that the standard to which a lot of sound software works?
Correct, my incrs are 769,772,776,777,779,782,786. 2-cent resolution is... let's say barely acceptable back when the JP-8000 was new? I wouldn't tolerate it in modern software, but on a modern CPU it would be as fast [or faster!*] to use a 32-bit int as the per-wave phasor and a 64-bit int as the accumulator.
0) I'm positive BumpUp Double is not 3x faster than Naive Int16 as you reported. I see 10% improvement at C4.
I never saw or reported a 3x difference. Naive short double was 98-99 seconds, bumpup double was 64-69 seconds, about 20 seconds of that was just writing doubles to memory.
(Also, and I'm just being a nitpicky little bitch here, middle C = C4 = ~262Hz; my tests were done at C above, C5, 523Hz.)
3) Generally, the naive int16 implementation (under full optimization, at least) is spending most of its time just converting the output to doubles.
Converting them, or storing them to memory? I had to store to keep the optimizer from just degenerating all the computation to one big no-op, and once I was doing that it wasn't fair to store 16 bits per sample for the naive saw and 32 or 64 for the bump-up. I don't see a "fair" way to resolve that.
The point of benchmarking it, for me, was to establish (a) whether naive saw via integer wraparound was a credible candidate for the JP-8000 implementation, and (b) whether bump-up saw was a drastically faster way to emulate it. Answers (a) yes, modern hardware can do it 2000+ times faster than real time, so cheap 1997 hardware should have been able to handle it, and (b) faster but not drastically so.
The frequency dependence of the bump algorithm is interesting, too.
* In 32-bit x86 native code, 16-bit operations are distinguished with an opcode prefix, so all 16-bit ALU operations are one byte longer than their 32-bit counterparts, so code is bigger, code cache pressure higher, etc.