I'd like to share some of the stuff I've worked out along the way. Particularly, generating only ramps with DPW; scaling factor considerations for DPW; and division for PolyBLEP, which basically has me stopped cold. I don't have comparisons for DPW vs. PolyBLEP; at some point I'll work out a Python implementation using (likely) FxpMath, and then translate that approach into hardware using Amaranth.
I'm using offset ramps (ramp(p) - ramp(p+offset)) to generate pulse waves, although I'm not 100% sure DPW will work starting from a phase offset. Generating triangle waves and tilted triangle waves is easy from there: DPW differences an integrated trivial waveform function, for example a fourth-order integral is differenced three times to get the first-order function. To generate triangular waveforms, you integrate a square wave; therefor, you can perform only two differences and skip the third when generating a square wave, and you'll get the corresponding triangle wave. The peak for the triangle will correspond to the falling edge of the pulse, e.g. a 20% duty cycle pulse wave, integrated, will have a peak +1 at 1/5 of its period, and then run back to -1 at the end/beginning of its period.
One question I haven't figured out about DPW: what happens if I have y=0 and move my phase forward 50% or 25%? This works with continuous functions, but moves the waveform down (DC offset); however, continuous functions work differently than descretized functions, and don't bandlimit the signal. If it works when descretized then this is fine, I just need 2 ramp generators to produce a single ramp, pulse, and triangle oscillator.
I'll also need to find out if this works with vibrato. Changing the frequency essentially means changing the step size, which I'm not sure will work the way I want...
Valimaki et al recommend using an order 2 and 4 (DPW2 and DPW4) DPW generator to deal with the high scaling factor. They diagram this as such:
According to Valimäki et al:
Their frequency-dependent scaling factors are pi/4sin(pi/P) and pi^3/(24[2sin(pi/P)]^3) for DPW2 and DPW4. At a frequency of 27.5Hz (A0) and 48kHz sample rate, P=48000/27.5, these are 437 and 27,696,514, rounded up. log2(437)≈8.8, log2(27,696,514)≈24.7; therefor, your options are to either use a crossover centered at 500Hz (from 400Hz to 600Hz) or to add 25 bits to the DPW4 sample width. You will need an extra 9 bits to maintain precision with DPW2; you will need an extra 14 bits to maintain precision with DPW4 down to 400Hz; therefor you are only adding 11 more bits to the DPW4 sample precision. The sample can be cut down to its native sample depth (e.g. 24 bit integer or single-precision float) after scaling, so very few registers and large-width adds and multiplies are necessary.A practical implementation structure for DPW sawtooth synthesis is presented in Fig. 16, where low-frequency tones are produced by a second-order DPW method and high frequencies are produced using a fourth-order method. The crossover frequency may be about 500 Hz. Using a combination of two DPW algorithms it is possible to avoid a very large scale factor at low fundamental frequencies but still maintain a good audio quality at high fundamental frequencies.
Based on all of that, it's cheaper to just use wider data types. For each waveform generated, DPW4 needs 1 multiply and 1 multiply-add—a special case because it's x^4 - 2x^2, so you have x2=x^2, a=(x^2*x^2 + x^2<<1), whereas for a non-power-of-2 coefficient you'd need an extra multiply—plus 3 additions (differencing) and 1 multiply (scaling). That's 3 multipliers made 11 bits wider, and 3 adders made 11 bits wider, and 3 registers made 11 bits wider. DPW2 requires 1 multiply, 1 scaling multiply, and 1 differencing addition, all adding 9 bits of precision on top of the target sample size, e.g. 33 bits wide instead of 24. The savings would be minimal, and then you need an extra register and two additional multipliers (well, a multiplier and a fused multiply-add) to combine the two, plus control circuitry to determine how much to multiply by.
I've partially diagrammed their suggested DPW2+DPW4 system below. Each of the differencers (with z^-1 above them) is subtracting the input from z^-1, not adding.
The +, ×, and z^-1 along the top implement DPW2. What I didn't diagram was the additional calculations to calculate DPW2 scale factor (another multiply), to determine the mix (a subtraction and multiply, with conditional tests), and to complete the mix (two more multiplies and a subtraction). The alternative to all of that is, again, make the registers and operators along the bottom row 11 bits wider, which I think is a better approach.
All of that is fascinating, but I haven't compared it to PolyBLEP.
My issue with PolyBLEP is I can replace dt with (f0 * 1/fs) and hard-code 1/fs to avoid dividing by the sample rate; but whereas I need to divide by fs/f0 to compute the scaling factor in DPW (thus multiply by f0*1/fs), I need to divide by f0/fs (thus multiply by fs*1/f0) to compute PolyBLEP.
Division is hard.
There's no really fast way to approximate 1/x to a low error.
I'm starting to think it's more viable to just generate 20 sines per waveform and use a sum of sines. For the 3 channels in an AY-3 enhanced with 3 waveform types, vibrato, and chorus effect, that'd be only 240 sine generators. That or maybe modal synthesis?