Ok, this makes sense. I was figuring I'd have to keep the scalar functions around to process remainders or I'd have to mask out the unused lanes somehow.
I've just tried some quick tests with this on my scalar vs SSE wavetable testbed. The scalar version still beats the SSE version when there's just one remainder, but anything after that is a win.
Currently, I've got an array of active voice indexes that I'm using to gather the data for processing. All I have to do is make sure the used parts of this array are a multiple of four and then any remainder voices point to, say, the last active voice. It's then possible to always just process four at a time without any special logic.
Well, except each of my voices can have additional phasors for super-saw style processing and super-saw parameters can be modulated, i.e. change per voice. That'll take some figuring out.
Thanks for the pointer.