The intrinsics look quite useful and easy to use. The downside is that, of course, they are different from SSE, causing another code fork. I hate platform-specific code.
I think I'll be able to get away with just the autovectorization. I revisited it and learned quite about where and why it works. I found 7 not very critical loops I could force to vectorize with a #pragma. The others all failed due to variable loop sizes, method calls, inline conditionals, or mostly, just not worth the bits to vectorize, according to the LLVM cost model. Maybe if I get bored, I'll test those out and see, but for now, I'm trusting my compiler.