Yes, that's what it makes me think: why one should implement your own/not wrapped/not portable/home made functions "SIMD oriented" when we have such of great libraries?
Because you might not find what are looking for in there.
IPP is really strong if you go on algrotihm level. i.e try to beat the IPP FFT.. good luck.
If you are on CPU instruction level, you will always move memory in and out of IPP, while the compiler migth be able to use registers if you code intrinsics.
Example: Extend your for loop with some more math than a single _mm_add_pd. If you don't find that math in IPP you need to call lot IPP functions with blocksize=1 (will be slower than intrinsics for sure), or you need to re-arrange code and break up into multiple loops, pre-implemented by IPP.
So you can't say IPP will be faster than your SSE2 intrinsics, just because of the dispatcher.
It will depend on what you do and how you do it and what your target system is actually capable of (I remeber an Intel (or was it AMD??) CPU generation that supported AVX(256bit) instructions, but on an SSE(128bit) execution unit - by doing same op twice. Result: AVX instructions where terrible slow compared to SSE2, so your SSE2->AVX optimization made it worse).
But as you say.. it is more important to understand first whats going on under the hood and to understand how use a CPU effectively. Afterwards think about implementation details
Starting with usage of SSE2 intrinsics is a good way. If you understood it once, porting to AVX or AVX2 or IPP or.. will be no big deal, just typing working.
Surely straight asm would be more faster and specific for actual problem (if you know what you are doing), but later you have "lots" of trouble on make it portable. The same with intrinsics, I believe.
Same with IPP. If you leave the C spec, you enter the world of CPU architectures: there is no IPP for ARM/NEON or PPC or any other CPU does not support the Intel instruction set. IPP != portable ;D