GeForce 8600 GT (256 MB Ram) with 10 second 44.1k wav
CPU hovering between 1-2%
Seemed to work as expected when loading 16 bit wav files but I think there may be an issue with 32-bit wav files (they're often a pain, try libsndfile it makes it all very easy). You probably don't care about that much at this stage anyway.
Regarding the high latency, you may like to try computing the first 8192 samples on the CPU and doing the rest on the GPU and just summing the two parts. I only say as I imagine when you start trying to push cuda down to small block sizes the cpu usage might start to jump up. (Guesswork.)
I did a little mock-up trial using a zero latency software convolution reverb with the first part of a file on one bus and your plug for the rest on another bus; seemed to work well. If you pick your block sizes real carefully with a uniform length partitioning algorithm you should be able to get 'zero' latency by synching with the vst block lengths (yeh I know you are not certain to get a pre-determined fixed number of samples in, but you almost always do and just handle it where you don't as a special case). If you're only running up to 8192 samples it shouldn't be too much load on the cpu.
That said, a good CPU based solution only uses a few percent with a similar impulse response on zero latency mode.
So, that only leaves you with sorting out the huge gpu usage. I'm sure you're aware this is a common problem with uniform partition length algorithms and long responses (when I did it on a CPU it was strikingly inefficient). Are you familiar with non-uniform length partitioning? If not, take a look at http://www.music.miami.edu/programs/mue/Research/jvandekieft/jvchapter2.htm
(skip to the end) and you should see how it all works. It's not as easy as uniform length but still not really that hard and people (including me) have already done it well on a CPU.
It's a bit cheeky of me, but I would love to see some skeleton vst code to get data on/off the card, when I tried a while back (having got a 8600 for Christmas) I got irritated when I couldn't solve a dumb bug with getting the cppIntegration demo working in VST and passed on it (I quit it too quickly to be honest) to focus on cpu based work until a multiplatform system (like OpenCL) arrived. If I remember correctly, processReplacing couldn't properly access any memory in the cu functions unless I'd allocated it within processReplacing, obviously not what I really wanted to be doing. Maybe 7 months on the SDK is easier to use too.