mystran wrote:Anyway, as suggested before, it probably makes sense to do the head of the impulse with the CPU, and only use GPU for the long tail.
One of the problems with that approach will be that getting to zero latency is the part that uses the biggest ammount of CPU in a zero latency solution because doing lots of FFTs on small blocks isn't very efficient. The long tail (say where blocks are getting to >~10000 samples, usually the vast majority of the IR) is actually extremely efficient in a zero latency solution because the FFTs get so big (amongst other reasons) and can be done very well by a modern cpu, this reduces the benefit of combining it with a fixed block length algorithm on a gpu in the first place.
I'm not saying it's not worth a try though, just the cpu load might be higher than envisaged and at that point pushing all the work to the cpu may not actually increase processor usage that much. Working towards lowering the gpu latency using some degree of non uniform partitioning might be better.