There seems to be a minimum length of time to hold a note before the phrase will advance to next syllable when you hit the next note. Not sure if that's some kinda global window (like the poly legato time window) but it seems to depend on the sample - I think it's more problematic if the first syllable ends in a consonant like 's', as the delay it takes to close the syllable is fixed, i.e. it can't adapt to your pace.
You've just given me an idea though!
I could bounce the clip at a low BPM, then time-stretch it to make it faster haha. Enough reverb and maybe no one will notice!

