_mm_movemask_ps might do the trick.mystran wrote: ↑Wed May 05, 2021 10:06 am
While min/max is a no-brainer here, it got me wondering: is AVX the minimum to actually do generalized branches?
Basically with AVX one can do _mm_cmp_ps (or the wider equivalents) to get a mask for conditionally selecting values, then follow up with PTEST (_mm_test_all_zeroes, etc; SSE4.1) to get CF or ZF to only compute the branches where at least one value is actually needed, so Turing equivalent computation is possible without ever unpacking the vectors, but is there some trick to do this with just SSE2 with sensible amount of shuffling?
Random cheap sigmoid
-
- KVRian
- 631 posts since 21 Jun, 2013
- KVRAF
- Topic Starter
- 7892 posts since 12 Feb, 2006 from Helsinki, Finland
Oh, I just realized the SSE versions of comparisons are _mm_cmpXX_ps which map to CMPPS where as _mm_cmp_ps maps to VCMPPS, which is basically the same thing, except the intrinsic-syntax is different. It's just a joy that Intel is consistent with these?!?2DaT wrote: ↑Wed May 05, 2021 9:27 pm_mm_movemask_ps might do the trick.mystran wrote: ↑Wed May 05, 2021 10:06 am
While min/max is a no-brainer here, it got me wondering: is AVX the minimum to actually do generalized branches?
Basically with AVX one can do _mm_cmp_ps (or the wider equivalents) to get a mask for conditionally selecting values, then follow up with PTEST (_mm_test_all_zeroes, etc; SSE4.1) to get CF or ZF to only compute the branches where at least one value is actually needed, so Turing equivalent computation is possible without ever unpacking the vectors, but is there some trick to do this with just SSE2 with sensible amount of shuffling?
You're right though, _mm_movemask_ps will do the job at the cost of extra integer TEST to set the flags.
-
- KVRian
- 919 posts since 4 Jan, 2007
Excuse my extremely rusty math. What happened here?martinvicanek wrote: ↑Fri Mar 22, 2019 12:51 am You dont even need to worry about division if you rewrite
(sqrt(1 + x^2) - sqrt(1+ x1^2))/(x - x1) = (x + x1)/(sqrt(1 + x^2) + sqrt(1 + x1^2))
-
- KVRist
- 51 posts since 16 Mar, 2014
It should actually read "You dont even need to worry about division by zero". You still have a division, but the denominator is always >= 2.rafa1981 wrote: ↑Wed Jun 16, 2021 7:00 amExcuse my extremely rusty math. What happened here?martinvicanek wrote: ↑Fri Mar 22, 2019 12:51 am You dont even need to worry about division if you rewrite
(sqrt(1 + x^2) - sqrt(1+ x1^2))/(x - x1) = (x + x1)/(sqrt(1 + x^2) + sqrt(1 + x1^2))
To prove the above equality, multiply on the left hand side both the numerator and dennominator by (sqrt(1 + x^2) + sqrt(1+ x1^2)), then use the identity x^2 - x1^2 = (x + x1)(x - x1).
-
- KVRian
- 919 posts since 4 Jan, 2007
I see. Clever use of "the basics".martinvicanek wrote: ↑Wed Jun 16, 2021 10:44 am To prove the above equality, multiply on the left hand side both the numerator and dennominator by (sqrt(1 + x^2) + sqrt(1+ x1^2)), then use the identity x^2 - x1^2 = (x + x1)(x - x1).
- KVRAF
- Topic Starter
- 7892 posts since 12 Feb, 2006 from Helsinki, Finland
Two square roots and a division make it quite expensive to actually compute though.martinvicanek wrote: ↑Wed Jun 16, 2021 10:44 amIt should actually read "You dont even need to worry about division by zero". You still have a division, but the denominator is always >= 2.rafa1981 wrote: ↑Wed Jun 16, 2021 7:00 amExcuse my extremely rusty math. What happened here?martinvicanek wrote: ↑Fri Mar 22, 2019 12:51 am You dont even need to worry about division if you rewrite
(sqrt(1 + x^2) - sqrt(1+ x1^2))/(x - x1) = (x + x1)/(sqrt(1 + x^2) + sqrt(1 + x1^2))
To prove the above equality, multiply on the left hand side both the numerator and dennominator by (sqrt(1 + x^2) + sqrt(1+ x1^2)), then use the identity x^2 - x1^2 = (x + x1)(x - x1).
-
- KVRian
- 919 posts since 4 Jan, 2007
Actually you can save the previous "sqrt" result along with the previous sample input (x1), so one "sqrt" can go away at the expense of one extra state variable.
This is doing antialiasing at very low latency (0.5 samples? 1 sample?) and almost no memory usage, so the definition of cheap is relative. I'm very bad at math and DSP and I might be missing some better ways.
This is doing antialiasing at very low latency (0.5 samples? 1 sample?) and almost no memory usage, so the definition of cheap is relative. I'm very bad at math and DSP and I might be missing some better ways.