Login / Register  0 items | $0.00 NewWhat is KVR? Submit News Advertise

odd integer division optimization

stratum
KVRian
 
1236 posts since 29 May, 2012

Postby stratum; Tue May 09, 2017 9:50 am odd integer division optimization

Hi all

This is from https://github.com/maikmerten/p64/blob/master/transform.c

Code: Select all
/*BFUNC
FastDivide() is a very bizarre little helper to let the compiler construct
fast integer divides for the most common divisors.
EFUNC*/


__inline int FastDivide(int divident, int divisor) {
    switch(divisor) {
        case 1: return divident;
        case 2: return divident / 2;
        case 3: return divident / 3;
        case 4: return divident / 4;
        case 5: return divident / 5;
        case 6: return divident / 6;
        case 7: return divident / 7;
        case 8: return divident / 8;
        case 9: return divident / 9;
        case 10: return divident / 10;
        case 11: return divident / 11;
        case 12: return divident / 12;
        case 13: return divident / 13;
        case 14: return divident / 14;
        case 15: return divident / 15;
        case 16: return divident / 16;
        case 17: return divident / 17;
        case 18: return divident / 18;
        case 19: return divident / 19;
        case 20: return divident / 20;
        case 21: return divident / 21;
        case 22: return divident / 22;
        case 23: return divident / 23;
        case 24: return divident / 24;
        case 25: return divident / 25;
        case 26: return divident / 26;
        case 27: return divident / 27;
        case 28: return divident / 28;
        case 29: return divident / 29;
        case 30: return divident / 30;
        case 31: return divident / 31;
    }
    return divident / divisor;     
}


Is that still relevant for modern cpus' and/or compilers?

Just curious, not that I'll do anything with this ancient H261 implementaiton. :lol:

Thanks
~stratum~
Mayae
KVRian
 
514 posts since 1 Jan, 2013, from Denmark

Postby Mayae; Tue May 09, 2017 2:06 pm Re: odd integer division optimization

Depends, like the duff's device and friends they are relics of the past mostly. When in doubt, measure (release x64 /02 vs 2017, brand new cpu):

Code: Select all
int main()
{
   int divident;

   while (std::cin >> divident)
   {
      int sum = 0;
      LARGE_INTEGER before, after;

      QueryPerformanceCounter(&before);
      for (int i = 1; i < 2000000; ++i)
      {
         sum += FastDivide(divident, i);
      }
      QueryPerformanceCounter(&after);

      std::cout << "sum = " << sum << ": " << (after.QuadPart - before.QuadPart) << std::endl;
      sum = 0;

      QueryPerformanceCounter(&before);
      for (int i = 1; i < 2000000; ++i)
      {
         sum += divident / i;
      }
      QueryPerformanceCounter(&after);

      std::cout << "sum = " << sum << ": " << (after.QuadPart - before.QuadPart) << std::endl;
   }
   return 0;
}


Code: Select all
>> 30
sum = 111: 27733
sum = 111: 28137
>> 30
sum = 111: 28343
sum = 111: 27702
>> 40
sum = 158: 28422
sum = 158: 27699
>> 20
sum = 66: 28512
sum = 66: 27828
>> 10
sum = 27: 28071
sum = 27: 27819
>> 50
sum = 207: 27372
sum = 207: 28697
>> 50
sum = 207: 28490
sum = 207: 27379
>> 10000
sum = 93668: 27492
sum = 93668: 28477
>> 10000
sum = 93668: 28038
sum = 93668: 27379
>> 200000
sum = 2472113: 28658
sum = 2472113: 27436


If you make the branch predictor happy by swapping the divident and divisor, ie. it takes the same path every time, you get different results:

Code: Select all
>>10
sum = 199999000000: 13790
sum = 199999000000: 43432
>>20
sum = 99999000000: 14107
sum = 99999000000: 42418
>>30
sum = 66665666670: 14192
sum = 66665666670: 42653
>>32
sum = 62499000000: 41213
sum = 62499000000: 41344
>>31
sum = 64515129034: 13832
sum = 64515129034: 41986
>>32
sum = 62499000000: 41253
sum = 62499000000: 41696
>>31
sum = 64515129034: 14121
sum = 64515129034: 42937
>>32
sum = 62499000000: 40995
sum = 62499000000: 41509
You would probably get varying results depending on inlining contexts, instruction cache bloating, variance of divisors etc. Rewriting it to only utilize incrementing divisors from 0 to 31 resulted in FastDivide being twice as fast.
stratum
KVRian
 
1236 posts since 29 May, 2012

Postby stratum; Tue May 09, 2017 3:09 pm Re: odd integer division optimization

Hi,

Thanks for taking the time to test it. I had assumed it to be an ancient trick that's probably irrelevant nowadays and asked about it considering that perhaps a retired expert could remember those days, but the git commit comment on that file reveals that the optimization was added in december 2013, so I guess it intentionally optimizes the branch prediction case that you have revealed, for reasonably modern cpus.
~stratum~
camsr
KVRAF
 
6630 posts since 16 Feb, 2005

Postby camsr; Tue May 09, 2017 7:07 pm Re: odd integer division optimization

With out-of-order processing, I doubt this is an optimization.
Image
stratum
KVRian
 
1236 posts since 29 May, 2012

Postby stratum; Wed May 10, 2017 2:01 am Re: odd integer division optimization

With out-of-order processing, I doubt this is an optimization.


DCT result may be "orderly" in some way, who knows. I haven't looked at where it was being used and what the data was.

DCT => Quantize => Runlength & Hufmann encoding. That's one of the algorithms used. The code is very old (circa 1990) and only useful nowadays as an introduction to video compression methods.
~stratum~
User avatar
Urs
u-he
 
21012 posts since 7 Aug, 2002, from Berlin

Postby Urs; Wed May 10, 2017 2:10 am Re: odd integer division optimization

Mayae wrote:for (int i = 1; i < 2000000; ++i)
{
sum += FastDivide(divident, i);
}

Well, the FastDivide only really covers divisors from 1 to 31, so anything from 32 to 1999999 isn't expected to be faster at all...
PurpleSunray
KVRian
 
611 posts since 13 Mar, 2012

Postby PurpleSunray; Wed May 10, 2017 9:11 am Re: odd integer division optimization

I like this kind of bizarre little constructs :D

Did something similar for an int what warps arround 0 at a specific limit. Most of the time this limit is a pow2 so that warp-arround can be a simple AND.

Code: Select all

namespace WarpIntDetail
{
   template<unsigned int nmax> struct warp_func {
      static int wrap(int val) {
         return ((val % nmax) + nmax) % nmax;
      }
   };

   template<> struct warp_func<0x1> { inline static int wrap(int val) { return val & (0x1 - 1); } };
   template<> struct warp_func<0x2> { inline static int wrap(int val) { return val & (0x2 - 1); } };
   template<> struct warp_func<0x4> { inline static int wrap(int val) { return val & (0x4 - 1); } };
   template<> struct warp_func<0x8> { inline static int wrap(int val) { return val & (0x8 - 1); } };
   template<> struct warp_func<0x10> { inline static int wrap(int val) { return val & (0x10 - 1); } };
   template<> struct warp_func<0x20> { inline static int wrap(int val) { return val & (0x20 - 1); } };
   template<> struct warp_func<0x40> { inline static int wrap(int val) { return val & (0x40 - 1); } };
   template<> struct warp_func<0x80> { inline static int wrap(int val) { return val & (0x80 - 1); } };
   template<> struct warp_func<0x100> { inline static int wrap(int val) { return val & (0x100 - 1); } };
   template<> struct warp_func<0x200> { inline static int wrap(int val) { return val & (0x200 - 1); } };
   template<> struct warp_func<0x400> { inline static int wrap(int val) { return val & (0x400 - 1); } };
   template<> struct warp_func<0x800> { inline static int wrap(int val) { return val & (0x800 - 1); } };
   template<> struct warp_func<0x1000> { inline static int wrap(int val) { return val & (0x1000 - 1); } };
   template<> struct warp_func<0x2000> { inline static int wrap(int val) { return val & (0x2000 - 1); } };
   template<> struct warp_func<0x4000> { inline static int wrap(int val) { return val & (0x4000 - 1); } };
   template<> struct warp_func<0x8000> { inline static int wrap(int val) { return val & (0x8000 - 1); } };
   template<> struct warp_func<0x10000> { inline static int wrap(int val) { return val & (0x10000 - 1); } };
   template<> struct warp_func<0x20000> { inline static int wrap(int val) { return val & (0x20000 - 1); } };
   template<> struct warp_func<0x40000> { inline static int wrap(int val) { return val & (0x40000 - 1); } };
   template<> struct warp_func<0x80000> { inline static int wrap(int val) { return val & (0x80000 - 1); } };
   template<> struct warp_func<0x100000> { inline static int wrap(int val) { return val & (0x100000 - 1); } };
   template<> struct warp_func<0x200000> { inline static int wrap(int val) { return val & (0x200000 - 1); } };
   template<> struct warp_func<0x400000> { inline static int wrap(int val) { return val & (0x400000 - 1); } };
   template<> struct warp_func<0x800000> { inline static int wrap(int val) { return val & (0x800000 - 1); } };
   template<> struct warp_func<0x1000000> { inline static int wrap(int val) { return val & (0x1000000 - 1); } };
   template<> struct warp_func<0x2000000> { inline static int wrap(int val) { return val & (0x2000000 - 1); } };
   template<> struct warp_func<0x4000000> { inline static int wrap(int val) { return val & (0x4000000 - 1); } };
   template<> struct warp_func<0x8000000> { inline static int wrap(int val) { return val & (0x8000000 - 1); } };
   template<> struct warp_func<0x10000000> { inline static int wrap(int val) { return val & (0x10000000 - 1); } };
   template<> struct warp_func<0x20000000> { inline static int wrap(int val) { return val & (0x20000000 - 1); } };
   template<> struct warp_func<0x40000000> { inline static int wrap(int val) { return val & (0x40000000 - 1); } };
   template<> struct warp_func<0x80000000> { inline static int wrap(int val) { return val & (0x80000000 - 1); } };

};

template<unsigned int max>
class WrapInt
{
   int m_val;

public:

   WrapInt(int init = 0)
      : m_val(init)
   {
      WarpIntDetail::warp_func<max>::wrap(m_val);
   }

   WrapInt(const WrapInt& r)
      : m_val(r.m_val)
   {
   }

   inline operator int() const {
      return m_val;
   }

   unsigned int GetMax() const {
      return max;
   }

   int GetValue() const {
      return m_val;
   }

   inline WrapInt& operator=(const WrapInt& lh)
   {
      if (&lh == this) {
         return *this;
      }
      m_val = lh.m_val;
      return *this;
   }

   inline WrapInt& operator=(int lh) {
      m_val = WarpIntDetail::warp_func<max>::wrap(lh);
      return *this;
   }
   inline WrapInt& add(int v) {
      m_val = WarpIntDetail::warp_func<max>::wrap(m_val + v);
      return *this;
   }
   inline WrapInt& sub(int v) {
      m_val = WarpIntDetail::warp_func<max>::wrap(m_val - v);
      return *this;
   }
   inline WrapInt& mul(int v) {
      m_val = WarpIntDetail::warp_func<max>::wrap(m_val * v);
      return *this;
   }
   inline WrapInt& div(int v) {
      m_val = WarpIntDetail::warp_func<max>::wrap(m_val / v);
      return *this;
   }
   inline WrapInt& operator + (const int& lh) {
      return add(lh);
   }
   inline WrapInt& operator - (const int& lh) {
      return sub(lh);
   }
   inline WrapInt& operator * (const int& lh) {
      return mul(lh);
   }
   inline WrapInt& operator / (const int& lh) {
      return div(lh);
   }
   inline WrapInt& operator += (const int& lh) {
      return add(lh);
   }
   inline WrapInt& operator -= (const int& lh) {
      return sub(lh);
   }
   inline WrapInt& operator *= (const int& lh) {
      return mul(lh);
   }
   inline WrapInt& operator /= (const int& lh) {
      return div(lh);
   }
   inline WrapInt& operator ++ (int) {
      return add(1);
   }
   inline WrapInt operator -- (int) {
      return sub(1);
   }
};

:clown:
Mayae
KVRian
 
514 posts since 1 Jan, 2013, from Denmark

Postby Mayae; Thu May 11, 2017 9:30 am Re: odd integer division optimization

Urs wrote:
Mayae wrote:for (int i = 1; i < 2000000; ++i)
{
sum += FastDivide(divident, i);
}

Well, the FastDivide only really covers divisors from 1 to 31, so anything from 32 to 1999999 isn't expected to be faster at all...

Yes, I covered the case for divisor < 32 to showcase this as well.
Miles1981
KVRian
 
1251 posts since 26 Apr, 2004, from UK

Postby Miles1981; Thu May 11, 2017 9:35 am Re: odd integer division optimization

The point Urs makes is that the speedup for the 32 small integers is hidden by the 2 other millions. The maximum speedup you can expect is ridiculously small and can't be detected (i.e. the test doesn't prove anything).
Mayae
KVRian
 
514 posts since 1 Jan, 2013, from Denmark

Postby Mayae; Thu May 11, 2017 11:02 am Re: odd integer division optimization

Miles1981 wrote:The point Urs makes is that the speedup for the 32 small integers is hidden by the 2 other millions. The maximum speedup you can expect is ridiculously small and can't be detected (i.e. the test doesn't prove anything).

The test proves lots of things, for instance: is there significant overhead in the fastdivide function for non-ideal cases? It's rarely interesting only to analyse the supposed benefits when applying a general purpose solution (since fastdivide handles all inputs).
Miles1981
KVRian
 
1251 posts since 26 Apr, 2004, from UK

Postby Miles1981; Thu May 11, 2017 11:15 am Re: odd integer division optimization

Your test may prove that there is not much overhead, but it can't say that if your divisions are usually small numbers, the function is more interesting or not.
stratum
KVRian
 
1236 posts since 29 May, 2012

Postby stratum; Thu May 11, 2017 2:36 pm Re: odd integer division optimization

The reason for which the FastDivide function might be optimized for small numbers is seen around the minute 28:50 in the video below. The sensible values for the quantization value (divisor, which controls the compression amount and bitrate) are probably around 1-31. If the function is used the other way around (arguments swapped) somewhere in the algorithm, DCT results are also small numbers (minute 26:00). The video is not about H261, so there might be some differences but what transform.c/CCITTQuantize does looks similar.

https://www.youtube.com/watch?v=rC16fhvXZOo
~stratum~
Mayae
KVRian
 
514 posts since 1 Jan, 2013, from Denmark

Postby Mayae; Thu May 11, 2017 4:46 pm Re: odd integer division optimization

Miles1981 wrote:Your test may prove that there is not much overhead, but it can't say that if your divisions are usually small numbers, the function is more interesting or not.
And that's why I added results for cases where I specifically test the algorithms intended domain for both variying and constant divisors..? Note that obviously the test code changes, I just didn't post the specific permutations for brewity

Moderator: Moderators (Main)

Return to DSP and Plug-in Development