2011年6月16日,星期四

快速逆根

最近,我想知道如何快速近似一个数字的$ p $根,\ [
y = x ^ {-\ frac {1} {p}},
\]使用时出现 $ L_p $归一化.

当$ p = 2 $时 反平方根hack 适用,因此我在Wikipedia上对$ p \ neq 2 $修改了派生结果,导致\ [
I_y =-\ frac {1} {p} I_x + \ frac {p + 1} {p}(B-\ sigma)L,
\]导致了以下代码,
// WARNING: this code has been updated.  Do not use this version.
// Instead get the latest version from http://code.google.com/p/fastapprox

float
fast_inverse_p (float x,
                float p)
{
  union { float f; uint32_t i; } v = { x };

  unsigned int R = 0x5F375A84U;   // R = (3/2) (B - sigma) L
  unsigned int pR = ((2 * (p + 1)) / (3 * p)) * R;
  unsigned int sub = v.i / p;

  v.i = pR - sub;

  return v.f;
}
这个例程做的很合理,但是缺乏 牛顿法 反平方根hack具有的抛光步骤选项。由于要避免调用powf,因此我将牛顿方法应用于$ f(y)= \ log(y)+ \ frac {1} {p} \ log(x)$,这样会产生迭代\ [
y_ {n + 1} = y_n \ left(1-\ log(2)\ left(\ frac {1} {p} \ log_2(x)+ \ log_2(y)\ right)\ right)。
\]的想法是将其与$ \ log_2 $的廉价近似值一起使用,以利用浮点表示形式。我尝试的第一个近似是一个有理近似,得出了以下结论。
// WARNING: this code has been updated.  Do not use this version.
// Instead get the latest version from http://code.google.com/p/fastapprox

float
fast_inverse_p_one (float x,
                    float p)
{
  float y = fast_inverse_p (x, p);
  union { float f; uint32_t i; } vx = { x };
  union { float f; uint32_t i; } vy = { y };

  // equivalent to: mx = frexp (x, &ex), ex += 126
  int ex = ((vx.i >> 23) & 0xFF);
  union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) };
  int ey = ((vy.i >> 23) & 0xFF);
  union { uint32_t i; float f; } my = { (vy.i & 0x007FFFFF) | (0x7e << 23) };

  float log2x = ex - 123.52964741f - 4.29802897f / (0.73958333f + mx.f);
  float log2y = ey - 123.52964741f - 4.29802897f / (0.73958333f + my.f);

  return y * (1.0f - 0.69314718f * (log2x / p + log2y));
}
那有些精确,有些贵。为了获得更高的精度,我采用了多项式逼近法和两次牛顿迭代法。
// WARNING: this code has been updated.  Do not use this version.
// Instead get the latest version from http://code.google.com/p/fastapprox

float
fast_inverse_p_two (float x,
                    float p)
{
  float y = fast_inverse_p (x, p);
  union { float f; uint32_t i; } vx = { x };
  union { float f; uint32_t i; } vy = { y };

  int ex = ((vx.i >> 23) & 0xFF);
  union { uint32_t i; float f; } mx = { (vx.i & 0x007FFFFF) | (0x7e << 23) };
  int ey = ((vy.i >> 23) & 0xFF);
  union { uint32_t i; float f; } my = { (vy.i & 0x007FFFFF) | (0x7e << 23) };

  float powmx = mx.f;
  float log2x = ex - 129.78533457081f + 10.07763081376f * powmx;
  powmx *= mx.f; log2x -= 13.90982855008f * powmx;
  powmx *= mx.f; log2x += 12.64852864324f * powmx;
  powmx *= mx.f; log2x -= 6.39532347130f * powmx;
  powmx *= mx.f; log2x += 1.364335673877f * powmx;
  log2x /= p;

  float powmy = my.f;
  float log2y = ey - 129.78533457081f + 10.07763081376f * powmy;
  powmy *= my.f; log2y -= 13.90982855008f * powmy;
  powmy *= my.f; log2y += 12.64852864324f * powmy;
  powmy *= my.f; log2y -= 6.39532347130f * powmy;
  powmy *= my.f; log2y += 1.364335673877f * powmy;

  union { float f; uint32_t i; } vz = { y * (1.0f - 0.69314718f * (log2x + log2y)) };
  int ez = ((vz.i >> 23) & 0xFF);
  union { uint32_t i; float f; } mz = { (vz.i & 0x007FFFFF) | (0x7e << 23) };

  float powmz = mz.f;
  float log2z = ez - 129.78533457081f + 10.07763081376f * powmz;
  powmz *= mz.f; log2z -= 13.90982855008f * powmz;
  powmz *= mz.f; log2z += 12.64852864324f * powmz;
  powmz *= mz.f; log2z -= 6.39532347130f * powmz;
  powmz *= mz.f; log2z += 1.364335673877f * powmz;

  return vz.f * (1.0f - 0.69314718f * (log2x + log2z));
}
这更准确,更昂贵。

这是一些示例值以及一些相对详尽的输入范围内的平均相对误差估计。
x       p       fast0   fast1   fast2   powf
7.000   0.870   0.10889 0.10663 0.10681 0.10681
7.000   2.488   0.45789 0.45704 0.45744 0.45744
7.000   4.106   0.63713 0.62284 0.62256 0.62256
7.000   5.724   0.73334 0.71151 0.71181 0.71181
7.000   7.342   0.78715 0.76660 0.76718 0.76718
7.000   8.960   0.82151 0.80411 0.80480 0.80479
69.000  0.870   0.00749 0.00769 0.00770 0.00770
69.000  2.488   0.18674 0.18229 0.18236 0.18236
69.000  4.106   0.36593 0.35650 0.35659 0.35658
69.000  5.724   0.47131 0.47700 0.47726 0.47726
69.000  7.342   0.56050 0.56189 0.56176 0.56176
69.000  8.960   0.63580 0.62382 0.62341 0.62341
211.000 0.870   0.00217 0.00213 0.00213 0.00213
211.000 2.488   0.11642 0.11628 0.11636 0.11636
211.000 4.106   0.27032 0.27150 0.27161 0.27161
211.000 5.724   0.40273 0.39225 0.39260 0.39260
211.000 7.342   0.47678 0.48220 0.48243 0.48243
211.000 8.960   0.54817 0.55025 0.55031 0.55030
err0 = 0.021138
err1 = 0.000680451
err2 = 7.20003e-06
我做了一个时序测试,用 -O3 -finline功能-fast-math,在运行Ubuntu Lucid(所以 gcc 4:4.4.3-1ubuntu1libc 2.11.1-0ubuntu7.6)。
fast_inverse_p million calls per second = 206.677
fast_inverse_p_one million calls per second = 77.9861
fast_inverse_p_two million calls per second = 26.531
powf million calls per second = 9.04992
巧合的是,每次精度提高都会使吞吐量损失大约三倍。

3条评论:

  1. p = 0.5与测试工具中的快速平方根逆平方相比如何? (一世'我对计时和错误都感兴趣。)

    回复删除
  2. 是的,我'会测试一下。但是,我的直接反应是,如果您知道要获取逆整数幂,则可以将y = x ^ p作为目标函数来更快,更准确地完成牛顿步。

    回复删除
  3. 后缀表示完成的牛顿步骤数。

    rsqrtf0相对精度0.0236779
    rsqrtf1相对精度0.000969781
    rsqrtf2相对精度1.86903e-06
    rsqrtf3相对精度2.66092e-08
    rsqrtf0每秒百万呼叫= 233.094
    rsqrtf1百万/每秒= 182.67
    rsqrtf2百万/每秒= 154.079
    rsqrtf3百万/秒= 121.4

    因此,没有任何牛顿步骤,它们具有相似的速度和(精度)。但是,牛顿步骤的优越形式意味着平方根hack具有比精度折衷更好的计算能力。

    回复删除