2011年6月24日,星期五

更快的LDA

现在,我已经掌握了 字节重新解释的邪恶力量 在整数和IEEE浮点数之间,我一直在寻找加速的东西。最近发布的 快速实施LDA 由Smola等等提醒我看Vowpal Wabbit的 LDA实施。的 官方网站 该文件充满了看起来昂贵的先验功能,可以使用近似版本来加快速度,希望不会破坏学习算法。

更快的Log Gamma和Digamma

除了许多我已经有了快速近似的指数和对数调用之外,还有一些对 对数伽玛地伽玛。幸运的是, 斯特林近似 是对数,因此我可以利用我的快速对数近似值 以前的博客文章.

我发现对于Log Gamma和Digamma而言,两次应用移位公式可以在精度和速度之间取得最佳折衷。
// WARNING: this code has been updated.  Do not use this version.
// Instead get the latest version from http://code.google.com/p/fastapprox

inline float
fastlgamma (float x)
{
  float logterm = fastlog (x * (1.0f + x) * (2.0f + x));
  float xp3 = 3.0f + x;

  return
    -2.081061466f - x + 0.0833333f / xp3 - logterm + (2.5f + x) * fastlog (xp3);
}

inline float
fastdigamma (float x)
{
  float twopx = 2.0f + x;
  float logterm = fastlog (twopx);

  return - (1.0f + 2.0f * x) / (x * (1.0f + x))
         - (13.0f + 6.0f * x) / (12.0f * twopx * twopx)
         + logterm;
}
时间和准确性
时序测试是通过编译完成的 -O3 -finline功能-fast-math,在运行64位Ubuntu Lucid(所以 gcc 4:4.4.3-1ubuntu1libc 2.11.1-0ubuntu7.6)。我还测量了$ x \ in [1/100,10] $的平均相对准确度(使用 lgammaf 来自libc作为Log Gamma的黄金标准,并且 boost :: math :: digamma 作为Digamma的黄金标准)。
fastlgamma relative accuracy = 0.00045967
fastdigamma relative accuracy = 0.000420604
fastlgamma million calls per second = 60.259
lgammaf million calls per second = 21.4703
boost::math::lgamma million calls per second = 1.8951
fastdigamma million calls per second = 66.0639
boost::math::digamma million calls per second = 18.9672
好吧,Log Gamma的速度大约是3倍(但是为什么boost :: math :: lgamma这么慢?)却略好于Digamma的3倍。

LDA时序

所以我将这些近似值放入 官方网站 并使用 wiki1K 来自的文件 测试/训练集/ 目录;但是我将文件本身连接了100次以减慢速度。
原始运行
% (cd test; time ../vw --lda 100 --lda_alpha 0.01 --lda_rho 0.01 --lda_D 1000 -b 13 --minibatch 128 train-sets/wiki1Kx100.dat)
your learning rate is too high, setting it to 1
using no cache
Reading from train-sets/wiki1Kx100.dat
num sources = 1
Num weight bits = 13
learning rate = 1
initial_t = 1
power_t = 0.5
learning_rate set to 1
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
10.296875  10.296875         3      3.0    unknown   0.0000       38
10.437156  10.577436         6      6.0    unknown   0.0000       14
10.347227  10.239314        11     11.0    unknown   0.0000       32
10.498633  10.650038        22     22.0    unknown   0.0000        2
10.495566  10.492500        44     44.0    unknown   0.0000      166
10.469184  10.442189        87     87.0    unknown   0.0000       29
10.068007  9.666831        174    174.0    unknown   0.0000       17
9.477440   8.886873        348    348.0    unknown   0.0000        2
9.020482   8.563524        696    696.0    unknown   0.0000      143
8.482232   7.943982       1392   1392.0    unknown   0.0000       31
8.106654   7.731076       2784   2784.0    unknown   0.0000       21
7.850269   7.593883       5568   5568.0    unknown   0.0000       25
7.707427   7.564560      11135  11135.0    unknown   0.0000       39
7.627417   7.547399      22269  22269.0    unknown   0.0000       61
7.583820   7.540222      44537  44537.0    unknown   0.0000        5
7.558824   7.533827      89073  89073.0    unknown   0.0000      457

finished run
number of examples = 0
weighted example sum = 0
weighted label sum = 0
average loss = -nan
best constant = -nan
total feature number = 0
../vw --lda 100 --lda_alpha 0.01 --lda_rho 0.01 --lda_D 1000 -b 13 --minibatc  69.06s user 13.60s system 101% cpu 1:21.59 total
近似运行
% (cd test; time ../vw --lda 100 --lda_alpha 0.01 --lda_rho 0.01 --lda_D 1000 -b 13 --minibatch 128 train-sets/wiki1Kx100.dat)               your learning rate is too high, setting it to 1
using no cache
Reading from train-sets/wiki1Kx100.dat
num sources = 1
Num weight bits = 13
learning rate = 1
initial_t = 1
power_t = 0.5
learning_rate set to 1
average    since       example  example    current  current  current
loss       last        counter   weight      label  predict features
10.297077  10.297077         3      3.0    unknown   0.0000       38
10.437259  10.577440         6      6.0    unknown   0.0000       14
10.347348  10.239455        11     11.0    unknown   0.0000       32
10.498796  10.650243        22     22.0    unknown   0.0000        2
10.495748  10.492700        44     44.0    unknown   0.0000      166
10.469374  10.442388        87     87.0    unknown   0.0000       29
10.068179  9.666985        174    174.0    unknown   0.0000       17
9.477574   8.886968        348    348.0    unknown   0.0000        2
9.020435   8.563297        696    696.0    unknown   0.0000      143
8.482178   7.943921       1392   1392.0    unknown   0.0000       31
8.106636   7.731093       2784   2784.0    unknown   0.0000       21
7.849980   7.593324       5568   5568.0    unknown   0.0000       25
7.707124   7.564242      11135  11135.0    unknown   0.0000       39
7.627207   7.547283      22269  22269.0    unknown   0.0000       61
7.583952   7.540696      44537  44537.0    unknown   0.0000        5
7.559260   7.534566      89073  89073.0    unknown   0.0000      457

finished run
number of examples = 0
weighted example sum = 0
weighted label sum = 0
average loss = -nan
best constant = -nan
total feature number = 0
../vw --lda 100 --lda_alpha 0.01 --lda_rho 0.01 --lda_D 1000 -b 13 --minibatc  41.97s user 7.69s system 102% cpu 48.622 total
因此,从81秒降低到49秒,速度提高了大约40%。但这有效吗?好吧,我让马特(Matt)担任最终裁判(我给他寄了一封修改过的 官方网站 (用于测试),但初步迹象表明,使用昂贵功能的近似版本不会破坏学习过程。

下一步:矢量化

肯定会有更多的速度,因为 官方网站 加载像
for (size_t i = 0; i<global.lda; i++)
    {
      sum += gamma[i];
      gamma[i] = mydigamma(gamma[i]);
    }
只是为了尖叫 向量化。敬请关注!

没意见:

发表评论