2013年2月21日,星期四

大声一点

我和尼科斯最近 向vee-dub添加了神经网络 通过减少。这不是深度学习的实现,它只是一个隐藏层,因此您可能会问``这有什么意义?''我们最初的动机是仅使用vee-dub赢得一些Kaggle比赛。尽管最近我一直很忙,无法参加任何比赛,但是减少的效果符合预期。特别是,我希望可以通过对线性模型有利的工程特征来继续解决大多数问题,并在末尾添加一些隐藏单元以进一步提高性能。在谈论一个 计算与数据集大小的权衡,但这是一个更明确的示例。

来自的拼接站点数据集 2008年Pascal大规模学习挑战赛 是5000万个标记的DNA序列的集合,每个序列的长度为200个碱基对。就我们的目的而言,这是一个字母有限的字符串的二进制分类问题。这里有些例子:
% paste -d' ' <(bzcat dna_train.lab.bz2) <(bzcat dna_train.dat.bz2) | head -3
-1 AGGTTGGAGTGCAGTGGTGCGATCATAGCTCACTGCAGCCTCAAACTCCTGGGCTCAAGTGATCCTCCCATCTCAGCCTCCCAAATAGCTGGGCCTATAGGCATGCACTACCATGCTCAGCTAATTCTTTTGTTGTTGTTGTTGAGACGAAGCCTCGCTCTGTCGCCCAGGCTGGAGTGCAGTGGCACAATCTCGGCTCG
-1 TAAAAAAATGACGGCCGGTCGCAGTGGCTCATGCCTGTAATCCTAGCACTTTGGGAGGCCGAGGCGGGTGAATCACCTGAGGCCAGGAGTTCGAGATCAGCCTGGCCAACATGGAGAAATCCCGTCTCTACTAAAAATACAAAAATTAGCCAGGCATGGTGGCGGGTGCCTGTAATCCCAGCTACTCGGGAGGCTGAGGT
-1 AAAAGAGGTTTAATTGGCTTACAGTTCCGCAGGCTCTACAGGAAGCATAGCGCCAGCATCTCACAATCATGACAGAAGATGAAGAGGGAGCAGGAGCAAGAGAGAGGTGAGGAGGTGCCACACACTTTTAAACAACCAGATCTCACGAAAACTCAGTCACTATTGCAAGAACAGCACCAAGGGGACGGTGTTAGAGCATT
事实证明,如果将这些字符串分解为$ n $ -grams,则逻辑回归效果很好。这是一个小程序,它将DNA序列处理成4克并输出vee-dub兼容格式。
% less Quaddna2vw.cpp
#include <iostream>
#include <string>

namespace
{
  using namespace std;

  unsigned int
  codec (const string::const_iterator& c)
    {
      return *c == 'A' ? 0 :
             *c == 'C' ? 1 :
             *c == 'G' ? 2 : 3;
    }
}

int
main (void)
{
  using namespace std;

  while (! cin.eof ())
    {
      string line;

      getline (cin, line);

      if (line.length ())
        {
          string::const_iterator ppp = line.begin ();
          string::const_iterator pp = ppp + 1;
          string::const_iterator p = pp + 1;
          unsigned int offset = 1;

          cout << " |f";

          for (string::const_iterator c = p + 1;
               c != line.end ();
               ++ppp, ++pp, ++p, ++c)
            {
              unsigned int val = 64 * codec (ppp) +
                                 16 * codec (pp) +
                                  4 * codec (p) +
                                      codec (c);

              cout << " " << offset + val << ":1";
              offset += 256;
            }

          cout << endl;
        }
    }

  return 0;
}
我将使用以下Makefile来驱动学习渠道。
% less Makefile
SHELL=/bin/zsh
CXXFLAGS=-O3

.SECONDARY:

all:

%.check:
        @test -x "$$(which $*)" || {                            \
          echo "ERROR: you need to install $*" 1>&2;            \
          exit 1;                                               \
        }

dna_train.%.bz2: wget.check
        wget ftp://largescale.ml.tu-berlin.de/largescale/dna/dna_train.$*.bz2

quaddna2vw: Quaddna2vw.cpp

quaddna.model.nn%: dna_train.lab.bz2 dna_train.dat.bz2 Quaddna2vw  大众 .check
        time paste -d' '                                        \
            <(bzcat $(word 1,$^))                               \
            <(bzcat $(word 2,$^) | ./quaddna2vw) |              \
          tail -n +1000000 |                                    \
         大众  -b 24 -l 0.05 --adaptive --invariant                 \
          --loss_function logistic -f $@                        \
          $$([ $* -gt 0 ] && echo "--nn $* --inpass")

quaddna.test.%: dna_train.lab.bz2 dna_train.dat.bz2 quaddna.model.% Quaddna2vw  大众 .check
        paste -d' '                                             \
          <(bzcat $(word 1,$^))                                 \
          <(bzcat $(word 2,$^) | ./quaddna2vw) |                \
        head -n +1000000 |                                      \
         大众  -t --loss_function logistic -i $(word 3,$^) -p $@

quaddna.perf.%: dna_train.lab.bz2 quaddna.test.% perf.check
        paste -d' '                                             \
          <(bzcat $(word 1,$^))                                 \
          $(word 2,$^) |                                        \
        head -n +1000000 |                                      \
        perf -ROC -APR
这是使用logistic回归对数据进行一次sgd传递的结果。
% make quaddna.perf.nn0
g++ -O3 -I/home/pmineiro/include -I/usr/local/include -L/home/pmineiro/lib -L/usr/local/lib  Quaddna2vw.cpp   -o Quaddna2vw
time paste -d' '                                        \
            <(bzcat dna_train.lab.bz2)                          \
            <(bzcat dna_train.dat.bz2 | ./quaddna2vw) |         \
          tail -n +1000000 |                                    \
         大众  -b 24 -l 0.05 --adaptive --invariant                 \
          --loss_function logistic -f quaddna.model.nn0                 \
          $([ 0 -gt 0 ] && echo "--nn 0 --inpass")
final_regressor = quaddna.model.nn0
Num weight bits = 24
learning rate = 0.05
initial_t = 0
power_t = 0.5
using no cache
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.673094   0.673094            3         3.0  -1.0000  -0.0639      198
0.663842   0.654590            6         6.0  -1.0000  -0.0902      198
0.623277   0.574599           11        11.0  -1.0000  -0.3074      198
0.579802   0.536327           22        22.0  -1.0000  -0.3935      198
...
0.011148   0.009709     22802601  22802601.0  -1.0000 -12.1878      198
0.009952   0.008755     45605201  45605201.0  -1.0000 -12.7672      198

finished run
number of examples = 49000001
weighted example sum = 4.9e+07
weighted label sum = -4.872e+07
average loss = 0.009849
best constant = -0.9942
total feature number = 9702000198
paste -d' ' <(bzcat dna_train.lab.bz2)   53.69s user 973.20s system 36% cpu 46:22.36 total
tail -n +1000000  3.87s user 661.57s system 23% cpu 46:22.36 total
vw -b 24 -l 0.05 --adaptive --invariant --loss_function logistic -f    286.54s user 1380.19s system 59% cpu 46:22.43 total
paste -d' '                                             \
          <(bzcat dna_train.lab.bz2)                            \
          <(bzcat dna_train.dat.bz2 | ./quaddna2vw) |           \
        head -n +1000000 |                                      \
         大众  -t --loss_function logistic -i quaddna.model.nn0 -p quaddna.test.nn0
only testing
Num weight bits = 24
learning rate = 10
initial_t = 1
power_t = 0.5
predictions = quaddna.test.nn0
using no cache
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.000020   0.000020            3         3.0  -1.0000 -17.4051      198
0.000017   0.000014            6         6.0  -1.0000 -17.3808      198
0.000272   0.000578           11        11.0  -1.0000  -5.8593      198
0.000168   0.000065           22        22.0  -1.0000 -10.5622      198
...
0.008531   0.008113       356291    356291.0  -1.0000 -14.7463      198
0.008372   0.008213       712582    712582.0  -1.0000  -7.1162      198

finished run
number of examples = 1000000
weighted example sum = 1e+06
weighted label sum = -9.942e+05
average loss = 0.008434
best constant = -0.9942
total feature number = 198000000
paste -d' '                                             \
          <(bzcat dna_train.lab.bz2)                            \
          quaddna.test.nn0 |                                    \
        head -n +1000000 |                                      \
        perf -ROC -APR
APR    0.51482
ROC    0.97749
挂钟训练时间为47分钟,测试APR为0.514。 (如果仔细阅读以上内容,您会注意到我将文件的前一百万行用作测试数据,将其余的几行用作训练数据。)大规模学习挑战的条目的APR约为0.2,这是从unigram logistic回归中得到的,而此数据集上最著名的方法需要多个核心天才能计算并获得约0.58的APR。

在以上运行期间 Quaddna2vw 使用100%的1 cpu和 大众 使用约60%的另一个。换一种说法, 大众 这不是瓶颈,我们可以花一些额外的cpu学习,而不会产生实际的挂钟影响。因此,通过指定少量隐藏单元并通过输入直接连接到输出层,可以大声一点 --nn 8 --inpass。其他所有内容都相同。
% make quaddna.perf.nn8
time paste -d' '                                        \
            <(bzcat dna_train.lab.bz2)                          \
            <(bzcat dna_train.dat.bz2 | ./quaddna2vw) |         \
          tail -n +1000000 |                                    \
         大众  -b 24 -l 0.05 --adaptive --invariant                 \
          --loss_function logistic -f quaddna.model.nn8                 \
          $([ 8 -gt 0 ] && echo "--nn 8 --inpass")
final_regressor = quaddna.model.nn8
Num weight bits = 24
learning rate = 0.05
initial_t = 0
power_t = 0.5
using input passthrough for neural network training
randomly initializing neural network output weights  和  hidden bias
using no cache
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.600105   0.600105            3         3.0  -1.0000  -0.2497      198
0.576544   0.552984            6         6.0  -1.0000  -0.3317      198
0.525074   0.463309           11        11.0  -1.0000  -0.6047      198
0.465905   0.406737           22        22.0  -1.0000  -0.7760      198
...
0.010760   0.009331     22802601  22802601.0  -1.0000 -11.5363      198
0.009633   0.008505     45605201  45605201.0  -1.0000 -11.7959      198

finished run
number of examples = 49000001
weighted example sum = 4.9e+07
weighted label sum = -4.872e+07
average loss = 0.009538
best constant = -0.9942
total feature number = 9702000198
paste -d' ' <(bzcat dna_train.lab.bz2)   58.24s user 1017.98s system 38% cpu 46:23.54 total
tail -n +1000000  3.77s user 682.93s system 24% cpu 46:23.54 total
vw -b 24 -l 0.05 --adaptive --invariant --loss_function logistic -f    2341.03s user 573.53s system 104% cpu 46:23.61 total
paste -d' '                                             \
          <(bzcat dna_train.lab.bz2)                            \
          <(bzcat dna_train.dat.bz2 | ./quaddna2vw) |           \
        head -n +1000000 |                                      \
         大众  -t --loss_function logistic -i quaddna.model.nn8 -p quaddna.test.nn8
only testing
Num weight bits = 24
learning rate = 10
initial_t = 1
power_t = 0.5
predictions = quaddna.test.nn8
using input passthrough for neural network testing
using no cache
Reading from
num sources = 1
average    since         example     example  current  current  current
loss       last          counter      weight    label  predict features
0.000041   0.000041            3         3.0  -1.0000 -15.2224      198
0.000028   0.000015            6         6.0  -1.0000 -16.5099      198
0.000128   0.000247           11        11.0  -1.0000  -6.7542      198
0.000093   0.000059           22        22.0  -1.0000 -10.7089      198
...
0.008343   0.007864       356291    356291.0  -1.0000 -14.3546      198
0.008138   0.007934       712582    712582.0  -1.0000  -7.0710      198

finished run
number of examples = 1000000
weighted example sum = 1e+06
weighted label sum = -9.942e+05
average loss = 0.008221
best constant = -0.9942
total feature number = 198000000
paste -d' '                                             \
          <(bzcat dna_train.lab.bz2)                            \
          quaddna.test.nn8 |                                    \
        head -n +1000000 |                                      \
        perf -ROC -APR
APR    0.53259
ROC    0.97844
从挂钟的角度来看,这是免费的:总培训时间增加了1秒,并且 大众 Quaddna2vw 现在吞吐量大致相等。同时,实际年利率从0.515增加到0.532。这说明了一个基本思想:设计出适合您的线性模型的特征,然后当您精疲力尽时,尝试添加一些隐藏的单元。就像转动设计矩阵 最多十一.

我推测由于自适应梯度导致的学习速率安排,正在发生类似于梯度增强的事情。具体而言,如果直接连接的收敛速度比隐藏单元的收敛速度快,则可以有效地要求它们对线性模型中的残差进行建模。这表明一种更明确的强化形式可能会产生更好的免费午餐。