2014年8月26日，星期二

更多深度学习的困惑

Yoshua Bengio, one of the luminaries of the 深 learning community, gave multiple talks about 深 learning 在 集成电路 2014 this year. I like Bengio's focus on the statistical aspects of 深 learning. Here are some thoughts I had in response to his presentations.

通过深度进行正则化

Bengio的话题之一是深度是一种有效的调节器。该论点是这样的：通过组合多层（有限容量）非线性，相对于相似的前导灵活性的浅层模型，整体体系结构能够探索有趣的高柔性模型子集。在这里有趣的是，这些模型具有足够的灵活性来对目标概念进行建模，但是受到足够的约束，仅需适度的数据需求即可学习。这实际上是关于我们正在尝试建模的目标概念的声明（例如，在人工智能任务中）。另一种说法是（释义）“寻找比平滑度假设更具约束力的正则化器，但仍广泛适用于感兴趣的任务。”

As a purely mathematical statement it is definitely true that composing nonlinearities through bottlenecks leads to a subset of larger model space. For example, composing order $d$ polynomial units in a 深 architecture with $m$ levels results in something whose leading order terms are monomials of order $m d$; but many of the terms in a full $m d$ polynomial expansion (aka “shallow architecture”） 缺失。因此，前导顺序具有灵活性，但模型空间有限。但是，这有关系吗？

For me the best evidence comes from that old chestnut MNIST. For many years the Gaussian kernel yielded better results than 深 learning on MNIST among solutions that did not exploit spatial structure. Since the discovery of dropout this is no longer true and one can see a gap between the Gaussian kernel (at circa 1.2% test error) and, e.g., maxout networks (at 0.9% test error). The Gaussian kernel essentially works by penalizing all function derivatives, i.e., enforcing smoothness. Now it seems something more powerful is happening with 深 architectures and dropout. You might say, “嘿1.2％和0.9％，我们不是要分开头发吗？”但我不这么认为。我怀疑这里还会发生其他事情，但这只是一个猜测，我当然不理解。

The counterargument is that, to date, the major performance gains in 深 learning happen when the composition by depth is combined with a decomposition of the feature space (e.g., spatial or temporal). In speech the Gaussian kernel (in the highly scalable form of random fourier features) is able to approach the performance of 深 learning on TIMIT, if the 深 net cannot exploit temporal structure, i.e., RFF is competitive with non-convolutional DNNs on this task, but is surpassed by convolutional DNNs. (Of course, from a computational standpoint, a 深 network starts to look downright parsimonious compared to hundreds of thousands of random fourier features, but we're talking statistics here.)

远距离关系的危险

So for general problems it's not clear that regularization via depth'' is obviously better than general smoothness regularizers (although I suspect it is). However for problems in computer vision it is intuitive that 深 composition of representations is beneficial. This is because the spatial domain comes with a natural concept of neighborhoods which can be used to beneficially limit model complexity.

这是附近美好的一天

The outstanding success of hard-wiring hierarchical spatial structure into a 深 architecture for computer vision has motivated the search for similar concepts of local neighborhoods for other tasks such as speech recognition and natural language processing. For temporal data time provides a natural concept of locality, but for text data the situation is more opaque. Lexical distance in a sentence is only a moderate indicator of semantic distance, which is why much of NLP is about uncovering latent structure (e.g., topic modeling, parsing). One line of active research synthesizes NLP techniques with 深 architectures hierarchically defined given a traditional NLP decomposition of the input.