2014年8月26日,星期二

更多深度学习的困惑


Yoshua Bengio, one of the luminaries of the 深 learning community, gave multiple talks about 深 learning 在 集成电路 2014 this year. I like Bengio's focus on the statistical aspects of 深 learning. Here are some thoughts I had in response to his presentations.

通过深度进行正则化

Bengio的话题之一是深度是一种有效的调节器。该论点是这样的:通过组合多层(有限容量)非线性,相对于相似的前导灵活性的浅层模型,整体体系结构能够探索有趣的高柔性模型子集。在这里有趣的是,这些模型具有足够的灵活性来对目标概念进行建模,但是受到足够的约束,仅需适度的数据需求即可学习。这实际上是关于我们正在尝试建模的目标概念的声明(例如,在人工智能任务中)。另一种说法是(释义)“寻找比平滑度假设更具约束力的正则化器,但仍广泛适用于感兴趣的任务。”

是这样吗

As a purely mathematical statement it is definitely true that composing nonlinearities through bottlenecks leads to a subset of larger model space. For example, composing order $d$ polynomial units in a 深 architecture with $m$ levels results in something whose leading order terms are monomials of order $m d$; but many of the terms in a full $m d$ polynomial expansion (aka “shallow architecture”) 缺失。因此,前导顺序具有灵活性,但模型空间有限。但是,这有关系吗?

For me the best evidence comes from that old chestnut MNIST. For many years the Gaussian kernel yielded better results than 深 learning on MNIST among solutions that did not exploit spatial structure. Since the discovery of dropout this is no longer true and one can see a gap between the Gaussian kernel (at circa 1.2% test error) and, e.g., maxout networks (at 0.9% test error). The Gaussian kernel essentially works by penalizing all function derivatives, i.e., enforcing smoothness. Now it seems something more powerful is happening with 深 architectures and dropout. You might say, “嘿1.2%和0.9%,我们不是要分开头发吗?”但我不这么认为。我怀疑这里还会发生其他事情,但这只是一个猜测,我当然不理解。

The counterargument is that, to date, the major performance gains in 深 learning happen when the composition by depth is combined with a decomposition of the feature space (e.g., spatial or temporal). In speech the Gaussian kernel (in the highly scalable form of random fourier features) is able to approach the performance of 深 learning on TIMIT, if the 深 net cannot exploit temporal structure, i.e., RFF is competitive with non-convolutional DNNs on this task, but is surpassed by convolutional DNNs. (Of course, from a computational standpoint, a 深 network starts to look downright parsimonious compared to hundreds of thousands of random fourier features, but we're talking statistics here.)

远距离关系的危险

So for general problems it's not clear that ``regularization via depth'' is obviously better than general smoothness regularizers (although I suspect it is). However for problems in computer vision it is intuitive that 深 composition of representations is beneficial. This is because the spatial domain comes with a natural concept of neighborhoods which can be used to beneficially limit model complexity.

对于诸如自然场景理解之类的任务,空间范围有限的各种对象将被放置在众多背景之上的不同相对位置。在这种情况下,歧视的一些关键方面将由本地统计数据确定,而其他方面则由远端统计数据确定。但是,给定一个包含256x256像素图像的训练集,训练集中的每个示例都提供了一对像素的一种实​​现,该像素对向右下方偏移256个像素(即,左上左下右像素)。相反,每个示例都提供一对像素的252 ^ 2 $实现,该像素向右下方偏移4个像素。尽管这些实现不是独立的,但是对于正常摄影比例的自然场景图像,每个训练示例中有关局部依存关系的数据要比远端依存关系多得多。从统计学上讲,这表明尝试估计附近像素之间的高度复杂关系较为安全,但是必须更严格地规范远距离依存关系。深度分层体系结构是实现这些双重目标的一种方法。

理解此先验功能的一种方法是,注意它适用于通常与深度学习无关的模型类。在经过验证的MNIST数据集上,高斯核最小二乘可实现1.2%的测试误差(无训练误差)。将每个示例划分为4个象限,在每个象限上计算一个高斯核,然后在所得的4个向量上计算高斯核的最小二乘可实现0.96%的测试误差(无训练误差)。高斯核与核的区别。“deep”高斯核是建模远端像素交互的能力受到限制。尽管我还没有尝试过,但我相信通过约束从根到叶的每条路径以包含空间上相邻像素的分割,可以类似地改善决策树集合。

这是附近美好的一天

The outstanding success of hard-wiring hierarchical spatial structure into a 深 architecture for computer vision has motivated the search for similar concepts of local neighborhoods for other tasks such as speech recognition and natural language processing. For temporal data time provides a natural concept of locality, but for text data the situation is more opaque. Lexical distance in a sentence is only a moderate indicator of semantic distance, which is why much of NLP is about uncovering latent structure (e.g., topic modeling, parsing). One line of active research synthesizes NLP techniques with 深 architectures hierarchically defined given a traditional NLP decomposition of the input.

对用文字表达邻里关系的相对困难的另一种回应是问“can I learn the neighborhood structure instead, just using a general 深 architecture?”从头开始学习是一种自然的吸引力,尤其是当直觉用尽时;但是,在视觉上,当前有必要将空间结构硬连接到模型中,以获取接近最新技术水平的性能(给定当前数据和计算资源)。

因此,对于例如机器翻译的良好解决方案将在多大程度上涉及手工指定的先验知识与从数据得出的知识之间是一个悬而未决的问题。这听起来像旧的“nature vs. nuture”认知科学方面的争论,但是我怀疑在这个问题上会取得更多进展,因为现在辩论是通过实际尝试设计执行相关任务的系统而获得的。