Yoshua Bengio, one of the luminaries of the 深 learning community, gave multiple talks about 深 learning 在 集成电路 2014 this year. I like Bengio's focus on the statistical aspects of 深 learning. Here are some thoughts I had in response to his presentations.




As a purely mathematical statement it is definitely true that composing nonlinearities through bottlenecks leads to a subset of larger model space. For example, composing order $d$ polynomial units in a 深 architecture with $m$ levels results in something whose leading order terms are monomials of order $m d$; but many of the terms in a full $m d$ polynomial expansion (aka “shallow architecture”) 缺失。因此,前导顺序具有灵活性,但模型空间有限。但是,这有关系吗?

For me the best evidence comes from that old chestnut MNIST. For many years the Gaussian kernel yielded better results than 深 learning on MNIST among solutions that did not exploit spatial structure. Since the discovery of dropout this is no longer true and one can see a gap between the Gaussian kernel (at circa 1.2% test error) and, e.g., maxout networks (at 0.9% test error). The Gaussian kernel essentially works by penalizing all function derivatives, i.e., enforcing smoothness. Now it seems something more powerful is happening with 深 architectures and dropout. You might say, “嘿1.2%和0.9%,我们不是要分开头发吗?”但我不这么认为。我怀疑这里还会发生其他事情,但这只是一个猜测,我当然不理解。

The counterargument is that, to date, the major performance gains in 深 learning happen when the composition by depth is combined with a decomposition of the feature space (e.g., spatial or temporal). In speech the Gaussian kernel (in the highly scalable form of random fourier features) is able to approach the performance of 深 learning on TIMIT, if the 深 net cannot exploit temporal structure, i.e., RFF is competitive with non-convolutional DNNs on this task, but is surpassed by convolutional DNNs. (Of course, from a computational standpoint, a 深 network starts to look downright parsimonious compared to hundreds of thousands of random fourier features, but we're talking statistics here.)


So for general problems it's not clear that ``regularization via depth'' is obviously better than general smoothness regularizers (although I suspect it is). However for problems in computer vision it is intuitive that 深 composition of representations is beneficial. This is because the spatial domain comes with a natural concept of neighborhoods which can be used to beneficially limit model complexity.

对于诸如自然场景理解之类的任务,空间范围有限的各种对象将被放置在众多背景之上的不同相对位置。在这种情况下,歧视的一些关键方面将由本地统计数据确定,而其他方面则由远端统计数据确定。但是,给定一个包含256x256像素图像的训练集,训练集中的每个示例都提供了一对像素的一种实​​现,该像素对向右下方偏移256个像素(即,左上左下右像素)。相反,每个示例都提供一对像素的252 ^ 2 $实现,该像素向右下方偏移4个像素。尽管这些实现不是独立的,但是对于正常摄影比例的自然场景图像,每个训练示例中有关局部依存关系的数据要比远端依存关系多得多。从统计学上讲,这表明尝试估计附近像素之间的高度复杂关系较为安全,但是必须更严格地规范远距离依存关系。深度分层体系结构是实现这些双重目标的一种方法。



The outstanding success of hard-wiring hierarchical spatial structure into a 深 architecture for computer vision has motivated the search for similar concepts of local neighborhoods for other tasks such as speech recognition and natural language processing. For temporal data time provides a natural concept of locality, but for text data the situation is more opaque. Lexical distance in a sentence is only a moderate indicator of semantic distance, which is why much of NLP is about uncovering latent structure (e.g., topic modeling, parsing). One line of active research synthesizes NLP techniques with 深 architectures hierarchically defined given a traditional NLP decomposition of the input.

对用文字表达邻里关系的相对困难的另一种回应是问“can I learn the neighborhood structure instead, just using a general 深 architecture?”从头开始学习是一种自然的吸引力,尤其是当直觉用尽时;但是,在视觉上,当前有必要将空间结构硬连接到模型中,以获取接近最新技术水平的性能(给定当前数据和计算资源)。

因此,对于例如机器翻译的良好解决方案将在多大程度上涉及手工指定的先验知识与从数据得出的知识之间是一个悬而未决的问题。这听起来像旧的“nature vs. nuture”认知科学方面的争论,但是我怀疑在这个问题上会取得更多进展,因为现在辩论是通过实际尝试设计执行相关任务的系统而获得的。