01-2015-Deep Learning论文翻译精读

深度学习(Deep learning)

>* Deep Learning原始论文:http://pages.cs.wisc.edu/~dyer/cs540/handouts/deep-learning-nature2015.pdf
>* Deep Learning论文下载地址(高速)

摘要

> Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec- ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

  • 深度学习允许由多个处理层组成的计算模型学习具有多个抽象级别的数据表示。
  • 深度学习通过使用反向传播算法指示机器应如何更改其内部参数来发现大型数据集中的复杂结构
  • 深层卷积网络在处理图像,视频,语音和音频方面带来了突破,而递归网络则对诸如文本和语音之类的顺序数据有所启发。

> Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social net- works to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning. Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, con- structing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a fea- ture extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input. Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representa- tion, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure. Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence commu- nity for many years. It has turned out to be very good at discovering doi:10.1038/nature14539

> Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec- ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech. intricate structures in high-dimensional data and is therefore applica- ble to many domains of science, business and government. In addition We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available com- putation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only acceler- ate this progress.

  • 常规的机器学习技术在处理原始格式的自然数据的能力方面受到限制
    • 需要相当多的领域专业知识,才能设计特征提取器
    • 通过特征提取器将原始数据(例如图像的像素值)转换为合适的内部表示或特征向量
  • 深度学习具有多层表示形式的表示学习方法
    • 它是通过组合简单但非线性的模块而获得的
    • 有了足够多模块组合,就可以学习非常复杂的功能。
  • 深度学习图像识别分层学习:
    • 第一层,学习图像的边缘
    • 第二层,基于边缘的的特定布置检测图案
    • 第三层,将图案组装成与熟悉的对象的各个部分相对应的较大组合,并且随后的层将把对象检测为这些部分的组合

监督学习(Supervised learning)

> The most common form of machine learning, deep or not, is super- vised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or dis- tance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the input–output func- tion of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine.

  • 监督学习是最常见的机器学习形式(也包括深度学习在内)
  • 典型的深度学习系统中,可能会有数以亿计的这些可调节砝码,以及数亿个带有标签的示例,用于训练机器
  • 图片分类为例:
    • 收集大量的房屋,汽车,人和宠物的图像数据集,每个图像均标有类别
    • 在训练过程中,机器会显示一张图像,并以分数矢量的形式产生输出,每个类别一个
    • 通过调整调整的参数(通常称为权重)是实数,可以看作是“旋钮”,用于定义机器的输入-输出功能,使得输出的类别在所有类别中得分最高。

> To properly adjust the weight vector, the learning algorithm computes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direc- tion to the gradient vector. The objective function, averaged over all the training examples, can be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average.

  • 如何调整权重向量:
    • 学习算法计算一个梯度向量(表示权重的细微增加会导致错误增加或者减少的量)
    • 按照与梯度向量相反的方向调整权重向量
  • 平均的目标函数可以看作是权重值高维空间中的一种丘陵景观
  • 负梯度矢量指示此景观中最陡下降的方向,使其更接近最小值,其中输出误差平均较低。

> In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization tech- niques18. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine — its ability to produce sensible answers on new inputs that it has never seen during training.

  • 实践中,大多数从业者使用一种称为随机梯度下降(SGD)的程序
  • 之所以称其为随机的,是因为每个小样本示例都给出了所有示例中平均梯度的噪声估计
  • 测试机器的泛化能力,即它对训练期间从未见过的新输入产生明智答案的能力。

> Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category.

  • 机器学习的许多当前实际应用都在顶层手工特征工程处理上使用线性分类器。
  • 二分类线性分类器计算特征向量分量的加权和。如果加权和大于阈值,则将输入分类为属于特定类别

> Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces sepa- rated by a hyperplane19. But problems such as image and speech recognition require the input–output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other ‘shallow’ classifier operating on raw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods20, but generic features such as those arising with the Gaussian kernel do not allow the learner to general- ize well far from the training examples21. The conventional option is to hand design good feature extractors, which requires a consider- able amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.

  • 线性分类器只能将划分比较简单的区域(常规的异或问题都无法处理)
  • 线性分类器只能将其输入空间划分到非常简单的区域
  • 有些场景希望模型函数对输入的不相关变化不敏感:
    • 图像识别:对象的位置,方向或照明的变化
    • 语音识别:语音的音高或口音的变化
  • 有些场景希望模型函数对特定的细微变化敏感:
    • 白狼与类似白狼的萨摩耶之间的差异
    • 处于不同姿势和不同环境中的两个萨摩耶犬的图像可能彼此非常不同,而在相同位置且背景相似的萨摩耶犬和狼的两个图像可能彼此非常相似

> * 线性分类器或任何其他对原始像素进行操作的“浅”分类器都无法正常区分上述白狼和萨摩耶场景,并进行正确分类
> * 这就是为什么浅分类器需要一个好的特征提取器来解决选择性不变性难题的原因 > * 传统的选择是手工设计好的特征提取器,这需要大量的工程技术和领域专业知识。

  • 深度学习的关键优势是:可以使用通用学习过程来自动学习好的特征

> A deep-learning architecture is a multilayer stack of simple modules, all (or most) of which are subject to learning, and many of which compute non-linear input–output mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate functions of its inputs that are simultaneously sensitive to minute details — distinguishing Samoyeds from white wolves — and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects.

  • 深度学习框架是简单模块的多层堆积:
    • 这些模块(或大多数模块)都具备学习能力,并且其中许多模块都会计算非线性的输入-输出之间映射。
    • 多个非线性表示层(5-10层),能够实现非常复杂的功能,对细微变化敏感
    • 能够很好的区分萨摩耶与白狼(对诸如背景,姿势,灯光和周围物体影响情况下)

Backpropagation to train multilayer architectures(反向传播训练多层网络架构)

> From the earliest days of pattern recognition, the aim of researchers has been to replace hand-engineered features with trainable multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent. As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and 1980s.

  • 早期的模式识别中,通过训练多层网络来代替手工设计特征抽取引擎。
  • 只要这些多层网络模块是其输入和内部权重的相对平滑函数,就可以使用反向传播过程来计算梯度
  • 直到1980年,这种简单的多层网络才被广泛理解

> The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives. The key insight is that the derivative (or gradient) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) (Fig. 1). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module.

  • 反向传播用于计算目标函数相对于神经网络权重的梯度,无非就是导数链式法则的应用。
  • 通过计算输出相对于真实值的误差的编导,进行反向调整参数反馈传播到第一层。

Fig. 1 image

> Many applications of deep learning use feedforward neural network architectures (Fig. 1), which learn to map a fixed-size input (for example, an image) to a fixed-size output (for example, a prob- ability for each of several categories). To go from one layer to the next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (ReLU), which is simply the half-wave rectifier f(z) = max(z, 0). In past decades, neural nets used smoother non-linearities, such as tanh(z) or 1/(1 + exp(−z)), but the ReLU typically learns much faster in networks with many layers, allowing training of a deep supervised network without unsupervised pre-training28. Units that are not in the input or output layer are conventionally called hidden units. The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer (Fig. 1).

  • 图1 c为正向传播,学习固定大小(图像)输入到输出(分类)的映射。
  • ReLU修正线性单元(ReLU):
    • f(z)= max(z,0), 最流行的非线性函数
    • 类似函数还有tanh(x) = (e^x - e^-x)/ (e^x + e^-x)
    • 类似函数还有Logistic Sigmoid, f(z) = 1/(1 + exp(−z)).
  • 隐藏层的定义是:除输入层和输出层以外的其他各层叫做隐藏层。隐藏层不直接接受外界的信号,也不直接向外界发送信号。
  • 可以将隐藏的层视为非线性地使输入失真

激活函数对比: 激活函数对比

  • 为什么tanh相比sigmoid收敛更快:

    • 对比sigmoid和tanh两者导数输出可知,tanh函数的导数比sigmoid函数导数值更大,即梯度变化更快,也就是在训练过程中收敛速度更快。
    • relu函数的导数计算更快,程序实现就是一个if-else语句;而sigmoid函数要进行浮点四则运算,涉及到除法;
  • Leaky ReLU:

    • f(x) = aix(x<0) ai为一个随机选取的固定值,类似1/2. 目的是小于0的数据有导数
    • f(x) = x(x>0)

> In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with lit- tle prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima — weight configurations for which no small change would reduce the average error.

  • 90年代后期,神经网络和反向传播被人们忽视。
  • 人们普遍认为,简单的梯度下降会陷入不良的局部极小值—权重配置

> In practice, poor local minima are rarely a problem with large networks. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinato- rially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the emainder. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objec- tive function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.

  • 最近的理论和经验结果强烈表明,局部极小值通常不是一个严重的问题。
  • 其中鞍点问题、大平原是更加严重的问题。
  • 为什么鞍点相对于局部最小值更严重:
    • 因为在大型网络中,由于维度很多,要达到局部最小值则说明各个维度都要最小(否则可以沿着其他维度逃离局部最小值),这样很大概率就代表局部最小值即为全局最小值
    • 鞍点则是当前值导数为0,两边则是导数相反(在高纬度中,比较容易出现)
    • 大平原,则因为导数为0,导致很难逃离出去
    • 更多阅读深度学习中的局部极小值与鞍点

> Interest in deep feedforward networks was revived around 2006 (refs 31–34) by a group of researchers brought together by the Canadian Institute for Advanced Research (CIFAR). The researchers introduced unsupervised learning procedures that could create layers of feature detectors without requiring labelled data. The objective in learning each layer of feature detectors was to be able to reconstruct or model the activities of feature detectors (or raw inputs) in the layer below. By ‘pre-training’ several layers of progressively more complex feature detectors using this reconstruction objective, the weights of a deep network could be initialized to sensible values. A final layer of output units could then be added to the top of the network and the whole deep system could be fine-tuned using standard backpropagation. This worked remarkably well for recognizing handwritten digits or for detecting pedestrians, especially when the amount of labelled data was very limited.

  • 2006年,CIFAR引入无监督学习创建"自动探测特征"网络层。基于无标记数据"预训练"
  • 整个深度网络使用标准反向传播对模型进行微调。
  • 这种方式在手写识别中表现良好,特别是当标记数据有限的场景下

> The first major application of this pre-training approach was in speech recognition, and it was made possible by the advent of fast graphics processing units (GPUs) that were convenient to program and allowed researchers to train networks 10 or 20 times faster. In 2009, the approach was used to map short temporal windows of coef- ficients extracted from a sound wave to a set of probabilities for the various fragments of speech that might be represented by the frame in the centre of the window. It achieved record-breaking results on a standard speech recognition benchmark that used a small vocabu- lary and was quickly developed to give record-breaking results on a large vocabulary task. By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups and were already being deployed in Android phones. For smaller data sets, unsupervised pre-training helps to prevent overfitting, leading to significantly better generalization when the number of labelled exam- ples is small, or in a transfer setting where we have lots of examples for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.

  • 预训练方法的第一个主要应用是语音识别(得益于更快计算能力的GPU)
  • 2012年,许多语音识别项目,基于2009年的预训练音频模型取得不错效果
  • 对于较小的数据集,无监督的预训练有助于防止过度拟合(事实证明,仅对于小型数据集才需要预训练阶段)

> There was, however, one particular type of deep, feedforward network that was much easier to train and generalized much better than networks with full connectivity between adjacent layers. This was the convolutional neural network (ConvNet)41,42. It achieved many practical successes during the period when neural networks were out of favour and it has recently been widely adopted by the computer- vision community.

  • 但是,存在一种特定类型的深度前馈网络,它比相邻层之间具有完全连接的网络更容易训练和推广。 这就是卷积神经网络(ConvNet)
  • ConvNet, 最近被计算机视觉界广泛采用。

卷积神经网络(Convolutional neural networks)

> ConvNets are designed to process data that come in the form of multiple arrays, for example a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. Many data modalities are in the form of multiple arrays: 1D for signals and sequences, including language; 2D for images or audio spectrograms; and 3D for video or volumetric images. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.

  • ConvNet设计用于处理以多维矩阵形式出现的数据
  • ConvNet4个关键思想:
    • local connections(本地连接)
    • shared weights(共享权重)
    • pooling(池化)
    • many layers(多层使用)

Fig. 2 image

> Figure2|Inside a convolutional network. The outputs(not the filters) of each layer (horizontally) of a typical convolutional network architecture applied to the image of a Samoyed dog (bottom left; and RGB (red, green, blue) inputs, bottom right). Each rectangular image is a feature map corresponding to the output for one of the learned features, detected at each of the image positions. Information flows bottom up, with lower-level features acting as oriented edge detectors, and a score is computed for each image class in output. ReLU, rectified linear unit.

  • 图2 | 卷积神经网络内部视图。
  • 典型卷积神经网络(应用于萨摩耶图片):
    • 左下图为萨摩耶图片,右下图为RGB输入
    • 每个矩形图像特征图,是在对应每个图像位置检测学习到的特征
    • 信息自下而上流动,低级特征充当定向的边缘检测器,并为输出中的每个图像类别计算得分

> The architecture of a typical ConvNet (Fig. 2) is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolu- tional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Differ- ent feature maps in a layer use different filter banks. The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathematically, the filtering operation performed by a feature map is a discrete convolution, hence the name.

  • 典型的ConvNet一系列阶段组成:
    • 前几个阶段由卷积层和池化层组成
    • 卷积层通过滤波器组连接到上一层特征图的局部,局部加权和的结果将通过非线性(如ReLU)传递
    • 同一个特征图中的所有单元共享滤波器组
    • 不同特征图使用不同的滤波器
  • 上述体系结构的原因有两个:
    1. 在图片类似的特征图中,局部位置的值通常高度相关,形成易于检测的局部特征
    2. 其次图像的局部统计量和其他信息对于位置是不变的
      • 换句话说,如果主题出现在图像的一部分中,那么它就可能出现在任何地方,因此,能在不同位置共享和在数据的不同部分检测到相同的图案。
  • 从数学上讲,特征图数据的过滤操作是一种卷积,因此叫做卷积神经网络。

Image understanding with deep convolutional networks(深度卷积网络的图像理解)

> Since the early 2000s, ConvNets have been applied with great success to the detection, segmentation and recognition of objects and regions in images. These were all tasks in which labelled data was relatively abundant, such as traffic sign recognition, the segmentation of biological images particularly for connectomics, and the detection of faces, text, pedestrians and human bodies in natural images. A major recent practical success of ConvNets is face recognition.

  • 2000年以来,卷积神经网络成功应用于图像的检测,分割,对象识别。
  • 应用于标记数据相对丰富的任务:例如交通标志识别,特别的生物图像分割、自然图像中人脸,文字,行人和人体的检测。
  • ConvNets最近在实践中取得的最大成功是面部识别。

> Importantly, images can be labelled at the pixel level, which will have applications in technology, including autonomous mobile robots and self-driving cars. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision systems for cars. Other applications gaining importance involve natural language understanding and speech recognition.

  • 卷积神经网络将应用在自动机器人和无人驾驶领域
  • Mobileye与NVIDIA正在使用基于ConvNet的视觉系统中
  • 在自然语言理解和语言识别也将或者重视

Fig. 3 image > Figure3|From image to text. Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to ‘translate’ high-level representations of images into captions (top). Reproduced with permission from ref. 102[经ref许可转载]. When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), we found86 that it exploits this to achieve better ‘translation’ of images into captions.

  • 最上图,“看图说话”示例:通过深度卷积网络表征学习作为输入,结合递归神经网络处理将这些表征转“翻译”成文字说明。
  • 当RNN能够在生成每个单词(粗体)时将注意力集中在输入图像的不同位置(中间和底部;较浅的色块得到更多的关注)时,我们发现可以利用它来实现更好的效果将图片“翻译”为文字说明。
  • 表征学习:在机器学习中,特征学习或表征学习是学习一个特征的技术的集合:将原始数据转换成为能够被机器学习来有效开发的一种形式。它避免了手动提取特征的麻烦,允许计算机学习使用特征的同时,也学习如何提取特征:学习如何学习。
  • 表征学习及为何不怎么流行pretrain了
  • 词向量的工作原理

> Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spectacular results, almost halving the error rates of the best compet- ing approaches. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout, and techniques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).

  • 直到2012年的ImageNet比赛,才让ConvNet在被主流的计算机视觉和机器学习社区所认可
  • 深度卷积神经网络在一百万图像集包含1000个不同分类比赛中,取得最好方法误差一半的惊人的成绩
  • 成功的原因是有效利用了GPU,ReLU,一种称为dropout的新正则化技术,以及通过使现有样例变形来生成更多训练样例的技术,这次成功带来了计算机视觉的一场革命
  • ConvNets是几乎所有识别和检测任务的主要方法,并在某些任务上超过人类识别能力。
  • 最近的一次令人惊叹的演示结合了ConvNets和递归网络模块,用于"看图说话"场景(图3)

> Recent ConvNet architectures have 10 to 20 layers of ReLUs, hundreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.

  • 最近的神经网络拥有10-20 ReLu层,上亿个权重和上十亿的连接。
  • 在早两年(现在2015)训练如此大的网络需要数周,由于当前硬件,软件,算法并行的进度只需要几小时。

> The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook, Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.

  • 基于ConvNet的视觉系统的性能已引起大多数主要技术公司的发展,其中包括Google,Facebook,Microsoft,IBM,Yahoo!,Twitter和Adobe,以及数量迅速增长的初创企业启动了研发项目, 部署基于ConvNet的图像理解产品和服务。

> ConvNets are easily amenable to efficient hardware implemen- tations in chips or field-programmable gate arrays. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.

  • ConvNets适用于现场可编程门阵列中的高效硬件。
  • NVIDIA,Mobileye,英特尔,高通和三星等多家公司正在开发ConvNet芯片,以支持智能手机,相机,机器人和自动驾驶汽车中的实时视觉应用。

分布式表示和语言处理(Distributed representations and language processing)

> Deep-learning theory shows that deep nets have two different expo- nential advantages over classic learning algorithms that do not use distributed representations. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features). Second, composing layers of representation in a deep net brings the potential for another exponential advantage(exponential in the depth).

  • 深度学习理论表明,与不使用分布式表示形式的经典学习算法相比,深层网络具有两个不同的指数优势。
  • 这两个优点都来自于组合的能力,并取决于具有适当组件结构的基础数据生成分布。
    • 首先,学习分布式表示可以将学习到的特征值的新组合推广到训练期间看不到的值(例如,n个二进制特征可能有2n个组合)。
    • 其次,在一个深层网络中构成表示层会带来另一个指数优势(深度指数)。

Fig.4 image
> Figure4|Visualizing the learned word vectors. On the left is an illustration of word representations learned for modelling language, non-linearly projected to 2D for visualization using the t-SNE algorithm103. On the right is a 2D representation of phrases learned by an English-to-French encoder–decoder recurrent neural network. One can observe that semantically similar words or sequences of words are mapped to nearby representations. The distributed representations of words are obtained by using backpropagation to jointly learn a representation for each word and a function that predicts a target quantity such as the next word in a sequence (for language modelling) or a whole sequence of translated words (for machine translation).

  • 图4 |可视化词向量学习。
  • 左图:语音模型表示学习,使用t-SNE算法将其非线性投影到2维以进行可视化
  • 右图:通过英语到法语,编码-解码的神经网络学习到的二维短语
  • 可以看到,语义相似的单词或单词序列映射到相邻的表示形式
  • 通过使用反向传播联合学习每个单词的表示以及预测目标数量的函数(例如序列中的下一个单词(用于语言建模)或翻译单词的整个序列(用于机器翻译)),可以获得单词的分布式表示形式 )。

> The hidden layers of a multilayer neural network learn to represent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local context of earlier words. Each word in the context is presented to the network as a one-of-N vector, that is, one component has a value of 1 and the rest are 0. In the first layer, each word creates a different pattern of activations, or word vectors (Fig. 4). In a language model, the other layers of the network learn to convert the input word vectors into an output word vector for the predicted next word, which can be used to predict the probability for any word in the vocabulary to appear as the next word. The network learns word vectors that contain many active components each of which can be interpreted as a separate feature of the word, as was first demonstrated in the context of learning distributed representations for symbols. These semantic features were not explicitly present in the input. They were discovered by the learning procedure as a good way of factorizing the structured relationships between the input and output symbols into multiple ‘micro-rules’. Learning word vectors turned out to also work very well when the word sequences come from a large corpus of real text and the individual micro-rules are unreliable. When trained to predict the next word in a news story, for example, the learned word vectors for Tuesday and Wednesday are very similar, as are the word vectors for Sweden and Norway. Such representations are called distributed representations because their elements (the features) are not mutually exclusive and their many configurations correspond to the variations seen in the observed data. These word vectors are composed of learned features that were not determined ahead of time by experts, but automatically discovered by the neural network. Vector representations of words learned from text are now very widely used in natural language applications.

  • 多层神经网络的隐藏层的学习来表示输入,这种方式使得预测目标输出更容易
  • 多层神经网络能够从当前单词的上下文中预测下一单词
  • 上下文中的每个单词都以N维向量的形式呈现给网络,也就是说,一个部分的值为1,其余为0
  • 在第一层中,每个单词都会创建不同的激活模式,或词向量(图4)
  • 在语言模型中,其他层学习将输入的词向量转换为预测的下一个单词的输出词向量,这可用于预测词汇表中任何单词出现为下一个单词的概率。
  • 网络学习包含许多有效成分的单词向量,每个成分都可以解释为单词的一个独立特征,首次证明能在上下文中学习分布式表示。
  • 当单词序列来自大量的真实文本并且单个微规则不可靠时,学习单词向量也可以很好地工作。
  • 例如,当我们训练在“新闻故事中”预测下一个单词时,周二和周三学到的词向量与瑞典和挪威的词向量非常相似。
  • 这些词向量的特征组成并不是专家事先确定,而是由神经网络自动学习发现。从文本中学到的单词的矢量表示现在已广泛用于自然语言应用程序中。
  • 论文阅读,词向量分布式表示 - Distributed Representations of Words

> The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intuitive’ inference that underpins effortless commonsense reasoning.

  • 表征问题是逻辑启发和神经网络启发的认知范式之间争论的核心。
  • 逻辑启发:
    • 符号是标识两个不同实例的唯一特性
    • 没有使用内部相关性结构
    • 必须绑定变量到谨慎选择的规则中
  • 神经网络启发:
    • 使用较大的激活向量、大权重矩阵、非线性标量来执行快速的“直觉”推断,从而支持毫不费力的常识推理。

> Before the introduction of neural language models71, the standard approach to statistical modelling of language did not exploit distrib- uted representations: it was based on counting frequencies of occur- rences of short symbol sequences of length up to N (called N-grams). The number of possible N-grams is on the order of VN, where V is the vocabulary size, so taking into account a context of more than a handful of words would require very large training corpora. N-grams treat each word as an atomic unit, so they cannot generalize across semantically related sequences of words, whereas neural language models can because they associate each word with a vector of real valued features, and semantically related words end up close to each other in that vector space (Fig. 4).

  • 在引入神经语言模型之前,语言统计建模的标准方法并未利用分布式表示形式:它是基于对长度不超过N(称为N元语法)的短符号序列的出现频率进行计数。可能的N元语法的数量在VN的数量级上,其中V是词汇量,因此考虑到少数单词的上下文,将需要非常大的训练语料库。
  • N-gram将每个单词视为一个原子单元,因此它们无法在语义相关的单词序列中进行泛化,而神经语言模型可以将每个单词与实际值特征的向量相关联,而语义相关的单词最终在该向量空间中彼此靠近 (图4)。

递归神经网络(Recurrent neural networks)

Fig. 5 image > Figure 5 | A recurrent neural network and the unfolding in time of the computation involved inits forward computation. The artificial neurons (for example, hidden units grouped under node s with values st at time t) get inputs from other neurons at previous time steps (this is represented with the black square, representing a delay of one time step, on the left). In this way, a recurrent neural network can map an input sequence with elements xt into an output sequence with elements ot, with each ot depending on all the previous xtʹ (for tʹ ≤ t). The same parameters (matrices U,V,W ) are used at each time step. Many other architectures are possible, including a variant in which the network can generate a sequence of outputs (for example, words), each of which is used as inputs for the next time step. The backpropagation algorithm (Fig. 1) can be directly applied to the computational graph of the unfolded network on the right, to compute the derivative of a total error (for example, the log-probability of generating the right sequence of outputs) with respect to all the states st and all the parameters.

  • 图5,递归神经网络与时间展开的正向计算
  • 神经网络从时间st-1获取输入,递归神经网络映射输入序列Xt到输出Ot,每个ot都依赖于所有先前的xtʹ
  • 所有时间步长里面都共享相同的参数(U,V,W矩阵)
  • 每个输出都用作下一时间步骤的输入。反向传播算法(图1)可以直接应用于右侧展开网络计算图,以计算总误差的导数(例如,生成正确的输出序列的对数概率)到所有状态st和所有参数。

> When backpropagation was first introduced, its most exciting use was for training recurrent neural networks (RNNs). For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs (Fig. 5). RNNs process an input sequence one element at a time, maintaining in their hidden units a ‘state vector’ that implicitly contains information about the history of all the past elements of the sequence. When we consider the outputs of the hidden units at different discrete time steps as if they were the outputs of different neurons in a deep multilayer network (Fig. 5, right), it becomes clear how we can apply backpropagation to train RNNs.

  • 首次引入反向传播时,其最令人兴奋的用途是训练循环神经网络(RNN)
  • 对于涉及顺序输入的任务,例如语音和语言,通常最好使用RNN(图5)
  • RNN一次处理一个输入序列的一个元素,并在其隐藏单元中维护一个“状态向量”,该“状态向量”隐式包含有关该序列的所有过去元素的历史信息。
  • 当我们将隐藏单元在不同离散时间步的输出视为是深层多层网络中不同神经元的输出时(图5,右),很清楚地知道如何应用反向传播训练RNN。

> RNNs are very powerful dynamic systems, but training them has proved to be problematic because the backpropagated gradients either grow or shrink at each time step, so over many time steps they typically explode or vanish.

  • RNN是非常强大的动态系统,但是事实证明,训练RNN是有问题的,因为反向传播的梯度在每个时间步长都会增大或缩小,因此在许多时间步长中它们通常会爆炸或消失。

> Thanks to advances in their architecture them, RNNs have been found to be very good at predicting the next character in the text or the next word in a sequence, but they can also be used for more complex tasks. For example, after reading an English sentence one word at a time, an English ‘encoder’ network can be trained so that the final state vector of its hidden units is a good representation of the thought expressed by the sentence. This thought vector can then be used as the initial hidden state of (or as extra input to) a jointly trained French ‘decoder’ network, which outputs a probability distribution for the first word of the French translation. If a particular first word is chosen from this distribution and provided as input to the decoder network it will then output a probability distribution for the second word of the translation and so on until a full stop is chosen. Overall, this process generates sequences of French words according to a probability distribution that depends on the English sentence. This rather naive way of performing machine translation has quickly become competitive with the state-of-the-art, and this raises serious doubts about whether understanding a sentence requires anything like the internal symbolic expressions that are manipulated by using inference rules. It is more compatible with the view that everyday reasoning involves many simultaneous analogies that each contribute plausibility to a conclusion.

  • RNN非常擅长预测文本中的下一个字符或序列中的下一个单词,但是它们也可以用于更复杂的任务
  • 例如,阅读一个英语句子后,可以训练一个英语“编码器”网络,其隐藏单元可以很好地表达该句子的思想
  • 这引起了人们对理解句子是否需要诸如通过使用推理规则的内部符号表达式之类的方式产生质疑

> Instead of translating the meaning of a French sentence into an English sentence, one can learn to ‘translate’ the meaning of an image into an English sentence (Fig. 3). The encoder here is a deep Con- vNet that converts the pixels into an activity vector in its last hidden layer. The decoder is an RNN similar to the ones used for machine translation and neural language modelling. There has been a surge of interest in such systems recently (see examples mentioned in ref. 86).

> RNNs, once unfolded in time (Fig. 5), can be seen as very deep feedforward networks in which all the layers share the same weights. Although their main purpose is to learn long-term dependencies, theoretical and empirical evidence shows that it is difficult to learn to store information for very long.

  • RNNs随时间展开(图5),可以看作是非常深的前馈网络,其中所有层共享相同的权重。 * 尽管它们的主要目的是学习长期依赖关系,但理论和经验证据表明,很难学习很长期的存储信息。

> To correct for that, one idea is to augment the network with an explicit memory. The first proposal of this kind is the long short-term memory (LSTM) networks that use special hidden units, the natural behaviour of which is to remember inputs for a long time. A special unit called the memory cell acts like an accumulator or a gated leaky neuron: it has a connection to itself at the next time step that has a weight of one, so it copies its own real-valued state and accumulates the external signal, but this self-connection is multiplicatively gated by another unit that learns to decide when to clear the content of the memory. LSTM networks have subsequently proved to be more effective than conventional RNNs, especially when they have several layers for each time step, enabling an entire speech recognition system that goes all the way from acoustics to the sequence of characters in the transcription. LSTM networks or related forms of gated units are also currently used for the encoder and decoder networks that perform so well at machine translation.

  • LSTM(长短期记忆网络)通过一个特殊的隐藏单元来到达是长时间记住输入
  • 一个特殊的内存记忆单元,通过将自己和下一时间信息相连接。复制自己当前状态值模拟为外部输入。
  • LSTM网络被证明比常规RNN更有效,尤其是当它们在每个时间步都有多层时,能让整个语音识别系统从声学中学习到转录中的字符序列。
  • LSTM网络在机器翻译方面也有出色的表现

> Over the past year, several authors have made different proposals to augment RNNs with a memory module. Proposals include the Neural Turing Machine in which the network is augmented by a ‘tape-like’ memory that the RNN can choose to read from or write to, and memory networks, in which a regular network is augmented by a kind of associative memory. memory networks have yielded excellent performance on standard question-answering benchmarks. The memory is used to remember the story about which the network is later asked to answer questions.

  • 记忆网络在标准问答基准测试方面已取得了出色的表现
  • 记忆网络记忆在故事结束后会被要求回答的问题。

> Beyond simple memorization, neural Turing machines and memory networks are being used for tasks that would normally require reasoning and symbol manipulation. Neural Turing machines can be taught ‘algorithms’. Among other things, they can learn to output a sorted list of symbols when their input consists of an unsorted sequence in which each symbol is accompanied by a real value that indicates its priority in the list88. Memory networks can be trained to keep track of the state of the world in a setting similar to a text adventure game and after reading a story, they can answer questions that require complex inference90. In one test example, the network is shown a 15-sentence version of the The Lord of the Rings and correctly answers questions such as “where is Frodo now?”89.

  • 通过训练记忆网络在阅读故事后,它们可以回答需要复杂推理的问题
  • 在《指环王》故事问答测试中,能正确回答了诸如“ Frodo现在在哪里?” 之类的问题。

深度学习的未来(The future of deep learning)

> Unsupervised learning had a catalytic effect in reviving interest in deep learning, but has since been overshadowed by the successes of purely supervised learning. Although we have not focused on it in this Review, we expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object. Human vision is an active process that sequentially samples the optic array in an intelligent, task-specific way using a small, high-resolution fovea with a large, low-resolution surround. We expect much of the future progress in vision to come from systems that are trained end-to- end and combine ConvNets with RNNs that use reinforcement learning to decide where to look. Systems combining deep learning and reinforcement learning are in their infancy, but they already outperform passive vision systems at classification tasks and produce impressive results in learning to play many different video games.

  • 无监督学习对重新激发大家对深度学习的兴趣,但可能会被监督学习的成功所掩盖。从长远来看,无监督学习将变得更加重要。
  • 人类和动物的学习在很大程度上不受监督:我们通过观察来发现世界的结构,而不是通过告知每个物体的名称来发现世界的结构。
  • 人类视觉将是一个重要研究方向,它使用具有高分辨率,低分辨率的小中心凹,以智能的,针对特定任务的方式依次对光学阵列进行采样。
  • 结合了深度学习和强化学习的系统尚处于起步阶段,但是在分类任务上它们已经超过了被动视觉系统

> Natural language understanding is another area in which deep learn- ing is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for selectively attending to one part at a time. Ultimately, major progress in artificial intelligence will come about through systems that combine representation learning with complex reasoning. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, new paradigms are needed to replace rule-based manipulation of symbolic expressions by operations on large vectors.

  • 自然语言理解是深度学习必将在未来几年产生巨大影响的另一个领域。
  • 我们希望使用RNN理解句子或整个文档的系统在学习每次选择性地关注一部分内容的策略时会变得更好
  • 最终,人工智能的重大进步将通过将表示学习与复杂推理相结合的系统来实现。
  • 尽管长期以来,深度学习和简单推理已用于语音和手写识别,但仍需要新的范例来通过对大向量进行运算来代替基于规则的符号表达模式。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章