01-2015-Deep Learning論文翻譯精讀

深度學習(Deep learning)

>* Deep Learning原始論文:http://pages.cs.wisc.edu/~dyer/cs540/handouts/deep-learning-nature2015.pdf
>* Deep Learning論文下載地址(高速)


> Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec- ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.

  • 深度學習允許由多個處理層組成的計算模型學習具有多個抽象級別的數據表示。
  • 深度學習通過使用反向傳播算法指示機器應如何更改其內部參數來發現大型數據集中的複雜結構
  • 深層卷積網絡在處理圖像,視頻,語音和音頻方面帶來了突破,而遞歸網絡則對諸如文本和語音之類的順序數據有所啓發。

> Machine-learning technology powers many aspects of modern society: from web searches to content filtering on social net- works to recommendations on e-commerce websites, and it is increasingly present in consumer products such as cameras and smartphones. Machine-learning systems are used to identify objects in images, transcribe speech into text, match news items, posts or products with users’ interests, and select relevant results of search. Increasingly, these applications make use of a class of techniques called deep learning. Conventional machine-learning techniques were limited in their ability to process natural data in their raw form. For decades, con- structing a pattern-recognition or machine-learning system required careful engineering and considerable domain expertise to design a fea- ture extractor that transformed the raw data (such as the pixel values of an image) into a suitable internal representation or feature vector from which the learning subsystem, often a classifier, could detect or classify patterns in the input. Representation learning is a set of methods that allows a machine to be fed with raw data and to automatically discover the representations needed for detection or classification. Deep-learning methods are representation-learning methods with multiple levels of representa- tion, obtained by composing simple but non-linear modules that each transform the representation at one level (starting with the raw input) into a representation at a higher, slightly more abstract level. With the composition of enough such transformations, very complex functions can be learned. For classification tasks, higher layers of representation amplify aspects of the input that are important for discrimination and suppress irrelevant variations. An image, for example, comes in the form of an array of pixel values, and the learned features in the first layer of representation typically represent the presence or absence of edges at particular orientations and locations in the image. The second layer typically detects motifs by spotting particular arrangements of edges, regardless of small variations in the edge positions. The third layer may assemble motifs into larger combinations that correspond to parts of familiar objects, and subsequent layers would detect objects as combinations of these parts. The key aspect of deep learning is that these layers of features are not designed by human engineers: they are learned from data using a general-purpose learning procedure. Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence commu- nity for many years. It has turned out to be very good at discovering doi:10.1038/nature14539

> Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech rec- ognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech. intricate structures in high-dimensional data and is therefore applica- ble to many domains of science, business and government. In addition We think that deep learning will have many more successes in the near future because it requires very little engineering by hand, so it can easily take advantage of increases in the amount of available com- putation and data. New learning algorithms and architectures that are currently being developed for deep neural networks will only acceler- ate this progress.

  • 常規的機器學習技術在處理原始格式的自然數據的能力方面受到限制
    • 需要相當多的領域專業知識,才能設計特徵提取器
    • 通過特徵提取器將原始數據(例如圖像的像素值)轉換爲合適的內部表示或特徵向量
  • 深度學習具有多層表示形式的表示學習方法
    • 它是通過組合簡單但非線性的模塊而獲得的
    • 有了足夠多模塊組合,就可以學習非常複雜的功能。
  • 深度學習圖像識別分層學習:
    • 第一層,學習圖像的邊緣
    • 第二層,基於邊緣的的特定佈置檢測圖案
    • 第三層,將圖案組裝成與熟悉的對象的各個部分相對應的較大組合,並且隨後的層將把對象檢測爲這些部分的組合

監督學習(Supervised learning)

> The most common form of machine learning, deep or not, is super- vised learning. Imagine that we want to build a system that can classify images as containing, say, a house, a car, a person or a pet. We first collect a large data set of images of houses, cars, people and pets, each labelled with its category. During training, the machine is shown an image and produces an output in the form of a vector of scores, one for each category. We want the desired category to have the highest score of all categories, but this is unlikely to happen before training. We compute an objective function that measures the error (or dis- tance) between the output scores and the desired pattern of scores. The machine then modifies its internal adjustable parameters to reduce this error. These adjustable parameters, often called weights, are real numbers that can be seen as ‘knobs’ that define the input–output func- tion of the machine. In a typical deep-learning system, there may be hundreds of millions of these adjustable weights, and hundreds of millions of labelled examples with which to train the machine.

  • 監督學習是最常見的機器學習形式(也包括深度學習在內)
  • 典型的深度學習系統中,可能會有數以億計的這些可調節砝碼,以及數億個帶有標籤的示例,用於訓練機器
  • 圖片分類爲例:
    • 收集大量的房屋,汽車,人和寵物的圖像數據集,每個圖像均標有類別
    • 在訓練過程中,機器會顯示一張圖像,並以分數矢量的形式產生輸出,每個類別一個
    • 通過調整調整的參數(通常稱爲權重)是實數,可以看作是“旋鈕”,用於定義機器的輸入-輸出功能,使得輸出的類別在所有類別中得分最高。

> To properly adjust the weight vector, the learning algorithm computes a gradient vector that, for each weight, indicates by what amount the error would increase or decrease if the weight were increased by a tiny amount. The weight vector is then adjusted in the opposite direc- tion to the gradient vector. The objective function, averaged over all the training examples, can be seen as a kind of hilly landscape in the high-dimensional space of weight values. The negative gradient vector indicates the direction of steepest descent in this landscape, taking it closer to a minimum, where the output error is low on average.

  • 如何調整權重向量:
    • 學習算法計算一個梯度向量(表示權重的細微增加會導致錯誤增加或者減少的量)
    • 按照與梯度向量相反的方向調整權重向量
  • 平均的目標函數可以看作是權重值高維空間中的一種丘陵景觀
  • 負梯度矢量指示此景觀中最陡下降的方向,使其更接近最小值,其中輸出誤差平均較低。

> In practice, most practitioners use a procedure called stochastic gradient descent (SGD). This consists of showing the input vector for a few examples, computing the outputs and the errors, computing the average gradient for those examples, and adjusting the weights accordingly. The process is repeated for many small sets of examples from the training set until the average of the objective function stops decreasing. It is called stochastic because each small set of examples gives a noisy estimate of the average gradient over all examples. This simple procedure usually finds a good set of weights surprisingly quickly when compared with far more elaborate optimization tech- niques18. After training, the performance of the system is measured on a different set of examples called a test set. This serves to test the generalization ability of the machine — its ability to produce sensible answers on new inputs that it has never seen during training.

  • 實踐中,大多數從業者使用一種稱爲隨機梯度下降(SGD)的程序
  • 之所以稱其爲隨機的,是因爲每個小樣本示例都給出了所有示例中平均梯度的噪聲估計
  • 測試機器的泛化能力,即它對訓練期間從未見過的新輸入產生明智答案的能力。

> Many of the current practical applications of machine learning use linear classifiers on top of hand-engineered features. A two-class linear classifier computes a weighted sum of the feature vector components. If the weighted sum is above a threshold, the input is classified as belonging to a particular category.

  • 機器學習的許多當前實際應用都在頂層手工特徵工程處理上使用線性分類器。
  • 二分類線性分類器計算特徵向量分量的加權和。如果加權和大於閾值,則將輸入分類爲屬於特定類別

> Since the 1960s we have known that linear classifiers can only carve their input space into very simple regions, namely half-spaces sepa- rated by a hyperplane19. But problems such as image and speech recognition require the input–output function to be insensitive to irrelevant variations of the input, such as variations in position, orientation or illumination of an object, or variations in the pitch or accent of speech, while being very sensitive to particular minute variations (for example, the difference between a white wolf and a breed of wolf-like white dog called a Samoyed). At the pixel level, images of two Samoyeds in different poses and in different environments may be very different from each other, whereas two images of a Samoyed and a wolf in the same position and on similar backgrounds may be very similar to each other. A linear classifier, or any other ‘shallow’ classifier operating on raw pixels could not possibly distinguish the latter two, while putting the former two in the same category. This is why shallow classifiers require a good feature extractor that solves the selectivity–invariance dilemma — one that produces representations that are selective to the aspects of the image that are important for discrimination, but that are invariant to irrelevant aspects such as the pose of the animal. To make classifiers more powerful, one can use generic non-linear features, as with kernel methods20, but generic features such as those arising with the Gaussian kernel do not allow the learner to general- ize well far from the training examples21. The conventional option is to hand design good feature extractors, which requires a consider- able amount of engineering skill and domain expertise. But this can all be avoided if good features can be learned automatically using a general-purpose learning procedure. This is the key advantage of deep learning.

  • 線性分類器只能將劃分比較簡單的區域(常規的異或問題都無法處理)
  • 線性分類器只能將其輸入空間劃分到非常簡單的區域
  • 有些場景希望模型函數對輸入的不相關變化不敏感:
    • 圖像識別:對象的位置,方向或照明的變化
    • 語音識別:語音的音高或口音的變化
  • 有些場景希望模型函數對特定的細微變化敏感:
    • 白狼與類似白狼的薩摩耶之間的差異
    • 處於不同姿勢和不同環境中的兩個薩摩耶犬的圖像可能彼此非常不同,而在相同位置且背景相似的薩摩耶犬和狼的兩個圖像可能彼此非常相似

> * 線性分類器或任何其他對原始像素進行操作的“淺”分類器都無法正常區分上述白狼和薩摩耶場景,並進行正確分類
> * 這就是爲什麼淺分類器需要一個好的特徵提取器來解決選擇性不變性難題的原因 > * 傳統的選擇是手工設計好的特徵提取器,這需要大量的工程技術和領域專業知識。

  • 深度學習的關鍵優勢是:可以使用通用學習過程來自動學習好的特徵

> A deep-learning architecture is a multilayer stack of simple modules, all (or most) of which are subject to learning, and many of which compute non-linear input–output mappings. Each module in the stack transforms its input to increase both the selectivity and the invariance of the representation. With multiple non-linear layers, say a depth of 5 to 20, a system can implement extremely intricate functions of its inputs that are simultaneously sensitive to minute details — distinguishing Samoyeds from white wolves — and insensitive to large irrelevant variations such as the background, pose, lighting and surrounding objects.

  • 深度學習框架是簡單模塊的多層堆積:
    • 這些模塊(或大多數模塊)都具備學習能力,並且其中許多模塊都會計算非線性的輸入-輸出之間映射。
    • 多個非線性表示層(5-10層),能夠實現非常複雜的功能,對細微變化敏感
    • 能夠很好的區分薩摩耶與白狼(對諸如背景,姿勢,燈光和周圍物體影響情況下)

Backpropagation to train multilayer architectures(反向傳播訓練多層網絡架構)

> From the earliest days of pattern recognition, the aim of researchers has been to replace hand-engineered features with trainable multilayer networks, but despite its simplicity, the solution was not widely understood until the mid 1980s. As it turns out, multilayer architectures can be trained by simple stochastic gradient descent. As long as the modules are relatively smooth functions of their inputs and of their internal weights, one can compute gradients using the backpropagation procedure. The idea that this could be done, and that it worked, was discovered independently by several different groups during the 1970s and 1980s.

  • 早期的模式識別中,通過訓練多層網絡來代替手工設計特徵抽取引擎。
  • 只要這些多層網絡模塊是其輸入和內部權重的相對平滑函數,就可以使用反向傳播過程來計算梯度
  • 直到1980年,這種簡單的多層網絡才被廣泛理解

> The backpropagation procedure to compute the gradient of an objective function with respect to the weights of a multilayer stack of modules is nothing more than a practical application of the chain rule for derivatives. The key insight is that the derivative (or gradient) of the objective with respect to the input of a module can be computed by working backwards from the gradient with respect to the output of that module (or the input of the subsequent module) (Fig. 1). The backpropagation equation can be applied repeatedly to propagate gradients through all modules, starting from the output at the top (where the network produces its prediction) all the way to the bottom (where the external input is fed). Once these gradients have been computed, it is straightforward to compute the gradients with respect to the weights of each module.

  • 反向傳播用於計算目標函數相對於神經網絡權重的梯度,無非就是導數鏈式法則的應用。
  • 通過計算輸出相對於真實值的誤差的編導,進行反向調整參數反饋傳播到第一層。

Fig. 1 image

> Many applications of deep learning use feedforward neural network architectures (Fig. 1), which learn to map a fixed-size input (for example, an image) to a fixed-size output (for example, a prob- ability for each of several categories). To go from one layer to the next, a set of units compute a weighted sum of their inputs from the previous layer and pass the result through a non-linear function. At present, the most popular non-linear function is the rectified linear unit (ReLU), which is simply the half-wave rectifier f(z) = max(z, 0). In past decades, neural nets used smoother non-linearities, such as tanh(z) or 1/(1 + exp(−z)), but the ReLU typically learns much faster in networks with many layers, allowing training of a deep supervised network without unsupervised pre-training28. Units that are not in the input or output layer are conventionally called hidden units. The hidden layers can be seen as distorting the input in a non-linear way so that categories become linearly separable by the last layer (Fig. 1).

  • 圖1 c爲正向傳播,學習固定大小(圖像)輸入到輸出(分類)的映射。
  • ReLU修正線性單元(ReLU):
    • f(z)= max(z,0), 最流行的非線性函數
    • 類似函數還有tanh(x) = (e^x - e^-x)/ (e^x + e^-x)
    • 類似函數還有Logistic Sigmoid, f(z) = 1/(1 + exp(−z)).
  • 隱藏層的定義是:除輸入層和輸出層以外的其他各層叫做隱藏層。隱藏層不直接接受外界的信號,也不直接向外界發送信號。
  • 可以將隱藏的層視爲非線性地使輸入失真

激活函數對比: 激活函數對比

  • 爲什麼tanh相比sigmoid收斂更快:

    • 對比sigmoid和tanh兩者導數輸出可知,tanh函數的導數比sigmoid函數導數值更大,即梯度變化更快,也就是在訓練過程中收斂速度更快。
    • relu函數的導數計算更快,程序實現就是一個if-else語句;而sigmoid函數要進行浮點四則運算,涉及到除法;
  • Leaky ReLU:

    • f(x) = aix(x<0) ai爲一個隨機選取的固定值,類似1/2. 目的是小於0的數據有導數
    • f(x) = x(x>0)

> In the late 1990s, neural nets and backpropagation were largely forsaken by the machine-learning community and ignored by the computer-vision and speech-recognition communities. It was widely thought that learning useful, multistage, feature extractors with lit- tle prior knowledge was infeasible. In particular, it was commonly thought that simple gradient descent would get trapped in poor local minima — weight configurations for which no small change would reduce the average error.

  • 90年代後期,神經網絡和反向傳播被人們忽視。
  • 人們普遍認爲,簡單的梯度下降會陷入不良的局部極小值—權重配置

> In practice, poor local minima are rarely a problem with large networks. Regardless of the initial conditions, the system nearly always reaches solutions of very similar quality. Recent theoretical and empirical results strongly suggest that local minima are not a serious issue in general. Instead, the landscape is packed with a combinato- rially large number of saddle points where the gradient is zero, and the surface curves up in most dimensions and curves down in the emainder. The analysis seems to show that saddle points with only a few downward curving directions are present in very large numbers, but almost all of them have very similar values of the objec- tive function. Hence, it does not much matter which of these saddle points the algorithm gets stuck at.

  • 最近的理論和經驗結果強烈表明,局部極小值通常不是一個嚴重的問題。
  • 其中鞍點問題、大平原是更加嚴重的問題。
  • 爲什麼鞍點相對於局部最小值更嚴重:
    • 因爲在大型網絡中,由於維度很多,要達到局部最小值則說明各個維度都要最小(否則可以沿着其他維度逃離局部最小值),這樣很大概率就代表局部最小值即爲全局最小值
    • 鞍點則是當前值導數爲0,兩邊則是導數相反(在高緯度中,比較容易出現)
    • 大平原,則因爲導數爲0,導致很難逃離出去
    • 更多閱讀深度學習中的局部極小值與鞍點

> Interest in deep feedforward networks was revived around 2006 (refs 31–34) by a group of researchers brought together by the Canadian Institute for Advanced Research (CIFAR). The researchers introduced unsupervised learning procedures that could create layers of feature detectors without requiring labelled data. The objective in learning each layer of feature detectors was to be able to reconstruct or model the activities of feature detectors (or raw inputs) in the layer below. By ‘pre-training’ several layers of progressively more complex feature detectors using this reconstruction objective, the weights of a deep network could be initialized to sensible values. A final layer of output units could then be added to the top of the network and the whole deep system could be fine-tuned using standard backpropagation. This worked remarkably well for recognizing handwritten digits or for detecting pedestrians, especially when the amount of labelled data was very limited.

  • 2006年,CIFAR引入無監督學習創建"自動探測特徵"網絡層。基於無標記數據"預訓練"
  • 整個深度網絡使用標準反向傳播對模型進行微調。
  • 這種方式在手寫識別中表現良好,特別是當標記數據有限的場景下

> The first major application of this pre-training approach was in speech recognition, and it was made possible by the advent of fast graphics processing units (GPUs) that were convenient to program and allowed researchers to train networks 10 or 20 times faster. In 2009, the approach was used to map short temporal windows of coef- ficients extracted from a sound wave to a set of probabilities for the various fragments of speech that might be represented by the frame in the centre of the window. It achieved record-breaking results on a standard speech recognition benchmark that used a small vocabu- lary and was quickly developed to give record-breaking results on a large vocabulary task. By 2012, versions of the deep net from 2009 were being developed by many of the major speech groups and were already being deployed in Android phones. For smaller data sets, unsupervised pre-training helps to prevent overfitting, leading to significantly better generalization when the number of labelled exam- ples is small, or in a transfer setting where we have lots of examples for some ‘source’ tasks but very few for some ‘target’ tasks. Once deep learning had been rehabilitated, it turned out that the pre-training stage was only needed for small data sets.

  • 預訓練方法的第一個主要應用是語音識別(得益於更快計算能力的GPU)
  • 2012年,許多語音識別項目,基於2009年的預訓練音頻模型取得不錯效果
  • 對於較小的數據集,無監督的預訓練有助於防止過度擬合(事實證明,僅對於小型數據集才需要預訓練階段)

> There was, however, one particular type of deep, feedforward network that was much easier to train and generalized much better than networks with full connectivity between adjacent layers. This was the convolutional neural network (ConvNet)41,42. It achieved many practical successes during the period when neural networks were out of favour and it has recently been widely adopted by the computer- vision community.

  • 但是,存在一種特定類型的深度前饋網絡,它比相鄰層之間具有完全連接的網絡更容易訓練和推廣。 這就是卷積神經網絡(ConvNet)
  • ConvNet, 最近被計算機視覺界廣泛採用。

卷積神經網絡(Convolutional neural networks)

> ConvNets are designed to process data that come in the form of multiple arrays, for example a colour image composed of three 2D arrays containing pixel intensities in the three colour channels. Many data modalities are in the form of multiple arrays: 1D for signals and sequences, including language; 2D for images or audio spectrograms; and 3D for video or volumetric images. There are four key ideas behind ConvNets that take advantage of the properties of natural signals: local connections, shared weights, pooling and the use of many layers.

  • ConvNet設計用於處理以多維矩陣形式出現的數據
  • ConvNet4個關鍵思想:
    • local connections(本地連接)
    • shared weights(共享權重)
    • pooling(池化)
    • many layers(多層使用)

Fig. 2 image

> Figure2|Inside a convolutional network. The outputs(not the filters) of each layer (horizontally) of a typical convolutional network architecture applied to the image of a Samoyed dog (bottom left; and RGB (red, green, blue) inputs, bottom right). Each rectangular image is a feature map corresponding to the output for one of the learned features, detected at each of the image positions. Information flows bottom up, with lower-level features acting as oriented edge detectors, and a score is computed for each image class in output. ReLU, rectified linear unit.

  • 圖2 | 卷積神經網絡內部視圖。
  • 典型卷積神經網絡(應用於薩摩耶圖片):
    • 左下圖爲薩摩耶圖片,右下圖爲RGB輸入
    • 每個矩形圖像特徵圖,是在對應每個圖像位置檢測學習到的特徵
    • 信息自下而上流動,低級特徵充當定向的邊緣檢測器,併爲輸出中的每個圖像類別計算得分

> The architecture of a typical ConvNet (Fig. 2) is structured as a series of stages. The first few stages are composed of two types of layers: convolutional layers and pooling layers. Units in a convolu- tional layer are organized in feature maps, within which each unit is connected to local patches in the feature maps of the previous layer through a set of weights called a filter bank. The result of this local weighted sum is then passed through a non-linearity such as a ReLU. All units in a feature map share the same filter bank. Differ- ent feature maps in a layer use different filter banks. The reason for this architecture is twofold. First, in array data such as images, local groups of values are often highly correlated, forming distinctive local motifs that are easily detected. Second, the local statistics of images and other signals are invariant to location. In other words, if a motif can appear in one part of the image, it could appear anywhere, hence the idea of units at different locations sharing the same weights and detecting the same pattern in different parts of the array. Mathematically, the filtering operation performed by a feature map is a discrete convolution, hence the name.

  • 典型的ConvNet一系列階段組成:
    • 前幾個階段由卷積層和池化層組成
    • 卷積層通過濾波器組連接到上一層特徵圖的局部,局部加權和的結果將通過非線性(如ReLU)傳遞
    • 同一個特徵圖中的所有單元共享濾波器組
    • 不同特徵圖使用不同的濾波器
  • 上述體系結構的原因有兩個:
    1. 在圖片類似的特徵圖中,局部位置的值通常高度相關,形成易於檢測的局部特徵
    2. 其次圖像的局部統計量和其他信息對於位置是不變的
      • 換句話說,如果主題出現在圖像的一部分中,那麼它就可能出現在任何地方,因此,能在不同位置共享和在數據的不同部分檢測到相同的圖案。
  • 從數學上講,特徵圖數據的過濾操作是一種卷積,因此叫做卷積神經網絡。

Image understanding with deep convolutional networks(深度卷積網絡的圖像理解)

> Since the early 2000s, ConvNets have been applied with great success to the detection, segmentation and recognition of objects and regions in images. These were all tasks in which labelled data was relatively abundant, such as traffic sign recognition, the segmentation of biological images particularly for connectomics, and the detection of faces, text, pedestrians and human bodies in natural images. A major recent practical success of ConvNets is face recognition.

  • 2000年以來,卷積神經網絡成功應用於圖像的檢測,分割,對象識別。
  • 應用於標記數據相對豐富的任務:例如交通標誌識別,特別的生物圖像分割、自然圖像中人臉,文字,行人和人體的檢測。
  • ConvNets最近在實踐中取得的最大成功是面部識別。

> Importantly, images can be labelled at the pixel level, which will have applications in technology, including autonomous mobile robots and self-driving cars. Companies such as Mobileye and NVIDIA are using such ConvNet-based methods in their upcoming vision systems for cars. Other applications gaining importance involve natural language understanding and speech recognition.

  • 卷積神經網絡將應用在自動機器人和無人駕駛領域
  • Mobileye與NVIDIA正在使用基於ConvNet的視覺系統中
  • 在自然語言理解和語言識別也將或者重視

Fig. 3 image > Figure3|From image to text. Captions generated by a recurrent neural network (RNN) taking, as extra input, the representation extracted by a deep convolution neural network (CNN) from a test image, with the RNN trained to ‘translate’ high-level representations of images into captions (top). Reproduced with permission from ref. 102[經ref許可轉載]. When the RNN is given the ability to focus its attention on a different location in the input image (middle and bottom; the lighter patches were given more attention) as it generates each word (bold), we found86 that it exploits this to achieve better ‘translation’ of images into captions.

  • 最上圖,“看圖說話”示例:通過深度卷積網絡表徵學習作爲輸入,結合遞歸神經網絡處理將這些表徵轉“翻譯”成文字說明。
  • 當RNN能夠在生成每個單詞(粗體)時將注意力集中在輸入圖像的不同位置(中間和底部;較淺的色塊得到更多的關注)時,我們發現可以利用它來實現更好的效果將圖片“翻譯”爲文字說明。
  • 表徵學習:在機器學習中,特徵學習或表徵學習是學習一個特徵的技術的集合:將原始數據轉換成爲能夠被機器學習來有效開發的一種形式。它避免了手動提取特徵的麻煩,允許計算機學習使用特徵的同時,也學習如何提取特徵:學習如何學習。
  • 表徵學習及爲何不怎麼流行pretrain了
  • 詞向量的工作原理

> Despite these successes, ConvNets were largely forsaken by the mainstream computer-vision and machine-learning communities until the ImageNet competition in 2012. When deep convolutional networks were applied to a data set of about a million images from the web that contained 1,000 different classes, they achieved spectacular results, almost halving the error rates of the best compet- ing approaches. This success came from the efficient use of GPUs, ReLUs, a new regularization technique called dropout, and techniques to generate more training examples by deforming the existing ones. This success has brought about a revolution in computer vision; ConvNets are now the dominant approach for almost all recognition and detection tasks and approach human performance on some tasks. A recent stunning demonstration combines ConvNets and recurrent net modules for the generation of image captions (Fig. 3).

  • 直到2012年的ImageNet比賽,才讓ConvNet在被主流的計算機視覺和機器學習社區所認可
  • 深度卷積神經網絡在一百萬圖像集包含1000個不同分類比賽中,取得最好方法誤差一半的驚人的成績
  • 成功的原因是有效利用了GPU,ReLU,一種稱爲dropout的新正則化技術,以及通過使現有樣例變形來生成更多訓練樣例的技術,這次成功帶來了計算機視覺的一場革命
  • ConvNets是幾乎所有識別和檢測任務的主要方法,並在某些任務上超過人類識別能力。
  • 最近的一次令人驚歎的演示結合了ConvNets和遞歸網絡模塊,用於"看圖說話"場景(圖3)

> Recent ConvNet architectures have 10 to 20 layers of ReLUs, hundreds of millions of weights, and billions of connections between units. Whereas training such large networks could have taken weeks only two years ago, progress in hardware, software and algorithm parallelization have reduced training times to a few hours.

  • 最近的神經網絡擁有10-20 ReLu層,上億個權重和上十億的連接。
  • 在早兩年(現在2015)訓練如此大的網絡需要數週,由於當前硬件,軟件,算法並行的進度只需要幾小時。

> The performance of ConvNet-based vision systems has caused most major technology companies, including Google, Facebook, Microsoft, IBM, Yahoo!, Twitter and Adobe, as well as a quickly growing number of start-ups to initiate research and development projects and to deploy ConvNet-based image understanding products and services.

  • 基於ConvNet的視覺系統的性能已引起大多數主要技術公司的發展,其中包括Google,Facebook,Microsoft,IBM,Yahoo!,Twitter和Adobe,以及數量迅速增長的初創企業啓動了研發項目, 部署基於ConvNet的圖像理解產品和服務。

> ConvNets are easily amenable to efficient hardware implemen- tations in chips or field-programmable gate arrays. A number of companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung are developing ConvNet chips to enable real-time vision applications in smartphones, cameras, robots and self-driving cars.

  • ConvNets適用於現場可編程門陣列中的高效硬件。
  • NVIDIA,Mobileye,英特爾,高通和三星等多家公司正在開發ConvNet芯片,以支持智能手機,相機,機器人和自動駕駛汽車中的實時視覺應用。

分佈式表示和語言處理(Distributed representations and language processing)

> Deep-learning theory shows that deep nets have two different expo- nential advantages over classic learning algorithms that do not use distributed representations. Both of these advantages arise from the power of composition and depend on the underlying data-generating distribution having an appropriate componential structure. First, learning distributed representations enable generalization to new combinations of the values of learned features beyond those seen during training (for example, 2n combinations are possible with n binary features). Second, composing layers of representation in a deep net brings the potential for another exponential advantage(exponential in the depth).

  • 深度學習理論表明,與不使用分佈式表示形式的經典學習算法相比,深層網絡具有兩個不同的指數優勢。
  • 這兩個優點都來自於組合的能力,並取決於具有適當組件結構的基礎數據生成分佈。
    • 首先,學習分佈式表示可以將學習到的特徵值的新組合推廣到訓練期間看不到的值(例如,n個二進制特徵可能有2n個組合)。
    • 其次,在一個深層網絡中構成表示層會帶來另一個指數優勢(深度指數)。

Fig.4 image
> Figure4|Visualizing the learned word vectors. On the left is an illustration of word representations learned for modelling language, non-linearly projected to 2D for visualization using the t-SNE algorithm103. On the right is a 2D representation of phrases learned by an English-to-French encoder–decoder recurrent neural network. One can observe that semantically similar words or sequences of words are mapped to nearby representations. The distributed representations of words are obtained by using backpropagation to jointly learn a representation for each word and a function that predicts a target quantity such as the next word in a sequence (for language modelling) or a whole sequence of translated words (for machine translation).

  • 圖4 |可視化詞向量學習。
  • 左圖:語音模型表示學習,使用t-SNE算法將其非線性投影到2維以進行可視化
  • 右圖:通過英語到法語,編碼-解碼的神經網絡學習到的二維短語
  • 可以看到,語義相似的單詞或單詞序列映射到相鄰的表示形式
  • 通過使用反向傳播聯合學習每個單詞的表示以及預測目標數量的函數(例如序列中的下一個單詞(用於語言建模)或翻譯單詞的整個序列(用於機器翻譯)),可以獲得單詞的分佈式表示形式 )。

> The hidden layers of a multilayer neural network learn to represent the network’s inputs in a way that makes it easy to predict the target outputs. This is nicely demonstrated by training a multilayer neural network to predict the next word in a sequence from a local context of earlier words. Each word in the context is presented to the network as a one-of-N vector, that is, one component has a value of 1 and the rest are 0. In the first layer, each word creates a different pattern of activations, or word vectors (Fig. 4). In a language model, the other layers of the network learn to convert the input word vectors into an output word vector for the predicted next word, which can be used to predict the probability for any word in the vocabulary to appear as the next word. The network learns word vectors that contain many active components each of which can be interpreted as a separate feature of the word, as was first demonstrated in the context of learning distributed representations for symbols. These semantic features were not explicitly present in the input. They were discovered by the learning procedure as a good way of factorizing the structured relationships between the input and output symbols into multiple ‘micro-rules’. Learning word vectors turned out to also work very well when the word sequences come from a large corpus of real text and the individual micro-rules are unreliable. When trained to predict the next word in a news story, for example, the learned word vectors for Tuesday and Wednesday are very similar, as are the word vectors for Sweden and Norway. Such representations are called distributed representations because their elements (the features) are not mutually exclusive and their many configurations correspond to the variations seen in the observed data. These word vectors are composed of learned features that were not determined ahead of time by experts, but automatically discovered by the neural network. Vector representations of words learned from text are now very widely used in natural language applications.

  • 多層神經網絡的隱藏層的學習來表示輸入,這種方式使得預測目標輸出更容易
  • 多層神經網絡能夠從當前單詞的上下文中預測下一單詞
  • 上下文中的每個單詞都以N維向量的形式呈現給網絡,也就是說,一個部分的值爲1,其餘爲0
  • 在第一層中,每個單詞都會創建不同的激活模式,或詞向量(圖4)
  • 在語言模型中,其他層學習將輸入的詞向量轉換爲預測的下一個單詞的輸出詞向量,這可用於預測詞彙表中任何單詞出現爲下一個單詞的概率。
  • 網絡學習包含許多有效成分的單詞向量,每個成分都可以解釋爲單詞的一個獨立特徵,首次證明能在上下文中學習分佈式表示。
  • 當單詞序列來自大量的真實文本並且單個微規則不可靠時,學習單詞向量也可以很好地工作。
  • 例如,當我們訓練在“新聞故事中”預測下一個單詞時,週二和週三學到的詞向量與瑞典和挪威的詞向量非常相似。
  • 這些詞向量的特徵組成並不是專家事先確定,而是由神經網絡自動學習發現。從文本中學到的單詞的矢量表示現在已廣泛用於自然語言應用程序中。
  • 論文閱讀,詞向量分佈式表示 - Distributed Representations of Words

> The issue of representation lies at the heart of the debate between the logic-inspired and the neural-network-inspired paradigms for cognition. In the logic-inspired paradigm, an instance of a symbol is something for which the only property is that it is either identical or non-identical to other symbol instances. It has no internal structure that is relevant to its use; and to reason with symbols, they must be bound to the variables in judiciously chosen rules of inference. By contrast, neural networks just use big activity vectors, big weight matrices and scalar non-linearities to perform the type of fast ‘intuitive’ inference that underpins effortless commonsense reasoning.

  • 表徵問題是邏輯啓發和神經網絡啓發的認知範式之間爭論的核心。
  • 邏輯啓發:
    • 符號是標識兩個不同實例的唯一特性
    • 沒有使用內部相關性結構
    • 必須綁定變量到謹慎選擇的規則中
  • 神經網絡啓發:
    • 使用較大的激活向量、大權重矩陣、非線性標量來執行快速的“直覺”推斷,從而支持毫不費力的常識推理。

> Before the introduction of neural language models71, the standard approach to statistical modelling of language did not exploit distrib- uted representations: it was based on counting frequencies of occur- rences of short symbol sequences of length up to N (called N-grams). The number of possible N-grams is on the order of VN, where V is the vocabulary size, so taking into account a context of more than a handful of words would require very large training corpora. N-grams treat each word as an atomic unit, so they cannot generalize across semantically related sequences of words, whereas neural language models can because they associate each word with a vector of real valued features, and semantically related words end up close to each other in that vector space (Fig. 4).

  • 在引入神經語言模型之前,語言統計建模的標準方法並未利用分佈式表示形式:它是基於對長度不超過N(稱爲N元語法)的短符號序列的出現頻率進行計數。可能的N元語法的數量在VN的數量級上,其中V是詞彙量,因此考慮到少數單詞的上下文,將需要非常大的訓練語料庫。
  • N-gram將每個單詞視爲一個原子單元,因此它們無法在語義相關的單詞序列中進行泛化,而神經語言模型可以將每個單詞與實際值特徵的向量相關聯,而語義相關的單詞最終在該向量空間中彼此靠近 (圖4)。

遞歸神經網絡(Recurrent neural networks)

Fig. 5 image > Figure 5 | A recurrent neural network and the unfolding in time of the computation involved inits forward computation. The artificial neurons (for example, hidden units grouped under node s with values st at time t) get inputs from other neurons at previous time steps (this is represented with the black square, representing a delay of one time step, on the left). In this way, a recurrent neural network can map an input sequence with elements xt into an output sequence with elements ot, with each ot depending on all the previous xtʹ (for tʹ ≤ t). The same parameters (matrices U,V,W ) are used at each time step. Many other architectures are possible, including a variant in which the network can generate a sequence of outputs (for example, words), each of which is used as inputs for the next time step. The backpropagation algorithm (Fig. 1) can be directly applied to the computational graph of the unfolded network on the right, to compute the derivative of a total error (for example, the log-probability of generating the right sequence of outputs) with respect to all the states st and all the parameters.

  • 圖5,遞歸神經網絡與時間展開的正向計算
  • 神經網絡從時間st-1獲取輸入,遞歸神經網絡映射輸入序列Xt到輸出Ot,每個ot都依賴於所有先前的xtʹ
  • 所有時間步長裏面都共享相同的參數(U,V,W矩陣)
  • 每個輸出都用作下一時間步驟的輸入。反向傳播算法(圖1)可以直接應用於右側展開網絡計算圖,以計算總誤差的導數(例如,生成正確的輸出序列的對數概率)到所有狀態st和所有參數。

> When backpropagation was first introduced, its most exciting use was for training recurrent neural networks (RNNs). For tasks that involve sequential inputs, such as speech and language, it is often better to use RNNs (Fig. 5). RNNs process an input sequence one element at a time, maintaining in their hidden units a ‘state vector’ that implicitly contains information about the history of all the past elements of the sequence. When we consider the outputs of the hidden units at different discrete time steps as if they were the outputs of different neurons in a deep multilayer network (Fig. 5, right), it becomes clear how we can apply backpropagation to train RNNs.

  • 首次引入反向傳播時,其最令人興奮的用途是訓練循環神經網絡(RNN)
  • 對於涉及順序輸入的任務,例如語音和語言,通常最好使用RNN(圖5)
  • RNN一次處理一個輸入序列的一個元素,並在其隱藏單元中維護一個“狀態向量”,該“狀態向量”隱式包含有關該序列的所有過去元素的歷史信息。
  • 當我們將隱藏單元在不同離散時間步的輸出視爲是深層多層網絡中不同神經元的輸出時(圖5,右),很清楚地知道如何應用反向傳播訓練RNN。

> RNNs are very powerful dynamic systems, but training them has proved to be problematic because the backpropagated gradients either grow or shrink at each time step, so over many time steps they typically explode or vanish.

  • RNN是非常強大的動態系統,但是事實證明,訓練RNN是有問題的,因爲反向傳播的梯度在每個時間步長都會增大或縮小,因此在許多時間步長中它們通常會爆炸或消失。

> Thanks to advances in their architecture them, RNNs have been found to be very good at predicting the next character in the text or the next word in a sequence, but they can also be used for more complex tasks. For example, after reading an English sentence one word at a time, an English ‘encoder’ network can be trained so that the final state vector of its hidden units is a good representation of the thought expressed by the sentence. This thought vector can then be used as the initial hidden state of (or as extra input to) a jointly trained French ‘decoder’ network, which outputs a probability distribution for the first word of the French translation. If a particular first word is chosen from this distribution and provided as input to the decoder network it will then output a probability distribution for the second word of the translation and so on until a full stop is chosen. Overall, this process generates sequences of French words according to a probability distribution that depends on the English sentence. This rather naive way of performing machine translation has quickly become competitive with the state-of-the-art, and this raises serious doubts about whether understanding a sentence requires anything like the internal symbolic expressions that are manipulated by using inference rules. It is more compatible with the view that everyday reasoning involves many simultaneous analogies that each contribute plausibility to a conclusion.

  • RNN非常擅長預測文本中的下一個字符或序列中的下一個單詞,但是它們也可以用於更復雜的任務
  • 例如,閱讀一個英語句子後,可以訓練一個英語“編碼器”網絡,其隱藏單元可以很好地表達該句子的思想
  • 這引起了人們對理解句子是否需要諸如通過使用推理規則的內部符號表達式之類的方式產生質疑

> Instead of translating the meaning of a French sentence into an English sentence, one can learn to ‘translate’ the meaning of an image into an English sentence (Fig. 3). The encoder here is a deep Con- vNet that converts the pixels into an activity vector in its last hidden layer. The decoder is an RNN similar to the ones used for machine translation and neural language modelling. There has been a surge of interest in such systems recently (see examples mentioned in ref. 86).

> RNNs, once unfolded in time (Fig. 5), can be seen as very deep feedforward networks in which all the layers share the same weights. Although their main purpose is to learn long-term dependencies, theoretical and empirical evidence shows that it is difficult to learn to store information for very long.

  • RNNs隨時間展開(圖5),可以看作是非常深的前饋網絡,其中所有層共享相同的權重。 * 儘管它們的主要目的是學習長期依賴關係,但理論和經驗證據表明,很難學習很長期的存儲信息。

> To correct for that, one idea is to augment the network with an explicit memory. The first proposal of this kind is the long short-term memory (LSTM) networks that use special hidden units, the natural behaviour of which is to remember inputs for a long time. A special unit called the memory cell acts like an accumulator or a gated leaky neuron: it has a connection to itself at the next time step that has a weight of one, so it copies its own real-valued state and accumulates the external signal, but this self-connection is multiplicatively gated by another unit that learns to decide when to clear the content of the memory. LSTM networks have subsequently proved to be more effective than conventional RNNs, especially when they have several layers for each time step, enabling an entire speech recognition system that goes all the way from acoustics to the sequence of characters in the transcription. LSTM networks or related forms of gated units are also currently used for the encoder and decoder networks that perform so well at machine translation.

  • LSTM(長短期記憶網絡)通過一個特殊的隱藏單元來到達是長時間記住輸入
  • 一個特殊的內存記憶單元,通過將自己和下一時間信息相連接。複製自己當前狀態值模擬爲外部輸入。
  • LSTM網絡被證明比常規RNN更有效,尤其是當它們在每個時間步都有多層時,能讓整個語音識別系統從聲學中學習到轉錄中的字符序列。
  • LSTM網絡在機器翻譯方面也有出色的表現

> Over the past year, several authors have made different proposals to augment RNNs with a memory module. Proposals include the Neural Turing Machine in which the network is augmented by a ‘tape-like’ memory that the RNN can choose to read from or write to, and memory networks, in which a regular network is augmented by a kind of associative memory. memory networks have yielded excellent performance on standard question-answering benchmarks. The memory is used to remember the story about which the network is later asked to answer questions.

  • 記憶網絡在標準問答基準測試方面已取得了出色的表現
  • 記憶網絡記憶在故事結束後會被要求回答的問題。

> Beyond simple memorization, neural Turing machines and memory networks are being used for tasks that would normally require reasoning and symbol manipulation. Neural Turing machines can be taught ‘algorithms’. Among other things, they can learn to output a sorted list of symbols when their input consists of an unsorted sequence in which each symbol is accompanied by a real value that indicates its priority in the list88. Memory networks can be trained to keep track of the state of the world in a setting similar to a text adventure game and after reading a story, they can answer questions that require complex inference90. In one test example, the network is shown a 15-sentence version of the The Lord of the Rings and correctly answers questions such as “where is Frodo now?”89.

  • 通過訓練記憶網絡在閱讀故事後,它們可以回答需要複雜推理的問題
  • 在《指環王》故事問答測試中,能正確回答了諸如“ Frodo現在在哪裏?” 之類的問題。

深度學習的未來(The future of deep learning)

> Unsupervised learning had a catalytic effect in reviving interest in deep learning, but has since been overshadowed by the successes of purely supervised learning. Although we have not focused on it in this Review, we expect unsupervised learning to become far more important in the longer term. Human and animal learning is largely unsupervised: we discover the structure of the world by observing it, not by being told the name of every object. Human vision is an active process that sequentially samples the optic array in an intelligent, task-specific way using a small, high-resolution fovea with a large, low-resolution surround. We expect much of the future progress in vision to come from systems that are trained end-to- end and combine ConvNets with RNNs that use reinforcement learning to decide where to look. Systems combining deep learning and reinforcement learning are in their infancy, but they already outperform passive vision systems at classification tasks and produce impressive results in learning to play many different video games.

  • 無監督學習對重新激發大家對深度學習的興趣,但可能會被監督學習的成功所掩蓋。從長遠來看,無監督學習將變得更加重要。
  • 人類和動物的學習在很大程度上不受監督:我們通過觀察來發現世界的結構,而不是通過告知每個物體的名稱來發現世界的結構。
  • 人類視覺將是一個重要研究方向,它使用具有高分辨率,低分辨率的小中心凹,以智能的,針對特定任務的方式依次對光學陣列進行採樣。
  • 結合了深度學習和強化學習的系統尚處於起步階段,但是在分類任務上它們已經超過了被動視覺系統

> Natural language understanding is another area in which deep learn- ing is poised to make a large impact over the next few years. We expect systems that use RNNs to understand sentences or whole documents will become much better when they learn strategies for selectively attending to one part at a time. Ultimately, major progress in artificial intelligence will come about through systems that combine representation learning with complex reasoning. Although deep learning and simple reasoning have been used for speech and handwriting recognition for a long time, new paradigms are needed to replace rule-based manipulation of symbolic expressions by operations on large vectors.

  • 自然語言理解是深度學習必將在未來幾年產生巨大影響的另一個領域。
  • 我們希望使用RNN理解句子或整個文檔的系統在學習每次選擇性地關注一部分內容的策略時會變得更好
  • 最終,人工智能的重大進步將通過將表示學習與複雜推理相結合的系統來實現。
  • 儘管長期以來,深度學習和簡單推理已用於語音和手寫識別,但仍需要新的範例來通過對大向量進行運算來代替基於規則的符號表達模式。
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.