【論文翻譯】GoogleNet網絡論文中英對照翻譯--(Going deeper with convolutions)

【開始時間】2018.09.25

【完成時間】2018.09.26

【論文翻譯】GoogleNet網絡論文中英對照翻譯--(Going deeper with convolutions)

【中文譯名】 更深的卷積

【論文鏈接】https://arxiv.org/abs/1409.4842

                                                  題目:更深的卷積

 

Abstract(摘要)

   We propose a deep convolutional neural network architecture codenamed Inception, which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014(ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.

 

  我們在ImageNet大規模視覺識別挑戰賽2014(ILSVRC14)上提出了一種代號爲Inception的深度卷積神經網絡結構,並在分類和檢測上取得了新的最好結果。該體系結構的主要特點是提高了網絡內計算資源的利用率。這是通過精心設計實現的,該設計允許在保持計算預算不變的同時增加網絡的深度和寬度。爲了優化質量(quality),架構決策基於Hebbian原則和多尺度處理。在我們爲ilsvrc 14提交的文件中使用的一種特殊形式稱爲googlenet,它是一個22層深的網絡,其質量將在分類和檢測的範圍內進行評估。

 

1 Introduction(介紹)

 

    In the last three years, mainly due to the advances of deep learning, more concretely convolutional networks [10], the quality of image recognition and object detection has been progressing at a dramatic pace. One encouraging news is that most of this progress is not just the result of more powerful hardware, larger datasets and bigger models, but mainly a consequence of new ideas, algorithms and improved network architectures. No new data sources were used, for example, by the top entries in the ILSVRC 2014 competition besides the classification dataset of the same competition for detection purposes. Our GoogLeNet submission to ILSVRC 2014 actually uses 12× fewer parameters than the winning architecture of Krizhevsky et al [9] from two years ago, while being significantly more accurate. The biggest gains in object-detection have not come from the utilization of deep networks alone or bigger models, but from the synergy of deep architectures and classical computer vision, like the R-CNN algorithm by Girshick et al [6].

    近三年來,主要由於深入學習、更具體的卷積網絡[10]的發展,圖像識別和目標檢測的質量正以前所未有的速度向前發展。一個令人鼓舞的消息是,這一進步的大部分不僅僅是更強大的硬件、更大的數據集和更大的模型的結果,而且主要是新的思想、算法和改進的網絡結構的結果。 例如,ILSVRC 2014競賽中最靠前的輸入除了用於檢測目的的分類數據集之外,沒有使用新的數據資源。我們的GoogleNet提交給ILSVRC 2014的報告實際上是兩年前Krizhevsky等人[9]的獲獎架構使用的參數的1/12,而且要準確得多。在目標檢測方面,最大的收穫不是來自於單獨利用深度網絡或更大的模型,而是來自於深層架構和經典計算機視覺的協同作用,比如girshick等人的r-cnn算法[6]。

  

Another notable factor is that with the ongoing traction of mobile and embedded computing, the efficiency of our algorithms – especially their power and memory use – gains importance. It is noteworthy that the considerations leading to the design of the deep architecture presented in this paper included this factor rather than having a sheer fixation on accuracy numbers. For most of the experiments, the models were designed to keep a computational budget of 1.5 billion multiply-adds at inference time, so that the they do not end up to be a purely academic curiosity, but could be put to real world use, even on large datasets, at a reasonable cost.

  另一個值得注意的因素是,隨着移動計算和嵌入式計算的不斷髮展,我們算法的效率-尤其是它們的能力和內存的使用-變得越來越重要。 值得注意的是,正是包含了這個因素的考慮才得出了本文中呈現的深度架構設計,而不是單純的爲了提高準確率。在大多數實驗中,這些模型的設計是爲了保持15億的計算預算-在推理時增加,這樣它們最終不會成爲純粹的學術好奇心,而是可以合理的成本投入現實世界的使用,即使是在大型數據集上也是如此。

    

   In this paper, we will focus on an efficient deep neural network architecture for computer vision, codenamed Inception, which derives its name from the Network in network paper by Lin et al [12] in conjunction with the famous “we need to go deeper” internet meme [1]. In our case, the word “deep” is used in two different meanings: first of all, in the sense that we introduce a new level of organization in the form of the “Inception module” and also in the more direct sense of increased network depth. In general, one can view the Inception model as a logical culmination of [12] while taking inspiration and guidance from the theoretical work by Arora et al [2]. The benefits of the architecture are experimentally verified on the ILSVRC 2014 classification and detection challenges, on which it significantly outperforms the current state of the art.

    本文將重點研究一種高效的計算機視覺深層神經網絡體系結構,代號爲“Inception”,它的名稱來源於Lin等人[12]的網絡論文中的網絡,以及著名的“我們需要更深層次的”網絡模因[1]。在我們的例子中,“深度”一詞有兩種不同的含義:首先,我們以“Inception模塊”的形式引入了一個新的組織層次,並且在更直接的意義上增加了網絡深度。一般來說,人們可以把初始模型看作是[12]的邏輯頂點,同時從Arora等人的理論工作中獲得靈感和指導[2]。該架構的優點在ILSVRC 2014分類和檢測挑戰上得到了實驗驗證,在這方面,它的性能明顯優於當前的先進水平。

2 Related Work(相關工作)

    Starting with LeNet-5 [10], convolutional neural networks (CNN) have typically had a standard structure – stacked convolutional layers (optionally followed by contrast normalization and max- pooling) are followed by one or more fully-connected layers. Variants of this basic design are prevalent in the image classification literature and have yielded the best results to-date on MNIST, CIFAR and most notably on the ImageNet classification challenge [9, 21]. For larger datasets such as Imagenet, the recent trend has been to increase the number of layers [12] and layer size [21, 14], while using dropout [7] to address the problem of overfitting.

    從LENET-5[10]開始,卷積神經網絡(CNN)通常有一個標準的結構-堆疊的卷積層(可選地接着是對比度歸一化和最大池)後面是一個或多個完全連接的層。這種基本設計的變體在圖像分類文獻中非常流行,並在mnist、CIFAR和ImageNet分類挑戰[9,21]上取得了迄今爲止最好的結果。對於大型數據集,如ImageNet,最近的趨勢是增加層數[12]和層大小[21,14],同時使用Dropout[7]來解決過度擬合的問題。

    

     Despite concerns that max-pooling layers result in loss of accurate spatial information, the same convolutional network architecture as [9] has also been successfully employed for localization [9, 14], object detection [6, 14, 18, 5] and human pose estimation [19].Inspired by a neuroscience model of the primate visual cortex, Serre et al. [15] used a series of fixed Gabor filters of different sizes to handle multiple scales. We use a similar strategy here. However, contrary to the fixed 2-layer deep model of [15], all filters in the Inception architecture are learned. Furthermore, Inception layers are repeated many times, leading to a 22-layer deep model in the case of the GoogLeNet model. 

        儘管人們擔心最大池化層層會導致精確的空間信息丟失,但與[9]相同的卷積網絡結構也被成功地用於定位[9,14],目標檢測[6,14,18,5]和人體姿態估計[19]。 受靈長類視覺皮層神經科學模型的啓發,Serre等人[15]使用了一系列固定的不同大小的Gabor濾波器來處理多尺度。然而,與[15]中固定的2層深度模型相反,在初始模型中的所有濾波器都是學習的。 此外,Inception層重複了很多次,在GoogLeNet模型中得到了一個22層的深度模型。

 

     Network-in-Network is an approach proposed by Lin et al. [12] in order to increase the representational power of neural networks. In their model, additional 1 × 1 convolutional layers are added to the network, increasing its depth. We use this approach heavily in our architecture. However, in our setting, 1 × 1 convolutions have dual purpose: most critically, they are used mainly as dimension reduction modules to remove computational bottlenecks, that would otherwise limit the size of our networks. This allows for not just increasing the depth, but also the width of our networks without a significant performance penalty. 

        Network-in-Network是Lin等人[12]爲了增加神經網絡表現能力而提出的一種方法。當應用於卷積層時,該方法可以看作是額外的1×1卷積層,然後是典型的校正線性激活[9]。這使得它能夠很容易地集成到目前的CNN管道中。我們在架構中大量使用這種方法。然而,在我們的設置中,1×1卷積具有雙重用途:最關鍵的是,它們主要用作降維模塊,以消除計算瓶頸,否則會限制我們網絡的規模。這不僅允許增加深度,而且還允許我們的網絡的寬度沒有顯著的性能損失。

 

  The current state of the art for object detection is the Regions with Convolutional Neural Networks (R-CNN) method by Girshick et al. [6]. R-CNN decomposes the overall detection problem into two subproblems: utilizing low-level cues such as color and texture in order to generate object location proposals in a category-agnostic fashion and using CNN classifiers to identify object categories at those locations. Such a two stage approach leverages the accuracy of bounding box segmentation with low-level cues, as well as the highly powerful classification power of state-of-the-art CNNs. We adopted a similar pipeline in our detection submissions, but have explored enhancements in both stages, such as multi-box [5] prediction for higher object bounding box recall, and ensemble approaches for better categorization of bounding box proposals

    目前主要的目標檢測方法是Girshick等人提出的基於區域的卷積神經網絡方法(R-CNN)[6]。R-CNN將整個檢測問題分解爲兩個子問題:首先,以一種與類別無關的方式,利用顏色和超像素一致性等低級別線索來進行潛在的對象建議,然後使用cnn分類器識別這些位置的對象類別。 這樣一種兩個階段的方法利用了低層特徵分割邊界框的準確性,也利用了目前的CNN非常強大的分類能力。我們在提交的檢測報告中採用了類似的方法,但在這兩個階段都進行了改進,例如多框[5]對較高對象包圍盒召回的預測,以及更好地對邊界框提案進行分類的集成方法。

 

3 Motivation and High Level Considerations(動機和高層考慮

 

     The most straightforward way of improving the performance of deep neural networks is by increasing their size. This includes both increasing the depth —— the number of network levels —— as well as its width: the number of units at each level. This is an easy and safe way of training higher quality models, especially given the availability of a large amount of labeled training data. However, this simple solution comes with two major drawbacks. 

    改善深層神經網絡性能最直接的方法是增加它們的大小。這包括增加網絡的深度(層數)及其寬度:每層的單元數。這是一種簡單而安全的方法來訓練高質量的模型,特別是考慮到大量的標記訓練數據的可用性。然而,這個簡單的解決方案有兩個主要缺點。

 

    Bigger size typically means a larger number of parameters, which makes the enlarged network more prone to overfitting, especially if the number of labeled examples in the training set is limited. This is a major bottleneck as strongly labeled datasets are laborious and expensive to obtain, often requiring expert human raters to distinguish between various fine-grained visual categories such as those in ImageNet (even in the 1000-class ILSVRC subset) as shown in Figure 1

  更大的規模通常意味着更多的參數,這使得擴大後的網絡更容易過度擬合,特別是在訓練集中標記示例的數量有限的情況下。這可能成爲一個主要的瓶頸,因爲創建高質量的培訓集可能是棘手和昂貴的,特別是如果需要專家評估人員來區分像ImageNet(甚至在1000類ILSVRC子集中)這樣的細粒度視覺類別,如圖1所示。

     The other drawback of uniformly increased network size is the dramatically increased use of computational resources. For example, in a deep vision network, if two convolutional layers are chained, any uniform increase in the number of their filters results in a quadratic increase of computation. If the added capacity is used inefficiently (for example, if most weights end up to be close to zero), then much of the computation is wasted. As the computational budget is always finite, an efficient distribution of computing resources is preferred to an indiscriminate increase of size, even when the main objective is to increase the quality of performance. 

   網絡大小均勻增加的另一個缺點是計算資源的使用急劇增加。例如,在深度視覺網絡中,如果將兩個卷積層鏈接起來,它們的濾波器數目的任何均勻增加都會導致計算的二次增長。如果增加的容量沒有得到有效的使用(例如,如果大多數權重最終接近於零),那麼大量的計算就會被浪費掉。由於計算預算在實踐中總是有限的,因此更傾向於有效分配計算資源,而不是任意增加規模,即使主要目標是提高結果的質量。

 

    A fundamental way of solving both of these issues would be to introduce sparsity and replace the fully connected layers by the sparse ones, even inside the convolutions. Besides mimicking biological systems, this would also have the advantage of firmer theoretical underpinnings due to the groundbreaking work of Arora et al. [2]. Their main result states that if the probability distribution of the dataset is representable by a large, very sparse deep neural network, then the optimal network topology can be constructed layer after layer by analyzing the correlation statistics of the preceding layer activations and clustering neurons with highly correlated outputs. Although the strict mathematical proof requires very strong conditions, the fact that this statement resonates with the well known Hebbian principle —— neurons that fire together, wire together —— suggests that the underlying idea is applicable even under less strict conditions, in practice. 

   解決這兩個問題的一個基本的方式就是將全連接層替換爲稀疏的全連接層,甚至在卷積層內部。除了模仿生物系統之外,由於Arora等人的開創性工作,這也將具有更堅實的理論基礎的優勢[2]。它們的主要結果是,如果數據集的概率分佈可以用一個大的、非常稀疏的深層神經網絡來表示, 則最優的網絡拓撲結構可以通過分析前一層激活的相關性統計和聚類高度相關的神經元來一層層的構建。儘管嚴格的數學證明需要很強的條件,但這一說法與衆所周知的Hebbian原理產生了共鳴-神經元一起激發、一起連接-這表明,即使在實際中,在不太嚴格的條件下,這種基本思想也是適用的。

 

  On the downside, todays computing infrastructures are very inefficient when it comes to numerical calculation on non-uniform sparse data structures. Even if the number of arithmetic operations is reduced by 100×, the overhead of lookups and cache misses is so dominant that switching to sparse matrices would not pay off. The gap is widened even further by the use of steadily improving, highly tuned, numerical libraries that allow for extremely fast dense matrix multiplication, exploiting the minute details of the underlying CPU or GPU hardware [16, 9]. Also, non-uniform sparse models require more sophisticated engineering and computing infrastructure. Most current vision oriented machine learning systems utilize sparsity in the spatial domain just by the virtue of em- ploying convolutions. However, convolutions are implemented as collections of dense connections to the patches in the earlier layer. ConvNets have traditionally used random and sparse connection tables in the feature dimensions since [11] in order to break the symmetry and improve learning, the trend changed back to full connections with [9] in order to better optimize parallel computing. The uniformity of the structure and a large number of filters and greater batch size allow for utilizing efficient dense computation.

  缺點是,今天的計算架構對於非均勻稀疏數據結構的數值計算效率很低。即使算術運算的數量減少了100倍,查找和緩存丟失的開銷仍然占主導地位,因此切換到稀疏矩陣是不會有好處的。 隨着穩定提升和高度調整的數值庫的應用,差距仍在進一步擴大,這些數值庫允許極度快速密集的矩陣乘法,利用底層的CPU或GPU硬件[16, 9]的微小細節。此外,非均勻    的稀疏模型需要更復雜的工程和計算基礎設施。目前大多數面向視覺的機器學習系統都是利用空間域的稀疏性來實現的。但是,卷積是作爲與前一層中的補丁的密集連接的集合來實現的。自[11]以來,爲了打破對稱性和提高學習能力,卷積網習慣上上在特徵維中使用隨機和稀疏連接表,以更好地優化並行計算,這種趨勢又回到了與[9]完全連接的狀態。結構的均勻性和大量的過濾器和更大的批量允許使用高效的密集計算。

 

    This raises the question of whether there is any hope for a next, intermediate step: an architecture that makes use of filter-level sparsity, as suggested by the theory, but exploits our current hardware by utilizing computations on dense matrices. The vast literature on sparse matrix computations (e.g. [3]) suggests that clustering sparse matrices into relatively dense submatrices tends to give competitive performance for sparse matrix multiplication. It does not seem far-fetched to think that similar methods would be utilized for the automated construction of non-uniform deep-learning architectures in the near future.

   這就提出了一個問題:是否有希望實現下一個中間步驟:一種利用額外稀疏性的體系結構,即使是濾波器級,但正如理論所建議的那樣,能通過利用密集矩陣上的計算來利用我們當前的硬件。關於稀疏矩陣計算的大量文獻(例如[3])表明,將稀疏矩陣聚類成相對稠密的子矩陣,往往會給稀疏矩陣乘法提供最先進的實際性能。似乎不難想象,在不久的將來,類似的方法將被用於非均勻的深度學習體系結構的自動化構建。

    

   The Inception architecture started out as a case study of the first author for assessing the hypothetical output of a sophisticated network topology construction algorithm that tries to approximate a sparse structure implied by [2] for vision networks and covering the hypothesized outcome by dense, readily available components. Despite being a highly speculative undertaking, only after two iterations on the exact choice of topology, we could already see modest gains against the reference architecture based on [12]. After further tuning of learning rate, hyperparameters and improved training methodology, we established that the resulting Inception architecture was especially useful in the context of localization and object detection as the base network for [6] and [5]. Interestingly, while most of the original architectural choices have been questioned and tested thoroughly, they turned out to be at least locally optimal.

   Inception架構開始是作爲案例研究,用於評估一個複雜網絡拓撲構建算法的假設輸出,該算法試圖近似[2]中所示的視覺網絡的稀疏結構,並通過密集的、容易獲得的組件來覆蓋假設結果。儘管這是一項高度投機性的工作,但只有在對拓撲的精確選擇進行了兩次迭代之後,我們已經可以看到與基於[12]的參考架構相比所取得的一些進展。在進一步調整學習率、超參數和改進的訓練方法之後,我們確定了該Inception結構對於[6]和[5]的基本網絡在定位和目標檢測方面是特別有用的。有趣的是,雖然大多數最初的架構選擇都經過了徹底的質疑和測試,但最終它們至少在本地是最優的。

   

    One must be cautious though: although the Inception architecture has become a success for computer vision, it is still questionable whether this can be attributed to the guiding principles that have lead to its construction.  Making sure would require much more thorough analysis and verification: for example, if automated tools based on the principles described below would find similar, but better topology for the vision networks. The most convincing proof would be if an automated system would create network topologies resulting in similar gains in other domains using the same algorithm but with very differently looking global architecture. At very least, the initial success of the Inception architecture yields firm motivation for exciting future work in this direction.

    然而必須謹慎:儘管Inception架構在計算機上領域取得成功,但這是否可以歸因於構建其架構的指導原則仍是有疑問的。要確保這一點需要更徹底的分析和驗證:例如,如果基於以下原則的自動化工具會發現類似的、但更好的視覺網絡拓撲結構。最令人信服的證據是,自動化系統是否會創建網絡拓撲,從而在其他領域使用相同的算法,但具有非常不同的全局架構,從而獲得類似的收益。至少,Inception架構的最初成功爲在這個方向上激動人心的未來工作提供了堅定的動力。

 

4 Architectural Details(結構細節)

 

     The main idea of the Inception architecture is to consider how an optimal local sparse structure of a convolutional vision network can be approximated and covered by readily available dense components. Note that assuming translation invariance means that our network will be built from convolutional building blocks. All we need is to find the optimal local construction and to repeat it spatially. Arora et al. [2] suggests a layer-by-layer construction where one should analyze the correlation statistics of the last layer and cluster them into groups of units with high correlation. These clusters form the units of the next layer and are connected to the units in the previous layer. We assume that each unit from an earlier layer corresponds to some region of the input image and these units are grouped into filter banks. In the lower layers (the ones close to the input) correlated units would concentrate in local regions. Thus, we would end up with a lot of clusters concentrated in a single region and they can be covered by a layer of 1×1 convolutions in the next layer, as suggested in [12]. However, one can also expect that there will be a smaller number of more spatially spread out clusters that can be covered by convolutions over larger patches, and there will be a decreasing number of patches over larger and larger regions. In order to avoid patch-alignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1×1, 3×3 and 5×5; this decision was based more on convenience rather than necessity. It also means that the suggested architecture is a combination of all those layers with their output filter banks concatenated into a single output vector forming the input of the next stage. Additionally, since pooling operations have been essential for the success of current convolutional networks, it suggests that adding an alternative parallel pooling path in each such stage should have additional beneficial effect, too (see Figure 2(a)).

   Inception架構的主要思想是找出卷積視覺網絡中最優的局部稀疏結構是如何被容易獲得的密集分量所近似與覆蓋的。請注意,假定轉換不變性意味着我們的網絡將由卷積積木構建。請注意,假定平移不變性意味着我們的網絡將由卷積積木構建。Arora等人[2]提出一種逐層結構,對上一層的相關統計量進行分析,並將其聚成一組具有高度相關性的單元。這些聚類形成了下一層的單元並與前一層的單元連接。我們假設來自前一層的每個單元對應於輸入圖像的某個區域,並且這些單元被分組爲濾波器組。在較低層(接近輸入層),相關單元集中在局部區域。這意味着,我們最終會有大量的團簇集中在一個單一的區域,它們可以在下一層被1×1的卷積覆蓋,就像[12]中所建議的那樣。然而也可以預期,將存在更小數目的在更大空間上擴展的聚類,其可以被更大塊上的卷積覆蓋,在越來越大的區域上塊的數量將會下降。爲了避免塊校正的問題,目前Inception架構形式的濾波器的尺寸僅限於1×1、3×3、5×5,這個決定更多的是基於便易性而不是必要性。這還意味着所建議的體系結構是所有這些層的組合,它們的輸出濾波器組連接成一個單一的輸出矢量,形成下一階段的輸入。此外,由於池操作對於當前最先進的卷積網絡的成功至關重要,它建議在每個這樣的階段增加一條可供選擇的並行池路徑,這也應具有額外的有益效果(見圖2(A)。

 

    As these “Inception modules” are stacked on top of each other, their output correlation statistics are bound to vary: as features of higher abstraction are captured by higher layers, their spatial concentration is expected to decrease suggesting that the ratio of 3×3 and 5×5 convolutions should increase as we move to higher layers. 

    由於這些“Inception模塊”是層疊在一起的,它們的輸出相關統計量必然會有所不同:由於較高的抽象特徵被較高的層所捕捉,它們的空間濃度預計會降低,這意味着3×3和5×5卷積的比率應該隨着我們移動到更高的層而增加。

    

    One big problem with the above modules, at least in this naive form, is that even a modest number of 5×5 convolutions can be prohibitively expensive on top of a convolutional layer with a large number of filters. This problem becomes even more pronounced once pooling units are added to the mix: the number of output filters equals to the number of filters in the previous stage. The merging of output of the pooling layer with outputs of the convolutional layers would lead to an inevitable increase in the number of outputs from stage to stage. While this architecture might cover the optimal sparse structure, it would do it very inefficiently, leading to a computational blow up within a few stages.

   上述模塊的一個大問題是在具有大量濾波器的卷積層之上,即使適量的5×5卷積也可能是非常昂貴的,至少在這種樸素形式中有這個問題。一旦將池單元添加到混合中,這個問題就會更加明顯:它們的輸出過濾器的數量等於上一階段的過濾器的數量。將池層的輸出與卷積層的輸出合併將不可避免地導致從一個階段到另一個階段的輸出數量的增加。即使這個體系結構可能覆蓋最優的稀疏結構,它也會非常低效率地完成它,在幾個階段內導致計算崩潰。

 

    This leads to the second idea of the Inception architecture: judiciously reducing dimension wherever the computational requirements would increase too much otherwise. This is based on the success of embeddings: even low dimensional embeddings might contain a lot of information about a relatively large image patch. However, embeddings represent information in a dense, compressed form and compressed information is harder to process. The representation should be kept sparse at most places (as required by the conditions of [2]) and compress the signals only whenever they have to be aggregated en masse. That is, 1×1 convolutions are used to compute reductions before the expensive 3×3 and 5×5 convolutions. Besides being used as reductions, they also include the use of rectified linear activation making them dual-purpose. The final result is depicted in Figure 2(b).

   這導致了Inception架構的第二個想法:在計算要求會增加太多的地方,明智地減少維度和映射。這是基於嵌入式的成功:即使是低維嵌入也可能包含大量關於相對較大的圖像修補程序的信息。然而,嵌入以密集、壓縮的形式表示信息,壓縮後的信息更難建模。我們希望在大多數地方保持我們的表示稀疏(根據[2]的要求),並且只有當信號必須聚集在一起時纔對它們進行壓縮。也就是說,在昂貴的3×3和5×5卷積之前,使用1×1卷積來進行計算約簡。除了用作減少(參數)外,它們還包括使用經校正的線性激活,使它們具有雙重用途。最後的結果如圖2(B)所示。

  

   In general, an Inception network is a network consisting of modules of the above type stacked upon each other, with occasional max-pooling layers with stride 2 to halve the resolution of the grid. For technical reasons (memory efficiency during training), it seemed beneficial to start using Inception modules only at higher layers while keeping the lower layers in traditional convolutional fashion. This is not strictly necessary, simply reflecting some infrastructural inefficiencies in our current implementation. 

  一般來說,Inception網絡是由上述類型的模塊相互疊加而成的網絡,偶爾會有跨越2的最大池層,以將網格的分辨率減半。由於技術原因(訓練期間的內存效率),似乎只在較高層開始使用初始模塊,而以傳統的卷積方式保持較低層的使用是有益的。這不是絕對必要的,只是反映了我們目前實現中的一些基礎結構效率低下。

 

    One of the main beneficial aspects of this architecture is that it allows for increasing the number of units at each stage significantly without an uncontrolled blow-up in computational complexity. The ubiquitous use of dimension reduction allows for shielding the large number of input filters of the last stage to the next layer, first reducing their dimension before convolving over them with a large patch size. Another practically useful aspect of this design is that it aligns with the intuition that visual information should be processed at various scales and then aggregated so that the next stage can abstract features from different scales simultaneously

    這個體系結構的主要好處之一是,它允許在每個階段顯着地增加單元數量, 而不會在後面的階段出現計算複雜度不受控制的爆炸。維數約簡的普遍使用使得在上一階段的大量輸入濾波器被屏蔽到下一層,首先減小它們的維數,然後再將它們與大的塊大小集合在一起。該設計的另一個實際有用的方面是,它與直覺保持一致,即視覺信息應該在不同的尺度上進行處理,然後進行聚合,以便下一階段能夠同時從不同的尺度中提取特徵。

 

  The improved use of computational resources allows for increasing both the width of each stage as well as the number of stages without getting into computational difficulties. Another way to utilize the inception architecture is to create slightly inferior, but computationally cheaper versions of it. We have found that all the included the knobs and levers allow for a controlled balancing of computational resources that can result in networks that are 2−3× faster than similarly performing networks with non-Inception architecture, however this requires careful manual design at this point.

 通過改進計算資源的使用,可以增加每個階段的寬度和階段數,而不會陷入計算困難。另一種利用初始架構的方法是創建稍微低劣的,但計算成本較低的版本。我們已經發現,所有包含的旋鈕和槓桿都允許對計算資源進行控制平衡,從而使網絡比具有非初始架構的類似執行網絡快2−3×,但是這需要在此時進行仔細的手工設計。

 

5 GoogLeNet

 

     We chose GoogLeNet as our team-name in the ILSVRC14 competition. This name is an homage to Yann LeCuns pioneering LeNet 5 network [10]. We also use GoogLeNet to refer to the particular incarnation of the Inception architecture used in our submission for the competition. We have also used a deeper and wider Inception network, the quality of which was slightly inferior, but adding it to the ensemble seemed to improve the results marginally. We omit the details of that network, since our experiments have shown that the influence of the exact architectural parameters is relatively  minor. Here, the most successful particular instance (named GoogLeNet) is described in Table 1 for demonstrational purposes. The exact same topology (trained with different sampling methods) was used for 6 out of the 7 models in our ensemble.

   我們在ilsvrc 14競賽中選擇GoogLeNet作爲我們的團隊名稱。這個名字是對亞恩萊昂開拓性的 LeNet 55網絡[10]的一種敬意。我們還使用GoogleNet來作爲我們提交的競賽中所使用的Inception架構的特例。我們還使用了一個更深更廣的初始網絡,其質量稍差,但將其添加到集合中似乎可以稍微提高效果。我們忽略了網絡的細節,因爲我們的實驗表明,精確的架構參數的影響相對較小。在這裏,爲了演示目的,表1描述了最成功的特定實例(名爲GoogLeNet)。在我們集成的7種模型中,有6種採用了完全相同的拓撲結構(用不同的採樣方法訓練)。

 

      All the convolutions, including those inside the Inception modules, use rectified linear activation. The size of the receptive field in our network is 224×224 in the RGB color space with zero mean. “#3×3 reduce” and “#5×5 reduce” stands for the number of 1×1 filters in the reduction layer used before the 3×3 and 5×5 convolutions. One can see the number of 1×1 filters in the projection layer after the built-in max-pooling in the pool proj column. All these reduction/projection layers use rectified linear activation as well.

    所有的卷積都使用了修正線性激活,包括Inception模塊內部的卷積。在我們的網絡中感受野是在均值爲0的RGB顏色空間中,大小是224×224。“#3×3 reduce”和“#5×5 reduce”表示在3×3和5×5卷積之前,降維層使用的1×1濾波器的數量。在pool proj列可以看到內置的最大池化之後,投影層中1×1濾波器的數量。所有的這些降維/投影層也都使用了線性修正激活。

    

    The network was designed with computational efficiency and practicality in mind, so that inference can be run on individual devices including even those with limited computational resources, especially with low-memory footprint. The network is 22 layers deep when counting only layers with parameters (or 27 layers if we also count pooling). The overall number of layers (independent building blocks) used for the construction of the network is about 100. However this number depends onthe machine learning infrastructure system used. The use of average pooling before the classifier is based on [12], although our implementation differs in that we use an extra linear layer. This enables

adapting and fine-tuning our networks for other label sets easily, but it is mostly convenience and we do not expect it to have a major effect. It was found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers.

    該網絡的設計考慮了計算效率和實用性,因此可以在單個設備上運行,包括那些計算資源有限的設備,尤其是內存佔用較少的設備。當只計算帶有參數的層時,網絡有22層深度(如果計算pooling 池,則爲27層)。用於建造網絡的層(獨立構建塊)的總數約爲100層。然而,這個數字取決於所使用的機器學習基礎設施系統。在分類器之前使用平均池是基於[12]的,儘管我們的實現不同之處在於我們使用了額外的線性層。 線性層使我們的網絡能很容易地適應其它的標籤集,但它主要是方便,我們不期望它有一個重大的影響。我們發現從全連接層變爲平均池化,提高了大約top-1 %0.6的準確率,然而即使在移除了全連接層之後,Dropout的使用還是必不可少的。

    

    Given the relatively large depth of the network, the ability to propagate gradients back through all the layers in an effective manner was a concern. One interesting insight is that the strong performance of relatively shallower networks on this task suggests that the features produced by the layers in the middle of the network should be very discriminative. By adding auxiliary classifiers connected to these intermediate layers, we would expect to encourage discrimination in the lower stages in the classifier, increase the gradient signal that gets propagated back, and provide additional regularization. These classifiers take the form of smaller convolutional networks put on top of the output of the Inception (4a) and (4d) modules. During training, their loss gets added to the total loss of the network with a discount weight (the losses of the auxiliary classifiers were weighted by 0.3). At

inference time, these auxiliary networks are discarded.

    考慮到網絡的相對較大的深度,以有效的方式將梯度傳播回所有層的能力是一個值得關注的問題。一個有趣的觀點是,相對較淺的網絡在這項任務上的強大性能表明,網絡中間層產生的特性應該是非常有區別的。通過增加與這些中間層相連接的輔助分類器,我們期望在分類器的較低階段增強識別,增加傳播回來的梯度信號,並提供額外的正則化。這些分類器採用設置在初始(4a)和(4d)模塊的輸出之上的較小卷積網絡的形式。在訓練過程中,它們的損失以折扣權重加到網絡的總損失中(輔助分類器的損失加權0.3)。在推理時,這些輔助網絡被丟棄。

 

   The exact structure of the extra network on the side, including the auxiliary classifier, is as follows:

  • An average pooling layer with 5×5 filter size and stride 3, resulting in an 4×4×512 output for the (4a), and 4×4×528 for the (4d) stage.

  • A 1×1 convolution with 128 filters for dimension reduction and rectified linear activation.

  • A fully connected layer with 1024 units and rectified linear activation.

  • A dropout layer with 70% ratio of dropped outputs.

  • A linear layer with softmax loss as the classifier (predicting the same 1000 classes as the main classifier, but removed at inference time).

A schematic view of the resulting network is depicted in Figure 3.

   包括輔助分類器在內的附加網絡的具體結構如下:

  • 一個濾波器大小5×5,步長爲3的平均池化層,導致(4a)階段的輸出爲4×4×512,(4d)的輸出爲4×4×528。

  • 具有128個濾波器的1×1卷積,用於降維和修正線性激活。

  • 一個全連接層,具有1024個單元和修正線性激活。

  • 丟棄70%輸出的丟棄層。

  • 使用帶有softmax損失的線性層作爲分類器(作爲主分類器預測同樣的1000類,但在推斷時移除)。

圖3描述了結果網絡的示意圖視圖。

                                                                                                                                              Figure 3: GoogLeNet network with all the bells and whistles

 

6 Training Methodology(訓練方法)

      Our networks were trained using the DistBelief [4] distributed machine learning system using modest amount of model and data-parallelism. Although we used CPU based implementation only, a rough estimate suggests that the GoogLeNet network could be trained to convergence using few high-end GPUs within a week, the main limitation being the memory usage. Our training used asynchronous stochastic gradient descent with 0.9 momentum [17], fixed learning rate schedule (decreasing the learning rate by 4% every 8 epochs). Polyak averaging [13] was used to create the final model used at inference time.

     我們的網絡使用分佈式機器學習系統對網絡進行了訓練,使用了少量的模型和數據並行性。儘管我們僅使用一個基於CPU的實現,但粗略的估計表明GoogLeNet網絡可以用更少的高端GPU在一週之內訓練到收斂,主要的限制是內存使用。我們的訓練採用異步隨機梯度下降的0.9動量[17],固定的學習速率時間表(降低4%的學習率每8個時代)。利用Polyak平均[13]建立了推理時使用的最終模型。

 

   Ourimagesamplingmethodshavechangedsubstantiallyoverthemonthsleadingtothecompetition, and already converged models were trained on with other options, sometimes in conjunction with changed hyperparameters, like dropout and learning rate, so it is hard to give a definitive guidance to the most effective single way to train these networks. To complicate matters further, some of the models were mainly trained on smaller relative crops, others on larger ones, inspired by [8]. Still, one prescription that was verified to work very well after the competition includes sampling of various sized patches of the image whose size is distributed evenly between 8% and 100% of the image area and whose aspect ratio is chosen randomly between 3/4 and 4/3. Also, we found that the photometric distortions by Andrew Howard [8] were useful to combat overfitting to some extent. In addition, we started to use random interpolation methods (bilinear, area, nearest neighbor and cubic, with equal probability) for resizing relatively late and in conjunction with other hyperparameter changes, so we could not tell definitely whether the final results were affected positively by their use.

    圖像採樣方法在過去幾個月的競賽中發生了重大變化,並且已收斂的模型(可以)在其他選項上進行了訓練,有時還結合着超參數的改變,例如丟棄和學習率,因此,很難對培訓這些網絡的最有效的單一方式給予明確的指導。使問題更加複雜的是,一些模型主要是在較小的相對裁剪(crop)上進行訓練,另一些是在[8]的啓發下訓練更大的crop。不過,有一種處方在比賽後得到了很好的驗證,它的尺寸均勻分佈在圖像區域的8%—100%之間,並在3/4和4/3之間隨機選擇其長寬比的各種大小的圖像塊進行採樣。此外,我們還發現,AndrewHoward[8]的光度畸變在一定程度上有助於防止過度擬合。此外,我們還開始使用隨機插值方法(雙線性、面積、最近鄰和立方,概率相等)來比較晚地調整大小,並結合其他超參數變化,因此無法確定最終結果是否受到其使用的積極影響。

 

7 ILSVRC 2014 Classification Challenge Setup and Results(ILSVRC 2014分類挑戰設置和結果

 

    The ILSVRC 2014 classification challenge involves the task of classifying the image into one of 1000 leaf-node categories in the Imagenet hierarchy. There are about 1.2 million images for training, 50,000 for validation and 100,000 images for testing. Each image is associated with one ground truth category, and performance is measured based on the highest scoring classifier predictions. Two numbers are usually reported: the top-1 accuracy rate, which compares the ground truth against the first predicted class, and the top-5 error rate, which compares the ground truth against the first 5 predicted classes: an image is deemed correctly classified if the ground truth is among the top-5, regardless of its rank in them. The challenge uses the top-5 error rate for ranking purposes. 

   ILSVRC 2014分類挑戰涉及將圖像分類爲ImageNet層次結構中的1000個葉節點類別之一的任務。大約有120萬張圖像用於培訓,5萬張用於驗證,10萬張用於測試。每幅圖像都與一個地面真相分類器相關聯,並且性能是基於最高得分分類器預測來衡量的。通常報告兩個數字:top-1準確率,比較實際類別和第一個預測類別,top-5錯誤率,比較實際類別與前5個預測類別:如果圖像實際類別在top-5中,則認爲圖像分類正確,不管它在top-5中的排名。挑戰賽使用top-5錯誤率來進行排名。

 

We participated in the challenge with no external data used for training. In addition to the training techniques aforementioned in this paper, we adopted a set of techniques during testing to obtain a higher performance, which we describe next.

  1.   We independently trained 7 versions of the same GoogLeNet model (including one wider version), and performed ensemble prediction with them. These models were trained with the same initialization (even with the same initial weights, due to an oversight) and learning rate policies. They differed only in sampling methodologies and the randomized input image order.

  2. During testing, we adopted a more aggressive cropping approach than that of Krizhevsky et al. [9]. Specifically, we resized the image to 4 scales where the shorter dimension (height or width) is 256, 288, 320 and 352 respectively, take the left, center and right square of these resized images (in the case of portrait images, we take the top, center and bottom squares). For each square, we then take the 4 corners and the center 224×224 crop as well as the square resized to 224×224, and their mirrored versions. This leads to 4×3×6×2 = 144 crops per image. A similar approach was used by Andrew Howard [8] in the previous year’s entry, which we empirically verified to perform slightly worse than the proposed scheme. We note that such aggressive cropping may not be necessary in real applications, as the benefit of more crops becomes marginal after a reasonable number of crops are present (as we will show later on).

  3. The softmax probabilities are averaged over multiple crops and over all the individual classifiers to obtain the final prediction. In our experiments we analyzed alternative approaches on the validation data, such as max pooling over crops and averaging over classifiers, but they lead to inferior performance than the simple averaging.

 

    我們參加了這次挑戰,沒有使用外部數據進行培訓。除了本文中提到的訓練技術之外,我們還在測試中採用了一套技術來獲得更高的性能,我們將在下面對此進行詳細的闡述。

  1. 我們獨立地培訓了7個版本的相同的谷歌網模型(包括一個更廣泛的版本),並與他們一起進行了集成預測。這些模型經過相同的初始化(甚至具有相同的初始權重(主要是由於疏忽)和學習速率策略的訓練,它們只在採樣方法和看到輸入圖像的隨機順序上有所不同。

  2. 在測試中,我們採用比Krizhevsky等人[9]更積極的裁剪方法。具體來說,我們將圖像歸一化爲四個尺度,其中較短維度(高度或寬度)分別爲256,288,320和352,取這些歸一化的圖像的左,中,右方塊(在肖像圖片中,我們採用頂部,中心和底部方塊)。對於每個方塊,我們將採用4個角以及中心224×224裁剪圖像以及方塊尺寸歸一化爲224×224,以及它們的鏡像版本。這導致每張圖像會得到4×3×6×2 = 144的裁剪圖像。前一年的輸入中,Andrew Howard[8]採用了類似的方法,經過我們實證驗證,其方法略差於我們提出的方案。我們注意到,在實際應用中,這種積極裁剪可能是不必要的,因爲存在合理數量的裁剪圖像後,更多裁剪圖像的好處會變得很微小(正如我們後面展示的那樣)。

  3. 在多個作物和所有分類器上,對Softmax概率進行平均,以獲得最終的預測結果。在我們的實驗中,我們分析了驗證數據的替代方法,例如對裁剪的最大池和對分類器的平均,但它們導致的性能不如簡單平均。

   

     In the remainder of this paper, we analyze the multiple factors that contribute to the overall performance of the final submission.

    Our final submission in the challenge obtains a top-5 error of 6.67% on both the validation and testing data, ranking the first among other participants. This is a 56.5% relative reduction compared to the SuperVision approach in 2012, and about 40% relative reduction compared to the previous year’s best approach (Clarifai), both of which used external data for training the classifiers. The following table shows the statistics of some of the top-performing approaches.

    在本文的其餘部分,我們分析了影響最終提交的總體性能的多種因素。

    我們在挑戰中的最後提交在驗證和測試數據上都獲得了6.67%的前5位錯誤,在其他參與者中排名第一。這與2012年的監督方法相比,相對減少了56.5%,與前一年的最佳方法(Clarifai)相比,相對減少了40%,這兩種方法都使用外部數據來培訓分類器。下表顯示了一些性能最好的方法的統計數據。

    We also analyze and report the performance of multiple testing choices, by varying the number ofmodels and the number of crops used when predicting an image in the following table. When we use one model, we chose the one with the lowest top-1 error rate on the validation data. All numbers are reported on the validation dataset in order to not overfit to the testing data statistics.    

   我們還通過改變模型的數量和在下表中預測圖像時使用的作物數量來分析和報告多種測試選擇的性能。當我們使用一個模型時,我們選擇了一個在驗證數據上具有最低前1錯誤率的模型。所有數字 都報告在驗證數據集中,以避免與測試數據統計數據過分匹配。

 

8 ILSVRC 2014 Detection Challenge Setup and Results(ILSVRC 2014檢測挑戰設置和結果)

 

    The ILSVRC detection task is to produce bounding boxes around objects in images among 200 possible classes. Detected objects count as correct if they match the class of the groundtruth and their bounding boxes overlap by at least 50% (using the Jaccard index). Extraneous detections count as false positives and are penalized. Contrary to the classification task, each image may contain many objects or none, and their scale may vary. Results are reported using the mean average precision (mAP). 

    ILSVRC檢測任務是在200個可能的類中,圍繞圖像中的對象生成包圍框。如果檢測到的對象與地面真相類相匹配,並且它們的邊界框至少重疊50%(使用Jaccard索引),則它們就算作正確的對象。多餘的檢測被視爲假陽性並受到懲罰。與分類任務相反,每幅圖像可能包含多個對象,也可能沒有對象,它們的比例可能從大到小。報告的結果使用平均精度均值(mAP)。  

   

     The approach taken by GoogLeNet for detection is similar to the R-CNN by [6], but is augmented with the Inception model as the region classifier. Additionally, the region proposal step is improved by combining the Selective Search [20] approach with multi-box [5] predictions for higher object bounding box recall. In order to cut down the number of false positives, the superpixel size was increased by 2×. This halves the proposals coming from the selective search algorithm. We added back 200 region proposals coming from multi-box [5] resulting, in total, in about 60% of the pro- posals used by [6], while increasing the coverage from 92% to 93%. The overall effect of cutting the number of proposals with increased coverage is a 1% improvement of the mean average precision for the single model case. Finally, we use an ensemble of 6 ConvNets when classifying each region

which improves results from 40% to 43.9% accuracy. Note that contrary to R-CNN, we did not use bounding box regression due to lack of time. 

    Google網所採用的檢測方法與r-CNN的方法類似[6],但作爲區域分類器的起始模型得到了擴展。此外,通過將選擇性搜索[20]方法與多框[5]預測相結合,改進了區域建議步驟,從而提高了目標包圍盒召回率。爲了減少假陽性的數量,增加了2倍的超像素大小。這將選擇性搜索算法中的提議減半。我們總共補充了200個來自多盒結果的區域生成,大約60%的區域生成用於[6],同時將覆蓋率從92%提高到93%。減少區域生成的數量,增加覆蓋率的整體影響是對於單個模型的情況平均精度均值增加了1%。最後,等分類單個區域時,我們使用了6個GoogLeNets的組合。這導致準確率從40%提高到43.9%。注意,與R-CNN相反,由於缺少時間我們沒有使用邊界框迴歸。

 

 We first report the top detection results and show the progress since the first edition of the detection task. Compared to the 2013 result, the accuracy has almost doubled. The top performing teams all use Convolutional Networks. We report the official scores in Table 4 and common strategies for each team: the use of external data, ensemble models or contextual models. The external data is typically the ILSVRC12 classification data for pre-training a model that is later refined on the detection data. Some teams also mention the use of the localization data. Since a good portion of the localization task bounding boxes are not included in the detection dataset, one can pre-train a general bounding box regressor with this data the same way classification is used for pre-training. The GoogLeNetentry did not use the localization data for pretraining.

   我們首先報告頂級檢測結果,並顯示自第一版檢測任務以來的進展情況。與2013年的結果相比,準確率幾乎翻了一番。表現最好的團隊都使用卷積網絡。我們報告表4中的官方分數和每個團隊的共同策略:使用外部數據、集成模型或上下文模型。外部數據通常是用於預訓練的ilsvrc 12分類數據,該模型隨後對檢測數據進行細化。一些團隊還提到了本地化數據的使用。由於定位任務邊界框的很大一部分不包含在檢測數據集中,因此可以使用該數據對一個通用的邊界盒迴歸器進行預訓練,就像在預訓練中使用分類一樣。

 

   In Table 5, we compare results using a single model only. The top performing model is by Deep Insight and surprisingly only improves by 0.3 points with an ensemble of 3 models while the GoogLeNet obtains significantly stronger results with the ensemble.

   在表5中,我們僅比較了單個模型的結果。最好性能模型是Deep Insight的,令人驚訝的是3個模型的集合僅提高了0.3個點,而GoogLeNet在模型集成時明顯獲得了更好的結果。

 

9 Conclusions(總結)

   

   Our results seem to yield a solid evidence that approximating the expected optimal sparse structure by readily available dense building blocks is a viable method for improving neural networks for computer vision. The main advantage of this method is a significant quality gain at a modest increase of computational requirements compared to shallower and less wide networks. Also note that our detection work was competitive despite of neither utilizing context nor performing bounding box regression and this fact provides further evidence of the strength of the Inception architecture. Although it is expected that similar quality of result can be achieved by much more expensive networks of similar depth and width, our approach yields solid evidence that moving to sparser architectures is feasible and useful idea in general. This suggest promising future work towards creating sparser and more refined structures in automated ways on the basis of [2].

   我們的結果似乎提供了一個確鑿的證據,證明用現有的密集積木來逼近預期的最優稀疏結構是改進計算機視覺神經網絡的一種可行方法。該方法的主要優點是與較淺和較小的網絡相比,在計算需求略有增加的情況下獲得了顯著的質量增益。還要注意的是,我們的檢測工作是有競爭力的,儘管既沒有使用上下文,也沒有執行邊界框迴歸,這一事實爲初始架構的強度提供了進一步的證據。雖然我們的方法可以通過更昂貴的、深度和寬度相似的網絡來實現類似的結果質量,但是我們的方法提供了確鑿的證據,證明移動到稀疏的體系結構在一般情況下是可行的和有用的。這表明未來有希望在[2]的基礎上,以自動化的方式創造更稀疏、更精細的結構。

 

10 Acknowledgements(致謝)

   

   We would like to thank Sanjeev Arora and Aditya Bhaskara for fruitful discussions on [2]. Also we are indebted to the DistBelief [4] team for their support especially to Rajat Monga, Jon Shlens, Alex Krizhevsky, Jeff Dean, Ilya Sutskever and Andrea Frome. We would also like to thank to Tom Duerig and Ning Ye for their help on photometric distortions. Also our work would not have been possible without the support of Chuck Rosenberg and Hartwig Adam.

   我們要感謝Sanjeev Arora和Aditya Bhas卡拉就[2]進行的富有成果的討論。我們還要感謝迪貝利夫[4]隊的支持,特別是對拉賈特·蒙加、喬恩·希透鏡、亞歷克斯·克里澤夫斯基、傑夫·迪安、伊利亞·薩茨卡特和安德里亞·弗洛姆的支持。我們還要感謝湯姆·杜裏格和寧·葉在光度畸變方面的幫助。此外,如果沒有查克、羅森博格和哈特尼格·亞當的支持,我們的工作就不可能完成。

 

References(參考文獻)

[1] Know your meme: We need to go deeper. http://knowyourmeme.com/memes/we-need-to-go-deeper. Accessed: 2014-09-15.

[2] S. Arora, A. Bhaskara, R. Ge, and T. Ma. Provable bounds for learning some deep representations. CoRR, abs/1310.6343, 2013.

[3] U. V. C ̧atalyu ̈rek, C. Aykanat, and B. Uc ̧ar. On two-dimensional sparse matrix partitioning: Models, methods, and a recipe. SIAM J. Sci. Comput., 32(2):656–683, Feb. 2010.

[4] J. Dean, G. Corrado, R. Monga, K. Chen, M. Devin, M. Mao, M. Ranzato, A. Senior, P. Tucker, K. Yang, Q. V. Le, and A. Y. Ng. Large scale distributed deep networks. In P. Bartlett, F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, NIPS, pages 1232–1240. 2012.

[5] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. In CVPR, 2014.

[6] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Computer Vision and Pattern Recognition, 2014. CVPR 2014. IEEE Conference on, 2014.

[7] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.

[8] A. G. Howard. Some improvements on deep convolutional neural network based image classification. CoRR, abs/1312.5402, 2013.

[9] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems 25, pages 1106–1114, 2012.

[10] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. Backpropagation applied to handwritten zip code recognition. Neural Comput., 1(4):541–551, Dec. 1989.

[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[12] M. Lin, Q. Chen, and S. Yan. Network in network. CoRR, abs/1312.4400, 2013.

[13] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, July 1992.

[14] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013.

[15] T. Serre, L. Wolf, S. M. Bileschi, M. Riesenhuber, and T. Poggio. Robust object recognition with cortex-like mechanisms. IEEE Trans. Pattern Anal. Mach. Intell., 29(3):411–426, 2007.

[16] F. Song and J. Dongarra. Scaling up matrix computations on shared-memory manycore systems with 1000 cpu cores. In Proceedings of the 28th ACM Interna- tional Conference on Supercomputing, ICS ’14, pages 333–342, New York, NY, USA, 2014. ACM.

[17] I. Sutskever, J. Martens, G. E. Dahl, and G. E. Hinton. On the importance of initialization and momentum in deep learning. In ICML, volume 28 of JMLR Proceed- ings, pages 1139–1147. JMLR.org, 2013.

[18] C.Szegedy,A.Toshev,andD.Erhan.Deep neural networks for object detection. In C. J. C. Burges, L. Bottou, Z. Ghahramani, and K. Q. Weinberger, editors, NIPS, pages 2553–2561, 2013.

[19] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. CoRR, abs/1312.4659, 2013.

[20] K. E. A. van de Sande, J. R. R. Uijlings, T. Gevers, and A. W. M. Smeulders. Segmentation as selective search for object recognition. In Proceedings of the 2011 International Conference on Computer Vision, ICCV ’11, pages 1879–1886, Washington, DC, USA, 2011. IEEE Computer Society.

[21] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors, ECCV, volume 8689 of Lecture Notes in Computer Science, pages 818–833. Springer, 2014.

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章