《Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition》翻譯

Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun

Abstract—Existing deep convolutional neuralnetworks (CNNs) require a fixed-size (e.g., 224×224) input image. Thisrequirement is “artificial” and may reduce the recognition accuracy for theimages or sub-images of an arbitrary size/scale. In this work, we equipthe networks with another pooling strategy, “spatial pyramid pooling”, toeliminate the above requirement. The new network structure, calledSPP-net, can generate a fixed-length representation regardless of imagesize/scale. Pyramid pooling is also robust to object deformations. Withthese advantages, SPP-net should in general improve all CNN-basedimage classification methods. On the ImageNet 2012 dataset, we demonstratethat SPP-net boosts the accuracy of a variety of CNN architectures despitetheir different designs. On the Pascal VOC 2007 and Caltech101 datasets,SPP-net achieves state-of-the-art classification results using a singlefull-image representation and no fine-tuning.

The power of SPP-net is also significantin object detection. Using SPP-net, we compute the feature maps from theentire image only once, and then pool features in arbitrary regions(sub-images) to generate fixed-length representations for training thedetectors. This method avoids repeatedly computing the convolutional features.In processing test images, our method is 24-102×faster than the R-CNN method,while achieving better or comparable accuracy on Pascal VOC 2007.

In ImageNet Large Scale VisualRecognition Challenge (ILSVRC) 2014, our methods rank #2 in object detectionand #3 in image classification among all 38 teams. This manuscript alsointroduces the improvement made for this competition.

Index Terms—Convolutional NeuralNetworks, Spatial Pyramid Pooling, Image Classification, Object Detection

摘要

當前深度卷積神經網絡（CNNs）都需要輸入的圖像尺寸固定（比如224×224）。這種人爲的需要導致面對任意尺寸和比例的圖像或子圖像時降低識別的精度。本文中，我們給網絡配上一個叫做“空間金字塔池化”(spatial pyramid pooling)的池化策略以消除上述限制。這個我們稱之爲SPP-net的網絡結構能夠產生固定大小的表示（representation）而不關心輸入圖像的尺寸或比例。金字塔池化對物體的形變十分魯棒。由於諸多優點，SPP-net可以普遍幫助改進各類基於CNN的圖像分類方法。在ImageNet2012數據集上，SPP-net將各種CNN架構的精度都大幅提升，儘管這些架構有着各自不同的設計。在Pascal VOC 2007和Caltech 101數據集上，SPP-net使用單一全圖像表示在沒有調優的情況下都達到了最好成績。

SPP-net在目標檢測上也表現突出。使用SPP-net，只需要從整張圖片計算一次特徵圖（feature map），然後對任意尺寸的區域（子圖像）進行特徵池化以產生一個固定尺寸的表示用於訓練檢測器。這個方法避免了反覆計算卷積特徵。在處理測試圖像時，我們的方法在VOC2007數據集上，達到相同或更好的性能情況下，比R-CNN方法快24-102倍。

在ImageNet大規模視覺識別任務挑戰（ILSVRC）2014上，我們的方法在物體檢測上排名第2，在物體分類上排名第3，參賽的總共有38個組。本文也介紹了爲了這個比賽所作的一些改進。

關鍵詞：Convolutional NeuralNetworks, Spatial Pyramid Pooling, Image Classification, Object Detection

1 INTRODUCTION

We are witnessing a rapid, revolutionary change in ourvision community, mainly caused by deep convolutional neural networks (CNNs)[1] and the availability of large scale training data [2]. Deep-networks-based approacheshave recently been substantially improving upon the state of the art in imageclassification [3], [4], [5], [6], object detection [7], [8], [5],many other recognition tasks [9], [10], [11], [12],and even non-recognition tasks.

1.引言

我們看到計算機視覺領域正在經歷飛速的變化，這一切得益於深度卷積神經網絡（CNNs）[1]和大規模的訓練數據的出現[2]。近來深度網絡對圖像分類 [3][4][5][6]，物體檢測 [7][8][5]和其他識別任務 [9][10][11][12]，甚至很多非識別類任務上都表現出了明顯的性能提升。

However, there is a technical issue in the training andtesting of the CNNs: the prevalent CNNs require a fixed input image size (e.g.,224×224), which limits both the aspect ratio and the scale of the input image.Whenapplied to images of arbitrary sizes, current methods mostly fit the inputimage to the fixed size, either via cropping [3], [4] or via warping [13], [7],asshown in Figure 1 (top). But the cropped region may not contain the entireobject, while the warped content may result in unwanted geometric distortion.Recognitionaccuracy can be compromised due to the content loss or distortion. Besides, apre-defined scale may not be suitable when object scales vary. Fixing inputsizes overlooks the issues involving scales.

然而，這些技術再訓練和測試時都有一個問題，這些流行的CNNs都需要輸入的圖像尺寸是固定的（比如224×224），這限制了輸入圖像的長寬比和縮放尺度。當遇到任意尺寸的圖像時，都是先將圖像適應成固定尺寸，方法包括裁剪[3][4]和變形[13][7]，如圖1（上）所示。但裁剪會導致信息的丟失，變形會導致位置信息的扭曲，就會影響識別的精度。另外，一個預先定義好的尺寸在物體是縮放可變的時候就不適用了。固定輸入大小忽略了涉及比例的問題。

So why do CNNs require a fixed input size? A CNN mainly consists of two parts: convolutional layers,and fully-connected layers that follow. The convolutional layers operate in a sliding-window manner and output feature maps which represent the spatial arrangement of the activations (Figure2). In fact, convolutional layers do not require a fixed image size and cangenerate feature maps of any sizes. On the other hand, the fully-connected layers need to have fixed-size/length input by their definition. Hence, thefixed-size constraint comes only from the fully-connected layers, which existat a deeper stage of the network.

那麼爲什麼CNNs需要一個固定的輸入尺寸呢？CNN主要由兩部分組成，卷積層和其後的全連接層。卷積部分通過滑窗進行計算，並輸出代表激活的空間排布的特徵圖（feature map）（圖2）。事實上，卷積並不需要固定的圖像尺寸，他可以產生任意尺寸的特徵圖。而另一方面，根據定義，全連接層則需要固定的尺寸輸入。因此固定尺寸的問題來源於全連接層，也是網絡的最後階段。

In this paper, we introduce a spatial pyramid pooling(SPP)[14], [15] layer to remove the fixed-size constraint of the network.Specifically, we add an SPP layer on top of the last convolutional layer. The SPP layer pools the features and generates fixed-length outputs, which are then fed into the fully-connected layers (or other classifiers). In other words,we perform some information “aggregation” at a deeper stage of the networkhierarchy (between convolutional layers and fully-connected layers) to avoid theneed for cropping or warping at the beginning.Figure 1 (bottom) shows thechange of the network architecture by introducing the SPP layer. We call the newnetwork structure SPP-net.

本文引入一種空間金字塔池化( spatial pyramid pooling，SPP)層以移除對網絡固定尺寸的限制。尤其是，將SPP層放在最後一個卷積層之後。SPP層對特徵進行池化，併產生固定長度的輸出，這個輸出再餵給全連接層（或其他分類器）。換句話說，在網絡層次的較後階段（也就是卷積層和全連接層之間）進行某種信息“彙總”，可以避免在最開始的時候就進行裁剪或變形。圖1（下）展示了引入SPP層之後的網絡結構變化。我們稱這種新型的網絡結構爲SPP-net。

Spatial pyramid pooling [14], [15] (popularly known as spatial pyramid matching or SPM [15]), as an extension of the Bag-of-Words (BoW) model [16],is one of the most successful methods in computer vision. It partitions the image into divisions from finer to coarser levels, and aggregates local features in them. SPP has long been a key component in the leading and competition-winning systems for classification (e.g., [17], [18], [19]) and detection (e.g., [20])before the recent prevalence of CNNs. Nevertheless, SPP has not been considered in the context of CNNs.We note that SPP has several remarkable properties for deep CNNs: 1) SPP is able to generate a fixed-length output regardless of the input size, while the sliding window pooling used in the previous deep networks [3] cannot; 2) SPP uses multi-level spatial bins, while the sliding window pooling uses only a single window size. Multi-level pooling has been shown to be robust to object deformations [15]; 3) SPP can pool features extracted at variable scales thanks to the flexibility of input scales. Throughexperiments we show that all these factors elevate the recognition accuracy of deep networks.

空間金字塔池化[14][15]（普遍稱謂：空間金字塔匹配spatial pyramid matching, SPM[15]），是一種詞袋(Bag-of-Words, BoW)模型的擴展。詞袋模型是計算機視覺領域最成功的方法之一。它將圖像切分成粗糙到精細各種級別，然後整合其中的局部特徵。在CNN之前，SPP一直是各大分類比賽[17][18][19]和檢測比賽（比如[20]）的冠軍系統中的核心組件。對深度CNNs而言，SPP有幾個突出的優點：1）SPP能在輸入尺寸任意的情況下產生固定大小的輸出，而以前的深度網絡[3]中的滑窗池化(sliding window pooling)則不能；2）SPP使用了多級別的空間箱(bin)，而滑窗池化則只用了一個窗口尺寸。多級池化對於物體的變形十分魯棒[15]；3）由於其對輸入的靈活性，SPP可以池化從各種尺度抽取出來的特徵。通過實驗，我們將展示影響深度網絡最終識別精度的所有這些因素。

SPP-net not only makes it possible to generate representations from arbitrarily sized images/windows for testing, but also allows us to feedimages with varying sizes or scales during training. Training with variable-size images increases scale-invariance and reduces over-fitting. We develop a simple multi-size training method. For a single network to accept variable inputsizes, we approximate it by multiple networks that share all parameters, while each of these networks is trained using a fixed input size. In each epoch we train the network with a given input size, and switch to another input size for the next epoch. Experiments show that this multi-size training converges just as the traditional single-size training,and leads to better testing accuracy.

SPP-net不僅僅讓測試階段允許任意尺寸的輸入能夠產生表示(representations)，也允許訓練階段的圖像可以有各種尺寸和縮放尺度。使用各種尺寸的圖像進行訓練可以提高縮放不變性，以及減少過擬合。我們開發了一個簡單的多尺度訓練方法。爲了實現一個單一的能夠接受各種輸入尺寸的網絡，我們先使用分別訓練固定輸入尺寸的多個網絡，這些網絡之間共享權重（Parameters），然後再一起來代表這個單一網絡（譯者注：具體代表方式沒有說清楚，看後面怎麼說吧）。每個epoch，我們針對一個給定的輸入尺寸進行網絡訓練，然後在下一個epoch再切換到另一個尺寸。實驗表明，這種多尺度訓練和傳統的單一尺度訓練一樣可以瘦臉，並且能達到更好的測試精度。

The advantages of SPP are orthogonal to the specific CNN designs. In a series of controlled experiments on the ImageNet 2012 dataset, we demonstrate that SPP improves four different CNN architectures in existing publications[3], [4], [5] (or their modifications), over the no-SPP counterparts. These architectures have various filter numbers/sizes, strides, depths, or other designs.It is thus reasonable for us to conjecture that SPP should improve more sophisticated(deeper and larger) convolutional architectures. SPP-net also shows state-of-the-art classification results on Caltech101 [21] and Pascal VOC 2007[22] using only a single full-image representation and no fine-tuning.

SPP的優點是與各類CNN設計是正交的。通過在ImageNet2012數據集上進行一系列可控的實驗，我們發現SPP對[3][4][5]這些不同的CNN架構都有提升。這些架構有不同的特徵數量、尺寸、滑動距離（strides）、深度或其他的設計。所以我們有理由推測SPP可以幫助提升更多複雜的（更大、更深）的卷積架構。SPP-net也做到了 Caltech101 [21]和Pascal VOC 2007 [22]上的最好結果，而只使用了一個全圖像表示，且沒有調優。

SPP-net also shows great strength in object detection. In the leading object detection method R-CNN[7],the features from candidate windows are extracted via deep convolutional networks. This method shows remarkable detection accuracy on both the VOC and ImageNet datasets. But the feature computation in R-CNN is time-consuming, because itrepeatedly applies the deep convolutional networks to the raw pixels of thousands of warped regions per image. In this paper, we show that we can runthe convolutional layers only once on the entireimage (regardless of the number of windows), and then extract features by SPP-net on the feature maps. This method yields a speed up of over one hundred times over R-CNN. Note that training/running a detector on the feature maps(rather than image regions) is actually a more popular idea [23], [24], [20],[5]. But SPP-net inherits the power of the deep CNN feature maps and also the flexibility of SPP on arbitrary window sizes, which leads to outstanding accuracy and efficiency. In our experiment, the SPP-net-based system (built upon the R-CNNpipeline) computes features 24-102×faster than R-CNN, while has better or comparable accuracy.With the recent fast proposal method of EdgeBoxes[25], oursystem takes 0.5 seconds processing an image(including all steps). This makes our method practical for real-world applications.

在圖像檢測方面，SPP-net也表現優異。目前領先的方法是R-CNN[7]，候選窗口的特徵是藉助深度神經網絡進行抽取的。此方法在VOC和ImageNet數據集上都表現出了出色的檢測精度。但R-CNN的特徵計算十分耗時，因爲他對每張圖片中的上千個變形後的區域的像素反覆調用CNN。本文中，我們展示了我們只需要在整張圖片上運行一次卷積網絡層（不關心窗口的數量），然後再使用SPP-net在特徵圖上抽取特徵。這個方法縮減了上百倍的耗時。在特徵圖（而不是圖像區域）上訓練和運行檢測器是一個很受歡迎的想法[23][24][20][5]。但SPP-net延續了深度CNN特徵圖的優勢，也結合了SPP兼容任意窗口大小的靈活性，所以做到了出色的精度和效率。我們的實驗中，基於SPP-net的系統（建立在R-CNN流水線上）比R-CNN計算特徵要快24-120倍，而精度卻更高。結合最新的推薦方法EdgeBoxes[25]，我們的系統達到了每張圖片處理0.5s的速度（全部步驟）。這使得我們的方法變得更加實用。

A preliminary version of this manuscript has been published in ECCV 2014. Based on this work, we attended the competition of ILSVRC 2014[26], and ranked #2 in object detection and #3 in image classification (bothare provided-data-only tracks) among all 38 teams. There are a few modifications made for ILSVRC 2014. We show that the SPP-nets can boost various networks that are deeper and larger (Sec. 3.1.2-3.1.4) over the no-SPP counterparts.Further, driven by our detection framework, we find that multi-view testing onfeature maps with flexibly located/sized windows (Sec. 3.1.5) can increase the classification accuracy. This manuscript also provides the details of these modifications.

We have released the code to facilitate future research(http://research.microsoft.com/ en-us/ um/ people/ kahe/).

本論文的一個早先版本發佈在ECCV2014上。基於這個工作，我們參加了ILSVRC 2014 [26]，在38個團隊中，取得了物體檢測第2名和圖像分類第3名的成績。針對ILSVRC 2014我們也做了很多修改。我們將展示SPP-nets可以將更深、更大的網絡的性能顯著提升。進一步，受檢測框架驅動，我們發現藉助靈活尺寸窗口對特徵圖進行多視角測試可以顯著提高分類精度。本文對這些改動做了更加詳細的說明。

另外，我們將代碼放在了以方便大家研究（http://research.microsoft.com/en-us/um/people/kahe/，譯者注：已失效）

2 DEEP NETWORKS WITH SPATIAL PYRA-MID POOLING

2.1 Convolutional Layers and Feature Maps Consider thepopular seven-layer architectures [3], [4].The first five layers are convolutional, some of which are followed by pooling layers. These pooling layers can also be considered as “convolutional”, in the sense that they are using sliding windows. The last two layers are fully connected, with an N-way softmax as the output, where N is the number of categories.

2. 基於空間金字塔池化的深度網絡

2.1 卷積層和特徵圖

在頗受歡迎的七層架構中[3][4]中，前五層是卷積層，其中一些後面跟着池化層。從他們也使用滑窗的角度來看，這些池化層也可以認爲是“卷積的”。最後兩層是全連接的，跟着一個N路softmax輸出，其中N是類別的數量。

The deep network described above needs a fixed imagesize. However, we notice that the requirement of fixed sizes is only due to the fully-connected layers that demand fixed-length vectors as inputs. On the other hand, the convolutional layers accept inputs of arbitrary sizes. The convolutional layers use sliding filters, and their outputs have roughly the same aspect ratio as the inputs. These outputs are known as feature maps [1] -they involve not only the strength of the responses, but also their spatial positions.

上述的深度網絡需要一個固定大小的圖像尺寸。然而，我們注意到，固定尺寸的要求僅僅是因爲全連接層的存在導致的。另一方面，卷積層使用滑動的特徵過濾器，它們的輸出基本保持了原始輸入的比例關係。它們的輸出就是特徵圖[1]-它們不僅涉及響應的強度，還包括空間位置。

In Figure 2, we visualize some feature maps. They are generated by some filters of the conv layer. Figure 2(c) shows the strongest activated images of these filters in the ImageNet dataset. We see a filter can be activated by some semantic content. For example, the 55-th filter (Figure 2,bottom left) is most activated by a circle shape; the 66-th filter (Figure 2,top right) is most activated by a ∧-shape; and the 118-th filter (Figure 2, bottomright) is most activated by a ∨-shape.These shapes in the input images (Figure 2(a))activate the feature maps at the corresponding positions (the arrows in Figure2).

圖2中，我們可視化了一些特徵圖。這些特徵圖來自於conv5層的一些過濾器。圖2（c）顯示了ImageNet數據集中激活最強的若干圖像。可以看到一個過濾器能夠被一些語義內容激活。例如，第55個過濾器（圖2，左下）對圓形十分敏感；第66層（圖2，右上）對^形狀特別敏感；第118個過濾器（圖2，右下）對∨形狀非常敏感。這些輸入圖像中的形狀會激活相應位置的特徵圖（圖2中的箭頭）。

It is worth noticing that we generate the feature maps in Figure 2 without fixing the input size. These feature maps generated by deep convolutional layers are analogous to the feature maps in traditional methods[27], [28]. In those methods, SIFT vectors [29] or image patches [28] aredensely extracted and then encoded,e.g., by vector quantization [16], [15],[30],sparse coding [17], [18], or Fisher kernels [19].These encoded features consist of the feature maps,and are then pooled by Bag-of-Words (BoW) [16] or spatialpyramids [14], [15]. Analogously, the deep convolutional features can be pooled in a similar way.

值得注意的是，圖2中生成的特徵圖並沒有固定輸入尺寸。深度卷積層生成的特徵圖和傳統方法[27][28]中的特徵圖很相似。這些傳統方法中，SIFT向量[29]或圖像碎片[28]被密集地抽取出來，在通過矢量量化[16][15][30]，稀疏化[17][18]或Fisher核函數[19]進行編碼。這些編碼後的特徵構成了特徵圖，然後通過詞袋（BoW）[16]或空間金字塔[14][15]進行池化。類似的深度卷積的特徵也可以這樣做。

2.2 The Spatial Pyramid Pooling Layer

The convolutional layers accept arbitrary input sizes,but they produce outputs of variable sizes. The classifiers (SVM/softmax) or fully-connected layers require fixed-length vectors. Such vectors can be generated by the Bag-of-Words (BoW) approach [16] that pools the features together. Spatial pyramid pooling [14], [15] improves BoW in that it can maintain spatial information by pooling in local spatial bins. These spatialbins have sizes proportional to the image size, so the number of bins is fixed regardless of the image size. This is in contrast to the sliding window pooling of the previous deep networks [3], where the number of sliding windows depends on the input size.

2.2 空間金字塔池化層

卷積層接受任意大小的輸入，所以他們的輸出也是各種大小。而分類器（SVM/softmax）或者全連接層UI需要固定的輸入大小的向量。這種向量可以使用詞袋方法[16]通過池化特徵來生成。空間金字塔池化[14][15]對BoW進行了改進以便在池化過程中保留局部空間塊（local spatial bins）中的空間保留。這些空間塊的尺寸和圖像的尺寸是成比例的，這樣塊的數量就是固定的了。而前述深度網絡的滑窗池化則對依賴於輸入圖像的尺寸。

To adopt the deep network for images of arbitrary sizes, we replace the last pooling layer (e.g.,pool5, after the last convolutional layer) with a spatial pyramid pooling layer.Figure 3 illustrates our method.In each spatial bin, we pool the responses of each filter (throughout this paper we use max pooling).The outputs of the spatial pyramid pooling are kM-dimensional vectors with the number of bins denoted as M(k is the number of filters in the last convolutional layer). The fixed-dimensional vectors are the input to the fully-connected layer.

爲了讓我們的神經網絡適應任意尺寸的圖像輸入，我們用一個空間金字塔池化層替換掉了最優一個池化層（最後一個卷積層之後的pool5）。圖3示例了這種方法。在每個空間塊中，我們池化每個過濾器的響應（本文中採用了最大池化法）。空間金字塔的輸出是一個kM維向量，M代表塊的數量，k代表最後一層卷積層的過濾器的數量。這個固定維度的向量就是全連接層的輸入。

With spatial pyramid pooling, the input image can 4 beof any sizes. This not only allows arbitrary aspect ratios, but also allows arbitrary scales. We can resize the input image to any scale (e.g.,min(w,h)=180,224,...) and apply the same deep network. When the input image is at different scales, the network (with the same filter sizes) will extract features at different scales. The scales play important roles in traditional methods,e.g.,the SIFT vectors are often extracted at multiple scales [29], [27] (determinedby the sizes of the patches and Gaussian filters). We will show that the scales are also important for the accuracy of deep networks.

有了空間金字塔池化，輸入圖像就可以是任意尺寸了。不但允許任意比例關係，而且支持任意縮放尺度。我們也可以將輸入圖像縮放到任意尺度(例如min(w;h)=180,224,…)並且使用同一個深度網絡。當輸入圖像處於不同的空間尺度時，帶有相同大小卷積核的網絡就可以在不同的尺度上抽取特徵。跨多個尺度在傳統方法中十分重要，比如SIFT向量就經常在多個尺度上進行抽取[29][27]（受碎片和高斯過濾器的大小所決定）。我們接下來會說明多尺度在深度網絡精度方面的重要作用。

Interestingly, the coarsest pyramid level has a single bin that covers the entire image. This is in fact a “global pooling” operation,which is also investigated in several concurrent works. In [31], [32] a global average pooling is used to reduce the model size and also reduce overfitting; in [33],a global average pooling is used on the testing stage after all fc layers to improve accuracy; in [34], a global max pooling is used for weakly supervised object recognition. The global pooling operation corresponds to the traditional Bag-of-Words method.

有趣的是，粗糙的金字塔級別只有一個塊，覆蓋了整張圖像。這就是一個全局池化操作，當前有很多正在進行的工作正在研究它。[33]中，一個放在全連接層之後的全局平均池化被用來提高測試階段的精確度；[34]中，一個全局最大池化用於弱監督物體識別。全局池化操作相當於傳統的詞袋方法。

2.3 Training the Network

Theoretically, the above network structure can be trainedwith standard back-propagation [1], regardless of the input image size. But inpractice the GPU implementations (such as cuda-convnet[3] and Caffe[35]) are preferably run on fixed input images. Next we describe our training solution that takes advantage of these GPU implementations while still preserving the spatial pyramid pooling behaviors.

2.3 網絡的訓練

理論上將，上述網絡結構可以用標準的反向傳播進行訓練[1]，與圖像的大小無關。但實踐中，GPU的實現（如cuda-convnet[3]和Caffe[35]）更適合運行在固定輸入圖像上。接下來，我們描述我們的訓練方法能夠在保持空間金字塔池化行爲的同時還能充分利用GPU的優勢。

單一尺寸訓練

如前人的工作一樣，我們首先考慮接收裁剪成224×224圖像的網絡。裁剪的目的是數據增強。對於一個給定尺寸的圖像，我們先計算空間金字塔池化所需要的塊（bins）的大小。試想一個尺寸是axa（也就是13×13）的conv5之後特徵圖。對於nxn塊的金字塔級，我們實現一個滑窗池化過程，窗口大小爲win = 上取整[a/n]，步幅str = 下取整[a/n]. 對於l層金字塔，我們實現l個這樣的層。然後將l個層的輸出進行連接輸出給全連接層。圖4展示了一個cuda卷積網絡風格的3層金字塔的樣例。(3×3, 2×2, 1×1)。

單一尺寸訓練的主要目的是開啓多級別池化行爲。實驗表明這是獲取精度的一個原因。

多尺寸訓練

攜帶SPP的網絡可以應用於任意尺寸，爲了解決不同圖像尺寸的訓練問題，我們考慮一些預設好的尺寸。現在考慮這兩個尺寸：180×180,224×224。我們使用縮放而不是裁剪，將前述的224的區域圖像變成180大小。這樣，不同尺度的區域僅僅是分辨率上的不同，而不是內容和佈局上的不同。對於接受180輸入的網絡，我們實現另一個固定尺寸的網絡。本例中，conv5輸出的特徵圖尺寸是axa=10×10。我們仍然使用win = 上取整[a/n]，str = 下取整[a/n]，實現每個金字塔池化層。這個180網絡的空間金字塔層的輸出的大小就和224網絡的一樣了。

這樣，這個180網絡就和224網絡擁有一樣的參數了。換句話說，訓練過程中，我們通過使用共享參數的兩個固定尺寸的網絡實現了不同輸入尺寸的SPP-net。

爲了降低從一個網絡（比如224）向另一個網絡（比如180）切換的開銷，我們在每個網絡上訓練一個完整的epoch，然後在下一個完成的epoch再切換到另一個網絡（權重保留）。依此往復。實驗中我們發現多尺寸訓練的收斂速度和單尺寸差不多。

多尺寸訓練的主要目的是在保證已經充分利用現在被較好優化的固定尺寸網絡實現的同時，模擬不同的輸入尺寸。除了上述兩個尺度的實現，我們也在每個epoch中測試了不同的sxs輸入，s是從180到224之間均勻選取的。後面將在實驗部分報告這些測試的結果。

注意，上面的單尺寸或多尺寸解析度只用於訓練。在測試階段，是直接對各種尺寸的圖像應用SPP-net的。

3 用於圖像分類的SPP-NET

3.1 ImageNet 2012分類實驗

我們在1000類別的Image2012訓練集上訓練了網絡。我們的訓練算法參照了前人的實踐工作[3][4][36]。圖像會被縮放，以便較小的維度是256，再從中間獲得四個角裁出224×224。圖像會通過水平翻轉和顏色變換[3]進行數據增強。

最後兩層全連接層會使用Dropout[3]。learning rate起始值是0.01，當錯誤率停滯後就除以10。我們的實現基於公開的cuda-convnet源代碼[3]和Caffe[35]。所有網絡都是在單一GeForceGTX TitanGPU（6G內存）耗時二到四周訓練的。

3.1.1 基準網絡架構

SPP的優勢是和使用的卷積神經網絡無關。我研究了四種不同的網絡架構[3][4][5]（或他們的修改版），對所有這些架構，SPP都提升了準確度。基準架構如表1，簡單介紹如下：

– ZF-5：基於Zeiler和Fergus的“快速”模式[4]的網絡架構。數字5代表5層卷積網絡。

– Convnet*-5：基於Krizhevsky等人工作[3]的修改。我們在conv2和conv3（而不是conv1和conv2）之後加入了兩個池化層。這樣，每一層之後的特徵圖就和ZF-5的尺寸一樣了。

– Overfeat-5/7：基於Overfeat論文[5]，使用了[6]的修改。對比ZF-5/Convnet*-5，這個架構在最後一個池化層產生了更大的特徵圖（18×18而不是13×13）。還在conv3和後續的卷基層使用了更多的過濾器（512）。我們也研究了7層卷積網絡，其中conv3和conv7結構一樣。

基準模型中，最後卷積層之後的池化層會產生6×6的特徵圖，然後跟着兩個4096維度的全連接層，和一個1000路的softmax層。這些基準網絡的表現參見表2(a)，我們針對ZF-5進行了70個epoch，而其他的用了90個epoch。ZF-5的表現比[4]中報告的那個要好。增益主要來源於角落裁切來源於整張圖片，[36]中也提到了這點。

3.1.2 多層次池化提升準確度

表2（b）中我們顯示了使用單尺寸訓練的結果。訓練和測試尺寸都是224×224.這些網絡中，卷積網絡都和他們的基準網絡有相同的結構，只是最後卷積層之後的池化層，被替換成了SPP層。表2中的結果我們使用了4層金字塔，f6x6, 3×3, 2×2, 1x1g(總共50個塊)。爲了公平比較，我們仍然使用標準的10-view預測法，每個view都是一個224×224的裁切。表2（b）中的結果顯示了明顯的性能提升。有趣的是，最大的提升（top-1 error，1.65%）來自於精度最高的網絡架構。既然我們一直使用相同10個裁切view。這些提升只能是來自於多層次池化。

值得注意的是多層次池化帶來的提升不只是因爲更多的參數；而是因爲多層次池化對對象的變形和空間佈局更加魯棒[15]。爲了說明這個，我們使用一個不同的4層金字塔（f4×4, 3×3, 2×2, 1×1g，供30個塊）訓練另一個ZF-5網絡。這個網絡有更少的參數，因爲他的全連接層fc6有30×256維輸入而不是36×256維。網絡的top-1/top-5錯誤率分別是35.06/14.04和50塊的金字塔網絡相近，明顯好於非SPP基準網絡（35.99/14.76）。

3.1.3 多尺寸訓練提升準確度

表2（c）展示了多尺寸訓練的結果。訓練尺寸是224和180，測試尺寸是224。我們還使用標準的10-view預測法。所有架構的top-1/top-5錯誤率進一步下降。SPP-net(Overfeat-7)的Top-1 錯誤率降到29.68%，比非SPP網絡低了2.33%，比單尺寸訓練降低了0.68%。

除了使用180和224兩個尺寸，我們還隨機選了[180;224]之間多個尺寸。SPP-net(Overfeat-7)的top1/5錯誤率是30.06%/10.96%。Top-1錯誤率比兩尺寸版本有所下降，可能因爲224這個尺寸（測試時用的尺寸）被更少的訪問到。但結果仍然比但尺寸版本要好。

之前的CNN解決方案[5][36]也處理了不同尺寸問題，但他們主要是基於測試。在Overfeat[5]和Howard的方法[36]中，單一網絡在測試解決被應用於不同的尺度，然後將分支平均。Howard進一步在低/高兩個分辨率圖像區域上訓練了兩個不同的網絡，然後平均分支。據我們所知，我們是第一個對不同尺寸訓練單一網絡的方法。

3.1.4 全圖像表示提升準確度

接下來我們研究全圖像視角的準確度。我們將圖像保持比例不變的情況下縮放到min(w;h)=256。SPP-net應用到一整張圖像上。爲了公平比較，我們也計算中央224×224裁切這單一視圖（上述評估都用過）的準確度。單視圖比較的準確度見表3。驗證了ZF-5/Overfeat-7，top-1錯誤率在全視圖表示中全部下降。這說明保持完整內容的重要性。即使網絡訓練時只使用了正方形圖像，卻也可以很好地適應其他的比例。

對比表2和表3我們發現，結合多種視圖大體上要好於全圖像視圖。然而全視圖圖像的表示仍然有價值。首先，經驗上看，我們發現（下節會討論）即使結合幾十個視圖，額外增加兩個全圖像視角（帶翻轉）仍然可以提高準確度大約0.2%。其次，全圖像視圖從方法論上講與傳統方法[15][17][19]保持了一致，這些方法中對整張圖像進行編碼的SIFT向量被池化在一起。第三，在其他一些應用中，比如圖像恢復[37]，相似度評分需要圖像表示而不是分類得分。一個全圖像的表示就會成爲首選。

3.1.5 特徵圖上的多視圖測試

【略】

3.2 Experiments on VOC 2007 Classification

【略】

3.3 Experiments on Caltech101

【略】

4 SPP-NET用於物體檢測

深度網絡已經被用於物體檢測。我們簡要回顧一下最先進的R-CNN[7]。R-CNN首先使用選擇性搜索[20]從每個圖像中選出2000個候選窗口。然後將每個窗口中的圖像區域變形到固定大小227×227。一個事先訓練好的深度網絡被用於抽取每個窗口的特徵。然後用二分類的SVM分類器在這些特徵上針對檢測進行訓練。R-CNN產生的引人注目的成果。但R-CNN在一張圖像的2000個窗口上反覆應用深度卷積網絡，十分耗時。在測試階段的特徵抽取式主要的耗時瓶頸。

我們將SPP-net應用於物體檢測。只在整張圖像上抽取一次特徵。然後在每個特徵圖的候選窗口上應用空間金字塔池化，形成這個窗口的一個固定長度表示（見圖5）。因爲只應用一次卷積網絡，我們的方法快得多。我們的方法是從特徵圖中直接抽取特徵，而R-CNN則要從圖像區域抽取。之前的一些工作中，可變性部件模型(Deformable Part Model, DPM)從HOG[24]特徵圖的窗口中抽取圖像，選擇性搜索方法[20]從SIFT編碼後的特徵圖的窗口中抽取特徵。Overfeat也是從卷積特徵圖中抽取特徵，但需要預定義的窗口尺寸。作爲對比，我們的特徵抽取可以在任意尺寸的深度卷積特徵圖窗口上。

4.1 檢測算法

我們使用選擇性搜索[20]的“fast”模式對每張圖片產生2000個候選窗口。然後縮放圖像以滿足min(w;h) = s，並且從整張圖像中抽取特徵圖。我們暫時使用ZF-5的SPP-net模型（單一尺寸訓練）。在每個候選窗口，我們使用一個4級空間金字塔（1×1, 2×2, 3×3, 6×6, 總共50塊）。每個窗口將產生一個12800（256×50）維的表示。這些表示傳遞給網絡的全連接層。然後我們針對每個分類訓練一個二分線性SVM分類器。我們的SVN實現追隨了[20][7]。我們使用真實標註的窗口去生成正例。負例是那些與正例窗口重疊不超過30%的窗口（使用IoU比例）。

如果一個負例與另一個負例重疊超過70%就會被移除。我們使用標準的難負例挖掘算法（standard hard negative mining [23]）訓練SVM。這個步驟只迭代一次。對於全部20個分類訓練SVM小於1個小時。測試階段，訓練器用來對候選窗口打分。然後在打分窗口上使用最大值抑制[23]算法（30%的閾值）。

通過多尺度特徵提取，我們的方法可以得到改進。將圖像縮放成min(w;h) = s \belongs S = {480; 576; 688; 864; 1200 }，然後針對每個尺度計算conv5的特徵圖。一個結合這些這些不同尺度特徵的策略是逐個channel的池化。但我們從經驗上發現另一個策略有更好的效果。對於每個候選窗口，我們選擇一個單一尺度s \belongs S，令縮放後的候選窗口的像素數量接近與224×224。然後我們從這個尺度抽取的特徵圖去計算窗口的特徵。如果這個預定義的尺度足夠密集，窗口近似於正方形。我們的方法粗略地等效於將窗口縮放到224×224，然後再從中抽取特徵。但我們的方法在每個尺度只計算一次特徵圖，不管有多少個候選窗口。

我們參照[7]對預訓練的網絡進行了調優。由於對於任意尺寸的窗口，我們都是從conv5的特徵圖中吃畫出特徵來，爲了簡單起見，我們只調優全連接層。

本例中，數據層接受conv5之後的固定長度的池化後的特徵，後面跟着fc_{6,7}和一個新的21路（有一個負例類別）fc8層。fc8的權重使用高斯分佈進行初始化σ=0.01。我們修正所有的learning rate爲1e-4，再將全部三層調整爲1e-5。調優過程中正例是與標註窗口重疊度達到[0.5, 1]的窗口，負例是重疊度爲[0.1, 0.5)的。每個mini-batch，25%是正例。我們使用學習率1e-4訓練了250k個minibatch，然後使用1e-5訓練50k個minibatch。

因爲我們只調優fc層，所以訓練非常的快，在GPU上只需要2個小時，不包括預緩存特徵圖所需要的1小時。另外，遵循[7]，我們使用了約束框迴歸來後處理預測窗口。用於迴歸的特徵也是conv5之後的池化後的特徵。用於迴歸訓練的是那些與標註窗口至少重疊50%的窗口。

4.2 檢測結果

我們在Pascal VOC 2007數據集的檢測任務上，評測了我們的方法。表9展示了我們的不同層的結果，使用了1-scale（s=688）或5-scale。R-CNN的結果見[7]，他們使用了5個卷積層的AlexNet[3]。使用pool5層我們的結果是44.9%，R-CNN的結果是44.2%。但使用未調優的fc6層，我們的結果就不好。可能是我們的fc層針對圖像區域進行了預訓練，在檢測案例中，他們用於特徵圖區域。而特徵圖區域在窗口框附近會有較強的激活，而圖像的區域就不會這樣。這種用法的不同是可以通過調優解決的。使用調優後的fc層，我們的結果就比R-CNN稍勝一籌。經過約束狂迴歸，我們的5-scale結果（59.2%）比R-CNN（58.5%）高0.7%。，而1-scale結果（58.0%）要差0.5%。

表10中，我們進一步使用相同預訓練的SPPnet模型（ZF-5）和R-CNN進行比較。本例中，我們的方法和R-CNN有相當的平均成績。R-CNN的結果是通過預訓練模型進行提升的。這是因爲ZF-5比AlexNet有更好的架構，而且SPPnet是多層次池化（如果使用非SPP的ZF-5，R-CNN的結果就會下降）。表11表明了每個類別的結果。表也包含了其他方法。選擇性搜索（SS）[20]在SIFT特徵圖上應用空間金字塔匹配。DPM[23]和Regionlet[39]都是基於HOG特徵的[24]。Regionlet方法通過結合包含conv5的同步特徵可以提升到46.1%。DetectorNet[40]訓練一個深度網絡，可以輸出像素級的對象遮罩。這個方法僅僅需要對整張圖片應用深度網絡一次，和我們的方法一樣。但他們的方法mAP比較低(30.5%)。

4.3 複雜度和運行時間

【略】

4.4 用於檢測的多模型結合

模型結合對於提升CNN爲基礎的分類準確度有重要的提升作用[3]。我們提出一種簡單的用於檢測的結合方法。

首先在ImageNet上預訓練另一個網絡，使用的結構都相同，只是隨機初始化不同。然後我們重複上述的檢測算法。表12（SPP-net（2））顯示了這個網絡的結果。他的mAP可以和第一名的網絡相媲美（59.1%vs59.2%），並且在11個類別上要好於第一網絡。

給定兩個模型，我們首先使用每個模型對測試圖像的候選框進行打分。然後對並聯的兩個候選框集合上應用最大化抑制。一個方法比較置信的窗口就會壓制另一個方法不太置信的窗口。通過這樣的結合，mAP提升到了60.9%（表12）。結合方法在20類中的17個的表現要好於單個模型。這意味着雙模型是互補的。

我們進一步發現這個互補性主要是因爲卷積層。我們嘗試結合卷積模型完全相同的兩個模型，則沒有任何效果。

4.5 ILSVRC 2014 Detection

【略】

5 結論

SPP對於處理不同的尺度、尺寸和長寬比是十分靈活的解決方案。這些問題在視覺識別時非常重要，但深度網絡中大家卻很少考慮這個問題。我們建議使用空間金字塔池化層來訓練深度網絡。這種SPP-net在分類和檢測任務上都表現出了出色的精度並顯著加速了DNN爲基礎的檢測任務。我們的研究也表明很多CV領域成熟的技術再基於深度網絡的識別中仍然可以發揮重要的作用。

ShaneneD

發佈了18 篇原創文章 · 獲贊 12 · 訪問量 4萬+

私信關注