MultiNet論文相關

論文下載地址:原文地址、免翻牆地址

論文Github地址：KittiSeg

論文翻譯參考：csdn

MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving

MultiNet: 自主駕駛中的實時聯合語義推理

Abstract

While most approaches to semantic reasoning have focused on improving performance, in this paper we argue that computational times are very important in order to enable real time applications such as autonomous driving. Towards this goal, we present an approach to joint classifi-cation, detection and semantic segmentation via a unified architecture where the encoder is shared amongst the three tasks. Our approach is very simple, can be trained end-toend and performs extremely well in the challenging KITTI dataset, outperforming the state-of-the-art in the road segmentation task. Our approach is also very efficient, taking less than 100 ms to perform all tasks.

雖然大多數語義推理方法都集中於提高性能, 但在本文中, 我們認爲運行時間對於實現自動駕駛等實時應用非常重要。爲了實現這一目標, 我們提出了一種通過統一架構的聯合classification、detection、segmentation的方法, 其中encoder由三個任務共享。我們的方法非常簡單, 可以進行端到端訓練, 並在KITTI數據集表上現非常出色, 超越了道路劃分任務中的最先進技術。我們的方法也非常有效, 執行所有任務只需要不到100 ms。

Introduction

Current advances in the field of computer vision have made clear that visual perception is going to play a key role in the development of self-driving cars. This is mostly due to the deep learning revolution which begun with the introduction of AlexNet in 2012 [23]. Since then, the accuracy of new approaches has been increasing at a vertiginous rate. Causes of this are the existence of more data, increased computation power and algorithmic developments. The current trend is to create deeper networks with as many layers as possible [17].

目前計算機視覺領域的前沿已經明確指出, 視覺感知將在自駕車發展中發揮關鍵作用。這主要是由於2012年AlexNet開始引起了深度學習革命[23]。從那時起, 新方法的準確性一直在快速提升。其原因是有了更多的數據、增大了計算量以及算法的開發。目前的趨勢是儘可能多地創建更深層次的網絡[17]。

While performance is extremely high, when dealing with real-world applications, running times become important. New hardware accelerators as well as compression, reduced precision and distillation methods have been exploited to speed up current networks.

雖然性能非常好, 但在處理實際應用程序時, 運行時間也變得重要。新的硬件的加速與壓縮、精度的降低與提練方法已經被利用來加速當前的網絡。

In this paper we take an alternative approach and design a network architecture that can very efficiently perform classification, detection and semantic segmentation simultaneously.This is done by incorporating all three task Figure 1: Our goal: Solving street classification, vehicle detection and road segmentation in one forward pass. into a unified encoder-decoder architecture. We name our approach MultiNet. The encoder consists of the convolution and pooling layers from the VGG network [45] and is shared among all tasks. Those features are then utilized by task-specific decoders, which produce their outputs in real-time. In particular, the detection decoder combines the fast regression design introduced in Yolo [38] with the sizeadjusting ROI-Pooling of Fast-RCNN [14], achieving a better speed-accuracy ratio.

在本文中, 我們採用一種替代方法, 設計一種可以非常有效地進行classification、detection、segmentation的網絡體系結構。通過將這三個任務併入統一的編碼器—解碼器架構來完成。以MultiNet命名我們的方法。編碼器由VGG網絡的卷積層和池化層組成[45], 並被所有任務共享。那些特徵隨後被特定任務的解碼器使用, 這些解碼器實時產生它們的輸出。特別地, 檢測解碼器將Yolo中介紹的快速回歸設計[38]與fast RCNN的尺寸調整ROI-Pooling[14]相結合, 實現了更好的速度—精度比。

We demonstrate the effectiveness of our approach in the challenging KITTI benchmark [13] and show state-of-theart performance in road segmentation. Importantly, our ROI-Pooling implementation can significantly improve detection performance without requiring an explicit proposal generation network. This gives our decoder a significant speed advantage compared to Faster-RCNN. Our approach is able to benefit from sharing computations, allowing us to perform inference in less than 100 ms for all tasks.

我們在KITTI數據庫上測試該方法的有效性[13], 並得到道路分割中最好的性能表現。重要的是, ROI-Pooling的實施可以顯著提高檢測性能, 而不需要一個明確的候選生成網絡。與Faster RCNN相比, 這使得我們的解碼器具有顯著的速度優勢。我們的方法得益於共享運算, 從而執行所有任務所需時間不到100 ms。

Figure 1: Our goal: Solving street classification, vehicle detection and road segmentation in one forward pass.

圖1：我們的目標：在一次前向傳播中解決街道分類, 車輛檢測和道路分割

In this section we review current approaches to the tasks that MultiNet tackles, i.e., detection, classification and semantic segmentation. We focus our attention on deep learning based approaches.

本節中, 我們回顧了目前使用MultiNet處理任務的方法, 即classification、detection、segmentation。我們着重於基於深度學習的方法。

Figure 2: MultiNet architecture.

圖2：MultiNet架構

譯者注：
CNN Encoder : input:1248x384x3 –(CONV 64)x**2**–> 1248x384x64 –(MAX_POOL)–> 624x192x64 –(CONV 128)x**2**–> 624x192x128 –(MAX_POOL)–> 312x96x128 –(CONV 256)x**3**–> 312x96x256 –(MAX_POOL)–> 156x48x256 –(CONV 512)x**3**–> 156x48x512 –(MAX_POOL)–> 78x24x512 –(CONV 512)x**3**–> 78x24x512
Encoded Features: –(MAX_POOL)–> 39x24x512

分類

Classification: After the development of AlexNet [23], most modern approaches to image classification utilize deep learning. Residual networks [17] constitute the state-of-theart, as they allow to train very deep networks without problems of vanishing or exploding gradients. In the context of road classification, deep neural networks are also widely employed [31]. Sensor fusion has also been exploited in this context [43]. In this paper we use classification to guide other semantic tasks, i.e., segmentation and detection.

分類：在AlexNet出現後, 大多數現代圖像分類方法開始利用深度學習。殘差網絡[17]是目前最先進的技術, 因爲它們能訓練非常深的網絡, 而不會產生梯度消失或爆炸。在道路分類的背景下, 深度神經網絡也得到廣泛的應用[31]。傳感器融合也在這種情況下被利用[43]。在本文中, 我們使用分類來指導其他語義任務, 即分割和檢測。

檢測

Detection: Traditional deep learning approaches to object detection follow a two step process, where region proposals [25, 20, 19] are first generated and then scored using a convolutional network [15, 40]. Additional performance improvements can be gained by using convolutional neural networks (CNNs) for the proposal generation step [8, 40] or by reasoning in 3D [5, 4]. Recently, several methods have proposed to use a single deep network that is trainable end-to-end to directly perform detection [44, 38, 39, 27]. Their main advantage over proposal-based methods is that they are much faster at both training and inference time, and thus more suitable for real-time detection applications. However, so far they lag far behind in performance. In this paper we propose an end-to-end trainable detector which reduces significantly the performance gap. We argue that the main advantage of proposal-based methods is their ability to have size-adjustable features. This inspired our zoom layer that as shown in our experience results in large improvements in performance.

檢測：傳統的深度學習目標檢測方法包含兩步, 首先生成候選區域[25, 20, 19], 然後使用卷積網絡進行評分[15, 40]。通過CNN來對候選區域進行生成[8, 40]或通過三維推理生成可以獲得更好的性能提高[5, 4]。近來, 直接利用可端對端訓練的單深度網絡進行檢測的幾種方法已經被提出[44, 38, 39, 27]。與基於候選區域的方法相比的主要優點在於, 它們在訓練和推理時間上都快得多, 因此更適合實時檢測的應用。然而, 到目前爲止, 它們在性能上遠遠落後。在本文中, 我們提出了一種能夠顯著降低性能差距的可端對端訓練的檢測器。我們認爲, 基於候選區域的方法的主要優點在於它們具有可調整大小的功能。這促使我們採用了rezoom層, 如實驗所示, 它可以大大提高性能。

分割

Segmentation: Inspired by the successes of deep learning, CNN-based classifiers were adapted to the task of semantic segmentation. Early approaches used the inherent efficiency of CNNs to implement implicit sliding-window [16, 26]. Fully Convolutional Networks (FCNs) were proposed to model semantic segmentation using a deep learning pipeline that is trainable end-to-end. Transposed convolutions [50, 6, 21] are utilized to upsample low resolution features. A variety of deeper flavors of FCNs have been proposed since [1, 34, 41, 36]. Very good results are archived by combining FCNs with conditional random fields (CRFs) [52, 2, 3]. [52, 42] showed that mean-field inference in the CRF can be cast as a recurrent net allowing end-to-end training. Dilated convolutions were introduced in [48] to augment the receptive field size without losing resolution. The aforementioned techniques in conjunction with residual networks [17] are currently the state-of-the-art.

分割：受深度學習的成功的啓發, 基於CNN的分類器開始應用於語義分割任務。早期方法利用CNN的固有效率來實現隱式的滑動窗口[16, 26]。全卷積網絡(FCN)被用於採用可端到端訓練的深度學習通道建模語義分割任務。轉置卷積[50, 6, 21]用於上採樣低分辨率特徵。自從[1, 34, 41, 36]以來, 已經提出了各種各樣的FCN。通過將FCN與條件隨機場(CRF)組合得到了非常好的結果[52, 2, 3]。[52, 42]表明, CRF中平均場推理可以被投射爲一個可端到端的訓練經常性的網絡(recurrent net)。擴展卷積(dilated convolutions)在[48]中被引入, 以增大接收域大小而不失去分辨率。上述與殘差網絡相結合的技術[17]是目前最先進的技術。

聯合推理

Joint Reasoning: Multi-task learning techniques aim at learning better representations by exploiting many tasks. Several approaches have been proposed in the context of CNNs [30, 28] but applications have mainly been focussed on face recognition tasks [51, 47, 37]. [18] reasons jointly about classification and segmentation using an SVM in combination with dynamic programming. [46] proposed to use a CRF to solve many tasks including detection, segmentation and scene classification. In the context of deep learning, [7] proposed a model which is able to jointly perform pose estimation and object classification. To our knowledge no unified deep architecture has been proposed to solve segmentation, classification and detection.

聯合推理：多任務學習技術旨在通過開發多任務來學習更好的表徵。幾種基於CNN的方法已經被提出[30, 28], 但是主要應用在面部識別任務上[51, 47, 37] 。[18]是關於使用SVM與動態規劃相結合的分類和分割方法。[46]則提出使用CRF來解決多任務, 包括segmentation、detection、scene classification。在深度學習的背景下, [7]提出了聯合姿態估計和目標分類的模型。據我們所知, 目前還沒有提出聯合的深度架構來解決classification、detection、segmentation任務。

MultiNet for Joint Semantic Reasoning

In this paper we propose an efficient and effective feedforward architecture, which we call MultiNet, to jointly reason about semantic segmentation, image classification and object detection. Our approach shares a common encoder over the three tasks and has three branches, each implementing a decoder for a given task. We refer the reader to Fig. 2 for an illustration of our architecture. MultiNet can be trained end-to-end and joint inference over all tasks can be done in less than 100ms. We start our discussion by introducing our joint encoder, follow by the task-specific decoders.

在本文中, 我們提出了一種高效的前饋架構, 我們稱之爲MultiNet, 以聯合的理解語義分割, 圖像分類和目標檢測。我們的方法在三個任務中共享一個共同的編碼器, 並且具有三個分支, 每個分支是一個實現給定任務的解碼器。參考圖2爲網絡架構。MultiNet可以進行端到端的訓練, 所有任務的聯合推理可以在100ms內完成。我們開始討論並根據特定任務的解碼器來介紹聯合編碼器。

The task of the encoder is to process the image and extract rich abstract features [49] that contain all necessary information to perform accurate segmentation, detection and image classification. The encoder of MultiNet consists of the first 13 layers of the VGG16 network [45], which are applied in a fully convolutional manner to the image producing a tensor of size 39 × 12 × 512. This is the output of the 5th pooling layer, which is called pool5 in the VGG implementation [45].

編碼器的任務是處理圖像並提取豐富的抽象特徵[49], 該特徵包含了執行準確分割, 檢測和圖像分類的所必要的信息。MultiNet編碼器由VGG16網絡的前13層組成[45], 應用全卷積方式產生39×12×512大小的張量。這是第5個pooling層的輸出, 在VGG中叫作pool5 [45]。

Classification Decoder 分類解碼器

The classification decoder is designed to take advantage of the encoder. Towards this goal, we apply a 1 × 1 convolution followed by a fully connected layer and a softmax layer to output the final class probabilities.

分類解碼器被設計用來利用編碼器的優點。爲了實現這一目標, 我們應用1×1卷積, 然後用全連接層和softmax層輸出最後類的概率。

Detection Decoder 檢測解碼器

FastBox, our detection decoder, is designed to be a regression based detection system. We choose such a decoder over a proposal based one because it can be train end-toend, and both training and inference can be done very efficiently. Our approach is inspired by ReInspect [39], Yolo [38] and Overfeat [44]. In addition to the standard regression pipeline, we include an ROI pooling approach, which allows the network to utilize features at a higher resolution, similar to the much slower Faster-RCNN.

我們的檢測解碼器FastBox, 被設計爲基於迴歸的檢測系統。我們選擇一種基於候選區域的解碼器, 因爲它可以進行端對端的訓練, 並且可以非常有效地完成訓練和推理。方法靈感來自ReInspect [39], Yolo [38]和Overfeat [44]。除了標準迴歸流程之外, 我們還包含一個ROI池化方法, 它允許網絡利用更高分辨率的特徵, 類似較慢的Faster-RCNN。

The first step of our decoder is to produce a rough estimate of the bounding boxes. Towards this goal, we first pass the encoded features through a 1 × 1 convolutional layer with 500 filters, producing a tensor of shape 39 × 12 × 500, which we call hidden. This tensor is processed with another 1 × 1 convolutional layer which outputs 6 channels at resolution 39 × 12. We call this tensor prediction, the values of the tensor have a semantic meaning. The first two channels of this tensor form a coarse segmentation of the image. Their values represent the confidence that an object of interest is present at that particular location in the 39 × 12 grid. The last four channels represent the coordinates of a bounding box in the area around that cell. Fig. 3 shows an image with its cells.

該解碼器的第一步是產生bounding box的粗略估計。爲了實現這一目標, 首先用500個濾波器的1×1卷積層傳遞編碼的特徵, 產生一個39×12×500大小的張量, 我們稱之爲隱藏層。隨後該張量用另一個1×1卷積層處理, 輸出6個分辨率爲39×12的通道。我們稱這個張量爲prediction, 張量的值具有語義含義。該張量的前兩個通道形成圖像的粗分割。這些值表示感興趣目標存在於39×12網格中的特定位置處的置信度。最後四個通道表示該單元周圍區域中邊界框的座標。圖3表示有cell的圖像。

Figure 3: Visualization of our label encoding. Blue grid: cells, Red cells: cells containing a car, Grey cells: cells in don’t care area. Green boxes: ground truth boxes.

圖3：可視化我們的標籤編碼。藍色網格：單元(cells)。紅單元：含有汽車的單元。灰色單元：無關區域的單元。綠色框：真實值

Such prediction, however, is not very accurate. In this paper we argue that this is due to the fact that resolution has been lost by the time we arrive to the encoder output. To alleviate this problem we introduce a rezoom layer, which predicts a residual on the locations of the bounding boxes by exploiting high resolution features. This is done by concatenating subsets of higher resolution VGG features (156×48) with the hidden features (39 × 12) and applying 1 × 1 convolutions on top of this. In order to make this possible, a 39 × 12 grid needs to be generated out of the high resolution VGG features. This is achieved by applying ROI pooling [40] using the rough prediction provided by the tensor prediction. Finally, this is concatenated with the 39×12×6 features and passed through a 1×1 convolution layer to produce the residuals.

然而, 這種預測不是非常準確。在本文中, 我們認爲這是由於編碼器輸出時的分辨率已經丟失。爲了減輕這個問題, 我們引入了一個rezoom層, 它通過利用高分辨率特徵來預測邊界框位置上的殘差。它通過將更高分辨率的VGG特徵的子集(156×48)與隱藏特徵(39×12)連接並在其上應用1×1卷積來完成。爲了使其成爲可能, 需要從高分辨率VGG特徵產生39×12網格, 這些網格是通過應用ROI池化[40]使用由tensor prediction提供的粗預測來實現的。最後, 它與39×12×6特徵連接, 並通過1×1卷積層以產生殘差。

Segmentation Decoder 分割解碼器

The segmentation decoder follows the FCN architecture [29]. Given the encoder, we transform the remaining fullyconnected (FC) layers of the VGG architecture into 1 × 1 convolutional layers to produce a low resolution segmentation of size 39 × 12. This is followed by three transposed convolution layers [6, 21] to perform up-sampling. Skip layers are utilized to extract high resolution features from the lower layers. Those features are first processed by a 1 × 1 convolution layer and then added to the partially upsampled results.

分割解碼器遵循FCN架構[29]。給定編碼器, 我們將VGG架構中已有的全連接(FC)層轉換爲1×1的卷積層, 以產生39×12大小的低分辨率segmentation。其後是三個transposed卷積層[6, 21] 進行上採樣。skip層用於從較低層提取高分辨率特徵。這些特徵首先由1×1卷積層處理, 然後加到部分上採樣結果中。

Training Details 訓練詳情

In this section we describe the loss functions we employ as well as other details of our training procedure including initialization.

在本節中, 將介紹損失函數以及訓練過程中的其他細節, 包括初始化。

Label encoding 標籤編碼

We use one-hot encoding for classification and segmentation. For the detection, we assigned a positive confidence if and only if it intersects with at least one bounding box. We parameterize the bounding box by the x and y coordinate of its center and the width w and height h of the box. Note that this encoding is much simpler than Faster RCNN or ReInspect.

採用單熱編碼(one-hot encoding)進行分類和分割。對於檢測, 當且僅當它與至少一個邊界框相交時, 我們才分配了一個正的的置信度。我們用其中心x、y座標以及框的寬度w、高度h來參數化邊界框。請注意, 該編碼比Faster RCNN或ReInspect簡單得多。

Loss Functions

We define our loss function as the sum of the loss functions for classification, segmentation and detection. We employ cross-entropy as loss function for the classification and segmentation branches, which is defined as

將損失函數定義爲分類, 分割和檢測的損失函數的總和。採用交叉熵作爲分類和分割分支的損失函數, 定義如下：

l o s s_{c l a s s} (p, q) := - \frac{1}{| I |} \sum_{i \in I} \sum_{c \in C} q_{i} (c) \log p_{i} (c) (1)

where p is the prediction, q the ground truth and C the set of classes. We use the sum of two losses for detection: Cross entropy loss for the confidences and an L1 loss on the bounding box coordinates. Note that the L1 loss is only computed for cells which have been assigned a positive confidence label. Thus

其中p是prediction, q是ground truth, C是類的集合。我們使用兩個損失的和作爲detection的loss：置信度(confidence)的交叉熵損失及邊界框座標的L1損失。請注意, 只有被賦予正置信標籤的cells才計算它們的L1損失。從而:

l o s s_{b o x} (p, q) := \frac{1}{I} \sum_{i \in I} δ_{q_{i}} \cdot (| x_{p_{i}} - x_{q_{i}} | + | y_{p_{i}} - y_{q_{i}} | + | w_{p_{i}} - w_{q_{i}} | + | h_{p_{i}} - h_{q_{i}} |) (2)

where p is the prediction, q the ground truth, C the set of classes and I is the set of examples in the mini batch.

其中p是prediction, q是ground truth, C是類的集合, I是小批次中的一組示例。

Combined Training Strategy

Joint training is performed by merging the gradients computed by each loss on independent mini batches. This allows us to train each of the three decoders with their own set of training parameters. During gradient merging all losses are weighted equally. In addition, we observe that the detection network requires more steps to be trained than the other tasks. We thus sample our mini batches such that we alternate an update using all loss functions with two updates that only utilize the detection loss.

通過將每個損失計算的梯度合併在獨立的小批量上進行聯合訓練。這樣我們就能用自己的訓練參數來訓練每個解碼器。在梯度合併過程中, 所有的損失都被相等地加權(all losses are weighted equally)。另外, 我們觀察到, 檢測網絡需要比其他任務訓練更多次。因此, 我們對我們的小批次進行抽樣, 以便我們使用僅利用detection loss的兩次更新的全損失函數來交替更新。

Initialization

The encoder is initialized using pretrained VGG weights on ImageNet. The detection and classification decoder weights are randomly initialized using a uniform distribution in the range (−0.1,0.1). The convolutional layers of the segmentation decoder are also initialized using VGG weights and the transposed convolution layers are initialized to perform bilinear upsampling. The skip connections on the other hand are initialized randomly with very small weights (i.e. std of 1e − 4). This allows us to perform training in one step (as opposed to the two step procedure of [29]).

使用ImageNet上預先訓練的VGG權重對編碼器進行初始化。使用範圍在(-0.1, 0.1)的單位分佈隨機初始化檢測和分類解碼器權重。分割解碼器的卷積層也使用VGG權重進行初始化, 並且轉置卷積層被初始化以執行雙線性上採樣。另一方面, skip connection以非常小的權重(即1e-4的標準)隨機初始化。我們能一步進行訓練(與[29]的兩步程序相反)。

Optimizer and regularization

We use the Adam optimizer [22] with a learning rate of 1e − 5 to train our MultiNet. A weight decay of 5e − 4 is applied to all layers and dropout with probability 0.5 is applied to all (inner) 1 × 1 convolutions in the decoder.

使用Adam優化器[22], 學習率爲1e - 5來訓練MultiNet。對所有層施加5e-4的權重衰減, 並且對解碼器中所有(內部)1×1的卷積進行概率爲0.5的dropout。

Experimental Results

In this section we perform our experimental evaluation on the challenging KITTI dataset.

在KITTI數據集上進行實驗評估。

Experiment	max steps	eval steps [k]
Segmentation	16,000	100
Classification	18,000	200
Detection	180,000	1000
United	200,000	1000

Table 1: Summary of training length.

表1：訓練長度總結

Dataset

We evaluate MultiNet in he KITTI Vision Benchmark Suite [12]. The Benchmark contains images showing a variety of street situations captured from a moving platform driving around the city of Karlruhe. In addition to the raw data, KITTI comes with a number of labels for different tasks relevant to autonomous driving. We use the road benchmark of [10] to evaluate the performance of our semantic segmentation decoder and the object detection benchmark [13] for the detection decoder. We exploit the automatically generated labels of [31], which provide us with road labels generated by combining GPS information with open-street map data.

在KITTI Vision Benchmark Suite [12]上評估MultiNet。基準測試包含許多圖像, 這些圖像展示了在卡爾魯厄市由駕駛移動平臺捕獲的各種街道情況。除原始數據之外, KITTI還附帶了許多與自主駕駛相關的不同任務的標籤。使用[10]的道路基準來評估語義分割解碼器的性能，使用[13]的目標檢測基準來檢測解碼器。利用[31]自動生成的標籤, 這些標籤提供了通過將GPS信息與開放街道地圖數據相結合生成的道路標籤。

Detection performance is measured using the average precision score [9]. For evaluation, objects are divided into three categories: easy, moderate and hard to detect. The segmentation performance is measured using the MaxF1 score [10]. In addition, the average precision score is given for reference. Classification performance is evaluated by computing accuracy and precision-recall plots.

使用平均精度得分測量檢測性能[9]。對於評估, 目標分爲三類：容易, 中等和困難的檢測。使用MaxF1分數測量分割性能[10]。此外, 平均精度得分作爲參考。分類性能通過計算精度和精確召回圖進行評估。

Performance evaluation

Our evaluation is performed in two steps. First we build three individual models consisting of the VGG-encoder and the decoder corresponding to the task. Those models are tuned to achieve highest possible performance on the given task. In a second step MultiNet is trained using one encoder and three decoders in a single network. We evaluate both settings in our experimental evaluation. We report a set of plots depicting the convergence properties of our networks in Figs. 4, 6 and 8. Evaluation on the validation set is performed every k iterations during training, where k for each tasks is given in Table 1. To reduce the variance in the plots the output is smoothed by computing the median over the last 50 evaluations performed.

我們的評估分兩步進行。首先, 我們構建由VGG編碼器和對應三個任務的單獨的解碼器組成的模型。這些模型被調整到在給定任務上實現最高的性能。在第二步中, MultiNet在一個網絡中使用一個編碼器和三個解碼器進行訓練。我們在實驗評估中評估這兩種設置。我們展示了一組描繪該網絡收斂性質的圖。在訓練期間每k次迭代執行驗證集的評估, 其中每個任務的k在表1中給出。爲了減少圖中的方差, 通過計算過去執行的50次評估的中值來平滑輸出。

Segmentation

Figure 4: Convergence behavior of the segmentation decoder

圖4：分割解碼器的收斂性質

Our Segmentation decoder is trained using the KITTI Road Benchmark [10]. This dataset is very small, providing only 289 training images. Thus the network has to transfer as much knowledge as possible from pre-training. Note that the skip connections are the only layers which are randomly initialized and thus need to be trained from scratch. This transfer learning approach leads to very fast convergence of the network. As shown in Fig. 4 the raw scores already reach values of about 95 % after only about 4000 iterations. Training is conducted for 16,000 iterations to obtain a meaningful median score.

分割解碼器使用KITTI道路基準進行訓練[10]。該數據集非常小, 僅提供289個訓練圖像。因此, 網絡必須從預訓練轉移儘可能多的知識。請注意, skip connection是唯一隨機初始化的層, 因此需要從頭進行訓練。這種傳輸學習方法導致網絡能非常快速的收斂。如圖4所示, 只有4000次迭代,時原始分數已達到近95％。訓練16, 000次以獲得有重要意義的中值。

Metric	Result
MaxF1	95.83 %
Average Precision	92.29 %
Speed (msec)	94.6 ms
Speed (fps)	10.6 Hz

Table 2: Validation performance of the segmentation decoder

表2：分割編碼器的驗證性能

Method	MaxF1	AP	Place
FCN LC [32]	90.79 %	85.83 %	5th
FTP [24]	91.61 %	90.96 %	4th
DDN [33]	93.43 %	89.67 %	3th
Up Conv Poly [35]	93.83 %	90.47 %	2rd
MultiNet	94.88%	93.71%	1st

Table 3: Summary of the URBAN ROAD scores on the public KITTIRoad Detection Leaderboard [11].

表3：KITTI Road檢測排行榜上城市道路評分[11]

Figure 5: Visualization of the segmentation output. Top rows: Soft segmentation output as red blue plot. The intensity of the
plot reflects the confidence. Bottom rows hard class labels.

圖5：分割輸出的可視化。頂行：軟分割輸出爲紅藍色。圖的強度反映了置信度。底行：硬類標籤

Table 2 shows the scores of our segmentation decoder after 16,000 iterations. The scores indicate that our segmentation decoder generalizes very well using only the data given by the KITTI Road Benchmark. No other segmentation dataset was utilized. As shown in Fig. 5, our approach is very effective at segmenting roads. Even difficult areas, corresponding to sidewalks and buildings are segmented correctly. In the confidence plots shown in top two rows of Fig. 5, it can be seen that our approach has confidence close to 0.5 at the edges of the street. This is due to the slight variation in the labels of the training set. We have submitted the results of our approach on the test set to the KITTI road leaderboard. As shown in Table 3, our result achieve first place.

表2顯示了16, 000次迭代後分割解碼器的得分。分數表明, 分割解碼器僅使用KITTI Road Benchmark給的數據泛化得很好。沒有使用其他分割數據集。如圖5所示, 該方法在分割道路方面非常有效。對應於人行道和建築物的困難區域也能正確分割。在圖5上面兩行所示的置信區間中, 可以看出, 我們的方法在街道邊緣的置信度接近0.5。這是由於訓練集標籤的輕微變化。我們已經將該方法的測試結果提交給了KITTI道路排行榜。如表3所示, 我們的結果達到了第一名。

Detection

Figure 6: Validation scores of the detection decoder. Performance of FastBox with and without rezoom layer is shown for comparison.

圖6：檢測解碼器的驗證分數。顯示具有和不具有再縮放層的FastBox的性能進行比較

Our detection decoder is trained and evaluated on the data provided by the KITTI object benchmark [13]. Fig. 6 shows the convergence rate of the validation scores. The detection decoder converges much slower than the segmentation and classification decoders. We therefore train the decoder up to iteration 180,000.

檢測解碼器對KITTI目標基準測試提供的數據進行了訓練和評估[13]。圖6顯示了驗證分數的收斂速度。檢測解碼器的收斂速度比分割和分類解碼器慢得多。因此, 我們訓練180, 000次。

Task: Metric	moderate	easy	hard
FastBox with rezoom	83.35 %	92.80 %	67.59 %
FastBox no rezoom	77.00 %	86.45 %	60.82 %

Table 4: Detection performance of FastBox.

表4：FastBox的檢測性能

FastBox	FastBox (no rezoom)
speed [msec]	37.49 ms
speed [fps]	26.67 Hz
post-processing	2.10 ms

Table 5: Detection speed of FastBox. Results are measured on a Pascal Titan X.

表5：FastBox的檢測速度。結果在Pascal Titan X上測量

Figure 7: Visualization of the detection output. With and without non-maximal suppression applied.

圖7：檢測輸出的可視化。有無非最大抑制應用

FastBox can perform evaluation at very high speed: an inference step takes 37.49 ms per image. This makes FastBox particularly suitable for real-time applications. Our results indicate further that the computational overhead of the rezoom layer is negligible (see Table 5). The performance boost of the rezoom layer on the other hand is quite substantial (see Table 4), justifying the use of a rezoom layer in the final model. Qualitative results are shown in Fig. 7 with and without non-maxima suppression.

FastBox可以以非常高的速度執行評估：每個圖像的推理步驟需要37.49ms。這使得FastBox特別適合實時應用。我們的結果進一步表明, rezoom layer的計算開銷是可以忽略的(見表5)。另一方面, rezoom layer的性能提升是相當大的(參見表4), 在最終模型中使用rezoom layer證明了這兩點。定性結果如圖7具有和不具有非極大抑制。

MultiNet

Task: Metric	seperate	2 losses	3 losses
Segmentation: MaxF1	95.83%	94.98%	95.13%
Detection: Moderate	83.35%	83.91%	84.39%
Classification: Accuracy	92.65%	-	94.38%

Table 6: MultiNet performance: Comparison between united and seperate evaluation on the validation set.

表6：MultiNet性能：驗證集合的聯合和單獨評估之間的比較

Figure 8: MultiNet: Comparison of Joint and Separate Training.

圖8.MultiNet：聯合訓練和分離訓練的比較

We have experimented with two versions of MultiNet. The first version is trained using two decoders, (detection and segmentation) while the second version is trained with all three decoders. Training with additional decoders significantly lowers the convergence speed of all decoders. When training with all three decoders it takes segmentation more than 30.000 and detection more than 150.000 iterations to converge, as shown in Fig. 8. Fig. 8 and Table 6 also show, that our combined training does not harm performance. On the contrary, the detection and classification tasks benefit slightly when jointly trained. This effect can be explained by transfer learning between tasks: relevant features learned from one task can be utilized in a different task.

我們已經嘗試了兩個版本的MultiNet。使用兩個解碼器(檢測和分割)訓練第一個版本, 而第二個版本使用所有三個解碼器進行訓練。使用額外的解碼器進行訓練明顯地降低了所有解碼器的收斂速度。當使用所有三個解碼器進行訓練時, 分割需要超過30.000次迭代，檢測需要超過150.000次迭代達到收斂, 如圖8所示。圖8和表6還表明, 該組合訓練不會影響性能。相反, 聯合訓練時, 檢測和分類任務略有好轉。這種效果可以通過任務之間的轉移學習來解釋：從一個任務中學到的相關特徵可以用於不同的任務。

MultiNet	Segmentation	Detection	Classification
98.10 ms	94.6 ms	37.5 ms	35.94 ms
10.2 Hz	10.6 Hz	27.7 Hz	27.8 Hz

Table 7: MultiNet inference speed: Comparision between united and seperate evaluation.

表7：MultiNet推理速度：聯合和單獨評估之間的比較

MultiNet is particularly suited for real-time applications. As shown in Table 7 computational complexity benefits significantly from a shared architecture. Overall, MultiNet is able to solve all three task together in real-time.

MultiNet特別適用於實時應用。如表7, 計算複雜度顯著得益於共享架構。總的來說, MultiNet能夠實時的同時解決這三個任務。

Figure 9: Visualization of the MultiNet output.

圖9. MultiNet輸出可視化

Conclusion

In this paper we have developed a unified deep architecture which is able to jointly reason about classification, detection and semantic segmentation. Our approach is very simple, can be trained end-to-end and performs extremely well in the challenging KITTI, outperforming the state-ofthe-art in the road segmentation task. Our approach is also very efficient, taking 98.10 ms to perform all tasks. In the future we plan to exploit compression methods in order to further reduce the computational bottleneck and energy consumption of MutiNet.

在本文中, 我們開發了一個聯合的深度架構, 能夠共同推理分類, 檢測和語義分割。我們的方法非常簡單, 可以端到端訓練, 並在KITTI中表現非常出色, 超越了道路分割任務中的最先進的技術。我們的方法也非常有效, 需要98.10ms執行所有任務。未來我們計劃開發壓縮方法, 以進一步降低MutiNet的計算瓶頸和能耗。

Acknowledgements: This work was partially supported by Begabtenstiftung Informatik Karlsruhe, ONR-N00014-14-1-0232, Qualcomm, Samsung, NVIDIA, Google, EPSRC and NSERC. We are thankful to Thomas Roddick for proofreading the paper.

References

[1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. CoRR, abs/1511.00561, 2015. 2
[2] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Semantic image segmentation with deep convolutional nets and fully connected crfs. CoRR, abs/1412.7062, 2014. 2
[3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. CoRR, abs/1606.00915, 2016. 2
[4] X. Chen, K. Kundu, Z. Zhang, H. Ma, S. Fidler, and R. Urtasun. Monocular 3d object detection for autonomous driving. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2147–2156, 2016. 2
[5] X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection. In Advances in Neural Information Processing Systems, pages 424–432, 2015. 2
[6] V. Dumoulin and F. Visin. A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285, 2016. 2, 3
[7] M. Elhoseiny, T. El-Gaaly, A. Bakry, and A. M. Elgammal. Convolutional models for joint object categorization and pose estimation. CoRR, abs/1511.05175, 2015. 3
[8] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov. Scalable object detection using deep neural networks. CoRR, abs/1312.2249, 2013. 2
[9] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results.
http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html. 4
[10] J. Fritsch, T. Kuehnl, and A. Geiger. A new performance measure and evaluation benchmark for road detection algorithms. In International Conference on Intelligent Transportation Systems (ITSC), 2013. 4, 5
[11] A. Geiger. Kitti road public benchmark, 2013. 5
[12] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. International Journal of Robotics Research (IJRR), 2013. 4
[13] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012. 1, 4, 5
[14] R. B. Girshick. Fast R-CNN. CoRR, abs/1504.08083, 2015. 1
[15] R. B. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CoRR, abs/1311.2524, 2013. 2
[16] A. Giusti, D. C. Ciresan, J. Masci, L. M. Gambardella, and J. Schmidhuber. Fast image scanning with deep max-pooling convolutional neural networks. CoRR, abs/1302.1700, 2013. 2
[17] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. 1, 2
[18] M. Hoai, Z.-Z. Lan, and F. De la Torre. Joint segmentation and classification of human actions in video. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 3265–3272. IEEE, 2011. 3
[19] J. H. Hosang, R. Benenson, P. Doll´ar, and B. Schiele. What makes for effective detection proposals? CoRR, abs/1502.05082, 2015. 2
[20] J. H. Hosang, R. Benenson, and B. Schiele. How good are detection proposals, really? CoRR, abs/1406.6962, 2014. 2
[21] D. J. Im, C. D. Kim, H. Jiang, and R. Memisevic. Generating images with recurrent adversarial networks. CoRR, abs/1602.05110, 2016. 2, 3
[22] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. 4
[23] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012. 1, 2
[24] A. Laddha, M. K. Kocamaz, L. E. Navarro-Serment, and M. Hebert. Map-supervised road detection. In 2016 IEEE Intelligent Vehicles Symposium (IV), pages 118–123, June 2016. 5
[25] C. H. Lampert, M. B. Blaschko, and T. Hofmann. Beyond sliding windows: Object localization by efficient subwindow search. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on, pages 1–8. IEEE, 2008. 2
[26] H. Li, R. Zhao, and X. Wang. Highly efficient forward and backward propagation of convolutional neural networks for pixelwise classification. CoRR, abs/1412.4526, 2014. 2
[27] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, and S. E. Reed. SSD: single shot multibox detector. CoRR, abs/1512.02325, 2015. 2
[28] X. Liu, J. Gao, X. He, L. Deng, K. Duh, and Y.-Y. Wang. Representation learning using multi-task deep neural networks for semantic classification and information retrieval. In Proc. NAACL, 2015. 3
[29] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. CVPR (to appear), Nov. 2015. 3, 4
[30] M. Long and J. Wang. Learning multiple tasks with deep relationship networks. CoRR, abs/1506.02117, 2015. 3
[31] W.-C. Ma, S. Wang, M. A. Brubaker, S. Fidler, and R. Urtasun. Find your way by observing the sun and other semantic cues. arXiv preprint arXiv:1606.07415, 2016. 2, 4
[32] C. C. T. Mendes, V. Frmont, and D. F. Wolf. Exploiting fully convolutional neural networks for fast road detection. In IEEE Conference on Robotics and Automation (ICRA), May 2016. 5
[33] R. Mohan. Deep deconvolutional networks for scene parsing, 2014. 5
[34] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. 2015. 2
[35] G. Oliveira,W. Burgard, and T. Brox. Efficient deep methods for monocular road segmentation. 2016. 5
[36] G. Papandreou, L. Chen, K. Murphy, and A. L. Yuille. Weakly- and semi-supervised learning of a DCNN for semantic image segmentation. CoRR, abs/1502.02734, 2015. 2
[37] R. Ranjan, V. M. Patel, and R. Chellappa. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. CoRR, abs/1603.01249, 2016. 3
[38] J. Redmon, S. K. Divvala, R. B. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. CoRR, abs/1506.02640, 2015. 1, 2, 3
[39] M. Ren and R. S. Zemel. End-to-end instance segmentation and counting with recurrent attention. CoRR, abs/1605.09410, 2016. 2, 3
[40] S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R-CNN: towards real-time object detection with region proposal networks. CoRR, abs/1506.01497, 2015. 2, 3
[41] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. CoRR, abs/1505.04597, 2015. 2
[42] A. G. Schwing and R. Urtasun. Fully connected deep structured networks. CoRR, abs/1503.02351, 2015. 2
[43] C. Seeger, A. M¨uller, L. Schwarz, and M. Manz. Towards road type classification with occupancy grids. IVSWorkshop, 2016. 2
[44] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recognition, localization and detection using convolutional networks. CoRR, abs/1312.6229, 2013. 2, 3
[45] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014. 1, 3
[46] J. Yao, S. Fidler, and R. Urtasun. Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pages 702–709. IEEE, 2012. 3
[47] J. Yim, H. Jung, B. Yoo, C. Choi, D. Park, and J. Kim. Rotating your face using multi-task deep neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 676–684, 2015. 3
[48] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. CoRR, abs/1511.07122, 2015. 2
[49] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European Conference on Computer Vision, pages 818–833. Springer, 2014. 3
[50] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus. Deconvolutional networks. In Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 2528–2535. IEEE, 2010. 2
[51] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. In European Conference on Computer Vision, pages 94–108. Springer, 2014. 3
[52] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. H. S. Torr. Conditional random fields as recurrent neural networks. CoRR, abs/1502.03240, 2015. 2

QiangLi_strong

發佈了45 篇原創文章 · 獲贊 61 · 訪問量 7萬+

私信關注

MultiNet：自主駕駛中的實時聯合語義推理論文翻譯

MultiNet: Real-time Joint Semantic Reasoning for Autonomous Driving

Abstract

Introduction

分類

檢測

分割

聯合推理