Paper之EfficientDet: 《Scalable and Efficient Object Detection—可擴展和高效的目標檢測》的翻譯及其解讀

Paper之EfficientDet: 《Scalable and Efficient Object Detection—可擴展和高效的目標檢測》的翻譯及其解讀

導讀:2019年11月21日,谷歌大腦團隊發佈了論文 EfficientDet: Scalable and Efficient Object Detection 。
Google Brain 團隊的三位 Auto ML 大佬 Mingxing Tan Ruoming Pang Quoc V. Le 最近在 Arxiv 上發表了該文章,有網友猜測是投到 CVPR 2020。通過改進 FPN 中多尺度特徵融合的結構和借鑑 EfficientNet 模型縮放方法,提出了一種模型可縮放且高效的目標檢測算法 EfficientDet。
這篇工作可以看做是中了 ICML 2019 Oral 的 EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks 擴展,從分類任務擴展到檢測任務(Object Detection)。
從圖表1中,就能看出,神經網絡的FLOPS速度和mAP精度之間根據場景需求存在某種平衡,從 EfficientDet D1 ~ EfficientDet D7的曲線可知,FLOPS逐漸變慢,同時mAP逐漸提高。

 

 

目錄

Scalable and Efficient Object Detection的翻譯及其解讀

Abstract

1. Introduction

2. Related Work

3、BiFPN

3.1. Problem Formulation

3.2. Cross-Scale Connections


 

 

 

Scalable and Efficient Object Detection的翻譯及其解讀

論文地址https://arxiv.org/pdf/1911.09070.pdf
論文作者:Mingxing Tan Ruoming Pang Quoc V. Le Google Research, Brain Team {tanmingxing, rpang, qvl}@google.com

 

Abstract

Model efficiency has become increasingly important in  computer vision. In this paper, we systematically study various  neural network architecture design choices for object  detection and propose several key optimizations to improve  efficiency. First, we propose a weighted bi-directional feature  pyramid network (BiFPN), which allows easy and fast  multi-scale feature fusion; Second, we propose a compound  scaling method that uniformly scales the resolution, depth,  and width for all backbone, feature network, and box/class  prediction networks at the same time. Based on these optimizations,  we have developed a new family of object detectors,  called EfficientDet, which consistently achieve an  order-of-magnitude better efficiency than prior art across a  wide spectrum of resource constraints. In particular, without  bells and whistles, our EfficientDet-D7 achieves stateof-the-art  51.0 mAP on COCO dataset with 52M parameters  and 326B FLOPS1  , being 4x smaller and using 9.3x  fewer FLOPS yet still more accurate (+0.3% mAP) than the  best previous detector. 模型效率在計算機視覺中越來越重要。在本文中,我們系統地研究了用於目標檢測的各種神經網絡體系結構的設計選擇,並提出了提高效率的幾個關鍵優化方案。首先,我們提出了一種加權雙向特徵金字塔網絡(BiFPN),它可以方便、快速地融合多尺度特徵;其次,我們提出了一種混合縮放方法,可以同時對所有主幹、特徵網絡和box/class預測網絡的分辨率、深度和寬度進行均勻縮放。基於這些優化,我們開發了一個新的對象檢測器系列,稱爲EfficientDet,在廣泛的資源約束範圍內,它始終能夠達到比現有技術更好的數量級效率。特別是,在沒有任何附加功能的情況下,我們的EfficientDet-D7在COCO數據集上實現了最先進的51.0 mAP,參數爲52M, FLOPS1爲326B,比之前最好的檢測器小4倍,少用9.3倍的FLOPS,但仍然比之前的檢測器更精確(+0.3% mAP)。

1. Introduction

Figure 1: Model FLOPS vs COCO accuracy – All numbers are for single-model single-scale. Our EfficientDet achieves much better accuracy with fewer computations than other detectors. In particular, EfficientDet-D7 achieves new state-of-the-art 51.0% COCO mAP with 4x fewer parameters and 9.3x fewer FLOPS. Details are in Table 2.
圖1:模型FLOPS 與COCO準確度——所有的數字都是針對單模型單尺度的。與其他探測器相比,我們的高效率探測器在計算量少的情況下實現了更高的精度。特別是,effecentett - d7實現了最新的51.0% COCO映射,參數減少了4倍,失敗減少了9.3倍。詳情見表2。

Tremendous progresses have been made in recent years towards more accurate object detection; meanwhile, stateof-the-art object detectors also become increasingly more expensive. For example, the latest AmoebaNet-based NASFPN detector [37] requires 167M parameters and 3045B FLOPS (30x more than RetinaNet [17]) to achieve state-ofthe-art accuracy. The large model sizes and expensive computation costs deter their deployment in many real-world applications such as robotics and self-driving cars where model size and latency are highly constrained. Given these real-world resource constraints, model efficiency becomes increasingly important for object detection. 近年來,在提高目標檢測精度方面取得了巨大的進展;與此同時,最先進的物體探測器也變得越來越昂貴。例如,最新的基於AmoebaNet的NASFPN探測器[37]需要1.67億個參數和3045B FLOPS(比RetinaNet[17]多30倍)才能達到最新的精度。大型模型尺寸和昂貴的計算成本阻礙了它們在機器人和自動駕駛汽車等許多現實世界應用程序中的部署,這些應用程序的模型尺寸和延遲都受到高度限制。考慮到這些現實的資源約束,模型效率對於對象檢測變得越來越重要。
There have been many previous works aiming to develop more efficient detector architectures, such as onestage [20, 25, 26, 17] and anchor-free detectors [14, 36, 32],or compress existing models [21, 22]. Although these methods tend to achieve better efficiency, they usually sacrifice accuracy. Moreover, most previous works only focus on a specific or a small range of resource requirements, but the variety of real-world applications, from mobile devices to datacenters, often demand different resource constraints. 之前有許多致力於開發更高效的探測器架構的工作,如onestage[20,25,26,17]和無錨探測器[14,36,32],或壓縮現有模型[21,22]。雖然這些方法趨向於獲得更好的效率,但它們通常會犧牲準確性。此外,以前的大多數工作只關注特定的或小範圍的資源需求,但是從移動設備到數據中心的各種實際應用程序常常需要不同的資源約束
A natural question is: Is it possible to build a scalable detection architecture with both higher accuracy and better efficiency across a wide spectrum of resource constraints (e.g., from 3B to 300B FLOPS)? This paper aims to tackle this problem by systematically studying various design choices of detector architectures. Based on the onestage detector paradigm, we examine the design choices for backbone, feature fusion, and class/box network, and identify two main challenges:
  • Challenge 1: efficient multi-scale feature fusion – Since introduced in [16], FPN has been widely used for multiscale feature fusion. Recently, PANet [19], NAS-FPN [5], and other studies [13, 12, 34] have developed more network structures for cross-scale feature fusion. While fusing different input features, most previous works simply sum them up without distinction; however, since these different input features are at different resolutions, we observe they usually contribute to the fused output feature unequally. To address this issue, we propose a simple yet highly effective weighted bi-directional feature pyramid network (BiFPN), which introduces learnable weights to learn the importance of different input features, while repeatedly applying topdown and bottom-up multi-scale feature fusion.
  • Challenge 2: model scaling – While previous works mainly rely on bigger backbone networks [17, 27, 26, 5] or larger input image sizes [8, 37] for higher accuracy, we observe that scaling up feature network and box/class prediction network is also critical when taking into account both accuracy and efficiency. Inspired by recent works [31], we propose a compound scaling method for object detectors, which jointly scales up the resolution/depth/width for all backbone, feature network, box/class prediction network.
一個很自然的問題是:是否有可能構建一個可伸縮的檢測架構,該架構具有更高的準確性和更大的效率,可以跨越各種資源約束(例如,從3B到300B FLOPS)?本文旨在通過系統地研究探測器結構的各種設計選擇來解決這一問題。基於onestage檢測器範例,我們檢查了主幹、特徵融合和類/盒網絡的設計選擇,並確定了兩個主要挑戰:
  • 挑戰1:高效的多尺度特徵融合——自[16]引入以來,FPN被廣泛用於多尺度特徵融合。最近,PANet[19]、NAS-FPN[5]等研究[13、12、34]開發了更多用於跨尺度特徵融合的網絡結構。雖然融合了不同的輸入特性,但以往的大多數工作只是簡單地將它們相加,沒有區別;然而,由於這些不同的輸入特徵具有不同的分辨率,我們觀察到它們通常對融合的輸出特徵的貢獻是不平等的。針對這一問題,我們提出了一種簡單而高效的加權雙向特徵金字塔網絡(BiFPN),該網絡在重複應用自頂向下和自底向上多尺度特徵融合的同時,引入可學習權值來學習不同輸入特徵的重要性。
  • 挑戰2:模型縮放——雖然以前的工作主要依賴於更大的主幹網絡[17,27,26,5]或更大的輸入圖像大小[8,37]來獲得更高的精度,但我們注意到,在考慮準確性和效率的同時,放大特徵網絡和box/class預測網絡也很關鍵。摘要受近年來[31]算法的啓發,我們提出了一種用於目標檢測的複合標度方法,該方法可以對所有主幹、特徵網絡、盒類預測網絡的分辨率/深度/寬度進行聯合標度。

Finally, we also observe that the recently introduced EfficientNets [31] achieve better efficiency than previous commonly used backbones (e.g., ResNets [9], ResNeXt [33], and AmoebaNet [24]). Combining EfficientNet backbones with our propose BiFPN and compound scaling, we have developed a new family of object detectors, named EfficientDet, which consistently achieve better accuracy with an order-of-magnitude fewer parameters and FLOPS than previous object detectors. Figure 1 and Figure 4 show the performance comparison on COCO dataset [18]. Under similar accuracy constraint, our EfficientDet uses 28x fewer FLOPS than YOLOv3 [26], 30x fewer FLOPS than RetinaNet [17], and 19x fewer FLOPS than the recent NASFPN [5]. In particular, with single-model and single testtime scale, our EfficientDet-D7 achieves state-of-the-art 51.0 mAP with 52M parameters and 326B FLOPS, being 4x smaller and using 9.3x fewer FLOPS yet still more accurate (+0.3% mAP) than the best previous models [37]. Our EfficientDet models are also up to 3.2x faster on GPU and 8.1x faster on CPU than previous detectors, as shown in Figure 4 and Table 2.

最後,我們還觀察到,最近推出的EfficientNets [31]比之前常用的骨幹(例如,ResNets [9], ResNeXt [33], AmoebaNet[24])的效率更高。我們將effecentnet主幹與我們提出的BiFPN和複合標度相結合,開發了一個新的對象檢測器家族,命名爲efficient entdet,與以前的對象檢測器相比,它始終能夠在較少數量級的參數和錯誤的情況下獲得更好的準確性。圖1和圖4顯示了對COCO數據集[18]的性能比較。在類似的精度約束下,我們的effecentdet使用的FLOPS比YOLOv3[26]少28倍,比RetinaNet[17]少30倍,比最近的NASFPN[5]少19倍。特別地,在單模型和單測試時間尺度的情況下,我們的效率測點- d7在52M參數和326B FLOPS的情況下,實現了最先進的51.0 mAP,比以前最好的模型[37]小4倍,減少了9.3倍的FLOPS,但仍然比以前的模型更精確(+0.3% mAP)。我們的EfficientDet模型在GPU上比以前的檢測器快3.2倍,在CPU上比以前的檢測器快8.1倍,如圖4和表2所示。

Our contributions can be summarized as:

• We proposed BiFPN, a weighted bidirectional feature network for easy and fast multi-scale feature fusion. • We proposed a new compound scaling method, which jointly scales up backbone, feature network, box/class network, and resolution, in a principled way. • Based on BiFPN and compound scaling, we developed EfficientDet, a new family of detectors with significantly better accuracy and efficiency across a wide spectrum of resource constraints.

我們的貢獻可以總結爲:

•我們提出了一個加權的雙向特徵網絡BiFPN,用於方便快速的多尺度特徵融合。•我們提出了一種新的複合標度方法,可以原則性地對主幹、feature network、box/class network、resolution進行聯合標度。•基於BiFPN和複合標度,我們開發了EfficientDet,這是一種新的探測器家族,在廣泛的資源約束範圍內具有更高的準確性和效率。

2. Related Work

One-Stage Detectors: Existing object detectors are mostly categorized by whether they have a region-ofinterest proposal step (two-stage [6, 27, 3, 8]) or not (onestage [28, 20, 25, 17]). While two-stage detectors tend to be more flexible and more accurate, one-stage detectors are often considered to be simpler and more efficient by leveraging predefined anchors [11]. Recently, one-stage detectors have attracted substantial attention due to their efficiency and simplicity [14, 34, 36]. In this paper, we mainly follow the one-stage detector design, and we show it is possible to achieve both better efficiency and higher accuracy with optimized network architectures.  
Multi-Scale Feature Representations: One of the main difficulties in object detection is to effectively represent and process multi-scale features. Earlier detectors often directly perform predictions based on the pyramidal feature hierarchy extracted from backbone networks [2, 20, 28]. As one of the pioneering works, feature pyramid network (FPN) [16] proposes a top-down pathway to combine multi-scale features. Following this idea, PANet [19] adds an extra bottom-up path aggregation network on top of FPN; STDL [35] proposes a scale-transfer module to exploit cross-scale features; M2det [34] proposes a U-shape module to fuse multi-scale features, and G-FRNet [1] introduces gate units for controlling information flow across features. More recently, NAS-FPN [5] leverages neural architecture search to automatically design feature network topology. Although it achieves better performance, NAS-FPN requires thousands of GPU hours during search, and the resulting feature network is irregular and thus difficult to interpret. In this paper, we aim to optimize multi-scale feature fusion with a more intuitive and principled way.  
Model Scaling: In order to obtain better accuracy, it is common to scale up a baseline detector by employing bigger backbone networks (e.g., from mobile-size models [30, 10] and ResNet [9], to ResNeXt [33] and AmoebaNet [24]), or increasing input image size (e.g., from 512x512 [17] to 1536x1536 [37]). Some recent works [5, 37] show that increasing the channel size and repeating feature networks can also lead to higher accuracy. These scaling methods mostly focus on single or limited scaling dimensions. Recently, [31] demonstrates remarkable model efficiency for image classification by jointly scaling up network width, depth, and resolution. Our proposed compound scaling method for object detection is mostly inspired by [31].  

 

3、BiFPN

In this section, we first formulate the multi-scale feature fusion problem, and then introduce the two main ideas for our proposed BiFPN: efficient bidirectional cross-scale connections and weighted feature fusion.  

Figure 2: Feature network design – (a) FPN [16] introduces a top-down pathway to fuse multi-scale features from level 3 to 7 (P3 - P7); (b) PANet [19] adds an additional bottom-up pathway on top of FPN; (c) NAS-FPN [5] use neural architecture search to find an irregular feature network topology; (d)-(f) are three alternatives studied in this paper. (d) adds expensive connections from all input feature to output features; (e) simplifies PANet by removing nodes if they only have one input edge; (f) is our BiFPN with better accuracy and efficiency trade-offs.

3.1. Problem Formulation

Multi-scale feature fusion aims to aggregate features at different resolutions. Formally, given a list of multi-scale features P~ in = (P in l1 , Pin l2 , ...), where P in li represents the feature at level li , our goal is to find a transformation f that can effectively aggregate different features and output a list of new features: P~ out = f(P~ in). As a concrete example, Figure 2(a) shows the conventional top-down FPN [16]. It takes level 3-7 input features P~ in = (P in 3 , ...Pin 7 ), where P in i represents a feature level with resolution of 1/2 i of the input images. For instance, if input resolution is 640x640, then P in 3 represents feature level 3 (640/2 3 = 80) with resolution 80x80, while P in 7 represents feature level 7 with resolution 5x5. The conventional FPN aggregates multi-scale features in a top-down manner:


where Resize is usually a upsampling or downsampling op for resolution matching, and Conv is usually a convolutional op for feature processing.

 

 

3.2. Cross-Scale Connections

Conventional top-down FPN is inherently limited by the one-way information flow. To address this issue, PANet [19] adds an extra bottom-up path aggregation network, as shown in Figure 2(b). Cross-scale connections are further studied in [13, 12, 34]. Recently, NAS-FPN [5] employs neural architecture search to search for better cross-scale feature network topology, but it requires thousands of GPU hours during search and the found network is irregular and difficult to interpret or modify, as shown in Figure 2(c).  
By studying the performance and efficiency of these three networks (Table 4), we observe that PANet achieves better accuracy than FPN and NAS-FPN, but with the cost of more parameters and computations. To improve model efficiency, this paper proposes several optimizations for cross-scale connections: First, we remove those nodes that only have one input edge. Our intuition is simple: if a node has only one input edge with no feature fusion, then it will have less contribution to feature network that aims at fusing different features. This leads to a simplified PANet as shown in Figure 2(e); Second, we add an extra edge from the original input to output node if they are at the same level, in order to fuse more features without adding much cost, as shown in Figure 2(f); Third, unlike PANet [19] that only has one top-down and one bottom-up path, we treat each bidirectional (top-down & bottom-up) path as one feature network layer, and repeat the same layer multiple times to enable more high-level feature fusion. Section 4.2 will discuss how to determine the number of layers for different resource constraints using a compound scaling method. With these optimizations, we name the new feature network as bidirectional feature pyramid network (BiFPN), as shown in Figure 2(f) and 3.  

3.3. Weighted Feature Fusion

When fusing multiple input features with different resolutions, a common way is to first resize them to the same resolution and then sum them up. Pyramid attention network [15] introduces global self-attention upsampling to recover pixel localization, which is further studied in [5].  
Previous feature fusion methods treat all input features equally without distinction. However, we observe that since different input features are at different resolutions, they usually contribute to the output feature unequally. To address this issue, we propose to add an additional weight for each input during feature fusion, and let the network to learn the importance of each input feature. Based on this idea, we consider three weighted fusion approaches:  
Unbounded fusion: O = P i wi · Ii , where wi is a learnable weight that can be a scalar (per-feature), a vector (per-channel), or a multi-dimensional tensor (per-pixel). We find a scale can achieve comparable accuracy to other approaches with minimal computational costs. However, since the scalar weight is unbounded, it could potentially cause training instability. Therefore, we resort to weight normalization to bound the value range of each weight.  
Softmax-based fusion: O = P i e wi P j e wj · Ii . An intuitive idea is to apply softmax to each weight, such that all weights are normalized to be a probability with value range from 0 to 1, representing the importance of each input. However, as shown in our ablation study in section 6.3, the extra softmax leads to significant slowdown on GPU hardware. To minimize the extra latency cost, we further propose a fast fusion approach.  

 

 

 

 

 

 

 

 

 

 

 

 

 

發佈了1573 篇原創文章 · 獲贊 6220 · 訪問量 1237萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章