MyDLNote-High-Resolution: gOctConv:100K參數實現高效顯著性目標檢測

Highly Efficient Salient Object Detection with 100K Parameters

[ECCV 2020] [Code]

Abstract

Salient object detection models often demand a considerable amount of computation cost to make precise prediction for each pixel, making them hardly applicable on low-power devices.

In this paper, we aim to relieve the contradiction between computation cost and model performance by improving the network efficiency to a higher degree.

We propose a flexible convolutional module, namely generalized OctConv (gOctConv), to efficiently utilize both in-stage and cross-stages multi-scale features, while reducing the representation redundancy by a novel dynamic weight decay scheme. The effective dynamic weight decay scheme stably boosts the sparsity of parameters during training, supports learnable number of channels for each scale in gOctConv, allowing 80% of parameters reduce with negligible performance drop. Utilizing gOctConv, we build an extremely light-weighted model, namely CSNet, which achieves comparable performance with ∼ 0.2% parameters (100k) of large models on popular salient object detection benchmarks. 

第一句,研究領域+提出問題顯著目標檢測模型往往需要大量的計算成本才能對每個像素進行精確的預測,這使得顯著目標檢測模型在低功耗設備上難以應用。

第二句,工作內容:本文的目的是通過提高網絡效率來緩解計算成本和模型性能之間的矛盾。

第三、四、五句,具體方法

1. 提出了一種靈活的卷積模塊 gOctConv,有效地利用階段內和跨階段的多尺度特徵,二是通過一種新的動態權重衰減方案減少表示冗餘。

2. 動態權重衰減方案(dynamic weight decay scheme)穩定提高了訓練過程中參數的稀疏性,支持 gOctConv 中每個尺度的信道數可學習,減少表示冗餘,使參數減少 80%,而性能下降可以忽略。

3. 利用 gOctConv,建立了一個非常輕量化的模型,即 CSNet,參數約爲0.2% (100k),在顯著目標檢測基準上,與大模型的性能相當。

 

Introduction

Salient object detection (SOD) is an important computer vision task with various applications in image retrieval [17,5], visual tracking [23], photographic composition [15], image quality assessment [69], and weakly supervised semantic segmentation [25]. While convolutional neural networks (CNNs) based SOD methods have made significant progress, most of these methods focus on improving the state-of-the-art (SOTA) performance, by utilizing both fine details and global semantics [64,80,83,76,11], attention [3,2], as well as edge cues [12,68,85,61] etc. Despite the great performance, these models are usually resource-hungry, which are hardly applicable on low-power devices with limited storage/computational capability. How to build an extremely light-weighted SOD model with SOTA performance is an important but less explored area.

介紹本文的研究領域(應用範圍,深度學習方法基本模型),以及本文要解決的問題(基於深度學習的模型,通常參數較大)。

顯著目標檢測 (Salient object detection, SOD) 是一項重要的計算機視覺任務,在圖像檢索[17,5]、視覺跟蹤[23]、攝影合成[15]、圖像質量評估[69]、弱監督語義分割[25]等領域有着廣泛的應用。基於卷積神經網絡 (CNNs) 的 SOD 方法取得了顯著的進展,但這些方法大多着眼於提高最先進的 (the- of-the- state, SOTA) 性能,利用精細細節和全局語義[64,80,83,76,11]、注意力[3,2]以及邊緣線索[12,68,85,61]等。儘管性能很好,但這些模型通常需要大量資源,很難應用於存儲/計算能力有限的低功耗設備。如何建立一個 SOTA 性能極輕的 SOD 模型是一個重要但研究較少的領域。

 

The SOD task requires generating accurate prediction scores for every image pixel, thus requires both large scale high level feature representations for correctly locating the salient objects, as well as fine detailed low level representations for precise boundary refinement [12,67,24]. There are two major challenges towards building an extremely light-weighted SOD models. Firstly, serious redundancy could appear when the low frequency nature of high level feature meets the high output resolution of saliency maps. Secondly, SOTA SOD models [44,72,12,84,46,10] usually rely on ImageNet pre-trained backbone architectures [19,13] to extract features, which by itself is resource-hungry.

介紹本文要提出非常輕量化的 SOD 模型需要解決的問題。

SOD 任務需要爲每個圖像像素生成準確的預測分數,因此既需要大規模的高級特徵表示來正確定位顯著對象,也需要精細的低級別表示來精確地細化邊界。建立一個非常輕量化的 SOD 模型有兩個主要的挑戰。

首先,當高階特徵的低頻特性映射到高分辨率顯著性輸出時,會出現嚴重的冗餘。

其次,SOTA SOD模型通常依賴於 ImageNet 預訓練的骨幹架構來提取特徵,這本身就需要大量的資源。

 

Very recently, the spatial redundancy issue of low frequency features has also been noticed by Chen et al. [4] in the context of image and video classification. As a replacement of vanilla convolution, they design an OctConv operation to process feature maps that vary spatially slower at a lower spatial resolution to reduce computational cost. However, directly using OctConv [4] to reduce redundancy issue in the SOD task still faces two major challenges. 1) Only utilizing two scales, i.e., low and high resolutions as in OctConv, is not sufficient to fully reduce redundancy issues in the SOD task, which needs much stronger multi-scale representation ability than classification tasks. 2) The number of channels for each scale in OctConv is manually selected, requiring lots of efforts to re-adjust for saliency model as SOD task requires less category information.

針對本文要解決的問題,相關工作研究情況:

最近,Chen 等人在圖像和視頻分類中也注意到了低頻特徵的空間冗餘問題,爲此設計了一種 OctConv 操作來處理以較低空間分辨率變化較慢的特徵映射,以減少計算成本。然而,直接使用 OctConv [4] 來減少 SOD 任務中的冗餘仍然面臨兩大挑戰:

1. 僅使用 OctConv 中的低分辨率和高分辨率兩種尺度不足以完全減少 SOD 任務中的冗餘問題,因爲 SOD 任務需要比分類任務更強的多尺度表示能力。

2. 在 OctConv 中,每個尺度的通道數都是人工選擇的,由於 SOD 任務需要較少的類別信息,需要花費大量的精力對顯著性模型進行重新調整。

[4] Drop an octave: Reducing spatial redundancy in convolutional neural networks with octave convolution.  (ICCV 2019)

  • OctConv motivation:

(a) 視覺空間頻率模型表明,自然圖像可以分解爲一個低空間頻率部分和一個高空間頻率部分。(b) 卷積層的輸出映射也可以按其空間頻率進行分解和分組。(c) 提出的多頻特徵表示將平滑變化的低頻映射存儲在低分辨率張量中,以減少空間冗餘。(d) 提出的倍頻卷積直接作用於這個表示。它更新每個組的信息,並進一步支持組之間的信息交換。

 

In this paper, we propose a generalized OctConv (gOctConv) for building an extremely light-weighted SOD model, by extending the OctConv in the following aspects: 1). The flexibility to take inputs from arbitrary number of scales, from both in-stage features as well as cross-stages features, allows a much larger range of multi-scale representations. 2). We propose a dynamic weight decay scheme to support learnable number of channels for each scale, allowing 80% of parameters reduce with negligible performance drop.

針對要解決的問題和前人工作的不足,本文都做了什麼:

本文提出一個廣義 OctConv (gOctConv) 構建一個極輕量級的 SOD 模型, 通過在以下方面擴展 OctConv:

1. 從 in-stage 特徵以及 cross-stages 特性,允許更大範圍的多尺度表示。

2. 提出了一種動態權重衰減方案,支持每個尺度的信道數量可學習,80%的參數減少,而性能的影響可以忽略不計。

 

Benefiting from the flexibility and efficiency of gOctConv, we propose a highly light-weighted model, namely CSNet, that fully explores both in-stage and Cross-Stages multi-scale features. As a bonus to the extremely low number of parameters, our CSNet could be directly trained from scratch without ImageNet pre-training, avoiding the unnecessary feature representations for distinguishing between various categories in the recognition task.

基本卷積模塊設計好了,還需要將其應用在 顯著度檢測 中,搭建新的任務網絡:

利用 gOctConv 的靈活性和效率,提出了一個高度輕量化的 CSNet 模型,該模型充分探索了 階段內 和 跨階段 的多尺度特徵。由於參數數量極低,CSNet 可以直接從頭開始訓練,而無需 ImageNet 預訓練,避免了在識別任務中用於區分不同類別的不必要的特徵表示。

 

In summary, we make two major contributions in this paper:

– We propose a flexible convolutional module, namely gOctConv, to efficiently utilize both in-stage and cross-stages multi-scale features for SOD task, while reducing the representation redundancy by a novel dynamic weight decay scheme.

– Utilizing gOctConv, we build an extremely light-weighted SOD model, namely CSNet, which achieves comparable performance with ∼ 0.2% parameters (100k) of SOTA large models on popular SOD benchmarks.

本文的貢獻:gOctConv,dynamic weight decay scheme 和 CSNet。

 

Light-weighted Network with Generalized OctConv

Overview of Generalized OctConv

Originally designed to be a replacement of traditional convolution unit, the vanilla OctConv [4] shown in Fig. 2 (a) conducts the convolution operation across high/low scales within a stage. However, only two-scales within a stage can not introduce enough multi-scale information required for SOD task. The channels for each scale in vanilla OctConv is manually set, requires lots of efforts to re-adjust for saliency model as SOD task requires less category information. Therefore, we propose a generalized OctConv (gOctConv) allows arbitrary number of input resolutions from both in-stage and cross-stages conv features with learnable number of channels as shown in Fig. 2 (b). As a generalized version of vanilla OctConv, gOctConv improves the vanilla OctConv for SOD task in following aspects: 1). Arbitrary numbers of input and output scales is available to support a larger range of multi-scale representation. 2). Except for in-stage features, the gOctConv can also process cross-stages features with arbitrary scales from the feature extractor. 3). The gOctConv supports learnable channels for each scale through our proposed dynamic weight decay assisting pruning scheme. 4). Cross-scales feature interaction can be turned off to support a large complexity flexibility. The flexible gOctConv allows many instances under different designing requirements. We will give a detailed introduction of different instances of gOctConvs in following light-weighted model designing.

圖 2 左側 所示的普通 OctConv[4] 最初是爲了替代傳統的卷積單元而設計的,它在一個階段內跨高/低尺度進行卷積運算。然而,僅在一個階段內的兩個尺度不能爲 SOD 任務引入足夠的多尺度信息【提出問題】。傳統的 OctConv 中每個尺度的通道都是人工設置的,由於 SOD 任務需要較少的類別信息,需要花費大量的精力來重新調整顯著性模型【解釋問題產生的原因】。因此,本文提出一個廣義 OctConv (gOctConv) 允許 in-stage 和 cross-stages 中的卷積特徵通道,作爲任意分辨率輸入 ,如圖 2 所示(b)。gOctConv 提高 SOD 任務表現在以下方面:

1. 任意數量的輸入和輸出尺度可以支持更大範圍的多尺度表示。

2. 除了 in-stage 特徵外,gOctConv 還可以處理特徵提取器中任意尺度的 cross-stages 特徵。

3. gOctConv 通過提出的動態權重衰減輔助剪枝方案支持每個尺度的可學習通道。

4. 可以關閉跨尺度特性交互,以支持較大的複雜性靈活性。

 

下面兩節分別講的是三大貢獻中的兩個:

1. gOctConvs:Light-weighted Model Composed of gOctConvs

2. dynamic weight decay scheme:Learnable Channels for gOctConv

 

Light-weighted Model Composed of gOctConvs

  • Overview.

As shown in Fig. 3, our proposed light-weighted network, consisting of a feature extractor and a cross-stages fusion part, synchronously processes features with multiple scales. The feature extractor is stacked with our proposed in-layer multi-scale block, namely ILBlocks, and is split into 4 stages according to the resolution of feature maps, where each stage has 3, 4, 6, and 4 ILBlocks, respectively. The cross-stages fusion part, composed of gOctConvs, processes features from stages of the feature extractor to get a high-resolution output.

如圖3所示,本文提出的輕量化網絡由特徵提取器和跨階段融合部分組成,同步處理多尺度的特徵。特徵提取器與層內多尺度塊 (ILBlock) 疊加,根據特徵圖的分辨率將特徵提取器分爲4個階段,每個階段分別有3、4、6和4個ILBlock。由 gOctConvs 組成的跨級融合部分,從特徵提取器的各個階段處理特徵,得到高分辨率的輸出。

  • In-layer Multi-scale Block.

ILBlock enhances the multi-scale representation of features within a stage. gOctConvs are utilized to introduce multi-scale within ILBlock. Vanilla OctConv requires about 60% FLOPs [4] to achieves the similar performance to standard convolution, which is not enough for our objective of designing a highly light-weighted model. To save computational cost, interacting features with different scales in every layer is unnecessary. Therefore, we apply an instance of gOctConv that each input channel corresponds to an output channel with the same resolution through eliminating the cross scale operations. A depthwise operation within each scale in utilized to further save computational cost. This instance of gOctConv only requires about 1/channel FLOPs compared with vanilla OctConv.ILBlock is composed of a vanilla OctConv and two 3 × 3 gOctConvs as shown in Fig. 3. Vanilla OctConv interacts features with two scales and gOctConvs extract features within each scale. Multi-scale features within a block are separately processed and interacted alternately. Each convolution is followed by the BatchNorm [30] and PRelu [18]. Initially, we roughly double the channels of ILBlocks as the resolution decreases, except for the last two stages that have the same channel number. Unless otherwise stated, the channels for different scales in ILBlocks are set evenly. Learnable channels of OctConvs then are obtained through the scheme as described in Sec. 3.3.

Fig. 3. Illustration of our salient object detection pipeline, which uses gOctConv to extract both in-stage and cross-stages multi-scale features in a highly efficient way.

本段介紹的是 In-layer Multi-scale Block。

ILBlock 增強了一個階段內特徵的多尺度表示。gOctConvs 被用來在 ILBlock 中引入多尺度。普通 OctConv 需要大約60%的FLOPs[4]才能達到與標準卷積類似的性能,這對於高度輕量化模型的目標是不夠的。爲了節省計算成本,不需要在每一層中進行不同尺度的特徵交互。因此,通過消除跨尺度操作,每個輸入通道對應一個具有相同分辨率的輸出通道。在每個規模內進行深度操作,以進一步節省計算成本。與普通的 OctConv 相比,gOctConv 實例只需要大約 1/channel FLOPs。ILBlock 由一個 vanilla OctConv 和兩個 3x3 gOctConvs 組成,如圖3所示。普通的 OctConv 與兩個尺度相互作用特徵,而 gOctConvs 則提取每個尺度內的特徵。對塊內的多尺度特徵進行單獨處理,並進行交互。每次卷積之後是 BatchNorm[30] 和 PRelu[18]。最初,ILBlock 的通道翻倍作爲分辨率的降低,除了最後兩個階段具有相同的通道數。除非另有說明,ILBlock 中不同尺度的通道是均勻設置的。

  • Cross-stages Fusion.

To retain a high output resolution, common methods retain high feature resolution on high-level of the feature extractor and construct complex multi-level aggregation module, inevitably increase the computational redundancy. While the value of multi-level aggregation is widely recognized [16,43], how to efficiently and concisely achieve it remains challenging. Instead, we simply use gOctConvs to fuse multi-scale features from stages of the feature extractor and generate the high-resolution output. As a trade-off between efficiency and performance, features from last three stages are used. A gOctConv 1 × 1 takes features with different scales from the last conv of each stage as input and conducts a cross-stages convolution to output features with different scales. To extract multi-scale features at a granular level, each scale of features is processed by a group of parallel convolutions with different dilation rates. Features are then sent to another gOctConv 1 × 1 to generate features with the highest resolution. Another standard conv 1 × 1 outputs the prediction result of saliency map. Learnable channels of gOctConvs in this part are also obtained.

本段介紹如何實現 Cross-stages 多層次特徵聚合。

爲了保持較高的輸出分辨率,常用的方法在特徵提取器的高層保留高分辨率特徵,構造複雜的多級聚合模塊,不可避免地增加了計算冗餘。如何高效、簡潔地實現多層次聚合仍是一個挑戰。相反,本文使用 gOctConvs 來融合來自特徵提取器各個階段的多尺度特徵,並生成高分辨率輸出。作爲效率和性能之間的權衡,使用了前三個階段的特性。gOctConv 1x1 將每個階段的最後一個卷積的不同尺度的特徵作爲輸入,對輸出的不同尺度的特徵進行跨級卷積。爲了提取粒度級的多尺度特徵,每個尺度的特徵都由一組不同膨脹率的並行卷積處理。然後,功能被髮送到另一個 gOctConv 1x1 生成最高分辨率。另一個標準的 conv 1x1 輸出顯著性映射的預測結果。

 

Learnable Channels for gOctConv

We propose to get learnable channels for each scale in gOctConv by utilizing our proposed dynamic weight decay assisted pruning during training. Dynamic weight decay maintains a stable weights distribution among channels while introducing sparsity, helping pruning algorithms to eliminate redundant channels with negligible performance drop.

在訓練過程中利用,本文提出的動態權重衰減輔助修剪,爲 gOctConv 中的每個尺度獲得可學習的通道。動態權重衰減在引入稀疏性的同時保持了信道間穩定的權重分佈,幫助剪枝算法消除冗餘信道,而性能下降可以忽略不計。

 

  • Dynamic Weight Decay.

The commonly used regularization trick weight decay [33,77] endows CNNs with better generalization performance. Mehta et al.[53] shows that weight decay introduces sparsity into CNNs, which helps to prune unimportant weights. Training with weight decay makes unimportant weights in CNN have values close to zero. Thus, weight decay has been widely used in pruning algorithms to introduce sparsity [38,50,48,22,47,21]. The common implementation of weight decay is by adding the L2 regularization to the loss function, which can be written as follows:

where L0 is the loss for the specific task, wi is the weight of ith layer, and λ is the weight for weight decay. During back propagation, the weight wi is updated as

where ∇fi (wi) is the gradient to be updated, and λwi is the decay term, which is only associated with the weight itself. Applying a large decay term enhances sparsity, and meanwhile inevitably enlarges the diversity of weights among channels. Fig. 4 (a) shows that diverse weights cause unstable distribution of outputs among channels. Ruan et al.[8] reveals that channels with diverse outputs are more likely to contain noise, leading to biased representation for subsequent filters. Attention mechanisms have been widely used to re-calibrate the diverse outputs with extra blocks and computational cost [29,8]. We propose to relieve diverse outputs among channels with no extra cost during inference. We argue that the diverse outputs are mainly caused by the indiscriminate suppression of decay terms to weights. Therefore, we propose to adjust the weight decay based on specific features of certain channels. Specifically, during back propagation, decay terms are dynamically changed according to features of certain channels. The weight update of the proposed dynamic weight decay is written as

where λd is the weight of dynamic weight decay, xi denotes the features calculated by wi, and S (xi) is the metric of the feature, which can have multiple definitions depending on the task. In this paper, our goal is to stabilize the weight distribution among channels according to features. Thus, we simply use the global average pooling (GAP) [42] as the metric for a certain channel:

where H and W are the height and width of the feature map xi. The dynamic weight decay with the GAP metric ensures that the weights producing large value features are suppressed, giving a compact and stable weights and outputs distribution as revealed in Fig. 4. Also, the metric can be defined as other forms to suit certain tasks as we will study in our future work. Please refer to Sec. 4.3 for a more detailed interpretation of dynamic weight decay.

前面一段講的是權重衰減的一般實現方式是在損失函數中加入 L2 正則化,以及其反向傳播的更新方式。常用的正則化技巧權值衰減使得 CNN 具有更好的泛化性能。權重衰減將稀疏性引入 CNN,有助於修剪不重要的權重。權重衰減訓練使得 CNN 中不重要的權重值接近於零。因此,權重衰減被廣泛應用於引入稀疏性的剪枝算法中。

本文認爲在推理過程中減少不同渠道的輸出,而不需要額外的成本。不同的輸出主要是由於對衰減項對權重的任意抑制。因此,根據特定通道的特徵來調整權重衰減。具體來說,在反向傳播過程中,衰減項會根據特定通道的特徵動態變化。所提出的動態權重衰減的權重更新爲 多了一個 S(xi)。S (xi) 是特性的度量標準,它可以根據任務有多個定義。本文的目標是根據特徵穩定信道間的權重分佈。因此,本文使用全局平均池 (GAP)[42] 作爲特定通道的度量,如上面最後一個公式。使用 GAP 度量的動態權值衰減保證了產生大值特徵的權值被抑制,給出瞭如圖4所示的緊湊和穩定的權值和輸出分佈。

 

 

  • Learnable channels with model compression.

Now, we incorporate dynamic weight decay with pruning algorithms to remove redundant weights, so as to get learnable channels of each scale in gOctConvs. In this paper, we follow [48] to use the weight of BatchNorm layer as the indicator of the channel importance. The BatchNorm operation [30] is written as follows:

where x and y are input and output features, E(x) and Var(x) are the mean and variance, respectively, and is a small factor to avoid zero variance. γ and β are learned factors.We apply the dynamic weight decay to γ during training. Fig. 4 (b) reveals that there is a clear gap between important and redundant weights, and unimportant weights are suppressed to nearly zero (wi < 1e−20). Thus, we can easily remove channels whose γ is less than a small threshold. The learnable channels of each resolution features in gOctConv are obtained. The algorithm of getting learnbale channels of gOctConvs is illustrated in Alg. 1.

本文將動態權值衰減與剪枝算法相結合來去除冗餘權值,從而在 gOctConvs 中獲得每個尺度的可學習通道。本文采用 [48] 法,利用 BatchNorm 層的權值作爲通道重要性的指標,如上面最後這個公式。

γ 和 β 是學習因子。在訓練過程中,將動態權值衰減應用於 γ。圖4 (b) 顯示重要權重和冗餘權重之間存在明顯的差距,不重要權重被抑制到接近零 (wi < 1e−20)。因此,可以將 γ 小於一個小閾值的通道移除,得到了 gOctConv 中各分辨率特徵的可學習通道。算法1中給出了獲取 gOctConvs 學習通道的算法。

 

 

 

 

 

 

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章