[ 论文速度 ] ECCV 2020 : Cross-Modal Weighting Network

Cross-Modal Weighting Network for RGB-D Salient Object Detection

[PAPER]

本文针对 SOD 这个问题,提出了新的网络结构,设计的新的网络结构,有效地融合了RGB通道和深度通道的信息,同时挖掘了跨尺度的目标定位和细节。

Abstract

Depth maps contain geometric clues for assisting Salient Object Detection (SOD). In this paper, we propose a novel Cross-Modal Weighting (CMW) strategy to encourage comprehensive interactions between RGB and depth channels for RGB-D SOD.

Specifically, three RGB-depth interaction modules, named CMW-L, CMW-M and CMWH, are developed to deal with respectively low-, middle- and high-level cross-modal information fusion. These modules use Depth-to-RGB Weighing (DW) and RGB-to-RGB Weighting (RW) to allow rich cross-modal and cross-scale interactions among feature layers generated by different network blocks. To effectively train the proposed Cross-Modal Weighting Network (CMWNet), we design a composite loss function that summarizes the errors between intermediate predictions and ground truth over different scales. With all these novel components working together, CMWNet effectively fuses information from RGB and depth channels, and meanwhile explores object localization and details across scales. Thorough evaluations demonstrate CMWNet consistently outperforms 15 state-of-the-art RGB-D SOD methods on seven popular benchmarks.

摘要一开始,给出了本文的研究领域 SOD,和本文核心卖点:Cross-Modal Weighting (CMW) strategy

接着,具体介绍这个新的的 CMW 网络:

1. 组成:CMW-L, CMW-M and CMWH

2. 作用:low-, middle- and high-level cross-modal information fusion

3. 方法:use Depth-to-RGB Weighing (DW) and RGB-to-RGB Weighting (RW),允许不同网络块生成的特征层之间的丰富的跨模态和跨尺度交互。

4. 训练:损失函数,设计一个综合损失函数,总结中间预测和地面真相在不同尺度上的误差。

5. 解释:通过所有这些新颖的组件一起工作,CMWNet有效地融合了来自RGB和深度通道的信息,同时探索了跨尺度的目标定位和细节。

最后,实验结论。

 

Related Work

CNN-based RGB-D SOD

In recent years, numerous CNN-based RGB-D SOD methods [4–6, 10, 13, 20, 26, 31, 33, 38, 42, 44] have been proposed.

4. Chen, H., Li, Y.: Progressively complementarity-aware fusion network for RGB-D salient object detection. In: IEEE CVPR (2018)

5. Chen, H., Li, Y.: Three-stream attention-aware network for RGB-D salient object detection. IEEE TIP 28(6), 2825–2835 (2019)

6. Chen, H., Li, Y., Su, D.: Multi-modal fusion network with multi-scale multi-path and cross-modal interactions for RGB-D salient object detection. Pattern Recognition 86, 376–385 (2019)

10. Ding, Y., Liu, Z., Huang, M., Shi, R., Wang, X.: Depth-aware saliency detection using convolutional neural networks. Journal of Visual Communication and Image Representatio 61, 1–9 (2019)

13. Fan, D.P., Lin, Z., Zhao, J.X., Liu, Y., Zhang, Z., Hou, Q., Zhu, M., Cheng, M.M.: Rethinking RGB-D salient object detection: Models, datasets, and largescale benchmarks. arXiv preprint arXiv:1907.06781 (2019)

20. Han, J., Chen, H., Liu, N., Yan, C., Li, X.: CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion. IEEE TCYB 48(11), 3171–3183 (2018)

26. Liu, Z., Shi, S., Duan, Q., Zhang, W., Zhao, P.: Salient object detection for RGBD image by single stream recurrent convolution neural network. Neurocomputing 363, 46–57 (2019)

31. Qu, L., He, S., Zhang, J., Tian, J., Tang, Y., Yang, Q.: RGBD salient object detection via deep fusion. IEEE TIP 26(5), 2274–2285 (2017)

33. Shigematsu, R., Feng, D., You, S., Barnes, N.: Learning RGB-D salient object detection using background enclosure, depth contrast, and top-down features. In: IEEE ICCVW (2017)

38. Wang, N., Gong, X.: Adaptive fusion for RGB-D salient object detection. IEEE Access 7, 55277–55284 (2019)

42. Zhao, J.X., Cao, Y., Fan, D.P., Cheng, M.M., Li, X.Y., Zhang, L.: Contrast prior and fluid pyramid integration for RGBD salient object detection. In: IEEE CVPR (2019)

44. Zhu, C., Cai, X., Huang, K., Li, T.H., Li, G.: PDNet: Prior-model guided depthenhanced network for salient object detection. In: IEEE ICME (2019)

 

Proposed Method

Network Overview and Motivation

整体上看:

1. U-Net 结构;

2. Encoder 分为两路,depth 和 RGB;

3. 从 Encoder 到 Decoder 的跳接前,先做了 depth 和 RGB 两路之间的融合;

4. 这种联系可以看做是 一方的较某一层另一方的前一层 特征的融合过程(作者又进一步命名为三个模块,这样在写作时,容易描述);

5. 损失函数,在解码器每一层都做输出预测,然后和 GT 进行计算损失。

Fig. 1. Illustration of the proposed CMWNet. For both RGB and depth channel, the Siamese encoder network is employed to extract feature blocks organized in three levels. Three Cross-Modal Weighting (CMW) modules, CMW-L, CMW-M and CMWH, are proposed to capture the interactions at corresponding level, and provide inputs for the decoder. The decoder progressively aggregates all the cross-modal cross-scale information for the final prediction. For training, multi-scale pixel-level supervision for intermediate predictions are utilized.

 

CMW-L, CMW-M and CMW-H

Fig. 2. Details of the three proposed RGB-depth interaction modules: CMW-L, CMW-M and CMW-H. All modules consist of Depth-to-RGB Weighting (DW) and RGB-to-RGB Weighting (RW) as key operations. Notably, the DW in CMW-L and CMW-M is performed in the cross-scale manner between two adjacent blocks, which effectively captures the feature continuity and activates cross-modal cross-scale interactions.

三个模块结构完全一致,只是在不同的 encoder 层中,且前两者是层级间、领域间的交叉融合(低阶和高阶、depth 和 RGB 之间的交叉),而最后者只是领域间的融合。

1. depth 要经过不同窗口尺度的卷积,然后通过 2x2 卷积聚合;

2. RGB 特征直接作为自己和 depth 输出的权重,类似于 mix attention;

3. 最后,RGB 通过残差连接直接送入 decoder ,这个就是传统的 U-Net 了。

 

小结:

新颖点:

1. Encoder 分为 depth 和 RGB 两个分支;

2. 层级交叉融合确实有些微妙;

3. 直接将 RGB 作为 mix attention 的注意力图。

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章