Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

原創

karen17

2020-06-28 01:18

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

網址：http://openaccess.thecvf.com/content_cvpr_2018/papers/Hara_Can_Spatiotemporal_3D_CVPR_2018_paper.pdf

Abstract

本文主要工作：當前傳統的研究都只關注shallow 3D結構，而我們在各類數據集上比較從較淺到非常深的各種3D CNN的結構。

主要結論：

1）在UCF-101, HMDB-51, and ActivityNet上，resnet-18過擬合嚴重；但在kinecits，並未出現過你和。

2）Kinetics 可以訓練非常深的3D CNNs，例如152 resnet

3）Kinetics 預訓練的簡單3D結構都能比複雜2D結構表現好

Introduction

在行爲識別上，well-organized 的3D模型都沒有一些stacked flow和RGB images的2D模型好

原因：1）當前視頻數據集較小，而3D CNN中參數多

2）預訓練問題：3D CNNs can only be trained on video datasets，然而2D CNN有imagenet預訓練

所以作者提出主要困惑：3D CNN能否重現 2D CNN和ImageNet的歷史？使用在Kinetics上訓練的3D CNN能否在行爲識別or其他各類任務上產生和imagenet相似的作用? 要解答上述疑惑，kinetics要預備的特點： 1）Kinetics要像ImageNet一樣大規模 2）Kinetics要支持訓練very deep的結構，這樣才能回答上述問題。

本文的主要工作：

1）從relatively shallow to very deep 探究不同的3D CNN結構在不同數據集：UCF-101, HMDB-51, ActivityNet，Kinetics上的性能。網絡結構主要基於resnet。

2）探究from scratch和 fine-tuning的比較

本文最主要貢獻：this is the first work to focus on the training of very deep 3D CNNs from scratch for action recognition

Experimental configuration

探究的三個問題：

1）determine whether current video datasets have sufficient data for training of deep 3D CNNs

探討當前的數據集是否足夠大，可以訓練複雜的3D CNN網絡。這裏我們用resnet18（最小的resnet結構）在幾個數據集上學習。如果resnet18在某個數據集上過擬合，就說明該數據集太小了，以至於不能訓練deep 3D CNNs from scratch，因爲resnet18已經是比較小的結構。

2）conducted a separate experiment to determine whether the Kinetics dataset could train deeper 3D CNNs.

這一部分主要探究，在Kinetics可以設計多深的3D CNNs。模型深度從18到200。如果可以達到imagenet在深resnet上的性能，我們可以用該數據集來做行爲識別中其他數據的預訓練

3）examined the fine-tuning of Kinetics pretrained 3D CNNs on UCF-101 and HMDB-51

探討kinetics產生的預訓練參數對小數據集UCF101 和 HMDB-51產生的影響。網絡結構：ResNet (basic and bottleneck blocks)， pre-activation ResNet ，wide ResNet (WRN) , ResNeXt, and DenseNet

Experiment

1.第一個問題的探究，在resnet18上他牛不同數據集

1) resnet18在UCF-101, HMDB-51, and ActivityNet的驗證誤差遠遠大於訓練誤差，說明resnet18在這些數據集上過擬合了，所以推斷出在這些數據集上train deep 3D CNNs from scratch 是不可行的。但在Kinetics結構不同，並不過擬合，所以可以在Kinetics上訓練deep 3D CNNs

2.第二個問題的探究，kinetics能訓練多深的3D網絡？

驗證深層網絡在Kinetics上的結果，發現隨着depth的升高，acc上升，直到resnet152飽和。但resnet200和resnet152結果差不多，可能已經開始過擬合了。

3.驗證fine-tuning和從頭訓練的對比

Kinetics可以從頭訓練，但其他數據集不行，所以用Kinetics給其他數據預訓練，結果差的還挺多

個人總結

本文有點像總結性論文，探究了多種resnet結構在當前行爲識別上的多個常見數據集上的性能。從而得到結論：

1）現有的很多行爲識別數據集都太小，不能從頭開始訓練複雜的3D 網絡結構

2）但Kinetics可以，並且網絡可以設計的非常深，resnet152 ，resneg200等

3）在行爲識別上，Kinetics可以充當imagenet的作用，給其他數據集提供預訓練。

github上代碼很全，但其實本文的實驗結果並不好。例如ucf101用kinetic預訓練，resnet50纔到89.3。之前的paper，TSM（Temporal Shift Module for Efficient Video Understanding）好像能做到96了。並且作者用了很多圖像增強的trick，實際我在ucf101上沒用這些trick復現論文時，達不到89。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

Abstract

Introduction

Experimental configuration

Experiment

個人總結

lightdb hash index的性能和限制

NetVLAD：CNN architecture for weakly supervised place recognition

Timeception Complex Action Recognition

leetcode-131.分割回文串

leetcode-124.二叉樹中的最大路徑和

leetcode-115.不同的子序列

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結