Knowledge Distillation 筆記

Inter-Region Affinity Distillation for Road Marking Segmentation (2020.04)

Yuenan Hou1, Zheng Ma2, Chunxiao Liu2, Tak-Wai Hui1, and Chen Change Loy3y
1The Chinese University of Hong Kong 2SenseTime Group Limited 3Nanyang Technological University
 

用 inter-region affinity graph 描述 structural relationship。
每個node表示一種class的 areas of interest (AOI)(同類算一個 or instance算一個?),edge 代表 Affinity


Generation of AOI:smooth the label map with an average kernel φ and AOI map as 
AOI-grounded moment pooling:分別描述 mean, variance, skewness

Inter-region affinity :

distillation:

     

Experiments

Learning Lightweight Lane Detection CNNs by Self Attention Distillation (2019.08)

Yuenan Hou1, Zheng Ma2, Chunxiao Liu2, and Chen Change Loy3y
1The Chinese University of Hong Kong 2SenseTime Group Limited 3Nanyang Technological University

用於lane det 的self attention distillation,
backbone:Enet、Resnet18/34
每個block輸出的feature map轉爲attention map,後層attention map監督指導前層的attention map。
生成attention map:
 -> Bilinear upsampling B(.) -> spatial softmax operation Φ(.)

Loss
  m爲block數,Ld爲L2 loss

Ablation Study
Distillation paths of SAD:SAD用於block 1起反作用,造成low-level info細節損失?
Backward distillation.:後層爲student,前層爲teacher不行
SAD v.s. Deep Supervision:soft target,feedback connection
When to add SAD:adding SAD in later training stage would benefit

Knowledge Adaptation for Efficient Semantic Segmentation(CVPR 2019)

Tong He1 Chunhua Shen1y Zhi Tian1 Dong Gong1 Changming Sun2 Youliang Yan3
1The University of Adelaide 2Data61, CSIRO 3Noah’s Ark Lab, Huawei Technologies

motivation:Teacher和Student間結構差異,使得abilities to capture context and long range dependencies不同,給直接蒸餾造成難度。應該將knowledge去冗去噪後再用於蒸餾。

Knowledge Translation
用auto-encoder壓縮feature  

Feature Adaptation(有Fitnet的影子)
solve the problem of feature mismatching and decrease the effect of the inherent network difference of two model

Cf uses a 3 × 3 kernel with stride of 1, padding of 1, BN layer and ReLU

Affinity Distillation
cosine距離表示相似度

backbone: T: resnet50、 S:mobilenetV2

Knowledge Distillation via Instance Relationship Graph (CVPR 2019)

Yufan Liu∗a, Jiajiong Cao*b, Bing Li†a, Chunfeng Yuan†a, Weiming Hua, Yangxi Lic and Yunqiang Duanc
aNLPR, Institute of Automation, Chinese Academy of Sciences  bAnt Financial   cNational Computer Network Emergency Response Technical Team/Coordination Center of China

分類task,蒸餾中引入instance(樣本) relationship。構建graph:instance feature爲頂點,鄰接矩陣A表示relationship。A與batch size有關,loss中λ需根據batch size嘗試。
backbone: Resnet  dataset: cifar、ImageNet

L2 feature + L2 edge

Relational Knowledge Distillation (CVPR2019)

distill info中加入batch中樣本間的差異。用於metric learning,classification也能漲點。
backbone:resnet

embedding vectors不加L2 normalization效果更好(by exploiting a larger embedding space.)
Distance-wise distillation loss
      
Angle-wise distillation loss
  

A Gift from Knowledge Distillation: Fast Optimization, Network Minimization and Transfer Learning (CVPPR 2017)

Junho Yim1 Donggyu Joo1 Jihoon Bae2 Junmo Kim1
1School of Electrical Engineering, KAIST, South Korea   2Electronics and Telecommunications Research Institute

網絡自身不同層feature間的相似性作爲蒸餾知識(Gramian matrix)
The knowledge transfer performance is very sensitive to how the distilled knowledge is defined. we believe that demonstrating the solution process for the problem provides better generalization than teaching the intermediate result
論文通過實驗證明蒸餾三個好處:
1、Fast optimization
2、Performance improvement for the small DNN
3、Transfer Learning

Gij = Feature1的 i 通道和Feature2的 j 通道elem-product後求和

訓練過程:分階段

FITNETS: HINTS FOR THIN DEEP NETS (2015.03)

teacher:wide shallow  student:thin deep。也許由於當時resnet未出,deep net不好訓練
先蒸餾中間層(student中間層加一層conv和teacher match size),再蒸餾輸出分佈



 λ逐漸減小

PAYING MORE ATTENTION TO ATTENTION: IMPROVING THE PERFORMANCE OF CONVOLUTIONAL NEURAL NETWORKS VIA ATTENTION TRANSFER (2017.02)

attention  map作爲distill info。



attention based attention transfer
 p=2
gradient-based attention transfer



Low-resolution Face Recognition in the Wild via Selective Knowledge Distillation (2019.03 TIP2018)

人臉識別任務,只選擇有用的info用於distill
teacher:大網絡、high-resolution input
student:小網絡、low-resolution input
Initialization of the Two-stream CNNs
teacher:可用其他dataset預訓練
student:隨機初始化
Selective Knowledge Distillation from the Teacher Stream
 
引入feature centroid (uc),減少graph edge:保留類內fi間的edge,不同類間只有fi和uc相連


Cosine distance d(·)  λ is a negative weight(第一項傾向少選node,第二項傾向多選node)
Teacher-supervised Student Stream Fine-tuning


 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章