Tutorial: Knowledge Distillation

概述

Knowledge Distillation(KD)一般指利用一個大的teacher網絡作爲監督，幫助一個小的student網絡進行學習，主要用於模型壓縮。

其方法主要分爲兩大類

Output Distillation
Feature Distillation

Output Distillation

Motivation

主要拉近teacher和student最終輸出的距離，參考論文：Distilling the Knowledge in a Neural Network
one-hot label會將所有不正確的類別概率都設置爲0，而一個模型預測出來的結果，這些不正確的類別概率是有不同的，他們之間概率的相對大小其實蘊含了更多的信息，代表着模型是如何泛化判別的。
比如一輛轎車，一個模型預測出的概率向量中，更有可能卡車和轎車的概率相當，而貓的概率則很小，這其實給出了比one hot label更多的信息即轎車和卡車更像，而和貓不像。

Method

Loss = CrossEntropy(softmax(predict), one hot label) + alpha * T * T * CrossEntropy(softmax(predict/T), soft target)

T作爲一個超參，當T很大時，qi會更加soft，比如T趨於無窮大，則qi=(1/n, 1/n…)
alpha爲權衡task loss和output distillation loss

Advantage

對於hard sample，提供了樣本的難易程度的監督(如上述例子一樣，轎車和卡車比較容易混淆)
對於easy sample，平滑label，起到正則化作用，避免過擬合
只拉進output距離，不受網絡結構的限制

Feature Distillation

Motivation

在output distillation中僅拉進output的距離，有兩個問題

很多時候一個好的網絡其輸出和GT差別不大，接近one-hot，可以通過調節T使該問題得到一定程度的緩解(增大T得到更加soft的label)
student和teacher層數較多，僅在網絡最終的輸出處進行約束，不太容易對齊
基於以上兩個問題，Yoshua Bengio等人提出了feature distillation

Method

Loss = L_task + alpha * L_{output_distillation} + beta * L_{feature_distillation}
L_{feature_distillation} = Distance(Transform_t(Feat_t), Transform_s(Feat_s))
feature distillation方法有4個方面可以設計

feature的選取位置
student feature的變換
teacher feature的變換
距離的定義

很多方法根據以上4各方面進行設計

1. FitNets
論文參考：FitNets: Hints for Thin Deep Nets

feature的選取位置：不同的實驗不同的選擇，論文僅選擇了一層feature進行對齊
student feature的變換：1x1 conv
teacher feature的變換：無
距離的定義：L2
2. Attention Transfer
論文參考：paying more attention to attention: improving the performance of convolutional neural networks via attention transfer
feature的選取位置：resnet每個階段的最後一層卷積
student feature的變換：channel通道求平方和後做L2 norm，hwc->h*w
teacher feature的變換：channel通道求平方和後做L2 norm，hwc->h*w
距離的定義：L2
3. Similarity Preserve
論文參考：similarity-preserving knowledge distillation
feature的選取位置：resnet每個階段的最後一層卷積
student feature的變換：求batch中樣本之間的特徵相似度，即[b, hwc] * [ hwc, b] = b*b
teacher feature的變換：求batch中樣本之間的特徵相似度，即[b, hwc] * [ hwc, b] = b*b
距離的定義：L2
稍稍解釋下motivation，無論student還是teacher，其對兩個類別一樣的樣本predict出來的特徵相似度應該很高，如不同類別相似度應該很低。
這樣做的好處是：student不用去mimic teacher的特徵空間，只用在自己的特徵空間做到相同類別物體相似度高，不同類別物體相似度低即好，因爲有時候student容量很小，很難能夠mimic teacher的特徵空間
4. Overhaul Distillation
論文參考：A Comprehensive Overhaul of Feature Distillation
feature的選取位置：resnet每個階段的最後一層卷積，注意是ReLU前
student feature的變換：1x1 conv
teacher feature的變換：margin ReLU，保留更多的正值，抑制一部分負值
距離的定義：partial L2

擴展

Knowledge最開始提出時，主要是針對分類任務，後續也有了很多針對Detection和Segmentation的蒸餾工作，列舉兩篇有代表性的

Detection

論文參考 CVPR2019 Distilling Object Detectors with Fine-grained Feature Imitation
Motivation:
分類任務中因爲圖片一般僅包含主題物體，且所佔比例比較大，因此做feature distillation是在整個feature上做的。而檢測任務中會有很多物體，且所佔圖片比例較小，存在大量的背景。
如果整圖feature做distillation，噪聲太大，不易學習，因此本文提出僅在大概率出現物體的區域(根據RPN的輸出可以獲得該信息)做feature distillation
Method:
L_{feature_distillation} = Mask * Distance(Transform_t(Feat_t), Transform_s(Feat_s))，僅在mask選中的區域內做feature distillation

Segmentation

論文參考 CVPR 2019 Structured Knowledge Distillation for Semantic Segmentation
針對segmentation的任務特點: structured prediction，本文提出一種蒸餾方法挖掘其中的結構關係

pair-wise distillation: 分別計算student、teacher各自feature中各像素的相似度矩陣，拉進相似度矩陣的L2距離，目的是使得學生在自己的特徵空間中學習和teacher一致的structure，不用去mimic teacher的特徵空間
holistic distillation: 受WGAN的啓發，拉進student和teacher的score map的W距離

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Tutorial: Knowledge Distillation

概述

Output Distillation

Motivation

Method

Advantage

Feature Distillation

Motivation

Method

擴展

Detection

Segmentation

詐騙（殺豬盤）網站進行滲透測試

Python 潮流週刊#50：我最喜歡的 Python 3.13 新特性！

外行也能讀懂的網絡硬件設備功能原理速成

Revisit Knowledge Distillation: a Teacher-free Framework

Attention Transfer

Similarity-Preserving Knowledge Distillation

NLP pretrained model

Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結