Knowledge Distillation via Route Constrained Optimization

Motivation

已有的KD方法提升性能都是基於一個假設：teacher模型可以提供一種弱監督，且能被一個小student網絡學習，但是實際上一個收斂的大網絡表示空間很難被student學習，這會導致congruence loss很高
因此本文提出一種策略route constrained optimization，根據參數空間的route去選擇teacher的參數，一步一步的指導student。

Method

Review

Mobilenet是S，res50是T，我們分別用10 40 120 240的resnet作爲mobilenet的T，發現用越好的T，S loss越大，說明T越好，S越難學，不過性能確實是越來越好

RCO(Route Constrained Optimization)

把網絡中間的訓練狀態稱爲anchor points

算法流程：

訓練T，得到不同訓練狀態的T，T1，T2…Tn
隨時初始化S
用T1指導S，訓練一段時間用T2指導S
直到用Tn指導完S，得到最終的S
注意這裏何時切換T，後續需要討論

Strategy for Selecting Anchor Points

Equal Epoch Interval Strategy：每個T訓練4個epoch，但是比較粗暴，沒有考慮各個T學習的難度
Greedy Search Strategy：

30 100 180分別代表用不同時刻的T作文指導訓練得到的S，然後隨機取1w張圖，分別送入他們得到輸出，然後同時也送入不同epoch的teacher(60 120減低學習率)得到output

然後計算出的KL散度結果可以畫成上圖，發現在teacher的前期，30指導的S能夠比較好的學習，而到了後面30指導的S已經跟不上節奏了，尤其是每一次減低學習率的時候

根據以上發現，本文提出一種策略

計算S和當前T以及下一個anchor point的T之間的KL距離(隨機選一些驗證集圖片計算output)
當距離大於一定閾值後就換T

Experiment

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Knowledge Distillation via Route Constrained Optimization

Motivation

Method

Review

RCO(Route Constrained Optimization)

Strategy for Selecting Anchor Points

Experiment

Python 潮流週刊#50：我最喜歡的 Python 3.13 新特性！

Revisit Knowledge Distillation: a Teacher-free Framework

Attention Transfer

Similarity-Preserving Knowledge Distillation

NLP pretrained model

Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結