Attention Transfer

Paying More Attention to Attention: Improving the Performance of Convolutional Neural Networks via Attention Transfer

Motivation

大量的論文已經證明Attention在CV、NLP中都發揮着巨大的作用，因爲本文利用Attention做KD，即讓student學習teacher的attention maps

Activation-based attention transfer

如果定義是spatial attention map

各個channel相同位置絕對值求和
各個channel相同位置p次方求和：對比1，會更加註重於響應高的地方
各個channel相同位置p次方求最大值

3種方式得到的attention map各有側重，後兩種更加側重一些響應更突出的位置

最終的Loss：

Qs Qt爲第j對student和teacher的attention map

beta取1000，式子後半部會在所有位置取平均，整體來說後半部的權重在0.1左右

Gradient-based attention transfer

網絡對某些位置輸入的敏感性，比如調整某些位置的像素然後觀察網絡輸出的變化，如果某些位置調整後網絡輸出變化大即說明網絡更加paying attention to這個位置

Experiments

activation-based AT， F-AcT(類似FitNets，1x1做feature adaptation後做L2 loss)

平方和效果最好

activation-based好於gradient-based

其他在Scenes這個數據集上AT做的比傳統的KD要好很多，猜測是因爲we speculate is due to importance of intermediate attention for fine-grained recognition

好像作者寫錯了吧，這裏明明CUB纔是fine-grained的數據集

重要

KD struggles to work if teacher and student have different architecture/depth (we observe the same on CIFAR), so we tried using the same architecture and depth for attention transfer.

We also could not find applications of FitNets, KD or similar methods on ImageNet in the literature. Given that, we can assume that proposed activation-based AT is the first knowledge transfer method to be successfully applied on ImageNet.

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Attention Transfer

Motivation

Activation-based attention transfer

Gradient-based attention transfer

Experiments

Kafka存儲機制

aws語音呼叫調用，告警電話

【轉】[C#] WebAPI 防止併發調用二（冥等性）

HTTP URL 詳解

創新工具：2024年開發者必備的一款表格控件（二）

車牌識別控制檯可快速整合二次開發

Revisit Knowledge Distillation: a Teacher-free Framework

Attention Transfer

Similarity-Preserving Knowledge Distillation

NLP pretrained model

Be Your Own Teacher: Improve the Performance of Convolutional Neural Networks via Self Distillation

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結