ProxyLessNAS : Direct Neural Architecture Search On Target Task And Hardware

Related Work

Proxy Tasks:

1.在小數據集上訓練

2.只學習一些小的Blocks

3.只訓練一小部分的Epoch

Motivation

1.基於 Proxy Tasks的NAS方法並沒有考慮到諸如Latency等性能的影響

2.現存的NAS結構是採用搜索Block的方法進行Stack組成最終的網絡結構，但是實際情況中，每一個Block是可以不同的

3.DARTS是一種創造性的將Architecture 和Weight聯合進行SGD優化的，但是在使用Proxy Tasks學習Block進行堆疊的同時又存在GPU Consumption的方法
$m_O^{DARTS}(x)=\mathop {\Sigma}\limits_{i=1}\limits^{N}p_i o_i(x)=\mathop{\Sigma}\limits_{i=1}\limits^{N}\frac{exp(\alpha_i)}{\Sigma_jexp(\alpha_j)}o_i(x)$

4.Drop Path是一種可以找到緊性的高效網絡結構的方法
$m_O^{One-Shot}(x)=\mathop{\Sigma}\limits_{i=1}\limits^{N}o_i(x)$

Overview

本文提出了一種Path-Level的剪枝方法

**Step 1：**直接訓練一種包含所有候選Path的超參Network

**Step 2：**在訓練過程中，通過引入結構化參數來學習通路的冗餘性

**Step 3：**在最終獲得緊性的優化結構時將具有冗餘性的Path進行剪枝

存在問題：

隨着可選Operations的增加，可選Network的Size越大，會增加GPU Memory
- Solution：藉助"Binary Connect"的思路，將Architecture Parameters二值化，在一個Run-Time中只允許一個Path被激活
Hardware Objectives(e.g. Latency)不可微
- 將Network Latency抽象成連續函數，利用正常的Loss進行優化
- Reinforce-Based Algorithm

Contributions

1.直接在大型數據集上且不需要使用Proxy Task的技術進行訓練和學習

2.允許在大型的候選集上進行搜索

3.打破了Block 堆疊的Network的構成方式

4.提出了對於NAS做剪枝的方案，展示了NAS與模型壓縮之間的相近的關係，同時通過Path剪枝，本文的方法節約了大約一個量級的計算量

5.提出了一個基於Latency Regularization Loss的梯度下降方法以解決硬件目標問題

Method

Construction of Over-Parameterized Network

Question:爲了能夠進行搜索，必須要對整個超參的網絡結構進行定義

解決方案：所有的Path都是混合通路，學習的是整個Network而不是Block

將Network用 $N(e=m^1_O,...,e=m^n_O)$ 表示

$e_i$ 代表一個有向無環圖的一條確定的邊
$O=\{o_i\}$ 是一個含有 $N$ 個初始Operations(卷積、池化、Identity、Zero)的集合
$m_O$ 每條邊上含有N條並行通路的一個混合操作

因此，對於一個輸入 $x$ ，混合操作 $m_O$ 的輸出結果是基於 $N$ 條通路所形成的

Learning Binarized Path

**Question:**隨着Path可選的空間逐漸擴大，Weight的運算量將不斷增加

**解決方案：**二值化Path

ProxyLessNet在訓練過程中，在衆多的Path中，只激活一個Path，故此本文將所有的Path進行二值化處理
$g=binarize(p_1,...,p_N)=\left\{ \begin{aligned} [1,0,...,0] & & with &probability&p_1\\ ...\\ [0,0,...,1] & & with & probability& P_N \end{aligned} \right.$

因此對於給定的binary gate $g$
$m_O^{Binary}(x)= \mathop{\Sigma}\limits_{i=1}\limits^{N}g_io_i(x)=\left\{ \begin{aligned} o_1(x) & with&probability&p_1\\...\\o_N(x)&with &probability & p_N \end{aligned} \right.$

[說明]:此處值得注意的是 $architecture $ $parameters$ 並沒有直接包含在 $m_O^{Binary}$ 的計算公式中，因此不能直接通過該公式對 $Architecture$ 進行優化，下文中，借用DARTS中的方法對 $Architecture$ 進行轉換

Training Binarized Architecture Parameters

**Question：**對於以上網絡的定義，就像先前的NAS，必須要解決Architecture和Weight的更新迭代問題

**解決方案：**Binary Connect–>Weight 更新Architecture

訓練 $weight$ ：
- 固定Architecture參數
- 根據公式(3)隨機採樣Binary Gate
- 通過Binary Gate激活的Path利用SGD的方法在訓練集上進行優化
訓練 $Architecture\quad Parameters$ :
- 固定 $weight$
- 重置Binary Gate
- 在驗證集上更新Architecture

[說明]:以上兩個優化過程必須要交替進行

Architecture Parameter確定時，通過砍掉具有冗餘性的Path即可得到緊緻性網絡結構 [此處利用了Drop Path操作]，爲了簡化，本文采用了具有最高Path權重的Path作爲基礎的Architecture
$\frac{\partial L}{\partial \alpha_i}=\mathop{\Sigma}\limits_{j=1}\limits^{N}\frac{\partial L}{\partial g_j}\frac{\partial p_j}{\partial \alpha_i}=\mathop{\Sigma}\limits_{j=1}\limits^{N}\frac{\partial L}{\partial g_j}\frac{\partial \frac{exp(\alpha_{j})}{\Sigma_kexp(\alpha_k)}}{\partial\alpha_i}=\mathop{\Sigma}\limits_{j=1}\limits^{N}\frac{\partial L}{\partial g_j}p_j(\delta_{ij}-p_i)$

**Question：**此處值得注意的是 $\frac{\partial L}{\partial g_j}=o_j$ ,因此藉助網絡的反向傳播算法是可以計算的但是， $o_j$ 是需要保存在memory中

**解決方法：**本文考慮將選擇N個候選中的一個路徑的任務分解爲多個二進制選擇任務

Step 1:更新Architecture Parameters
Step 2:根據多項式分佈 $(p_1,p_2,...p_N)$ 採樣兩個Path [此處利用了MASK]

通過以上兩個步驟，就可以直接將候選操作的數量由 $N$ 降解爲2.同時 $path$ $weight$ 和 $Binary$ $Gate$ 也要隨之重置
Step 3:利用公式 $(5)$ 重新採樣的兩個Path的 $Architecture$ $Parameters$ 進行更新
Step 4:通過在Architecture Parameters上應用softmax來計算Path Weight，需要通過乘以比率因子來重新調整這兩個更新的Architecture Parameters的值，以保持非採樣路徑的路徑權重不變

[效果]:

採樣Path-A:Path Weight增加
採樣Path-B:Path Weight下降
Other Path:Path Weight不變

Handling Non-Differentiable hardware Metrics

Question：Latency不可微

解決方法：構建可微函數，使得網絡可以通過反向傳播進行SGD更新優化

Step 1：使得Latency 可微

假設：

$\{o_j\}$ ：一個混合操作的動作候選集

$p_j$ ：與 $o_j$ 相對應的path weight,代表了選擇 $o_j$ 的概率

$\mathbb{E}_{[latency_i]}$ ：是第 $i$ 個Block的Latency的期望值

$F(\dot{})$ 代表的是Latency 預測模型
- $\mathbb{E}_{[latency_i]}=\Sigma_jp^i_j\times F(o^i_j)$
$\frac{\partial \mathbb{E}_{latency_i}}{\partial p^i_j}=F(o^i_j)$

因此對於整個網絡來說:
$\mathbb{E}_{[latency]}=\Sigma_i\mathbb{E}_{[latency_i]}$

Network Loss

$Loss=Loss_{CE}+\lambda_1||w||^2_2+\lambda_2\mathbb{E}_{[latency]}$

Reinforce-Based Approach

Question:針對於Binary Gate的優化問題的解決方案

**解決方法：**ReInforce可以用於訓練二值權重

具體的公式推演可以詳看論文

Experience

Conclusion

GPU prefers shallow and wide model with early pooling;
CPU prefers deep and narrow model with late pooling.
Pooling layers prefer large and wide kernel.
Early layers prefer small kernel.
Late layers prefer large kernel.

GPU

CPU:

Mobile

ProxyLessNAS

ProxyLessNAS : Direct Neural Architecture Search On Target Task And Hardware

Related Work

Motivation

Overview

Contributions

Method

Construction of Over-Parameterized Network

Learning Binarized Path

Training Binarized Architecture Parameters

Handling Non-Differentiable hardware Metrics

Network Loss

Reinforce-Based Approach

Experience

Conclusion

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

ProxyLessNAS

Top mAP and mAP

DARTS

Pytorch Tensor與Variable、Numpy

Matplotlib本機和服務器的使用區別說明

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結