本文爲《深度學習入門基於Python的理論與實現》的部分讀書筆記，也參考吳恩達深度學習視頻
代碼以及圖片均參考此書

正則化(regularization)

With regularization, training a bigger network almost never hurts.

過擬合(overfit)

發生過擬合(high variance)的原因，主要有以下兩個：

模型擁有大量參數、表現力強
訓練數據少

解決過擬合問題：

Get more data
Regularization
Find a more appropriate neural network architecture

模型診斷(diagnose high bias/variance)

根據訓練集與測試集上的預測誤差進行判斷
先解決高偏差(high bias)問題(bigger network, train longer…)，再解決過擬合問題

爲了產生過擬合現象，特地只選擇三百個樣本作爲訓練數據，同時網絡的隱藏層設爲6層，在不進行正則化的情況下進行訓練：

可以看到，網絡產生了很明顯的過擬合現象

權值衰減(weight decay)(L2 Regularization)

權值衰減通過在學習的過程中對大的權重進行懲罰，來抑制過擬合。很多過擬合原本就是因爲權重參數取值過大才發生的。

神經網絡的學習目的是減小損失函數的值。這時，爲損失函數加上權重的平方範數（L2 範數），就可以抑制權重變大。也就是說，將 $\frac {1}{2} \lambda W^2$ 加到損失函數上。這裏， $\lambda$ 是控制正則化強度的超參數。 $\lambda$ 設置得越大，對大的權重施加的懲罰就越重。此外， $\frac {1}{2} \lambda W^2$ 開頭的 $\frac {1}{2}$ 是用於將 $\frac {1}{2} \lambda W^2$ 的求導結果變成 $\lambda W$ 的調整用常量。

注意：使用權值衰減後，在誤差反向傳播時，得到的權重梯度要加上正則化項的導數 $\lambda W$
通常只對 $W$ 進行權值衰減，而不對 $b$ 進行權值衰減
L2範數相當於各個元素的平方和。用數學式表示的話，假設有權重 $W= (w1, w2, . . . , wn)$ ，則L2 範數可用 $\sqrt {w_1^2 + w_2^2 + .. + w_n^2}$ 計算出來

爲什麼權值衰減可以抑制過擬合？

直觀： $\lambda$ 足夠大時 $W$ 很小，基本消除了很多神經元的影響，會將high variance變爲high bias，因此中間存在一個合適的 $\lambda$ 可以平衡 variance 和 bias
稍加分析： $\lambda$ 足夠大時 $W$ 很小，因此輸出 $WX + b$ 也很小，若激活函數使用sigmoid，則激活值基本都在線性區內，整個網絡就變成了一個線性網絡，因此可以抑制過擬合

僞代碼：

# forward
for each layer:
	loss += 0.5 * lambda * np.sum(w**2)

# backward
for each layer:
	dw += lambda * w

部分實現代碼(用於理解權值衰減的流程，完整代碼將在之後的博客中給出)：

# forward
weight_decay = 0
for idx in range(1, self.hidden_layer_num):
    W = self.params['W' + str(idx)]
    weight_decay += 0.5 * self.weight_decay_lambda * np.sum(W**2)

return self.last_layer.forward(y, t) + weight_decay

# backward
grads = {}
for idx in range(1, self.hidden_layer_num):
    grads['W' + str(idx)] = self.layers['Affine' + str(idx)].dW + self.weight_decay_lambda * self.params['W' + str(idx)]
    grads['b' + str(idx)] = self.layers['Affine' + str(idx)].db

使用權值衰減之後過擬合現象得到了明顯抑制

Dropout

參考：https://arxiv.org/abs/1207.0580

如果網絡的模型變得很複雜，只用權值衰減就難以應對過擬合了。在這種情況下，經常使用Dropout

當然，如果沒有過擬合，就不要使用Dropout(會損失精度)

Dropout 是一種在學習的過程中隨機刪除神經元的方法，一般用於全連接層中。訓練時，隨機選出隱藏層的神經元，然後將其刪除。被刪除的神經元不再進行信號的傳遞。訓練時，每傳遞一次數據，就會隨機選擇要刪除的神經元。然後，測試時，雖然會傳遞所有的神經元信號，但是對於各個神經元的輸出，要乘上訓練時的刪除比例後再輸出(保持該層輸出的期望值不變)。

爲什麼Dropout管用？

使網絡無法依賴於某一個神經元，而必須把權重分攤到各個神經元上，相當於減小了 $||W||^2$
可以看作是多模型的平均，減少了神經元間的依賴

This prevents complex co-adaptations in which a feature detector is only helpful in the context of several other specific feature detectors. Instead, each neuron learns to detect a feature that is generally helpful for producing the correct answer given the combinatorially large variety of internal contexts in which it must operate.

Dropout的缺點：

使用Dropout後損失函數並不會在每次迭代中下降，失去了明確的意義。因此，在檢查網絡正確性時，需要先去掉Dropout層。

dropout_ratio的選擇：

一般選0.5
也可以爲不同的層設置不同的比例，例如權重矩陣更大的那一層可以考慮增大dropout_ratio(權重矩陣更大則更容易過擬合)

代碼實現：

class Dropout:
    def __init__(self, dropout_ratio=0.5): # dropout_ratio一般選0.5
        self.dropout_ratio = dropout_ratio
        self.mask = None

    def forward(self, x, train_flg=True):
        if train_flg == True:
            retain_prob = 1 - self.dropout_ratio
            self.mask = np.random.binomial(1, p=retain_prob, size=x.shape)
            # self.mask = np.random.randn(*x.shape) > self.dropout_ratio         
            return x * self.mask
        else:
            return x * (1 - self.dropout_ratio)

    def backward(self, dout):
        return dout * self.mask

使用Dropout之後過擬合現象也得到了抑制

其他正則化方法

數據擴充(Data Augmentation)

將圖片進行水平翻轉(flip the images horizontally)(要確保不需要考慮圖片的對稱性)、隨意裁剪(take random crops)、旋轉、在垂直或水平方向上的移動、調整亮度、放大縮小…

Early stopping

缺點：

正常在訓練網絡的過程中，可以看作是分步進行兩個操作：減小損失函數值、避免過擬合。這兩個過程不是並行的，這種模式稱爲正交化(orthogonalization)。然而，early stopping卻將這兩個過程糅合在了一起

優點：

少了一些超參數的選擇，比如權值衰減時用到的 $\lambda$

超參數的驗證

劃分數據集

之前我們使用的數據集分成了訓練數據和測試數據，訓練數據用於學習，測試數據用於評估泛化能力。由此，就可以評估是否只過度擬合了訓練數據（是否發生了過擬合），以及泛化能力如何等。下面我們要對超參數設置各種各樣的值以進行驗證。這裏要注意的是，不能使用測試數據評估超參數的性能。這一點非常重要，但也容易被忽視。爲什麼不能用測試數據評估超參數的性能呢？這是因爲如果使用測試數據調整超參數，超參數的值會對測試數據發生過擬合。因此，調整超參數時，必須使用超參數專用的確認數據。用於調整超參數的數據，一般稱爲驗證數據（validation data）

relatively small set $\rightarrow$ train/dev/test: 60%/20%/20%
much larger set $\rightarrow$ dev, test比例可降到更低，因爲已足夠對模型好壞進行評估

caution: dev, test, train set must come from the same distribution!

分割數據集之前，要先打亂數據與標籤，因爲數據集中的數據可能存在偏向（比如，數據從“0”到“10”按順序排列等）

shuffle_data=True
(x_train, t_train), (x_test, t_test) = load_mnist(shuffle_data=True)
# 分割驗證數據
validation_rate = 0.20
validation_num = int(x_train.shape[0] * validation_rate)
x_val = x_train[:validation_num]
t_val = t_train[:validation_num]
x_train = x_train[validation_num:]
t_train = t_train[validation_num:]

超參數的最優化

進行超參數的最優化時，逐漸縮小超參數的“好值”的存在範圍非常重要。所謂逐漸縮小範圍，是指一開始先大致設定一個範圍，從這個範圍中隨機選出一個超參數（採樣），用這個採樣到的值進行識別精度的評估；然後，多次重複該操作，觀察識別精度的結果，根據這個結果縮小超參數的“好值”的範圍。通過重複這一操作，就可以逐漸確定超參數的合適範圍。

有報告顯示，在進行神經網絡的超參數的最優化時，與網格搜索等有規律的搜索相比，隨機採樣的搜索方式效果更好。這是因爲在多個超參數中，各個超參數對最終的識別精度的影響程度不同。

Hyperparameters:

Most important: $\alpha$ (learning rate)
Second important: $\beta$ (momentum), #hiddenunits, mini-batch size
Third important: #layers, $\lambda$ (learning rate decay)

超參數的範圍只要“大致地指定”就可以了。所謂“大致地指定”，是指像0.001（10−3）到1000（103）這樣，以“10 的階乘”的尺度指定範圍（也表述爲“用對數尺度（log scale）指定”）。

下面舉兩個例子說明爲什麼不能使用線性尺度：

如果採用線性尺度搜索 $\alpha$ : 0.001 ~ 1，則90%的搜索都會集中在0.1 ~ 1，顯然不合理。
如果採用線性尺度搜索 $\beta$ : 0.9 ~ 0.999。當在0.9000 ~ 0.9005範圍內搜索時， $\frac {1}{1 - \beta} \approx 10$ ，而在0.9990 ~ 0.9995範圍內搜索時， $\frac {1}{1 - \beta} \approx 1000$ ~ $2000$ ，用線性範圍搜索顯然不合理，因爲我們要選取的其實是合適的 $\frac {1}{1 - \beta}$ 的值

在超參數的最優化中，要注意的是深度學習需要很長時間（比如，幾天或幾周）。因此，在超參數的搜索中，需要儘早放棄那些不符合邏輯的超參數。於是，在超參數的最優化中，減少學習的epoch，縮短一次評估所需的時間是一個不錯的辦法。

這裏介紹的超參數的最優化方法是實踐性的方法。在超參數的最優化中，如果需要更精煉的方法，可以使用貝葉斯最優化（Bayesian optimization）。貝葉斯最優化運用以貝葉斯定理爲中心的數學理論，能夠更加嚴密、高效地進行最優化。詳細內容請參考論文“Practical Bayesian Optimization of Machine Learning Algorithms”等。

import sys
file_path = __file__.replace('\\', '/')
dir_path = file_path[: file_path.rfind('/')] # 當前文件夾的路徑
pardir_path = dir_path[: dir_path.rfind('/')]
sys.path.append(pardir_path) # 添加上上級目錄到python模塊搜索路徑

import numpy as np
from dataset.mnist import load_mnist
from layer.multi_layer_net import MultiLayerNet
from trainer.trainer import Trainer
import matplotlib.pyplot as plt 

(x_train, t_train),  (x_test, t_test) = load_mnist(normalize=True, flatten=False, one_hot_label=True, shuffle_data=True)

# 分割驗證數據
validation_rate = 0.20
validation_num = int(x_train.shape[0] * validation_rate)
x_val = x_train[:validation_num]
t_val = t_train[:validation_num]
x_train = x_train[validation_num:]
t_train = t_train[validation_num:]

pkl_file_name = dir_path + '/hyperparameter_optimization.pkl'
fig_name = dir_path + '/hyperparameter_optimization.png'

def __train(lr, weight_decay, epocs=2):
    net = MultiLayerNet(784, [100, 100, 100, 100, 100, 100], 10,
                activation='relu', weight_init_std='relu', weight_decay_lambda=weight_decay, 
                use_dropout=False, dropout_ration=0.5, use_batchnorm=True, 
                pretrain_flag=False, pkl_file_name=pkl_file_name)
    trainer = Trainer(net, x_train, t_train, x_test, t_test,
                    epochs=epocs, mini_batch_size=100,
                    optimizer='SGD', optimizer_param={'lr':lr}, 
                    save_model_flag=True, pkl_file_name=pkl_file_name, plot_flag=False, fig_name=fig_name,
                    evaluate_sample_num_per_epoch=None, verbose=True)
    trainer.train()

    return trainer.test_acc_list, trainer.train_acc_list

# 之後進行超參數最優化的搜索
optimization_trial = 100
results_val = {}
results_train = {}
for _ in range(optimization_trial):
    # 指定搜索的超參數的範圍===============
    weight_decay = 10 ** np.random.uniform(-8, -4)
    lr = 10 ** np.random.uniform(-6, -2)
    # ================================================

    val_acc_list, train_acc_list = __train(lr, weight_decay)
    print("val acc:" + str(val_acc_list[-1]) + " | lr:" + str(lr) + ", weight decay:" + str(weight_decay))
    key = "lr:" + str(lr) + ", weight decay:" + str(weight_decay)
    results_val[key] = val_acc_list
    results_train[key] = train_acc_list

# 繪製圖形========================================================
print("=========== Hyper-Parameter Optimization Result ===========")
graph_draw_num = 20
col_num = 5
row_num = int(np.ceil(graph_draw_num / col_num))
i = 0

for key, val_acc_list in sorted(results_val.items(), key=lambda x:x[1][-1], reverse=True):
    print("Best-" + str(i+1) + "(val acc:" + str(val_acc_list[-1]) + ") | " + key)

    plt.subplot(row_num, col_num, i+1)
    plt.title("Best-" + str(i+1))
    plt.ylim(0.0, 1.0)
    if i % 5: plt.yticks([])
    plt.xticks([])
    x = np.arange(len(val_acc_list))
    plt.plot(x, val_acc_list)
    plt.plot(x, results_train[key], "--")
    i += 1

    if i >= graph_draw_num:
        break

plt.savefig(fig_name)
plt.show()

=========== Hyper-Parameter Optimization Result ===========
Best-1(val acc:0.9178) | lr:0.009005557486889807, weight decay:1.7477510887585676e-07
Best-2(val acc:0.9147) | lr:0.008230729959088037, weight decay:4.0741675278124096e-05
Best-3(val acc:0.9137) | lr:0.007172321239239096, weight decay:3.279148370122038e-05
Best-4(val acc:0.9106) | lr:0.007784860782344955, weight decay:2.6963471504299467e-05
Best-5(val acc:0.9103) | lr:0.00639415111314913, weight decay:2.5875411574553633e-07
Best-6(val acc:0.9096) | lr:0.0067713810843289994, weight decay:1.4721636892262977e-08
Best-7(val acc:0.9009) | lr:0.006278021397985206, weight decay:1.347044302843336e-08
Best-8(val acc:0.886) | lr:0.004757015189713183, weight decay:3.469706246467051e-05
Best-9(val acc:0.8856) | lr:0.003900241211695265, weight decay:3.3527006586550624e-08
Best-10(val acc:0.8786) | lr:0.004096508041494537, weight decay:7.797544480165376e-06
Best-11(val acc:0.8736) | lr:0.0035426486270627756, weight decay:2.6716950639718e-06
Best-12(val acc:0.843) | lr:0.0029865433537836217, weight decay:1.168246937591811e-07
Best-13(val acc:0.8405) | lr:0.002651462378045705, weight decay:9.754264464435194e-07
Best-14(val acc:0.8206) | lr:0.0022564421780149114, weight decay:1.7280502179151947e-06
Best-15(val acc:0.7975) | lr:0.002067204045559343, weight decay:6.48318010753941e-08
Best-16(val acc:0.7957) | lr:0.0019221274810548542, weight decay:1.2329400973635888e-06
Best-17(val acc:0.7923) | lr:0.0017003621475225061, weight decay:6.848766914357895e-07
Best-18(val acc:0.7912) | lr:0.002028435943575853, weight decay:5.611188782800319e-05
Best-19(val acc:0.7882) | lr:0.0017122199198167435, weight decay:3.1379461173155134e-06
Best-20(val acc:0.7762) | lr:0.002026173247734754, weight decay:4.745680808901564e-07

從這個結果可以看出，學習率在0.001 到0.01、權值衰減係數在 $10^{−7}$ 到 $10^{−6}$ 之間時，學習可以順利進行。

遷移學習(Transfer Learning)

保留前面層的參數，只修改最後一層或幾層的參數(數據越多，修改的層數越多)，並構造新的輸出層。這樣做的話就相當於只訓練一個淺層網路。(大部分框架都可以設置來不訓練某些層的參數)

適用於：

被遷移問題的數據較多，而遷移問題的數據較少
兩個問題都有相同的輸入 $X$
被遷移問題的低級特徵(low level features)對遷移問題有幫助

例如對於圖像分類問題，卷積網絡的前兩三層的作用是非常類似的，都是提取圖像的邊緣信息。因此爲了保證模型訓練中能夠更加穩定，一般會固定與訓練網絡的前兩三個卷積層不進行參數的學習

端到端的深度學習(End-to-end deep learning)

The tradition way – small data set
Audio $\rightarrow$ Extract features $\rightarrow$ Phonemes $\rightarrow$ Words $\rightarrow$ Transcript
The hybrid way – medium data set
Audio $\rightarrow$ Phonemes $\rightarrow$ Words $\rightarrow$ Transcript
The end-to-end deep learning way - large data set
Audio $\rightarrow$ Transcript

沒有大量數據支持端到端學習時可以拆分任務爲多個小任務(每個小任務都有足夠的數據)

深度學習入門(七)：正則化、超參數的驗證、遷移學習、端到端的深度學習

目錄

正則化(regularization)

過擬合(overfit)

權值衰減(weight decay)(L2 Regularization)

Dropout

其他正則化方法

數據擴充(Data Augmentation)

Early stopping

超參數的驗證

劃分數據集

超參數的最優化

遷移學習(Transfer Learning)

端到端的深度學習(End-to-end deep learning)

PyTorch(四)：實踐--在CIFAR10數據集上訓練基於PyTorch的第一個網絡

經典網絡結構(二)：VGG

深度學習入門(十)：CNN的實現及可視化

經典網絡結構(一)：LeNet、AlexNet

深度學習入門(九)：卷積層和池化層的實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結