lecture7,Training Neural Networks, Part 2

1,Fancier optimization

sgd存在的問題:

  • Very slow progress along shallow dimension, jitter along steep direction。if loss changes quickly in one direction and slowly in another.
  • local minima or saddle point

幾種梯度下降方法

A, sgd+momentum 試圖解決sgd中鞍點和極小值點的問題

vx = 0
while true
    dx = compute_gradient(x)
    vx  = rho * vx + dx
    x -= learning_rate * vx
##或者(cs231作業中採用的方案):
while true
    dx = compute_gradient(x)
    vx  = rho * vx - learning_rate *dx
    x += vx

B, AdaGrad 試圖解決sgd中zigzag的問題

grad_squared = 0
while true
    dx = compute_gradient(x)
    grad_squared += dx * dx
    x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

但當迭代次數過多時,grad_squared會很大,x幾乎不能改變,所以有了改進版RMSProp

C, RMSProp,利用decay限制了grad_squared的增加

grad_squared = 0
while true
    dx = compute_gradient(x)
    grad_squared = decay_rate * grad_squared + (1-decay) * dx * dx
    x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)

D, Adam 結合了momentum和RMSProp

first_moment = 0
second_moment = 0
for t in range(num_iterations):
    dx = compute_gradient(x)
    first_moment = beta1 * first_moment + (1-beta)* dx## Momentum
    second_moment = beta2 * second_moment + (1 - beta2) * dx * dx##RMSProp

    ##Bias correction for the fact thatfirst and second moment estimates start at zero
    first_unbias = first_moment / (1 - beta1 ** t)
    second_unbias = second_moment / (1 - beta2 ** t)

    x -= learning_rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)## Momentum

Adam with beta1 = 0.9,beta2 = 0.999, and learning_rate = 1e-3 or 5e-4is a great starting point for many models!

2,regularization

2.1,dropout :

It is forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
The neurons which are“dropped out” in this way do not contribute to the forward pass and do not participate in back-propagation .At test time, we use all the neurons but multiply their outputs by “drop out” rate.(-nips 2012 Alex etc.)

2.2,data augmentation

1,horizontal filps
2,Random crops and scales

  • Training: sample random crops / scales
    ResNet:
    1. Pick random L in range [256, 480]
    2. Resize training image, short side = L
    3. Sample random 224 x 224 patch
  • Testing: average a fixed set of crops
    ResNet:
    1. Resize image at 5 scales: {224, 256, 384, 480, 640}
    2. For each size, use 10 224 x 224 crops: 4 corners + center, + flips

3,PCA jittering

PCA的推倒:
現有數據矩陣X ,元素分別爲x1,x2,..,xN ,目的:求一個向量u 使這些元素經過該向量映射後的方差最大

xmean=1Nk=1Nxk
即求使下式取最大值時u 的值

1Nk=1N(xTkuxTmeanu)2=uT{1Nk=1N(xkxmean)(xTkxTmean)}u
S=1NNk=1(xkxmean)(xTkxTmean) ,u 爲單位向量(uTu=1 ),對u 求導得到:Suλu=0 ,
所以uS 的特徵向量,維數大小與數據相同。
發佈了57 篇原創文章 · 獲贊 4 · 訪問量 2萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章