1,Fancier optimization
sgd存在的問題:
- Very slow progress along shallow dimension, jitter along steep direction。if loss changes quickly in one direction and slowly in another.
- local minima or saddle point
幾種梯度下降方法
A, sgd+momentum 試圖解決sgd中鞍點和極小值點的問題
vx = 0
while true
dx = compute_gradient(x)
vx = rho * vx + dx
x -= learning_rate * vx
##或者(cs231作業中採用的方案):
while true
dx = compute_gradient(x)
vx = rho * vx - learning_rate *dx
x += vx
B, AdaGrad 試圖解決sgd中zigzag的問題
grad_squared = 0
while true
dx = compute_gradient(x)
grad_squared += dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
但當迭代次數過多時,grad_squared會很大,x幾乎不能改變,所以有了改進版RMSProp
C, RMSProp,利用decay限制了grad_squared的增加
grad_squared = 0
while true
dx = compute_gradient(x)
grad_squared = decay_rate * grad_squared + (1-decay) * dx * dx
x -= learning_rate * dx / (np.sqrt(grad_squared) + 1e-7)
D, Adam 結合了momentum和RMSProp
first_moment = 0
second_moment = 0
for t in range(num_iterations):
dx = compute_gradient(x)
first_moment = beta1 * first_moment + (1-beta)* dx## Momentum
second_moment = beta2 * second_moment + (1 - beta2) * dx * dx##RMSProp
##Bias correction for the fact thatfirst and second moment estimates start at zero
first_unbias = first_moment / (1 - beta1 ** t)
second_unbias = second_moment / (1 - beta2 ** t)
x -= learning_rate * first_unbias / (np.sqrt(second_unbias) + 1e-7)## Momentum
Adam with beta1 = 0.9,beta2 = 0.999, and learning_rate = 1e-3 or 5e-4is a great starting point for many models!
2,regularization
2.1,dropout :
It is forced to learn more robust features that are useful in conjunction with many different random subsets of the other neurons.
The neurons which are“dropped out” in this way do not contribute to the forward pass and do not participate in back-propagation .At test time, we use all the neurons but multiply their outputs by “drop out” rate.(-nips 2012 Alex etc.)
2.2,data augmentation
1,horizontal filps
2,Random crops and scales
- Training: sample random crops / scales
ResNet:
- Pick random L in range [256, 480]
- Resize training image, short side = L
- Sample random 224 x 224 patch
- Testing: average a fixed set of crops
ResNet:
- Resize image at 5 scales: {224, 256, 384, 480, 640}
- For each size, use 10 224 x 224 crops: 4 corners + center, + flips
3,PCA jittering
PCA的推倒:
現有數據矩陣X ,元素分別爲x1,x2,..,xN ,目的:求一個向量u 使這些元素經過該向量映射後的方差最大即求使下式取最大值時xmean⎯⎯⎯⎯⎯⎯⎯⎯=1N∑k=1Nxk u 的值設1N∑k=1N(xTku−xTmeanu)2=uT{1N∑k=1N(xk−xmean)(xTk−xTmean)}u S=1N∑Nk=1(xk−xmean)(xTk−xTmean) ,u 爲單位向量(uTu=1 ),對u 求導得到:Su−λu=0 ,
所以u 是S 的特徵向量,維數大小與數據相同。