整理——Some basic questions about caffe and deep learning

Rectified Linear Units
摘自http://www.douban.com/note/348196265/
sigmoid 和 tanh 作爲神經網絡的激活函數已經很熟悉,今天看了一下 ReLU 這種線性激活函數。很顯然,線性激活函數的計算開銷又大大降低。而且很多工作顯示 ReLU 有助於提升效果[1].
這裏寫圖片描述
sigmoid:
g(x) = 1 /(1+exp(-1)). g’(x) = (1-g(x))g(x).
tanh :
g(x) = sinh(x)/cosh(x) = ( exp(x)- exp(-x) ) / ( exp(x) + exp(-x) )
Rectifier (ReL):
- hard ReLU: g(x)=max(0,x)
- Noise ReLU max(0, x+N(0, σ(x)).
softplus:
g(x) = log(1+exp(x)), 導數就是 logistic function

[以下摘自quora]
The major differences between the sigmoid and ReL function are:
Sigmoid function has range [0,1] whereas the ReL function has range [0,\infty] . Hence sigmoid function can be used to model probability, whereas ReL can be used to model positive real number.
The gradient of the sigmoid function vanishes as we increase or decrease x. However, the gradient of the ReL function doesn’t vanish as we increase x(gradient vanishing problem, 這點對訓練神經網絡很不好). In fact, for max function, gradient is defined as

0x<01x>0ifif
.
[以上摘自quora]

ReLU 的優點在 [1] 裏提到:
Hard ReLU is naturally enforcing sparsity.
The derivative of ReLU is constant,

可能 ReLU 的問題是 dead-zone (輸出恆爲0的結點)的問題(The derivative of hard ReLU is constant over two ranges x<0 and x>=0, for x>0, g’=1, and x<0, g’=0)。這樣就會“stuck”。有些小 trick 可以解決這個問題,比如把 bias 初始設爲一個正數,但是這個問題在有些論文中[2]被指出並沒有什麼影響。當然還可以更換其他線性激活函數,比如 maxout 和 softplus.

[1] Rectifier Nonlinearities Improve Neural Network Acoustic Models
[2] Deep Sparse Rectifier Neural Networks
[3] http://www.quora.com/Deep-Learning/What-is-special-about-rectifier-neural-units-used-in-NN-learning
關於relu來源的描述:
Geoff Hinton gave a lecture in the Summer of 2013 that I found very helpful in understanding ReLUs and the rest. Essentially he claimed that the original activation function was chosen arbitrarily and that the ReLU’s work “better”, but aren’t the be-all-end-all. Also interesting was that the ReLU is an approximation to the summation of an infinite number of sigmoids with varying offsets - I wrote a blog post showing this is the case. They arrive at this function because of experimenting with a deep network where they varied the offsets of the activation functions at random until it “just worked” without pretraining. Based off this observation Hinton decided to try a network that essentially tried all the offsets at once - hence the ReLU.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章