Activation Functions in Neural Networks

This article is inspired by 這裏 and 這裏.

- 激活函數的主要意義是爲NN加入非線性的元素。在神經學上模仿的是一個神經元是否有效。

A Neural Network without Activation function would simply be a Linear regression Model. Neural-Networks are considered Universal Function ApproximatorsIt means that they can compute and learn any function at all. Almost any process we can think of can be represented as a functional computation in Neural Networks.

- 激活函數must be 'differentiable' for gradient descent.

- 一些常見的Function

  1.  Sigmoid (deprecated). Output in [0,1], 不利於優化. 存在Vanishing Gradient問題(gradients get smaller and smaller during back-propagation. Early laryers cannot learn.)。
  2. Tanh (deprecated). Output in [-1,1]. Dominates Sigmoid but same vanishing gradient problem.
  3. ReLU-Rectified Linear Unit. f=max(0,x). 非常簡單有效地解決了VG問題。不過不能用在最後一層(使用linear unit or softmax)。另外有可能導致Dead Neurons (when x<0, the neuron never activates). 可以使用Leaky ReLU 或者maxout解決這個問=題。
    1. Leaky ReLu: introduces a small slope to keep the updates alive.
    2. Benefits of ReLU: Cheap to compute ,converges faster, capable of outputting a true zero value allowing Representational Sparsity
  4. Maxout. is a generalization of the ReLU and the leaky ReLU functions. 

It is a learnable activation function.

It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with the dropout (weights have probability to be set 0) regularization technique.

Both ReLU and leaky ReLU are special cases of Maxout. The Maxout neuron, therefore, enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU).

However, it doubles the total number of parameters for each neuron and hence, a higher total number of parameters need to be trained.

Conclusion

  • Use ReLU in hidden layer activation, but be careful with the learning rate and monitor the fraction of dead units.
  • If ReLU is giving problems. Try Leaky ReLU, PReLU, Maxout. Do not use sigmoid
  • Normalize the data in order to achieve higher validation accuracy, and standardize if you need the results faster
  • The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the vanishing gradient problem.
發佈了73 篇原創文章 · 獲贊 101 · 訪問量 22萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章