Activation Functions in Neural Networks

This article is inspired by 这里 and 这里.

- 激活函数的主要意义是为NN加入非线性的元素。在神经学上模仿的是一个神经元是否有效。

A Neural Network without Activation function would simply be a Linear regression Model. Neural-Networks are considered Universal Function ApproximatorsIt means that they can compute and learn any function at all. Almost any process we can think of can be represented as a functional computation in Neural Networks.

- 激活函数must be 'differentiable' for gradient descent.

- 一些常见的Function

  1.  Sigmoid (deprecated). Output in [0,1], 不利于优化. 存在Vanishing Gradient问题(gradients get smaller and smaller during back-propagation. Early laryers cannot learn.)。
  2. Tanh (deprecated). Output in [-1,1]. Dominates Sigmoid but same vanishing gradient problem.
  3. ReLU-Rectified Linear Unit. f=max(0,x). 非常简单有效地解决了VG问题。不过不能用在最后一层(使用linear unit or softmax)。另外有可能导致Dead Neurons (when x<0, the neuron never activates). 可以使用Leaky ReLU 或者maxout解决这个问=题。
    1. Leaky ReLu: introduces a small slope to keep the updates alive.
    2. Benefits of ReLU: Cheap to compute ,converges faster, capable of outputting a true zero value allowing Representational Sparsity
  4. Maxout. is a generalization of the ReLU and the leaky ReLU functions. 

It is a learnable activation function.

It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with the dropout (weights have probability to be set 0) regularization technique.

Both ReLU and leaky ReLU are special cases of Maxout. The Maxout neuron, therefore, enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU).

However, it doubles the total number of parameters for each neuron and hence, a higher total number of parameters need to be trained.

Conclusion

  • Use ReLU in hidden layer activation, but be careful with the learning rate and monitor the fraction of dead units.
  • If ReLU is giving problems. Try Leaky ReLU, PReLU, Maxout. Do not use sigmoid
  • Normalize the data in order to achieve higher validation accuracy, and standardize if you need the results faster
  • The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the vanishing gradient problem.
发布了73 篇原创文章 · 获赞 101 · 访问量 22万+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章