Activation Functions in Neural Networks

原創

2020-02-22 06:43

This article is inspired by 这里 and 这里.

- 激活函数的主要意义是为NN加入非线性的元素。在神经学上模仿的是一个神经元是否有效。

A Neural Network without Activation function would simply be a Linear regression Model. Neural-Networks are considered Universal Function Approximators. It means that they can compute and learn any function at all. Almost any process we can think of can be represented as a functional computation in Neural Networks.

- 激活函数must be 'differentiable' for gradient descent.

- 一些常见的Function

Sigmoid (deprecated). Output in [0,1], 不利于优化. 存在Vanishing Gradient问题（gradients get smaller and smaller during back-propagation. Early laryers cannot learn.）。
Tanh (deprecated). Output in [-1,1]. Dominates Sigmoid but same vanishing gradient problem.
ReLU-Rectified Linear Unit. f=max(0,x). 非常简单有效地解决了VG问题。不过不能用在最后一层（使用linear unit or softmax）。另外有可能导致Dead Neurons (when x<0, the neuron never activates). 可以使用Leaky ReLU 或者maxout解决这个问=题。
1. Leaky ReLu: introduces a small slope to keep the updates alive.
2. Benefits of ReLU: Cheap to compute ,converges faster, capable of outputting a true zero value allowing Representational Sparsity
Maxout. is a generalization of the ReLU and the leaky ReLU functions.

It is a learnable activation function.

It is a piecewise linear function that returns the maximum of the inputs, designed to be used in conjunction with the dropout (weights have probability to be set 0) regularization technique.

Both ReLU and leaky ReLU are special cases of Maxout. The Maxout neuron, therefore, enjoys all the benefits of a ReLU unit (linear regime of operation, no saturation) and does not have its drawbacks (dying ReLU).

However, it doubles the total number of parameters for each neuron and hence, a higher total number of parameters need to be trained.

Conclusion

Use ReLU in hidden layer activation, but be careful with the learning rate and monitor the fraction of dead units.

If ReLU is giving problems. Try Leaky ReLU, PReLU, Maxout. Do not use sigmoid

Normalize the data in order to achieve higher validation accuracy, and standardize if you need the results faster

The sigmoid and hyperbolic tangent activation functions cannot be used in networks with many layers due to the vanishing gradient problem.

tangwing

发布了73 篇原创文章 · 获赞 101 · 访问量 22万+

私信关注

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Activation Functions in Neural Networks

Conclusion

Nginx R31 doc 官方文档-01-nginx 如何安装

挑战程序设计竞赛 2.2章习题 POJ - 3617 Best Cow Line 贪心

golang开发环境搭建(win10)

[NOTE in progress] Simulation Optimization

A Road Map for Deep Learning

Stochastic Optimization: Casual Notes

Graph Neural Network: A First Glance

Git 項目管理流程與協作方式

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結