激活函數hard sigmoid

The standard sigmoid is slow to compute because it requires computing the exp() function, which is done via complex code (with some hardware assist if available in the CPU architecture). In many cases the high-precision exp() results aren’t needed, and an approximation will suffice. Such is the case in many forms of gradient-descent/optimization neural netwotks: the exact values aren’t as important as the “ballpark” values, insofar as the results are comparable with small error.

Here’s a plot of the sigmoid, “ultra-fast” sigmoid and the “hard” sigmoid:這裏寫圖片描述
Note how the sigmoid (blue) is smooth, while the ultra-fast (green) and hard (red) sigmoids are linear piece-wise. In fact, these approximations are computed as linear interpolations between pairs of cut-points. Note the green line plot, that touches the blue one at a few points forming a set of line segments. Computing the results of this approximation is significantly faster than calling a routine implementing the sigmoid via exp() and division: all it requires is determining in which linear segment x lies and doing a simple interpolation. The approximation is just that: approximate, but the errors are low enough that many ANN algorithms run fine with the approximation.

For the hard sigmoid there are less cut points: in fact there are only 2 and therefore only two comparisons are required to ascertain in which segment the result lies and only one interpolation is required, for the central segment, as the other two segments are constant 0 and constant 1; in other words: it’s very fast. The error is larger than for the ultra-fast sigmoid, but depending on your particular case it might not change significantly the numerical results. In fact, for classification problems it rarely if ever causes errors (and when it does, some more training tends to correct it–extra training you can afford to run because your training cycles run so much faster than with the standard sigmoid).

As an added detail, you can see the effect of using piecewise interpolation when computing the sigmoid as a form of regularization, which in the right circumstances helps a lot with the creation of useful feature detectors. Just don’t use the more extreme approximations (like the hard sigmoid) when your problem is function approximation–your errors will reduce slowly, and might plateau before reaching your goal. But, again, if you’re doing classification it’s usually quite OK.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章