An overview of gradient descent optimization algorithms

Gradient descent is the preferred way to optimize neural networks and many other machine learning algorithms but is often used as a black box. This post explores how many of the most popular gradient-based optimization algorithms such as Momentum, Adagrad, and Adam actually work.

Sebastian Ruder

Read more posts by this author.

SEBASTIAN RUDER

19 JAN 2016 • 28 MIN READ

This post explores how many of the most popular gradient-based optimization algorithms actually work.

Note: If you are looking for a review paper, this blog post is also available as an article on arXiv.

Update 20.03.2020: Added a note on recent optimizers.

Update 09.02.2018: Added AMSGrad.

Update 24.11.2017: Most of the content in this article is now also available as slides.

Update 15.06.2017: Added derivations of AdaMax and Nadam.

Update 21.06.16: This post was posted to Hacker News. The discussion provides some interesting pointers to related work and other techniques.

Table of contents:

Gradient descent is one of the most popular algorithms to perform optimization and by far the most common way to optimize neural networks. At the same time, every state-of-the-art Deep Learning library contains implementations of various algorithms to optimize gradient descent (e.g. lasagne's, caffe's, and keras' documentation). These algorithms, however, are often used as black-box optimizers, as practical explanations of their strengths and weaknesses are hard to come by.

This blog post aims at providing you with intuitions towards the behaviour of different algorithms for optimizing gradient descent that will help you put them to use. We are first going to look at the different variants of gradient descent. We will then briefly summarize challenges during training. Subsequently, we will introduce the most common optimization algorithms by showing their motivation to resolve these challenges and how this leads to the derivation of their update rules. We will also take a short look at algorithms and architectures to optimize gradient descent in a parallel and distributed setting. Finally, we will consider additional strategies that are helpful for optimizing gradient descent.

Gradient descent is a way to minimize an objective function J(θ)J(θ) parameterized by a model's parameters θ∈Rdθ∈Rd by updating the parameters in the opposite direction of the gradient of the objective function ∇θJ(θ)∇θJ(θ) w.r.t. to the parameters. The learning rate ηη determines the size of the steps we take to reach a (local) minimum. In other words, we follow the direction of the slope of the surface created by the objective function downhill until we reach a valley. If you are unfamiliar with gradient descent, you can find a good introduction on optimizing neural networks here.

請訪問原文鏈接：Overview of gradient descent algorithms

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【轉載】Overview of gradient descent algorithms

An overview of gradient descent optimization algorithms

Sebastian Ruder

請訪問原文鏈接：Overview of gradient descent algorithms

自學編程兩個月，現在我月入 4 萬元

Google Chrome驅動程序 124.0.6367.62（正式版本）去哪下載？

[NOTE in progress] Simulation Optimization

A Road Map for Deep Learning

Stochastic Optimization: Casual Notes

Graph Neural Network: A First Glance

Git 項目管理流程與協作方式

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結