關於Stochastic Gradient Descent和機器學習的優化問題

原創

Trasper1

2018-12-07 21:48

給定一個問題以及相應的data ( $z \in Z$ 是一個sample pair（x，y）)，若採用機器學習的手段來解決，那麼要分兩步走：

1. 模型選擇：即選定一族函數 F，這個大F可以是SVM，linear regression，boosting，或者nerual networks（neural network就是一個funciton approximator）等等。

2. 模型參數估計：選定了模型即選定了 $f_{w}(x)\in F$ 之後, 現在要做的就是通過優化（loss $Q(z,w)=l(f_{w}(w), y)$ ）的方法解得最優的一組w，從而得到模型 f。

當training set很大時（large scale training），每次優化的update都average （the loss & gradient）over all z in Z太耗時耗資源，因此，就牽扯到利用一個採樣得到的樣本batch進行一次update。這就引出了expected risk E（f）和empirical risk En（f）的概念：

Emprical risk En(f)：一個樣本batch（n個samples（xi，yi））上的average loss。衡量的是模型在訓練集上的性能。

$E_{n}(f)=\frac{1}{n}\sum_{i=1}^{n}l(f(x_{i}), y_{i})$

Expected risk E(f)：整個training set上所有sample的average loss。衡量的是模型的generalization能力。

$E(f)=\int l(f(x),y)dP(z)$

Vapnik & Chervonenkis的statistical learning theory證明了當模型選擇合理時，可以通過優化emprical risk來代替優化expected risk。此處，risk的概念可以理解爲average loss。

模型優化---mini-batch GD：在non-convex優化over large scale of data時，Rumelhart et al. 證明了使用gradient descent（GD，此處指的是mini-batch GD(MBGD)）優化方法來優化empirical risk的合理性。

而利用GD優化empirical risk又可以分爲一階（first order）GD和二階（second order）GD：

關於optimization algorithm的order問題：一階optmization就是隻用loss function的一階deravitives，二階就是用二階deravitives（Hessian陣）。那麼，當參數量激增的時候，求所有參數的deravitive的代價也增大，而且階數越高代價增高地越快。所以，高階優化算法不適用於訓練神經網絡這樣大參數的模型。Newton法就是second order method（with Hessian), 而GD則是一階方法，GD is relatively efficient optmization methods, since the computation of first-order partial derivatives w.r.t. all the parameters is of the same computational complexity as just evaluating the function.

First order GD：

$w_{t+1}=w_{t}-{\color{Blue} {\color{Green} }\gamma}{\color{Red} \frac{1}{n}\sum_{i=1}^{n}}\nabla_{w}Q(z_{i},w_{t})$

$\gamma$ 是learning rate，是一個實數。當1. under sufficient regularity assumptions，2. the initial estimate w0 is close enough to the optimum，3. learning rate $\gamma$ is sufficently small，一階GD能夠達到linear convergence：即residual error $\rho$ 滿足 ${\color{Red} -log\rho \sim t}$ .

Second prder GD: is a variant of the well known Newton algorithm.

$w_{t+1}=w_{t}-{\color{Blue} {\color{Green} }\Gamma_{t}}{\color{Red} \frac{1}{n}\sum_{i=1}^{n}}\nabla_{w}Q(z_{i},w_{t})$

${\color{Blue} \Gamma_{t}}$ 是一個scaling matrix，它是一個正定矩陣，逼近損失函數在optimum處的Hessian陣的inverse。當1. under sufficient regularity assumptions，2. the initial estimate w0 is close enough to the optimum時，二階GD達到quadratic convergence，即residual error $\rho$ 滿足 ${\color{Red} -loglog\rho \sim t}$ .

模型優化---Stochastic GD：SGD is a drastic simplication of MBGD, with the "hope" that it can still achieve the performance of BGD/MBGD despite the stochastic sampling noise.

$w_{t+1}=w_{t}-{\color{Blue} {\color{Green} }\gamma}\nabla_{w}Q(z_{i},w_{t})$

因爲SGD的隨機性，即隨機地按照ground truth distribution地抽取samples，那麼SGD相當於直接optimize the expected risk而不是empirical risk。

SGD的收斂性需要遞減的learning rate滿足如下兩個條件：

$\sum_{t}\gamma_{t}=\infty$
$\sum_{t}\gamma_{t}^{2}<\infty$

SGD達到最佳收斂速度的前提是learning rate $\gamma _{t} = t^{-1}$

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

關於Stochastic Gradient Descent和機器學習的優化問題

1. 模型選擇：即選定一族函數 F，這個大F可以是SVM，linear regression，boosting，或者nerual networks（neural network就是一個funciton approximator）等等。

2. 模型參數估計：選定了模型即選定了 $f_{w}(x)\in F$ 之後, 現在要做的就是通過優化（loss $Q(z,w)=l(f_{w}(w), y)$ ）的方法解得最優的一組w，從而得到模型 f。

模型優化---mini-batch GD：在non-convex優化over large scale of data時，Rumelhart et al. 證明了使用gradient descent（GD，此處指的是mini-batch GD(MBGD)）優化方法來優化empirical risk的合理性。

模型優化---Stochastic GD：SGD is a drastic simplication of MBGD, with the "hope" that it can still achieve the performance of BGD/MBGD despite the stochastic sampling noise.

因爲SGD的隨機性，即隨機地按照ground truth distribution地抽取samples，那麼SGD相當於直接optimize the expected risk而不是empirical risk。

SGD的收斂性需要遞減的learning rate滿足如下兩個條件：

SGD達到最佳收斂速度的前提是learning rate $\gamma _{t} = t^{-1}$

Window 安裝 Python 失敗 0x80070643，發生嚴重錯誤

Mac OS與Linux的關係，以及如何在Mac OS設備上管理Python多版本

PyTorch使用tensorboard（遠程服務器和本地計算機的聯動）

在本地使用遠程服務器上的jupyter notebook服務（同時開啓多個notebook）

Ubuntu16.04服務器上用conda安裝PyTorch、torchvision、cuda

PyTorch使用tensorboard & Jupyter notebook（遠程服務器和本地計算機的聯動）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

關於Stochastic Gradient Descent和機器學習的優化問題

1. 模型選擇：即選定一族函數 F，這個大F可以是SVM，linear regression，boosting，或者nerual networks（neural network就是一個funciton approximator）等等。

2. 模型參數估計：選定了模型即選定了之後, 現在要做的就是通過優化（loss ）的方法解得最優的一組w，從而得到模型 f。

模型優化---mini-batch GD：在non-convex優化over large scale of data時，Rumelhart et al. 證明了使用gradient descent（GD，此處指的是mini-batch GD(MBGD)）優化方法來優化empirical risk的合理性。

模型優化---Stochastic GD：SGD is a drastic simplication of MBGD, with the "hope" that it can still achieve the performance of BGD/MBGD despite the stochastic sampling noise.

因爲SGD的隨機性，即隨機地按照ground truth distribution地抽取samples，那麼SGD相當於直接optimize the expected risk而不是empirical risk。

SGD的收斂性需要遞減的learning rate滿足如下兩個條件：

SGD達到最佳收斂速度的前提是learning rate

2. 模型參數估計：選定了模型即選定了 $f_{w}(x)\in F$ 之後, 現在要做的就是通過優化（loss $Q(z,w)=l(f_{w}(w), y)$ ）的方法解得最優的一組w，從而得到模型 f。

SGD達到最佳收斂速度的前提是learning rate $\gamma _{t} = t^{-1}$