給定一個問題以及相應的data ( 是一個sample pair(x,y)),若採用機器學習的手段來解決,那麼要分兩步走:
1. 模型選擇:即選定一族函數 F,這個大F可以是SVM,linear regression,boosting,或者nerual networks(neural network就是一個funciton approximator)等等。
2. 模型參數估計:選定了模型即選定了之後, 現在要做的就是通過優化(loss )的方法解得最優的一組w,從而得到模型 f。
當training set很大時(large scale training),每次優化的update都average (the loss & gradient)over all z in Z太耗時耗資源,因此,就牽扯到利用一個採樣得到的樣本batch進行一次update。這就引出了expected risk E(f)和empirical risk En(f)的概念:
- Emprical risk En(f):一個樣本batch(n個samples(xi,yi))上的average loss。衡量的是模型在訓練集上的性能。
- Expected risk E(f):整個training set上所有sample的average loss。衡量的是模型的generalization能力。
Vapnik & Chervonenkis的statistical learning theory證明了當模型選擇合理時,可以通過優化emprical risk來代替優化expected risk。此處,risk的概念可以理解爲average loss。
模型優化---mini-batch GD:在non-convex優化over large scale of data時,Rumelhart et al. 證明了使用gradient descent(GD,此處指的是mini-batch GD(MBGD))優化方法來優化empirical risk的合理性。
而利用GD優化empirical risk又可以分爲一階(first order)GD和二階(second order)GD:
關於optimization algorithm的order問題:一階optmization就是隻用loss function的一階deravitives,二階就是用二階deravitives(Hessian陣)。那麼,當參數量激增的時候,求所有參數的deravitive的代價也增大,而且階數越高代價增高地越快。所以,高階優化算法不適用於訓練神經網絡這樣大參數的模型。Newton法就是second order method(with Hessian), 而GD則是一階方法,GD is relatively efficient optmization methods, since the computation of first-order partial derivatives w.r.t. all the parameters is of the same computational complexity as just evaluating the function.
- First order GD:
是learning rate,是一個實數。當1. under sufficient regularity assumptions,2. the initial estimate w0 is close enough to the optimum,3. learning rate is sufficently small,一階GD能夠達到linear convergence:即residual error 滿足.
- Second prder GD: is a variant of the well known Newton algorithm.
是一個scaling matrix,它是一個正定矩陣,逼近損失函數在optimum處的Hessian陣的inverse。當1. under sufficient regularity assumptions,2. the initial estimate w0 is close enough to the optimum時,二階GD達到quadratic convergence,即residual error 滿足.
模型優化---Stochastic GD:SGD is a drastic simplication of MBGD, with the "hope" that it can still achieve the performance of BGD/MBGD despite the stochastic sampling noise.
因爲SGD的隨機性,即隨機地按照ground truth distribution地抽取samples,那麼SGD相當於直接optimize the expected risk而不是empirical risk。
SGD的收斂性需要遞減的learning rate滿足如下兩個條件:
SGD達到最佳收斂速度的前提是learning rate