本文主要介紹tensorflow的自動訓練的相關細節，並把自動訓練和基礎公式結合起來。如有不足，還請指教。

寫這個的初衷：有些教程說的比較模糊，沒體現出用意和特性或應用場景。

面向對象：稍微瞭解點代碼，又因爲有限的教程講解比較模糊而一知半解的初學者。

（更多相關內容，比如相關優化算法的分解和手動實現，EMA、BatchNormalization等用法，底部都有鏈接。）

正文

tensorflow提供了多種optimizer，典型梯度下降GradientDescent和Adagrad、Momentum、Nestrov、Adam等變種。

典型的學習步驟是梯度下降GradientDescent，optimizer可以自動實現這一過程，通過指定loss來串聯所有相關變量形成計算圖，然後通過optimizer(learning_rate).minimize(loss)實現自動梯度下降。minimize()也是兩步操作的合併，後邊會分解。

計算圖的概念：一個變量想要被訓練到，前提他在計算圖中，更直白的說，要在公式或者連鎖公式中，如果一個變量和loss沒有任何直接以及間接關係，那就不會被訓練到。

源碼

train的過程其實就是修改計算圖中的tf.Variable的過程，可以認爲這些所有variable都是權重，爲了簡化，下面這個例子沒引入placeholder和x，沒有x和w的區分，但是變量prediction_to_train=3其實等價於：

prediction_to_train（y） = w*x，其中初始值w=3，隱藏的鎖死的x=1（也就是一個固定的訓練樣本）。

這裏loss定義的是平方差，label是1，所以訓練過程就是x=1，y=1的數據，針對初始化w=3，訓練w，把w變成1。

import tensorflow as tf

#define variable and error
label = tf.constant(1,dtype = tf.float32)
prediction_to_train = tf.Variable(3,dtype=tf.float32)

#define losses and train
manual_compute_loss = tf.square(prediction_to_train - label)
optimizer = tf.train.GradientDescentOptimizer(0.01)
train_step = optimizer.minimize(manual_compute_loss)

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for _ in range(100):
        print('variable is ', sess.run(prediction_to_train), ' and the loss is ',sess.run(manual_compute_loss))
        sess.run(train_step)

輸出

variable is  3.0  and the loss is  4.0
variable is  2.96  and the loss is  3.8416002
variable is  2.9208  and the loss is  3.6894724
variable is  2.882384  and the loss is  3.5433698
variable is  2.8447363  and the loss is  3.403052
variable is  2.8078415  and the loss is  3.268291

。。。。。。。
。。。
variable is  2.0062745  and the loss is  1.0125883
variable is  1.986149  and the loss is  0.9724898
variable is  1.966426  and the loss is  0.9339792

。。。。
。。。

variable is  1.0000029  and the loss is  8.185452e-12
variable is  1.0000029  and the loss is  8.185452e-12
variable is  1.0000029  and the loss is  8.185452e-12
variable is  1.0000029  and the loss is  8.185452e-12
variable is  1.0000029  and the loss is  8.185452e-12

限定train的Variable的方法：

根據train是修改計算圖中tf.Variable（默認是計算圖中所有tf.Variable，可以通過var_list指定）的事實，可以使用tf.constant或者python變量的形式來規避常量被訓練，這也是遷移學習要用到的技巧。

下邊是一個正經的陳（train）一發的例子：

y=w1*x+w2*x+w3*x

因y=1,x=1

1=w1+w2+w3

又w3=4

-3=w1+w2

#demo2
#define variable and error
label = tf.constant(1,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w1 = tf.Variable(4,dtype=tf.float32)
w2 = tf.Variable(4,dtype=tf.float32)
w3 = tf.constant(4,dtype=tf.float32)

y_predict = w1*x+w2*x+w3*x

#define losses and train
make_up_loss = tf.square(y_predict - label)
optimizer = tf.train.GradientDescentOptimizer(0.01)
train_step = optimizer.minimize(make_up_loss)

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for _ in range(100):
        w1_,w2_,w3_,loss_ = sess.run([w1,w2,w3,make_up_loss],feed_dict={x:1})
        print('variable is w1:',w1_,' w2:',w2_,' w3:',w3_, ' and the loss is ',loss_)
        sess.run(train_step,{x:1})

因爲w3是constant，成功避免了被陳（train）一發，只有w1和w2被train。

符合預期-3=w1+w2

variable is w1: -1.4999986  w2: -1.4999986  w3: 4.0  and the loss is  8.185452e-12
variable is w1: -1.4999986  w2: -1.4999986  w3: 4.0  and the loss is  8.185452e-12
variable is w1: -1.4999986  w2: -1.4999986  w3: 4.0  and the loss is  8.185452e-12
variable is w1: -1.4999986  w2: -1.4999986  w3: 4.0  and the loss is  8.185452e-12

下邊是使用var_list限制只有w2被train的例子，只有w2被train，又因爲那兩個w初始化都是4，x=1，所以w2接近-7是正確答案。

#define variable and error
label = tf.constant(1,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w1 = tf.Variable(4,dtype=tf.float32)
w2 = tf.Variable(4,dtype=tf.float32)
w3 = tf.constant(4,dtype=tf.float32)

y_predict = w1*x+w2*x+w3*x

#define losses and train
make_up_loss = tf.square(y_predict - label)
optimizer = tf.train.GradientDescentOptimizer(0.01)
train_step = optimizer.minimize(make_up_loss,var_list = w2)

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for _ in range(500):
        w1_,w2_,w3_,loss_ = sess.run([w1,w2,w3,make_up_loss],feed_dict={x:1})
        print('variable is w1:',w1_,' w2:',w2_,' w3:',w3_, ' and the loss is ',loss_)
        sess.run(train_step,{x:1})

variable is w1: 4.0  w2: -6.99948  w3: 4.0  and the loss is  2.7063857e-07
variable is w1: 4.0  w2: -6.9994903  w3: 4.0  and the loss is  2.5983377e-07
variable is w1: 4.0  w2: -6.9995003  w3: 4.0  and the loss is  2.4972542e-07
variable is w1: 4.0  w2: -6.9995103  w3: 4.0  and the loss is  2.398176e-07
variable is w1: 4.0  w2: -6.9995203  w3: 4.0  and the loss is  2.3011035e-07
variable is w1: 4.0  w2: -6.99953  w3: 4.0  and the loss is  2.2105178e-07
variable is w1: 4.0  w2: -6.9995394  w3: 4.0  and the loss is  2.1217511e-07

如果w1、w2、w3都是tf.constant呢？毫無疑問，，還，真友好~

一共兩種情況：

var_list自動獲取所有可訓練變量，會報錯告訴你找不到能train的variables：

ValueError: No variables to optimize.

用var_list指定一個constant，沒有實現：

NotImplementedError: ('Trying to update a Tensor ', <tf.Tensor 'Const_1:0' shape=() dtype=float32>)

另一種獲得var_list的方式——tf.getCollection

各種get_variable更實用一些，因爲不一定方便通過python引用得到tensor。

#demo2.2  another way to collect var_list

label = tf.constant(1,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w1 = tf.Variable(4,dtype=tf.float32)
with tf.name_scope(name='selected_variable_to_trian'):
    w2 = tf.Variable(4,dtype=tf.float32)
w3 = tf.constant(4,dtype=tf.float32)

y_predict = w1*x+w2*x+w3*x

#define losses and train
make_up_loss = (y_predict - label)**3
optimizer = tf.train.GradientDescentOptimizer(0.01)

output_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES, scope='selected_variable_to_trian')
train_step = optimizer.minimize(make_up_loss,var_list = output_vars)

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for _ in range(3000):
        w1_,w2_,w3_,loss_ = sess.run([w1,w2,w3,make_up_loss],feed_dict={x:1})
        print('variable is w1:',w1_,' w2:',w2_,' w3:',w3_, ' and the loss is ',loss_)
        sess.run(train_step,{x:1})

variable is w1: 4.0  w2: -6.988893  w3: 4.0  and the loss is  1.3702081e-06
variable is w1: 4.0  w2: -6.988897  w3: 4.0  and the loss is  1.3687968e-06
variable is w1: 4.0  w2: -6.9889007  w3: 4.0  and the loss is  1.3673865e-06
variable is w1: 4.0  w2: -6.9889045  w3: 4.0  and the loss is  1.3659771e-06
variable is w1: 4.0  w2: -6.9889083  w3: 4.0  and the loss is  1.3645688e-06
variable is w1: 4.0  w2: -6.988912  w3: 4.0  and the loss is  1.3631613e-06
variable is w1: 4.0  w2: -6.988916  w3: 4.0  and the loss is  1.3617548e-06
variable is w1: 4.0  w2: -6.9889197  w3: 4.0  and the loss is  1.3603493e-06

TRAINABLE_VARIABLE=False

另一種限制variable被限制的方法，與上邊的方法原理相似，都和tf.GraphKeys.TRAINABLE_VARIABLE有關，只不過前一個是從裏邊挑出指定scope，這個從變量定義時就決定了不往裏插入這個變量。

不可訓練和常量還是不同的，畢竟還能手動修改，比如滑動平均值的應用，不可訓練像是專門針對optimizer的約定。

#demo2.4  another way to avoid variable be train

label = tf.constant(1,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w1 = tf.Variable(4,dtype=tf.float32,trainable=False)
w2 = tf.Variable(4,dtype=tf.float32)
w3 = tf.constant(4,dtype=tf.float32)

y_predict = w1*x+w2*x+w3*x

#define losses and train
make_up_loss = (y_predict - label)**3
optimizer = tf.train.GradientDescentOptimizer(0.01)

output_vars = tf.get_collection(tf.GraphKeys.TRAINABLE_VARIABLES)
train_step = optimizer.minimize(make_up_loss,var_list = output_vars)

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for _ in range(3000):
        w1_,w2_,w3_,loss_ = sess.run([w1,w2,w3,make_up_loss],feed_dict={x:1})
        print('variable is w1:',w1_,' w2:',w2_,' w3:',w3_, ' and the loss is ',loss_)
        sess.run(train_step,{x:1})

獲取所有trainable變量來train，也就等於不指定var_list直接train，是默認參數。

      var_list: Optional list or tuple of `Variable` objects to update to
        minimize `loss`.  Defaults to the list of variables collected in
        the graph under the key `GraphKeys.TRAINABLE_VARIABLES`.

#demo2.3  another way to avoid variable be train

label = tf.constant(1,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
#w1 = tf.Variable(4,dtype=tf.float32)
w1 = tf.Variable(4,dtype=tf.float32,trainable=False)
with tf.name_scope(name='selected_variable_to_trian'):
    w2 = tf.Variable(4,dtype=tf.float32)
w3 = tf.constant(4,dtype=tf.float32)

y_predict = w1*x+w2*x+w3*x

#define losses and train
make_up_loss = (y_predict - label)**3
optimizer = tf.train.GradientDescentOptimizer(0.01)

train_step = optimizer.minimize(make_up_loss)

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for _ in range(3000):
        w1_,w2_,w3_,loss_ = sess.run([w1,w2,w3,make_up_loss],feed_dict={x:1})
        print('variable is w1:',w1_,' w2:',w2_,' w3:',w3_, ' and the loss is ',loss_)
        sess.run(train_step,{x:1})

實際結果同上，略。

minimize()操作分解

其實minimize()操作也只是一個compute_gradients()和apply_gradients()的組合操作.

compute_gradients()用來計算梯度，opt.apply_gradients()用來更新參數。通過多個optimizer可以指定多個具有不同學習率的學習過程，針對不同的var_list分別進行gradient的計算和參數更新，可以用來遷移學習或者處理一些深層網絡梯度更新不匹配的問題，暫不贅述。

#demo2.4  combine of ompute_gradients() and apply_gradients()

label = tf.constant(1,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w1 = tf.Variable(4,dtype=tf.float32,trainable=False)
w2 = tf.Variable(4,dtype=tf.float32)
w3 = tf.Variable(4,dtype=tf.float32)

y_predict = w1*x+w2*x+w3*x

#define losses and train
make_up_loss = (y_predict - label)**3
optimizer = tf.train.GradientDescentOptimizer(0.01)

w2_gradient = optimizer.compute_gradients(loss = make_up_loss, var_list = w2)
train_step = optimizer.apply_gradients(grads_and_vars = (w2_gradient))

init = tf.global_variables_initializer()
with tf.Session() as sess:
    sess.run(init)
    for _ in range(300):
        w1_,w2_,w3_,loss_,w2_gradient_ = sess.run([w1,w2,w3,make_up_loss,w2_gradient],feed_dict={x:1})
        print('variable is w1:',w1_,' w2:',w2_,' w3:',w3_, ' and the loss is ',loss_)
        print('gradient:',w2_gradient_)
        sess.run(train_step,{x:1})

具體的learning rate、step、計算公式和手動梯度下降實現：

在預測中，x是關於y的變量，但是在train中，w是L的變量，x是不可能變化的。所以，知道爲什麼weights叫Variable了吧（強行瞎解釋一發）

下面用tensorflow接口手動實現梯度下降:

爲了方便寫公式，下邊的代碼改了變量的命名，採用loss、prediction、gradient、weight、y、x等首字母表示，η表示學習率,w0、w1、w2等表示第幾次迭代時w的值，不是多個變量。

loss=(y-p)^2=(y-w*x)^2=(y^2-2*y*w*x+w^2*x^2)

dl/dw = 2*w*x^2-2*y*x

代入梯度下降公式w1=w0-η*dL/dw|w=w0

w1 = w0-η*dL/dw|w=w0

w2 = w1 - η*dL/dw|w=w1

w3 = w2 - η*dL/dw|w=w2

初始：y=3,x=1,w=2,l=1,dl/dw=-2,η=1

更新：w=4

更新：w=2

更新：w=4

所以，本例x=1,y=3，dl/dw巧合的等於2w-2y，也就是二倍的prediction和label的差距。learning rate=1會導致w圍繞正確的值來回徘徊，完全不收斂，這樣寫主要是方便演示計算。改小learning rate 並增加循環次數就能收斂了。

#demo4:manual gradient descent in tensorflow
#y label
y = tf.constant(3,dtype = tf.float32)
x = tf.placeholder(dtype = tf.float32)
w = tf.Variable(2,dtype=tf.float32)
#prediction
p = w*x

#define losses
l = tf.square(p - y)
g = tf.gradients(l, w)
learning_rate = tf.constant(1,dtype=tf.float32)
#learning_rate = tf.constant(0.11,dtype=tf.float32)
init = tf.global_variables_initializer()

#update
update = tf.assign(w, w - learning_rate * g[0])

with tf.Session() as sess:
    sess.run(init)
    print(sess.run([g,p,w], {x: 1}))
    for _ in range(5):
        w_,g_,l_ = sess.run([w,g,l],feed_dict={x:1})
        print('variable is w:',w_, ' g is ',g_,'  and the loss is ',l_)

        _ = sess.run(update,feed_dict={x:1})

結果：

learning rate=1

[[-2.0], 2.0, 2.0]
variable is w: 2.0  g is  [-2.0]   and the loss is  1.0
variable is w: 4.0  g is  [2.0]   and the loss is  1.0
variable is w: 2.0  g is  [-2.0]   and the loss is  1.0
variable is w: 4.0  g is  [2.0]   and the loss is  1.0
variable is w: 2.0  g is  [-2.0]   and the loss is  1.0

效果類似下圖

縮小learning rate

variable is w: 2.9964619  g is  [-0.007575512]   and the loss is  1.4347095e-05
variable is w: 2.996695  g is  [-0.0070762634]   and the loss is  1.2518376e-05
variable is w: 2.996913  g is  [-0.0066099167]   and the loss is  1.0922749e-05
variable is w: 2.9971166  g is  [-0.0061740875]   and the loss is  9.529839e-06
variable is w: 2.9973066  g is  [-0.0057668686]   and the loss is  8.314193e-06
variable is w: 2.9974842  g is  [-0.0053868294]   and the loss is  7.2544826e-06
variable is w: 2.9976501  g is  [-0.0050315857]   and the loss is  6.3292136e-06
variable is w: 2.997805  g is  [-0.004699707]   and the loss is  5.5218115e-06
variable is w: 2.9979498  g is  [-0.004389763]   and the loss is  4.8175043e-06
variable is w: 2.998085  g is  [-0.0041003227]   and the loss is  4.2031616e-06
variable is w: 2.9982114  g is  [-0.003829956]   and the loss is  3.6671408e-06
variable is w: 2.9983294  g is  [-0.0035772324]   and the loss is  3.1991478e-06

擴展：Momentum、Adagrad的自動和手動實現，這裏嫌太長，分開了

源碼

補充實操經驗：

實際工程經常會使用global_step變量，作爲動態學習率、EMA和Batch_Normalization操作的依據，在對所有可訓練數據訓練時，尤其ema選中所有可訓練變量時，容易對global_step產生影響（本來是每一步+1，偏偏被加了個慣性，加了衰減係數），所以global_step一定要設定trainable=False。並且EMA等操作謹慎選擇訓練目標。

關於EMA與trainable=False，其實沒有嚴格關係，但是通常有一定關係，EMA默認可能是獲得所有可訓練變量，如果給global_step設定trainable=False，就避免了被傳入EMA的var_list，這也算是一個“你也不知道爲什麼，只是走運沒出事兒”的常見案例了！！！

同樣道理，BatchNormalization的average_mean和average_variance都是要設定trainable=False，都是他們單獨維護的。

tensorflow中optimizer minimize自動訓練簡介和選擇訓練variable的方法

限定train的Variable的方法：

TRAINABLE_VARIABLE=False

minimize()操作分解

具體的learning rate、step、計算公式和手動梯度下降實現：

MySQL 分庫分表方案，總結太全了。。

Qt/C++音視頻開發71-指定mjpeg/h264格式採集本地攝像頭/存儲文件到mp4/設備推流/採集推流

WPF開源輕便、快速的桌面啓動器

python手寫神經網絡之Dropout實現

python手寫神經網絡之BatchNormalization實現

python手寫神經網絡之權重初始化——梯度消失、表達消失

python手寫神經網絡之優化器（Optimizer）SGD、Momentum、Adagrad、RMSProp、Adam實現與對比——《深度學習入門——基於Python的理論與實現（第六章）》

python實現微分函數，兩種計算方式對比，一個誤區

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結