深度強化學習系列(15): TRPO算法原理及Tensorflow實現

深入淺出理解TRPO算法

1、論文思想與原理

Tensorflow代碼實踐

前言：策略梯度方法博大精深，但策略梯度算法的硬傷就在於更新步長 $\alpha$ 很難確定，當步長不恰當時，更新參數所對應的策略是一個更不好的策略，當利用這個更不好的策略進行採樣學習時，再次更新的參數會更差，因此很容易導致越學越差，甚至崩潰學不到東西。所以，一個合適的步長選擇對於強化學習來說是非常關鍵的，目前的大部分強化學習算法很難保證單調收斂，而TRPO算法則直面步長問題，並通過數學給出了一個單調的策略改善方法。

由於TRPO中數學推導特別複雜，同時PPO(Proximal Policy Optimization, 近端策略優化)算法【後期單獨講解】對其過程優化的更加簡單的緣故一直覺得學的不太到位，因此本篇博客綜合了很多大牛的技術博客, 繼續寫完了TRPO算法的學習，以便於把算法學習明白，在此向大神們表示膜拜與感謝！

置信域策略梯度優化: TRPO

TRPO(Trust Region Policy Optimization, 置信域策略梯度優化)算法是現OpenAI科學家John Schulman在2015年2月提出的一種策略優化算法，並且發表於《第31屆ICML》頂會上，可以說是非常獨特新穎，同時其在2016年的博士論文中將該算法作爲第三章(page18-44)進行了單獨闡述。下面我們思考一個問題，當我們在使用梯度下降法求解的時候，步子大容易錯過某些最優值，步子小收斂速度非常慢，如圖所示

在梯度上升法中，我們判斷最陡峭的方向，之後朝着那個方向前進。但是如果學習率(learning rate，步長)太高的話，這樣的行動也許讓我們遠離真正的回報函數( the real reward function )，甚至會變成災難。在信任的區域之中，我們用 $\delta$ 變量限制我們的搜索區域（置信區間），這樣的區域可以保證在它達到局部或者全局最優策略之前，它的優化策略將會優於當前策略,結果如圖所示

1、論文思想與原理

TRPO算法使用了MM(Minorize-Maximizatio)優化過程，在本文中將通過幾個步驟推導出TRPO的目標函數。

MM(Minorize-Maximization)過程

MM算法是一個迭代優化，其利用凸函數方法，以便找到他們的最大值或最小值,取決於所需的優化是最大化還是最小化。MM本身不是算法，而是描述如何構造優化算法。MM算法的歷史基礎可以追溯到至少1970年，當時Ortega和Rheinboldt正在進行與線搜索方法相關的研究。同樣的概念繼續以不同的形式重新出現在不同的領域。2000年，亨特和蘭格提出“MM”作爲一般框架。最近的研究已將該方法應用於廣泛的學科領域，如數學，統計學，機器學習和工程學。MM過程如下：

更多內容參考MM過程

1.1 Surrogate function(替代函數)

RL的目標就是最大化預期折扣獎勵(the expected discounted rewards)。然而我們知道，根據策略梯度方法，參數更新方程式爲：
$\theta_{new}=\theta_{old}+\alpha\nabla_{\theta}J$
策略梯度算法的硬傷就在更新步長 $\alpha$ ，當步長不合適時，更新的參數所對應的策略是一個更不好的策略，當利用這個更不好的策略進行採樣學習時，再次更新的參數會更差，因此很容易導致越學越差，最後崩潰。所以，合適的步長對於強化學習非常關鍵。那麼什麼叫合適的步長？所謂合適的步長是指當策略更新後，回報函數的值不能更差。如何選擇這個步長？或者說，如何找到新的策略使得新的回報函數的值單調遞增，或單調不減。這就是TRPO算法要解決的問題。具體如下圖，紅色的線表示期望折扣回，

其中 $\eta$ 被定義爲：
$\eta(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{\infty}\gamma^{t}r_{t}]$
這是對累計獎勵函數做的期望，其中圖中的C爲常數，藍色曲線M中的KL表示散度（後文解釋)，對於每次迭代，我們發現Surrogate函數M(藍線)有如下性質：

是 $\eta$ 的下界函數
可用於估計當前策略的折扣獎勵 $\eta$
易於優化(我們將會把Surrogate函數近似估計爲一個二次方程)

在每一次迭代之中，我們找到最佳的M點並且把它作爲當前的策略。

之後，我們重新評估新策略的下界並且重複迭代。當我們持續這個過程，策略也會不斷的改進。因爲可能的策略是有限的，所以我們當前的概率最終將會收斂到局部或者全部最優的策略。

1.2 目標函數

在強化學習中，值函數、動作值函數是學習的基礎，優勢函數則是計算A3C算法、TRPO算法的重要過程，我們先定義Q值函數、狀態值函數和優勢函數。數學表示爲：

$Q_{\pi}(s_{t},a_{t}) = \mathbb{E{s_{t+1},a_{t+1},\dots}\left[ \sum_{k=0}^{\infty} \gamma^{k}r(s_{t+k})\right]} \\ V_{\pi}(s_{t}) = \mathbb{E{s_{t+1},\dots}\left[ \sum_{k=0}^{\infty} \gamma^{k}r(s_{t+k})\right]} \\ A_{\pi}(s,a) = Q_{\pi}(s,a)-V_{\pi} \\ where \quad a_{t} \sim \pi(a_{t}|s_{t}),s_{t+1}\sim P(s_{t+1}|s_{t},a_{t}) \quad for \quad t \geq 0$

爲了衡量一個動作所能夠帶來的回報過程，優勢函數被廣泛的使用，

折扣期望獎勵被表示爲：

$\eta(\pi) = \mathbb{E}_{s_{0},a_{0},\dots}\left[ \sum_{t=0}^{\infty}\gamma^{t}r(s_{t}) \right] \\ where \quad s_{0} \sim \rho_{0}(s_{0}),a_{t} \sim \pi(a_{t}|s_{t}),s_{t+1}\sim P(s_{t+1}|s_{t},a_{t})$

我們可以使用其他策略計算策略的獎勵進行表示

$\eta(\hat{\pi}) = \eta(\pi) +\mathbb{E}_{s_{0},a_{0},\dots \sim \hat{\pi}}[\sum_{t=0}^{\infty}\gamma^{t}A_{\pi}(s_{t},a_{t})]$

$A_{\pi}(s,a) = Q_{\pi}(s,a)-V_{\pi}(s) =\mathbb{E}_{s^{'}\sim P(s^{'}|s,a)}[r(s)+\gamma V_{\pi}(s^{})-V_{\pi}(s)]$

其中符號 $\mathbb{E}_{s_{0},a_{0},\dots \sim \hat{\pi}}[\dots]$ 表示動作的採樣 $a_{t} \sim \hat{\pi(.|s_{t})}$ , $\eta(\pi)$ 表示舊策略， $\eta{(\hat{\pi})}$ 表示新策略，而 $\mathbb{E}_{s_{0},a_{0},\dots \sim \hat{\pi}}[\sum_{t=0}^{\infty}\gamma^{t}A_{\pi}(s_{t},a_{t})]$ 則表示新舊策略的回報差,這也是優勢函數所謂的優勢所在，同時我們設置 $\rho(\pi)$ 爲折扣的訪問頻率

$\rho_{\pi}(s) = P(s_{0}=s)+\gamma P(s_{1}=s)+\gamma^{2}P(s_{2}=s)+\dots,$

其中關於新舊策略的回報差的證明如下:
$\begin{aligned} &\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^{t} A^{\pi}\left(s_{t}, a_{t}\right)\right] \\ &=\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^{t}\left(R\left(s_{t}, a_{t}, s_{t+1}\right)+\gamma V^{\pi}\left(s_{t+1}\right)-V^{\pi}\left(s_{t}\right)\right)\right] \\ &=\eta\left(\pi^{\prime}\right)+\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^{t+1} V^{\pi}\left(s_{t+1}\right)-\sum_{t=0}^{\infty} \gamma^{t} V^{\pi}\left(s_{t}\right)\right] \\ &=\eta\left(\pi^{\prime}\right)+\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[\sum_{t=1}^{\infty} \gamma^{t} V^{\pi}\left(s_{t}\right)-\sum_{t=0}^{\infty} \gamma^{t} V^{\pi}\left(s_{t}\right)\right] \\ &=\eta\left(\pi^{\prime}\right)-\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[V^{\pi}\left(s_{0}\right)\right] \\ &=\eta\left(\pi^{\prime}\right)-\eta(\pi) \end{aligned}$

將上述證明公式展開得到：

$\begin{aligned} \eta(\hat{\pi}) &= \eta(\pi)+\sum_{t}^{\infty}\sum_{s}P(s_{t}=s|\hat{\pi})\sum\hat{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \\ &= \eta(\pi)+\sum\rho_{\hat{\pi}}(s)\sum\hat{\pi}(a|s)A_{\pi}(s,a) \end{aligned}$
其中 $\rho_{\hat{\pi}}(s) = \sum_{t=0}^{\infty}\gamma^{t}P(s_{t} = s|\hat{\pi})$

優勢函數解釋(如下樹圖所示)：

值函數 $V(s)$ 可以理解爲在該狀態下所有可能動作所對應的動作值函數乘以採取該動作的概率的和。更通俗的講，值函數 $V(s)$ 是該狀態下所有動作值函數關於動作概率的平均值。而動作值函數 $Q(s,a)$ 是單個動作所對應的值函數， $Q_{\pi}\left(s,a\right)-V_{\pi}\left(s\right)$ 能評價當前動作值函數相對於平均值的大小。所以，這裏的優勢指的是動作值函數相比於當前狀態的值函數的優勢。如果優勢函數大於零，則說明該動作比平均動作好，如果優勢函數小於零，則說明當前動作還不如平均動作好。

1.3 一階近似： L函數

1.3.1 技巧一：一階近似

引入TRPO的第一個技巧對狀態分佈進行處理，我們忽略狀態分佈的變化，依然採用舊的策略所對應的狀態分佈。這個技巧是對原代價函數(目標函數)的一次近似。其實，當新舊參數很接近的時候，我們將用舊的狀態分佈替代新的狀態分佈是合理的，同時使用MM算法是爲了找到當前策略下近似 $\eta$ 下界局部值的函數，原來的代價函數我們近似爲函數 ?? ：

$L_{\pi}(\hat{\pi}) = \eta(\pi) +\sum_{s}\rho_{\pi}(s)\sum_{a}\hat{\pi}(a|s)A_{\pi}(s,a)$

第二項策略部分，這時的動作是由新策略 $\hat{\pi}$ 產生的，可新策略是帶參數 $\theta$ 的，這個參數的是未知的，無法用來產生動作。此時引入TRPO的第二個技巧–【重要性採樣】。

1.3.2 重要性採樣

TRPO第二個技巧是利用重要性採樣對動作分佈進行處理，後文將會介紹新舊策略之間的差異非常小，因此使用重要性採樣能夠取得非常好的效果
$\sum\hat{\pi}(a|s)A_{\pi}(s,a) = \mathbb{E}_{a \sim q(a|s)}\left[ \frac{\hat{\pi}(a|s)}{q(a|s)}A_{\pi}(s_{t},a_{t})\right]$
那麼，經過將s求和的形式寫爲期望，則函數表示爲：

$L_{\pi}(\hat{\pi}) = \eta(\pi)+\mathbb{E}_{s \sim \rho_{\theta{old}},a \sim \pi{\theta_{old}}(a|s)}\left[\frac{\hat{\pi_{\theta}}(a|s)}{\pi_{\theta_{old}}(a|s)}A_{\pi_{\theta_{old}}}(s_{t},a_{t}) \right]$

重要性採樣
重要性採樣（英語：importance sampling）是統計學中估計某一分佈性質時使用的一種方法。該方法從與原分佈不同的另一個分佈中採樣，而對原先分佈的性質進行估計。重要性採樣與計算物理學中的傘形採樣相關。

1.3.3 步長的選擇

通過對如下兩個公式進行觀察：

$\eta(\hat{\pi}) = \eta(\pi)+\sum_{t}^{\infty}\sum_{s}P(s_{t}=s|\hat{\pi})\sum\hat{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \\ \\ L_{\pi}(\hat{\pi}) = \eta(\pi)+\mathbb{E}_{s \sim \rho_{\theta{old}},a \sim \pi{\theta_{old}}(a|s)}\left[\frac{\hat{\pi_{\theta}}(a|s)}{\pi_{\theta_{old}}(a|s)}A_{\pi_{\theta_{old}}}(s_{t},a_{t}) \right]$
我們發現 $\eta(\hat{\pi})$ 和 $L_{\pi}(\hat{\pi})$ 的區別在於分佈不同，他們都是 $\hat{\pi}$ 的函數，且在 $\pi_{\theta_{old}}$ 處的

?? 是下界函數M 的一部分（紅色下劃線）

其中M中的第二項是KL-divergence KL散度

KL散度：
Kullback-Leibler(KL)散度由Solomon Kullback和Richard Leibler在1951 年引入，作爲兩個分佈之間的有向差異衡量，對於離散概率分佈 P 和 Q Q在相同的概率空間上定義，Kullback-Leibler之間存在差異 $P$ 和 $Q$ Q定義是
$D_{KL}（P\parallel Q）= - \sum_{x \in \mathcal{X}} P(x) \log {\frac{Q(x)}{P(x)}}$
換句話說，期望概率之間的對數差異P 和Q，使用概率採取期望 P。Kullback-Leibler散度僅在所有人中定義 X， Q（x）= 0， Q（X）= 0 暗示 P（x）= 0，P（X）= 0（絕對連續性）。對於P 和 Q在連續隨機變量中，Kullback-Leibler（KL）散度被定義爲積分：
$D_{KL}（P\parallel Q）= \intop\nolimits_{-\infty }^{\infty } p(x) \log {\left(\frac{q(x)}{p(x)}\right)dx}$
更多相關數學內容見點擊查看維基百科

在當前的策略中， $KL( \theta_{i}, \theta_{i})=0$ , $C*KL$ 項可以看作是 $??$ 的上限誤差。在策略 $\theta_{i}$ 中，我們可以證明 $L$ 與 $η$ 的第一個階導數相同。

$L_{\pi_{\theta_{i}}}(\pi_{\theta_{i}}) = \eta(\pi_{\theta_{i}}) \\ \nabla_{\theta}L_{\pi_{\theta_{i}}}(\pi_{\theta})|_{\theta = \theta_{i}} = \nabla_{\theta}\eta(\pi_{\theta})|_{\theta = \theta_{i}}$

根據上述公式，在 $\theta_{i}$ 有限的範圍內,兩個目標函數相差十分的近似，梯度也近似，因此在本文可以使用 $L_{\pi_{\theta_{old}}}$ 代替原來的目標函數，並沿着近似函數導數方向做有限步長的更新，同時提升策略梯度。

然而，步長多大能夠保證單調不減？既然近似函數與原目標函數是在策略附近一個很小的區域可以做到近似，那麼就加入一個約束用於限定新舊策略的差異。如果把策略看做是一個概率分佈，那麼就可以使用KL散度表示兩個分佈之間的差異 $D_{KL}^{max}(\pi, \hat{\pi}) = \max_{s}D_{KL}(\pi(\cdot |s)||\hat{\pi}(\cdot |s))$ ，在作者原論文中證明過，當 $\hat{\pi} = arg\max_{\pi^{'}}L_{\pi}(\pi^{'})$ 成立時，則下限爲：

$\eta(\hat{\pi}) \geq L_{\pi}(\hat{\pi})-C \times D_{KL}^{max}(\pi, \hat{\pi}) \\ where, \quad C = \frac{2\epsilon \gamma}{(1-\gamma)^{2}}$

其中 $D_{KL}^{max}({\pi},{\tilde \pi })$ 是KL散度，限制了策略模型更新前後的差異，則定義如下等式

$M_{i}(\pi) = L_{\pi_{i}}(\pi) -C\times D_{KL}^{max}(\pi_{i},\pi)$

利用這個等式，我們證明策略的單調性
$\eta(\pi_{i+1}) \geq M_{i}(\pi_{i+1}) \\ \eta(\pi_{i}) = M_{i}(\pi_{i})$

對該公式求減法得到：

$\eta(\pi_{i+1})-\eta(\pi_{i}) \geq M_{i}(\pi_{i+1})-M_{i}(\pi_{i})$

接下來我們討論下界函數M的細節。TRPO論文附錄A中的兩頁證明 η的有一個確定的下限。

$KaTeX parse error: Undefined control sequence: \nonumber at position 177: …pha^{2} where \̲n̲o̲n̲u̲m̲b̲e̲r̲\\ & \epsilon=\…$

D_TV是總的散度方差。但這並不重要，因爲我們將馬上使用KL散度替代它，因爲（找下界）

$D_{TV}(p ||q)^{2} \leq D_{KL}(p||q)$

下界函數可以被重定義爲：
$\eta(\tilde{\pi}) \geq L_{\pi}(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}), \text { where } C=\frac{4 \epsilon \gamma}{(1-\gamma)^{2}} \\ D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi})=\max _{s} D_{\mathrm{KL}}(\pi(\cdot | s) \| \tilde{\pi}(\cdot | s))$

注意，符號可以簡記爲

$\eta(\theta):=\eta(\pi_{\theta}) \\ L_{\theta}(\hat{\theta}):= L_{\pi_{\theta}}(\pi_{\hat{\theta}})$

1.4 單調遞增證明：

自然策略梯度的關鍵思想是保證了函數單調上升。它是Policy Gradient方法家族中的“用貨幣擔保”的版本（笑）。簡而言之，至少在理論上，任何策略策更新都將比之前的好。我們在這需要證明的是，基於優化M的新策略可以保證在 η （實際預期回報）方面的表現優於之前的策略。由於策略的數量是有限的，持續更新策略最終能達到局部或全局最優。這是證明：

Mi(πi+1)對比Mi(πi)的任何改進都會使得η(πi+1)獲得改進。
$M_{i}(\pi)=L_{\pi_{i}}(\pi)-C D_{\mathrm{KL}}^{\max }\left(\pi_{i}, \pi\right). then \\ \eta\left(\pi_{i+1}\right) \geq M_{i}\left(\pi_{i+1}\right) \\ \eta\left(\pi_{i}\right)=M_{i}\left(\pi_{i}\right), therefore, \\ \eta\left(\pi_{i+1}\right)-\eta\left(\pi_{i}\right) \geq M_{i}\left(\pi_{i+1}\right)-M\left(\pi_{i}\right)$

策略迭代中梯度單調上升的證明

這是迭代算法，它保證新策略總是比當前策略執行得更好。

然而，很難（在所有策略中）找到KL散度的最大值。因此我們將放寬要求爲使用KL散度的均值。

其中 ?? 函數：
$L_{\pi}(\hat{\pi}) = \eta(\pi)+\sum_{s}\rho_{\pi}(s)\sum_{a}\hat{\pi}(a|s)A_{\pi}(s,a)$

我們可以使用重要性抽樣，從策略q中抽樣來估計上式左邊：
$\underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\text { old }}, a \sim q}\left[\frac{\pi_{\theta}(a | s)}{q(a | s)} \hat{A}_{\theta_{\text { old }}}(s, a)\right] \\ subject \quad to \quad \mathbb{E}_{s \sim \rho_{\text { old }}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\text { old }}}(\cdot | s) \| \pi_{\theta}(\cdot | s)\right)\right] \leq \delta$

重要性採樣
重要性採樣計算 $f(x)$ 的期望值，其中 $x$ 具有數據分佈 $p$ 。則計算爲 $\mathbb{E}_{x \sim p}[f(x)]$ ,但在重要性抽樣中，我們提供了不從 $p$ 中抽取 $f(x)$ 值的選擇。相反，我們從 $q$ 中採樣數據並使用 $p$ 和 $q$ 之間的概率比來重新校準結果,其計算過程爲： $\mathbb{E}_{x \sim p}[\frac{f(x)p(x)}{q(x)}]$ , 則在PG中，我們使用當前策略來計算策略梯度表示爲：
[外鏈圖片轉存失敗(img-1TkyMuRi-1563675372814)(assets/markdown-img-paste-20190409103527137.png)]

因此，無論何時更改策略，我們都會收集新樣本。舊樣本不可重複使用。因此PG的樣品效率很差。通過重要性抽樣，我們的目標可以被重寫，我們可以使用舊政策中的樣本來計算政策梯度。

繼續前文，利用拉格朗日對偶性，可以將目標函數的約束成帶拉格朗日乘子的目標函數。兩者在數學上是等價的：

$\underset{\theta}{\operatorname{maximize}} \quad \hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t} | s_{t}\right)}{\pi_{\theta_{\mathrm{old}}}\left(a_{t} | s_{t}\right)} \hat{A}_{t}\right] \\ subject \quad to \quad \hat{\mathbb{E}}_{t}\left[\mathrm{KL}\left[\pi_{\theta_{\text { old }}}\left(\cdot | s_{t}\right), \pi_{\theta}\left(\cdot | s_{t}\right)\right]\right] \leq \delta$
或者
$\underset{\theta}{\operatorname{maximize}} \quad \hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t} | s_{t}\right)}{\pi_{\theta_{\text { old }}}\left(a_{t} | s_{t}\right)} \hat{A}_{t}\right]-\beta \hat{\mathbb{E}}_{t}\left[\mathrm{KL}\left[\pi_{\theta_{\text { old }}}\left(\cdot | s_{t}\right), \pi_{\theta}\left(\cdot | s_{t}\right)\right]\right]$

1.5 優化目標函數（Optimizing the objective function）

正如我們之前提到過的，下界函數M應該是易於優化的。事實上，我們使用泰勒級數(Taylor Series)來估計 L函數和 KL距離(KL-divergence)的平均值，分別用泰勒將L 展開到一階，KL距離展開到二階：
$L_{\theta_{k}}(\theta) \approx g^{T}\left(\theta-\theta_{k}\right) \\ g \doteq \nabla_{\theta} L_{\theta_{k}}\left.(\theta)\right|_{\theta_{k}}$

$\overline{D}_{K L}\left(\theta \| \theta_{k}\right) \approx \frac{1}{2}\left(\theta-\theta_{k}\right)^{T} H\left(\theta-\theta_{k}\right) \\ H \doteq \nabla_{\theta}^{2} \overline{D}_{K L}\left.\left(\theta \| \theta_{k}\right)\right|_{\theta_{k}}$

其中g是策略梯度， H 被叫做費雪信息矩陣(FIM,the Fisher Information matrix)，以海森Hessian矩陣的形式(Hessian matrix)

$H=\nabla^{2} f=\left(\begin{array}{cccc}{\frac{\partial^{2} f}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{2}}} & {\dots} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{n}}} \\ {\frac{\partial^{2} f}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{n}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial^{2} f}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{n}}}\end{array}\right)$
那麼優化問題就變成了：

$\theta_{k+1}=\arg \max _{\theta} g^{T}\left(\theta-\theta_{k}\right) \\ subject \quad to \quad \frac{1}{2}\left(\theta-\theta_{k}\right)^{T} H\left(\theta-\theta_{k}\right) \leq \delta$

我們可以通過優化二次方程來解決它，並且解集爲：
$\theta_{k+1}=\theta_{k}+\sqrt{\frac{2 \delta}{g^{T} H^{-1} g}} H^{-1} g$

上面的計算方法被叫做自然策略梯度法(the natural policy gradien)。以下是使用MM算法和自然梯度算法優化過的算法：

TRPO
但是，FIM (H) 及其逆運算的成本非常的高。TRPO估計如: $H^{-1} g$ , 通過求解x得到以下線性方程：
$x_{k} \approx \hat{H}_{k}^{-1} \hat{g}_{k} \\ \hat{H}_{k} x_{k} \approx \hat{g}$

這證實了我們可以通過共軛梯度法（conjugate gradien）來求解它，並且這比計算矩陣H 的逆(inverse)更加簡單。共軛梯度法類似於梯度下降法，但是它可以在最多N次迭代之中找到最優點，其中N表示模型之中的參數數量。

最後附圖論文在atari遊戲中的結果圖：

Tensorflow代碼實踐

本部分代碼來源於OpenAI官方SpinningUp

import numpy as np
import tensorflow as tf
import gym
import time
import spinup.algos.trpo.core as core
from spinup.utils.logx import EpochLogger
from spinup.utils.mpi_tf import MpiAdamOptimizer, sync_all_params
from spinup.utils.mpi_tools import mpi_fork, mpi_avg, proc_id, mpi_statistics_scalar, num_procs


EPS = 1e-8

class GAEBuffer:
    """
    A buffer for storing trajectories experienced by a TRPO agent interacting
    with the environment, and using Generalized Advantage Estimation (GAE-Lambda)
    for calculating the advantages of state-action pairs.
    """
    # 初始化過程，包括obs,act,優勢函數等各種的初始化
    def __init__(self, obs_dim, act_dim, size, info_shapes, gamma=0.99, lam=0.95):
        self.obs_buf = np.zeros(core.combined_shape(size, obs_dim), dtype=np.float32)
        self.act_buf = np.zeros(core.combined_shape(size, act_dim), dtype=np.float32)
        self.adv_buf = np.zeros(size, dtype=np.float32)
        self.rew_buf = np.zeros(size, dtype=np.float32)
        self.ret_buf = np.zeros(size, dtype=np.float32)
        self.val_buf = np.zeros(size, dtype=np.float32)
        self.logp_buf = np.zeros(size, dtype=np.float32)
        self.info_bufs = {k: np.zeros([size] + list(v), dtype=np.float32) for k,v in info_shapes.items()}
        self.sorted_info_keys = core.keys_as_sorted_list(self.info_bufs)
        self.gamma, self.lam = gamma, lam
        self.ptr, self.path_start_idx, self.max_size = 0, 0, size
    # 單獨將存儲寫了一個函數，類似於DDPG算法經驗回放池一樣
    def store(self, obs, act, rew, val, logp, info):
        """
        Append one timestep of agent-environment interaction to the buffer.
        """
        assert self.ptr < self.max_size     # buffer has to have room so you can store
        self.obs_buf[self.ptr] = obs
        self.act_buf[self.ptr] = act
        self.rew_buf[self.ptr] = rew
        self.val_buf[self.ptr] = val
        # 新舊策略
        self.logp_buf[self.ptr] = logp
        for i, k in enumerate(self.sorted_info_keys):
            self.info_bufs[k][self.ptr] = info[i]
        self.ptr += 1

    def finish_path(self, last_val=0):
        """
        Call this at the end of a trajectory, or when one gets cut off
        by an epoch ending. This looks back in the buffer to where the
        trajectory started, and uses rewards and value estimates from
        the whole trajectory to compute advantage estimates with GAE-Lambda,
        as well as compute the rewards-to-go for each state, to use as
        the targets for the value function.

        The "last_val" argument should be 0 if the trajectory ended
        because the agent reached a terminal state (died), and otherwise
        should be V(s_T), the value function estimated for the last state.
        This allows us to bootstrap the reward-to-go calculation to account
        for timesteps beyond the arbitrary episode horizon (or epoch cutoff).
        """

        path_slice = slice(self.path_start_idx, self.ptr)
        rews = np.append(self.rew_buf[path_slice], last_val)
        vals = np.append(self.val_buf[path_slice], last_val)

        # the next two lines implement GAE-Lambda advantage calculation
        deltas = rews[:-1] + self.gamma * vals[1:] - vals[:-1]
        self.adv_buf[path_slice] = core.discount_cumsum(deltas, self.gamma * self.lam)

        # the next line computes rewards-to-go, to be targets for the value function
        self.ret_buf[path_slice] = core.discount_cumsum(rews, self.gamma)[:-1]

        self.path_start_idx = self.ptr

    def get(self):
        """
        Call this at the end of an epoch to get all of the data from
        the buffer, with advantages appropriately normalized (shifted to have
        mean zero and std one). Also, resets some pointers in the buffer.
        """
        assert self.ptr == self.max_size    # buffer has to be full before you can get
        self.ptr, self.path_start_idx = 0, 0
        # the next two lines implement the advantage normalization trick
        adv_mean, adv_std = mpi_statistics_scalar(self.adv_buf)
        self.adv_buf = (self.adv_buf - adv_mean) / adv_std
        return [self.obs_buf, self.act_buf, self.adv_buf, self.ret_buf,
                self.logp_buf] + core.values_as_sorted_list(self.info_bufs)

"""

Trust Region Policy Optimization

(with support for Natural Policy Gradient)

"""
[文檔]def trpo(env_fn, actor_critic=core.mlp_actor_critic, ac_kwargs=dict(), seed=0,
         steps_per_epoch=4000, epochs=50, gamma=0.99, delta=0.01, vf_lr=1e-3,
         train_v_iters=80, damping_coeff=0.1, cg_iters=10, backtrack_iters=10,
         backtrack_coeff=0.8, lam=0.97, max_ep_len=1000, logger_kwargs=dict(),
         save_freq=10, algo='trpo'):
    """

    Args:
        env_fn : A function which creates a copy of the environment.
            The environment must satisfy the OpenAI Gym API.

        actor_critic: A function which takes in placeholder symbols
            for state, ``x_ph``, and action, ``a_ph``, and returns the main
            outputs from the agent's Tensorflow computation graph:

            ============  ================  ========================================
            Symbol        Shape             Description
            ============  ================  ========================================
            ``pi``        (batch, act_dim)  | Samples actions from policy given
                                            | states.
            ``logp``      (batch,)          | Gives log probability, according to
                                            | the policy, of taking actions ``a_ph``
                                            | in states ``x_ph``.
            ``logp_pi``   (batch,)          | Gives log probability, according to
                                            | the policy, of the action sampled by
                                            | ``pi``.
            ``info``      N/A               | A dict of any intermediate quantities
                                            | (from calculating the policy or log
                                            | probabilities) which are needed for
                                            | analytically computing KL divergence.
                                            | (eg sufficient statistics of the
                                            | distributions)
            ``info_phs``  N/A               | A dict of placeholders for old values
                                            | of the entries in ``info``.
            ``d_kl``      ()                | A symbol for computing the mean KL
                                            | divergence between the current policy
                                            | (``pi``) and the old policy (as
                                            | specified by the inputs to
                                            | ``info_phs``) over the batch of
                                            | states given in ``x_ph``.
            ``v``         (batch,)          | Gives the value estimate for states
                                            | in ``x_ph``. (Critical: make sure
                                            | to flatten this!)
            ============  ================  ========================================

        ac_kwargs (dict): Any kwargs appropriate for the actor_critic
            function you provided to TRPO.

        seed (int): Seed for random number generators.

        steps_per_epoch (int): Number of steps of interaction (state-action pairs)
            for the agent and the environment in each epoch.

        epochs (int): Number of epochs of interaction (equivalent to
            number of policy updates) to perform.

        gamma (float): Discount factor. (Always between 0 and 1.)

        delta (float): KL-divergence limit for TRPO / NPG update.
            (Should be small for stability. Values like 0.01, 0.05.)

        vf_lr (float): Learning rate for value function optimizer.

        train_v_iters (int): Number of gradient descent steps to take on
            value function per epoch.

        damping_coeff (float): Artifact for numerical stability, should be
            smallish. Adjusts Hessian-vector product calculation:

            .. math:: Hv \\rightarrow (\\alpha I + H)v

            where :math:`\\alpha` is the damping coefficient.
            Probably don't play with this hyperparameter.

        cg_iters (int): Number of iterations of conjugate gradient to perform.
            Increasing this will lead to a more accurate approximation
            to :math:`H^{-1} g`, and possibly slightly-improved performance,
            but at the cost of slowing things down.

            Also probably don't play with this hyperparameter.

        backtrack_iters (int): Maximum number of steps allowed in the
            backtracking line search. Since the line search usually doesn't
            backtrack, and usually only steps back once when it does, this
            hyperparameter doesn't often matter.

        backtrack_coeff (float): How far back to step during backtracking line
            search. (Always between 0 and 1, usually above 0.5.)

        lam (float): Lambda for GAE-Lambda. (Always between 0 and 1,
            close to 1.)

        max_ep_len (int): Maximum length of trajectory / episode / rollout.

        logger_kwargs (dict): Keyword args for EpochLogger.

        save_freq (int): How often (in terms of gap between epochs) to save
            the current policy and value function.

        algo: Either 'trpo' or 'npg': this code supports both, since they are
            almost the same.

    """

    logger = EpochLogger(**logger_kwargs)
    logger.save_config(locals())

    seed += 10000 * proc_id()
    tf.set_random_seed(seed)
    np.random.seed(seed)

    env = env_fn()
    obs_dim = env.observation_space.shape
    act_dim = env.action_space.shape

    # Share information about action space with policy architecture
    ac_kwargs['action_space'] = env.action_space

    # Inputs to computation graph
    x_ph, a_ph = core.placeholders_from_spaces(env.observation_space, env.action_space)
    adv_ph, ret_ph, logp_old_ph = core.placeholders(None, None, None)

    # Main outputs from computation graph, plus placeholders for old pdist (for KL)
    pi, logp, logp_pi, info, info_phs, d_kl, v = actor_critic(x_ph, a_ph, **ac_kwargs)

    # Need all placeholders in *this* order later (to zip with data from buffer)
    all_phs = [x_ph, a_ph, adv_ph, ret_ph, logp_old_ph] + core.values_as_sorted_list(info_phs)

    # Every step, get: action, value, logprob, & info for pdist (for computing kl div)
    get_action_ops = [pi, v, logp_pi] + core.values_as_sorted_list(info)

    # Experience buffer
    local_steps_per_epoch = int(steps_per_epoch / num_procs())
    info_shapes = {k: v.shape.as_list()[1:] for k,v in info_phs.items()}
    buf = GAEBuffer(obs_dim, act_dim, local_steps_per_epoch, info_shapes, gamma, lam)

    # Count variables
    var_counts = tuple(core.count_vars(scope) for scope in ['pi', 'v'])
    logger.log('\nNumber of parameters: \t pi: %d, \t v: %d\n'%var_counts)

    # TRPO losses
    ratio = tf.exp(logp - logp_old_ph)          # pi(a|s) / pi_old(a|s)
    pi_loss = -tf.reduce_mean(ratio * adv_ph)
    v_loss = tf.reduce_mean((ret_ph - v)**2)

    # Optimizer for value function
    train_vf = MpiAdamOptimizer(learning_rate=vf_lr).minimize(v_loss)

    # Symbols needed for CG solver
    pi_params = core.get_vars('pi')
    gradient = core.flat_grad(pi_loss, pi_params)
    v_ph, hvp = core.hessian_vector_product(d_kl, pi_params)
    if damping_coeff > 0:
        hvp += damping_coeff * v_ph

    # Symbols for getting and setting params
    get_pi_params = core.flat_concat(pi_params)
    set_pi_params = core.assign_params_from_flat(v_ph, pi_params)

    sess = tf.Session()
    sess.run(tf.global_variables_initializer())

    # Sync params across processes
    sess.run(sync_all_params())

    # Setup model saving
    logger.setup_tf_saver(sess, inputs={'x': x_ph}, outputs={'pi': pi, 'v': v})

    def cg(Ax, b):
        """
        Conjugate gradient algorithm
        (see https://en.wikipedia.org/wiki/Conjugate_gradient_method)
        """
        x = np.zeros_like(b)
        r = b.copy() # Note: should be 'b - Ax(x)', but for x=0, Ax(x)=0. Change if doing warm start.
        p = r.copy()
        r_dot_old = np.dot(r,r)
        for _ in range(cg_iters):
            z = Ax(p)
            alpha = r_dot_old / (np.dot(p, z) + EPS)
            x += alpha * p
            r -= alpha * z
            r_dot_new = np.dot(r,r)
            p = r + (r_dot_new / r_dot_old) * p
            r_dot_old = r_dot_new
        return x

    def update():
        # Prepare hessian func, gradient eval
        inputs = {k:v for k,v in zip(all_phs, buf.get())}
        Hx = lambda x : mpi_avg(sess.run(hvp, feed_dict={**inputs, v_ph: x}))
        g, pi_l_old, v_l_old = sess.run([gradient, pi_loss, v_loss], feed_dict=inputs)
        g, pi_l_old = mpi_avg(g), mpi_avg(pi_l_old)

        # Core calculations for TRPO or NPG
        x = cg(Hx, g)
        alpha = np.sqrt(2*delta/(np.dot(x, Hx(x))+EPS))
        old_params = sess.run(get_pi_params)

        def set_and_eval(step):
            sess.run(set_pi_params, feed_dict={v_ph: old_params - alpha * x * step})
            return mpi_avg(sess.run([d_kl, pi_loss], feed_dict=inputs))

        if algo=='npg':
            # npg has no backtracking or hard kl constraint enforcement
            kl, pi_l_new = set_and_eval(step=1.)

        elif algo=='trpo':
            # trpo augments npg with backtracking line search, hard kl
            for j in range(backtrack_iters):
                kl, pi_l_new = set_and_eval(step=backtrack_coeff**j)
                if kl <= delta and pi_l_new <= pi_l_old:
                    logger.log('Accepting new params at step %d of line search.'%j)
                    logger.store(BacktrackIters=j)
                    break

                if j==backtrack_iters-1:
                    logger.log('Line search failed! Keeping old params.')
                    logger.store(BacktrackIters=j)
                    kl, pi_l_new = set_and_eval(step=0.)

        # Value function updates
        for _ in range(train_v_iters):
            sess.run(train_vf, feed_dict=inputs)
        v_l_new = sess.run(v_loss, feed_dict=inputs)

        # Log changes from update
        logger.store(LossPi=pi_l_old, LossV=v_l_old, KL=kl,
                     DeltaLossPi=(pi_l_new - pi_l_old),
                     DeltaLossV=(v_l_new - v_l_old))

    start_time = time.time()
    o, r, d, ep_ret, ep_len = env.reset(), 0, False, 0, 0

    # Main loop: collect experience in env and update/log each epoch
    for epoch in range(epochs):
        for t in range(local_steps_per_epoch):
            agent_outs = sess.run(get_action_ops, feed_dict={x_ph: o.reshape(1,-1)})
            a, v_t, logp_t, info_t = agent_outs[0][0], agent_outs[1], agent_outs[2], agent_outs[3:]

            # save and log
            buf.store(o, a, r, v_t, logp_t, info_t)
            logger.store(VVals=v_t)

            o, r, d, _ = env.step(a)
            ep_ret += r
            ep_len += 1

            terminal = d or (ep_len == max_ep_len)
            if terminal or (t==local_steps_per_epoch-1):
                if not(terminal):
                    print('Warning: trajectory cut off by epoch at %d steps.'%ep_len)
                # if trajectory didn't reach terminal state, bootstrap value target
                last_val = r if d else sess.run(v, feed_dict={x_ph: o.reshape(1,-1)})
                buf.finish_path(last_val)
                if terminal:
                    # only save EpRet / EpLen if trajectory finished
                    logger.store(EpRet=ep_ret, EpLen=ep_len)
                o, r, d, ep_ret, ep_len = env.reset(), 0, False, 0, 0

        # Save model
        if (epoch % save_freq == 0) or (epoch == epochs-1):
            logger.save_state({'env': env}, None)

        # Perform TRPO or NPG update!
        update()

        # Log info about epoch
        logger.log_tabular('Epoch', epoch)
        logger.log_tabular('EpRet', with_min_and_max=True)
        logger.log_tabular('EpLen', average_only=True)
        logger.log_tabular('VVals', with_min_and_max=True)
        logger.log_tabular('TotalEnvInteracts', (epoch+1)*steps_per_epoch)
        logger.log_tabular('LossPi', average_only=True)
        logger.log_tabular('LossV', average_only=True)
        logger.log_tabular('DeltaLossPi', average_only=True)
        logger.log_tabular('DeltaLossV', average_only=True)
        logger.log_tabular('KL', average_only=True)
        if algo=='trpo':
            logger.log_tabular('BacktrackIters', average_only=True)
        logger.log_tabular('Time', time.time()-start_time)
        logger.dump_tabular()


if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('--env', type=str, default='HalfCheetah-v2')
    parser.add_argument('--hid', type=int, default=64)
    parser.add_argument('--l', type=int, default=2)
    parser.add_argument('--gamma', type=float, default=0.99)
    parser.add_argument('--seed', '-s', type=int, default=0)
    parser.add_argument('--cpu', type=int, default=4)
    parser.add_argument('--steps', type=int, default=4000)
    parser.add_argument('--epochs', type=int, default=50)
    parser.add_argument('--exp_name', type=str, default='trpo')
    args = parser.parse_args()

    mpi_fork(args.cpu)  # run parallel code with mpi

    from spinup.utils.run_utils import setup_logger_kwargs
    logger_kwargs = setup_logger_kwargs(args.exp_name, args.seed)

    trpo(lambda : gym.make(args.env), actor_critic=core.mlp_actor_critic,
         ac_kwargs=dict(hidden_sizes=[args.hid]*args.l), gamma=args.gamma,
         seed=args.seed, steps_per_epoch=args.steps, epochs=args.epochs,
         logger_kwargs=logger_kwargs)

關於core.py代碼如下：

import numpy as np
import tensorflow as tf
import scipy.signal
from gym.spaces import Box, Discrete

EPS = 1e-8

def combined_shape(length, shape=None):
    if shape is None:
        return (length,)
    return (length, shape) if np.isscalar(shape) else (length, *shape)

def keys_as_sorted_list(dict):
    return sorted(list(dict.keys()))

def values_as_sorted_list(dict):
    return [dict[k] for k in keys_as_sorted_list(dict)]

def placeholder(dim=None):
    return tf.placeholder(dtype=tf.float32, shape=combined_shape(None,dim))

def placeholders(*args):
    return [placeholder(dim) for dim in args]

def placeholder_from_space(space):
    if isinstance(space, Box):
        return placeholder(space.shape)
    elif isinstance(space, Discrete):
        return tf.placeholder(dtype=tf.int32, shape=(None,))
    raise NotImplementedError

def placeholders_from_spaces(*args):
    return [placeholder_from_space(space) for space in args]

def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
    for h in hidden_sizes[:-1]:
        x = tf.layers.dense(x, units=h, activation=activation)
    return tf.layers.dense(x, units=hidden_sizes[-1], activation=output_activation)

def get_vars(scope=''):
    return [x for x in tf.trainable_variables() if scope in x.name]

def count_vars(scope=''):
    v = get_vars(scope)
    return sum([np.prod(var.shape.as_list()) for var in v])
def gaussian_likelihood(x, mu, log_std):
    pre_sum = -0.5 * (((x-mu)/(tf.exp(log_std)+EPS))**2 + 2*log_std + np.log(2*np.pi))
    return tf.reduce_sum(pre_sum, axis=1)

def diagonal_gaussian_kl(mu0, log_std0, mu1, log_std1):
    """
    tf symbol for mean KL divergence between two batches of diagonal gaussian distributions,
    where distributions are specified by means and log stds.
    (https://en.wikipedia.org/wiki/Kullback-Leibler_divergence#Multivariate_normal_distributions)
    """
    var0, var1 = tf.exp(2 * log_std0), tf.exp(2 * log_std1)
    pre_sum = 0.5*(((mu1- mu0)**2 + var0)/(var1 + EPS) - 1) +  log_std1 - log_std0
    all_kls = tf.reduce_sum(pre_sum, axis=1)
    return tf.reduce_mean(all_kls)

def categorical_kl(logp0, logp1):
    """
    tf symbol for mean KL divergence between two batches of categorical probability distributions,
    where the distributions are input as log probs.
    """
    all_kls = tf.reduce_sum(tf.exp(logp1) * (logp1 - logp0), axis=1)
    return tf.reduce_mean(all_kls)

def flat_concat(xs):
    return tf.concat([tf.reshape(x,(-1,)) for x in xs], axis=0)

def flat_grad(f, params):
    return flat_concat(tf.gradients(xs=params, ys=f))

def hessian_vector_product(f, params):
    # for H = grad**2 f, compute Hx
    g = flat_grad(f, params)
    x = tf.placeholder(tf.float32, shape=g.shape)
    return x, flat_grad(tf.reduce_sum(g*x), params)

def assign_params_from_flat(x, params):
    flat_size = lambda p : int(np.prod(p.shape.as_list())) # the 'int' is important for scalars
    splits = tf.split(x, [flat_size(p) for p in params])
    new_params = [tf.reshape(p_new, p.shape) for p, p_new in zip(params, splits)]
    return tf.group([tf.assign(p, p_new) for p, p_new in zip(params, new_params)])

def discount_cumsum(x, discount):
    """
    magic from rllab for computing discounted cumulative sums of vectors.

    input:
        vector x,
        [x0,
         x1,
         x2]

    output:
        [x0 + discount * x1 + discount^2 * x2,
         x1 + discount * x2,
         x2]
    """
    return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1]

"""
Policies
"""
def mlp_categorical_policy(x, a, hidden_sizes, activation, output_activation, action_space):
    act_dim = action_space.n
    logits = mlp(x, list(hidden_sizes)+[act_dim], activation, None)
    logp_all = tf.nn.log_softmax(logits)
    pi = tf.squeeze(tf.multinomial(logits,1), axis=1)
    logp = tf.reduce_sum(tf.one_hot(a, depth=act_dim) * logp_all, axis=1)
    logp_pi = tf.reduce_sum(tf.one_hot(pi, depth=act_dim) * logp_all, axis=1)

    old_logp_all = placeholder(act_dim)
    d_kl = categorical_kl(logp_all, old_logp_all)

    info = {'logp_all': logp_all}
    info_phs = {'logp_all': old_logp_all}

    return pi, logp, logp_pi, info, info_phs, d_kl


def mlp_gaussian_policy(x, a, hidden_sizes, activation, output_activation, action_space):
    act_dim = a.shape.as_list()[-1]
    mu = mlp(x, list(hidden_sizes)+[act_dim], activation, output_activation)
    log_std = tf.get_variable(name='log_std', initializer=-0.5*np.ones(act_dim, dtype=np.float32))
    std = tf.exp(log_std)
    pi = mu + tf.random_normal(tf.shape(mu)) * std
    logp = gaussian_likelihood(a, mu, log_std)
    logp_pi = gaussian_likelihood(pi, mu, log_std)

    old_mu_ph, old_log_std_ph = placeholders(act_dim, act_dim)
    d_kl = diagonal_gaussian_kl(mu, log_std, old_mu_ph, old_log_std_ph)

    info = {'mu': mu, 'log_std': log_std}
    info_phs = {'mu': old_mu_ph, 'log_std': old_log_std_ph}

    return pi, logp, logp_pi, info, info_phs, d_kl


"""
Actor-Critics
"""
def mlp_actor_critic(x, a, hidden_sizes=(64,64), activation=tf.tanh,
                     output_activation=None, policy=None, action_space=None):

    # default policy builder depends on action space
    if policy is None and isinstance(action_space, Box):
        policy = mlp_gaussian_policy
    elif policy is None and isinstance(action_space, Discrete):
        policy = mlp_categorical_policy

    with tf.variable_scope('pi'):
        policy_outs = policy(x, a, hidden_sizes, activation, output_activation, action_space)
        pi, logp, logp_pi, info, info_phs, d_kl = policy_outs
    with tf.variable_scope('v'):
        v = tf.squeeze(mlp(x, list(hidden_sizes)+[1], activation, None), axis=1)
    return pi, logp, logp_pi, info, info_phs, d_kl, v

參考文獻：

[1].《Trust Region Policy Optimization》https://arxiv.org/pdf/1502.05477.pdf

[2]. https://medium.com/@jonathan_hui/rl-the-math-behind-trpo-ppo-d12f6c745f33

[3]. https://ai.yanxishe.com/page/TextTranslation/1419

[4]. https://blog.csdn.net/weixin_37895339/article/details/83044731

[5]. https://en.wikipedia.org/wiki/MM_algorithm

[6]. https://www.youtube.com/watch?v=wSjQ4LVUgoU

[7]. https://zhuanlan.zhihu.com/p/26308073

[8]. https://medium.com/@jonathan_hui/rl-trust-region-policy-optimization-trpo-explained-a6ee04eeeee9

[9]. https://medium.com/@jonathan_hui/rl-natural-policy-gradient-actor-critic-using-kronecker-factored-trust-region-acktr-58f3798a4a93

[10]. KL散度：https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

[11]. https://spinningup.readthedocs.io/zh_CN/latest/_modules/spinup/algos/trpo/trpo.html#trpo

深度強化學習系列(15): TRPO算法原理及Tensorflow實現

深入淺出理解TRPO算法

1、論文思想與原理

1.1 Surrogate function(替代函數)

1.2 目標函數

1.3 一階近似： L函數

1.3.1 技巧一：一階近似

1.3.2 重要性採樣

1.3.3 步長的選擇

1.4 單調遞增證明：

1.5 優化目標函數（Optimizing the objective function）

Tensorflow代碼實踐

強化學習系列（1）：強化學習（Reinforcement Learning）

深度強化學習系列(5): Double Q-Learning原理詳解

深度強化學習系列: 最全深度強化學習資料

深度強化學習系列之(13): 深度強化學習實驗中應該使用多少個隨機種子？

深度強化學習系列(15): TRPO算法原理及Tensorflow實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結