【5分鐘 Paper】(TD3) Addressing Function Approximation Error in Actor-Critic Methods

論文題目：Addressing Function Approximation Error in Actor-Critic Methods

所解決的問題？

value-base的強化學習值函數的近似估計會過估計值函數(DQN)，作者將Double Q-Learning處理過擬合的思想引入actor critic算法中。(過估計的問題就在於累計誤差會使得某些不好的state的value變地很高(exploration 不充分所導致的))。還花了很大的心血在處理過估計問題修正後帶來的方差過高的問題。

作者將過估計的問題引入到continuous action space中，在continuous action space中處理過估計問題的難點在於policy的change非常緩慢，導致current和target的value差距不大， too similar to avoid maximization bias。

背景

以往的算法解決過估計問題的就是Double Q Learning那一套，但是這種方法雖然說會降低bias但是會引入高的variance(在選擇下一個時刻s‘的action的時候，不確定性變得更大才將以往DQN中max這一步變得不是那麼max，與之帶來的問題就是方差會變大)，仍然會對policy的優化起負面作用。作者是用clipped double q learning來解決這個問題。

所採用的方法？

作者所採用的很多components用於減少方差：

DQN 中的 target network 用於variance reduction by reducing the accumulation of errors(不使用target network的使用是振盪更新的)。
爲了解決value 和policy耦合的關係，提出了延遲更新(delaying policy updates)的方式。(to address the coupling of value and policy, we propose delaying policy updates until the value estimate has converged)
提出了novel regularization的更新方式SARSA-style ( the variance reduction by averaging over valueestimates)。這種方法參考的是18年Nachum的將值函數smooth能夠減少方差的算法。

Nachum, O., Norouzi, M., Tucker, G., and Schuurmans, D. Smoothed action value functions for learning gaussian policies. arXiv preprint arXiv:1803.02348, 2018.

當然multi-step return也能夠去權衡方差與偏差之間的關係，還有一些放在文末擴展閱讀裏面了。

作者將上述修正方法用於Deep Deterministic Policy Gradient算法中並將其命名爲Twin Delayed Deep Deterministic policy gradient (TD3)算法中。一種考慮了在policy和value 函數近似過程中所帶來的一些誤差對AC框架所帶來的影響。

前人算法回顧

首先回顧一下DPG算法的更新公式：

$\nabla_{\phi} J(\phi)=\mathbb{E}_{s \sim p_{\pi}}\left[\left.\nabla_{a} Q^{\pi}(s, a)\right|_{a=\pi(s)} \nabla_{\phi} \pi_{\phi}(s)\right]$

其中 $Q^{\pi}(s,a) = r+\gamma \mathbb{E}_{s^{\prime},a^{\prime}}[Q^{\pi}(s^{\prime},a^{\prime})]$ ， $Q^{\pi}(s,a)$ 可以用參數 $\theta$ 近似，在DQN中還使用了frozen target network $Q_{\theta^{\prime}}(s,a)$ ，更新的目標爲：

$y=r+\gamma Q_{\theta^{\prime}}\left(s^{\prime}, a^{\prime}\right), \quad a^{\prime} \sim \pi_{\phi^{\prime}}\left(s^{\prime}\right)$

如果受誤差 $\varepsilon$ 干擾，則有：

$\mathbb{E}_{\varepsilon}[\max_{a^{\prime}}(Q(s^{\prime},a^{\prime})+\varepsilon)] \geq \max_{a^{\prime}}Q(s^{\prime},a^{\prime})$

在AC框架下，用 $\phi_{approx}$ 表示actor能獲得近似值函數 $Q_{\theta}(s,a)$ 的近似策略參數( $Q_{\theta}(s,a)$ 所對應的那個策略參數)， $\phi_{true}$ 表示actor能獲得真實準確 $Q^{\pi}(s,a)$ 的參數(which is not known during learning)。

$\begin{aligned} \phi_{\text {approx }} &=\phi+\frac{\alpha}{Z_{1}} \mathbb{E}_{s \sim p_{\pi}}[\nabla_{\phi} \pi_{\phi}(s) \nabla_{a} Q_{\theta}(s, a)|_{a=\pi_{\phi} (s)}]\\ \phi_{\text {true }} &=\phi+\frac{\alpha}{Z_{2}} \mathbb{E}_{s \sim p_{\pi}}[\nabla_{\phi} \pi_{\phi}(s) \nabla_{a} Q^{\pi}(s, a)|_{a=\pi_{\phi} (s)}] \end{aligned}$

其中 $Z_{1}$ , $Z_{2}$ 是梯度歸一化參數，有 $Z^{-1}||\mathbb{E[\cdot]}|| =1$ 。這裏做歸一化的原因就是更容易保證收斂(Without normalized gradients, overestimation bias is still guaranteed to occur with slightly stricter conditions. )。

由於梯度方向是局部最大化的方向，存在一個足夠小的 $\varepsilon_{1}$ ，使得 $\alpha \leq \varepsilon_{1}$ 時approximate value of $\pi_{approx}$ 會有一個下界 approximate value of $\pi_{true}$ (approximate 會存在過估計問題，就是下面這個式子所描述的)。

$\mathbb{E}[Q_{\theta}(s,\pi_{approx}(s))] \geq \mathbb{E}[Q_{\theta}(s,\pi_{true}(s))]$

相反的，存在一個足夠小的 $\varepsilon_{2}$ 使得 $\alpha \leq \varepsilon_{2}$ 時，the true value of $\pi_{approx}$ 會有一個上界 the true value of $\pi_{true}$ (approximate policy所得出來的動作在真實的action value function中無法達到最優)：

$\mathbb{E}[Q^{\pi}(s,\pi_{true}(s))] \geq \mathbb{E}[Q^{\pi}(s,\pi_{approx}(s))]$

the value estimate 會大於等於true value $\mathbb{E}[Q_{\theta}(s,\pi_{true}(s))] \geq \mathbb{E}[Q^{\pi}(s,\pi_{true}(s))]$ ，三式聯立有：

$\mathbb{E}[Q_{\theta}(s,\pi_{approx}(s))] \geq \mathbb{E}[Q^{\pi}(s,\pi_{approx}(s))]$

Clipped Double Q-Learning

Double DQN中的target：

$y = r + \gamma Q_{\theta^{\prime}}(s^{\prime},\pi_{\phi}(s^{\prime}))$

Double Q-learning ：

$\begin{array}{l} y_{1}=r+\gamma Q_{\theta_{2}^{\prime}}\left(s^{\prime}, \pi_{\phi_{1}}\left(s^{\prime}\right)\right) \\ y_{2}=r+\gamma Q_{\theta_{1}^{\prime}}\left(s^{\prime}, \pi_{\phi_{2}}\left(s^{\prime}\right)\right) \end{array}$

Clipped Double Q-learning：

$y_{1} = r + \gamma \min_{i=1,2}Q_{\theta_{i}^{\prime}}(s^{\prime},\pi_{\phi_{1}}(s^{\prime}))$

這裏的 $\phi_{1}$ 指的是target actor(可參見僞代碼，只用了一個actor)。這種方法會underestimation bias，由於underestimation bias 這種方法就需要加大探索度，不然算法的效率就會很低。

如果 $Q_{\theta_{2}} > Q_{\theta_{1}}$ ，那麼就相當於輔助的 $Q_{\theta_{2}}$ 沒用到，那麼就no additional bias；如果 $Q_{\theta_{1}} > Q_{\theta_{2}}$ 那麼就會取到 $Q_{\theta_{2}}$ ，作者原文附錄裏面有證明收斂性。

Addressing Variance

設置target network用於減小policy更新所帶的的方差，不然state value approx會很容易發散，不收斂。

作者使用policy相比於value做延遲更新(Delayed Policy Updates)，這樣保證策略更新的時候，先將TD誤差最小化，這樣不會使得policy更新的時候受誤差影響，導致其方差高。

Target Policy Smoothing Regularization

作者認爲similar actions should have similar value，所以對某個action周圍加上少許噪聲能夠使得模型泛化能力更強。

$\begin{aligned} y &=r+\gamma Q_{\theta^{\prime}}\left(s^{\prime}, \pi_{\phi^{\prime}}\left(s^{\prime}\right)+\epsilon\right) \\ \epsilon & \sim \operatorname{clip}(\mathcal{N}(0, \sigma),-c, c) \end{aligned}$

相似的想法在Nachum et al.(2018)上也有設計，不過是smoothing $Q_{\theta}$ ，不是 $Q_{\theta^{\prime}}$ 。

Nachum, O., Norouzi, M., Tucker, G., and Schuurmans, D. Smoothed action value functions for learning gaussian policies. arXiv preprint arXiv:1803.02348, 2018.

算法僞代碼：

取得的效果？

作者與當前的sota算法對比，結果如下：

作者還驗證了target neteork對收斂性的影響：

最終的實驗：

所出版信息？作者信息？

ICML2018上的一篇文章，Scott Fujimoto is a PhD student at McGill University and Mila. He is the author of TD3 as well as some of the recent developments in batch deep reinforcement learning.

他還有倆篇論文比較有意思：Off-Policy Deep Reinforcement Learning without Exploration；Benchmarking Batch Deep Reinforcement Learning Algorithms。

擴展閱讀

論文代碼：https://github.com/sfujim/TD3

作者爲了驗證論文的復現性，參考了2017年Henderson, P的文章實驗了很多隨機種子。

參考文獻：Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep Reinforcement Learning that Matters. arXiv preprint arXiv:1709.06560, 2017

還有一些平衡bias和variance的方法，比如：

importance sampling

Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal-difference learning with function approximation. In International Conference on Machine Learning, pp. 417–424, 2001.
Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efﬁcient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062, 2016.

distributed methods

Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Internationa lConference on Machine Learning, pp.1928– 1937, 2016.
Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.

approximate bounds

He, F. S., Liu, Y., Schwing, A. G., and Peng, J. Learning to play in a day: Faster deep reinforcement learning by optimality tightening. arXiv preprint arXiv:1611.01606, 2016.

reduce discount factor to reduce the contribution of each error

Petrik, M. and Scherrer, B. Biasing approximate dynamic programming with a lower discount factor. In Advancesin Neural Information Processing Systems, pp. 1265–1272, 2009.

【5分鐘 Paper】(TD3) Addressing Function Approximation Error in Actor-Critic Methods

所解決的問題？

背景

所採用的方法？

前人算法回顧

Clipped Double Q-Learning

Addressing Variance

Target Policy Smoothing Regularization

算法僞代碼：

取得的效果？

所出版信息？作者信息？

擴展閱讀

salesforce零基礎學習（一百三十八）零碎知識點小總結（十）

關於接口協議，你必須要知道這些！

一鍵自動化博客發佈工具,用過的人都說好(頭條篇)

01 穩定性（一）如何應對事故並做好覆盤？

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

京東面試：如何進行JVM調優？

線程池那些坑爹的參數-核心線程數&最大線程數&工作隊列

Stream流常用方法總結

FPGA智能傳感系統(二)基於FPGA的交通燈設計

Python進階(一)Python中的內置函數詳解

Python進階(六)文件操作

Python進階(五)模塊、包詳解

Python進階(四)Python中的異常

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結