【5分鐘 Paper】(TD3) Addressing Function Approximation Error in Actor-Critic Methods

  • 論文題目:Addressing Function Approximation Error in Actor-Critic Methods

標題及作者信息

所解決的問題?

  value-base的強化學習值函數的近似估計會過估計值函數(DQN),作者將Double Q-Learning處理過擬合的思想引入actor critic算法中。(過估計的問題就在於累計誤差會使得某些不好的statevalue變地很高(exploration 不充分所導致的))。還花了很大的心血在處理過估計問題修正後帶來的方差過高的問題。

  作者將過估計的問題引入到continuous action space中,在continuous action space中處理過估計問題的難點在於policychange非常緩慢,導致currenttargetvalue差距不大, too similar to avoid maximization bias

背景

  以往的算法解決過估計問題的就是Double Q Learning那一套,但是這種方法雖然說會降低bias但是會引入高的variance(在選擇下一個時刻s‘action的時候,不確定性變得更大才將以往DQNmax這一步變得不是那麼max,與之帶來的問題就是方差會變大),仍然會對policy的優化起負面作用。作者是用clipped double q learning來解決這個問題。

所採用的方法?

  作者所採用的很多components用於減少方差:

  1. DQN 中的 target network 用於variance reduction by reducing the accumulation of errors(不使用target network的使用是振盪更新的)。
  2. 爲了解決valuepolicy耦合的關係,提出了延遲更新(delaying policy updates)的方式。(to address the coupling of value and policy, we propose delaying policy updates until the value estimate has converged)
  3. 提出了novel regularization的更新方式SARSA-style ( the variance reduction by averaging over valueestimates)。這種方法參考的是18Nachum的將值函數smooth能夠減少方差的算法。
  • Nachum, O., Norouzi, M., Tucker, G., and Schuurmans, D. Smoothed action value functions for learning gaussian policies. arXiv preprint arXiv:1803.02348, 2018.

  當然multi-step return也能夠去權衡方差與偏差之間的關係,還有一些放在文末擴展閱讀裏面了。

  作者將上述修正方法用於Deep Deterministic Policy Gradient算法中並將其命名爲Twin Delayed Deep Deterministic policy gradient (TD3)算法中。一種考慮了在policyvalue 函數近似過程中所帶來的一些誤差對AC框架所帶來的影響。

前人算法回顧

  首先回顧一下DPG算法的更新公式:

ϕJ(ϕ)=Espπ[aQπ(s,a)a=π(s)ϕπϕ(s)] \nabla_{\phi} J(\phi)=\mathbb{E}_{s \sim p_{\pi}}\left[\left.\nabla_{a} Q^{\pi}(s, a)\right|_{a=\pi(s)} \nabla_{\phi} \pi_{\phi}(s)\right]

  其中 Qπ(s,a)=r+γEs,a[Qπ(s,a)]Q^{\pi}(s,a) = r+\gamma \mathbb{E}_{s^{\prime},a^{\prime}}[Q^{\pi}(s^{\prime},a^{\prime})]Qπ(s,a)Q^{\pi}(s,a)可以用參數 θ\theta 近似,在DQN中還使用了frozen target network Qθ(s,a)Q_{\theta^{\prime}}(s,a),更新的目標爲:

y=r+γQθ(s,a),aπϕ(s) y=r+\gamma Q_{\theta^{\prime}}\left(s^{\prime}, a^{\prime}\right), \quad a^{\prime} \sim \pi_{\phi^{\prime}}\left(s^{\prime}\right)

  如果受誤差ε\varepsilon 干擾,則有:

Eε[maxa(Q(s,a)+ε)]maxaQ(s,a) \mathbb{E}_{\varepsilon}[\max_{a^{\prime}}(Q(s^{\prime},a^{\prime})+\varepsilon)] \geq \max_{a^{\prime}}Q(s^{\prime},a^{\prime})

  在AC框架下,用ϕapprox\phi_{approx}表示actor能獲得近似值函數Qθ(s,a)Q_{\theta}(s,a)的近似策略參數(Qθ(s,a)Q_{\theta}(s,a)所對應的那個策略參數),ϕtrue\phi_{true}表示actor能獲得真實準確Qπ(s,a)Q^{\pi}(s,a)的參數(which is not known during learning)。

ϕapprox =ϕ+αZ1Espπ[ϕπϕ(s)aQθ(s,a)a=πϕ(s)]ϕtrue =ϕ+αZ2Espπ[ϕπϕ(s)aQπ(s,a)a=πϕ(s)] \begin{aligned} \phi_{\text {approx }} &=\phi+\frac{\alpha}{Z_{1}} \mathbb{E}_{s \sim p_{\pi}}[\nabla_{\phi} \pi_{\phi}(s) \nabla_{a} Q_{\theta}(s, a)|_{a=\pi_{\phi} (s)}]\\ \phi_{\text {true }} &=\phi+\frac{\alpha}{Z_{2}} \mathbb{E}_{s \sim p_{\pi}}[\nabla_{\phi} \pi_{\phi}(s) \nabla_{a} Q^{\pi}(s, a)|_{a=\pi_{\phi} (s)}] \end{aligned}

  其中 Z1Z_{1},Z2Z_{2} 是梯度歸一化參數,有 Z1E[]=1Z^{-1}||\mathbb{E[\cdot]}|| =1。這裏做歸一化的原因就是更容易保證收斂(Without normalized gradients, overestimation bias is still guaranteed to occur with slightly stricter conditions. )。

  由於梯度方向是局部最大化的方向,存在一個足夠小的 ε1\varepsilon_{1},使得αε1\alpha \leq \varepsilon_{1}approximate value of πapprox\pi_{approx} 會有一個下界 approximate value of πtrue\pi_{true}(approximate 會存在過估計問題,就是下面這個式子所描述的)。

E[Qθ(s,πapprox(s))]E[Qθ(s,πtrue(s))] \mathbb{E}[Q_{\theta}(s,\pi_{approx}(s))] \geq \mathbb{E}[Q_{\theta}(s,\pi_{true}(s))]

  相反的,存在一個足夠小的 ε2\varepsilon_{2} 使得 αε2\alpha \leq \varepsilon_{2}時,the true value of πapprox\pi_{approx} 會有一個上界 the true value ofπtrue\pi_{true} (approximate policy所得出來的動作在真實的action value function中無法達到最優):

E[Qπ(s,πtrue(s))]E[Qπ(s,πapprox(s))] \mathbb{E}[Q^{\pi}(s,\pi_{true}(s))] \geq \mathbb{E}[Q^{\pi}(s,\pi_{approx}(s))]

  the value estimate 會大於等於true value E[Qθ(s,πtrue(s))]E[Qπ(s,πtrue(s))]\mathbb{E}[Q_{\theta}(s,\pi_{true}(s))] \geq \mathbb{E}[Q^{\pi}(s,\pi_{true}(s))],三式聯立有:

E[Qθ(s,πapprox(s))]E[Qπ(s,πapprox(s))] \mathbb{E}[Q_{\theta}(s,\pi_{approx}(s))] \geq \mathbb{E}[Q^{\pi}(s,\pi_{approx}(s))]

Clipped Double Q-Learning

  Double DQN中的target

y=r+γQθ(s,πϕ(s)) y = r + \gamma Q_{\theta^{\prime}}(s^{\prime},\pi_{\phi}(s^{\prime}))

  Double Q-learning

y1=r+γQθ2(s,πϕ1(s))y2=r+γQθ1(s,πϕ2(s)) \begin{array}{l} y_{1}=r+\gamma Q_{\theta_{2}^{\prime}}\left(s^{\prime}, \pi_{\phi_{1}}\left(s^{\prime}\right)\right) \\ y_{2}=r+\gamma Q_{\theta_{1}^{\prime}}\left(s^{\prime}, \pi_{\phi_{2}}\left(s^{\prime}\right)\right) \end{array}

  Clipped Double Q-learning

y1=r+γmini=1,2Qθi(s,πϕ1(s)) y_{1} = r + \gamma \min_{i=1,2}Q_{\theta_{i}^{\prime}}(s^{\prime},\pi_{\phi_{1}}(s^{\prime}))

  這裏的ϕ1\phi_{1}指的是target actor(可參見僞代碼,只用了一個actor)。這種方法會underestimation bias,由於underestimation bias 這種方法就需要加大探索度,不然算法的效率就會很低。

  如果 Qθ2>Qθ1Q_{\theta_{2}} > Q_{\theta_{1}},那麼就相當於輔助的Qθ2Q_{\theta_{2}}沒用到,那麼就no additional bias;如果 Qθ1>Qθ2Q_{\theta_{1}} > Q_{\theta_{2}}那麼就會取到Qθ2Q_{\theta_{2}},作者原文附錄裏面有證明收斂性

Addressing Variance

  設置target network用於減小policy更新所帶的的方差,不然state value approx會很容易發散,不收斂。

  作者使用policy相比於value做延遲更新(Delayed Policy Updates),這樣保證策略更新的時候,先將TD誤差最小化,這樣不會使得policy更新的時候受誤差影響,導致其方差高。

Target Policy Smoothing Regularization

  作者認爲similar actions should have similar value,所以對某個action周圍加上少許噪聲能夠使得模型泛化能力更強。

y=r+γQθ(s,πϕ(s)+ϵ)ϵclip(N(0,σ),c,c) \begin{aligned} y &=r+\gamma Q_{\theta^{\prime}}\left(s^{\prime}, \pi_{\phi^{\prime}}\left(s^{\prime}\right)+\epsilon\right) \\ \epsilon & \sim \operatorname{clip}(\mathcal{N}(0, \sigma),-c, c) \end{aligned}

  相似的想法在Nachum et al.(2018)上也有設計,不過是smoothing QθQ_{\theta},不是QθQ_{\theta^{\prime}}

  • Nachum, O., Norouzi, M., Tucker, G., and Schuurmans, D. Smoothed action value functions for learning gaussian policies. arXiv preprint arXiv:1803.02348, 2018.

算法僞代碼:

TD3算法僞代碼

取得的效果?

  作者與當前的sota算法對比,結果如下:

過估計實驗

  作者還驗證了target neteork對收斂性的影響:

在這裏插入圖片描述

  最終的實驗:

實驗效果

所出版信息?作者信息?

  ICML2018上的一篇文章,Scott Fujimoto is a PhD student at McGill University and Mila. He is the author of TD3 as well as some of the recent developments in batch deep reinforcement learning.

  他還有倆篇論文比較有意思:Off-Policy Deep Reinforcement Learning without ExplorationBenchmarking Batch Deep Reinforcement Learning Algorithms

擴展閱讀

  • 論文代碼:https://github.com/sfujim/TD3

  作者爲了驗證論文的復現性,參考了2017Henderson, P的文章實驗了很多隨機種子。

  • 參考文獻:Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. Deep Reinforcement Learning that Matters. arXiv preprint arXiv:1709.06560, 2017

  還有一些平衡biasvariance的方法,比如:

  1. importance sampling
  • Precup, D., Sutton, R. S., and Dasgupta, S. Off-policy temporal-difference learning with function approximation. In International Conference on Machine Learning, pp. 417–424, 2001.
  • Munos, R., Stepleton, T., Harutyunyan, A., and Bellemare, M. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pp. 1054–1062, 2016.
  1. distributed methods
  • Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. In Internationa lConference on Machine Learning, pp.1928– 1937, 2016.
  • Espeholt, L., Soyer, H., Munos, R., Simonyan, K., Mnih, V., Ward, T., Doron, Y., Firoiu, V., Harley, T., Dunning, I., et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. arXiv preprint arXiv:1802.01561, 2018.
  1. approximate bounds
  • He, F. S., Liu, Y., Schwing, A. G., and Peng, J. Learning to play in a day: Faster deep reinforcement learning by optimality tightening. arXiv preprint arXiv:1611.01606, 2016.
  1. reduce discount factor to reduce the contribution of each error
  • Petrik, M. and Scherrer, B. Biasing approximate dynamic programming with a lower discount factor. In Advancesin Neural Information Processing Systems, pp. 1265–1272, 2009.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章