針對PPO的一些Code-level性能優化技巧

Intro

這篇blog是我在看過Logan等人的“implementation matters in deep policy gradients: a case study on ppo and trpo“之後的總結。

reward clipping

  • clip the rewards within a preset range( usually [-5,5] or [-10,10])

observation clipping

  • The state are first normalized to mean-zero, variance-one vectors

value function clipping

\(Loss^{V} = (V_{\theta t} - V_{targ})^{2}\)替換爲\(L^{V} = min[ (V_{\theta t} - V_{targ})^{2} , (clip(V_{\theta t}, V_{\theta t-1}-\epsilon, V_{\theta t-1}+\epsilon) - V_{targ})^{2} ]\)

orthogonal initialization and layer scaling

use orthogonal initialization with scaling that varies from layer to layer

adam learning rate annealing

anneal the learning rate of Adam

hyperbolic tan activations

use hyperbolic tan activations when constructing the policy network and value network

global gradient clipping

clip the gradients such the 'global l2 norm' doesn't exceed 0.5

reward scaling

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章