Intro
這篇blog是我在看過Logan等人的“implementation matters in deep policy gradients: a case study on ppo and trpo“之後的總結。
reward clipping
- clip the rewards within a preset range( usually [-5,5] or [-10,10])
observation clipping
- The state are first normalized to mean-zero, variance-one vectors
value function clipping
將\(Loss^{V} = (V_{\theta t} - V_{targ})^{2}\)替換爲\(L^{V} = min[ (V_{\theta t} - V_{targ})^{2} , (clip(V_{\theta t}, V_{\theta t-1}-\epsilon, V_{\theta t-1}+\epsilon) - V_{targ})^{2} ]\)
orthogonal initialization and layer scaling
use orthogonal initialization with scaling that varies from layer to layer
adam learning rate annealing
anneal the learning rate of Adam
hyperbolic tan activations
use hyperbolic tan activations when constructing the policy network and value network
global gradient clipping
clip the gradients such the 'global l2 norm' doesn't exceed 0.5