Sample-Efficient Deep Reinforcement Learning via Episodic Backward Update


發表時間:2019 (NeurIPS 2019)
文章要點:這篇文章提出Episodic Backward Update (EBU)算法,採樣一整條軌跡,然後從後往前依次更新做experience replay,這種方法對稀疏和延遲迴報的環境有很好的效果(allows sparse and delayed rewards to propagate directly through all transitions of the sampled episode.)。
作者的觀點是
(1) We have a low chance of sampling a transition with a reward for its sparsity.
(2) there is no point in updating values of one-step transitions with zero rewards if the values of future transitions with nonzero rewards have not been updated yet.
作者的解決方法是
(1) by sampling transitions in an episodic manner.
(2) by updating the values of transitions in a backward manner
爲了打破數據的相關性緩解overestimation,作者採用了一個diffusion factor \(\beta\)來做trade off。這個參數會在最新的估計和之前的估計之間做加權,take a weighted sum of the new backpropagated value and the pre-existing value estimate
算法僞代碼如下

最後作者用多個learner設置不同的diffusion factor來學習,最終選一個來輸出動作。We generate K learner networks with different diffusion factors, and a single actor to output a policy. For each episode, the single actor selects one of the learner networks in a regular sequence.這些learner的參數隔一段時間同步一次。
最終看起來有一定效果

總結:感覺依次更新問題應該不少啊,可能trick有點多。另外作者強調achieves the same mean and median human normalized performance of DQN by using only 5% and 10% of samples,有點牽強了。明顯看出來訓練一樣多的step,很多遊戲提升也不大

疑問:裏面這個diffusion factor好像也不能打亂數據之間的相關性吧,不知道會不會有問題。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章