【ICLR2019】基於模型的深度強化學習算法框架，具有理論保證

原創

2020-04-15 01:30

論文題目：Algorithmic framework for model-based deep reinforcement learning with theoretical guarantees

所解決的問題？

提出了一種具有理論性保證的基於模型的強化學習算法框架。設計了一個元算法，該算法在理論上保證了將單調性改進到期望報酬的局部最大值。將這個框架用於MBRL得到 Stochastic Lower Bounds Optimization (SLBO)算法。(同樣是假定獎勵函數已知)。

背景

model-free的強化學習算法取得了巨大成功，但是其採樣成本昂貴。model-based方法通過在learned mode上規劃學習，在採樣效率上取得了巨大成功。

Our meta-algorithm (Algorithm 1) extends the optimism-in-face-of-uncertainty principle to non-linear dynamical models in a way that requires no explicit uncertainty quantiﬁcation of the dynamical models.

所採用的方法？

model的學習過程採用的是 use a multi-step prediction loss for learning the models with $\ell_{2}$ norm。其loss定義如下：

$\mathcal{L}_{\phi}^{(H)}\left(\left(s_{t: t+h}, a_{t: t+h}\right) ; \phi\right)=\frac{1}{H} \sum_{i=1}^{H}\left\|\left(\hat{s}_{t+i}-\hat{s}_{t+i-1}\right)-\left(s_{t+i}-s_{t+i-1}\right)\right\|_{2}$

再引入策略 $\theta$ ，整體的公式(6.2)loss定義如下：

$\max _{\phi, \theta} V^{\pi_{\theta}, \operatorname{sg}\left(\widehat{M}_{\phi}\right)}-\lambda \underbrace{\mathbb{E}}_{\left(s_{t: t+h}, a_{t: t+h}\right) \sim \pi_{k}, M^{\star}}\left[\mathcal{L}_{\phi}^{(H)}\left(\left(s_{t: t+h}, a_{t: t+h}\right) ; \phi\right)\right]$

原論文中還涉及大量理論推導，以後有研究需要再看吧，感興趣的可以看看。

取得的效果？

所出版信息？作者信息？

ICLR 2019的一篇文章，作者來自普林斯頓大學計算機科學系三年級博士，導師Sanjeev Arora，之前就讀於清華姚班。主要研究機器學習，尤其是強化學習算法。

參考鏈接

Sanjeev Arora主要從事機器學習理論性收斂分析。

Sanjeev Arora個人主頁：https://www.cs.princeton.edu/~arora/
代碼鏈接：https://github.com/roosephu/slbo

擴展閱讀

設 $V^{\pi}$ 爲真實環境下的值函數， $\widehat{V}^{\pi}$ 爲評估模型下的值函數。設計一個可證明的upper bound $D^{\pi,\widehat{M}}$ ,用於衡量estimate 和real dynamical model之間的值函數估計誤差，與真實的值函數相比 $D^{\pi,\widehat{M}}$ leads to lower bound ：

$V^{\pi} \geq \widehat{V}^{\pi}-D^{\pi, \widehat{M}}$

算法先通過與環境交互收集數據， builds the lower bound above, and then maximizes it over both the dynamical model $\widehat{M}$ and the policy $\pi$ 。lower bounds的優化可以使用任何RL算法，因爲它是用sample trajectory from a ﬁxed reference policy 來優化的，而不是一個交互的策略迭代過程。

值函數的定義如下：

$V^{\pi, M}(s)=\underset{\forall t \geq 0, A_{t} \sim \pi\left(\cdot | S_{t}\right) ,S_{t+1} \sim M(\cdot|S_{t},A_{t})}{\mathbb{E}}\left[\sum_{t=0}^{\infty} \gamma^{t} R\left(S_{t}, A_{t}\right) | S_{0}=s\right]$

待續。。。。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【ICLR2019】基於模型的深度強化學習算法框架，具有理論保證

所解決的問題？

背景

所採用的方法？

取得的效果？

所出版信息？作者信息？

參考鏈接

擴展閱讀

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

本地SSL證書過期輸入命令在IIS自動生成

FPGA智能傳感系統(二)基於FPGA的交通燈設計

Python進階(一)Python中的內置函數詳解

Python進階(六)文件操作

Python進階(五)模塊、包詳解

Python進階(四)Python中的異常

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結