Abstract

Deep Reinforcement Learning (RL) has recently emerged as a solution for moving obstacle avoidance. Deep RL learns to simultaneously predict obstacle motions and corresponding avoidance actions directly from robot sensors, even for obstacles with different dynamics models.
However, deep RL methods typically cannot guarantee policy convergences (無法保證策略收斂性), i.e., cannot provide probabilistic collision avoidance guarantees. In contrast, stochastic reachability (SR), a computationally expensive formal method that employs a known obstacle dynamics model, identifies the optimal avoidance policy and provides strict convergence guarantees. The availability of the optimal solution for versions of the moving obstacle problem provides a baseline to compare trained deep RL policies. In this paper, we compare the expected cumulative reward and actions of these policies to SR, and find the following. 1) The state-value function approximates the optimal collision probability well, thus explaining the high empirical performance (狀態值函數很好地近似了最優碰撞概率). 2) RL policies deviate from the optimal significantly thus negatively impacting collision avoidance in some cases (RL策略明顯偏離了最優策略,因此在某些情況下會對避開碰撞產生負面影響). 3) Evidence suggests that the deviation is caused, at least partially, by the actor net failing to approximate the action corresponding to the highest state-action value (實驗表明這個差異來源於actor net未能近似出最高’狀態-動作值’對應的動作).

直觀理解（critic給出了正確的評估，但是actor沒有正確的執行）：本文對避障問題，將DRL和SR方法進行對比分析，來分析在整個過程中DRL算法的運作機理,最終發現在基於AC框架的DRL算法中，critic net可以比較準確的刻畫碰撞概率，而實驗結果差異主要在於actor net未能近似出最高’狀態-動作值’對應的動作。

Main contributions

(1) We present a comprehensive comparison between a deep RL algorithm and a formal method (SR) for dynamic obstacle avoidance.
(2) We also identify the potential points of failure of RL policies that provides insights on where additional safety policies might be required.

Results

End-to-end deep RL obstacle avoidance policies have up to 15% higher success than a state of the art multi-obstacle collision avoidance method, APF-SR.
We observe evolving changes in behavior of RL policies during training. This was consistent across environments with deterministic and stochastic obstacle motions. (在RL策略的訓練過程中,確定環境和隨機環境變化一致.)
The state value function stored in the critic net approximates the optimal collision probability reasonably well. This explains why RL policies perform well empirically compared to the traditional methods (critic的評估比較理想).
However, the RL policy stored in the actor net deviates from the optimal policy significantly and thus negatively impacts the true policy collision probability (actor的策略沒有完全執行).
Lastly, strong evidence suggests that the deviation from optimal policy is caused by the actor net failing to approximate the action corresponding to the highest state-action value (爲了驗證4,做了對比補充實驗).

Preliminaries

Robot and obstacle dynamics
SR analysis
基於傳統建立數學模型方法的碰撞概率計算和控制器設計.
Deep RL

給出深度強化學習的基本定義，最終目標爲尋找一個最優策略 ${\pi^{*}}$ 來將觀測映射到動作,使得期望折扣累計獎賞最大.目前DRL主流框架A3C使用了actor和critic兩個神經網絡.
actor網絡通過策略梯度(policy gradient)學習了策略,即根據critic網絡的值函數(最大化折扣累計期望)計算出來的方向更新actor網絡參數.同時,critic網絡根據Bellman方程來更新參數,類似於Q-learning.相比於AC,A3C通過異步收集經驗來加快學習速度.

Evaluation

Policy selection and evaluation

虛線爲APF-SR算法的baseline，實線爲DRL結果。在同種障礙物運動方式下，DRL方法均優於APF-SR方法。
critic comparison

圖a-d表示了DRL算法學習中critic躲避一個從左向右移動障礙物的過程變化，在圖a時算法初始化，此時 ${V_{RL}}$ 是隨機的.在圖b中,機器人在學習靠近障礙物,因此有一個高碰撞概率和均方誤差.圖b爲接下來學習躲避障礙物(圖c)提供幫助,但是它沒有考慮障礙物的運動,最後在收斂狀態下,圖d中機器人學習去考慮障礙物的運動,得到了較小的均方誤差.
Actor comparison
在對比critic網絡和SR算法給出的碰撞概率後,我們對比了DRL策略(actor net)和SR給出的最優策略.

圖中可以看出在DRL的策略中,有些動作建議穿過障礙物,還有一些迎向障礙物運動的動作.對於兩者動作選擇的偏差,我們回答了一下兩個問題:(1) 這些偏差如何影響障礙規避性能;(2) 什麼導致了差異.
Collision probability comparison
Causes for sub-optimality
首先作者給出了一個假設:the actor net failed to approximate the action corresponding to the highest state-action value inferred from the critic net.
爲了驗證這個假設,作者繞過了actor net而直接使用了一個從critic net中得到的新RL策略.

第三張的結果爲critic policy給出的結果,和第一張SR算法給出的結果更相似.出現上述結果的原因可以歸結爲:The actor-critic algorithms, e.g., A3C use the critic’s value function instead of empirical returns to estimate the cumulative rewards which helps to lower the variance of the gradient estimations but at the cost of introducing bias.

（a，c）和（b，d）分別爲actor net和critic net給出的RL actions對比，可見後者對於障礙物的避障處理更優。同時驗證了之前提出的猜想“actor net沒有完全執行critic net給出的能最大化state-value的動作，也就是actor net的策略參數 $\pi$ 訓練的不是特別理想”。