Virtual-to-real DRL: Continuous Control of Mobile Robots for Mapless Navigation

Abstract

We present a learning-based mapless motion planner (基於學習方法的無地圖運動規劃器) by taking the sparse 10-dimensional range findings(稀疏的雷達scan) and the target position with respect to the mobile robot coordinate frame as input and the continuous steering commands(連續的控制信號) as output.
We show that, through an asynchronous deep reinforcement learning method, a mapless motion planner can be trained end-to-end without any manually designed features and prior demonstrations. (通過異步深度強化學習方法,一個無地圖的運動規劃器可以端到端的被訓練,不借助任何人工設計的特徵和先驗知識.)

Introduct

  1. Deep reinforcement learning in mobile robots: The applications of deep reinforcement learning in robotics are mostly limited in manipulation where the workspace is fully observable and stable (DRL算法通常被用在完全可觀測和穩定的系統中). In terms of mobile robots, the complicated environments enlarge the sample space extremely while deep-RL methods normally sample the action from a discrete space to simplify the problem (在移動機器人中,複雜環境擴大了樣本空間,而DRL方法從離散空間中採樣動作來簡化問題). Thus, in this paper, we focus on the navigation problem of nonholonomic mobile robots with continuous control of deep-RL, which is the essential ability for the most widely used robot (本文主要研究具有DRL算法的非完整移動機器人的導航問題,這是目前應用最廣泛的機器人必須具備的基本能力).
  2. Mapless Navigation:
    For mobile nonholonomic ground robots, traditional methods, like simultaneous localization and mapping (SLAM), handle this problem through the prior obstacle map of the navigation environment based on dense laser range findings (機器人導航的傳統方法是基於SLAM,估計自身位置信息,並建立障礙物地圖來完成路徑規劃). There are two less addressed issues for this task: (1) the time-consuming building and updating of the obstacle map, and (2) the high dependence on the precise dense laser sensor for the mapping work and the local costmap prediction.
  3. From virtual to real: The huge difference between the structural simulation environment and the highly complicated real-world environment is the central challenge to transfer the trained model to a real robot directly (強化學習算法應用的困難在於仿真環境和真實環境的轉化). In this paper, we only use 10-dimensional sparse range findings as the observation input. This highly abstracted observation was sampled from specific angles of the raw laser range findings based on a trivial distribution. This brings two advantages: the first is the reduction of the gap between the virtual and real environments based on this abstracted observation (基於抽象的觀測,減少了虛擬環境和真實環境的差異), and the second is the potential extension to low-cost range sensors with distance information from only 10 directions (爲低成本傳感器使用DRL方法提供思路).

Related Work

  1. Deep-learning-based navigation: For learning-based obstacle avoidance, deep neural networks have been successfully applied on monocular images and depth images. Chen used semantics information extracted from the image by deep neural networks to decide the behavior of the autonomous vehicle. However, their control commands are simply discrete actions like turn left and turn right which may lead to rough navigation behaviors. (目前基於單目圖像和深度圖像的深度學習導航算法已經實現,但是動作空間是離散的,簡單的.) Regarding learning from demonstrations, Pfeiffer et al. [14] used a deep learning model to map the laser range findings and the target position to the moving commands. Kretzschmar et al. [15] used inverse reinforcement learning methods to make robots interact with humans in a socially compliant way. Such kinds of trained models are highly dependent on the demonstration information. A timeconsuming data collection procedure is also inevitable. (模仿學習也在實際中被應用,缺點是需要大量的樣本數據,收集過程有很大難度.)
  2. Deep reinforcement learning: DQN是深度強化學習的代表作,目前基於該方法已經實現很多機器人導航任務,但是原始DQN方法的缺陷是隻適用於離散環境. 爲了將其延展到連續環境中, Lillicrap et al. [2] proposed deep deterministic policy gradients (DDPG) to use deep neural networks on the actor-critic reinforcement learning method where both the policy and value of the reinforcement learning were represented through hierarchical networks. (Lillicrap提出了DDPG,也就是演員評論家框架,演員是動作的執行者,基於當前學到的策略來選擇動作,適用於連續控制;評論家基於值函數對策略進行評估,評估結果用來改進演員的策略.)
    在這裏插入圖片描述
    對於DRL的訓練效率問題: asynchronous deep-RL with multiple sample collection threads working in parallel should improve the training efficiency of the specific policy significantly. Minh optimize the deep-RL with asynchronous gradient descent from parallel on-policy actor-learners (A3C算法是異步DRL的一個典型應用).
    Thus, we choose DDPG as our training algorithm. Compared with NAF, DDPG needs less training parameters. And we extend DDPG to an asynchronous version to improve the sampling efficiency.

Motion planner implementation

  1. Asynchronous DRL
    對比於有無異步的方法效率.
  2. Problem Definition
    This paper aims to provide a mapless motion planner for mobile ground robots. vt=f(xt,pt,vt1),v_{t}=f(x_{t}, p_{t}, v_{t-1}), where xt{x_{t}} is the observation from the raw sensor information, pt{p_{t}} is the relative position of the target, and vt1{v_{t-1}} is the velocity of the mobile robot in the last time step. They can be regarded as the instant state st{s_{t}} of the mobile robot. The model directly maps the state to the action, which is the next time velocity vt{v_{t}}.
  3. Network Structure
    The problem can be naturally transferred to a reinforcement learning problem. In this paper, we use the extend asynchronous DDPG.
    在這裏插入圖片描述
    狀態輸入爲14維向量,laser在[π2,π2][-\frac{\pi}{2},\frac{\pi}{2}], in actor network, after 3 fully-connected neural network layers with 512 nodes, the input vector is transferred to the linear and angular velocity commands of the mobile robot.
    For the critic-network, the Q-value of the state and action pair is predicted. We still use 3 fully-connected neural network layers to process the input state. The action is merged in the second fully-connected neural network layers. The Q-value is finally activated through a linear activation function: y=kx+b,{y=kx+b}, where x{x} is the input of the last year, y{y} is the predicted Q-value, and k{k} and b{b} are the trained weights and bias of this layer.
  4. Reward function definition
    在這裏插入圖片描述

Experiments

baseline是基於SLAM的導航方法,具體由move base功能包實現,該算法基於laser scan進行自定位和障礙物地圖構建,路徑規劃算法爲全局A星和局部SWD方法.
在這裏插入圖片描述
爲了起到對比效果,在傳統方法的輸入部分採用了和DRL方法相同的稀疏10-dimension laser scan. 圖(a)爲完整laser scan下的move base方法,也就是常規的傳統方法,圖(b)爲使用稀疏scan的傳統方法結果,黑色是導航失敗的位置,圖©和(d)是兩個不同環境下的DRL結果.
實驗結果說明在輸入稀疏的條件下,傳統方法不如DRL方法成功率高. (該實驗沒有考慮DRL的泛化能力,基於自己做無人機的經驗,傳統方法在障礙物種類更多或者環境變化的時候具有更好的泛化性能,尤其當DRL算法遇上了之前沒有學習過的場景時,難以做出優良決策.)
實驗還分析了一個性能max control frequency, max control frequency反映了運動規劃器的查詢效率.實驗結果中,DRL方法雖然distance多於movebase,但是運行的時間幾乎相同.側面說明,DRL算法在規劃過程中速度更快.
在這裏插入圖片描述

Discussion

  1. 在實際測試過程中,發現在Env2中訓練的策略要好於Env1中的,是因爲Env2中的障礙物更加複雜密集.
  2. 基於本文提出的DRL方法生成的運動軌跡較傳統方法movebase的結果更加曲折,一個可能的解釋是training network沒有long-term prediction的能力,引入RNN,LSTM網絡可以解決這個問題.
  3. 本文的目的不是在於取代傳統方法,因爲在大範圍,高複雜度的環境中,環境的地圖可能提供更多的信息用來導航.本文的目標在於提供一個低成本的室內機器人導航方案,該機器人搭載低成本,低精度的傳感器.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章