Self-supervised DRL with Generalized Computation Graphs for Robot Navigation

Abstract

Learning-based methods improve as the robot acts in the environment, but are difficult to deploy in the real-world due to their high sample complexity. To address the need to learn complex policies with few samples, we propose a generalized computation graph (計算圖) that subsumes value-based model-free methods and model-based methods, with specific instantiations interpolating between model-free and model-based (在無模型和有模型方法中進行特定實例化).
We then instantiate this graph to form a navigation model that learns from raw images and is sample efficient. Our simulated car experiments explore the design decisions of our navigation model, and show our approach outperforms single-step and N-step double Q-learning.

Introduction

Not only can learning-based systems lift some of the assumptions of geometric reconstruction methods, but they offer two major advantages that are not present in analytic approaches: (1) learning-based methods adapt to the statistics of the environments in which they are trained (基於學習的方法適用於所訓練環境的統計特性) and (2) learning-based systems can learn from their mistakes (基於學習的方法可以從它們犯的錯誤中學習). The first advantage means that a learning-based navigation system may be able to act more intelligently even under partial observation by exploiting its knowledge of statistical patterns. The second advantage means that, when a learning-based system does make a mistake that results in a failure, the resulting data can be used to improve the system to prevent such a failure from occurring in the future. This second advantage, which is the principal focus of this work, is closely associated with einforcement learning: algorithms that learn from trial-and-error experience (算法從’嘗試-錯誤’的經驗中學習).
Reinforcement learning methods are typically classified as either model-free or model-based. Value-based model-free approaches learn a function that takes as input a state and action, and outputs the value (i.e., the expected sum of future rewards).
Policy extraction is then performed by selecting the action that maximizes the value function (選一個能最大化值函數的動作). Model-based approaches learn a predictive function that takes as input a state and a sequence of actions, and output future states (有模型的可以直接輸出未來狀態,因爲有運動學方程). Policy extraction is then performed by selecting the action sequence that maximizes the future rewards using the predicted future states (根據預測的未來狀態來最大化未來回報). In general, model-free algorithms can learn complex tasks, but are usually sample-inefficient, while model-based algorithms are typically sample-efficient, but have difficulty scaling to complex, high-dimensional tasks.
We explore the intersection between value-based model-free algorithms and model-based algorithms (探索有無模型方法的交叉點) in the context of learning robot navigation policies.
Three contributions are:

  1. Generalized computation graph for reinforcement learning that subsumes value-based model-free methods and model-based methods (結合基於值的有模型和無模型兩種方法的廣義計算圖).
  2. Instantiations of the generalized computation graph for the task of robot navigation, resulting in a suite of hybrid model-free, model-based algorithms (將計算圖融合在機器人導航模塊中,結果爲一個混合算法).
  3. An extensive empirical evaluation (對文中提出的方法進行了廣泛的實際測試).

Related work

傳統導航方法爲SLAM,缺點仍然是受限於無人機尺寸,計算資源等等. 從而引入基於學習的方法(深度學習算法).
Learning-based methods have attempted to address these limitations by learning from data. These supervised learning (深度學習算法) methods include learning: drivable routes and then using a planner [13], near-to-far obstacle detectors [14], reactive controllers on top of a map-based planner [15], driving affordances [16], and end-to-end driving from demonstrations [17], [18]. However, the capabilities of powerful and expressive models like deep neural networks are often constrained in large part by the available data, and methods based on human supervision are inherently limited by the amount of human data available. (傳統的深度學習算法主要是監督學習,完成導航任務的能力受限於大量的樣本數據.)
本文提出的深度強化學習算法採用了自監督(self-supervised)方法,直接從真實環境中學習.理論上,這個系統在生命週期可以不間斷的自主學習並提升能力.
While these methods have been used to learn robot navigation policies, they often require simulation experience [19], [20]. In contrast, our approach learns from scratch to navigate using monocular images solely in the real-world.
相比於一些方法需要從仿真環境中學習經驗,我們的方法使用真實環境中的單目圖像從零開始學習導航.

Preliminaries

Our goal is to learn collision avoidance policies (障礙規避策略) for mobile robots. We formalize this task as a reinforcement learning problem, where the robot is rewarded for collision-free navigation.
In reinforcement learning, the goal is to learn a policy that choose actions atA{a_{t} \in A} at each time step t{t} in response to the current state stS{s_{t} \in S}, such that the total expected sum of discounted rewards is maximized over all time.
At each time step, the system transitions from st{s_{t}} to st+1{s_{t+1}} in response to the chosen action at{a_{t}} (這個動作由策略給出,該動作最大化Q-table值) and the transition probability T(st+1st,at){T(s_{t+1}|s_{t},a_{t})}, collecting a reward rt{r_{t}} according to the reward function R(st,at){R({s_{t},a_{t}})}.
The expected sum of discounted rewards is then defined as Eπ,T[t=tγttrtst,at]{E_{\pi,T}[\sum_{t'=t}^{\infty} \gamma ^{t'-t}r_{t'}|s_{t},a_{t}]}, where γ[0,1]{\gamma \in [0,1]} is a discount factor that prioritizes near-term rewards over distant rewards, and the expectation is taken under the transition function T{T} and a policy π{\pi}.

Value-based model-free reinforcement learning

Value-based model-free algorithms learn a value function in order to select which actions to take.
The standard parametric Q-function, Qθ(s,a){Q_{\theta}(s,a)}, is a function of the current state and a single action, and outputs the expected discounted sum of future rewards that will be received by the optimal policy after taking action a{a} in state s{s}, where θ{\theta} denotes the function parameters.
A standard method for learning the Q-function is to minimize the Bellman error, given by
εt(θ)=12Es,a[rt+γVt+1Qθ(st,at)2],\varepsilon_{t}(\theta)=\frac{1}{2} E_{s,a}[\| r_{t}+\gamma V_{t+1} - Q_{\theta}({s_{t},a_{t})} \|^{2}],
where the actions are sampled from π(s){\pi(\cdot|s)} and the Vt+1{V_{t+1}} term is known as the bootstrap.
Defining the N-step value as Vt(N)=n=0N1γnrt+n+γNVt+N,{V_{t}^{(N)}=\sum_{n=0}^{N-1}\gamma^{n}r_{t+n}+\gamma^{N}V_{t+N}}, we argument the standard Bellman error minimization object by considering a weighted combination of Bellman errors from horizon length 11 to NN:
εt(θ)=12Es,a[N=1NωNVtNQθ(st,at)2]:N=1NωNVtN=1.\varepsilon_{t}(\theta)=\frac{1}{2} E_{s,a}\left[ \| \sum_{N'=1}^{N}\omega_{N'}V_{t}^{N'} - Q_{\theta}({s_{t},a_{t})} \|^{2} \right]:\sum_{N'=1}^{N}\omega_{N'}V_{t}^{N'}=1.

Comparing model-free and model-based methods

在三個方面對比model-free和model-based方法:sample efficiency (採樣效率), stability (穩定性), final performance (最終表現).
Model-free techniques are often sample inefficient. Specifically, for (N-step) Q-learning, bias from bootstrapping and high variance multi-step returns can lead to slow convergence. Furthermore, Q-leaning often requires experience replay buffers and target networks for stable learning, which also further decreases sample efficiency.
基於無模型的強化學習方法面臨的主要問題就是採樣效率過低(Q-learning需要經驗回放和給定目標網絡來進行訓練),並且收斂較慢.
In contrast, model-based methods can be very sample efficient and stable, since learning the transition model reduces to supervised learning of dense time-series data [8].
相反,基於模型的方法是有效且穩定的,因爲學習轉移模型變成了從稠密時間序列數據中的監督學習.然後最終的表現可能不理想,因爲最大化轉移模型的精度僅是一個替代目標,這並不意味中策略會表現得很好.

A generalized computation graph for RL

在這裏插入圖片描述
The computation graph Gθ(st,AtH){G_{\theta}(s_{t},A_{t}^{H})} parameterized by vector θ{\theta} takes as input the current state st{s_{t}} and a sequence of H actions AtH=(at,,at+H1){A_{t}^{H}=\left( a_{t}, \cdots, a_{t+H-1} \right)} and produces H sequential predicted outputs Y^tH=(y^t,,y^t+H1){\hat{Y}_{t}^{H}=(\hat{y}_{t}, \dots, \hat{y}_{t+H-1})} and a predicted terminal output b^t+h{\hat{b}_{t+h}}. These predicted outputs Y^tH\hat{Y}_{t}^{H} and b^t+h{\hat{b}_{t+h}} are combined and compared with label YtH{Y}_{t}^{H} and bt+h{{b}_{t+h}} to form an error signal εt(θ)\varepsilon_{t}(\theta) that is minimized using an optimizer.
We first instantiate the computation graph for N-step Q-learning by letting y{y} be reward and b{b} be the future value estimate; setting the model horizon H=1H=1 and using N-step returns; and letting the error function be the Bellman error: εt(θ)=(y^t+γb^t+1)(n=0N1γnyt+n+γNbt+N)22.{\varepsilon_{t}(\theta)=\| \left( \hat{y}_{t} + \gamma \hat{b}_{t+1} \right) - \left( \sum_{n=0}^{N-1} \gamma^{n}y_{t+n} + \gamma^{N}b_{t+N} \right) \|_{2}^{2}}.
We define J(st,AtH)J(s_{t},A_{t}^{H}) to be the generalized policy evaluation function, which is a scalar function such as π(AHst)=argmaxAHJ(st,AtH)\pi(A^{H}|s_{t}) = \arg\max_{A^{H}} J(s_{t},A_{t}^{H}). For N-step Q-learning, J(st,AtH)=y^t+γb^t+1J(s_{t},A_{t}^{H}) = \hat{y}_{t} + \gamma \hat{b}_{t+1} is the estimated future value.
在這裏插入圖片描述

Learning navigation policies with self-supervison

Model parameterization

While many function approximators could be used to instantiate our generalized computation graph, the function approximator needs to be able to cope with high-dimensional state inputs, such as images, and accurately model sequential data due to the nature of robot navigation. We therefore parameterize the computation graph as a deep recurrent neural network (RNN), depicted in Fig. 3.
對於處理高維度狀態輸入的函數近似,以及連續數據的精準建模問題,採用循環神經網絡(RNN)來參數化計算圖.RNN網絡中代表模型爲LSTM.
在這裏插入圖片描述

Model outputs

We consider two quantities. The first quantity is the standard approach in the reinforcement learning literature: Y^tH\hat{Y}_{t}^{H} represent rewards and b^t+H\hat{b}_{t+H} represents the future value-to-go. For the task of collision-free navigation, we define the reward as the robot’s speed which is typically known using onboard sensors, and therefore the value is approximately the distance the robots travels before experiencing a collision. The second quantity Y^tH\hat{Y}_{t}^{H} represents the probability of collision at or before each timestep-that is, y^t+H\hat{y}_{t+H} is the probability the robot will collide between time tt and t+ht+h, and b^t+H\hat{b}_{t+H} represents the best-case future likelihood of collision.

Policy evaluation function

If the model output quantities are values, which in our case is the expected distance-to-travel, then the policy evaluation function is simply the value J(st,AtH)=h=0H1γhy^t+h+γHb^t+H.J(s_{t},A_{t}^{H}) = \sum_{h=0}^{H-1} \gamma^{h} \hat{y}_{t+h} + \gamma^{H} \hat{b}_{t+H} .
如果模型輸出的定量爲值,那麼策略評估函數是N步內的回報加上N+1步的狀態值函數,這裏可以理解成爲Q-value function.
If the model output quantities are collision probabilities, then the policy evaluation function needs to somehow encourage the robot to move through the environment. We assume that the robot will be travelling at some fixed speed, and therefore the policy evaluation function needs to evaluate which actions are least likely to result in collisions J(st,AtH)=h=0H1y^t+hb^t+H.J(s_{t},A_{t}^{H}) = \sum_{h=0}^{H-1} -\hat{y}_{t+h} - \hat{b}_{t+H} .
如果模型輸出的定量爲碰撞概率,那麼策略評估函數需要鼓勵機器人在環境內移動,我們假設機器人在環境中以恆定速度運動,因此策略評估函數就需要評估哪個動作可以最不可能導致碰撞.

Policy evaluation

Using the policy evaluation function, action selection is performed by solving the finite-horizon planning problem argmaxAHJ(st,AH)\arg\max_{A^{H}}J(s_{t},A^{H}).
根據策略評估函數J(st,AtH)J(s_{t},A_{t}^{H})的結果,貪心地(Greedy)選擇最大化策略評估函數的動作序列AHA^{H}.

Model horizon

H=1H=1: 問題簡化爲完全無模型(fully model-free). H=H=\infty: 問題簡化爲有模型(fully model-based). For intermediate values of HH, the model is a hybrid of model-free and model-based methods. We empirically evaluate different horizon values in our experiments.

Label horizon

增大label horizon可以加快學習速度,但是會使N-step Q-learning變成on policy算法,這在機器人學習導航時是不理想的,因爲我們在訓練要同時訓練各種類型的數據,包括由舊策略和被探索策略收集到的數據.

Bootstrapping

在label horizon中,我們選擇label horizon的數量等於model horizon,爲了讓模型學習未來結果,除了增加model horizon,另外的方法是bootstrapping. 雖然增加model horizon會使得模型變得更加model-based,但在策略評估過程中,搜索空間成指數倍增長.Bootstrapping可以在model horizon不增大時緩解規劃問題,但是bootstrapping會導致學習過程中的偏差(bias)和不穩定性(instability).

Training the model

使用一個數據集DD來訓練模型,在模型輸出(outputs)和標籤(labels)之間定義損失函數.
對於samples(stH,AtH,ytH)D{(s_{t}^{H},A_{t}^{H},y_{t}^{H}) \in D} from the dataset, 如果模型輸出和標籤是值(value),損失函數是標準的Bellman error function εt(θ)=n=0N1γnyt+n+γNbt+HJ(st,AtH)22,\varepsilon_{t}(\theta) = \| \sum_{n=0}^{N-1} \gamma^{n}y_{t+n} + \gamma^{N}b_{t+H} - J(s_{t},A_{t}^{H}) \|_{2}^{2}, in which bt+H=maxAHJ(st,AtH)b_{t+H}=\max_{A^{H}}J(s_{t},A_{t}^{H}).
如果模型輸出爲碰撞概率,損失函數爲交叉熵損失在這裏插入圖片描述

Experiments

實驗部分,本文論證了三個問題:

  1. 導航計算圖的不同設計選型對性能有什麼樣的影響?
  2. 在給出最優的設計選型時,我們的方法與之前的方法對比如何?
  3. 我們的方法能夠在複雜環境中成功學習真實機器人的導航策略嗎?

Robot state SR2304S \in R^{2304} is a 64×3664 \times 36 grayscale image taken from an onboard forward-facing camera. The action space AR1A \in R^{1}.

Model outputs and loss function

在這裏插入圖片描述
"value"對應的輸出表示了未來回報的期望和,"collision"對應的輸出表示了碰撞的概率.迴歸問題使用的是平均方差損失函數,分類問題使用的是交叉熵損失函數.實驗結果表明,collision方法優於value方法,兩者的主要差別是:
The value model loss is a single loss on the sum of the outputs, while the collision model is the sum of H{H} separate losses on each of the outputs.
因此,碰撞模型對碰撞標籤何時及時出現有額外的監督作用.此外,在樣本效率和最終性能方面,具有交叉熵損失的訓練明顯優於具有均方誤差損失的訓練. 這一比較表明,預測離散的未來事件比預測連續的折扣獎勵更能使學習速度更快,更穩定.雖然我們只是在機器人導航的背景下展示了這一發現,但這一見解可能會產生一類新的高效,穩定,高性能的強化學習算法.

Model horizon

接下來對於HH不同時的影響,H=1H=1時,機器人能看見0.5m0.5m,H=16H=16時,機器人能看見8m8m.
在這裏插入圖片描述
對於輸出值的模型,具有更長的週期的訓練更穩定,並導致更高的執行最終策略。longer horizon model(長視野模型)表現更好是因爲long horizon減少了bootstrap的偏差。然而,對於輸出碰撞概率的模型,我們在比較短視距模型和長視距模型時沒有注意到任何性能上的變化。這可能是由於概率必然在0和1之間有界,從而最小化了bootstrap的偏差。

Bootstrapping

在這裏插入圖片描述
在對比bootstrapping時選擇H=16H=16,在不使用bootstrapping時,輸出值的模型無法學習,然而輸出碰撞概率的模型有很高的採樣效率和穩定性.在使用bootstrapping時,輸出值的模型的性能比輸出碰撞概率的模型差.然而,輸出值的模型確實受益於使用bootstrapping.相反,碰撞預測模型不受使用或不使用bootstrapping的影響.綜上結果表明,任務通過向前看HHstep(長視距),且不使用bootstrapping是更有優勢的.

Comparisons with prior work.

在這裏插入圖片描述
對比之前的一些方法,包括double Q-learning和N step double Q-learning等,說明該算法優於以前的算法,具有更高的樣本效率,穩定性,和高性能.

Real-world results

We therefore made the system fully asynchronous: the car continuously runs the reinforcement learning algorithm and sends data to the laptop, while the laptop continuously trains the model and periodically sends updated model parameters to the car.
我們構建了異步系統,車持續運行強化學習算法並將數據發送至pc,然後pc持續的訓練模型並且將模型參數更新到車上.
In evaluating our approach, we chose the best design decisions from our simulation experiments: the model outputs are collision probabilities trained using classification, a large model horizon (H = 12, corresponding to 3.6m lookahead), and no bootstrapping. All other settings were the exact same as the simulation experiments.
在評估階段,我們選擇了仿真結果中的最佳設計決策:模型輸出是使用分類訓練的碰撞概率,長視距,無自舉.
實驗和以前double Q-learning進行對比,說明了算法的優勢.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章