FollowNet: Robot Navigation by Following Natural Language Directions with DRL

Abstract

We present FollowNet, an end-to-end differentiable neural architecture for learning multi-modal navigation policies. FollowNet maps natural language instructions as well as visual and depth inputs to locomotion primitives. (語言指令,視覺輸入映射到動作原語中).
FollowNet processes instructions using an attention mechanism(注意力機制) conditioned on its visual and depth input to focus on the relevant parts of the command while performing the navigation task. Deep reinforcement learning (RL) a sparse reward learns simultaneously the state representation, the attention function, and control policies(在學習稀疏獎勵的同時學了狀態表徵,注意力方程和控制策略).

Introduction

The novel aspect of the FollowNet architecture is a language instruction attention mechanism that is conditioned on the agent’s sensory observations. This allows the agent to do two things.

  1. First, it keeps track of the instruction command and focuses on different parts as it explores the environment. (能完成指令的跟蹤以及在探索過程中重點關注不同的區域.)
  2. Second, it associates motion primitives, sensory observations, and sections of the instruction with the reward received, which enables the agent to generalize to new instructions. (FollowNet將運動源語,傳感器觀測,指令與收到的回報相關聯,這使得網絡對新的指令具有普適性.)

Related Work

In this work, we provide natural language instructions instead of the explicit goal, and the agent must learn to interpret the instructions to complete the task. (常見的端到端導航算法DRL主要解決顯示目標點的導航問題,本文要求agent要從指令中發掘隱式的目標點.)

Methods

Problem formulation

we assume the robot to be a point-mass with 3-DOF (x,y,θ)(x,y,\theta), navigating in a 2D grid overlaid on a 3D indoor house environment. To train a DQN agent, we formulate the task as a POMDP(部分可觀馬爾可夫決策過程): a tuple (O,A,D,R)(O,A,D,R) with observations o=[oNL,oV]Oo=[o_{NL},o_{V}] \in O, where oNL=[ω1,ω2,,ωi]{o_{NL}=[\omega_{1}, \omega_{2}, \cdots, \omega_{i}]} is a natural language instruction sampled from a set of user-provided directions for reaching a goal. oV{o_{V}} is the visual input available to the agent, which consists of the image that the robot sees at a time-step i{i}. The set of actions A=(turnπ2,go straight,turn3π2){A=(turn\frac{\pi}{2}, go \ straight, turn\frac{3\pi}{2} )}. The system dynamics D:O×AO{D:O \times A \rightarrow O} are deterministic and apply the action to the robot. The reward R:OR{R:O \rightarrow R} rewards an agent reaching a landmark (waypoint) mentioned in the instruction.
Fig. 2 provides an example task, where the robot starts at the position and orientation specified by the blue triangle, and must reach the goal location specified by the red circle.
fig2

FollowNet

We present FollowNet, a neural architecture for approximating the action value function directly from the language and visual inputs. (直接從語言和視覺輸入進行動作值函數近似,說明該算法在DQN框架下.)

  1. To simplify the image processing task, we assume a separate preprocessing step parses the visual input ovRn×m{o_{v} \in R^{n \times m}} to obtain a semantic segmentation oS{o_{S}} which assigns a one-hot semantic class id to each pixel,
    and a depth map oD{o_{D}} which assigns a real number to each pixel corresponding to the distance from the robot. (對輸入圖像進行預處理,得到語義分割結果和一個深度圖信息.完成這個步驟需要一系列的卷積神經網絡CNN和全連接層FC實現.)
  2. We use a single layer bi-directional GRU network to encode the natural language instruction. To enable the agent
    to focus on different parts of the instruction depending on the context, we add a feed-forward attention layer.
    在這裏插入圖片描述
    we use a feed-forward attention layer FFA{FF_{A}} conditioned on vC{v{C}}, which is the concatenated embeddings of the visual and language inputs, to obtain unnormalized scores ei{e_{i}} for each token ωi{\omega_{i}} (結合視覺和語言輸入,獲得未歸一化的單詞分數). ei{e_{i}} are normalized using the softmax function to obtain the attention scores αi{\alpha_{i}}, which correspond to the relative importance of each token of the instruction for the current time step (將單詞分數歸一化得到注意力分數,這個分數表徵了指令中每個token的相對重要性). We take the attention-weighted mean of the output vectors oi{o_{i}}, and pass it through another feed-forward layer to obtain vLRdL{v_{L} \in R^{d_{L}}}, which is the final encoding of the natural language instruction (將所有的token與注意力分數加權後,得到原指令的編碼結果).
    network
    在這裏插入圖片描述
  3. The Q function is then estimated from the concatenated [vS;vD;vL]{[v_{S}; v_{D}; v_{L}]} passed through a final feed-forward layer. During training, we sample actions from the Q-Function using ϵgreedy\epsilon-\textrm{greedy} policy to collect experience, and update the Q-network to minimize the Bellman error over batches of transitions using gradient descent. After the Q function is trained, we used the greedy policy π(o):OA{\pi(o)}:O \rightarrow A, with respect to learned
    Q^,π(o)=πQ(O)=argmaxaAQ^(o,a),{\hat{Q}, \pi(o)=\pi^{Q}(O)=\arg\max_{a \in A}\hat{Q}(o,a)},to take the robot to the goal presented in the instruction ol{o_{l}}.
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章