FeUdal Networks for Hierarchical Reinforcement Learning 閱讀筆記

FeUdal Networks for Hierarchical Reinforcement Learning

標籤(空格分隔): 論文筆記 增強學習算法


Abstract

這篇論文主要是在 fedual reinforcenment learning 上面的改進和應用,
首先說說fedual reinforcement learning的形式:
1. 主要分成兩個部分 Manger model和Worker model;
2. 其中Manger model的作用就是控制系統完成哪個任務,在文中,作者把每個任務編碼成一個embedding(類似與自然語言的詞向量的意思);
3. Worker model指的是針對與某個特定的任務,對環境進行交互(action);
4. 所以,在文章提到Manger model 的時間分辨率很低,而Worker model的時間分辨率很高
5. 作者提到一個sub-policies的概念,我的理解是每一個任務都會有一個不同的策略;
6. 把任務變成embedding,可以快速接受一個任務。

We introduce FeUdal Networks (FuNs): a novel architecture for hierarchical reinforcement learning. Our approach is inspired by the feudal reinforcement learning proposal of Dayan and Hinton, and gains power and efficacy by decoupling end-to-end learning across multiple levels allowing it to utilise different resolutions of time. Our framework employs a Manager module and a Worker module. The Manager operates at a lower temporal resolution and sets abstract goals which are conveyed to and enacted by the Worker. The Worker generates primitive actions at every tick of the environment. The decoupled structure of FuN conveys several benefits in addition to facilitating very long timescale credit assignment it also encourages the emergence of sub-policies associated with different goals set by the Manager. These properties allow FuN to dramatically outperform a strong baseline agent on tasks that involve longterm credit assignment or memorisation. We demonstrate the performance of our proposed system on a range of tasks from the ATARI suite and also from a 3D Deep-Mind Lab environment.

Introduction

作者在提到了幾個目前增強學習應用的幾個難點:
1. 增強學習一直存在着長時間的信譽分配問題(long-term credit assignment),目前這個問題一直都是用Bellman公式解決的,然後最近有人將每一次選擇的action分解成四個連續的action;
2. 第二個難點在於reward的回饋是稀疏的;

針對以上兩個問題作者在基於前人工作的基礎上,提出了自己的網絡結構以及訓練策略
1. the top-level, low temproal resolution Manger model和 the low-level ,high temporal resolution Worker model
2. Manger model 學習潛在的狀態(個人理解:暗示着該狀態想要往哪個目標發展),然後 Worker model接收Manger model的信號選擇動作
3. Manger model 學習信號並不是由Worker model 提供的,而是隻是外界環境提供的,換句話說,外界環境的reward提供給Manger model;
4. Worker model的學習信號是有系統內部的狀態(intrinsic reward)提供的
5. 在Manger model和Worker model之間並沒有梯度傳播

The architecture explored in this work is a fully- differentiable neural network with two levels of hierarchy (though there are obvious generalisations to deeper hierar- chies). The top level, the Manager, sets goals at a lower temporal resolution in a latent state-space that is itself learnt by the Manager. The lower level, the Worker, oper- ates at a higher temporal resolution and produces primitive actions, conditioned on the goals it receives from the Man- ager. The Worker is motivated to follow the goals by an intrinsic reward. However, significantly, no gradients are propagated between Worker and Manager; the Manager re- ceives its learning signal from the environment alone. In other words, the Manager learns to select latent goals that maximise extrinsic reward.

作者最後終結了這篇論文的貢獻:
1. 應該是將fedual reinforcenment learning泛化了,可以用在很多系統下;
2. 作者提出了以訓練Manager model的新方法(transition policy gradient),它能夠產生目標語義上的一些信息(我感覺就是將目標進了embedding);
3. 傳統的學習信號完全依賴於外界的環境,但是在該文章中,外界的學習信號(reward)是用來訓練Manger Model,然後訓練Worker Model是內部產生的信號;
4. 作者也使用了新型的LSTM網絡dilated LSTM,因爲在Manger Model中,需要長時間的記憶狀態,因爲LSTM的時間分辨率比較低

作者將自己的方法和2017年有人人提出的policy-over-option 進行了對比

A key difference between our approach and the options framework is that in our proposal the top level produces a meaningful and explicit goal for the bottom level to achieve. Sub-goals emerge as directions in the latent state-space and are naturally diverse.
理解:
1. Manger Model 在整個模型中處於一個上層地位,能夠產生一個指導性的信號給下層網絡(Worker Model);
2. 第二層含義就可能是每個大任務有很多的小任務,任務的不同階段的reward值可能不一樣,所以作者認爲大任務下面有很多小任務從而導致embedding的多樣性,有點類似與[1]這篇論文的思想

model

如下爲模型示意圖以及具體的計算公式:
模型
具體計算公式1

Here hM and hW correspond to the internal states of the Manager and the Worker respectively. A linear transform ϕ maps a goal gt into an embedding vector wtRk , which is then combined via product with matrix Ut (Workers output) to produce policy π – vector of probabilities over primitive actions.

說明:
1. fpercept 是一個特徵提取層
2. fMspace 沒有改變維度,有兩種可能L2_norm以及全連接層
3. ϕ 是一個沒有偏置值的全聯接層
4. 圖中wt 就是所謂的Goal embedding
5. 根據公式6可以知道,最後Worker Model輸出的是每一個action的可能性

Learning

作者在這一段講述瞭如何更新系統的權重的。
1. 網絡部分的卷積層(特徵提取層)有兩個更新途徑,第一個是Policy Gradient;第二個是TD-learning, 分別對應於Worker Model和 Manger Model;
2. 在這一部分,作者簡單的說明了一下,如果在訓練的時候,Worker Model 和Manger Model之間由梯度傳播的話,可能導致Manger Model內部一些語義的信息丟失,所以把gt 當作系統內部的隱藏信息;
3. Manger Model是基於value-base gradient,而 Worker Model 是基於Policy-based gradient
4. 學習信號(reward),Manger Model 的學習信號是環境的稀疏信號,而Worker Model的學習信號則是由Manger Model產生的

Manger Model的學習信號
Manger Model TD-error
Worker Model的reward值
Worker Model的損失函數
Worker Model TD-error

  1. 公式7,Manger Model的梯度(損失函數);
  2. AMt Manger Model TD-error
  3. 公式8,Worker Model的reward值,內部計算求得的
  4. 公式9,Worker Model的損失函數
  5. ADt Worker Model TD-error

論文對應句子:

  1. The conventional wisdom would be to train the whole architecture monolithically through gradient descent on either the policy directly or via TD-learning
  2. The outputs g of the Manager would be trained by gradients coming from the Worker. This, however would deprive Manager’s goals g of any semantic meaning, making them just internal latent variables of the model
  3. Manager to predict advantageous directions (transitions) in state space and to intrinsically reward the Worker to follow these directions.
  4. The intrinsic reward that encourages the Worker to follow the goals

論文也給出了作者這麼做的一些道理:
1. intrinsic reward給了Worker Model一個訓練的目標,和狀態改變的一個方向
2. 之前也說過,一個大任務可以分成幾個小任務,每個任務都會對應着不同的sub-policy,這樣就可做到 sub-goal 對應sub-policy;
3. intrinsic reward是作者的一個創新點;

Transition Policy Gradients

作者在之前的基礎上,提出了一個更新Manger Model的方式
在這之前作者先做了一些鋪墊:

  1. 定義了一個函數ot=μ(st,θ) ,來表示一個high level一個函數來選取子策略(sub-policy)
  2. 作者做了一個假定選取的策略在該段子任務是固定的
  3. 所以,對應的就是轉移函數分佈(transition distribution)(p(st+c|st,ot) ),而2選擇的策略被稱爲 transition distribution,對應的函數表示爲
    πTP(st+c|st,μ(st,θ))
    (transition policy),該公式描述的是給定初始狀態st 和選擇子策略(sub-policy)μ(st,θ) ,求最終子任務結束時,狀態的分佈st+c

然後,作者給出了梯度的計算方法
image_1bvj7lll91anjdbu17pcusv1h0r9.png-13.7kB

θlogp(st+c|st,μ(st,θ)) 就是transition policy gradient

然後 作者又給了p(st+c|st,μ(st,θ)) 的計算方法:
image_1bvj7usq6rj41irb3m61d8a1k6916.png-5.2kB(沒看懂作者怎麼得來的)

然後用公式(10)的梯度替代公式(7)的梯度 更新Manger

Architecture details

fpercept 是一個特徵提取網絡,網絡結構和DQN的卷積層結構一樣
fMspace 是一個全聯接層,將特徵層投影成一個16維embedding
fWrnn 是一個標準的LSTM結構
fMrnn 作者提出的dilated LSTM 結構

原因在於Manger的時間分辨率非常低,而 Worker的時間分辨率較高

Dilated LSTM(沒看)

1: Unsupervised Perceptual Rewards for Imitation Learnining

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章