DARLA: Improving Zero-Shot Transfer in Reinforcement Learning 閱讀筆記

DARLA: Improving Zero-Shot Transfer in Reinforcement Learning

標籤(空格分隔): 論文筆記 增強學習算法


該論文主要講的是,增強學習算法在不同數據分佈上的遷移應用(不需要進行再學習),這篇論文並沒有對強化學習的算法做出如何的改進

目的和意義

作者的初衷:強化學習算法會被應用到很多不同的數據分佈,然而,強化學習在線學習是非常困難的,再加上數據集的採集,是一個漫長的過程。
現在比較常見的
(1)模擬環境->真實環境;(2)不同的真實環境;
於是, 作者提出來了 多階段強化學習Agent算法DARLA(DisentAngled Representation Learning Agent)
首先,通過 神經網絡進行進行特徵提取(a disenstangled representation of the observed environment.),然後進行策略控制。

We propose a new multi-stage RL agent, DARLA (DisentAngled Representation Learning Agent), which learns to see before learning to act.
This paper focuses on one of these outstanding issues: the ability of RL agents to deal with changes to the input distribution, a form of transfer learning known as domain adaptation.
We aim to develop an agent that can learn a robust policy using observations and rewards obtained exclusively within the source domain.
a policy is considered as robust if it generalises with minimal drop in performance to the target domain without extra fine-tuning.

然後, 作者說了一波如果沒有轉換學習會導致什麼問題
(1)數據獲取成本太高;
(2)在source domain 容易過擬合;

  1. In many scenarios, such as robotics, this reliance on target domain information can be problematic, as the data may be expensive or difficult to obtain (Finn et al., 2017; Rusu et al., 2016). Furthermore, the target domain may simply not be known in advance.
  2. On the other hand, policies learnt exclusively on the source domain using existing deep RL approaches that have few constraints on the nature of the learnt representations often overfit to the source input distribution, resulting in poor domain adaptation performance

作者想設計一個特徵表示的方法,能給抓住潛在的低維的特徵,且該特徵不隨
任務和數據分佈的改變。

  1. We propose tackling both of these issues by focusing instead on learning representations which capture an underlying low-dimensional factorised representation of the world and are therefore not task or domain specific
  2. We demonstrate how disentangled representations can improve the robustness of RL algorithms in domain adaptation scenarios by introducing DARLA
  3. a new RL agent capable of learning a robust policy on the source domain that achieves significantly better out-of-the-box performance in domain adaptation scenarios compared to various baselines.
  4. DARLA relies on learning a latent state representation that is shared between the source and target domains, by learning a disentangled representation of the environment’s generative factors.

DARLA算法分爲三個部分:(1)學習特徵表示;(2)學習策略控制;(3)轉換。

DARLA does not require target domain data to form its representations. Our approach utilises a three stage pipeline: 1) learning to see, 2) learning to act, 3) transfer.

訓練領域和應用領域(source domain and target domain)

source domain / target domain
該遷移學習的的特點在於:
(1)訓練數據和測試數據分佈差別較大;
(2)在訓練數據訓練完成之後,在測試數據不進行學習

source domain 和 target domain 之前的數據差別在於:
(1)action space 共享;
(2)transition 和reward function 相似
(3)state space 差別較大
image_1bveknhaa1cbq3t98bh13qa7bd9.png-413.2kB

算法細則

整個算法現將高維的 state Soi 投影到低維Szi ,實用的方法是非監督學習

  1. In the process of doing so, the agent implicitly learns a function F:Soi>Szi that maps the typically high-dimensional raw observations Soi to typically low-dimensional latent states Szi ; followed by a policy function πi:Szi>Ai that maps the latent states Szi to actions ai
  2. Such a source policy πs is likely to be based on an entangled latent state space Szs
  3. Hence, DARLA is based on the idea that a good quality F learnt exclusively on the source domain DSM will zero-shot generalise to all target domains DiM , and therefore the source policy π(a|SzS;θ) will also generalise to all target domains DiM out of the box.

這個算法分爲三部分:
(1)學習特徵表示,這部分是全文的關鍵部分,採用的是非監督學習的方法;
(2)用特徵表示輸入到強化學習的算法中(DQN,DDPG,A3C);
(3)由sorce domain 向target domain 轉換
image_1bvesh878f6on3n172dk46q4k16.png-140.9kB
image_1bvesp6bjh7b1suo1qe414tq1n599.png-284.2kB

所以,這篇論文主要步驟一是關鍵,下面,來理解步驟一的算法實現

FU 也就是特徵表示網絡,採用的是βVAE 算法, 該算法通過無監督學習的方式來自動提取特徵表示從原始圖像中。

DARLA utilises βVAE , a state-of-the-art unsupervised model for automated discovery of factorised latent representations from raw image data.

首先定義損失函數:
image_1bvetbv5j17ci9ul11mvfpl1iv116.png-32kB

θϕ 分別爲encoder和decoder的權值,β 爲大於1的超參數,x,z 分別表示原始的數據以及對應的編碼向量 x̂  表示經過預訓練編解碼的結果,所以,把這個整明白了基本上這篇論文就很簡單了

後面就是把編碼向量z 輸入到強化學習中就OK了
後面會通過代碼來,說明βVAE 的訓練方式

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章