Key Concepts in RL

原創

2020-06-01 21:31

Key Concepts in RL

for review.
不定期更新。

In a nutshell, RL is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future.

Agent-environment interaction loop

The environment is the world that the agent lives in and interacts with. At every step of interaction, the agent sees a (possibly partial) observation of the state of the world, and then decides on an action to take. The environment changes when the agent acts on it, but may also change on its own.

The agent also perceives a reward signal from the environment, a number that tells it how good or bad the current world state is. The goal of the agent is to maximize its cumulative reward, called return. Reinforcement learning methods are ways that the agent can learn behaviors to achieve its goal.

More Terminology

DQN (Deep Q Network) 使用神經網絡生產Q值

馬爾可夫決策過程 MDP

在執行動作前作出的決策爲規劃 (planning)

但是在強化學習中，agent卻不是那麼容易知曉MDP中所有的元素的，比如，agent也許不會知道環境將會如何改變當它執行了一個動作後（狀態轉移概率函數 $T$ ），也不會知道它執行這個動作獲得即時的獎勵將會是多少（獎勵函數 $R$ ），agent能做的就是：根據自己已有的策略 $\pi$ 選擇關於當前狀態 $s$ 下自己認爲好的動作 $a$ ，執行此動作給環境，觀察環境給出的反饋 $r$ 和下一個狀態 $s′$ ，並根據這個反饋 $r$ 調整更新自己的策略 $\pi$ ，這樣反覆迭代，直到找到一種最優的策略 $\pi'$ 能夠最大限度獲得正反饋

那麼，當agent不知道轉移概率函數 $T$ 和獎勵函數 $R$ ，它是如何找到一個好的策略的呢，當然會有很多方法：

Model-based RL

一種方法就是Model-based方法，讓agent學習一種模型，這種模型能夠從它的觀察角度描述環境是如何工作的，然後利用這個模型做出動作規劃，具體來說，當agent處於 $s_1$ 狀態，執行了 $a_1$ 動作，然後觀察到了環境從 $s_1$ 轉化到了 $s_2$ 以及收到的獎勵 $r$ , 那麼這些信息能夠用來提高它對 $T(s_2|s_1, a_1)$ 和 $R(s_1, a_1)$ 的估計的準確性，當agent學習的模型能夠非常貼近於環境時，它就可以直接通過一些規劃算法來找到最優策略，具體來說：當agent已知任何狀態下執行任何動作獲得的回報，即 $R(s_t,a_t)$ 已知，而且下一個狀態也能通過 $T(s_{t+1}|s_t,a_t)$ 被計算，那麼這個問題很容易就通過動態規劃算法求解，尤其是當 $＝T(s_{t+1}|s_t,a_t)＝1$ 時，直接利用貪心算法，每次執行只需選擇當前狀態 $s_t$ 下回報函數取最大值的動作 $(max_aR(s,a|s=s_t))$ 即可，這種採取對環境進行建模的強化學習方法就是Model-based方法。

Model-free RL
但是，事實證明，我們有時候並不需要對環境進行建模也能找到最優的策略，一種經典的例子就是Q-learning，Q-learning直接對未來的回報 $Q(s,a)$ 進行估計， $Q(s_k,a_k)$ 表示對 $s_k$ 狀態下執行動作 $a_t$ 後獲得的未來收益總和 $E(\sum _{t=k}^n\gamma^kR_k)$ 的估計，若對這個Q值估計的越準確，那麼我們就越能確定如何選擇當前 $s_t$ 狀態下的動作：選擇讓 $Q(s_t,a_t)$ 最大的 $a_t$ 即可，而Q值的更新目標由Bellman方程定義，更新的方式可以有TD（Temporal Difference）等，這種是基於值迭代的方法，類似的還有基於策略迭代的方法以及結合值迭代和策略迭代的actor-critic方法，基礎的策略迭代方法一般回合制更新（Monte Carlo Update），這些方法由於沒有去對環境進行建模，因此他們都是Model-free的方法。

所以，如果你想查看這個強化學習算法是model-based還是model-free的，你就問你自己這個問題：在agent執行它的動作之前，它是否能對下一步的狀態和回報做出預測，如果可以，那麼就是model-based方法，如果不能，即爲model-free方法。

https://www.quora.com/What-is-the-difference-between-model-based-and-model-free-reinforcement-learning

States and Observation

A state $s$ is a complete description of the state of the world. There is no information about the world which is hidden from the state. An observation $o$ is a partial description of a state, which may omit information.

When the agent is able to observe the complete state of the environment, we say that the environment is fully observed. When the agent can only see a partial observation, we say that the environment is partially observed.

Action Spaces

action space: The set of all valid actions in a given environment.

discrete action space: only a finite number of moves are available to the agent.

continuous action space: like the agent controls a robot in a physical world, actions are real-valued vectors.

Policies

A policy is a rule by an agent to decide what actions to take. The policy is trying to maximize reward.

Deterministic Policies: $a_t = \mu(s_t)$

Stochastic Policies: $a_t ~ \pi(\cdot|s_t)$

two kinds of stochastic policies:

categorical policies: used in discrete action spaces.
diagonal Gaussian policies: used in continuous action spaces.

two key computaions:

sampling actions from the policy
computing log likelihoods of particular actions, $log \ \pi_\theta(a|s)$

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Key Concepts in RL

Key Concepts in RL

More Terminology

States and Observation

Action Spaces

Policies

你的神經網絡不工作的37個原因

Key Concepts in RL

TensorFlow Specialization Course 1學習筆記 Week 3

Focal Loss學習筆記

TensorFlow Specialization Course 2 學習筆記

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結