Key Concepts in RL

Key Concepts in RL

for review.
不定期更新。

In a nutshell, RL is the study of agents and how they learn by trial and error. It formalizes the idea that rewarding or punishing an agent for its behavior makes it more likely to repeat or forego that behavior in the future.

../_images/rl_diagram_transparent_bg.png

​ Agent-environment interaction loop

The environment is the world that the agent lives in and interacts with. At every step of interaction, the agent sees a (possibly partial) observation of the state of the world, and then decides on an action to take. The environment changes when the agent acts on it, but may also change on its own.

The agent also perceives a reward signal from the environment, a number that tells it how good or bad the current world state is. The goal of the agent is to maximize its cumulative reward, called return. Reinforcement learning methods are ways that the agent can learn behaviors to achieve its goal.

More Terminology

DQN (Deep Q Network) 使用神經網絡生產Q值

馬爾可夫決策過程 MDP

在執行動作前作出的決策爲規劃 (planning)

但是在強化學習中,agent卻不是那麼容易知曉MDP中所有的元素的,比如,agent也許不會知道環境將會如何改變當它執行了一個動作後(狀態轉移概率函數TT),也不會知道它執行這個動作獲得即時的獎勵將會是多少(獎勵函數RR),agent能做的就是:根據自己已有的策略π\pi選擇關於當前狀態ss下自己認爲好的動作aa,執行此動作給環境,觀察環境給出的反饋rr和下一個狀態ss′,並根據這個反饋rr調整更新自己的策略π\pi,這樣反覆迭代,直到找到一種 最優的策略π\pi'能夠最大限度獲得正反饋

那麼,當agent不知道轉移概率函數TT和獎勵函數RR,它是如何找到一個好的策略的呢,當然會有很多方法:

Model-based RL

一種方法就是Model-based方法,讓agent學習一種模型,這種模型能夠從它的觀察角度描述環境是如何工作的,然後利用這個模型做出動作規劃,具體來說,當agent處於s1s_1狀態,執行了a1a_1動作,然後觀察到了環境從s1s_1轉化到了s2s_2以及收到的獎勵rr, 那麼這些信息能夠用來提高它對T(s2s1,a1)T(s_2|s_1, a_1)R(s1,a1)R(s_1, a_1)的估計的準確性,當agent學習的模型能夠非常貼近於環境時,它就可以直接通過一些規劃算法來找到最優策略,具體來說:當agent已知任何狀態下執行任何動作獲得的回報,即R(st,at)R(s_t,a_t)已知,而且下一個狀態也能通過T(st+1st,at)T(s_{t+1}|s_t,a_t)被計算,那麼這個問題很容易就通過動態規劃算法求解,尤其是當T(st+1st,at)1=T(s_{t+1}|s_t,a_t)=1時,直接利用貪心算法,每次執行只需選擇當前狀態sts_t下回報函數取最大值的動作(maxaR(s,as=st))(max_aR(s,a|s=s_t))即可,這種採取對環境進行建模的強化學習方法就是Model-based方法。

Model-free RL
但是,事實證明,我們有時候並不需要對環境進行建模也能找到最優的策略,一種經典的例子就是Q-learning,Q-learning直接對未來的回報Q(s,a)Q(s,a)進行估計,Q(sk,ak)Q(s_k,a_k)表示對sks_k狀態下執行動作ata_t後獲得的未來收益總和E(t=knγkRk)E(\sum _{t=k}^n\gamma^kR_k)的估計,若對這個Q值估計的越準確,那麼我們就越能確定如何選擇當前sts_t狀態下的動作:選擇讓Q(st,at)Q(s_t,a_t)最大的ata_t即可,而Q值的更新目標由Bellman方程定義,更新的方式可以有TD(Temporal Difference)等,這種是基於值迭代的方法,類似的還有基於策略迭代的方法以及結合值迭代和策略迭代的actor-critic方法,基礎的策略迭代方法一般回合制更新(Monte Carlo Update),這些方法由於沒有去對環境進行建模,因此他們都是Model-free的方法。

所以,如果你想查看這個強化學習算法是model-based還是model-free的,你就問你自己這個問題:在agent執行它的動作之前,它是否能對下一步的狀態和回報做出預測,如果可以,那麼就是model-based方法,如果不能,即爲model-free方法。

https://www.quora.com/What-is-the-difference-between-model-based-and-model-free-reinforcement-learning

States and Observation

A state ss is a complete description of the state of the world. There is no information about the world which is hidden from the state. An observation oo is a partial description of a state, which may omit information.

When the agent is able to observe the complete state of the environment, we say that the environment is fully observed. When the agent can only see a partial observation, we say that the environment is partially observed.

Action Spaces

action space: The set of all valid actions in a given environment.

discrete action space: only a finite number of moves are available to the agent.

continuous action space: like the agent controls a robot in a physical world, actions are real-valued vectors.

Policies

A policy is a rule by an agent to decide what actions to take. The policy is trying to maximize reward.

Deterministic Policies: at=μ(st)a_t = \mu(s_t)

Stochastic Policies: at π(st)a_t ~ \pi(\cdot|s_t)

two kinds of stochastic policies:

  1. categorical policies: used in discrete action spaces.
  2. diagonal Gaussian policies: used in continuous action spaces.

two key computaions:

  1. sampling actions from the policy
  2. computing log likelihoods of particular actions, log πθ(as)log \ \pi_\theta(a|s)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章