An Overview of Reinforcement Learning

強化學習概覽

This overview is largely based on this article: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc.

 

Model-based vs Model free

  • Model: world model. Structured information about the environment. 利用了環境的結構性信息進行搜索。
  • Model-free method see the environment as a black box, only providing state and reward as numbers. No extra info can be emploited.

On-Policy vs Off-Policy

[source:https://www.quora.com/Why-is-Q-Learning-deemed-to-be-off-policy-learning]

主要判斷,在更新Q時,涉及到的Q(s',a')中的a'是否由當前actor根據s'得出,抑或是一種approximation like the max function in q-learning. 在使用replay-buffer的情況下,或者a'由target actor生成的情況下,稱爲off-policy。否則是on-policy。這是我個人瀏覽了很多信息後,目前的理解。[update 0413]見下圖,目前有了新的理解:remember Qfunction是基於TD的,前後action是有順序關係的。換句話說,train Q的時候,需要知道,是Q(?)跟當前的Q差了一個r。這時,如果此處的?與當前policy應當輸出的action相符,說明我們想要把Q 按照當前policy去train,所以是on policy。否則的話,如Sarsa,當前policy給出的是epsilon-greedy choice但是train Q的時候假定下一步是totally greedy的,所以Q與當前policy不符合,所以是off。

涉及到replay buf時候,relay-buffer給出的a'是歷史上的某個policy的action,與當前actor所能返回的action不同,所以Q沒有在試圖把自己train成當前policy的評估函數,所以是off。

總結:判斷on還是off,取決於,train Q的時候,Q(s',a')中的a'與當前actor所建議的action是否相同。換句話說,我們是否在把Q train成當前police的評估函數。

 

那麼下一個問題來了:兩者應該如何選擇呢?下面這個回答可以參考一下,特別是對於'take action'的理解。所以總的來說q learning off policy直接learn optimal policy但是有可能會不穩定,難以converge等。Sarsa相對conservative一些,所以當training代價比較大時可以考慮。

https://stats.stackexchange.com/questions/326788/when-to-choose-sarsa-vs-q-learning

最後一個疑問,今天意識到TD本質上是有時間順序在的:Q(s,a) for a->a'->a'' 和 Q(s,a) for a->a''->a'兩者的值可能本來就不能一致。針對TD背後的思想,需要進一步思考

On-policy v.s. Off-policy

An on-policy agent learns the value based on its current action a derived from the current policy, whereas its off-policy counter part learns it based on the action a* obtained from another policy.

The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next state s′ and the greedy action a′. In other words, it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy.

The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next state s′ and the current policy's action a′′. It estimates the return for state-action pairs assuming the current policy continues to be followed.

The distinction disappears if the current policy is a greedy policy. However, such an agent would not be good since it never explores.

Policy Optimization (Policy Iteration)

f(state) -> action.

What action to take now?

  1. Policy Gradient. Think of play a game 10000 times and check the expected reward of f(s), then you can do gradient asent on it.
  2. Trust Region Policy Optimization (TRPO). A on-policy method which do PG with a big step but not too big: it is within the trust region. This is ensured by a constraint on the difference between the current behavior (not parameters) and the behavior of the target network. Both continuous and discrete space are supported.
  3.  Proximal Policy Optimization (PPO). Like TRPO is an on-policy method, but traite that contraint as penalty (regularization). Popularized by OpenAI. Do PG by improving gradients smartly to avoid performance issue and instability. Can be implemented and solved more easily than TRPO but with similar performance so it is preferred than TRPO.

Remarks

  1. Why we invent TRPO/PPO: each time when the policy is upated, all previous samples are outdated. It is too costly to regenerate all samples on each policy update. PPO allows to reuse the old experiences, allowing moving from on-policy to off-policy.
  2. Rewards should be centered to 0. Since PG is based on sampling, if all rewards are positive, the probability of sampling out some actions would be less and less.
  3. For given (s,a), only the disconted reward afterward should be considered.

Q-Learning (Value Iteration)

f(state, action)->expected action value

Action-value functoin: How good is it if an particular action is taken?

  1. DQN. Seems for me only applicable to discrete action space. Variations:
    1. Double-DQN is DQN with target network.
    2. Dueling DQN is DQN with separate output on V(s) and A(s,a). Then let Q(s,a)=V(s)+A(s,a).
      1. Advantage: the update of V(s) will influence A(s,a), even if some actions are not sampled. In practice,
        1. some normalization should be done
        2. also add constraints on A, so that the network will not simply set V(s) to 0.
    3. Tricks: Prioritized Replay, prefer to use samples having large TD error.
  2. C51. Use distributional bellman equation instead of only consider EXPECTATION of futre rewards
  3. Distributional Reinforcement Learning with Quantile Regression (QR-DQN). Instead of returning an expected value of an action, it returns a distribution. Then quantiles can be considered to identify the 'best' action.
  4. Hindsight Experience Replay. DQN with goals added to input. Especially useful for sparse reward space, like some 1/0 games. Can also be combined with DDPG.

Hybrid

  1. DDPG
  2. A3C. Asynchronous: serval agents are trained in parallel. Actor-Critic: policy gradient and q-learning are combined. Also check Soft Actor-Critic
  3. TD3

Model-based

Can be used in Control Theory. Environment has assumptions and approximations.

  1. Learn the model. By supervised learning, for instance. Play the game then train the world model.
    1. World models: one of my favorite approaches in which the agent can learn from it’s own “dreams” due to the Variable Auto-encoders, See paper and code.
    2. Imagination-Augmented Agents (I2A): learns to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. BAsically it’s a hybrid learning method because it combines model-baes and model-free methods. Paper and implementation.
    3. Model-Based Priors for Model-Free Reinforcement Learning (MBMF): aims to bridge tge gap between model-free and model-based reinforcement learning. See paper and code.
    4. Model-Based Value Expansion (MBVE): Authors of the paper state that this method controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning.
  2. Learn given the model
    1. Check alphaGo-zero

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章