An Overview of Reinforcement Learning

强化学习概览

This overview is largely based on this article: https://medium.com/@SmartLabAI/reinforcement-learning-algorithms-an-intuitive-overview-904e2dff5bbc.

 

Model-based vs Model free

  • Model: world model. Structured information about the environment. 利用了环境的结构性信息进行搜索。
  • Model-free method see the environment as a black box, only providing state and reward as numbers. No extra info can be emploited.

On-Policy vs Off-Policy

[source:https://www.quora.com/Why-is-Q-Learning-deemed-to-be-off-policy-learning]

主要判断,在更新Q时,涉及到的Q(s',a')中的a'是否由当前actor根据s'得出,抑或是一种approximation like the max function in q-learning. 在使用replay-buffer的情况下,或者a'由target actor生成的情况下,称为off-policy。否则是on-policy。这是我个人浏览了很多信息后,目前的理解。[update 0413]见下图,目前有了新的理解:remember Qfunction是基于TD的,前后action是有顺序关系的。换句话说,train Q的时候,需要知道,是Q(?)跟当前的Q差了一个r。这时,如果此处的?与当前policy应当输出的action相符,说明我们想要把Q 按照当前policy去train,所以是on policy。否则的话,如Sarsa,当前policy给出的是epsilon-greedy choice但是train Q的时候假定下一步是totally greedy的,所以Q与当前policy不符合,所以是off。

涉及到replay buf时候,relay-buffer给出的a'是历史上的某个policy的action,与当前actor所能返回的action不同,所以Q没有在试图把自己train成当前policy的评估函数,所以是off。

总结:判断on还是off,取决于,train Q的时候,Q(s',a')中的a'与当前actor所建议的action是否相同。换句话说,我们是否在把Q train成当前police的评估函数。

 

那么下一个问题来了:两者应该如何选择呢?下面这个回答可以参考一下,特别是对于'take action'的理解。所以总的来说q learning off policy直接learn optimal policy但是有可能会不稳定,难以converge等。Sarsa相对conservative一些,所以当training代价比较大时可以考虑。

https://stats.stackexchange.com/questions/326788/when-to-choose-sarsa-vs-q-learning

最后一个疑问,今天意识到TD本质上是有时间顺序在的:Q(s,a) for a->a'->a'' 和 Q(s,a) for a->a''->a'两者的值可能本来就不能一致。针对TD背后的思想,需要进一步思考

On-policy v.s. Off-policy

An on-policy agent learns the value based on its current action a derived from the current policy, whereas its off-policy counter part learns it based on the action a* obtained from another policy.

The reason that Q-learning is off-policy is that it updates its Q-values using the Q-value of the next state s′ and the greedy action a′. In other words, it estimates the return (total discounted future reward) for state-action pairs assuming a greedy policy were followed despite the fact that it's not following a greedy policy.

The reason that SARSA is on-policy is that it updates its Q-values using the Q-value of the next state s′ and the current policy's action a′′. It estimates the return for state-action pairs assuming the current policy continues to be followed.

The distinction disappears if the current policy is a greedy policy. However, such an agent would not be good since it never explores.

Policy Optimization (Policy Iteration)

f(state) -> action.

What action to take now?

  1. Policy Gradient. Think of play a game 10000 times and check the expected reward of f(s), then you can do gradient asent on it.
  2. Trust Region Policy Optimization (TRPO). A on-policy method which do PG with a big step but not too big: it is within the trust region. This is ensured by a constraint on the difference between the current behavior (not parameters) and the behavior of the target network. Both continuous and discrete space are supported.
  3.  Proximal Policy Optimization (PPO). Like TRPO is an on-policy method, but traite that contraint as penalty (regularization). Popularized by OpenAI. Do PG by improving gradients smartly to avoid performance issue and instability. Can be implemented and solved more easily than TRPO but with similar performance so it is preferred than TRPO.

Remarks

  1. Why we invent TRPO/PPO: each time when the policy is upated, all previous samples are outdated. It is too costly to regenerate all samples on each policy update. PPO allows to reuse the old experiences, allowing moving from on-policy to off-policy.
  2. Rewards should be centered to 0. Since PG is based on sampling, if all rewards are positive, the probability of sampling out some actions would be less and less.
  3. For given (s,a), only the disconted reward afterward should be considered.

Q-Learning (Value Iteration)

f(state, action)->expected action value

Action-value functoin: How good is it if an particular action is taken?

  1. DQN. Seems for me only applicable to discrete action space. Variations:
    1. Double-DQN is DQN with target network.
    2. Dueling DQN is DQN with separate output on V(s) and A(s,a). Then let Q(s,a)=V(s)+A(s,a).
      1. Advantage: the update of V(s) will influence A(s,a), even if some actions are not sampled. In practice,
        1. some normalization should be done
        2. also add constraints on A, so that the network will not simply set V(s) to 0.
    3. Tricks: Prioritized Replay, prefer to use samples having large TD error.
  2. C51. Use distributional bellman equation instead of only consider EXPECTATION of futre rewards
  3. Distributional Reinforcement Learning with Quantile Regression (QR-DQN). Instead of returning an expected value of an action, it returns a distribution. Then quantiles can be considered to identify the 'best' action.
  4. Hindsight Experience Replay. DQN with goals added to input. Especially useful for sparse reward space, like some 1/0 games. Can also be combined with DDPG.

Hybrid

  1. DDPG
  2. A3C. Asynchronous: serval agents are trained in parallel. Actor-Critic: policy gradient and q-learning are combined. Also check Soft Actor-Critic
  3. TD3

Model-based

Can be used in Control Theory. Environment has assumptions and approximations.

  1. Learn the model. By supervised learning, for instance. Play the game then train the world model.
    1. World models: one of my favorite approaches in which the agent can learn from it’s own “dreams” due to the Variable Auto-encoders, See paper and code.
    2. Imagination-Augmented Agents (I2A): learns to interpret predictions from a learned environment model to construct implicit plans in arbitrary ways, by using the predictions as additional context in deep policy networks. BAsically it’s a hybrid learning method because it combines model-baes and model-free methods. Paper and implementation.
    3. Model-Based Priors for Model-Free Reinforcement Learning (MBMF): aims to bridge tge gap between model-free and model-based reinforcement learning. See paper and code.
    4. Model-Based Value Expansion (MBVE): Authors of the paper state that this method controls for uncertainty in the model by only allowing imagination to fixed depth. By enabling wider use of learned dynamics models within a model-free reinforcement learning algorithm, we improve value estimation, which, in turn, reduces the sample complexity of learning.
  2. Learn given the model
    1. Check alphaGo-zero

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章