RL by Tsitsiklis 【Notes待完成】

The lecture of John Tsitsiklis: Reinforcement learning
https://www.youtube.com/watch?v=fbmAsxbLal0

The value function indicates the value of the current state:
V(st)=minπ(E(i=t..Tγtcost(π,si))V^*(s^t) = min_\pi(E(\sum_{i=t..T}\gamma^t\cdot cost(\pi,s^i))
=mina(cost(a,s)+γs(P(ss,a)V(s)))= min_a(cost(a,s) + \gamma \cdot \sum_{s'}(P(s'|s,a)V^*(s'))) (bellman’s)
is a minimization over policies (functions). And what we want at the end is just the best policy.

  • Some can be solve in Poly time by reformulating it as LP (check MAB Gittins 79)
  • The curse of dimensionality
  • Intractable -> APPROXIMATE it (the function). Linearly or not (Neural Network)
    3 approaches
  • Policy network
  • Value network
    – From value to policy : minac(a,s)+E(s)min_a c(a,s) + E(s')
    – Look ahead: Monte-Carlo Tree Search
    – Actor-Critic methods: given π\pi learn VV then aid to improve π\pi

Approximate Policy Iteration

  • given π0\pi_0, do simulation to get global reward
  • from the loss, train the value function VV (by NN or whatever)
  • now from VV, update π\pi
    在這裏插入圖片描述

Discrete actions -> ocillations :
incremental methods: update V little by little.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章