RL by Tsitsiklis 【Notes待完成】

原創

2020-04-27 11:02

The lecture of John Tsitsiklis: Reinforcement learning
https://www.youtube.com/watch?v=fbmAsxbLal0

The value function indicates the value of the current state:
$V^*(s^t) = min_\pi(E(\sum_{i=t..T}\gamma^t\cdot cost(\pi,s^i))$
$= min_a(cost(a,s) + \gamma \cdot \sum_{s'}(P(s'|s,a)V^*(s')))$ (bellman’s)
is a minimization over policies (functions). And what we want at the end is just the best policy.

Some can be solve in Poly time by reformulating it as LP (check MAB Gittins 79)
The curse of dimensionality
Intractable -> APPROXIMATE it (the function). Linearly or not (Neural Network)
3 approaches
Policy network
Value network
– From value to policy : $min_a c(a,s) + E(s')$
– Look ahead: Monte-Carlo Tree Search
– Actor-Critic methods: given $\pi$ learn $V$ then aid to improve $\pi$

Approximate Policy Iteration

given $\pi_0$ , do simulation to get global reward
from the loss, train the value function $V$ (by NN or whatever)
now from $V$ , update $\pi$

Discrete actions -> ocillations :
incremental methods: update V little by little.

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

RL by Tsitsiklis 【Notes待完成】

[NOTE in progress] Simulation Optimization

A Road Map for Deep Learning

Stochastic Optimization: Casual Notes

Graph Neural Network: A First Glance

Git 項目管理流程與協作方式

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結