The lecture of John Tsitsiklis: Reinforcement learning
https://www.youtube.com/watch?v=fbmAsxbLal0
The value function indicates the value of the current state:
(bellman’s)
is a minimization over policies (functions). And what we want at the end is just the best policy.
- Some can be solve in Poly time by reformulating it as LP (check MAB Gittins 79)
- The curse of dimensionality
- Intractable -> APPROXIMATE it (the function). Linearly or not (Neural Network)
3 approaches - Policy network
- Value network
– From value to policy :
– Look ahead: Monte-Carlo Tree Search
– Actor-Critic methods: given learn then aid to improve
Approximate Policy Iteration
- given , do simulation to get global reward
- from the loss, train the value function (by NN or whatever)
- now from , update
Discrete actions -> ocillations :
incremental methods: update V little by little.