1.MDP(Markov Decision Processes)
finite MDP: finite state space&finite action space
transition probabilities:p(s′ | s, a) = Pr{St+1 = s′ | St = s, At = a}
r(s, a, s′) = E[Rt+1 | St = s, At = a, St+1 = s′]
2.Value Functions
state-value function: v
action-value function: q
Gt: return (cumulative discounted reward) following t
Rt: reward at t, dependent, like St, on At−1 and St−1
Gt =
v
Bellman Equation for v
Bellman function => learn v
Bellman Equation for q
3.Policy Evaluation
policy evaluation: compute v
Iteration policy evaluation:
1. For state s, the initial v0 is chosen arbitary(terminal state 0)
2.Successive approximation is obtained by using the Bellman Equation:
vk+1(s) = E
code:
4.Policy Improvement
policy improvement: evaluate policy to find better policies
greedy policy
The greedy policy takes the action that looks best in the short term—after one step of lookahead—according to v
5.Policy Iteration&Value Iteration
policy iteration
code:
an example:
value iteration
It can be written as a particularly simple backup operation that combines the policy improvement and truncated policy evaluation steps, more efficient.
vk+1(s) =
code:
Q-Learning
Initialize Q(s,a) arbitrarily
Repeat (for each episode):
Initialize s
Repeat (for each episode):
Choose a from s using policy derived from Q(e.g.,ε-greedy)
Take action a, observe r, s'
Q(s, a)←Q(s, a) + α[r + γmaxQ(s', a')-Q(s, a)]
s←s'
util s is terminal
Evaluation:
update Q-table, in every episode, Q(s, a) is the value in the table, maxa’Q(s’, a’) is the max approximation for Q(s’)
decision policy:
ε-greedy: ε = 0.9, choose the best action for 90%, choose others randomly for 10%