Reinforcement Learning - An Introduction memo

1.MDP(Markov Decision Processes)

finite MDP: finite state space&finite action space

transition probabilities:p(s′ | s, a) = Pr{St+1 = s′ | St = s, At = a}

r(s, a, s′) = E[Rt+1 | St = s, At = a, St+1 = s′]

2.Value Functions

state-value function: vπ(s)=Eπ[Gt|St = s] = Eπ[k=0γkRt+k+1 | St = s]

action-value function: qπ(s, a) = Eπ[Gt|St = s, At = a] = Eπ[k=0γkRt+k+1 | St = s, At = a]
Gt: return (cumulative discounted reward) following t
Rt: reward at t, dependent, like St, on At−1 and St−1
Gt = k=0γkRt+k+1

vπ, qπ: vπ(s) = aπ(a|s)qπ(s,a) qπ(s, a) = sp(s|s,a)[r(s,a,s)+γ vπ(s’)]
π(a|s) : probability of taking action a when in state s

Bellman Equation for vπ: vπ(s) = aπ(a|s)sp(s|s,a)[r(s,a,s)+γvπ(s) ]
Bellman function => learn vπ

Bellman Equation for qπ: qπ(s, a) = sp(s|s,a)[r(s,a,s)+γaπ(a|s)qπ(s,a)]

3.Policy Evaluation

policy evaluation: compute vπ for policy π
Iteration policy evaluation:
1. For state s, the initial v0 is chosen arbitary(terminal state 0)
2.Successive approximation is obtained by using the Bellman Equation:
vk+1(s) = Eπ [Rt+1+γ vk(st+1) | St = s]
code:

4.Policy Improvement

policy improvement: evaluate policy to find better policies

greedy policy π : π (x) = arg amaxqπ(s,a)
The greedy policy takes the action that looks best in the short term—after one step of lookahead—according to vπ .

5.Policy Iteration&Value Iteration

policy iteration

code:

an example:

value iteration

It can be written as a particularly simple backup operation that combines the policy improvement and truncated policy evaluation steps, more efficient.

vk+1(s) = amax E[Rt+1+γ vk(St+1) | St = s, At = a]

code:

Q-Learning

Initialize Q(s,a) arbitrarily
Repeat (for each episode):
    Initialize s
    Repeat (for each episode):
        Choose a from s using policy derived from Q(e.g.,ε-greedy)
        Take action a, observe r, s'
        Q(s, a)←Q(s, a) + α[r + γmaxQ(s', a')-Q(s, a)]
        s←s'
    util s is terminal
Evaluation:

update Q-table, in every episode, Q(s, a) is the value in the table, maxa’Q(s’, a’) is the max approximation for Q(s’)

decision policy:

ε-greedy: ε = 0.9, choose the best action for 90%, choose others randomly for 10%

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章