Reinforcement Learning - An Introduction memo

1.MDP(Markov Decision Processes)

finite MDP: finite state space&finite action space

transition probabilities：p(s′ | s, a) = Pr{S_t+1 = s′ | S_t = s, A_t = a}

r(s, a, s′) = E[R_t+1 | S_t = s, A_t = a, S_t+1 = s′]

2.Value Functions

state-value function: v_π(s)=E_π[G_t|S_t = s] = E_π[∑∞k=0γkRt+k+1 | S_t = s]

action-value function: q_π(s, a) = E_π[G_t|S_t = s, A_t = a] = E_π[∑∞k=0γkRt+k+1 | S_t = s, A_t = a]
G_t: return (cumulative discounted reward) following t
R_t: reward at t, dependent, like S_t, on A_t−1 and S_t−1
G_t = ∑∞k=0γkRt+k+1

v_π, q_π: v_π(s) = ∑aπ(a|s)qπ(s,a) q_π(s, a) = ∑s′p(s′|s,a)[r(s,a,s′)+γ v_π(s’)]
π(a|s) : probability of taking action a when in state s

Bellman Equation for v_π: v_π(s) = ∑aπ(a|s)∑s′p(s′|s,a)[r(s,a,s′)+γvπ(s′) ]
Bellman function => learn v_π

Bellman Equation for q_π: q_π(s, a) = ∑s′p(s′|s,a)[r(s,a,s′)+γ∑a′π(a′|s′)qπ(s′,a′)]

3.Policy Evaluation

policy evaluation: compute v_π for policy π
Iteration policy evaluation:
1. For state s, the initial v₀ is chosen arbitary(terminal state 0)
2.Successive approximation is obtained by using the Bellman Equation:
v_k+1(s) = Eπ [R_t+1+γ v_k(s_t+1) | S_t = s]
code:

4.Policy Improvement

policy improvement: evaluate policy to find better policies

greedy policy π′ : π′ (x) = arg amaxqπ(s,a)
The greedy policy takes the action that looks best in the short term—after one step of lookahead—according to vπ .

5.Policy Iteration&Value Iteration

policy iteration

code:

an example:

value iteration

It can be written as a particularly simple backup operation that combines the policy improvement and truncated policy evaluation steps, more efficient.

v_k+1(s) = amax E[R_t+1+γ v_k(S_t+1) | S_t = s, A_t = a]

code:

Q-Learning

Initialize Q(s,a) arbitrarily
Repeat (for each episode):
    Initialize s
    Repeat (for each episode):
        Choose a from s using policy derived from Q(e.g.,ε-greedy)
        Take action a, observe r, s'
        Q(s, a)←Q(s, a) + α[r + γmaxQ(s', a')-Q(s, a)]
        s←s'
    util s is terminal

Evaluation:

update Q-table, in every episode, Q(s, a) is the value in the table, max_a’Q(s’, a’) is the max approximation for Q(s’)

decision policy:

ε-greedy: ε = 0.9, choose the best action for 90%, choose others randomly for 10%

Reinforcement Learning - An Introduction memo

1.MDP(Markov Decision Processes)

2.Value Functions

3.Policy Evaluation

4.Policy Improvement

5.Policy Iteration&Value Iteration

policy iteration

value iteration

Q-Learning

Evaluation:

decision policy:

Sicily 1000. 詞法分析程序設計 **

LeetCode 464. Can I Win--動態規劃

LeetCode 316. Remove Duplicate Letters--貪心算法

安裝 go 語言開發環境（ubuntu 16.04）

LeetCode 95. Unique Binary Search Trees II&96. Unique Binary Search Trees--動態規劃，二叉樹

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結