STANFORD RL CS234 第一次作業總結

原創

牛蛙爹爹

2020-06-20 04:38

1 Optimal Policy for Simple MDP

(a) 等比數列取極限。

(b)官方解答已經很清楚了

solution: If γ > 0, value of γ does not change the ordering of states, so the optimal policy is the same; however, the value of the value function depends on γ. If γ = 0 then, policy ∀s : π(s) = a0 is still an optimal policy; however, this is not the only optimal policy.

（c）加常量不會影響optimal policy，但是會改變value function

solution: No effect on the optimal policy. Adding a constant c to all the rewards only changes the value of each state by a constant vc for any policy π:

(d)如果a>0,optimal policy 不變；

a=0,所有policy都是optimal

a<0,optimal policy就是遠離G.

2 Running Time of Value Iteration

（a）等比數列

（b）常量

（c）感覺官方答案有點問題

1.爲什麼下標是n+1，其次下式命名是 Qn+2的展開

3 Approximating the Optimal Value Function

個人感覺理解這題的用意很重要，題目裏也說得很清楚了。

This shows that if we compute an approximately optimal state-action value function and then extract the greedy policy for that approximate state-action value function, the resulting policy still does well in the real MDP.

證明貪心策略也能表現得很好。

Q∗(s, π(s)) 這個公式代表，當前使用π策略，後面遞推繼續使用最優策略。

(a) 證明也很清晰，就是整理出兩個 ||Q ̃ − Q∗||∞ ≤ ε

(b)

V ∗(s) − Vπ(s) = V ∗(s) − Q∗(s, π(s)) + Q∗(s, π(s)) − Vπ(s)

≤ 2ε + Q∗(s, π(s)) − Qπ(s, π(s))

= 2ε + γEs′ [V ∗(s′) − Vπ(s′)]

主要這邊：不等式兩邊取期望，得到 Es′ [V ∗(s′) − Vπ(s′)]<=2ε/(1-γ)

帶入原式即可。

整個計算都很簡單。

(d) 我對這題的理解就是證明可以找到一個使等式成立的policy。

4 Frozen Lake MDP

寫代碼還是比較有意思的一題，認真寫過可以加深對MDP的理解

(a)policy_evaluation:給定policy，循環的計算value_function，直至連續兩次的value_function差值小於閾值。

policy_improvement:利用greedy的方式找到局部最優的policy。

policy_iteration:先用policy_evaluation計算value_function，再用policy_improvement計算新的policy，直至policy不再變化。

(b)value_iteration:使用greedy從Vn計算到V0，直至誤差小於閾值。

(c)對於deterministic的FrozenLake計算出來的policy，環境執行時，過程都是固定的，可以移動到G(goal).

而對於stochastic的FrozenLake policy與deterministic不同，而且每次執行過程也不同，可能會移動到G，也有可能移動到H(hole).

相關代碼，隨後我會放到git上。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

STANFORD RL CS234 第一次作業總結

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

STANFORD RL CS234 第一次作業總結

git 回退到某個commit

faster-rcnn、yolov3和ssd loss總結

RCNN SPPNET FastRCNN FasterRCNN 總結

Keras加載和保存模型

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結