My Roadmap in Reinforcement Learning

一、前言

前段時間接受導師的建議，學習了一些強化學習和GANs的內容，第一週先看的強化學習，二三週看的GANs。強化學習（RL）是一個很有趣的領域，一直以來也是我很喜歡的一個AI的分支，被譽爲是AI皇冠上的明珠，因爲通過RL能很直觀地反映出“智能”。第一週看完之後有不少收穫，當時想着要寫一篇博客記錄下來，結果一拖再拖…
時至今日，已經是第四周了，本來給自己定的本週計劃是入門object tracking領域，幾天過去了，感覺tracking的入門門檻相對比較高，自己這幾天論文看得很艱難，有點迷失了，加上我導師找我談話催論文，弄得我心煩意亂無心學習。既然如此，不如利用現在的迷失時間來歸納一下RL方面的內容。閒話少說，現在開始進入RL時間。

================================================================

二、從Q-Learning談起

要入門RL，首先的入門算法就是Q-Learning了。Q-Learning如果換一個更大的名字，應該是基於值迭代的馬爾科夫鏈的一個求解算法，認識到Q-Learning的這個名字有助於更深地把握其背後的數學思想，從而能夠將其用到求解其他基於馬爾科夫鏈建模的數學模型中去（至少我是這麼認爲的。。）。
Q-Learning因爲是基於值迭代的求解算法（這個和後面會提到的基於策略迭代的算法，比如policy gradient剛好對立），所以理解起來，它其實就是在玩一個Q-tabel，也就是存儲着Q(state,action) 的一個表格，表格的橫欄表示不同的state，縱欄表示不同的action。

Q-Learning算法的目的就是不斷地迭代優化這個Q-tabel，直至其收斂以逼近理想化的Q-value（真正理想化的Q-value是得不到的，只能approximate），收斂之後每個state的決策就可以直接從Q-tabel中查找Q-value最大的action作爲當前state的決策了。

π(s)=argmaxaQ(s,a)

而Q-learning優化Q-tabel的方法基於以下幾個數學公式：

一是將discounted future reward定義爲

Rt=rt+γrt+1+γ2rt+2+γ3rt+3...+γn−trn （其中γ 是discount factor）

二是將Q-value（或者Q function）定義爲：

the maximum discounted future reward when we perform action a in state s, and continue optimally from that point on
或者
the best possible score at the end of the game after performing action a in state s

從這個定義來看Q-value是一個很理想化的值，想真正地得到理想的Q-value是沒有辦法。但是，我們可以去逼近它，所以說Q-Learning整個算法其實就是在爲了逼近理想化的Q-value而不斷地“努力”。

Q-value（或者Q function）用數學表達就是：

Q(st,at)=maxRt
（注意：在intel的博文中寫成了Rt+1 ，我覺得應該有問題）

三是，拜一、二、所賜，就能順理成章地得到大名鼎鼎的Bellman Equation了：

Q(s,a)=r+γmaxa′Q(s′,a)

其中，r 代表即時的reward（也就是執行action後立馬得到的reward），s′ 是執行r 之後轉到的下一個state，r 和s′ 都是通過simulator（也就是game模擬器）觀測到的。

拿到了Bellman Equation，就可以快樂地迭代更新Q-table中的Q-value了，整個的算法流程如下：

上述算法流程在“select and carry out an action a”的時候可以有不同的策略，比如可以是隨機的，也可以是以ϵ 的概率的隨機選擇，1−ϵ 的概率選擇當前Q-value最高對應的action，這種選取方法被稱爲ϵ -greedy exploration。

雖然Q-Learning算法迭代之初得到的Q-value可能相比理想的Q-value差別很大，但是已經有理論證明，只要有足夠多的iterations，那麼Q-tabel最終會收斂並且能夠表示“理想的（卻是我們真正想要的）”Q-value。

————————————————————————————————————-

Sarsa和Sarsa-Lambda

觀察上圖的Q-Learning算法流程圖，雖然每次都是根據最大化的原則來選擇a′ 所對應的Q(s′,a′) 來更新Q(s,a) ，但是進入下一個iteration後，選擇的action未必就是a′ ，因此這樣來看，傳統的Q-Learning看起來有點“不負責任”，盲目地追求用盡量大的值來更新Q(s,a) 卻又不真正地執行a′ 。由此有了Q-Learning的另一個版本，叫做SARSA，這個奇怪的名字其實就是（state-action-reward-sate-action）的首字母組合，從名字就能看出，SARSA是屬於“言出必行”類型的算法，是一個“實踐派”，既然使用了最大的Q(s′,a′) 來更新Q(s,a) ，那麼我就下一個iteration就執行a′ 。相比之下，Q-Learning則有點冒險，因爲它過度地去explore了，不像SARSA那麼保守務實，一步一一個腳印。

兩者算法的對比見下圖，從Q(s,a) 的更新來看，Sarsa言出必行。因此，Q-Learning是off-policy，而Sarsa是on-policy算法：

另外還有一個Q-Learning的版本叫做Sarsa-Lambda，理解起來就是，Sarsa是屬於單步更新算法，以“尋寶爲例”，Sarsa只會給尋找到寶藏的前一步一個獎勵，而忽略了之前許多步的作用。因此引入了回合更新算法，這樣就可以照顧到先前的步。

而lambda=1的時候相當於同等地看待先前的所有步，爲了引入時間上的discount，所以lambda常常介於0,1之間（0對應的就是普通的Sarsa了）

爲了更具體地瞭解一下Sarsa-lambda的思想和具體實現，可以看一下下面的代碼語句塊：

class SarsaLambdaTable(RL): # 繼承 RL class
    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9, trace_decay=0.9):
        ...
    def check_state_exist(self, state):
        ...
    def learn(self, s, a, r, s_, a_):
        # 這部分和 Sarsa 一樣
        self.check_state_exist(s_)
        q_predict = self.q_table.ix[s, a]
        if s_ != 'terminal':
            q_target = r + self.gamma * self.q_table.ix[s_, a_]
        else:
            q_target = r
        error = q_target - q_predict

        # 這裏開始不同:
        # 對於經歷過的 state-action, 我們讓他+1, 證明他是得到 reward 路途中不可或缺的一環
        self.eligibility_trace.ix[s, a] += 1

        # Q table 更新
        self.q_table += self.lr * error * self.eligibility_trace

        # 隨着時間衰減 eligibility trace 的值, 離獲取 reward 越遠的步, 他的"不可或缺性"越小
        self.eligibility_trace *= self.gamma*self.lambda_

從代碼中可以看到，借用eligibility_trace爲“橋樑”，在每一步的Q tabel更新時可以更新整個Q table（或者先前所有步），並且eligibility_trace隨時間以gamma，隨步距以lambda_的速度衰減。

————————————————————————————————————-
以上就是初入RL需要掌握的三個經典算法， Q-Learning（off-policy）， Sarsa（on-policy）， Sarsa-Lambda（on-policy）。其中Sarsa和Sarsa-Lambda主要參考了莫煩強化學習教程，在我看來應該是優於Q-Learning的算法，尤其是Sarsa-Lambda的回合更新在reward的分配上是一個考慮相對更加周全的算法，不過Sarsa， Sarsa-Lambda背後的主要思想都是繼承了Q-Learning，都是基於值迭代的馬爾科夫鏈的一個求解算法!

================================================================

三、Deep Q Network

前邊提到的Q-Learning及其衍生算法有一個弊病，那就是當state的數量很大時，比如在處理視頻遊戲（比如Atari遊戲）的時候，用像素組合表徵的state是一個天文數字，這樣一來Q-Learning就有點捉襟見肘了。而神經網絡對於建模高維結構化數據是一大利器，所以用Network去扮演高維Q table的角色，學習一個Q function成了一個自然的想法。

使用神經網絡來approximate Q（s,a）有兩種可以選擇的結構:

其中，以state作爲輸入，輸出各個action對應的Q-value只需要一次forward pass就能得到Q tabel的一行了，更加方便和容易建模。

需要一提的是，如果輸入的state是圖像或者視頻，通常不會使用pooling層，因爲pooling層會對transition不敏感，而視頻遊戲中對物體的位置信息是需要保留的。

根據對Q function的理解，給定一個transition<s,a,r,s′> ，那麼DQN的損失函數定義爲：

L=12[r+maxa′Q(s′,a′)−Q(s,a)]2

其中r+maxa′Q(s′,a′) 是target，Q(s,a) 是prediction，s′ 是s 的下一個state。

先前的Q-table更新算法由此變更爲：
————————————————————————-
1. Do a feedforward pass for the current state s to get predicted Q-values for all actions.
2. Do a feedforward pass for the next state s′ and calculate maximum overall network outputs maxa′Q(s′,a′) .
3. Set Q-value target for action to r+γmaxa′Q(s′,a′) (use the max calculated in step 2). For all other actions, set the Q-value target to the same as originally returned from step 1, making the error 0 for those outputs.
4. Update the weights using backpropagation.
————————————————————————-

Experience Replay
從以上算法描述看可以發現，DQN基本就是Q-Learning的神經網絡版本，既沒有吸收Sarsa也沒有吸收Sarsa-Lambda的思想，仍然是一個off-policy的算法。off-policy的好處就是可以進行Experience Replay.可以把所有的experiences <s,a,r,s′> 存儲在一個replay memory中，然後可以選擇random minibatch來訓練DQN網絡。Experience Replay是一個訓練DQN的很重要的trick。

Fixed Q target
爲了讓DQN訓練更加穩定，還有一個很重要的trick就是fixed Q target不得不提。具體來說，就是在訓練DQN的時候，爲了避免Network每一步都頻繁更新導致網絡學習不穩定。特意複製了一個相同的結構的Q Network放在一旁作爲target，也即上面算法流程中的tt。這個Q Network就是fixed的，不是每一步都更新，而是每隔幾步再把非fixed的Q network的參數複製到這個Q Network上。

可以參考：https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/4-2-DQN2/

使用Experience Replay，ϵ -greedy exploration和Fixed Q target策略後的DQN算法僞代碼：

通過上面的僞代碼可以發現，Experience Replay中的experience是來自於Q(ϕ(st),a;θ) 的，而非來自於fixed的Q(ϕ(st),a;θ−) 。

其實DeepMind還使用了許多tricks來使DQN最終work，包括error clipping, reward clipping等，這裏就不細說了。Deep Q-Learning算法已經被Google申請了專利。

參考：Intel官博 Guest Post (Part I): Demystifying Deep Reinforcement Learning

===========================================

四、Policy Gradient

DQN通過用NN來建模高維結構化數據，扮演Q table的角色，使得處理視頻遊戲這種高維state的RL問題成爲了可能。但是，如果我們的處理的問題的action也很多，甚至是連續的action，而不是離散的，那麼DQN也會捉襟見肘了。而這裏，就要policy gradient（PG）登場了，PG可以很好地建模連續action的RL問題。

首先摘抄一段來自Andrej Karpathy博客中的幾句話：

Similar to what happened in Computer Vision, the progress in RL is not driven as much as you might reasonably assume by new amazing ideas. In Computer Vision, the 2012 AlexNet was mostly a scaled up (deeper and wider) version of 1990’s ConvNets. Similarly, the ATARI Deep Q Learning paper from 2013 is an implementation of a standard algorithm (Q Learning with function approximation, which you can find in the standard RL book of Sutton 1998), where the function approximator happened to be a ConvNet. AlphaGo uses policy gradients with Monte Carlo Tree Search (MCTS) - these are also standard components.

前邊講到的DQN算法只是把function approximator換成了ConvNet，這一部分要講的policy gradient算法，一種比Q-Learning更好的算法。

It turns out that Q-Learning is not a great algorithm (you could say that DQN is so 2013 (okay I’m 50% joking)). In fact most people prefer to use Policy Gradients, including the authors of the original DQN paper who have shown Policy Gradients to work better than Q Learning when tuned well. PG is preferred because it is end-to-end: there’s an explicit policy and a principled approach that directly optimizes the expected reward.

一句話說Policy Gradient：

Policy Gradients: Run a policy for a while. See what actions led to high rewards. Increase their probability.

下面這段解釋沒太理解（來自Andrej的博客），字面上看好像還是沒有解釋清爲什麼不區別對待frame 50和frame 150，明明這兩次bounce都可以通過即時的reward來反映出是“正確”還是“錯誤”的action。。那麼讓frame 50的action概率增大，frame 150的概率減小應該很容易做到呀？

If you think through this process you’ll start to find a few funny properties. For example what if we made a good action in frame 50 (bouncing the ball back correctly), but then missed the ball in frame 150? If every single action is now labeled as bad (because we lost), wouldn’t that discourage the correct bounce on frame 50? You’re right - it would. However, when you consider the process over thousands/millions of games, then doing the first bounce correctly makes you slightly more likely to win down the road, so on average you’ll see more positive than negative updates for the correct bounce and your policy will end up doing the right thing.

爲了寫Policy Gradient這一部分，我又把Andrej Karpathy的博客又看了一遍，因爲之前理解了又忘了。。

其實歸根結底，Policy Gradient就是直接優化一個policy network（用數學表示爲p(x|θ) ，其中θ 就是網絡的參數，x 就是policy network，x 是對p(a|i) 的建模，i 是輸入圖片或者視頻幀，a 是action），不像先前基於value的Q-Learning，Sarsa等間接地去approximate Q-function（或者Q-value）去找到最佳的policy，這麼一想PG解決強化學習更加直接而乾脆。

使用PG還有一個好處就是，PG支持策略的概率輸出，而Q-Learning則是確定性的算法，概率的輸出更加符合直覺，因爲這個世界上很多時候解都不止一個，就比如說想要填飽肚子可以吃飯，也可以吃麪。

Andrej使用了130多行寫了一個PG，我摘取了比較重要的片段以助理解：
第3行，依據概率（體現出PG的概率輸出特點）選擇action，這裏之所以用2,3可能是Gym這個遊戲剛好action限於2,3之間吧
第9行，計算出encourage當前action的梯度（用於後面的梯度反傳）
第33行，將第9行得到的梯度（都是encourage屬性的梯度）用discounted_epr進行調製，這裏正是PG的魔法所在。discounted_epr其實就是一個episode中所有的reward（一個list）進行時間衰減後的結果（一個新的list）
第34行，將調製得到的梯度送回網絡反向傳播，更新參數。

從代碼也可以看初PG是一個回合更新算法，直到拿到了done信號纔會進行梯度反傳更新，done信號只有在一個episode結束後纔會變爲True。在Pong-v0遊戲中，應該就是電腦或者玩家中的一者，用掉21條命，另一方率先得到21後，一個episode即結束，然後收集reward和梯度，用時間衰減後的reward去調製梯度，反向傳播。

 # forward the policy network and sample an action from the returned probability
 aprob, h = policy_forward(x)
 action = 2 if np.random.uniform() < aprob else 3 # roll the dice!

 # record various intermediates (needed later for backprop)
 xs.append(x) # observation
 hs.append(h) # hidden state
 y = 1 if action == 2 else 0 # a "fake label"
 dlogps.append(y - aprob) # grad that encourages the action that was taken to be taken (see http://cs231n.github.io/neural-networks-2/#losses if confused)

 # step the environment and get new measurements
 observation, reward, done, info = env.step(action)
 reward_sum += reward

 drs.append(reward) # record reward (has to be done after we call step() to get reward for previous action)

 if done: # an episode finished
   episode_number += 1

   # stack together all inputs, hidden states, action gradients, and rewards for this episode
   epx = np.vstack(xs)
   eph = np.vstack(hs)
   epdlogp = np.vstack(dlogps)
   epr = np.vstack(drs)
   xs,hs,dlogps,drs = [],[],[],[] # reset array memory

   # compute the discounted reward backwards through time
   discounted_epr = discount_rewards(epr)
   # standardize the rewards to be unit normal (helps control the gradient estimator variance)
   discounted_epr -= np.mean(discounted_epr)
   discounted_epr /= np.std(discounted_epr)

   epdlogp *= discounted_epr # modulate the gradient with advantage (PG magic happens right here.)
   grad = policy_backward(eph, epdlogp)

================================================================

先寫到這裏吧，遺留的幾個問題是：
一是，先前提到的Andrej博客中的那段解釋frame 50和frame 150“鬍子眉毛一把抓”的處理方式的理解。
二是， discount_rewards(epr)函數（見下）沒太看明白 [已解決]

def discount_rewards(r):
  """ take 1D float array of rewards and compute discounted reward """
  discounted_r = np.zeros_like(r)
  running_add = 0
  for t in reversed(xrange(0, r.size)):
    if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)
    running_add = running_add * gamma + r[t]
    discounted_r[t] = running_add
  return discounted_r

其實這個discount_reward函數，輸入的是一維的reward，輸出的就是時間衰減後的discount_reward。而通過運行程序會發現，在Pong-v0遊戲中，reward的計算方法是：

每一幀都會計算一個reward
沒有死亡發生時，reward爲0
對方死一次，reward爲+1
自己死一次，reward爲-1
誰死夠21次，遊戲結束，也就是一個episode結束

由於是每完成一個episode才用收集到的encorage屬性的gradient和reward進行一次梯度反傳更新，那麼一個episode內的reward其實是一個一維向量，裏面有幾千個元素，視遊戲中各episode的長短而各不相同。
reward寫出來可能是這種形式: reward = [0,0,1,0,0,…,0,0,-1,0,0,…,0,-1]
也就是說大部分時候，reward都是0，只有當有死亡發生時，reward纔會變成1或者-1，所以上面的discount_reward函數纔會用一句：

if r[t] != 0: running_add = 0 # reset the sum, since this was a game boundary (pong specific!)

來表示一旦遇到死亡發生，reward的衰減就從這一次的死亡開始重新計算！[完]

三是，對於 discounted_epr爲什麼要做normalize（去均值併除以方差）的進一步理

上面從Karpathy的角度，intuitively闡述了Poliy Gradient的思想，如果從嚴格的數學形式來理解證明，可以看CS294的lecture 4 policy gradient introduction。我截取了三張PPT如下，涵蓋了PG的推導，其中的符號含義應該也可以猜到，其中的τ 表示的是一條馬爾科夫鏈的trajectory。