深度強化學習（DQN-Deep Q Network）之應用-Flappy Bird

本文系作者原創，轉載請註明出處:https://www.cnblogs.com/further-further-further/p/10811587.html

1.達到的目的

2.思路

2.1.強化學習（RL Reinforcement Learing）

2.2.深度學習（卷積神經網絡CNN）

3.踩過的坑

4.代碼實現（python3.5）

5.運行結果與分析

1.達到的目的

遊戲場景：障礙物以一定速度往左前行，小鳥拍打翅膀向上或向下飛翔來避開障礙物，如果碰到障礙物，遊戲就GAME OVER！

目的：小鳥通過訓練，能夠自動識別障礙物，做出正確的動作（向上或向下飛翔）。

2.思路

小鳥飛翔的難點是如何準確判斷下一步的動作（向上或向下）？而這正是強化學習想要解決的問題。因爲上一節案例網格的所有狀態（state）數目是比較小的（16），所以可以通過遍歷所有狀態，計算所有狀態的回報，生成 Q-Table（記錄所有狀態的價值）。但是本節的應用場景有所不同，它的狀態是圖片中的像素，如果圖片大小是 84 * 84，batch = 4，每個像素大小在[0,255]範圍內，有 256 種可能（256 個狀態），那麼最終 Q-Table 大小是

數據計算量是非常龐大的。這裏我們採用強化學習 + 深度學習（卷積神經網絡），也就是 DQN（Deep Q Network）。

卷積神經網絡決策目的是預測當前狀態所有行爲的回報（Q-value）->目標預測值（）以及參數的更新；

強化學習的目的是根據馬爾科夫決策過程以及貝爾曼價值函數計算出當前狀態所有行爲的回報 ->目標真實值（）

整張圖片作爲一個狀態（因爲小鳥不關心是像素還是圖片，它只關心它下一步動作的方向），4張圖片就是 4 個狀態，且這 4 張圖片在時間上是連續的。將所有狀態（States：80*80*4）以及行爲（Actions:1*2）作爲卷積神經網絡的輸入值，卷積神經網絡輸出爲當前狀態的所有行爲的價值（1*2），結構如下圖

2.1 強化學習

貝爾曼最優方程如下（當前狀態所有行爲價值 = 當前即時獎勵 + 下一狀態所有行爲的價值）

代碼實現

1   readout_j1_batch = sess.run(readout, feed_dict = {s : s_j1_batch})
2             for i in range(0, len(minibatch)):
3                 terminal = minibatch[i][4]
4                 # if terminal, only equals reward
5                 if terminal:  # 碰到障礙物，終止
6                     y_batch.append(r_batch[i])
7                 else: # 即時獎勵 + 下一階段回報
8                     y_batch.append(r_batch[i] + GAMMA * np.max(readout_j1_batch[i]))

minibatch保存了一個batch（32）下當前狀態（s_j_batch）、當前行動（a_batch）、當前狀態的即時獎勵（r_batch）、當前狀態下一時刻的狀態（s_j1_batch）。

將當前狀態下一時刻的狀態（s_j1_batch）作爲網絡模型輸入參數，就能得到下一狀態（相對當前狀態）所有行爲的價（readout_j1_batch），然後通過貝爾曼最優方程計算得到當前狀態的Q-value。

大家可能會有這樣的疑問：爲什麼當前狀態價值要通過下一個狀態價值得到，常規來說都是上一狀態價值來得到？

貝爾曼最優方程充分體現了嘗試這一核心思想，計算下一個狀態價值是爲了更新當前狀態價值，從而找到最優狀態行爲。

2.2 深度學習

在輸入數據進入神經網絡結構之前，需要對圖片數據進行預處理，從而減少運算量。

需要安裝opencv庫：pip install opencv-python，如果下載較慢，可以用國內鏡像代替

pip install opencv-python -i http://pypi.douban.com/simple --trusted-host pypi.douban.com。

圖片灰度處理：將彩色圖片轉變爲灰度圖片，圖片大小設置成（80 * 80）；

 x_t = cv2.cvtColor(cv2.resize(x_t, (80, 80)), cv2.COLOR_BGR2GRAY)

二值化：設置圖片像素閾值爲 1，大於 1 的像素值更新爲 255（白色），反之爲 0（黑色）。

 ret, x_t = cv2.threshold(x_t,1,255,cv2.THRESH_BINARY)

獲取連續幀（4）圖片：複製當前幀圖片 -> 堆積成4幀圖片 -> 將獲取到得下一幀圖片替換當前第4幀，如此循環就能保證當前的batch圖片是連續的。

s_t = np.stack((x_t, x_t, x_t, x_t), axis=2)
s_t1 = np.append(x_t1, s_t[:, :, :3], axis=2)

卷積神經網絡模型

這裏採用了3個卷積層（8*8*4*32, 4*4*32*64，3*3*64*64），3個池化層，4個Relu激活函數，2個全連接層，具體如下圖

（建議對照圖看代碼，注意數據流的變化）

注意：要注意每個卷積層的Stride，因爲padding = "SAME"，與輸入圖片卷積後數據寬，高 = 輸入圖片寬，高/Stride。

比如，輸入圖片數據與第一個卷積層（8*8*4*32）卷積後，圖片數據寬，高 = （80，80）/4 = （20，20），其他層卷積依次類推。

tensorboard可視流程圖（具體生成操作步驟見深度學習之卷積神經網絡(CNN)詳解與代碼實現（二））

圖片可能不是很清楚，在圖片位置點擊鼠標右鍵->在新標籤頁面打開圖片，就可以放縮圖片了。

3.踩過的坑

1.一定要弄明白深度強化學習的輸入和輸出。

強化學習的核心思想是嘗試，深度學習的核心思想是訓練。通過不斷的將預測值和真實值的殘差計算，不斷的更新訓練模型的參數，使殘差值越來越小，最終收斂於一個穩定值，從而得到最佳的訓練參數模型。

這裏的預測值是通過深度學習得到，而真實值是通過強化學習得到，所以纔有了深度強化學習的概念（DQN-Deep Q Network）。

卷積神經網絡前向傳播輸入：4幀連續圖片作爲不同的狀態States；

卷積神經網絡前向傳播輸出：readout（2個不同的方向對應的價值）；

卷積神經網絡反向傳播（通過損失函數獲取損失，計算梯度，更新參數）輸入：

i.y_batch（32, 2）:通過強化學習得到的真實目標值[32 表示神經網絡訓練時每次批量處理數目，2表示Action不同方向對應的價值 ]；

ii.a_batch（32, 2）：每個行動的不同方向，在訓練時更新步驟：初始化都爲0 ->深度學習（卷積神經網絡）輸出readout_t（1, 2）-> 找到輸出價值最大的索引 ->將a_batch中action相同索引置爲1（表示最優價值的方向），達到更新得目的。

iii.s_j_batch（32, 80, 80, 4）：下一個連續4幀，每一組是4幀，批量處理32組。

2.不要陷入常規的思維模式。

一般常規的思維模式是 A + B => C，這個 C 一般在計算或設計之前，在我們腦海中會計算出來，能夠具體化。但是深度學習是打破這一常規思維模式的，它能夠通過訓練自發的學習，獲取內在知識或規則。

以本節爲例，在我們腦海中，總是想着下面幾個問題

1. 爲什麼深度學習的結果就是行爲的各個方向的價值，而不是其他？

解答：這是根據真實目標值決定的，卷積神經網絡的要求是最後的輸出值一定要跟真實目標值大小相同。損失函數計算損失，然後更新各個網絡層的參數，不停的循環，使輸出無限的逼近真實值，穩定後獲取模型。

2. 在上一節強化學習時都是人爲指定了方向的映射（0=up, 1=right, 2=down, 3=left），爲什麼深度強化學習不需要指定，它自己就能識別？

解答：當前一組幀和下一組幀之間在時間上是連續的，小鳥的每個動作在時間上也是連續的，通過深度學習後獲取的模型其實已經學會了遊戲的內在規則，知道在當前狀態的下一步動作的方向，所以不需要我們人爲指定，這正是深度學習的神奇之處。

4.代碼實現（python3.5）

入口在代碼最下端main，代碼流程分爲三個階段：觀察、探索、訓練。由 OBSERVE 和 EXPLORE 設定

這也符合一般邏輯，先觀察環境，然後再看看怎麼飛。所以觀察次數一般偏小，其實在探索時就已經在訓練了，爲什麼要分開呢？

分開的目的是考慮更一般的情況，使模型更準確。比如某個狀態向上和向下的價值一樣，之前都是以向上的價值來計算整個價值，在探索時，我們就考慮向下的價值，然後來更新Q-Table。但是這種探索是隨着模型的穩定，次數會越來越少。

工程結構圖（整個工程代碼可在百度網盤下載： https://pan.baidu.com/s/1faj-BHeYt14g3bNtrzsqXA 提取碼: vxeb）

train.py

  1 #!/usr/bin/env python
  2 from __future__ import print_function
  3 
  4 import tensorflow as tf
  5 import cv2
  6 import sys
  7 sys.path.append("game/")
  8 try:
  9     from . import wrapped_flappy_bird as game
 10 except Exception:
 11     import wrapped_flappy_bird as game
 12 import random
 13 import numpy as np
 14 from collections import deque
 15 '''
 16 先觀察一段時間（OBSERVE = 1000 不能過大），
 17 獲取state(連續的4幀) => 進入訓練階段（無上限）=> action
 18 
 19 '''
 20 GAME = 'bird' # the name of the game being played for log files
 21 ACTIONS = 2 # number of valid actions 往上  往下
 22 GAMMA = 0.99 # decay rate of past observations
 23 OBSERVE = 1000. # timesteps to observe before training
 24 EXPLORE = 3000000. # frames over which to anneal epsilon
 25 FINAL_EPSILON = 0.0001 # final value of epsilon 探索
 26 INITIAL_EPSILON = 0.1 # starting value of epsilon
 27 REPLAY_MEMORY = 50000 # number of previous transitions to remember
 28 BATCH = 32 # size of minibatch
 29 FRAME_PER_ACTION = 1
 30 
 31 # GAME = 'bird' # the name of the game being played for log files
 32 # ACTIONS = 2 # number of valid actions
 33 # GAMMA = 0.99 # decay rate of past observations
 34 # OBSERVE = 100000. # timesteps to observe before training
 35 # EXPLORE = 2000000. # frames over which to anneal epsilon
 36 # FINAL_EPSILON = 0.0001 # final value of epsilon
 37 # INITIAL_EPSILON = 0.0001 # starting value of epsilon
 38 # REPLAY_MEMORY = 50000 # number of previous transitions to remember
 39 # BATCH = 32 # size of minibatch
 40 # FRAME_PER_ACTION = 1
 41 
 42 def weight_variable(shape):
 43     initial = tf.truncated_normal(shape, stddev = 0.01)
 44     return tf.Variable(initial)
 45 
 46 def bias_variable(shape):
 47     initial = tf.constant(0.01, shape = shape)
 48     return tf.Variable(initial)
 49 # padding = ‘SAME’=> new_height = new_width = W / S （結果向上取整）
 50 # padding = ‘VALID’=> new_height = new_width = (W – F + 1) / S （結果向上取整）
 51 def conv2d(x, W, stride):
 52     return tf.nn.conv2d(x, W, strides = [1, stride, stride, 1], padding = "SAME")
 53 
 54 def max_pool_2x2(x):
 55     return tf.nn.max_pool(x, ksize = [1, 2, 2, 1], strides = [1, 2, 2, 1], padding = "SAME")
 56 """
 57  數據流：80 * 80 * 4  
 58  conv1(8 * 8 * 4 * 32, Stride = 4) + pool(Stride = 2)-> 10 * 10 * 32(height = width = 80/4 = 20/2 = 10)
 59  conv2(4 * 4 * 32 * 64, Stride = 2) -> 5 * 5 * 64 + pool(Stride = 2)-> 3 * 3 * 64
 60  conv3(3 * 3 * 64 * 64, Stride = 1) -> 3 * 3 * 64 = 576
 61  576 在定義h_conv3_flat變量大小時需要用到，以便進行FC全連接操作
 62 """
 63 
 64 def createNetwork():
 65     # network weights
 66     W_conv1 = weight_variable([8, 8, 4, 32])
 67     b_conv1 = bias_variable([32])
 68 
 69     W_conv2 = weight_variable([4, 4, 32, 64])
 70     b_conv2 = bias_variable([64])
 71 
 72     W_conv3 = weight_variable([3, 3, 64, 64])
 73     b_conv3 = bias_variable([64])
 74 
 75     W_fc1 = weight_variable([576, 512])
 76     b_fc1 = bias_variable([512])
 77     # W_fc1 = weight_variable([1600, 512])
 78     # b_fc1 = bias_variable([512])
 79 
 80     W_fc2 = weight_variable([512, ACTIONS])
 81     b_fc2 = bias_variable([ACTIONS])
 82 
 83     # input layer
 84     s = tf.placeholder("float", [None, 80, 80, 4])
 85 
 86     # hidden layers
 87     h_conv1 = tf.nn.relu(conv2d(s, W_conv1, 4) + b_conv1)
 88     h_pool1 = max_pool_2x2(h_conv1)
 89 
 90     h_conv2 = tf.nn.relu(conv2d(h_pool1, W_conv2, 2) + b_conv2)
 91     h_pool2 = max_pool_2x2(h_conv2)
 92 
 93     h_conv3 = tf.nn.relu(conv2d(h_conv2, W_conv3, 1) + b_conv3)
 94     h_pool3 = max_pool_2x2(h_conv3)
 95 
 96     h_pool3_flat = tf.reshape(h_pool3, [-1, 576])
 97     #h_conv3_flat = tf.reshape(h_conv3, [-1, 1600])
 98 
 99     h_fc1 = tf.nn.relu(tf.matmul(h_pool3_flat, W_fc1) + b_fc1)
100     #h_fc1 = tf.nn.relu(tf.matmul(h_conv3_flat, W_fc1) + b_fc1)
101 
102     # readout layer
103     readout = tf.matmul(h_fc1, W_fc2) + b_fc2
104 
105     return s, readout, h_fc1
106 
107 def trainNetwork(s, readout, h_fc1, sess):
108     # define the cost function
109     a = tf.placeholder("float", [None, ACTIONS])
110     y = tf.placeholder("float", [None])
111     # reduction_indices = axis  0 : 列  1: 行
112     # 因 y 是數值，而readout: 網絡模型預測某個行爲的回報 大小[1, 2] 需要將readout 轉爲數值，
113     # 所以有tf.reduce_mean(tf.multiply(readout, a), axis=1) 數組乘法運算，再求均值。
114     # 其實，這裏readout_action = tf.reduce_mean(readout, axis=1) 直接求均值也是可以的。
115     readout_action = tf.reduce_mean(tf.multiply(readout, a), axis=1)
116     cost = tf.reduce_mean(tf.square(y - readout_action))
117     train_step = tf.train.AdamOptimizer(1e-6).minimize(cost)
118 
119     # open up a game state to communicate with emulator
120     game_state = game.GameState()
121     # 創建隊列保存參數
122     # store the previous observations in replay memory
123     D = deque()
124 
125     # printing
126     a_file = open("logs_" + GAME + "/readout.txt", 'w')
127     h_file = open("logs_" + GAME + "/hidden.txt", 'w')
128 
129     # get the first state by doing nothing and preprocess the image to 80x80x4
130     do_nothing = np.zeros(ACTIONS)
131     do_nothing[0] = 1
132     x_t, r_0, terminal = game_state.frame_step(do_nothing)
133     #cv2.imwrite('x_t.jpg',x_t)
134     x_t = cv2.cvtColor(cv2.resize(x_t, (80, 80)), cv2.COLOR_BGR2GRAY)
135     ret, x_t = cv2.threshold(x_t,1,255,cv2.THRESH_BINARY)
136     s_t = np.stack((x_t, x_t, x_t, x_t), axis=2)
137 
138     # saving and loading networks
139     tf.summary.FileWriter("tensorboard/", sess.graph)
140     saver = tf.train.Saver()
141     sess.run(tf.initialize_all_variables())
142     checkpoint = tf.train.get_checkpoint_state("saved_networks")
143     """
144     if checkpoint and checkpoint.model_checkpoint_path:
145         saver.restore(sess, checkpoint.model_checkpoint_path)
146         print("Successfully loaded:", checkpoint.model_checkpoint_path)
147     else:
148         print("Could not find old network weights")
149     """
150     # start training
151     epsilon = INITIAL_EPSILON
152     t = 0
153     while "flappy bird" != "angry bird":
154         # choose an action epsilon greedily
155         # 預測結果（當前狀態不同行爲action的回報，其實也就 往上，往下 兩種行爲）
156         readout_t = readout.eval(feed_dict={s : [s_t]})[0]
157         a_t = np.zeros([ACTIONS])
158         action_index = 0
159         if t % FRAME_PER_ACTION == 0:
160             # 加入一些探索，比如探索一些相同回報下其他行爲，可以提高模型的泛化能力。
161             # 且epsilon是隨着模型穩定趨勢衰減的，也就是模型越穩定，探索次數越少。
162             if random.random() <= epsilon:
163                 # 在ACTIONS範圍內隨機選取一個作爲當前狀態的即時行爲
164                 print("----------Random Action----------")
165                 action_index = random.randrange(ACTIONS)
166                 a_t[action_index] = 1
167             else:
168                 # 輸出 獎勵最大就是下一步的方向
169                 action_index = np.argmax(readout_t)
170                 a_t[action_index] = 1
171         else:
172             a_t[0] = 1 # do nothing
173 
174         # scale down epsilon 模型穩定，減少探索次數。
175         if epsilon > FINAL_EPSILON and t > OBSERVE:
176             epsilon -= (INITIAL_EPSILON - FINAL_EPSILON) / EXPLORE
177 
178         # run the selected action and observe next state and reward
179         x_t1_colored, r_t, terminal = game_state.frame_step(a_t)
180         # 先將尺寸設置成 80 * 80，然後轉換爲灰度圖
181         x_t1 = cv2.cvtColor(cv2.resize(x_t1_colored, (80, 80)), cv2.COLOR_BGR2GRAY)
182         # x_t1 新得到圖像，二值化 閾值：1
183         ret, x_t1 = cv2.threshold(x_t1, 1, 255, cv2.THRESH_BINARY)
184         x_t1 = np.reshape(x_t1, (80, 80, 1))
185         #s_t1 = np.append(x_t1, s_t[:,:,1:], axis = 2)
186         # 取之前狀態的前3幀圖片 + 當前得到的1幀圖片
187         # 每次輸入都是4幅圖像
188         s_t1 = np.append(x_t1, s_t[:, :, :3], axis=2)
189 
190         # store the transition in D
191         # s_t: 當前狀態（80 * 80 * 4）
192         # a_t: 即將行爲 （1 * 2）
193         # r_t: 即時獎勵
194         # s_t1: 下一狀態
195         # terminal: 當前行動的結果（是否碰到障礙物 True => 是 False =>否）
196         # 保存參數，隊列方式，超出上限，拋出最左端的元素。
197         D.append((s_t, a_t, r_t, s_t1, terminal))
198         if len(D) > REPLAY_MEMORY:
199             D.popleft()
200 
201         # only train if done observing
202         if t > OBSERVE:
203             # 獲取batch = 32個保存的參數集
204             minibatch = random.sample(D, BATCH)
205             # get the batch variables
206             # 獲取j時刻batch(32)個狀態state
207             s_j_batch = [d[0] for d in minibatch]
208             # 獲取batch(32)個行動action
209             a_batch = [d[1] for d in minibatch]
210             # 獲取保存的batch(32)個獎勵reward
211             r_batch = [d[2] for d in minibatch]
212             # 獲取保存的j + 1時刻的batch(32)個狀態state
213             s_j1_batch = [d[3] for d in minibatch]
214             # readout_j1_batch =>(32, 2)
215             y_batch = []
216             readout_j1_batch = sess.run(readout, feed_dict = {s : s_j1_batch})
217             for i in range(0, len(minibatch)):
218                 terminal = minibatch[i][4]
219                 # if terminal, only equals reward
220                 if terminal:  # 碰到障礙物，終止
221                     y_batch.append(r_batch[i])
222                 else: # 即時獎勵 + 下一階段回報
223                     y_batch.append(r_batch[i] + GAMMA * np.max(readout_j1_batch[i]))
224             # 根據cost -> 梯度 -> 反向傳播 -> 更新參數
225             # perform gradient step
226             # 必須要3個參數，y, a, s 只是佔位符，沒有初始化
227             # 在 train_step過程中，需要這3個參數作爲變量傳入
228             train_step.run(feed_dict = {
229                 y : y_batch,
230                 a : a_batch,
231                 s : s_j_batch}
232             )
233 
234         # update the old values
235         s_t = s_t1  # state 更新
236         t += 1
237 
238         # save progress every 10000 iterations
239         if t % 10000 == 0:
240             saver.save(sess, 'saved_networks/' + GAME + '-dqn', global_step = t)
241 
242         # print info
243         state = ""
244         if t <= OBSERVE:
245             state = "observe"
246         elif t > OBSERVE and t <= OBSERVE + EXPLORE:
247             state = "explore"
248         else:
249             state = "train"
250 
251         print("terminal", terminal, \
252               "TIMESTEP", t, "/ STATE", state, \
253             "/ EPSILON", epsilon, "/ ACTION", action_index, "/ REWARD", r_t, \
254             "/ Q_MAX %e" % np.max(readout_t))
255         # write info to files
256         '''
257         if t % 10000 <= 100:
258             a_file.write(",".join([str(x) for x in readout_t]) + '\n')
259             h_file.write(",".join([str(x) for x in h_fc1.eval(feed_dict={s:[s_t]})[0]]) + '\n')
260             cv2.imwrite("logs_tetris/frame" + str(t) + ".png", x_t1)
261         '''
262 
263 def playGame():
264     sess = tf.InteractiveSession()
265     s, readout, h_fc1 = createNetwork()
266     trainNetwork(s, readout, h_fc1, sess)
267 
268 def main():
269     playGame()
270 
271 if __name__ == "__main__":
272     main()

View Code

5.運行結果與分析

因爲不能上傳視頻，所以只能截取幾張典型圖片了。我訓練了2920000次生成的模型，以這個模型預測，小鳥能夠自動識別障礙物，不會發生碰撞。按如下配置訓練和預測：

訓練：OBSERVE = 1000，EXPLORE = 3000000

預測：OBSERVE = 100000，EXPLORE = 3000000 （預測是引用模型，所以不需要訓練，OBSERVE要儘可能大）

預測時在train.py文件中將下面引用模型註釋打開

 """
    if checkpoint and checkpoint.model_checkpoint_path:
        saver.restore(sess, checkpoint.model_checkpoint_path)
        print("Successfully loaded:", checkpoint.model_checkpoint_path)
    else:
        print("Could not find old network weights")
"""

小鳥運行結果圖片

在預測狀態，運行代碼，小鳥會自動飛翔，這時也會相應打印一些參數結果出來：

參數結果

terminal：是否碰撞到障礙物（True :是，False：否）；

TIMESTEP：表示運行次數；

STATE：當前模型運行狀態（observe：觀察，explore：探索，train：訓練）；

EPSILON：表示進入探索階段的閾值，是逐漸減小的；

ACTION:行動方向最大價值的索引；

REWARD：即時獎勵；

Q_MAX：輸出行動方向的最大價值；

不要讓懶惰佔據你的大腦，不要讓妥協拖垮了你的人生。青春就是一張票，能不能趕上時代的快車，你的步伐就掌握在你的腳下。

深度強化學習（DQN-Deep Q Network）之應用-Flappy Bird 深度學習之卷積神經網絡(CNN)詳解與代碼實現（二）

深度強化學習（DQN-Deep Q Network）之應用-Flappy Bird

目錄

1.達到的目的

2.思路

2.1.強化學習（RL Reinforcement Learing）

2.2.深度學習（卷積神經網絡CNN）

3.踩過的坑

4.代碼實現（python3.5）

5.運行結果與分析

1.達到的目的

2.思路

2.1 強化學習

2.2 深度學習

3.踩過的坑

4.代碼實現（python3.5）

5.運行結果與分析

深度剖析目標檢測算法YOLOV4

docker容器化python服務部署（supervisor-gunicorn-flask）

機器學習之樸素貝葉斯算法原理與代碼實現

深度學習之卷積神經網絡(CNN)詳解與代碼實現（一）

機器學習之logistic迴歸算法與代碼實現原理

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結