用Tensorflow基於Deep Q Learning DQN 玩Flappy Bird

前言

2013年DeepMind 在NIPS上發表Playing Atari with Deep Reinforcement Learning 一文，提出了DQN（Deep Q Network）算法，實現端到端學習玩Atari遊戲，即只有像素輸入，看着屏幕玩遊戲。Deep Mind就憑藉這個應用以6億美元被Google收購。由於DQN的開源，在github上涌現了大量各種版本的DQN程序。但大多是復現Atari的遊戲，代碼量很大，也不好理解。

Flappy Bird是個極其簡單又困難的遊戲，風靡一時。在很早之前，就有人使用Q-Learning 算法來實現完Flappy Bird。http://sarvagyavaish.github.io/FlappyBirdRL/
但是這個的實現是通過獲取小鳥的具體位置信息來實現的。

能否使用DQN來實現通過屏幕學習玩Flappy Bird是一個有意思的挑戰。（話說本人和朋友在去年年底也考慮了這個idea，但當時由於不知道如何截取遊戲屏幕只能使用具體位置來學習，不過其實也成功了）

最近，github上有人放出使用DQN玩Flappy Bird的代碼，https://github.com/yenchenlin1994/DeepLearningFlappyBird【1】
該repo通過結合之前的repo成功實現了這個想法。這個repo對整個實現過程進行了較詳細的分析，但是由於其DQN算法的代碼基本採用別人的repo，代碼較爲混亂，不易理解。

爲此，本人改寫了一個版本https://github.com/songrotek/DRL-FlappyBird

對DQN代碼進行了重新改寫。本質上對其做了類的封裝，從而使代碼更具通用性。可以方便移植到其他應用。

當然，本文的目的是借Flappy Bird DQN這個代碼來詳細分析一下DQN算法極其使用。

DQN 僞代碼

這個是NIPS13版本的僞代碼：

Initialize replay memory D to size N
Initialize action-value function Q with random weights
for episode = 1, M do
    Initialize state s_1
    for t = 1, T do
        With probability ϵ select random action a_t
        otherwise select a_t=max_a  Q($s_t$,a; $θ_i$)
        Execute action a_t in emulator and observe r_t and s_(t+1)
        Store transition (s_t,a_t,r_t,s_(t+1)) in D
        Sample a minibatch of transitions (s_j,a_j,r_j,s_(j+1)) from D
        Set y_j:=
            r_j for terminal s_(j+1)
            r_j+γ*max_(a^' )  Q(s_(j+1),a'; θ_i) for non-terminal s_(j+1)
        Perform a gradient step on (y_j-Q(s_j,a_j; θ_i))^2 with respect to θ
    end for
end for

基本的分析詳見Paper Reading 1 - Playing Atari with Deep Reinforcement Learning
基礎知識詳見Deep Reinforcement Learning 基礎知識（DQN方面）

本文主要從代碼實現的角度來分析如何編寫Flappy Bird DQN的代碼

編寫FlappyBirdDQN.py

首先，FlappyBird的遊戲已經編寫好，是現成的。提供了很簡單的接口：

nextObservation,reward,terminal = game.frame_step(action)

即輸入動作，輸出執行完動作的屏幕截圖，得到的反饋reward，以及遊戲是否結束。

那麼，現在先把DQN想象爲一個大腦，這裏我們也用BrainDQN類來表示，這個類只需獲取感知信息也就是上面說的觀察（截圖），反饋以及是否結束，然後輸出動作即可。

完美的代碼封裝應該是這樣。具體DQN裏面如何存儲。如何訓練是外部不關心的。
因此，我們的FlappyBirdDQN代碼只有如下這麼短：

# -------------------------
# Project: Deep Q-Learning on Flappy Bird
# Author: Flood Sung
# Date: 2016.3.21
# -------------------------

import cv2
import sys
sys.path.append("game/")
import wrapped_flappy_bird as game
from BrainDQN import BrainDQN
import numpy as np

# preprocess raw image to 80*80 gray image
def preprocess(observation):
    observation = cv2.cvtColor(cv2.resize(observation, (80, 80)), cv2.COLOR_BGR2GRAY)
    ret, observation = cv2.threshold(observation,1,255,cv2.THRESH_BINARY)
    return np.reshape(observation,(80,80,1))

def playFlappyBird():
    # Step 1: init BrainDQN
    brain = BrainDQN()
    # Step 2: init Flappy Bird Game
    flappyBird = game.GameState()
    # Step 3: play game
    # Step 3.1: obtain init state
    action0 = np.array([1,0])  # do nothing
    observation0, reward0, terminal = flappyBird.frame_step(action0)
    observation0 = cv2.cvtColor(cv2.resize(observation0, (80, 80)), cv2.COLOR_BGR2GRAY)
    ret, observation0 = cv2.threshold(observation0,1,255,cv2.THRESH_BINARY)
    brain.setInitState(observation0)

    # Step 3.2: run the game
    while 1!= 0:
        action = brain.getAction()
        nextObservation,reward,terminal = flappyBird.frame_step(action)
        nextObservation = preprocess(nextObservation)
        brain.setPerception(nextObservation,action,reward,terminal)

def main():
    playFlappyBird()

if __name__ == '__main__':
    main()

核心部分就在while循環裏面，由於要講圖像轉換爲80x80的灰度圖，因此，加了一個preprocess預處理函數。

這裏，顯然只有有遊戲引擎，換一個遊戲是一樣的寫法，非常方便。

接下來就是編寫BrainDQN.py 我們的遊戲大腦

編寫BrainDQN

基本架構：

class BrainDQN:
    def __init__(self):
        # init replay memory
        self.replayMemory = deque()
        # init Q network
        self.createQNetwork()
    def createQNetwork(self):

    def trainQNetwork(self):

    def setPerception(self,nextObservation,action,reward,terminal):
    def getAction(self):
    def setInitState(self,observation):

基本的架構也就只需要上面這幾個函數，其他的都是多餘了，接下來就是編寫每一部分的代碼。

CNN代碼

也就是createQNetwork部分，這裏採用如下圖的結構（轉自【1】）：

這裏就不講解整個流程了。主要是針對具體的輸入類型和輸出設計卷積和全連接層。

代碼如下：

    def createQNetwork(self):
        # network weights
        W_conv1 = self.weight_variable([8,8,4,32])
        b_conv1 = self.bias_variable([32])

        W_conv2 = self.weight_variable([4,4,32,64])
        b_conv2 = self.bias_variable([64])

        W_conv3 = self.weight_variable([3,3,64,64])
        b_conv3 = self.bias_variable([64])

        W_fc1 = self.weight_variable([1600,512])
        b_fc1 = self.bias_variable([512])

        W_fc2 = self.weight_variable([512,self.ACTION])
        b_fc2 = self.bias_variable([self.ACTION])

        # input layer

        self.stateInput = tf.placeholder("float",[None,80,80,4])

        # hidden layers
        h_conv1 = tf.nn.relu(self.conv2d(self.stateInput,W_conv1,4) + b_conv1)
        h_pool1 = self.max_pool_2x2(h_conv1)

        h_conv2 = tf.nn.relu(self.conv2d(h_pool1,W_conv2,2) + b_conv2)

        h_conv3 = tf.nn.relu(self.conv2d(h_conv2,W_conv3,1) + b_conv3)

        h_conv3_flat = tf.reshape(h_conv3,[-1,1600])
        h_fc1 = tf.nn.relu(tf.matmul(h_conv3_flat,W_fc1) + b_fc1)

        # Q Value layer
        self.QValue = tf.matmul(h_fc1,W_fc2) + b_fc2

        self.actionInput = tf.placeholder("float",[None,self.ACTION])
        self.yInput = tf.placeholder("float", [None]) 
        Q_action = tf.reduce_sum(tf.mul(self.QValue, self.actionInput), reduction_indices = 1)
        self.cost = tf.reduce_mean(tf.square(self.yInput - Q_action))
        self.trainStep = tf.train.AdamOptimizer(1e-6).minimize(self.cost)

記住輸出是Q值，關鍵要計算出cost，裏面關鍵是計算Q_action的值，即該state和action下的Q值。由於actionInput是one hot vector的形式，因此tf.mul(self.QValue, self.actionInput)正好就是該action下的Q值。

training 部分。

這部分是代碼的關鍵部分，主要是要計算y值，也就是target Q值。

    def trainQNetwork(self):
        # Step 1: obtain random minibatch from replay memory
        minibatch = random.sample(self.replayMemory,self.BATCH_SIZE)
        state_batch = [data[0] for data in minibatch]
        action_batch = [data[1] for data in minibatch]
        reward_batch = [data[2] for data in minibatch]
        nextState_batch = [data[3] for data in minibatch]

        # Step 2: calculate y 
        y_batch = []
        QValue_batch = self.QValue.eval(feed_dict={self.stateInput:nextState_batch})
        for i in range(0,self.BATCH_SIZE):
            terminal = minibatch[i][4]
            if terminal:
                y_batch.append(reward_batch[i])
            else:
                y_batch.append(reward_batch[i] + GAMMA * np.max(QValue_batch[i]))

        self.trainStep.run(feed_dict={
            self.yInput : y_batch,
            self.actionInput : action_batch,
            self.stateInput : state_batch
            })

其他部分

其他部分就比較容易了，這裏直接貼出完整的代碼：

# -----------------------------
# File: Deep Q-Learning Algorithm
# Author: Flood Sung
# Date: 2016.3.21
# -----------------------------

import tensorflow as tf 
import numpy as np 
import random
from collections import deque 

class BrainDQN:

    # Hyper Parameters:
    ACTION = 2
    FRAME_PER_ACTION = 1
    GAMMA = 0.99 # decay rate of past observations
    OBSERVE = 100000. # timesteps to observe before training
    EXPLORE = 150000. # frames over which to anneal epsilon
    FINAL_EPSILON = 0.0 # final value of epsilon
    INITIAL_EPSILON = 0.0 # starting value of epsilon
    REPLAY_MEMORY = 50000 # number of previous transitions to remember
    BATCH_SIZE = 32 # size of minibatch

    def __init__(self):
        # init replay memory
        self.replayMemory = deque()
        # init Q network
        self.createQNetwork()
        # init some parameters
        self.timeStep = 0
        self.epsilon = self.INITIAL_EPSILON

    def createQNetwork(self):
        # network weights
        W_conv1 = self.weight_variable([8,8,4,32])
        b_conv1 = self.bias_variable([32])

        W_conv2 = self.weight_variable([4,4,32,64])
        b_conv2 = self.bias_variable([64])

        W_conv3 = self.weight_variable([3,3,64,64])
        b_conv3 = self.bias_variable([64])

        W_fc1 = self.weight_variable([1600,512])
        b_fc1 = self.bias_variable([512])

        W_fc2 = self.weight_variable([512,self.ACTION])
        b_fc2 = self.bias_variable([self.ACTION])

        # input layer

        self.stateInput = tf.placeholder("float",[None,80,80,4])

        # hidden layers
        h_conv1 = tf.nn.relu(self.conv2d(self.stateInput,W_conv1,4) + b_conv1)
        h_pool1 = self.max_pool_2x2(h_conv1)

        h_conv2 = tf.nn.relu(self.conv2d(h_pool1,W_conv2,2) + b_conv2)

        h_conv3 = tf.nn.relu(self.conv2d(h_conv2,W_conv3,1) + b_conv3)

        h_conv3_flat = tf.reshape(h_conv3,[-1,1600])
        h_fc1 = tf.nn.relu(tf.matmul(h_conv3_flat,W_fc1) + b_fc1)

        # Q Value layer
        self.QValue = tf.matmul(h_fc1,W_fc2) + b_fc2

        self.actionInput = tf.placeholder("float",[None,self.ACTION])
        self.yInput = tf.placeholder("float", [None]) 
        Q_action = tf.reduce_sum(tf.mul(self.QValue, self.actionInput), reduction_indices = 1)
        self.cost = tf.reduce_mean(tf.square(self.yInput - Q_action))
        self.trainStep = tf.train.AdamOptimizer(1e-6).minimize(self.cost)

        # saving and loading networks
        saver = tf.train.Saver()
        self.session = tf.InteractiveSession()
        self.session.run(tf.initialize_all_variables())
        checkpoint = tf.train.get_checkpoint_state("saved_networks")
        if checkpoint and checkpoint.model_checkpoint_path:
                saver.restore(self.session, checkpoint.model_checkpoint_path)
                print "Successfully loaded:", checkpoint.model_checkpoint_path
        else:
                print "Could not find old network weights"

    def trainQNetwork(self):
        # Step 1: obtain random minibatch from replay memory
        minibatch = random.sample(self.replayMemory,self.BATCH_SIZE)
        state_batch = [data[0] for data in minibatch]
        action_batch = [data[1] for data in minibatch]
        reward_batch = [data[2] for data in minibatch]
        nextState_batch = [data[3] for data in minibatch]

        # Step 2: calculate y 
        y_batch = []
        QValue_batch = self.QValue.eval(feed_dict={self.stateInput:nextState_batch})
        for i in range(0,self.BATCH_SIZE):
            terminal = minibatch[i][4]
            if terminal:
                y_batch.append(reward_batch[i])
            else:
                y_batch.append(reward_batch[i] + GAMMA * np.max(QValue_batch[i]))

        self.trainStep.run(feed_dict={
            self.yInput : y_batch,
            self.actionInput : action_batch,
            self.stateInput : state_batch
            })

        # save network every 100000 iteration
        if self.timeStep % 10000 == 0:
            saver.save(self.session, 'saved_networks/' + 'network' + '-dqn', global_step = self.timeStep)


    def setPerception(self,nextObservation,action,reward,terminal):
        newState = np.append(nextObservation,self.currentState[:,:,1:],axis = 2)
        self.replayMemory.append((self.currentState,action,reward,newState,terminal))
        if len(self.replayMemory) > self.REPLAY_MEMORY:
            self.replayMemory.popleft()
        if self.timeStep > self.OBSERVE:
            # Train the network
            self.trainQNetwork()

        self.currentState = newState
        self.timeStep += 1

    def getAction(self):
        QValue = self.QValue.eval(feed_dict= {self.stateInput:[self.currentState]})[0]
        action = np.zeros(self.ACTION)
        action_index = 0
        if self.timeStep % self.FRAME_PER_ACTION == 0:
            if random.random() <= self.epsilon:
                action_index = random.randrange(self.ACTION)
                action[action_index] = 1
            else:
                action_index = np.argmax(QValue)
                action[action_index] = 1
        else:
            action[0] = 1 # do nothing

        # change episilon
        if self.epsilon > self.FINAL_EPSILON and self.timeStep > self.OBSERVE:
            self.epsilon -= (self.INITIAL_EPSILON - self.FINAL_EPSILON)/self.EXPLORE

        return action

    def setInitState(self,observation):
        self.currentState = np.stack((observation, observation, observation, observation), axis = 2)

    def weight_variable(self,shape):
        initial = tf.truncated_normal(shape, stddev = 0.01)
        return tf.Variable(initial)

    def bias_variable(self,shape):
        initial = tf.constant(0.01, shape = shape)
        return tf.Variable(initial)

    def conv2d(self,x, W, stride):
        return tf.nn.conv2d(x, W, strides = [1, stride, stride, 1], padding = "SAME")

    def max_pool_2x2(self,x):
        return tf.nn.max_pool(x, ksize = [1, 2, 2, 1], strides = [1, 2, 2, 1], padding = "SAME")

一共也只有160代碼。
如果這個任務不使用深度學習，而是人工的從圖像中找到小鳥，然後計算小鳥的軌跡，然後計算出應該怎麼按鍵，那麼代碼沒有好幾千行是不可能的。深度學習大大減少了代碼工作。

小結

本文從代碼角度對於DQN做了一定的分析，對於DQN的應用，大家可以在此基礎上做各種嘗試。

用Tensorflow基於Deep Q Learning DQN 玩Flappy Bird

前言

DQN 僞代碼

編寫FlappyBirdDQN.py

編寫BrainDQN

基本架構：

CNN代碼

training 部分。

其他部分

小結

【簡寫Mybatis-02】註冊機的實現以及SqlSession處理

手繪二維碼

.NET藉助虛擬網卡實現一個簡單異地組網工具

計算機視覺CV 之 CMT跟蹤算法分析四

計算機視覺CV 之 CMT跟蹤算法分析二

計算機視覺CV 之 CMT跟蹤算法分析一

在Mac OS X下搭建Latex編輯環境

Python 之使用Tkinter 做GUI 研究機器人走迷宮

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結