使用PARL實現基於DQN算法的AI Flappy-Bird

用PARL的官方 DQN 算法,玩Flappy-Bird的案例其實在2019年4月份就有了,所以這裏準確說,應該是復現。

但是如果用現在1.3.1版本的PARL框架運行,會有一些版本不兼容的問題,而且也不能直接在AI Studio裏運行,所以我藉着這個機會,完善一下,並且鞏固一下最近學的內容。

我已在AI Studio公開這個項目:
https://aistudio.baidu.com/aistudio/projectdetail/580622

首先在本地試着把這個項目項目跑起來

在本地可視化地展示Flappy-Bird

準備遊戲環境

GitHub地址如下:
https://github.com/kosoraYintai/PARL-Sample/tree/master/flappy_bird

裏面貌似有兩個版本的Flappy-Bird,一個版本是普通的Flappy-Bird,另一個版本是rainBow版的Flappy-Bird,遊戲是一樣的,我沒看明白有什麼區別,所以爲了不讓大家產生誤解,這裏我把它刪去了。

另外,裏面除了Flappy-Bird的遊戲環境,還有對應的強化學習代碼,只不過用PARL1.3.1的環境運行時,會有一點小問題:
在這裏插入圖片描述
而且這種問題都是改完一個,就出一個新問題,所以在調試的時候還挺麻煩的,不過這裏我會直接調好,給大家一個可以直接運行的代碼。

在本地訓練時,遊戲環境不需要做調整:
在這裏插入圖片描述
flappy_bird下有兩個文件夾:

  • assets裏存放着遊戲所需要的靜態文件,主要是圖片
  • game裏存放着用於實現遊戲的代碼,用pygame實現的
    其中,flappy_bird_utils.py主要用於加載靜態文件,具體的遊戲規則都是在BirdEnv.py裏實現的

AI Flappy-Bird效果展示

原來的代碼裏,需要導入PARL裏的代碼,這裏我全部提到了一個文件夾裏:
在這裏插入圖片描述
這是其中的一個改動,不改的話,會提示找不到文件,所以我直接放在一起了

在本地訓練時,會有可視化的效果,所以速度也比較慢,我只訓練了幾百輪就結束了,在本地我主要測試了模型效果:

AI flappy bird

不想看視頻的話,可以看一下這個gif圖:

在這裏插入圖片描述

在服務器端訓練Flappy-Bird

因爲在本地訓練的速度非常慢,所以我就不在本地訓練了,本地只用來做效果可視化,訓練就放到了服務器裏

AI Studio項目地址:
https://aistudio.baidu.com/aistudio/projectdetail/580622

現在的版本只達到了能調通項目的效果,後續我會修改超參數,以達到更優的效果

在開始前,因爲服務器端不能做可視化,所以要在遊戲環境flappy_bird/game/BirdEnv.py文件裏添加如下代碼:

import os 
os.environ["SDL_VIDEODRIVER"] = "dummy"

否則會報錯 ==> pygame.error: No available video device

在項目裏,有兩個方式進行訓練

直接運行文件,開始訓練

第一種方式非常簡單,安裝依賴庫後,直接用命令的方式運行:
在這裏插入圖片描述
真正的訓練前,智能體會不斷地隨機嘗試遊戲,這一部分需要一段時間,如果想看到進度的話,可以取消進度條的那段註釋

我訓練到了3000多個episode,但效果還不是很理想

分步驟查看並運行代碼

這裏的運行結果和上面其實一樣,如果沒有更改代碼的話

建議大家在調參的時候,使用這種方法,一個cell一個cell地運行:
在這裏插入圖片描述

Model

# Model
import numpy as np
from parl import layers
from parl import Model

#經典的雅達利遊戲卷積神經網絡
class BirdModel(Model):
    def __init__(self, act_dim):
        self.act_dim = act_dim
        #padding方式爲valid
        p_valid=0
        self.conv1 = layers.conv2d(
            num_filters=32, filter_size=8, stride=4, padding=p_valid, act='relu')
        self.conv2 = layers.conv2d(
            num_filters=64, filter_size=4, stride=2, padding=p_valid, act='relu')
        self.conv3 = layers.conv2d(
            num_filters=64, filter_size=3, stride=1, padding=p_valid, act='relu')
        self.fc0 = layers.fc(size=512)
        self.fc1 = layers.fc(size=act_dim)
        
    def value(self, obs):
        #輸入歸一化
        obs = obs / 255.0
        out = self.conv1(obs)
        out = self.conv2(out)
        out = self.conv3(out)
        out = layers.flatten(out, axis=1)
        out = self.fc0(out)
        out = self.fc1(out)
        return out

Agent

# Agent
import numpy as np
import paddle.fluid as fluid
from parl import layers
from parl import Agent

IMAGE_SIZE = (84, 84)
CONTEXT_LEN = 4

#智能體
class BirdAgent(Agent):
    def __init__(self, algorithm, action_dim):
        super(BirdAgent, self).__init__(algorithm)
        
        self.action_dim = action_dim
        self.global_step = 0
        
        #每訓練多少步更新target網絡,超參數可調
        self.update_target_steps = 5000
        #初始探索概率ε,超參數可微調
        self.exploration = 0.8
        #每步探索的衰減程度,超參數可微調
        self.exploration_dacay=1e-6
        #最小探索概率,超參數可微調
        self.min_exploration=0.05

    def build_program(self):
        self.learn_programs = []
        self.predict_programs=[]
        self.pred_program = fluid.Program()
        self.learn_program = fluid.Program()

        with fluid.program_guard(self.pred_program):
            obs = layers.data(
                name='obs',
                shape=[CONTEXT_LEN, IMAGE_SIZE[0], IMAGE_SIZE[1]],
                dtype='float32')
            self.value = self.alg.predict(obs)

        with fluid.program_guard(self.learn_program):
            obs = layers.data(
                name='obs',
                shape=[CONTEXT_LEN, IMAGE_SIZE[0], IMAGE_SIZE[1]],
                dtype='float32')
            action = layers.data(name='act', shape=[1], dtype='int32')
            reward = layers.data(name='reward', shape=[], dtype='float32')
            next_obs = layers.data(
                name='next_obs',
                shape=[CONTEXT_LEN, IMAGE_SIZE[0], IMAGE_SIZE[1]],
                dtype='float32')
            terminal = layers.data(name='terminal', shape=[], dtype='bool')
            self.cost = self.alg.learn(obs, action, reward, next_obs,
                                              terminal)
        self.learn_programs.append(self.learn_program)
        self.predict_programs.append(self.pred_program)
    
    #ε-greedy
    def sample(self, obs):
        sample = np.random.random()
        if sample < self.exploration:
            act = np.random.randint(self.action_dim)
        else:
            obs = np.expand_dims(obs, axis=0)
            pred_Q = self.fluid_executor.run(
                    self.pred_program,
                    feed={'obs': obs.astype('float32')},
                    fetch_list=[self.value])[0]
            pred_Q = np.squeeze(pred_Q, axis=0)
            act = np.argmax(pred_Q)
        self.exploration = max(self.min_exploration, self.exploration - self.exploration_dacay)
        return act
    
    #預測
    def predict(self, obs):
        obs = np.expand_dims(obs, axis=0)
        pred_Q = self.fluid_executor.run(
            self.pred_program,
            feed={'obs': obs.astype('float32')},
            fetch_list=[self.value])[0]
        pred_Q = np.squeeze(pred_Q, axis=0)
        act = np.argmax(pred_Q)
        return act
    
    #學習
    def learn(self, obs, act, reward, next_obs, terminal):
        self.alg.sync_target()
        self.global_step += 1

        act = np.expand_dims(act, -1)
        reward = np.clip(reward, -1, 1)
        feed = {
            'obs': obs.astype('float32'),
            'act': act.astype('int32'),
            'reward': reward,
            'next_obs': next_obs.astype('float32'),
            'terminal': terminal
        }
        cost = self.fluid_executor.run(
            self.learn_program, feed=feed, fetch_list=[self.cost])[0]
        return cost
    
    #保存模型
    def save_params(self, learnDir,predictDir):
        fluid.io.save_params(
                executor=self.fluid_executor,
                dirname=learnDir,
                main_program=self.learn_programs[0])   
        fluid.io.save_params(
                executor=self.fluid_executor,
                dirname=predictDir,
                main_program=self.predict_programs[0])     
       
    #加載模型
    def load_params(self, learnDir,predictDir): 
        fluid.io.load_params(
                    executor=self.fluid_executor,
                    dirname=learnDir,
                    main_program=self.learn_programs[0])  
        fluid.io.load_params(
                    executor=self.fluid_executor,
                    dirname=predictDir,
                    main_program=self.predict_programs[0])    

Algorithm

Algorithm直接導入即可:

from parl.algorithms import DQN

當然也可以自己寫:

class DQN(parl.Algorithm):
    def __init__(self, model, act_dim=None, gamma=None, lr=None):
        """ DQN algorithm
        
        Args:
            model (parl.Model): 定義Q函數的前向網絡結構
            act_dim (int): action空間的維度,即有幾個action
            gamma (float): reward的衰減因子
            lr (float): learning rate 學習率.
        """
        self.model = model
        self.target_model = copy.deepcopy(model)

        assert isinstance(act_dim, int)
        assert isinstance(gamma, float)
        assert isinstance(lr, float)
        self.act_dim = act_dim
        self.gamma = gamma
        self.lr = lr

    def predict(self, obs):
        """ 使用self.model的value網絡來獲取 [Q(s,a1),Q(s,a2),...]
        """
        return self.model.value(obs)

    def learn(self, obs, action, reward, next_obs, terminal):
        """ 使用DQN算法更新self.model的value網絡
        """
        # 從target_model中獲取 max Q' 的值,用於計算target_Q
        next_pred_value = self.target_model.value(next_obs)
        best_v = layers.reduce_max(next_pred_value, dim=1)
        best_v.stop_gradient = True  # 阻止梯度傳遞
        terminal = layers.cast(terminal, dtype='float32')
        target = reward + (1.0 - terminal) * self.gamma * best_v

        pred_value = self.model.value(obs)  # 獲取Q預測值
        # 將action轉onehot向量,比如:3 => [0,0,0,1,0]
        action_onehot = layers.one_hot(action, self.act_dim)
        action_onehot = layers.cast(action_onehot, dtype='float32')
        # 下面一行是逐元素相乘,拿到action對應的 Q(s,a)
        # 比如:pred_value = [[2.3, 5.7, 1.2, 3.9, 1.4]], action_onehot = [[0,0,0,1,0]]
        #  ==> pred_action_value = [[3.9]]
        pred_action_value = layers.reduce_sum(
            layers.elementwise_mul(action_onehot, pred_value), dim=1)

        # 計算 Q(s,a) 與 target_Q的均方差,得到loss
        cost = layers.square_error_cost(pred_action_value, target)
        cost = layers.reduce_mean(cost)
        optimizer = fluid.optimizer.Adam(learning_rate=self.lr)  # 使用Adam優化器
        optimizer.minimize(cost)
        return cost

    def sync_target(self):
        """ 把 self.model 的模型參數值同步到 self.target_model
        """
        self.model.sync_weights_to(self.target_model)

ReplayMemory

經驗池:用於存儲多條經驗,實現 經驗回放。

#經驗回放單元
import numpy as np
from collections import deque, namedtuple

Experience = namedtuple('Experience', ['state', 'action', 'reward', 'isOver'])

class ReplayMemory(object):
    def __init__(self, max_size, state_shape, context_len):
        self.max_size = int(max_size)
        self.state_shape = state_shape
        self.context_len = int(context_len)
        self.state = np.zeros((self.max_size, ) + state_shape, dtype='int32')
        self.action = np.zeros((self.max_size, ), dtype='int32')
        self.reward = np.zeros((self.max_size, ), dtype='float32')
        self.isOver = np.zeros((self.max_size, ), dtype='bool')

        self._curr_size = 0
        self._curr_pos = 0
        #_context是一個滑動窗口,長度永遠保持3
        self._context = deque(maxlen=context_len - 1)
        print('Replay-buffer initial success!')
    def append(self, exp):
        """append a new experience into replay memory
        """
        if self._curr_size < self.max_size:
            self._assign(self._curr_pos, exp)
            self._curr_size += 1
        else:
            self._assign(self._curr_pos, exp)
        self._curr_pos = (self._curr_pos + 1) % self.max_size
        if exp.isOver:
            self._context.clear()
        else:
            self._context.append(exp)
    
        
    def appendForTest(self,exp):
        """
        用於測試階段
        """
        if exp.isOver:
            self._context.clear()
        else:
            self._context.append(exp)
    
    def recent_state(self):
        """ maintain recent state for training"""
        lst = list(self._context)
        states = [np.zeros(self.state_shape, dtype='uint8')] * \
                    (self._context.maxlen - len(lst))
        states.extend([k.state for k in lst])
        return states

    def sample(self, idx):
        """ return state, action, reward, isOver,
            note that some frames in state may be generated from last episode,
            they should be removed from state
            """
        state = np.zeros(
            (self.context_len + 1, ) + self.state_shape, dtype=np.uint8)
        state_idx = np.arange(idx,
                              idx + self.context_len + 1) % self._curr_size

        # confirm that no frame was generated from last episode
        has_last_episode = False
        for k in range(self.context_len - 2, -1, -1):
            to_check_idx = state_idx[k]
            if self.isOver[to_check_idx]:
                has_last_episode = True
                state_idx = state_idx[k + 1:]
                #前面的state[0:k+1]位置的像素保持0
                state[k + 1:] = self.state[state_idx]
                break

        if not has_last_episode:
            state = self.state[state_idx]

        real_idx = (idx + self.context_len - 1) % self._curr_size
        action = self.action[real_idx]
        reward = self.reward[real_idx]
        isOver = self.isOver[real_idx]
        return state, reward, action, isOver

    def __len__(self):
        return self._curr_size

    def size(self):
        return self._curr_size

    def _assign(self, pos, exp):
        self.state[pos] = exp.state
        self.reward[pos] = exp.reward
        self.action[pos] = exp.action
        self.isOver[pos] = exp.isOver

    def sample_batch(self, batch_size):
        """sample a batch from replay memory for training
        """
        batch_idx = np.random.randint(
            self._curr_size - self.context_len - 1, size=batch_size)
        batch_idx = (self._curr_pos + batch_idx) % self._curr_size
        batch_exp = [self.sample(i) for i in batch_idx]
        return self._process_batch(batch_exp)

    def _process_batch(self, batch_exp):
        state = np.asarray([e[0] for e in batch_exp], dtype='uint8')
        reward = np.asarray([e[1] for e in batch_exp], dtype='float32')
        action = np.asarray([e[2] for e in batch_exp], dtype='int8')
        isOver = np.asarray([e[3] for e in batch_exp], dtype='bool')
        return [state, action, reward, isOver]

遊戲環境

#將flappy-bird遊戲封裝成標準的gym環境
#原始版本參考:https://github.com/yenchenlin/DeepLearningFlappyBird

import random
import pygame
import gym
from pygame.locals import *
from itertools import cycle
import os 

os.environ["SDL_VIDEODRIVER"] = "dummy"

FPS = 30
SCREENWIDTH  = 288
SCREENHEIGHT = 512
PIPEGAPSIZE = 100 # gap between upper and lower part of pipe
BASEY = SCREENHEIGHT * 0.79

PLAYER_INDEX_GEN = cycle([0, 1, 2, 1])

class BirdEnv(gym.Env):
    
    def beforeInit(self):
        pygame.init()
        self.FPSCLOCK = pygame.time.Clock()
        self.SCREEN = pygame.display.set_mode((SCREENWIDTH, SCREENHEIGHT))
        pygame.display.set_caption('AI Flappy-Bird')
        IMAGES, SOUNDS, HITMASKS = load()
        self.IMAGES=IMAGES
        self.HITMASKS=HITMASKS
        self.SOUNDS=SOUNDS
        PLAYER_WIDTH = IMAGES['player'][0].get_width()
        self.PLAYER_WIDTH=PLAYER_WIDTH
        PLAYER_HEIGHT = IMAGES['player'][0].get_height()
        self.PLAYER_HEIGHT=PLAYER_HEIGHT
        PIPE_WIDTH = IMAGES['pipe'][0].get_width()
        self.PIPE_WIDTH=PIPE_WIDTH
        PIPE_HEIGHT = IMAGES['pipe'][0].get_height()
        self.PIPE_HEIGHT=PIPE_HEIGHT
        BACKGROUND_WIDTH = IMAGES['background'].get_width()
        self.BACKGROUND_WIDTH=BACKGROUND_WIDTH
    
    def __init__(self):
        if not hasattr(self,'IMAGES'):
            print('InitGame!')
            self.beforeInit()
        
        self.score = self.playerIndex = self.loopIter = 0
        self.playerx = int(SCREENWIDTH * 0.3)
        self.playery = int((SCREENHEIGHT - self.PLAYER_HEIGHT) / 2.25)
        self.basex = 0
        self.baseShift = self.IMAGES['base'].get_width() - self.BACKGROUND_WIDTH

        newPipe1 = getRandomPipe(self.PIPE_HEIGHT)
        newPipe2 = getRandomPipe(self.PIPE_HEIGHT)
        self.upperPipes = [
            {'x': SCREENWIDTH, 'y': newPipe1[0]['y']},
            {'x': SCREENWIDTH + (SCREENWIDTH / 2), 'y': newPipe2[0]['y']},
        ]
        self.lowerPipes = [
            {'x': SCREENWIDTH, 'y': newPipe1[1]['y']},
            {'x': SCREENWIDTH + (SCREENWIDTH / 2), 'y': newPipe2[1]['y']},
        ]

        # player velocity, max velocity, downward accleration, accleration on flap
        self.pipeVelX = -4
        self.playerVelY    =  0    # player's velocity along Y, default same as playerFlapped
        self.playerMaxVelY =  10   # max vel along Y, max descend speed
        self.playerMinVelY =  -8   # min vel along Y, max ascend speed
        self.playerAccY    =   1.1  # players downward accleration
        self.playerFlapAcc =  -1.2   # players speed on flapping
        self.playerFlapped = False # True when player flaps
        
    def reset(self,mode='train'):
        self.__init__()
        self.mode=mode
        action0 = 1
        observation, reward, isOver,_ = self.step(action0)
        return observation,reward,isOver
            
    def render(self):
        image_data = pygame.surfarray.array3d(pygame.display.get_surface())
        pygame.display.update()
        self.FPSCLOCK.tick(FPS)
        return image_data
    
    def step(self, input_action=0):
        pygame.event.pump()
        #飛行一段距離,獎勵+0.1
        reward = 0.1
        terminal = False
        if input_action == 1:
            if self.playery > -2 * self.PLAYER_HEIGHT:
                self.playerVelY = self.playerFlapAcc
                self.playerFlapped = True
#                if self.mode=='test':
#                    self.SOUNDS['wing'].play()

        # check for score
        playerMidPos = self.playerx + self.PLAYER_WIDTH / 2
        for pipe in self.upperPipes:
            #穿過一個柱子獎勵加1
            pipeMidPos = pipe['x'] + self.PIPE_WIDTH / 2
            if pipeMidPos <= playerMidPos < pipeMidPos + 4:
#                if self.mode=='test':
#                    self.SOUNDS['point'].play()                
                self.score += 1
                reward = self.reward(1)

        # playerIndex basex change
        if (self.loopIter + 1) % 3 == 0:
            self.playerIndex = next(PLAYER_INDEX_GEN)
        self.loopIter = (self.loopIter + 1) % 30
        self.basex = -((-self.basex + 100) % self.baseShift)

        # player's movement
        if self.playerVelY < self.playerMaxVelY and not self.playerFlapped:
            self.playerVelY += self.playerAccY
        if self.playerFlapped:
            self.playerFlapped = False
        self.playery += min(self.playerVelY, BASEY - self.playery - self.PLAYER_HEIGHT)
        if self.playery < 0:
            self.playery = 0

        # move pipes to left
        for uPipe, lPipe in zip(self.upperPipes, self.lowerPipes):
            uPipe['x'] += self.pipeVelX
            lPipe['x'] += self.pipeVelX

        # add new pipe when first pipe is about to touch left of screen
        if 0 < self.upperPipes[0]['x'] < 5:
            newPipe = getRandomPipe(self.PIPE_HEIGHT)
            self.upperPipes.append(newPipe[0])
            self.lowerPipes.append(newPipe[1])

        # remove first pipe if its out of the screen
        if self.upperPipes[0]['x'] < -self.PIPE_WIDTH:
            self.upperPipes.pop(0)
            self.lowerPipes.pop(0)

        # check if crash here
        isCrash= checkCrash({'x': self.playerx, 'y': self.playery,
                             'index': self.playerIndex},
        self.upperPipes, self.lowerPipes,self.IMAGES,self.PIPE_WIDTH,self.PIPE_HEIGHT,self.HITMASKS)
        if isCrash:
            #撞到邊緣或者撞到柱子,結束,並且獎勵爲-1
            terminal = True
            reward = self.reward(-1)
#            if self.mode=='test':
#                self.SOUNDS['hit'].play()
#                self.SOUNDS['die'].play()
        # draw sprites
        self.SCREEN.blit(self.IMAGES['background'], (0,0))

        for uPipe, lPipe in zip(self.upperPipes, self.lowerPipes):
            self.SCREEN.blit(self.IMAGES['pipe'][0], (uPipe['x'], uPipe['y']))
            self.SCREEN.blit(self.IMAGES['pipe'][1], (lPipe['x'], lPipe['y']))

        self.SCREEN.blit(self.IMAGES['base'], (self.basex, BASEY))
        # print score so player overlaps the score
        showScore(self.score,self)
        self.SCREEN.blit(self.IMAGES['player'][self.playerIndex],
                    (self.playerx, self.playery))

        image_data=self.render()
        return image_data, reward, terminal,{}
    
    def reward(self,r):
        return r
    
def getRandomPipe(PIPE_HEIGHT):
    """returns a randomly generated pipe"""
    # y of gap between upper and lower pipe
    gapYs = [20, 30, 40, 50, 60, 70, 80, 90]
    index = random.randint(0, len(gapYs)-1)
    gapY = gapYs[index]

    gapY += int(BASEY * 0.2)
    pipeX = SCREENWIDTH + 10

    return [
        {'x': pipeX, 'y': gapY - PIPE_HEIGHT},  # upper pipe
        {'x': pipeX, 'y': gapY + PIPEGAPSIZE},  # lower pipe
    ]


def showScore(score,obj):
    """displays score in center of screen"""
    scoreDigits = [int(x) for x in list(str(score))]
    totalWidth = 0 # total width of all numbers to be printed

    for digit in scoreDigits:
        totalWidth += obj.IMAGES['numbers'][digit].get_width()

    Xoffset = (SCREENWIDTH - totalWidth) / 2

    for digit in scoreDigits:
        obj.SCREEN.blit(obj.IMAGES['numbers'][digit], (Xoffset, SCREENHEIGHT * 0.1))
        Xoffset += obj.IMAGES['numbers'][digit].get_width()


def checkCrash(player, upperPipes, lowerPipes,IMAGES,PIPE_WIDTH,PIPE_HEIGHT,HITMASKS):
    """returns True if player collders with base or pipes."""
    pi = player['index']
    player['w'] = IMAGES['player'][0].get_width()
    player['h'] = IMAGES['player'][0].get_height()

    # if player crashes into ground
    if player['y'] + player['h'] >= BASEY - 1:
        return True
    else:

        playerRect = pygame.Rect(player['x'], player['y'],
                      player['w'], player['h'])

        for uPipe, lPipe in zip(upperPipes, lowerPipes):
            # upper and lower pipe rects
            uPipeRect = pygame.Rect(uPipe['x'], uPipe['y'], PIPE_WIDTH, PIPE_HEIGHT)
            lPipeRect = pygame.Rect(lPipe['x'], lPipe['y'], PIPE_WIDTH, PIPE_HEIGHT)

            # player and upper/lower pipe hitmasks
            pHitMask = HITMASKS['player'][pi]
            uHitmask = HITMASKS['pipe'][0]
            lHitmask = HITMASKS['pipe'][1]

            # if bird collided with upipe or lpipe
            uCollide = pixelCollision(playerRect, uPipeRect, pHitMask, uHitmask)
            lCollide = pixelCollision(playerRect, lPipeRect, pHitMask, lHitmask)

            if uCollide or lCollide:
                return True

    return False

def pixelCollision(rect1, rect2, hitmask1, hitmask2):
    """Checks if two objects collide and not just their rects"""
    rect = rect1.clip(rect2)

    if rect.width == 0 or rect.height == 0:
        return False

    x1, y1 = rect.x - rect1.x, rect.y - rect1.y
    x2, y2 = rect.x - rect2.x, rect.y - rect2.y

    for x in range(rect.width):
        for y in range(rect.height):
            if hitmask1[x1+x][y1+y] and hitmask2[x2+x][y2+y]:
                return True
    return False
# 加載遊戲環境
import pygame
import sys

def load():
    # path of player with different states
    PLAYER_PATH = (
            r'flappy_bird/assets/sprites/redbird-upflap.png',
            r'flappy_bird/assets/sprites/redbird-midflap.png',
            r'flappy_bird/assets/sprites/redbird-downflap.png'
    )

    # path of background
    BACKGROUND_PATH = r'flappy_bird/assets/sprites/background-black.png'

    # path of pipe
    PIPE_PATH = r'flappy_bird/assets/sprites/pipe-green.png'

    IMAGES, SOUNDS, HITMASKS = {}, {}, {}

    # numbers sprites for score display
    IMAGES['numbers'] = (
        pygame.image.load(r'flappy_bird/assets/sprites/0.png').convert_alpha(),
        pygame.image.load(r'flappy_bird/assets/sprites/1.png').convert_alpha(),
        pygame.image.load(r'flappy_bird/assets/sprites/2.png').convert_alpha(),
        pygame.image.load(r'flappy_bird/assets/sprites/3.png').convert_alpha(),
        pygame.image.load(r'flappy_bird/assets/sprites/4.png').convert_alpha(),
        pygame.image.load(r'flappy_bird/assets/sprites/5.png').convert_alpha(),
        pygame.image.load(r'flappy_bird/assets/sprites/6.png').convert_alpha(),
        pygame.image.load(r'flappy_bird/assets/sprites/7.png').convert_alpha(),
        pygame.image.load(r'flappy_bird/assets/sprites/8.png').convert_alpha(),
        pygame.image.load(r'flappy_bird/assets/sprites/9.png').convert_alpha()
    )

    # base (ground) sprite
    IMAGES['base'] = pygame.image.load(r'flappy_bird/assets/sprites/base.png').convert_alpha()

    # sounds
    if 'win' in sys.platform:
        soundExt = '.wav'
    else:
        soundExt = '.ogg'

    # SOUNDS['die']    = pygame.mixer.Sound('assets/audio/die' + soundExt)
    # SOUNDS['hit']    = pygame.mixer.Sound('assets/audio/hit' + soundExt)
    # SOUNDS['point']  = pygame.mixer.Sound('assets/audio/point' + soundExt)
    # SOUNDS['swoosh'] = pygame.mixer.Sound('assets/audio/swoosh' + soundExt)
    # SOUNDS['wing']   = pygame.mixer.Sound('assets/audio/wing' + soundExt)

    # select random background sprites
    IMAGES['background'] = pygame.image.load(BACKGROUND_PATH).convert()

    # select random player sprites
    IMAGES['player'] = (
        pygame.image.load(PLAYER_PATH[0]).convert_alpha(),
        pygame.image.load(PLAYER_PATH[1]).convert_alpha(),
        pygame.image.load(PLAYER_PATH[2]).convert_alpha(),
    )

    # select random pipe sprites
    IMAGES['pipe'] = (
        pygame.transform.rotate(
            pygame.image.load(PIPE_PATH).convert_alpha(), 180),
        pygame.image.load(PIPE_PATH).convert_alpha(),
    )

    # hismask for pipes
    HITMASKS['pipe'] = (
        getHitmask(IMAGES['pipe'][0]),
        getHitmask(IMAGES['pipe'][1]),
    )

    # hitmask for player
    HITMASKS['player'] = (
        getHitmask(IMAGES['player'][0]),
        getHitmask(IMAGES['player'][1]),
        getHitmask(IMAGES['player'][2]),
    )

    return IMAGES, SOUNDS, HITMASKS

def getHitmask(image):
    """returns a hitmask using an image's alpha."""
    mask = []
    for x in range(image.get_width()):
        mask.append([])
        for y in range(image.get_height()):
            mask[x].append(bool(image.get_at((x,y))[3]))
    return mask

Training && Test

from contextlib import contextmanager
import cv2
import time

#捕獲當前屏幕並resize成(84*84*1)的灰度圖
def resizeBirdrToAtari(observation):
    observation = cv2.cvtColor(cv2.resize(observation, (84, 84)), cv2.COLOR_BGR2GRAY)
    _, observation = cv2.threshold(observation,1,255,cv2.THRESH_BINARY)
    return observation

@contextmanager
def trainTimer(name):
    start = time.time()
    yield
    end = time.time()
    print('{} COST_Time:{}'.format(name, end - start))
# 開始訓練
from tqdm import tqdm
from parl.algorithms import DQN
import time
from collections import deque
from parl.utils import logger
import matplotlib.pyplot as plt
import os
import sys
sys.path.append("game/")

#=========可調節的超參數 start=========

#圖像輸入大小(修改後,網絡模型的height、weight也必須修改)
IMAGE_SIZE = (84, 84)

#記錄最近4幀(修改後,網絡模型的通道數也必須修改)
CONTEXT_LEN = 4

#replay-memory的大小
MEMORY_SIZE = int(8e4)

#充滿replay-memory,使其達到warm-up-size纔開始訓練
MEMORY_WARMUP_SIZE = MEMORY_SIZE//20

#默認不跳幀
FRAME_SKIP = None

#網絡學習頻率
UPDATE_FREQ = 2

#衰減因子
GAMMA = 0.99

#學習率
LEARNING_RATE = 1e-3 * 0.5

#一共走多少步
TOTAL=1e6

#batch-size
batchSize=32

#每多少個episode測試一次
eval_freq=100

#一輪episode最多執行多少次step,不然小鳥會無限制的飛下去,相當於gym.env中的_max_episode_steps屬性
MAX_Step_Limit=int(1<<12)

#閾值,當測試reward的最小值超過threshold_min、平均值超過threshold_avg後就停止訓練,防止網絡過擬合
threshold_min=256
threshold_avg=400

#=========可調節的超參數 end=========

#以下爲非重要參數
#記錄日誌的頻率
log_freq=10
#初始化平均獎勵
meanReward=0
#記錄訓練的episode
trainEpisode=0
#學習曲線數組
learning_curve=[]
#初始平均獎勵超過eval_mean_save才保存模型
eval_mean_save=32

#訓練一個episode
def run_train_episode(env, agent, rpm):
    global trainEpisode
    global meanReward
    total_reward = 0
    all_cost = []
    #重置環境
    state,_, __ = env.reset()
    step = 0
    #循環每一步
    while True:
        context = rpm.recent_state()
        context.append(resizeBirdrToAtari(state))
        context = np.stack(context, axis=0)
        #用ε-greedy的方式選一個動作
        action = agent.sample(context)
        #執行動作
        next_state, reward, isOver,_ = env.step(action)
        step += 1
        #存入replay_buffer
        rpm.append(Experience(resizeBirdrToAtari(state), action, reward, isOver))
        if rpm.size() > MEMORY_WARMUP_SIZE:
            if step % UPDATE_FREQ == 0:
                #從replay_buffer中隨機採樣
                batch_all_state, batch_action, batch_reward, batch_isOver = rpm.sample_batch(
                    batchSize)
                batch_state = batch_all_state[:, :CONTEXT_LEN, :, :]
                batch_next_state = batch_all_state[:, 1:, :, :]
                #執行SGD,訓練參數θ
                cost = agent.learn(batch_state, batch_action, batch_reward,
                                   batch_next_state, batch_isOver)
                all_cost.append(float(cost))
        total_reward += reward
        state = next_state
        if isOver or step>=MAX_Step_Limit:
            break
    if all_cost:
        trainEpisode+=1
        #以滑動平均的方式打印平均獎勵
        meanReward=meanReward+(total_reward-meanReward)/trainEpisode
        print('\n trainEpisode:{},total_reward:{:.2f}, meanReward:{:.2f} mean_cost:{:.3f}'\
              .format(trainEpisode,total_reward, meanReward,np.mean(all_cost)))
    return total_reward, step

def run_evaluate_episode(env, agent,rpm):
    state, _, __ = env.reset('test')
    total_reward = 0
    step=0
    while True:
        context = rpm.recent_state()
        context.append(resizeBirdrToAtari(state))
        context = np.stack(context, axis=0)
        action = agent.predict(context)
        next_state, reward, isOver,_ = env.step(action)
        step+=1
        rpm.appendForTest(Experience(resizeBirdrToAtari(state), action, reward, isOver))
        total_reward += reward
        state=next_state
        if isOver or step>=MAX_Step_Limit:
            time.sleep(2)
            break
    return total_reward

#保存模型參數
def save(agent):
    learnDir = os.path.join(logger.get_dir(),'learn')
    predictDir = os.path.join(logger.get_dir(),'predict')
    agent.save_params(learnDir,predictDir)

#恢復模型
def restore(agent):
    print(logger.get_dir())
    learnDir = r"flappy_bird\log_dir\Train_Test_Working_Flow\learn"
    predictDir = r"flappy_bird\log_dir\Train_Test_Working_Flow\predict"
    print('restore model from {}'.format(learnDir))
    agent.load_params(learnDir,predictDir)

#初始化 環境-environment、模型-model、算法-algorithm、智能體-agent
def init_environment():
    env = BirdEnv()
    action_dim = 2
    hyperparas = {
        'action_dim': action_dim,
        'gamma': GAMMA,
        'lr': LEARNING_RATE
    }
    model = BirdModel(action_dim)
    algorithm = DQN(model, action_dim, GAMMA, LEARNING_RATE)
    agent = BirdAgent(algorithm, action_dim)
    return env,agent

#訓練
def train():
    env,agent=init_environment()
    rpm = ReplayMemory(MEMORY_SIZE, IMAGE_SIZE, CONTEXT_LEN)
    rpmForTest = ReplayMemory(128, IMAGE_SIZE, CONTEXT_LEN)
    with tqdm(total=MEMORY_WARMUP_SIZE) as pbar:
        while rpm.size() < MEMORY_WARMUP_SIZE:
            ep_reward, step = run_train_episode(env, agent, rpm)
            pbar.update(step)

    # train
    print('TrainStart!')
    # pbar = tqdm(total=TOTAL)
    #用一個雙端隊列記錄最近16次episode的平均值
    avgQueue=deque(maxlen=16)
    total_step = 0
    max_reward = 0
    global learning_curve 
    while True:
        ep_reward, step = run_train_episode(env, agent, rpm)
        total_step += step
        avgQueue.append(ep_reward)
        if ep_reward>max_reward:
            max_reward=ep_reward
        # pbar.set_description('exploration:{:.4f},max_reward:{:.2f}'.format(agent.exploration,max_reward))
        # pbar.update(step)
        if trainEpisode%log_freq==0:
            learning_curve.append(np.mean(avgQueue))
        if trainEpisode%eval_freq==0:
            global eval_mean_save
            eval_rewards=[]
            for _ in range(16):
                eval_reward = run_evaluate_episode(env, agent, rpmForTest)
                eval_rewards.append(eval_reward)
                print('TestReward:{:.2f}'.format(eval_reward))
            print('TestMeanReward:{:.2f}'.format(np.mean(eval_rewards)))
            if np.mean(eval_rewards)>eval_mean_save:
                eval_mean_save=np.mean(eval_rewards)
                save(agent)
                print('ModelSaved!')
            if np.min(eval_rewards) >= threshold_min and np.mean(eval_rewards) >= threshold_avg:
                print("########## Solved with {} episode!###########".format(trainEpisode))
                save(agent)
                break
        if total_step >= TOTAL:
            break
    # pbar.close()
    
    #繪製學習曲線
    X=np.arange(0,len(learning_curve))
    X*=log_freq
    plt.title('LearningCurve')
    plt.xlabel('TrainEpisode')
    plt.ylabel('AvgReward')
    plt.plot(X,learning_curve)
    plt.show()

#測試
def test():
    env,agent=init_environment()
    rpmForTest=ReplayMemory(128, IMAGE_SIZE, CONTEXT_LEN)
    restore(agent) 
    pbar = tqdm(total=TOTAL)
    pbar.write("testing:")
    eval_rewards = []
    for _ in tqdm(range(64), desc='eval agent'):
        eval_reward = run_evaluate_episode(env, agent, rpmForTest)
        eval_rewards.append(eval_reward)
        print('TestReward:{:.2f}'.format(eval_reward))
    print("eval_mean_reward:{:.2f},eval_min_reward:{:.2f}".format(np.mean(eval_rewards),np.min(eval_rewards)))
    pbar.close()
    
if __name__ == '__main__':
    train()
    
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章