用PARL的官方 DQN 算法,玩Flappy-Bird的案例其實在2019年4月份就有了,所以這裏準確說,應該是復現。
但是如果用現在1.3.1版本的PARL框架運行,會有一些版本不兼容的問題,而且也不能直接在AI Studio裏運行,所以我藉着這個機會,完善一下,並且鞏固一下最近學的內容。
我已在AI Studio公開這個項目:
https://aistudio.baidu.com/aistudio/projectdetail/580622
首先在本地試着把這個項目項目跑起來
在本地可視化地展示Flappy-Bird
準備遊戲環境
GitHub地址如下:
https://github.com/kosoraYintai/PARL-Sample/tree/master/flappy_bird
裏面貌似有兩個版本的Flappy-Bird,一個版本是普通的Flappy-Bird,另一個版本是rainBow版的Flappy-Bird,遊戲是一樣的,我沒看明白有什麼區別,所以爲了不讓大家產生誤解,這裏我把它刪去了。
另外,裏面除了Flappy-Bird的遊戲環境,還有對應的強化學習代碼,只不過用PARL1.3.1的環境運行時,會有一點小問題:
而且這種問題都是改完一個,就出一個新問題,所以在調試的時候還挺麻煩的,不過這裏我會直接調好,給大家一個可以直接運行的代碼。
在本地訓練時,遊戲環境不需要做調整:
flappy_bird下有兩個文件夾:
- assets裏存放着遊戲所需要的靜態文件,主要是圖片
- game裏存放着用於實現遊戲的代碼,用pygame實現的
其中,flappy_bird_utils.py主要用於加載靜態文件,具體的遊戲規則都是在BirdEnv.py裏實現的
AI Flappy-Bird效果展示
原來的代碼裏,需要導入PARL裏的代碼,這裏我全部提到了一個文件夾裏:
這是其中的一個改動,不改的話,會提示找不到文件,所以我直接放在一起了
在本地訓練時,會有可視化的效果,所以速度也比較慢,我只訓練了幾百輪就結束了,在本地我主要測試了模型效果:
AI flappy bird
不想看視頻的話,可以看一下這個gif圖:
在服務器端訓練Flappy-Bird
因爲在本地訓練的速度非常慢,所以我就不在本地訓練了,本地只用來做效果可視化,訓練就放到了服務器裏
AI Studio項目地址:
https://aistudio.baidu.com/aistudio/projectdetail/580622
現在的版本只達到了能調通項目的效果,後續我會修改超參數,以達到更優的效果
在開始前,因爲服務器端不能做可視化,所以要在遊戲環境flappy_bird/game/BirdEnv.py文件裏添加如下代碼:
import os
os.environ["SDL_VIDEODRIVER"] = "dummy"
否則會報錯 ==> pygame.error: No available video device
在項目裏,有兩個方式進行訓練
直接運行文件,開始訓練
第一種方式非常簡單,安裝依賴庫後,直接用命令的方式運行:
真正的訓練前,智能體會不斷地隨機嘗試遊戲,這一部分需要一段時間,如果想看到進度的話,可以取消進度條的那段註釋
我訓練到了3000多個episode,但效果還不是很理想
分步驟查看並運行代碼
這裏的運行結果和上面其實一樣,如果沒有更改代碼的話
建議大家在調參的時候,使用這種方法,一個cell一個cell地運行:
Model
# Model
import numpy as np
from parl import layers
from parl import Model
#經典的雅達利遊戲卷積神經網絡
class BirdModel(Model):
def __init__(self, act_dim):
self.act_dim = act_dim
#padding方式爲valid
p_valid=0
self.conv1 = layers.conv2d(
num_filters=32, filter_size=8, stride=4, padding=p_valid, act='relu')
self.conv2 = layers.conv2d(
num_filters=64, filter_size=4, stride=2, padding=p_valid, act='relu')
self.conv3 = layers.conv2d(
num_filters=64, filter_size=3, stride=1, padding=p_valid, act='relu')
self.fc0 = layers.fc(size=512)
self.fc1 = layers.fc(size=act_dim)
def value(self, obs):
#輸入歸一化
obs = obs / 255.0
out = self.conv1(obs)
out = self.conv2(out)
out = self.conv3(out)
out = layers.flatten(out, axis=1)
out = self.fc0(out)
out = self.fc1(out)
return out
Agent
# Agent
import numpy as np
import paddle.fluid as fluid
from parl import layers
from parl import Agent
IMAGE_SIZE = (84, 84)
CONTEXT_LEN = 4
#智能體
class BirdAgent(Agent):
def __init__(self, algorithm, action_dim):
super(BirdAgent, self).__init__(algorithm)
self.action_dim = action_dim
self.global_step = 0
#每訓練多少步更新target網絡,超參數可調
self.update_target_steps = 5000
#初始探索概率ε,超參數可微調
self.exploration = 0.8
#每步探索的衰減程度,超參數可微調
self.exploration_dacay=1e-6
#最小探索概率,超參數可微調
self.min_exploration=0.05
def build_program(self):
self.learn_programs = []
self.predict_programs=[]
self.pred_program = fluid.Program()
self.learn_program = fluid.Program()
with fluid.program_guard(self.pred_program):
obs = layers.data(
name='obs',
shape=[CONTEXT_LEN, IMAGE_SIZE[0], IMAGE_SIZE[1]],
dtype='float32')
self.value = self.alg.predict(obs)
with fluid.program_guard(self.learn_program):
obs = layers.data(
name='obs',
shape=[CONTEXT_LEN, IMAGE_SIZE[0], IMAGE_SIZE[1]],
dtype='float32')
action = layers.data(name='act', shape=[1], dtype='int32')
reward = layers.data(name='reward', shape=[], dtype='float32')
next_obs = layers.data(
name='next_obs',
shape=[CONTEXT_LEN, IMAGE_SIZE[0], IMAGE_SIZE[1]],
dtype='float32')
terminal = layers.data(name='terminal', shape=[], dtype='bool')
self.cost = self.alg.learn(obs, action, reward, next_obs,
terminal)
self.learn_programs.append(self.learn_program)
self.predict_programs.append(self.pred_program)
#ε-greedy
def sample(self, obs):
sample = np.random.random()
if sample < self.exploration:
act = np.random.randint(self.action_dim)
else:
obs = np.expand_dims(obs, axis=0)
pred_Q = self.fluid_executor.run(
self.pred_program,
feed={'obs': obs.astype('float32')},
fetch_list=[self.value])[0]
pred_Q = np.squeeze(pred_Q, axis=0)
act = np.argmax(pred_Q)
self.exploration = max(self.min_exploration, self.exploration - self.exploration_dacay)
return act
#預測
def predict(self, obs):
obs = np.expand_dims(obs, axis=0)
pred_Q = self.fluid_executor.run(
self.pred_program,
feed={'obs': obs.astype('float32')},
fetch_list=[self.value])[0]
pred_Q = np.squeeze(pred_Q, axis=0)
act = np.argmax(pred_Q)
return act
#學習
def learn(self, obs, act, reward, next_obs, terminal):
self.alg.sync_target()
self.global_step += 1
act = np.expand_dims(act, -1)
reward = np.clip(reward, -1, 1)
feed = {
'obs': obs.astype('float32'),
'act': act.astype('int32'),
'reward': reward,
'next_obs': next_obs.astype('float32'),
'terminal': terminal
}
cost = self.fluid_executor.run(
self.learn_program, feed=feed, fetch_list=[self.cost])[0]
return cost
#保存模型
def save_params(self, learnDir,predictDir):
fluid.io.save_params(
executor=self.fluid_executor,
dirname=learnDir,
main_program=self.learn_programs[0])
fluid.io.save_params(
executor=self.fluid_executor,
dirname=predictDir,
main_program=self.predict_programs[0])
#加載模型
def load_params(self, learnDir,predictDir):
fluid.io.load_params(
executor=self.fluid_executor,
dirname=learnDir,
main_program=self.learn_programs[0])
fluid.io.load_params(
executor=self.fluid_executor,
dirname=predictDir,
main_program=self.predict_programs[0])
Algorithm
Algorithm直接導入即可:
from parl.algorithms import DQN
當然也可以自己寫:
class DQN(parl.Algorithm):
def __init__(self, model, act_dim=None, gamma=None, lr=None):
""" DQN algorithm
Args:
model (parl.Model): 定義Q函數的前向網絡結構
act_dim (int): action空間的維度,即有幾個action
gamma (float): reward的衰減因子
lr (float): learning rate 學習率.
"""
self.model = model
self.target_model = copy.deepcopy(model)
assert isinstance(act_dim, int)
assert isinstance(gamma, float)
assert isinstance(lr, float)
self.act_dim = act_dim
self.gamma = gamma
self.lr = lr
def predict(self, obs):
""" 使用self.model的value網絡來獲取 [Q(s,a1),Q(s,a2),...]
"""
return self.model.value(obs)
def learn(self, obs, action, reward, next_obs, terminal):
""" 使用DQN算法更新self.model的value網絡
"""
# 從target_model中獲取 max Q' 的值,用於計算target_Q
next_pred_value = self.target_model.value(next_obs)
best_v = layers.reduce_max(next_pred_value, dim=1)
best_v.stop_gradient = True # 阻止梯度傳遞
terminal = layers.cast(terminal, dtype='float32')
target = reward + (1.0 - terminal) * self.gamma * best_v
pred_value = self.model.value(obs) # 獲取Q預測值
# 將action轉onehot向量,比如:3 => [0,0,0,1,0]
action_onehot = layers.one_hot(action, self.act_dim)
action_onehot = layers.cast(action_onehot, dtype='float32')
# 下面一行是逐元素相乘,拿到action對應的 Q(s,a)
# 比如:pred_value = [[2.3, 5.7, 1.2, 3.9, 1.4]], action_onehot = [[0,0,0,1,0]]
# ==> pred_action_value = [[3.9]]
pred_action_value = layers.reduce_sum(
layers.elementwise_mul(action_onehot, pred_value), dim=1)
# 計算 Q(s,a) 與 target_Q的均方差,得到loss
cost = layers.square_error_cost(pred_action_value, target)
cost = layers.reduce_mean(cost)
optimizer = fluid.optimizer.Adam(learning_rate=self.lr) # 使用Adam優化器
optimizer.minimize(cost)
return cost
def sync_target(self):
""" 把 self.model 的模型參數值同步到 self.target_model
"""
self.model.sync_weights_to(self.target_model)
ReplayMemory
經驗池:用於存儲多條經驗,實現 經驗回放。
#經驗回放單元
import numpy as np
from collections import deque, namedtuple
Experience = namedtuple('Experience', ['state', 'action', 'reward', 'isOver'])
class ReplayMemory(object):
def __init__(self, max_size, state_shape, context_len):
self.max_size = int(max_size)
self.state_shape = state_shape
self.context_len = int(context_len)
self.state = np.zeros((self.max_size, ) + state_shape, dtype='int32')
self.action = np.zeros((self.max_size, ), dtype='int32')
self.reward = np.zeros((self.max_size, ), dtype='float32')
self.isOver = np.zeros((self.max_size, ), dtype='bool')
self._curr_size = 0
self._curr_pos = 0
#_context是一個滑動窗口,長度永遠保持3
self._context = deque(maxlen=context_len - 1)
print('Replay-buffer initial success!')
def append(self, exp):
"""append a new experience into replay memory
"""
if self._curr_size < self.max_size:
self._assign(self._curr_pos, exp)
self._curr_size += 1
else:
self._assign(self._curr_pos, exp)
self._curr_pos = (self._curr_pos + 1) % self.max_size
if exp.isOver:
self._context.clear()
else:
self._context.append(exp)
def appendForTest(self,exp):
"""
用於測試階段
"""
if exp.isOver:
self._context.clear()
else:
self._context.append(exp)
def recent_state(self):
""" maintain recent state for training"""
lst = list(self._context)
states = [np.zeros(self.state_shape, dtype='uint8')] * \
(self._context.maxlen - len(lst))
states.extend([k.state for k in lst])
return states
def sample(self, idx):
""" return state, action, reward, isOver,
note that some frames in state may be generated from last episode,
they should be removed from state
"""
state = np.zeros(
(self.context_len + 1, ) + self.state_shape, dtype=np.uint8)
state_idx = np.arange(idx,
idx + self.context_len + 1) % self._curr_size
# confirm that no frame was generated from last episode
has_last_episode = False
for k in range(self.context_len - 2, -1, -1):
to_check_idx = state_idx[k]
if self.isOver[to_check_idx]:
has_last_episode = True
state_idx = state_idx[k + 1:]
#前面的state[0:k+1]位置的像素保持0
state[k + 1:] = self.state[state_idx]
break
if not has_last_episode:
state = self.state[state_idx]
real_idx = (idx + self.context_len - 1) % self._curr_size
action = self.action[real_idx]
reward = self.reward[real_idx]
isOver = self.isOver[real_idx]
return state, reward, action, isOver
def __len__(self):
return self._curr_size
def size(self):
return self._curr_size
def _assign(self, pos, exp):
self.state[pos] = exp.state
self.reward[pos] = exp.reward
self.action[pos] = exp.action
self.isOver[pos] = exp.isOver
def sample_batch(self, batch_size):
"""sample a batch from replay memory for training
"""
batch_idx = np.random.randint(
self._curr_size - self.context_len - 1, size=batch_size)
batch_idx = (self._curr_pos + batch_idx) % self._curr_size
batch_exp = [self.sample(i) for i in batch_idx]
return self._process_batch(batch_exp)
def _process_batch(self, batch_exp):
state = np.asarray([e[0] for e in batch_exp], dtype='uint8')
reward = np.asarray([e[1] for e in batch_exp], dtype='float32')
action = np.asarray([e[2] for e in batch_exp], dtype='int8')
isOver = np.asarray([e[3] for e in batch_exp], dtype='bool')
return [state, action, reward, isOver]
遊戲環境
#將flappy-bird遊戲封裝成標準的gym環境
#原始版本參考:https://github.com/yenchenlin/DeepLearningFlappyBird
import random
import pygame
import gym
from pygame.locals import *
from itertools import cycle
import os
os.environ["SDL_VIDEODRIVER"] = "dummy"
FPS = 30
SCREENWIDTH = 288
SCREENHEIGHT = 512
PIPEGAPSIZE = 100 # gap between upper and lower part of pipe
BASEY = SCREENHEIGHT * 0.79
PLAYER_INDEX_GEN = cycle([0, 1, 2, 1])
class BirdEnv(gym.Env):
def beforeInit(self):
pygame.init()
self.FPSCLOCK = pygame.time.Clock()
self.SCREEN = pygame.display.set_mode((SCREENWIDTH, SCREENHEIGHT))
pygame.display.set_caption('AI Flappy-Bird')
IMAGES, SOUNDS, HITMASKS = load()
self.IMAGES=IMAGES
self.HITMASKS=HITMASKS
self.SOUNDS=SOUNDS
PLAYER_WIDTH = IMAGES['player'][0].get_width()
self.PLAYER_WIDTH=PLAYER_WIDTH
PLAYER_HEIGHT = IMAGES['player'][0].get_height()
self.PLAYER_HEIGHT=PLAYER_HEIGHT
PIPE_WIDTH = IMAGES['pipe'][0].get_width()
self.PIPE_WIDTH=PIPE_WIDTH
PIPE_HEIGHT = IMAGES['pipe'][0].get_height()
self.PIPE_HEIGHT=PIPE_HEIGHT
BACKGROUND_WIDTH = IMAGES['background'].get_width()
self.BACKGROUND_WIDTH=BACKGROUND_WIDTH
def __init__(self):
if not hasattr(self,'IMAGES'):
print('InitGame!')
self.beforeInit()
self.score = self.playerIndex = self.loopIter = 0
self.playerx = int(SCREENWIDTH * 0.3)
self.playery = int((SCREENHEIGHT - self.PLAYER_HEIGHT) / 2.25)
self.basex = 0
self.baseShift = self.IMAGES['base'].get_width() - self.BACKGROUND_WIDTH
newPipe1 = getRandomPipe(self.PIPE_HEIGHT)
newPipe2 = getRandomPipe(self.PIPE_HEIGHT)
self.upperPipes = [
{'x': SCREENWIDTH, 'y': newPipe1[0]['y']},
{'x': SCREENWIDTH + (SCREENWIDTH / 2), 'y': newPipe2[0]['y']},
]
self.lowerPipes = [
{'x': SCREENWIDTH, 'y': newPipe1[1]['y']},
{'x': SCREENWIDTH + (SCREENWIDTH / 2), 'y': newPipe2[1]['y']},
]
# player velocity, max velocity, downward accleration, accleration on flap
self.pipeVelX = -4
self.playerVelY = 0 # player's velocity along Y, default same as playerFlapped
self.playerMaxVelY = 10 # max vel along Y, max descend speed
self.playerMinVelY = -8 # min vel along Y, max ascend speed
self.playerAccY = 1.1 # players downward accleration
self.playerFlapAcc = -1.2 # players speed on flapping
self.playerFlapped = False # True when player flaps
def reset(self,mode='train'):
self.__init__()
self.mode=mode
action0 = 1
observation, reward, isOver,_ = self.step(action0)
return observation,reward,isOver
def render(self):
image_data = pygame.surfarray.array3d(pygame.display.get_surface())
pygame.display.update()
self.FPSCLOCK.tick(FPS)
return image_data
def step(self, input_action=0):
pygame.event.pump()
#飛行一段距離,獎勵+0.1
reward = 0.1
terminal = False
if input_action == 1:
if self.playery > -2 * self.PLAYER_HEIGHT:
self.playerVelY = self.playerFlapAcc
self.playerFlapped = True
# if self.mode=='test':
# self.SOUNDS['wing'].play()
# check for score
playerMidPos = self.playerx + self.PLAYER_WIDTH / 2
for pipe in self.upperPipes:
#穿過一個柱子獎勵加1
pipeMidPos = pipe['x'] + self.PIPE_WIDTH / 2
if pipeMidPos <= playerMidPos < pipeMidPos + 4:
# if self.mode=='test':
# self.SOUNDS['point'].play()
self.score += 1
reward = self.reward(1)
# playerIndex basex change
if (self.loopIter + 1) % 3 == 0:
self.playerIndex = next(PLAYER_INDEX_GEN)
self.loopIter = (self.loopIter + 1) % 30
self.basex = -((-self.basex + 100) % self.baseShift)
# player's movement
if self.playerVelY < self.playerMaxVelY and not self.playerFlapped:
self.playerVelY += self.playerAccY
if self.playerFlapped:
self.playerFlapped = False
self.playery += min(self.playerVelY, BASEY - self.playery - self.PLAYER_HEIGHT)
if self.playery < 0:
self.playery = 0
# move pipes to left
for uPipe, lPipe in zip(self.upperPipes, self.lowerPipes):
uPipe['x'] += self.pipeVelX
lPipe['x'] += self.pipeVelX
# add new pipe when first pipe is about to touch left of screen
if 0 < self.upperPipes[0]['x'] < 5:
newPipe = getRandomPipe(self.PIPE_HEIGHT)
self.upperPipes.append(newPipe[0])
self.lowerPipes.append(newPipe[1])
# remove first pipe if its out of the screen
if self.upperPipes[0]['x'] < -self.PIPE_WIDTH:
self.upperPipes.pop(0)
self.lowerPipes.pop(0)
# check if crash here
isCrash= checkCrash({'x': self.playerx, 'y': self.playery,
'index': self.playerIndex},
self.upperPipes, self.lowerPipes,self.IMAGES,self.PIPE_WIDTH,self.PIPE_HEIGHT,self.HITMASKS)
if isCrash:
#撞到邊緣或者撞到柱子,結束,並且獎勵爲-1
terminal = True
reward = self.reward(-1)
# if self.mode=='test':
# self.SOUNDS['hit'].play()
# self.SOUNDS['die'].play()
# draw sprites
self.SCREEN.blit(self.IMAGES['background'], (0,0))
for uPipe, lPipe in zip(self.upperPipes, self.lowerPipes):
self.SCREEN.blit(self.IMAGES['pipe'][0], (uPipe['x'], uPipe['y']))
self.SCREEN.blit(self.IMAGES['pipe'][1], (lPipe['x'], lPipe['y']))
self.SCREEN.blit(self.IMAGES['base'], (self.basex, BASEY))
# print score so player overlaps the score
showScore(self.score,self)
self.SCREEN.blit(self.IMAGES['player'][self.playerIndex],
(self.playerx, self.playery))
image_data=self.render()
return image_data, reward, terminal,{}
def reward(self,r):
return r
def getRandomPipe(PIPE_HEIGHT):
"""returns a randomly generated pipe"""
# y of gap between upper and lower pipe
gapYs = [20, 30, 40, 50, 60, 70, 80, 90]
index = random.randint(0, len(gapYs)-1)
gapY = gapYs[index]
gapY += int(BASEY * 0.2)
pipeX = SCREENWIDTH + 10
return [
{'x': pipeX, 'y': gapY - PIPE_HEIGHT}, # upper pipe
{'x': pipeX, 'y': gapY + PIPEGAPSIZE}, # lower pipe
]
def showScore(score,obj):
"""displays score in center of screen"""
scoreDigits = [int(x) for x in list(str(score))]
totalWidth = 0 # total width of all numbers to be printed
for digit in scoreDigits:
totalWidth += obj.IMAGES['numbers'][digit].get_width()
Xoffset = (SCREENWIDTH - totalWidth) / 2
for digit in scoreDigits:
obj.SCREEN.blit(obj.IMAGES['numbers'][digit], (Xoffset, SCREENHEIGHT * 0.1))
Xoffset += obj.IMAGES['numbers'][digit].get_width()
def checkCrash(player, upperPipes, lowerPipes,IMAGES,PIPE_WIDTH,PIPE_HEIGHT,HITMASKS):
"""returns True if player collders with base or pipes."""
pi = player['index']
player['w'] = IMAGES['player'][0].get_width()
player['h'] = IMAGES['player'][0].get_height()
# if player crashes into ground
if player['y'] + player['h'] >= BASEY - 1:
return True
else:
playerRect = pygame.Rect(player['x'], player['y'],
player['w'], player['h'])
for uPipe, lPipe in zip(upperPipes, lowerPipes):
# upper and lower pipe rects
uPipeRect = pygame.Rect(uPipe['x'], uPipe['y'], PIPE_WIDTH, PIPE_HEIGHT)
lPipeRect = pygame.Rect(lPipe['x'], lPipe['y'], PIPE_WIDTH, PIPE_HEIGHT)
# player and upper/lower pipe hitmasks
pHitMask = HITMASKS['player'][pi]
uHitmask = HITMASKS['pipe'][0]
lHitmask = HITMASKS['pipe'][1]
# if bird collided with upipe or lpipe
uCollide = pixelCollision(playerRect, uPipeRect, pHitMask, uHitmask)
lCollide = pixelCollision(playerRect, lPipeRect, pHitMask, lHitmask)
if uCollide or lCollide:
return True
return False
def pixelCollision(rect1, rect2, hitmask1, hitmask2):
"""Checks if two objects collide and not just their rects"""
rect = rect1.clip(rect2)
if rect.width == 0 or rect.height == 0:
return False
x1, y1 = rect.x - rect1.x, rect.y - rect1.y
x2, y2 = rect.x - rect2.x, rect.y - rect2.y
for x in range(rect.width):
for y in range(rect.height):
if hitmask1[x1+x][y1+y] and hitmask2[x2+x][y2+y]:
return True
return False
# 加載遊戲環境
import pygame
import sys
def load():
# path of player with different states
PLAYER_PATH = (
r'flappy_bird/assets/sprites/redbird-upflap.png',
r'flappy_bird/assets/sprites/redbird-midflap.png',
r'flappy_bird/assets/sprites/redbird-downflap.png'
)
# path of background
BACKGROUND_PATH = r'flappy_bird/assets/sprites/background-black.png'
# path of pipe
PIPE_PATH = r'flappy_bird/assets/sprites/pipe-green.png'
IMAGES, SOUNDS, HITMASKS = {}, {}, {}
# numbers sprites for score display
IMAGES['numbers'] = (
pygame.image.load(r'flappy_bird/assets/sprites/0.png').convert_alpha(),
pygame.image.load(r'flappy_bird/assets/sprites/1.png').convert_alpha(),
pygame.image.load(r'flappy_bird/assets/sprites/2.png').convert_alpha(),
pygame.image.load(r'flappy_bird/assets/sprites/3.png').convert_alpha(),
pygame.image.load(r'flappy_bird/assets/sprites/4.png').convert_alpha(),
pygame.image.load(r'flappy_bird/assets/sprites/5.png').convert_alpha(),
pygame.image.load(r'flappy_bird/assets/sprites/6.png').convert_alpha(),
pygame.image.load(r'flappy_bird/assets/sprites/7.png').convert_alpha(),
pygame.image.load(r'flappy_bird/assets/sprites/8.png').convert_alpha(),
pygame.image.load(r'flappy_bird/assets/sprites/9.png').convert_alpha()
)
# base (ground) sprite
IMAGES['base'] = pygame.image.load(r'flappy_bird/assets/sprites/base.png').convert_alpha()
# sounds
if 'win' in sys.platform:
soundExt = '.wav'
else:
soundExt = '.ogg'
# SOUNDS['die'] = pygame.mixer.Sound('assets/audio/die' + soundExt)
# SOUNDS['hit'] = pygame.mixer.Sound('assets/audio/hit' + soundExt)
# SOUNDS['point'] = pygame.mixer.Sound('assets/audio/point' + soundExt)
# SOUNDS['swoosh'] = pygame.mixer.Sound('assets/audio/swoosh' + soundExt)
# SOUNDS['wing'] = pygame.mixer.Sound('assets/audio/wing' + soundExt)
# select random background sprites
IMAGES['background'] = pygame.image.load(BACKGROUND_PATH).convert()
# select random player sprites
IMAGES['player'] = (
pygame.image.load(PLAYER_PATH[0]).convert_alpha(),
pygame.image.load(PLAYER_PATH[1]).convert_alpha(),
pygame.image.load(PLAYER_PATH[2]).convert_alpha(),
)
# select random pipe sprites
IMAGES['pipe'] = (
pygame.transform.rotate(
pygame.image.load(PIPE_PATH).convert_alpha(), 180),
pygame.image.load(PIPE_PATH).convert_alpha(),
)
# hismask for pipes
HITMASKS['pipe'] = (
getHitmask(IMAGES['pipe'][0]),
getHitmask(IMAGES['pipe'][1]),
)
# hitmask for player
HITMASKS['player'] = (
getHitmask(IMAGES['player'][0]),
getHitmask(IMAGES['player'][1]),
getHitmask(IMAGES['player'][2]),
)
return IMAGES, SOUNDS, HITMASKS
def getHitmask(image):
"""returns a hitmask using an image's alpha."""
mask = []
for x in range(image.get_width()):
mask.append([])
for y in range(image.get_height()):
mask[x].append(bool(image.get_at((x,y))[3]))
return mask
Training && Test
from contextlib import contextmanager
import cv2
import time
#捕獲當前屏幕並resize成(84*84*1)的灰度圖
def resizeBirdrToAtari(observation):
observation = cv2.cvtColor(cv2.resize(observation, (84, 84)), cv2.COLOR_BGR2GRAY)
_, observation = cv2.threshold(observation,1,255,cv2.THRESH_BINARY)
return observation
@contextmanager
def trainTimer(name):
start = time.time()
yield
end = time.time()
print('{} COST_Time:{}'.format(name, end - start))
# 開始訓練
from tqdm import tqdm
from parl.algorithms import DQN
import time
from collections import deque
from parl.utils import logger
import matplotlib.pyplot as plt
import os
import sys
sys.path.append("game/")
#=========可調節的超參數 start=========
#圖像輸入大小(修改後,網絡模型的height、weight也必須修改)
IMAGE_SIZE = (84, 84)
#記錄最近4幀(修改後,網絡模型的通道數也必須修改)
CONTEXT_LEN = 4
#replay-memory的大小
MEMORY_SIZE = int(8e4)
#充滿replay-memory,使其達到warm-up-size纔開始訓練
MEMORY_WARMUP_SIZE = MEMORY_SIZE//20
#默認不跳幀
FRAME_SKIP = None
#網絡學習頻率
UPDATE_FREQ = 2
#衰減因子
GAMMA = 0.99
#學習率
LEARNING_RATE = 1e-3 * 0.5
#一共走多少步
TOTAL=1e6
#batch-size
batchSize=32
#每多少個episode測試一次
eval_freq=100
#一輪episode最多執行多少次step,不然小鳥會無限制的飛下去,相當於gym.env中的_max_episode_steps屬性
MAX_Step_Limit=int(1<<12)
#閾值,當測試reward的最小值超過threshold_min、平均值超過threshold_avg後就停止訓練,防止網絡過擬合
threshold_min=256
threshold_avg=400
#=========可調節的超參數 end=========
#以下爲非重要參數
#記錄日誌的頻率
log_freq=10
#初始化平均獎勵
meanReward=0
#記錄訓練的episode
trainEpisode=0
#學習曲線數組
learning_curve=[]
#初始平均獎勵超過eval_mean_save才保存模型
eval_mean_save=32
#訓練一個episode
def run_train_episode(env, agent, rpm):
global trainEpisode
global meanReward
total_reward = 0
all_cost = []
#重置環境
state,_, __ = env.reset()
step = 0
#循環每一步
while True:
context = rpm.recent_state()
context.append(resizeBirdrToAtari(state))
context = np.stack(context, axis=0)
#用ε-greedy的方式選一個動作
action = agent.sample(context)
#執行動作
next_state, reward, isOver,_ = env.step(action)
step += 1
#存入replay_buffer
rpm.append(Experience(resizeBirdrToAtari(state), action, reward, isOver))
if rpm.size() > MEMORY_WARMUP_SIZE:
if step % UPDATE_FREQ == 0:
#從replay_buffer中隨機採樣
batch_all_state, batch_action, batch_reward, batch_isOver = rpm.sample_batch(
batchSize)
batch_state = batch_all_state[:, :CONTEXT_LEN, :, :]
batch_next_state = batch_all_state[:, 1:, :, :]
#執行SGD,訓練參數θ
cost = agent.learn(batch_state, batch_action, batch_reward,
batch_next_state, batch_isOver)
all_cost.append(float(cost))
total_reward += reward
state = next_state
if isOver or step>=MAX_Step_Limit:
break
if all_cost:
trainEpisode+=1
#以滑動平均的方式打印平均獎勵
meanReward=meanReward+(total_reward-meanReward)/trainEpisode
print('\n trainEpisode:{},total_reward:{:.2f}, meanReward:{:.2f} mean_cost:{:.3f}'\
.format(trainEpisode,total_reward, meanReward,np.mean(all_cost)))
return total_reward, step
def run_evaluate_episode(env, agent,rpm):
state, _, __ = env.reset('test')
total_reward = 0
step=0
while True:
context = rpm.recent_state()
context.append(resizeBirdrToAtari(state))
context = np.stack(context, axis=0)
action = agent.predict(context)
next_state, reward, isOver,_ = env.step(action)
step+=1
rpm.appendForTest(Experience(resizeBirdrToAtari(state), action, reward, isOver))
total_reward += reward
state=next_state
if isOver or step>=MAX_Step_Limit:
time.sleep(2)
break
return total_reward
#保存模型參數
def save(agent):
learnDir = os.path.join(logger.get_dir(),'learn')
predictDir = os.path.join(logger.get_dir(),'predict')
agent.save_params(learnDir,predictDir)
#恢復模型
def restore(agent):
print(logger.get_dir())
learnDir = r"flappy_bird\log_dir\Train_Test_Working_Flow\learn"
predictDir = r"flappy_bird\log_dir\Train_Test_Working_Flow\predict"
print('restore model from {}'.format(learnDir))
agent.load_params(learnDir,predictDir)
#初始化 環境-environment、模型-model、算法-algorithm、智能體-agent
def init_environment():
env = BirdEnv()
action_dim = 2
hyperparas = {
'action_dim': action_dim,
'gamma': GAMMA,
'lr': LEARNING_RATE
}
model = BirdModel(action_dim)
algorithm = DQN(model, action_dim, GAMMA, LEARNING_RATE)
agent = BirdAgent(algorithm, action_dim)
return env,agent
#訓練
def train():
env,agent=init_environment()
rpm = ReplayMemory(MEMORY_SIZE, IMAGE_SIZE, CONTEXT_LEN)
rpmForTest = ReplayMemory(128, IMAGE_SIZE, CONTEXT_LEN)
with tqdm(total=MEMORY_WARMUP_SIZE) as pbar:
while rpm.size() < MEMORY_WARMUP_SIZE:
ep_reward, step = run_train_episode(env, agent, rpm)
pbar.update(step)
# train
print('TrainStart!')
# pbar = tqdm(total=TOTAL)
#用一個雙端隊列記錄最近16次episode的平均值
avgQueue=deque(maxlen=16)
total_step = 0
max_reward = 0
global learning_curve
while True:
ep_reward, step = run_train_episode(env, agent, rpm)
total_step += step
avgQueue.append(ep_reward)
if ep_reward>max_reward:
max_reward=ep_reward
# pbar.set_description('exploration:{:.4f},max_reward:{:.2f}'.format(agent.exploration,max_reward))
# pbar.update(step)
if trainEpisode%log_freq==0:
learning_curve.append(np.mean(avgQueue))
if trainEpisode%eval_freq==0:
global eval_mean_save
eval_rewards=[]
for _ in range(16):
eval_reward = run_evaluate_episode(env, agent, rpmForTest)
eval_rewards.append(eval_reward)
print('TestReward:{:.2f}'.format(eval_reward))
print('TestMeanReward:{:.2f}'.format(np.mean(eval_rewards)))
if np.mean(eval_rewards)>eval_mean_save:
eval_mean_save=np.mean(eval_rewards)
save(agent)
print('ModelSaved!')
if np.min(eval_rewards) >= threshold_min and np.mean(eval_rewards) >= threshold_avg:
print("########## Solved with {} episode!###########".format(trainEpisode))
save(agent)
break
if total_step >= TOTAL:
break
# pbar.close()
#繪製學習曲線
X=np.arange(0,len(learning_curve))
X*=log_freq
plt.title('LearningCurve')
plt.xlabel('TrainEpisode')
plt.ylabel('AvgReward')
plt.plot(X,learning_curve)
plt.show()
#測試
def test():
env,agent=init_environment()
rpmForTest=ReplayMemory(128, IMAGE_SIZE, CONTEXT_LEN)
restore(agent)
pbar = tqdm(total=TOTAL)
pbar.write("testing:")
eval_rewards = []
for _ in tqdm(range(64), desc='eval agent'):
eval_reward = run_evaluate_episode(env, agent, rpmForTest)
eval_rewards.append(eval_reward)
print('TestReward:{:.2f}'.format(eval_reward))
print("eval_mean_reward:{:.2f},eval_min_reward:{:.2f}".format(np.mean(eval_rewards),np.min(eval_rewards)))
pbar.close()
if __name__ == '__main__':
train()