之前寫過一個“走四棋兒”雙人對戰小遊戲,沒有加入AI,這個版本是人機對戰版本,使用UCT算法爲CPU產生走子策略。UCT算法是蒙特卡洛樹搜索(MCTS)與UCB公式的結合,不太瞭解算法的話可以搜索一下,網上的資料有很多。我在網上搜了很多資料,其中對我最有用的就是這兩個:①一個網址:http://mcts.ai/index.html這是介紹蒙特卡洛樹搜索方法的一個網站,裏面都是英文,也有Python和Java的示例代碼,但是我自己的代碼與這上邊的相差很多。在網上搜到的很多介紹MCTS的文章很多都是翻譯的這個網站的一些內容,建議先多看一看中文的介紹再來看這個英文的網站;②一張圖片:下邊的圖片是我在某一篇文章裏看到的,不記得原博客了,侵刪。這張圖片將UCT算法講的明明白白,對我寫代碼有很大作用。
拿上邊這張圖解釋一下UCT算法的大致思想,其中包含很多我自己的理解,可能有一些不當之處,歡迎指導。
很多地方把MCTS解釋爲四個過程,而在這張圖上UCT算法分爲三個主要的過程(我都直接用中文翻譯了):樹策略、默認策略、反向傳播,可以看到樹策略又分爲兩部分——擴展和選擇最好子節點。構成樹的基本單位是節點node,節點包含很多信息,其中很重要的包括它的子節點、父節點、從上一個遊戲狀態到達這個遊戲狀態所執行的動作、這個節點的遊戲狀態、這個節點被訪問的次數和得分等等,更詳細的可以參考我列的第一個網址上邊的示例代碼。每個節點都包含一個遊戲狀態(遊戲狀態包含目前遊戲的所有信息),該節點的子節點表示接下走子後的遊戲狀態,這樣就相當於把遊戲狀態“掛載”到節點上,節點構成的樹就包含很多遊戲狀態及其變化過程。UCT算法就是先根據當前的遊戲狀態s0創建一個節點v0,在Python中要注意,這個s0必須是複製的遊戲狀態,比如遊戲狀態通常情況下我們寫成一個對象,但是我們不能直接把原遊戲狀態對象作爲s0創建一個s0的節點(傳遞給節點的遊戲狀態對象實際上傳入的是一個指針,並不會自己複製一個相同的對象),因爲s0作爲節點v0的一個內容,s0的改變會引起原遊戲狀態的改變,這樣就亂套了,我們要做的是複製一個相同遊戲狀態的對象作爲s0傳遞給節點,從而創建v0,這樣即使在v0中的s0改變時,原遊戲狀態也不會改變。但是要複製相同的對象,只能自己手動寫複製本對象的方法(參照資源①的代碼例程),Python中的deepCpoy不怎麼好用。圖片裏的within computational budget就是循環次數,越多越智能。然後進入樹策略TreePolicy:如果此節點不是葉節點(葉節點的遊戲狀態是遊戲的最後一個狀態,遊戲分出勝負)的話,如果這個節點沒有被完全展開,即這個節點的子節點們還沒有包含所有的下一步遊戲狀態,那麼就將沒包含的遊戲狀態的子節點加入到該節點中,並指向這個剛加入的子節點,然後在這個子節點執行DefaulyPolicy;如果這個節點被完全展開,那麼就根據UCB公式選擇一個最好的節點,並指向這個節點,一直重複選擇直到一個沒完全展開的節點,然後對這個節點進行展開,並對剛展開的子節點執行DefaultPolicy。DefaultPolicy是一種快速走子方法(就是很多地方說的rollout),就是一直讓棋子隨機走,一直到遊戲結束,得到勝負。在執行DefaultPolicy時,傳入的是子節點的遊戲狀態,而且這個遊戲狀態應該也是複製的,因爲是快速隨機走子模擬,所以傳入子節點的狀態不應該被改變。在得到遊戲結果後,就進入Backup,Backup就是把得到的遊戲結果反饋給從開始執行模擬的節點到v0節點,反饋的信息包括訪問次數和回報,需要注意的是,針對不同層的節點,回報是不同的,比如開始進行快速走子的節點A的遊戲狀態是玩家1剛剛走子之後狀態,那麼它的父節點B,就是玩家2剛剛走子之後的狀態(玩家2走子之後遊戲狀態變成這個父節點B,然後玩家1走子,遊戲狀態變成了這個父節點B的一個子節點A),假如開始進行快速走子的節點A模擬結束後得到的結果是玩家1勝利,那麼從v0到這個節點A的訪問次數都需要加1,而這個節點A的回報應該是正的(因爲這個節點A包含玩家1執行該節點所代表的動作進行走子之後的局面,這個局面使得玩家1勝利),但對於這個節點的父節點B來說,節點B代表玩家2執行了使B的父節點的遊戲狀態變爲B的遊戲狀態的動作,這個動作使玩家2失敗,所以在B節點,訪問次數應該加1,但是回報應該是0或者負的,再往上B的父節點的回報應該是正的,B的父節點的父節點應該是0或負的。這樣做的結果就是,根據UCB公式選擇BestChild時,總是選擇對於應該走子的那一方最有利的走法,這樣就相當於你跟一個不瞭解棋力的人下棋,你假設這個人走的每一步都是對他結果最有利的走法,你在思考怎麼走子時選擇的每一步都是剋制他勝利的走法(對你最有利的)。這其實就是極小化極大算法的思想,當模擬次數足夠多時,就可以很接近最優解,在此不再深入討論。當Backup結束後,一個過程就結束了,下一遍還是這麼做,要注意的是還是從v0開始,選擇、擴展、快速走子、反饋,完成下一個過程。當模擬次數足夠多時,展開的樹就很大也很深了,越來越能夠將子節點擴展到遊戲結束的狀態。當循環完那麼多遍之後,要返回的是v0的所有子節點中最好的那個節點v_best的動作(關於“最好”,有的使用勝率,有的使用訪問次數,我沒有深入研究,但是勝率和訪問次數大多數情況下是一致的),這個動作代表應該走子的這一方執行的這個動作對這一方獲勝最有利,而執行了這個動作之後就會達到v_best的遊戲狀態。以上就是我對UCT的一些理解。
下邊是我的兩個程序的代碼,第一個是比較完善的機器自我對弈的程序,與我之前的那篇博客雙人對戰走四棋兒小遊戲程序相差不少,有較大改動;第二個是我根據機器自我對弈該寫的人機對弈的程序。我是個編程菜鳥,程序比較冗長,主要是看UCT算法,希望給小白一點點啓發吧。如果想要運行,需要下載遊戲資源:https://download.csdn.net/download/zugexiaodui/10805748
程序在思考的時候會卡死……
import pygame as pg
from pygame.locals import *
import sys
import time
import numpy as np
import random
import os
from math import sqrt, log
'''
機器對戰版本。
使用了蒙特卡羅方法,有點效果。
注意:針對不同走子方設置的reward應當相反,要考慮如果一方棋子被圍住無法移動怎麼辦。
player_move_flag最好改成player_just_moved_flag或者cpu_just_moved_flag,字面意思更好理解。
增加了棋子無法移動的處理(無法移動的一方輸)和平局(多於一定着數但未分勝負)。
改進方向:怎樣加快模擬速度;改進算法使選擇更注重當前的利益;做一個人機對戰版本。
關鍵參數:reward(喫子和勝利兩部分,影響探索與利用),需要合理設置;Cp(衡量探索與利用);UCT選擇最佳節點的方式(選擇訪問次數
最多的還是勝率最高的);經過測試還是UCT比自己的MC_Method好(都模擬800次的情況下);模擬次數很重要。
'''
pg.init()
size = width, height = 600, 400
screen = pg.display.set_mode(size)
f_clock = pg.time.Clock()
fps = 30
pg.display.set_caption("走四棋兒")
pg.display.set_icon(pg.image.load("icon.png").convert_alpha())
background = pg.image.load("background.png").convert_alpha()
pixel_pos = [(90, 40), (190, 40), (290, 40), (390, 40),
(90, 140), (190, 140), (290, 140), (390, 140),
(90, 240), (190, 240), (290, 240), (390, 240),
(90, 340), (190, 340), (290, 340), (390, 340)]
random.seed(8) # 設置隨機數種子,運行程序後,同一個種子總是會得到相同的運行過程。
file_name = 'Chess_score.txt'
f = open(file_name, 'w')
ITER_MAX = 8000
PLAYER_WINS, CPU_WINS, DRAWS = 0, 0, 0
class PlayerPiece():
def __init__(self):
self.id = 2
self.img = pg.image.load("spade.png").convert_alpha()
self.rect = self.img.get_rect()
self.pos = -1
self.alive_state = True
self.selected = False
# self.legal_moves = []
def piece_update(self):
if self.alive_state == True:
self.rect[0] = pixel_pos[self.pos][0]
self.rect[1] = pixel_pos[self.pos][1]
screen.blit(self.img, self.rect)
class CPUPiece():
def __init__(self):
self.id = 3
self.img = pg.image.load("heart.png").convert_alpha()
self.rect = self.img.get_rect()
self.pos = -1
self.alive_state = True
# self.legal_moves = []
def piece_update(self):
if self.alive_state == True:
self.rect[0] = pixel_pos[self.pos][0]
self.rect[1] = pixel_pos[self.pos][1]
screen.blit(self.img, self.rect)
class SiqiGame():
game_num = 0
def __init__(self):
self.game_num += 1
self.cpu_piece = [None] * 4
self.player_piece = [None] * 4
for i in range(4):
self.cpu_piece[i] = CPUPiece()
self.player_piece[i] = PlayerPiece()
self.cpu_piece[i].pos = i
self.player_piece[i].pos = 12 + i
self.player_move_flag = random.choice([True, False])
# self.is_player_selecting_piece = False
# self.player_select = None
# self.cpu_select = None
self.piece_just_moved = [-1, -1]
self.round = 0
self.glb_pos = np.array([3, 3, 3, 3,
0, 0, 0, 0,
0, 0, 0, 0,
2, 2, 2, 2], dtype=np.uint8) # 必須這樣初始化棋盤
# # print("New game " + str(self.game_num) + " starts!")
def Clone(self):
s = SiqiGame()
s.player_move_flag = self.player_move_flag
s.piece_just_moved = self.piece_just_moved[:]
s.round = self.round
s.glb_pos = self.glb_pos[:]
for i in range(4):
s.cpu_piece[i].pos = self.cpu_piece[i].pos
s.cpu_piece[i].alive_state = self.cpu_piece[i].alive_state
s.player_piece[i].pos = self.player_piece[i].pos
s.player_piece[i].alive_state = self.player_piece[i].alive_state
return s
def check_action_legal(self, selected_piece, action):
pos_legal = False
if 0 <= (selected_piece.pos + action) <= 15 and self.glb_pos[selected_piece.pos + action] == 0:
if action == 1:
if selected_piece.pos % 4 < 3:
pos_legal = True
elif action == -1:
if selected_piece.pos % 4 > 0:
pos_legal = True
elif action == 4:
if selected_piece.pos // 4 < 3:
pos_legal = True
elif action == -4:
if selected_piece.pos // 4 > 0:
pos_legal = True
return pos_legal
def get_pieces_legal_moves(self):
self.get_situation()
untried_moves = []
cpu_player_pieces = None
if self.player_move_flag == True:
cpu_player_pieces = self.player_piece
else:
cpu_player_pieces = self.cpu_piece
for i in range(4):
# cpu_player_pieces[i].legal_moves = []
if cpu_player_pieces[i].alive_state == True:
for action in [-1, 1, -4, 4]:
if self.check_action_legal(cpu_player_pieces[i], action) == True:
# cpu_player_pieces[i].legal_moves.append(action)
untried_moves.append((i, action))
return untried_moves
def DoMove(self, piece_move):
(i_piece, move) = piece_move
if self.player_move_flag == True:
self.player_piece[i_piece].pos += move
self.piece_just_moved = [i_piece, move]
self.check_situation()
self.player_move_flag = False
else:
self.cpu_piece[i_piece].pos += move
self.piece_just_moved = [i_piece, move]
self.check_situation()
self.player_move_flag = True
self.get_situation(True)
def get_situation(self, round_plus=False):
if self.player_move_flag and round_plus:
self.round += 1
# print("Round" + str(self.round), "player moves.")
elif (not self.player_move_flag) and round_plus:
self.round += 1
# print("Round" + str(self.round), "CPU moves.")
self.glb_pos = np.zeros([16], dtype=np.uint8)
for i in range(4):
if self.player_piece[i].alive_state == True:
self.glb_pos[self.player_piece[i].pos] = self.player_piece[i].id
if self.cpu_piece[i].alive_state == True:
self.glb_pos[self.cpu_piece[i].pos] = self.cpu_piece[i].id
# for i in range(4):
# print(self.glb_pos[i * 4:(i + 1) * 4])
# print("=" * 9)
def check_situation(self):
self.get_situation()
enemy_die = False
piece_just_moved = None
reward = 0
if self.player_move_flag:
piece_just_moved = self.player_piece[self.piece_just_moved[0]]
else:
piece_just_moved = self.cpu_piece[self.piece_just_moved[0]]
if piece_just_moved != None:
pos_row, pos_col = piece_just_moved.pos // 4, piece_just_moved.pos % 4
pos_row_all = self.glb_pos[pos_row * 4:(pos_row + 1) * 4]
pos_col_all = self.glb_pos[pos_col::4]
if piece_just_moved.id == self.player_piece[0].id:
if np.sum(pos_row_all) == 2 + 2 + 3:
if (pos_row_all == np.array([0, 2, 2, 3])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == pos_row * 4 + 3 and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
elif (pos_row_all == np.array([2, 2, 3, 0])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == pos_row * 4 + 2 and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
elif (pos_row_all == np.array([3, 2, 2, 0])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == pos_row * 4 + 0 and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
elif (pos_row_all == np.array([0, 3, 2, 2])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == pos_row * 4 + 1 and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
if np.sum(pos_col_all) == 2 + 2 + 3:
if (pos_col_all == np.array([0, 2, 2, 3])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == 4 * 3 + pos_col and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
elif (pos_col_all == np.array([2, 2, 3, 0])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == 4 * 2 + pos_col and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
elif (pos_col_all == np.array([3, 2, 2, 0])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == 4 * 0 + pos_col and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
elif (pos_col_all == np.array([0, 3, 2, 2])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == 4 * 1 + pos_col and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
if enemy_die == True:
reward = 0.8 # todo: 在設置回報時可以與round關聯,使策略更傾向於最近的喫子
elif piece_just_moved.id == self.cpu_piece[0].id:
if np.sum(pos_row_all) == 3 + 3 + 2:
if (pos_row_all == np.array([0, 3, 3, 2])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == pos_row * 4 + 3:
player_piece_i.alive_state = False
enemy_die = True
elif (pos_row_all == np.array([3, 3, 2, 0])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == pos_row * 4 + 2:
player_piece_i.alive_state = False
enemy_die = True
elif (pos_row_all == np.array([0, 2, 3, 3])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == pos_row * 4 + 1:
player_piece_i.alive_state = False
enemy_die = True
elif (pos_row_all == np.array([2, 3, 3, 0])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == pos_row * 4 + 0:
player_piece_i.alive_state = False
enemy_die = True
if np.sum(pos_col_all) == 3 + 3 + 2:
if (pos_col_all == np.array([0, 3, 3, 2])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == 4 * 3 + pos_col:
player_piece_i.alive_state = False
enemy_die = True
elif (pos_col_all == np.array([3, 3, 2, 0])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == 4 * 2 + pos_col:
player_piece_i.alive_state = False
enemy_die = True
elif (pos_col_all == np.array([2, 3, 3, 0])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == 4 * 0 + pos_col:
player_piece_i.alive_state = False
enemy_die = True
elif (pos_col_all == np.array([0, 2, 3, 3])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == 4 * 1 + pos_col:
player_piece_i.alive_state = False
enemy_die = True
if enemy_die == True:
reward = -0.8 # todo: 在設置回報時可以與round關聯,使策略更傾向於最近的喫子
return reward
def check_game_over(self):
player_pieces_alive_num, cpu_pieces_alive_num = 0, 0
game_over = False
reward = 0
if not self.player_move_flag: # todo:注意移動之後flag已經改了
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.alive_state:
cpu_pieces_alive_num += 1
if cpu_pieces_alive_num <= 1:
game_over = True
reward = 1
else:
if self.get_pieces_legal_moves() == []:
game_over = True
reward = 1
else:
for player_piece_i in self.player_piece:
if player_piece_i.alive_state:
player_pieces_alive_num += 1
if player_pieces_alive_num <= 1:
game_over = True
reward = -1
else:
if self.get_pieces_legal_moves() == []:
game_over = True
reward = -1
return game_over, reward
def restart(self):
del self.cpu_piece
del self.player_piece
self.__init__()
def update_screen(self):
screen.blit(background, (0, 0))
for i in range(4):
self.player_piece[i].piece_update()
self.cpu_piece[i].piece_update()
pg.display.update()
def game_main(self):
if self.player_move_flag:
policy = UCTSearch((self), ITER_MAX)
# policy = MC_Method((self),ITER_MAX)
self.DoMove(policy)
else:
# if self.get_pieces_legal_moves() != []:
# self.DoMove(random.choice(self.get_pieces_legal_moves()))
policy = UCTSearch((self), ITER_MAX // 2)
# policy = MC_Method((self),ITER_MAX)
self.DoMove(policy)
print("+++ Round " + str(self.round) + " +++")
f.write("+++ Round " + str(self.round) + " +++\n")
for i in range(4):
print(self.glb_pos[i * 4:(i + 1) * 4])
f.write(str(self.glb_pos[i * 4:(i + 1) * 4]) + '\n')
print("=" * 9)
self.update_screen()
f.write("=" * 9 + '\n')
if self.check_game_over()[0]:
if self.check_game_over()[1] > 0:
global PLAYER_WINS
PLAYER_WINS += 1
else:
global CPU_WINS
CPU_WINS += 1
print("*" * 5 + " PLAYER_WINS:" + str(PLAYER_WINS), end='')
print("---CPU_WINS:" + str(CPU_WINS) + ' ' + "*" * 5 + '\n')
f.write(
"*" * 5 + " PLAYER_WINS:" + str(PLAYER_WINS) + "---CPU_WINS:" + str(CPU_WINS) + ' ' + '*' * 5 + '\n\n')
self.update_screen()
time.sleep(2)
if PLAYER_WINS + CPU_WINS >= 7:
print("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS))
f.write("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS) + '\n')
f.close()
sys.exit()
self.restart()
else:
if self.round >= 150:
print("*" * 5 + " Draw! No one wins! " + "*" * 5)
time.sleep(2)
if PLAYER_WINS + CPU_WINS >= 50:
print("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS) + "DRAWS:" + str(DRAWS))
f.write("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS) + +"DRAWS:" + str(
DRAWS) + '\n')
f.close()
sys.exit()
self.restart()
self.update_screen()
class Node():
def __init__(self, parent=None, game_state=None, move=None):
self.parentNode = parent
self.game_state = game_state
self.move = move
self.childNodes = []
self.wins = 0
self.visits = 0
self.untriedMoves = self.game_state.get_pieces_legal_moves()
def AddChild(self, game_state, move):
n = Node(parent=self, game_state=game_state, move=move)
self.childNodes.append(n)
self.untriedMoves.remove(move)
return n
def UCTSelect(self, Cp=2):
try:
s = sorted(self.childNodes, key=lambda c: c.wins / c.visits + sqrt(Cp * log(self.visits) / c.visits))[-1]
except:
print(self.childNodes, self.game_state.player_move_flag)
print(self.game_state.glb_pos.reshape([4, 4]))
assert False, "UCT_SELECT"
return s
def Update(self, result):
self.visits += 1
if self.game_state.player_move_flag:
# todo: 注意Update是在DoMove之後的,player_move_flag==True表示本節點的move是cpu做的。
self.wins -= result # 設置reward時player的爲正,cpu的爲負,所以cpu的結果應該取反
else:
self.wins += result
def CheckFullyExpanded(self):
fully_expanded = False
if self.untriedMoves == []: # 如果一方沒輸但是沒有可以走的棋子怎麼辦
# if self.childNodes != []:
fully_expanded = True
return fully_expanded
def UCTSearch(root_state, itermax):
'''
可以使用勝率最高的節點或者訪問次數最多的節點都試試。
:param root_state:
:param itermax:
:return:
'''
state = root_state.Clone()
root_node = Node(game_state=state)
for i in range(itermax):
# state = root_state.Clone()
node_l = TreePolicy(root_node)
delta = DefaultPolicy(node_l.game_state.Clone())
BackUp(node_l, delta)
for n in root_node.childNodes:
print(n.move, n.wins, '/', n.visits, '=', n.wins / n.visits) # 打印每個動作節點的信息
# return BestChild(root_node, 0).move # todo: 選擇勝率最高的
return sorted(root_node.childNodes, key=lambda c: c.visits)[-1].move # todo: 選擇訪問次數最多的
def TreePolicy(node):
while not node.game_state.check_game_over()[0]:
if not node.CheckFullyExpanded():
c_node = Expand(node)
return c_node
else:
node = BestChild(node)
return node
def Expand(node):
if node.untriedMoves != []:
a = random.choice(node.untriedMoves)
state = node.game_state.Clone()
state.DoMove(a)
n = node.AddChild(game_state=state, move=a)
return n
else:
print(node.untriedMoves)
print(node.game_state.pos)
assert False, "node.untriedMoves == []"
def BestChild(node, Cp=2):
'''
Cp很重要,用來權衡探索與利用。
:param node:
:param Cp:
:return:
'''
return node.UCTSelect(Cp)
def DefaultPolicy(game_state):
'''
模擬和得到回報,得到的回報應該有喫子和勝利兩部分。
:param game_state:
:return:
'''
reward = 0
reward += game_state.check_situation()
over, r = game_state.check_game_over()
reward += r
while not over and game_state.get_pieces_legal_moves() != []:
game_state.DoMove(random.choice(game_state.get_pieces_legal_moves()))
over, r = game_state.check_game_over()
reward += r
return reward
def BackUp(node, delta):
while node != None:
node.Update(delta)
node = node.parentNode
def MC_Method(root_state, itermax):
'''
先找出所有子節點,然後均勻地對每個子節點進行模擬。
效果不如UCT。
:param root_state:
:param itermax:
:return: best_child.move
'''
state = root_state.Clone()
root_node = Node(game_state=state)
for m in root_node.untriedMoves[:]:
s = state.Clone()
s.DoMove(m)
root_node.AddChild(s, m)
assert root_node.untriedMoves == [], "NOT ALL!"
iter_n = 0
while iter_n < itermax: # todo: 爲了保證所有節點模擬的總次數等於itermax,與UCT方法模擬次數一致。
for child_node in root_node.childNodes:
r = DefaultPolicy(child_node.game_state.Clone())
child_node.Update(r)
iter_n += 1
if iter_n >= itermax:
break
for n in root_node.childNodes:
print(n.move, n.wins, '/', n.visits, '=', n.wins / n.visits)
best_child = sorted(root_node.childNodes, key=lambda c: c.wins / c.visits)[-1]
return best_child.move
game = SiqiGame()
while True:
for event in pg.event.get():
if event.type == pg.QUIT:
print("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS))
f.write("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS) + '\n')
f.close()
sys.exit()
game.game_main()
f_clock.tick(fps)
下邊是人機對戰版本,機器思考時程序在卡死狀態。
import pygame as pg
from pygame.locals import *
import sys
import time
import numpy as np
import random
import os
from math import sqrt, log
'''
做一個人機對戰版本,測試UCT的效果。
模擬8000次的情況下棋力已經很不錯了。
'''
pg.init()
size = width, height = 600, 400
screen = pg.display.set_mode(size)
f_clock = pg.time.Clock()
fps = 30
pg.display.set_caption("走四棋兒")
pg.display.set_icon(pg.image.load("icon.png").convert_alpha())
background = pg.image.load("background.png").convert_alpha()
pixel_pos = [(90, 40), (190, 40), (290, 40), (390, 40),
(90, 140), (190, 140), (290, 140), (390, 140),
(90, 240), (190, 240), (290, 240), (390, 240),
(90, 340), (190, 340), (290, 340), (390, 340)]
random.seed(8) # 設置隨機數種子,運行程序後,同一個種子總是會得到相同的運行過程。
file_name = 'Chess_score5.txt'
f = open(file_name, 'w')
ITER_MAX = 8000
PLAYER_WINS, CPU_WINS = 0, 0
class PlayerPiece():
def __init__(self):
self.id = 2
self.img = pg.image.load("spade.png").convert_alpha()
self.rect = self.img.get_rect()
self.pos = -1
self.alive_state = True
self.selected = False
# self.legal_moves = []
def piece_update(self):
if self.alive_state == True:
self.rect[0] = pixel_pos[self.pos][0]
self.rect[1] = pixel_pos[self.pos][1]
screen.blit(self.img, self.rect)
class CPUPiece():
def __init__(self):
self.id = 3
self.img = pg.image.load("heart.png").convert_alpha()
self.rect = self.img.get_rect()
self.pos = -1
self.alive_state = True
# self.legal_moves = []
def piece_update(self):
if self.alive_state == True:
self.rect[0] = pixel_pos[self.pos][0]
self.rect[1] = pixel_pos[self.pos][1]
screen.blit(self.img, self.rect)
class SiqiGame():
game_num = 0
def __init__(self):
self.game_num += 1
self.cpu_piece = [None] * 4
self.player_piece = [None] * 4
for i in range(4):
self.cpu_piece[i] = CPUPiece()
self.player_piece[i] = PlayerPiece()
self.cpu_piece[i].pos = i
self.player_piece[i].pos = 12 + i
self.player_move_flag = random.choice([True, False])
self.__is_player_selecting_piece = False
self.__player_select = None
# self.cpu_select = None
self.piece_just_moved = [-1, -1]
self.round = 0
self.glb_pos = np.array([3, 3, 3, 3,
0, 0, 0, 0,
0, 0, 0, 0,
2, 2, 2, 2], dtype=np.uint8) # 必須這樣初始化棋盤
# # print("New game " + str(self.game_num) + " starts!")
def Clone(self):
s = SiqiGame()
s.player_move_flag = self.player_move_flag
s.piece_just_moved = self.piece_just_moved[:]
s.round = self.round
s.glb_pos = self.glb_pos[:]
for i in range(4):
s.cpu_piece[i].pos = self.cpu_piece[i].pos
s.cpu_piece[i].alive_state = self.cpu_piece[i].alive_state
s.player_piece[i].pos = self.player_piece[i].pos
s.player_piece[i].alive_state = self.player_piece[i].alive_state
return s
def check_action_legal(self, selected_piece, action):
pos_legal = False
if 0 <= (selected_piece.pos + action) <= 15 and self.glb_pos[selected_piece.pos + action] == 0:
if action == 1:
if selected_piece.pos % 4 < 3:
pos_legal = True
elif action == -1:
if selected_piece.pos % 4 > 0:
pos_legal = True
elif action == 4:
if selected_piece.pos // 4 < 3:
pos_legal = True
elif action == -4:
if selected_piece.pos // 4 > 0:
pos_legal = True
return pos_legal
def get_pieces_legal_moves(self):
self.get_situation()
untried_moves = []
cpu_player_pieces = None
if self.player_move_flag == True:
cpu_player_pieces = self.player_piece
else:
cpu_player_pieces = self.cpu_piece
for i in range(4):
# cpu_player_pieces[i].legal_moves = []
if cpu_player_pieces[i].alive_state == True:
for action in [-1, 1, -4, 4]:
if self.check_action_legal(cpu_player_pieces[i], action) == True:
# cpu_player_pieces[i].legal_moves.append(action)
untried_moves.append((i, action))
return untried_moves
def __check_click(self, c_x, c_y):
self.__player_select = None
for i in range(4):
if self.player_piece[i].alive_state and self.player_piece[i].rect.collidepoint(c_x, c_y):
# self.player_piece[i].selected = True
self.__player_select = self.player_piece[i]
def __player_piece_moves(self, c_x, c_y):
move_successfully = False
piece_exist = False
action = 0
if self.__player_select != None:
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.rect.collidepoint(c_x, c_y) and cpu_piece_i.alive_state:
piece_exist = True
self.__is_player_selecting_piece = False
for i in range(4):
if self.player_piece[i].rect.collidepoint(c_x, c_y) and self.player_piece[i].alive_state:
self.__player_select = self.player_piece[i]
if piece_exist == False:
delta_y, delta_x = c_y - self.__player_select.rect[1], c_x - self.__player_select.rect[0]
if 80 <= abs(delta_x) <= 120 and abs(delta_y) <= 20:
if delta_x < 0:
if self.__player_select.pos % 4 > 0:
self.__player_select.pos -= 1
action = -1
move_successfully = True
else:
if self.__player_select.pos % 4 < 3:
self.__player_select.pos += 1
action = 1
move_successfully = True
if 80 <= abs(delta_y) <= 120 and abs(delta_x) <= 20:
if delta_y < 0:
if self.__player_select.pos > 3:
self.__player_select.pos -= 4
action = -4
move_successfully = True
else:
if self.__player_select.pos < 12:
self.__player_select.pos += 4
action = 4
move_successfully = True
if move_successfully:
return action
else:
return 0
def __PlayerMove(self):
if event.type == pg.MOUSEBUTTONDOWN:
c_x, c_y = pg.mouse.get_pos()
if self.__is_player_selecting_piece == False:
self.__check_click(c_x, c_y)
if self.__player_select != None:
self.__is_player_selecting_piece = True
else:
move = self.__player_piece_moves(c_x, c_y)
if move!=0:
for i in range(4):
if self.__player_select == self.player_piece[i]:
i_piece = i
piece_move = (i_piece,move)
self.piece_just_moved = piece_move
self.check_situation()
self.player_move_flag = False
self.get_situation(True,True)
def DoMove(self, piece_move):
(i_piece, move) = piece_move
if self.player_move_flag:
self.player_piece[i_piece].pos += move
self.piece_just_moved = [i_piece, move]
self.check_situation()
self.player_move_flag = False
else:
self.cpu_piece[i_piece].pos += move
self.piece_just_moved = [i_piece, move]
self.check_situation()
self.player_move_flag = True
self.get_situation(True)
def __CPUMove(self,piece_move):
(i_piece, move) = piece_move
# if not self.player_move_flag:
self.cpu_piece[i_piece].pos += move
self.piece_just_moved = [i_piece, move]
self.check_situation()
self.player_move_flag = True
self.get_situation(True,True)
def get_situation(self, round_plus=False,print_situation = False):
if self.player_move_flag and round_plus:
self.round += 1
# print("Round" + str(self.round), "player moves.")
elif (not self.player_move_flag) and round_plus:
self.round += 1
# print("Round" + str(self.round), "CPU moves.")
self.glb_pos = np.zeros([16], dtype=np.uint8)
for i in range(4):
if self.player_piece[i].alive_state == True:
self.glb_pos[self.player_piece[i].pos] = self.player_piece[i].id
if self.cpu_piece[i].alive_state == True:
self.glb_pos[self.cpu_piece[i].pos] = self.cpu_piece[i].id
if print_situation:
for i in range(4):
print(self.glb_pos[i * 4:(i + 1) * 4])
print("=" * 9)
def check_situation(self):
self.get_situation()
enemy_die = False
piece_just_moved = None
reward = 0
if self.player_move_flag:
piece_just_moved = self.player_piece[self.piece_just_moved[0]]
else:
piece_just_moved = self.cpu_piece[self.piece_just_moved[0]]
if piece_just_moved != None:
pos_row, pos_col = piece_just_moved.pos // 4, piece_just_moved.pos % 4
pos_row_all = self.glb_pos[pos_row * 4:(pos_row + 1) * 4]
pos_col_all = self.glb_pos[pos_col::4]
if piece_just_moved.id == self.player_piece[0].id:
if np.sum(pos_row_all) == 2 + 2 + 3:
if (pos_row_all == np.array([0, 2, 2, 3])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == pos_row * 4 + 3 and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
elif (pos_row_all == np.array([2, 2, 3, 0])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == pos_row * 4 + 2 and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
elif (pos_row_all == np.array([3, 2, 2, 0])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == pos_row * 4 + 0 and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
elif (pos_row_all == np.array([0, 3, 2, 2])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == pos_row * 4 + 1 and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
if np.sum(pos_col_all) == 2 + 2 + 3:
if (pos_col_all == np.array([0, 2, 2, 3])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == 4 * 3 + pos_col and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
elif (pos_col_all == np.array([2, 2, 3, 0])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == 4 * 2 + pos_col and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
elif (pos_col_all == np.array([3, 2, 2, 0])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == 4 * 0 + pos_col and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
elif (pos_col_all == np.array([0, 3, 2, 2])).all():
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.pos == 4 * 1 + pos_col and cpu_piece_i.alive_state:
cpu_piece_i.alive_state = False
enemy_die = True
if enemy_die == True:
reward = 0.8 # todo: 在設置回報時可以與round關聯,使策略更傾向於最近的喫子
elif piece_just_moved.id == self.cpu_piece[0].id:
if np.sum(pos_row_all) == 3 + 3 + 2:
if (pos_row_all == np.array([0, 3, 3, 2])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == pos_row * 4 + 3:
player_piece_i.alive_state = False
enemy_die = True
elif (pos_row_all == np.array([3, 3, 2, 0])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == pos_row * 4 + 2:
player_piece_i.alive_state = False
enemy_die = True
elif (pos_row_all == np.array([0, 2, 3, 3])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == pos_row * 4 + 1:
player_piece_i.alive_state = False
enemy_die = True
elif (pos_row_all == np.array([2, 3, 3, 0])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == pos_row * 4 + 0:
player_piece_i.alive_state = False
enemy_die = True
if np.sum(pos_col_all) == 3 + 3 + 2:
if (pos_col_all == np.array([0, 3, 3, 2])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == 4 * 3 + pos_col:
player_piece_i.alive_state = False
enemy_die = True
elif (pos_col_all == np.array([3, 3, 2, 0])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == 4 * 2 + pos_col:
player_piece_i.alive_state = False
enemy_die = True
elif (pos_col_all == np.array([2, 3, 3, 0])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == 4 * 0 + pos_col:
player_piece_i.alive_state = False
enemy_die = True
elif (pos_col_all == np.array([0, 2, 3, 3])).all():
for player_piece_i in self.player_piece:
if player_piece_i.pos == 4 * 1 + pos_col:
player_piece_i.alive_state = False
enemy_die = True
if enemy_die == True:
reward = -0.8 # todo: 在設置回報時可以與round關聯,使策略更傾向於最近的喫子
return reward
def check_game_over(self):
player_pieces_alive_num, cpu_pieces_alive_num = 0, 0
game_over = False
reward = 0
if not self.player_move_flag: # DoMove之後這個flag已經被反轉了
for cpu_piece_i in self.cpu_piece:
if cpu_piece_i.alive_state:
cpu_pieces_alive_num += 1
if cpu_pieces_alive_num <= 1:
game_over = True
reward = 1
else:
if self.get_pieces_legal_moves()==[]:
game_over = True
reward = 1
else:
for player_piece_i in self.player_piece:
if player_piece_i.alive_state:
player_pieces_alive_num += 1
if player_pieces_alive_num <= 1:
game_over = True
reward = -1
else:
if self.get_pieces_legal_moves()==[]:
game_over = True
reward = -1
return game_over, reward
def restart(self):
del self.cpu_piece
del self.player_piece
self.__init__()
def update_screen(self):
screen.blit(background, (0, 0))
for i in range(4):
self.player_piece[i].piece_update()
self.cpu_piece[i].piece_update()
pg.display.update()
def game_main(self):
if self.player_move_flag:
self.__PlayerMove()
else:
policy = UCTSearch((self), ITER_MAX)
# policy = MC_Method((self),ITER_MAX)
self.__CPUMove(policy)
# print("+++ Round " + str(self.round) + " +++")
# f.write("+++ Round " + str(self.round) + " +++\n")
# for i in range(4):
# print(self.glb_pos[i * 4:(i + 1) * 4])
# f.write(str(self.glb_pos[i * 4:(i + 1) * 4]) + '\n')
# print("=" * 9)
self.update_screen()
# f.write("=" * 9 + '\n')
if self.check_game_over()[0] == True:
if self.check_game_over()[1] > 0:
global PLAYER_WINS
PLAYER_WINS += 1
else:
global CPU_WINS
CPU_WINS += 1
print("*" * 5 + " PLAYER_WINS:" + str(PLAYER_WINS), end='')
print("---CPU_WINS:" + str(CPU_WINS) + ' ' + "*" * 5 + '\n')
# f.write("*" * 5 + " PLAYER_WINS:" + str(PLAYER_WINS) + "---CPU_WINS:" + str(CPU_WINS) + ' ' + '*' * 5 + '\n\n')
self.update_screen()
time.sleep(2)
if PLAYER_WINS + CPU_WINS >= 50:
print("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS))
# f.write("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS) + '\n')
f.close()
sys.exit()
self.restart()
self.update_screen()
class Node():
def __init__(self, parent=None, game_state=None, move=None):
self.parentNode = parent
self.game_state = game_state
self.move = move
self.childNodes = []
self.wins = 0
self.visits = 0
self.untriedMoves = self.game_state.get_pieces_legal_moves()
def AddChild(self, game_state, move):
n = Node(parent=self, game_state=game_state, move=move)
self.childNodes.append(n)
self.untriedMoves.remove(move)
return n
def UCTSelect(self, Cp=2):
try:
s = sorted(self.childNodes, key=lambda c: c.wins / c.visits + sqrt(Cp * log(self.visits) / c.visits))[-1]
except:
print(self.childNodes,self.game_state.player_move_flag)
print(self.game_state.glb_pos.reshape([4,4]))
assert False,"UCT_SELECT"
return s
def Update(self, result):
self.visits += 1
if self.game_state.player_move_flag:
# todo: 注意Update是在DoMove之後的,player_move_flag==True表示本節點的move是cpu做的。
self.wins -= result # 設置reward時player的爲正,cpu的爲負,所以cpu的結果應該取反
else:
self.wins += result
def CheckFullyExpanded(self):
fully_expanded = False
if self.untriedMoves == []: # 如果一方沒輸但是沒有可以走的棋子怎麼辦
# if self.childNodes != []:
fully_expanded = True
return fully_expanded
def UCTSearch(root_state, itermax):
'''
可以使用勝率最高的節點或者訪問次數最多的節點都試試。
:param root_state:
:param itermax:
:return:
'''
state = root_state.Clone()
root_node = Node(game_state=state)
for i in range(itermax):
# state = root_state.Clone()
node_l = TreePolicy(root_node)
delta = DefaultPolicy(node_l.game_state.Clone())
BackUp(node_l, delta)
# for n in root_node.childNodes:
# print(n.move, n.wins, '/', n.visits, '=', n.wins / n.visits) # 打印每個動作節點的信息
# return BestChild(root_node, 0).move # todo: 選擇勝率最高的
return sorted(root_node.childNodes, key=lambda c: c.visits)[-1].move # todo: 選擇訪問次數最多的
def TreePolicy(node):
while not node.game_state.check_game_over()[0]:
if not node.CheckFullyExpanded():
c_node = Expand(node)
return c_node
else:
node = BestChild(node)
return node
def Expand(node):
if node.untriedMoves != []:
a = random.choice(node.untriedMoves)
state = node.game_state.Clone()
state.DoMove(a)
n = node.AddChild(game_state=state, move=a)
return n
else:
print(node.untriedMoves)
print(node.game_state.pos)
assert False, "node.untriedMoves == []"
def BestChild(node, Cp=2):
'''
Cp很重要,用來權衡探索與利用。
:param node:
:param Cp:
:return:
'''
return node.UCTSelect(Cp)
def DefaultPolicy(game_state):
'''
模擬和得到回報,得到的回報應該有喫子和勝利兩部分。
:param game_state:
:return:
'''
reward = 0
reward += game_state.check_situation()
over, r = game_state.check_game_over()
reward += r
while not over and game_state.get_pieces_legal_moves() != []:
game_state.DoMove(random.choice(game_state.get_pieces_legal_moves()))
over, r = game_state.check_game_over()
reward += r
return reward
def BackUp(node, delta):
while node != None:
node.Update(delta)
node = node.parentNode
def MC_Method(root_state, itermax):
'''
先找出所有子節點,然後均勻地對每個子節點進行模擬。
效果不如UCT。
:param root_state:
:param itermax:
:return: best_child.move
'''
state = root_state.Clone()
root_node = Node(game_state=state)
for m in root_node.untriedMoves[:]:
s = state.Clone()
s.DoMove(m)
root_node.AddChild(s, m)
assert root_node.untriedMoves == [], "NOT ALL!"
iter_n = 0
while iter_n < itermax: # todo: 爲了保證所有節點模擬的總次數等於itermax,與UCT方法模擬次數一致。
for child_node in root_node.childNodes:
r = DefaultPolicy(child_node.game_state.Clone())
child_node.Update(r)
iter_n += 1
if iter_n >= itermax:
break
for n in root_node.childNodes:
print(n.move, n.wins, '/', n.visits, '=', n.wins / n.visits)
best_child = sorted(root_node.childNodes, key=lambda c: c.wins / c.visits)[-1]
return best_child.move
game = SiqiGame()
while True:
for event in pg.event.get():
if event.type == pg.QUIT:
print("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS))
f.write("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS) + '\n')
f.close()
sys.exit()
game.game_main()
f_clock.tick(fps)
在設置模擬8000次的情況下,我已經招架不住機器了,但是機器思考時間非常長。個人認爲UCT展開的深度足夠深才更有效,使用多線程或者多進程的話效果應該不會怎麼好,三個臭皮匠頂不了一個諸葛亮,如果有優化方法歡迎評論。
我跑了一晚上加一上午機器對弈,分別模擬8000和4000次,結果16:12。如果UCT的速度足夠快的話,下一步可以再仿照AlphaZero的做法,訓練神經網絡,挑戰天下武林高手。當然了這種“走四棋兒”會玩的不多,可以再自己拓展到其他的棋類上。