基于Pygame框架和蒙特卡洛树搜索的“走四棋儿”人机对战小游戏(附编程详解和代码)

之前写过一个“走四棋儿”双人对战小游戏,没有加入AI,这个版本是人机对战版本,使用UCT算法为CPU产生走子策略。UCT算法是蒙特卡洛树搜索(MCTS)与UCB公式的结合,不太了解算法的话可以搜索一下,网上的资料有很多。我在网上搜了很多资料,其中对我最有用的就是这两个:①一个网址:http://mcts.ai/index.html这是介绍蒙特卡洛树搜索方法的一个网站,里面都是英文,也有Python和Java的示例代码,但是我自己的代码与这上边的相差很多。在网上搜到的很多介绍MCTS的文章很多都是翻译的这个网站的一些内容,建议先多看一看中文的介绍再来看这个英文的网站;②一张图片:下边的图片是我在某一篇文章里看到的,不记得原博客了,侵删。这张图片将UCT算法讲的明明白白,对我写代码有很大作用。

                                                                 

拿上边这张图解释一下UCT算法的大致思想,其中包含很多我自己的理解,可能有一些不当之处,欢迎指导。

很多地方把MCTS解释为四个过程,而在这张图上UCT算法分为三个主要的过程(我都直接用中文翻译了):树策略、默认策略、反向传播,可以看到树策略又分为两部分——扩展和选择最好子节点。构成树的基本单位是节点node,节点包含很多信息,其中很重要的包括它的子节点、父节点、从上一个游戏状态到达这个游戏状态所执行的动作、这个节点的游戏状态、这个节点被访问的次数和得分等等,更详细的可以参考我列的第一个网址上边的示例代码。每个节点都包含一个游戏状态(游戏状态包含目前游戏的所有信息),该节点的子节点表示接下走子后的游戏状态,这样就相当于把游戏状态“挂载”到节点上,节点构成的树就包含很多游戏状态及其变化过程。UCT算法就是先根据当前的游戏状态s0创建一个节点v0,在Python中要注意,这个s0必须是复制的游戏状态,比如游戏状态通常情况下我们写成一个对象,但是我们不能直接把原游戏状态对象作为s0创建一个s0的节点(传递给节点的游戏状态对象实际上传入的是一个指针,并不会自己复制一个相同的对象),因为s0作为节点v0的一个内容,s0的改变会引起原游戏状态的改变,这样就乱套了,我们要做的是复制一个相同游戏状态的对象作为s0传递给节点,从而创建v0,这样即使在v0中的s0改变时,原游戏状态也不会改变。但是要复制相同的对象,只能自己手动写复制本对象的方法(参照资源①的代码例程),Python中的deepCpoy不怎么好用。图片里的within computational budget就是循环次数,越多越智能。然后进入树策略TreePolicy:如果此节点不是叶节点(叶节点的游戏状态是游戏的最后一个状态,游戏分出胜负)的话,如果这个节点没有被完全展开,即这个节点的子节点们还没有包含所有的下一步游戏状态,那么就将没包含的游戏状态的子节点加入到该节点中,并指向这个刚加入的子节点,然后在这个子节点执行DefaulyPolicy;如果这个节点被完全展开,那么就根据UCB公式选择一个最好的节点,并指向这个节点,一直重复选择直到一个没完全展开的节点,然后对这个节点进行展开,并对刚展开的子节点执行DefaultPolicy。DefaultPolicy是一种快速走子方法(就是很多地方说的rollout),就是一直让棋子随机走,一直到游戏结束,得到胜负。在执行DefaultPolicy时,传入的是子节点的游戏状态,而且这个游戏状态应该也是复制的,因为是快速随机走子模拟,所以传入子节点的状态不应该被改变。在得到游戏结果后,就进入Backup,Backup就是把得到的游戏结果反馈给从开始执行模拟的节点到v0节点,反馈的信息包括访问次数和回报,需要注意的是,针对不同层的节点,回报是不同的,比如开始进行快速走子的节点A的游戏状态是玩家1刚刚走子之后状态,那么它的父节点B,就是玩家2刚刚走子之后的状态(玩家2走子之后游戏状态变成这个父节点B,然后玩家1走子,游戏状态变成了这个父节点B的一个子节点A),假如开始进行快速走子的节点A模拟结束后得到的结果是玩家1胜利,那么从v0到这个节点A的访问次数都需要加1,而这个节点A的回报应该是正的(因为这个节点A包含玩家1执行该节点所代表的动作进行走子之后的局面,这个局面使得玩家1胜利),但对于这个节点的父节点B来说,节点B代表玩家2执行了使B的父节点的游戏状态变为B的游戏状态的动作,这个动作使玩家2失败,所以在B节点,访问次数应该加1,但是回报应该是0或者负的,再往上B的父节点的回报应该是正的,B的父节点的父节点应该是0或负的。这样做的结果就是,根据UCB公式选择BestChild时,总是选择对于应该走子的那一方最有利的走法,这样就相当于你跟一个不了解棋力的人下棋,你假设这个人走的每一步都是对他结果最有利的走法,你在思考怎么走子时选择的每一步都是克制他胜利的走法(对你最有利的)。这其实就是极小化极大算法的思想,当模拟次数足够多时,就可以很接近最优解,在此不再深入讨论。当Backup结束后,一个过程就结束了,下一遍还是这么做,要注意的是还是从v0开始,选择、扩展、快速走子、反馈,完成下一个过程。当模拟次数足够多时,展开的树就很大也很深了,越来越能够将子节点扩展到游戏结束的状态。当循环完那么多遍之后,要返回的是v0的所有子节点中最好的那个节点v_best的动作(关于“最好”,有的使用胜率,有的使用访问次数,我没有深入研究,但是胜率和访问次数大多数情况下是一致的),这个动作代表应该走子的这一方执行的这个动作对这一方获胜最有利,而执行了这个动作之后就会达到v_best的游戏状态。以上就是我对UCT的一些理解。

下边是我的两个程序的代码,第一个是比较完善的机器自我对弈的程序,与我之前的那篇博客双人对战走四棋儿小游戏程序相差不少,有较大改动;第二个是我根据机器自我对弈该写的人机对弈的程序。我是个编程菜鸟,程序比较冗长,主要是看UCT算法,希望给小白一点点启发吧。如果想要运行,需要下载游戏资源:https://download.csdn.net/download/zugexiaodui/10805748

程序在思考的时候会卡死……

import pygame as pg
from pygame.locals import *
import sys
import time
import numpy as np
import random
import os
from math import sqrt, log

'''
机器对战版本。
使用了蒙特卡罗方法,有点效果。
注意:针对不同走子方设置的reward应当相反,要考虑如果一方棋子被围住无法移动怎么办。
player_move_flag最好改成player_just_moved_flag或者cpu_just_moved_flag,字面意思更好理解。
增加了棋子无法移动的处理(无法移动的一方输)和平局(多于一定着数但未分胜负)。
改进方向:怎样加快模拟速度;改进算法使选择更注重当前的利益;做一个人机对战版本。
关键参数:reward(吃子和胜利两部分,影响探索与利用),需要合理设置;Cp(衡量探索与利用);UCT选择最佳节点的方式(选择访问次数
最多的还是胜率最高的);经过测试还是UCT比自己的MC_Method好(都模拟800次的情况下);模拟次数很重要。
'''
pg.init()
size = width, height = 600, 400
screen = pg.display.set_mode(size)
f_clock = pg.time.Clock()
fps = 30

pg.display.set_caption("走四棋儿")
pg.display.set_icon(pg.image.load("icon.png").convert_alpha())
background = pg.image.load("background.png").convert_alpha()
pixel_pos = [(90, 40), (190, 40), (290, 40), (390, 40),
             (90, 140), (190, 140), (290, 140), (390, 140),
             (90, 240), (190, 240), (290, 240), (390, 240),
             (90, 340), (190, 340), (290, 340), (390, 340)]
random.seed(8)  # 设置随机数种子,运行程序后,同一个种子总是会得到相同的运行过程。
file_name = 'Chess_score.txt'
f = open(file_name, 'w')
ITER_MAX = 8000
PLAYER_WINS, CPU_WINS, DRAWS = 0, 0, 0

class PlayerPiece():
    def __init__(self):
        self.id = 2
        self.img = pg.image.load("spade.png").convert_alpha()
        self.rect = self.img.get_rect()
        self.pos = -1
        self.alive_state = True
        self.selected = False
        # self.legal_moves = []

    def piece_update(self):
        if self.alive_state == True:
            self.rect[0] = pixel_pos[self.pos][0]
            self.rect[1] = pixel_pos[self.pos][1]
            screen.blit(self.img, self.rect)


class CPUPiece():
    def __init__(self):
        self.id = 3
        self.img = pg.image.load("heart.png").convert_alpha()
        self.rect = self.img.get_rect()
        self.pos = -1
        self.alive_state = True
        # self.legal_moves = []

    def piece_update(self):
        if self.alive_state == True:
            self.rect[0] = pixel_pos[self.pos][0]
            self.rect[1] = pixel_pos[self.pos][1]
            screen.blit(self.img, self.rect)


class SiqiGame():
    game_num = 0

    def __init__(self):
        self.game_num += 1
        self.cpu_piece = [None] * 4
        self.player_piece = [None] * 4
        for i in range(4):
            self.cpu_piece[i] = CPUPiece()
            self.player_piece[i] = PlayerPiece()
            self.cpu_piece[i].pos = i
            self.player_piece[i].pos = 12 + i
        self.player_move_flag = random.choice([True, False])
        # self.is_player_selecting_piece = False
        # self.player_select = None
        # self.cpu_select = None
        self.piece_just_moved = [-1, -1]
        self.round = 0
        self.glb_pos = np.array([3, 3, 3, 3,
                                 0, 0, 0, 0,
                                 0, 0, 0, 0,
                                 2, 2, 2, 2], dtype=np.uint8)  # 必须这样初始化棋盘
        # # print("New game " + str(self.game_num) + " starts!")

    def Clone(self):
        s = SiqiGame()
        s.player_move_flag = self.player_move_flag
        s.piece_just_moved = self.piece_just_moved[:]
        s.round = self.round
        s.glb_pos = self.glb_pos[:]
        for i in range(4):
            s.cpu_piece[i].pos = self.cpu_piece[i].pos
            s.cpu_piece[i].alive_state = self.cpu_piece[i].alive_state
            s.player_piece[i].pos = self.player_piece[i].pos
            s.player_piece[i].alive_state = self.player_piece[i].alive_state
        return s

    def check_action_legal(self, selected_piece, action):
        pos_legal = False
        if 0 <= (selected_piece.pos + action) <= 15 and self.glb_pos[selected_piece.pos + action] == 0:
            if action == 1:
                if selected_piece.pos % 4 < 3:
                    pos_legal = True
            elif action == -1:
                if selected_piece.pos % 4 > 0:
                    pos_legal = True
            elif action == 4:
                if selected_piece.pos // 4 < 3:
                    pos_legal = True
            elif action == -4:
                if selected_piece.pos // 4 > 0:
                    pos_legal = True
        return pos_legal

    def get_pieces_legal_moves(self):
        self.get_situation()
        untried_moves = []
        cpu_player_pieces = None
        if self.player_move_flag == True:
            cpu_player_pieces = self.player_piece
        else:
            cpu_player_pieces = self.cpu_piece
        for i in range(4):
            # cpu_player_pieces[i].legal_moves = []
            if cpu_player_pieces[i].alive_state == True:
                for action in [-1, 1, -4, 4]:
                    if self.check_action_legal(cpu_player_pieces[i], action) == True:
                        # cpu_player_pieces[i].legal_moves.append(action)
                        untried_moves.append((i, action))
        return untried_moves

    def DoMove(self, piece_move):
        (i_piece, move) = piece_move
        if self.player_move_flag == True:
            self.player_piece[i_piece].pos += move
            self.piece_just_moved = [i_piece, move]
            self.check_situation()
            self.player_move_flag = False
        else:
            self.cpu_piece[i_piece].pos += move
            self.piece_just_moved = [i_piece, move]
            self.check_situation()
            self.player_move_flag = True
        self.get_situation(True)

    def get_situation(self, round_plus=False):
        if self.player_move_flag and round_plus:
            self.round += 1
            # print("Round" + str(self.round), "player moves.")
        elif (not self.player_move_flag) and round_plus:
            self.round += 1
            # print("Round" + str(self.round), "CPU moves.")
        self.glb_pos = np.zeros([16], dtype=np.uint8)
        for i in range(4):
            if self.player_piece[i].alive_state == True:
                self.glb_pos[self.player_piece[i].pos] = self.player_piece[i].id
            if self.cpu_piece[i].alive_state == True:
                self.glb_pos[self.cpu_piece[i].pos] = self.cpu_piece[i].id
        # for i in range(4):
        #     print(self.glb_pos[i * 4:(i + 1) * 4])
        # print("=" * 9)

    def check_situation(self):
        self.get_situation()
        enemy_die = False
        piece_just_moved = None
        reward = 0
        if self.player_move_flag:
            piece_just_moved = self.player_piece[self.piece_just_moved[0]]
        else:
            piece_just_moved = self.cpu_piece[self.piece_just_moved[0]]
        if piece_just_moved != None:
            pos_row, pos_col = piece_just_moved.pos // 4, piece_just_moved.pos % 4
            pos_row_all = self.glb_pos[pos_row * 4:(pos_row + 1) * 4]
            pos_col_all = self.glb_pos[pos_col::4]
            if piece_just_moved.id == self.player_piece[0].id:
                if np.sum(pos_row_all) == 2 + 2 + 3:
                    if (pos_row_all == np.array([0, 2, 2, 3])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == pos_row * 4 + 3 and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_row_all == np.array([2, 2, 3, 0])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == pos_row * 4 + 2 and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_row_all == np.array([3, 2, 2, 0])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == pos_row * 4 + 0 and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_row_all == np.array([0, 3, 2, 2])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == pos_row * 4 + 1 and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                if np.sum(pos_col_all) == 2 + 2 + 3:
                    if (pos_col_all == np.array([0, 2, 2, 3])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == 4 * 3 + pos_col and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_col_all == np.array([2, 2, 3, 0])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == 4 * 2 + pos_col and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_col_all == np.array([3, 2, 2, 0])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == 4 * 0 + pos_col and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_col_all == np.array([0, 3, 2, 2])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == 4 * 1 + pos_col and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                if enemy_die == True:
                    reward = 0.8  # todo: 在设置回报时可以与round关联,使策略更倾向于最近的吃子
            elif piece_just_moved.id == self.cpu_piece[0].id:
                if np.sum(pos_row_all) == 3 + 3 + 2:
                    if (pos_row_all == np.array([0, 3, 3, 2])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == pos_row * 4 + 3:
                                player_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_row_all == np.array([3, 3, 2, 0])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == pos_row * 4 + 2:
                                player_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_row_all == np.array([0, 2, 3, 3])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == pos_row * 4 + 1:
                                player_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_row_all == np.array([2, 3, 3, 0])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == pos_row * 4 + 0:
                                player_piece_i.alive_state = False
                                enemy_die = True
                if np.sum(pos_col_all) == 3 + 3 + 2:
                    if (pos_col_all == np.array([0, 3, 3, 2])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == 4 * 3 + pos_col:
                                player_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_col_all == np.array([3, 3, 2, 0])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == 4 * 2 + pos_col:
                                player_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_col_all == np.array([2, 3, 3, 0])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == 4 * 0 + pos_col:
                                player_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_col_all == np.array([0, 2, 3, 3])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == 4 * 1 + pos_col:
                                player_piece_i.alive_state = False
                                enemy_die = True
                if enemy_die == True:
                    reward = -0.8  # todo: 在设置回报时可以与round关联,使策略更倾向于最近的吃子
        return reward

    def check_game_over(self):
        player_pieces_alive_num, cpu_pieces_alive_num = 0, 0
        game_over = False
        reward = 0
        if not self.player_move_flag:  # todo:注意移动之后flag已经改了
            for cpu_piece_i in self.cpu_piece:
                if cpu_piece_i.alive_state:
                    cpu_pieces_alive_num += 1
            if cpu_pieces_alive_num <= 1:
                game_over = True
                reward = 1
            else:
                if self.get_pieces_legal_moves() == []:
                    game_over = True
                    reward = 1
        else:
            for player_piece_i in self.player_piece:
                if player_piece_i.alive_state:
                    player_pieces_alive_num += 1
            if player_pieces_alive_num <= 1:
                game_over = True
                reward = -1
            else:
                if self.get_pieces_legal_moves() == []:
                    game_over = True
                    reward = -1
        return game_over, reward

    def restart(self):
        del self.cpu_piece
        del self.player_piece
        self.__init__()

    def update_screen(self):
        screen.blit(background, (0, 0))
        for i in range(4):
            self.player_piece[i].piece_update()
            self.cpu_piece[i].piece_update()
        pg.display.update()

    def game_main(self):
        if self.player_move_flag:
            policy = UCTSearch((self), ITER_MAX)
            # policy = MC_Method((self),ITER_MAX)
            self.DoMove(policy)
        else:
            # if self.get_pieces_legal_moves() != []:
            #     self.DoMove(random.choice(self.get_pieces_legal_moves()))
            policy = UCTSearch((self), ITER_MAX // 2)
            # policy = MC_Method((self),ITER_MAX)
            self.DoMove(policy)
        print("+++ Round " + str(self.round) + " +++")
        f.write("+++ Round " + str(self.round) + " +++\n")
        for i in range(4):
            print(self.glb_pos[i * 4:(i + 1) * 4])
            f.write(str(self.glb_pos[i * 4:(i + 1) * 4]) + '\n')
        print("=" * 9)
        self.update_screen()
        f.write("=" * 9 + '\n')

        if self.check_game_over()[0]:
            if self.check_game_over()[1] > 0:
                global PLAYER_WINS
                PLAYER_WINS += 1
            else:
                global CPU_WINS
                CPU_WINS += 1
            print("*" * 5 + " PLAYER_WINS:" + str(PLAYER_WINS), end='')
            print("---CPU_WINS:" + str(CPU_WINS) + ' ' + "*" * 5 + '\n')
            f.write(
                "*" * 5 + " PLAYER_WINS:" + str(PLAYER_WINS) + "---CPU_WINS:" + str(CPU_WINS) + ' ' + '*' * 5 + '\n\n')
            self.update_screen()
            time.sleep(2)

            if PLAYER_WINS + CPU_WINS >= 7:
                print("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS))
                f.write("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS) + '\n')
                f.close()
                sys.exit()
            self.restart()

        else:
            if self.round >= 150:
                print("*" * 5 + " Draw! No one wins! " + "*" * 5)
                time.sleep(2)
                if PLAYER_WINS + CPU_WINS >= 50:
                    print("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS) + "DRAWS:" + str(DRAWS))
                    f.write("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS) + +"DRAWS:" + str(
                        DRAWS) + '\n')
                    f.close()
                    sys.exit()
                self.restart()
        self.update_screen()


class Node():
    def __init__(self, parent=None, game_state=None, move=None):
        self.parentNode = parent
        self.game_state = game_state
        self.move = move
        self.childNodes = []
        self.wins = 0
        self.visits = 0
        self.untriedMoves = self.game_state.get_pieces_legal_moves()

    def AddChild(self, game_state, move):
        n = Node(parent=self, game_state=game_state, move=move)
        self.childNodes.append(n)
        self.untriedMoves.remove(move)
        return n

    def UCTSelect(self, Cp=2):
        try:
            s = sorted(self.childNodes, key=lambda c: c.wins / c.visits + sqrt(Cp * log(self.visits) / c.visits))[-1]
        except:
            print(self.childNodes, self.game_state.player_move_flag)
            print(self.game_state.glb_pos.reshape([4, 4]))
            assert False, "UCT_SELECT"
        return s

    def Update(self, result):
        self.visits += 1
        if self.game_state.player_move_flag:
            # todo: 注意Update是在DoMove之后的,player_move_flag==True表示本节点的move是cpu做的。
            self.wins -= result  # 设置reward时player的为正,cpu的为负,所以cpu的结果应该取反
        else:
            self.wins += result

    def CheckFullyExpanded(self):
        fully_expanded = False
        if self.untriedMoves == []:  # 如果一方没输但是没有可以走的棋子怎么办
            # if self.childNodes != []:
            fully_expanded = True
        return fully_expanded


def UCTSearch(root_state, itermax):
    '''
    可以使用胜率最高的节点或者访问次数最多的节点都试试。
    :param root_state:
    :param itermax:
    :return:
    '''
    state = root_state.Clone()
    root_node = Node(game_state=state)
    for i in range(itermax):
        # state = root_state.Clone()
        node_l = TreePolicy(root_node)
        delta = DefaultPolicy(node_l.game_state.Clone())
        BackUp(node_l, delta)
    for n in root_node.childNodes:
        print(n.move, n.wins, '/', n.visits, '=', n.wins / n.visits)  # 打印每个动作节点的信息
    # return BestChild(root_node, 0).move # todo: 选择胜率最高的
    return sorted(root_node.childNodes, key=lambda c: c.visits)[-1].move  # todo: 选择访问次数最多的


def TreePolicy(node):
    while not node.game_state.check_game_over()[0]:
        if not node.CheckFullyExpanded():
            c_node = Expand(node)
            return c_node
        else:
            node = BestChild(node)
    return node


def Expand(node):
    if node.untriedMoves != []:
        a = random.choice(node.untriedMoves)
        state = node.game_state.Clone()
        state.DoMove(a)
        n = node.AddChild(game_state=state, move=a)
        return n
    else:
        print(node.untriedMoves)
        print(node.game_state.pos)
        assert False, "node.untriedMoves == []"


def BestChild(node, Cp=2):
    '''
    Cp很重要,用来权衡探索与利用。
    :param node:
    :param Cp:
    :return:
    '''
    return node.UCTSelect(Cp)


def DefaultPolicy(game_state):
    '''
    模拟和得到回报,得到的回报应该有吃子和胜利两部分。
    :param game_state:
    :return:
    '''
    reward = 0
    reward += game_state.check_situation()
    over, r = game_state.check_game_over()
    reward += r
    while not over and game_state.get_pieces_legal_moves() != []:
        game_state.DoMove(random.choice(game_state.get_pieces_legal_moves()))
        over, r = game_state.check_game_over()
        reward += r
    return reward


def BackUp(node, delta):
    while node != None:
        node.Update(delta)
        node = node.parentNode


def MC_Method(root_state, itermax):
    '''
    先找出所有子节点,然后均匀地对每个子节点进行模拟。
    效果不如UCT。
    :param root_state:
    :param itermax:
    :return: best_child.move
    '''
    state = root_state.Clone()
    root_node = Node(game_state=state)
    for m in root_node.untriedMoves[:]:
        s = state.Clone()
        s.DoMove(m)
        root_node.AddChild(s, m)
    assert root_node.untriedMoves == [], "NOT ALL!"
    iter_n = 0
    while iter_n < itermax:  # todo: 为了保证所有节点模拟的总次数等于itermax,与UCT方法模拟次数一致。
        for child_node in root_node.childNodes:
            r = DefaultPolicy(child_node.game_state.Clone())
            child_node.Update(r)
            iter_n += 1
            if iter_n >= itermax:
                break
    for n in root_node.childNodes:
        print(n.move, n.wins, '/', n.visits, '=', n.wins / n.visits)
    best_child = sorted(root_node.childNodes, key=lambda c: c.wins / c.visits)[-1]
    return best_child.move


game = SiqiGame()

while True:
    for event in pg.event.get():
        if event.type == pg.QUIT:
            print("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS))
            f.write("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS) + '\n')
            f.close()
            sys.exit()
    game.game_main()
    f_clock.tick(fps)

下边是人机对战版本,机器思考时程序在卡死状态。

import pygame as pg
from pygame.locals import *
import sys
import time
import numpy as np
import random
import os
from math import sqrt, log
'''
做一个人机对战版本,测试UCT的效果。
模拟8000次的情况下棋力已经很不错了。
'''
pg.init()
size = width, height = 600, 400
screen = pg.display.set_mode(size)
f_clock = pg.time.Clock()
fps = 30

pg.display.set_caption("走四棋儿")
pg.display.set_icon(pg.image.load("icon.png").convert_alpha())
background = pg.image.load("background.png").convert_alpha()
pixel_pos = [(90, 40), (190, 40), (290, 40), (390, 40),
             (90, 140), (190, 140), (290, 140), (390, 140),
             (90, 240), (190, 240), (290, 240), (390, 240),
             (90, 340), (190, 340), (290, 340), (390, 340)]
random.seed(8)  # 设置随机数种子,运行程序后,同一个种子总是会得到相同的运行过程。
file_name = 'Chess_score5.txt'
f = open(file_name, 'w')
ITER_MAX = 8000
PLAYER_WINS, CPU_WINS = 0, 0

class PlayerPiece():
    def __init__(self):
        self.id = 2
        self.img = pg.image.load("spade.png").convert_alpha()
        self.rect = self.img.get_rect()
        self.pos = -1
        self.alive_state = True
        self.selected = False
        # self.legal_moves = []

    def piece_update(self):
        if self.alive_state == True:
            self.rect[0] = pixel_pos[self.pos][0]
            self.rect[1] = pixel_pos[self.pos][1]
            screen.blit(self.img, self.rect)


class CPUPiece():
    def __init__(self):
        self.id = 3
        self.img = pg.image.load("heart.png").convert_alpha()
        self.rect = self.img.get_rect()
        self.pos = -1
        self.alive_state = True
        # self.legal_moves = []

    def piece_update(self):
        if self.alive_state == True:
            self.rect[0] = pixel_pos[self.pos][0]
            self.rect[1] = pixel_pos[self.pos][1]
            screen.blit(self.img, self.rect)


class SiqiGame():
    game_num = 0

    def __init__(self):
        self.game_num += 1
        self.cpu_piece = [None] * 4
        self.player_piece = [None] * 4
        for i in range(4):
            self.cpu_piece[i] = CPUPiece()
            self.player_piece[i] = PlayerPiece()
            self.cpu_piece[i].pos = i
            self.player_piece[i].pos = 12 + i
        self.player_move_flag = random.choice([True, False])
        self.__is_player_selecting_piece = False
        self.__player_select = None
        # self.cpu_select = None
        self.piece_just_moved = [-1, -1]
        self.round = 0
        self.glb_pos = np.array([3, 3, 3, 3,
                                 0, 0, 0, 0,
                                 0, 0, 0, 0,
                                 2, 2, 2, 2], dtype=np.uint8)  # 必须这样初始化棋盘
        # # print("New game " + str(self.game_num) + " starts!")

    def Clone(self):
        s = SiqiGame()
        s.player_move_flag = self.player_move_flag
        s.piece_just_moved = self.piece_just_moved[:]
        s.round = self.round
        s.glb_pos = self.glb_pos[:]
        for i in range(4):
            s.cpu_piece[i].pos = self.cpu_piece[i].pos
            s.cpu_piece[i].alive_state = self.cpu_piece[i].alive_state
            s.player_piece[i].pos = self.player_piece[i].pos
            s.player_piece[i].alive_state = self.player_piece[i].alive_state
        return s

    def check_action_legal(self, selected_piece, action):
        pos_legal = False
        if 0 <= (selected_piece.pos + action) <= 15 and self.glb_pos[selected_piece.pos + action] == 0:
            if action == 1:
                if selected_piece.pos % 4 < 3:
                    pos_legal = True
            elif action == -1:
                if selected_piece.pos % 4 > 0:
                    pos_legal = True
            elif action == 4:
                if selected_piece.pos // 4 < 3:
                    pos_legal = True
            elif action == -4:
                if selected_piece.pos // 4 > 0:
                    pos_legal = True
        return pos_legal

    def get_pieces_legal_moves(self):
        self.get_situation()
        untried_moves = []
        cpu_player_pieces = None
        if self.player_move_flag == True:
            cpu_player_pieces = self.player_piece
        else:
            cpu_player_pieces = self.cpu_piece
        for i in range(4):
            # cpu_player_pieces[i].legal_moves = []
            if cpu_player_pieces[i].alive_state == True:
                for action in [-1, 1, -4, 4]:
                    if self.check_action_legal(cpu_player_pieces[i], action) == True:
                        # cpu_player_pieces[i].legal_moves.append(action)
                        untried_moves.append((i, action))
        return untried_moves

    def __check_click(self, c_x, c_y):
        self.__player_select = None
        for i in range(4):
            if self.player_piece[i].alive_state and self.player_piece[i].rect.collidepoint(c_x, c_y):
                # self.player_piece[i].selected = True
                self.__player_select = self.player_piece[i]

    def __player_piece_moves(self, c_x, c_y):
        move_successfully = False
        piece_exist = False
        action = 0
        if self.__player_select != None:
            for cpu_piece_i in self.cpu_piece:
                if cpu_piece_i.rect.collidepoint(c_x, c_y) and cpu_piece_i.alive_state:
                    piece_exist = True
                    self.__is_player_selecting_piece = False
            for i in range(4):
                if self.player_piece[i].rect.collidepoint(c_x, c_y) and self.player_piece[i].alive_state:
                    self.__player_select = self.player_piece[i]
            if piece_exist == False:
                delta_y, delta_x = c_y - self.__player_select.rect[1], c_x - self.__player_select.rect[0]
                if 80 <= abs(delta_x) <= 120 and abs(delta_y) <= 20:
                    if delta_x < 0:
                        if self.__player_select.pos % 4 > 0:
                            self.__player_select.pos -= 1
                            action = -1
                            move_successfully = True
                    else:
                        if self.__player_select.pos % 4 < 3:
                            self.__player_select.pos += 1
                            action = 1
                            move_successfully = True
                if 80 <= abs(delta_y) <= 120 and abs(delta_x) <= 20:
                    if delta_y < 0:
                        if self.__player_select.pos > 3:
                            self.__player_select.pos -= 4
                            action = -4
                            move_successfully = True
                    else:
                        if self.__player_select.pos < 12:
                            self.__player_select.pos += 4
                            action = 4
                            move_successfully = True
            if move_successfully:
                return action
            else:
                return 0

    def __PlayerMove(self):
        if event.type == pg.MOUSEBUTTONDOWN:
            c_x, c_y = pg.mouse.get_pos()
            if self.__is_player_selecting_piece == False:
                self.__check_click(c_x, c_y)
                if self.__player_select != None:
                    self.__is_player_selecting_piece = True
            else:
                move = self.__player_piece_moves(c_x, c_y)
                if move!=0:
                    for i in range(4):
                        if self.__player_select == self.player_piece[i]:
                            i_piece = i
                    piece_move = (i_piece,move)
                    self.piece_just_moved = piece_move
                    self.check_situation()
                    self.player_move_flag = False
                    self.get_situation(True,True)

    def DoMove(self, piece_move):
        (i_piece, move) = piece_move
        if self.player_move_flag:
            self.player_piece[i_piece].pos += move
            self.piece_just_moved = [i_piece, move]
            self.check_situation()
            self.player_move_flag = False
        else:
            self.cpu_piece[i_piece].pos += move
            self.piece_just_moved = [i_piece, move]
            self.check_situation()
            self.player_move_flag = True
        self.get_situation(True)

    def __CPUMove(self,piece_move):
        (i_piece, move) = piece_move
        # if not self.player_move_flag:
        self.cpu_piece[i_piece].pos += move
        self.piece_just_moved = [i_piece, move]
        self.check_situation()
        self.player_move_flag = True
        self.get_situation(True,True)


    def get_situation(self, round_plus=False,print_situation = False):
        if self.player_move_flag and round_plus:
            self.round += 1
            # print("Round" + str(self.round), "player moves.")
        elif (not self.player_move_flag) and round_plus:
            self.round += 1
            # print("Round" + str(self.round), "CPU moves.")
        self.glb_pos = np.zeros([16], dtype=np.uint8)
        for i in range(4):
            if self.player_piece[i].alive_state == True:
                self.glb_pos[self.player_piece[i].pos] = self.player_piece[i].id
            if self.cpu_piece[i].alive_state == True:
                self.glb_pos[self.cpu_piece[i].pos] = self.cpu_piece[i].id
        if print_situation:
            for i in range(4):
                print(self.glb_pos[i * 4:(i + 1) * 4])
            print("=" * 9)

    def check_situation(self):
        self.get_situation()
        enemy_die = False
        piece_just_moved = None
        reward = 0
        if self.player_move_flag:
            piece_just_moved = self.player_piece[self.piece_just_moved[0]]
        else:
            piece_just_moved = self.cpu_piece[self.piece_just_moved[0]]
        if piece_just_moved != None:
            pos_row, pos_col = piece_just_moved.pos // 4, piece_just_moved.pos % 4
            pos_row_all = self.glb_pos[pos_row * 4:(pos_row + 1) * 4]
            pos_col_all = self.glb_pos[pos_col::4]
            if piece_just_moved.id == self.player_piece[0].id:
                if np.sum(pos_row_all) == 2 + 2 + 3:
                    if (pos_row_all == np.array([0, 2, 2, 3])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == pos_row * 4 + 3 and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_row_all == np.array([2, 2, 3, 0])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == pos_row * 4 + 2 and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_row_all == np.array([3, 2, 2, 0])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == pos_row * 4 + 0 and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_row_all == np.array([0, 3, 2, 2])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == pos_row * 4 + 1 and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                if np.sum(pos_col_all) == 2 + 2 + 3:
                    if (pos_col_all == np.array([0, 2, 2, 3])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == 4 * 3 + pos_col and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_col_all == np.array([2, 2, 3, 0])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == 4 * 2 + pos_col and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_col_all == np.array([3, 2, 2, 0])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == 4 * 0 + pos_col and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_col_all == np.array([0, 3, 2, 2])).all():
                        for cpu_piece_i in self.cpu_piece:
                            if cpu_piece_i.pos == 4 * 1 + pos_col and cpu_piece_i.alive_state:
                                cpu_piece_i.alive_state = False
                                enemy_die = True
                if enemy_die == True:
                    reward = 0.8 # todo: 在设置回报时可以与round关联,使策略更倾向于最近的吃子
            elif piece_just_moved.id == self.cpu_piece[0].id:
                if np.sum(pos_row_all) == 3 + 3 + 2:
                    if (pos_row_all == np.array([0, 3, 3, 2])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == pos_row * 4 + 3:
                                player_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_row_all == np.array([3, 3, 2, 0])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == pos_row * 4 + 2:
                                player_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_row_all == np.array([0, 2, 3, 3])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == pos_row * 4 + 1:
                                player_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_row_all == np.array([2, 3, 3, 0])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == pos_row * 4 + 0:
                                player_piece_i.alive_state = False
                                enemy_die = True
                if np.sum(pos_col_all) == 3 + 3 + 2:
                    if (pos_col_all == np.array([0, 3, 3, 2])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == 4 * 3 + pos_col:
                                player_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_col_all == np.array([3, 3, 2, 0])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == 4 * 2 + pos_col:
                                player_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_col_all == np.array([2, 3, 3, 0])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == 4 * 0 + pos_col:
                                player_piece_i.alive_state = False
                                enemy_die = True
                    elif (pos_col_all == np.array([0, 2, 3, 3])).all():
                        for player_piece_i in self.player_piece:
                            if player_piece_i.pos == 4 * 1 + pos_col:
                                player_piece_i.alive_state = False
                                enemy_die = True
                if enemy_die == True:
                    reward = -0.8 # todo: 在设置回报时可以与round关联,使策略更倾向于最近的吃子
        return reward

    def check_game_over(self):
        player_pieces_alive_num, cpu_pieces_alive_num = 0, 0
        game_over = False
        reward = 0
        if not self.player_move_flag: # DoMove之后这个flag已经被反转了
            for cpu_piece_i in self.cpu_piece:
                if cpu_piece_i.alive_state:
                    cpu_pieces_alive_num += 1
            if cpu_pieces_alive_num <= 1:
                game_over = True
                reward = 1
            else:
                if self.get_pieces_legal_moves()==[]:
                    game_over = True
                    reward = 1
        else:
            for player_piece_i in self.player_piece:
                if player_piece_i.alive_state:
                    player_pieces_alive_num += 1
            if player_pieces_alive_num <= 1:
                game_over = True
                reward = -1
            else:
                if self.get_pieces_legal_moves()==[]:
                    game_over = True
                    reward = -1
        return game_over, reward

    def restart(self):
        del self.cpu_piece
        del self.player_piece
        self.__init__()

    def update_screen(self):
        screen.blit(background, (0, 0))
        for i in range(4):
            self.player_piece[i].piece_update()
            self.cpu_piece[i].piece_update()
        pg.display.update()

    def game_main(self):
        if self.player_move_flag:
            self.__PlayerMove()
        else:
            policy = UCTSearch((self), ITER_MAX)
            # policy = MC_Method((self),ITER_MAX)
            self.__CPUMove(policy)

        # print("+++ Round " + str(self.round) + " +++")
        # f.write("+++ Round " + str(self.round) + " +++\n")
        # for i in range(4):
        #     print(self.glb_pos[i * 4:(i + 1) * 4])
        #     f.write(str(self.glb_pos[i * 4:(i + 1) * 4]) + '\n')
        # print("=" * 9)
        self.update_screen()
        # f.write("=" * 9 + '\n')

        if self.check_game_over()[0] == True:
            if self.check_game_over()[1] > 0:
                global PLAYER_WINS
                PLAYER_WINS += 1
            else:
                global CPU_WINS
                CPU_WINS += 1

            print("*" * 5 + " PLAYER_WINS:" + str(PLAYER_WINS), end='')
            print("---CPU_WINS:" + str(CPU_WINS) + ' ' + "*" * 5 + '\n')
            # f.write("*" * 5 + " PLAYER_WINS:" + str(PLAYER_WINS) + "---CPU_WINS:" + str(CPU_WINS) + ' ' + '*' * 5 + '\n\n')
            self.update_screen()
            time.sleep(2)
            if PLAYER_WINS + CPU_WINS >= 50:
                print("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS))
                # f.write("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS) + '\n')
                f.close()
                sys.exit()
            self.restart()
        self.update_screen()


class Node():
    def __init__(self, parent=None, game_state=None, move=None):
        self.parentNode = parent
        self.game_state = game_state
        self.move = move
        self.childNodes = []
        self.wins = 0
        self.visits = 0
        self.untriedMoves = self.game_state.get_pieces_legal_moves()

    def AddChild(self, game_state, move):
        n = Node(parent=self, game_state=game_state, move=move)
        self.childNodes.append(n)
        self.untriedMoves.remove(move)
        return n

    def UCTSelect(self, Cp=2):
        try:
            s = sorted(self.childNodes, key=lambda c: c.wins / c.visits + sqrt(Cp * log(self.visits) / c.visits))[-1]
        except:
            print(self.childNodes,self.game_state.player_move_flag)
            print(self.game_state.glb_pos.reshape([4,4]))
            assert False,"UCT_SELECT"
        return s

    def Update(self, result):
        self.visits += 1
        if self.game_state.player_move_flag:
            # todo: 注意Update是在DoMove之后的,player_move_flag==True表示本节点的move是cpu做的。
            self.wins -= result # 设置reward时player的为正,cpu的为负,所以cpu的结果应该取反
        else:
            self.wins += result

    def CheckFullyExpanded(self):
        fully_expanded = False
        if self.untriedMoves == []: # 如果一方没输但是没有可以走的棋子怎么办
            # if self.childNodes != []:
                fully_expanded = True
        return fully_expanded


def UCTSearch(root_state, itermax):
    '''
    可以使用胜率最高的节点或者访问次数最多的节点都试试。
    :param root_state:
    :param itermax:
    :return:
    '''
    state = root_state.Clone()
    root_node = Node(game_state=state)
    for i in range(itermax):
        # state = root_state.Clone()
        node_l = TreePolicy(root_node)
        delta = DefaultPolicy(node_l.game_state.Clone())
        BackUp(node_l, delta)
    # for n in root_node.childNodes:
    #     print(n.move, n.wins, '/', n.visits, '=', n.wins / n.visits) # 打印每个动作节点的信息
    # return BestChild(root_node, 0).move # todo: 选择胜率最高的
    return sorted(root_node.childNodes, key=lambda c: c.visits)[-1].move  # todo: 选择访问次数最多的


def TreePolicy(node):
    while not node.game_state.check_game_over()[0]:
        if not node.CheckFullyExpanded():
            c_node = Expand(node)
            return c_node
        else:
            node = BestChild(node)
    return node


def Expand(node):
    if node.untriedMoves != []:
        a = random.choice(node.untriedMoves)
        state = node.game_state.Clone()
        state.DoMove(a)
        n = node.AddChild(game_state=state, move=a)
        return n
    else:
        print(node.untriedMoves)
        print(node.game_state.pos)
        assert False, "node.untriedMoves == []"


def BestChild(node, Cp=2):
    '''
    Cp很重要,用来权衡探索与利用。
    :param node:
    :param Cp:
    :return:
    '''
    return node.UCTSelect(Cp)


def DefaultPolicy(game_state):
    '''
    模拟和得到回报,得到的回报应该有吃子和胜利两部分。
    :param game_state:
    :return:
    '''
    reward = 0
    reward += game_state.check_situation()
    over, r = game_state.check_game_over()
    reward += r
    while not over and game_state.get_pieces_legal_moves() != []:
        game_state.DoMove(random.choice(game_state.get_pieces_legal_moves()))
        over, r = game_state.check_game_over()
        reward += r
    return reward


def BackUp(node, delta):
    while node != None:
        node.Update(delta)
        node = node.parentNode


def MC_Method(root_state, itermax):
    '''
    先找出所有子节点,然后均匀地对每个子节点进行模拟。
    效果不如UCT。
    :param root_state:
    :param itermax:
    :return: best_child.move
    '''
    state = root_state.Clone()
    root_node = Node(game_state=state)
    for m in root_node.untriedMoves[:]:
        s = state.Clone()
        s.DoMove(m)
        root_node.AddChild(s, m)
    assert root_node.untriedMoves == [], "NOT ALL!"
    iter_n = 0
    while iter_n < itermax:  # todo: 为了保证所有节点模拟的总次数等于itermax,与UCT方法模拟次数一致。
        for child_node in root_node.childNodes:
            r = DefaultPolicy(child_node.game_state.Clone())
            child_node.Update(r)
            iter_n += 1
            if iter_n >= itermax:
                break
    for n in root_node.childNodes:
        print(n.move, n.wins, '/', n.visits, '=', n.wins / n.visits)
    best_child = sorted(root_node.childNodes, key=lambda c: c.wins / c.visits)[-1]
    return best_child.move


game = SiqiGame()

while True:
    for event in pg.event.get():
        if event.type == pg.QUIT:
            print("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS))
            f.write("PLAYER_WINS:" + str(PLAYER_WINS) + " CPU_WINS:" + str(CPU_WINS) + '\n')
            f.close()
            sys.exit()
    game.game_main()
    f_clock.tick(fps)

在设置模拟8000次的情况下,我已经招架不住机器了,但是机器思考时间非常长。个人认为UCT展开的深度足够深才更有效,使用多线程或者多进程的话效果应该不会怎么好,三个臭皮匠顶不了一个诸葛亮,如果有优化方法欢迎评论。

我跑了一晚上加一上午机器对弈,分别模拟8000和4000次,结果16:12。如果UCT的速度足够快的话,下一步可以再仿照AlphaZero的做法,训练神经网络,挑战天下武林高手。当然了这种“走四棋儿”会玩的不多,可以再自己拓展到其他的棋类上。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章