RL 101 - Tic Tac Toe(井字棋遊戲) - 知乎

井字棋遊戲算是五子棋的簡化版:

兩個玩家,一個打 (◯),一個打 (✗),輪流在3乘3的格上打自己的符號,最先以橫、直、斜連成一線則爲勝。如果雙方都下得正確無誤,將得和局。

井字棋是強化學習一個典型例子,可被歸類爲 Two players zero-sum game,RL 表格型求解方法實現 reinforcement-learning-an-introduction/tic_tac_toe.py

def train(epochs, print_every_n=500):
    player1 = Player(epsilon=0.01)
    player2 = Player(epsilon=0.01)
    judger = Judger(player1, player2)
    player1_win = 0.0
    player2_win = 0.0
    for i in range(1, epochs + 1):
        # 完成一輪遊戲
        winner = judger.play(print_state=False)
        if winner == 1:
            player1_win += 1
        if winner == -1:
            player2_win += 1
        if i % print_every_n == 0:
            print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))
        # 一輪遊戲結束,便可以獲得 reward,並迭代更新所有 state 的價值
        player1.backup()
        player2.backup()
        # 重新初始化
        judger.reset()

    # 保存 value(state),後面被用來選擇 action
    player1.save_policy()
    player2.save_policy()



# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():
    while True:
        player1 = HumanPlayer()
        player2 = Player(epsilon=0)
        judger = Judger(player1, player2)
        player2.load_policy()
        # 對弈
        winner = judger.play()
        if winner == player2.symbol:
            print("You lose!")
        elif winner == player1.symbol:
            print("You win!")
        else:
            print("It is a tie!")

同樣的可以使用 openai/gymhill-a/stable-baselines 對井字棋進行抽象,但需根據其有兩個 player 適當修改 reward 等。

除表格型方法之外可以引入神經網絡來估計狀態價值函數,輸入可以是棋盤格的狀態,例如可以使用 deepmind/open_spiel AlphaZero 來玩井字棋:

# https://github.com/deepmind/open_spiel/blob/master/docs/alpha_zero.md
# 安裝 https://github.com/deepmind/open_spiel/blob/master/docs/install.md
# 訓練模型 & 對弈
$ az_path=exp/tic_tac_toe_alpha_zero
$ python3 open_spiel/python/examples/tic_tac_toe_alpha_zero.py --path ${az_path}
$ python3 open_spiel/python/examples/mcts.py --game=tic_tac_toe --player1=human --player2=az --az_path=${az_path}/checkpoint-25

2020-12-26 21:26:57.202819: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-26 21:26:57.221343: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fcfa55182a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-26 21:26:57.221356: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
INFO:tensorflow:Restoring parameters from exp/tic_tac_toe_alpha_zero/checkpoint-25
I1226 21:26:57.446192 4433972672 saver.py:1293] Restoring parameters from exp/tic_tac_toe_alpha_zero/checkpoint-25
Initial state:
...
...
...
Choose an action (empty to print legal actions): 4
Player 0 sampled action: x(1,1)
Next state:
...
.x.
...
Player 1 sampled action: o(2,2)
Next state:
...
.x.
..o
Choose an action (empty to print legal actions): 7
Player 0 sampled action: x(2,1)
Next state:
...
.x.
.xo
Player 1 sampled action: o(0,1)
Next state:
.o.
.x.
.xo
Choose an action (empty to print legal actions): 3
Player 0 sampled action: x(1,0)
Next state:
.o.
xx.
.xo
Player 1 sampled action: o(1,2)
Next state:
.o.
xxo
.xo
Choose an action (empty to print legal actions): 2
Player 0 sampled action: x(0,2)
Next state:
.ox
xxo
.xo
Player 1 sampled action: o(2,0)
Next state:
.ox
xxo
oxo
Choose an action (empty to print legal actions): 0
Player 0 sampled action: x(0,0)
Next state:
xox
xxo
oxo
Returns: 0.0 0.0 , Game actions: x(1,1) o(2,2) x(2,1) o(0,1) x(1,0) o(1,2) x(0,2) o(2,0) x(0,0)
Number of games played: 1
Number of distinct games played: 1
Players: human az
Overall wins [0, 0]
Overall returns [0.0, 0.0]

除此之外 deepmind/open_spiel 還提供了 DQN 和表格型方法的對弈學習

# DQN agent vs Tabular Q-Learning agents trained on Tic Tac Toe.
$ python3 open_spiel/python/examples/tic_tac_toe_dqn_vs_tabular.py

AlphaZero 同樣適用於除 GO 之外的 two players games。

封面取自 Welcome to Spinning Up in Deep RL!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章