兩個玩家,一個打 圈(◯),一個打 叉(✗),輪流在3乘3的格上打自己的符號,最先以橫、直、斜連成一線則爲勝。如果雙方都下得正確無誤,將得和局。
井字棋是強化學習一個典型例子,可被歸類爲 Two players zero-sum game,RL 表格型求解方法實現 reinforcement-learning-an-introduction/tic_tac_toe.py
def train(epochs, print_every_n=500):
player1 = Player(epsilon=0.01)
player2 = Player(epsilon=0.01)
judger = Judger(player1, player2)
player1_win = 0.0
player2_win = 0.0
for i in range(1, epochs + 1):
# 完成一輪遊戲
winner = judger.play(print_state=False)
if winner == 1:
player1_win += 1
if winner == -1:
player2_win += 1
if i % print_every_n == 0:
print('Epoch %d, player 1 winrate: %.02f, player 2 winrate: %.02f' % (i, player1_win / i, player2_win / i))
# 一輪遊戲結束,便可以獲得 reward,並迭代更新所有 state 的價值
player1.backup()
player2.backup()
# 重新初始化
judger.reset()
# 保存 value(state),後面被用來選擇 action
player1.save_policy()
player2.save_policy()
# The game is a zero sum game. If both players are playing with an optimal strategy, every game will end in a tie.
# So we test whether the AI can guarantee at least a tie if it goes second.
def play():
while True:
player1 = HumanPlayer()
player2 = Player(epsilon=0)
judger = Judger(player1, player2)
player2.load_policy()
# 對弈
winner = judger.play()
if winner == player2.symbol:
print("You lose!")
elif winner == player1.symbol:
print("You win!")
else:
print("It is a tie!")
同樣的可以使用 openai/gym 和 hill-a/stable-baselines 對井字棋進行抽象,但需根據其有兩個 player 適當修改 reward 等。
除表格型方法之外可以引入神經網絡來估計狀態價值函數,輸入可以是棋盤格的狀態,例如可以使用 deepmind/open_spiel AlphaZero 來玩井字棋:
# https://github.com/deepmind/open_spiel/blob/master/docs/alpha_zero.md
# 安裝 https://github.com/deepmind/open_spiel/blob/master/docs/install.md
# 訓練模型 & 對弈
$ az_path=exp/tic_tac_toe_alpha_zero
$ python3 open_spiel/python/examples/tic_tac_toe_alpha_zero.py --path ${az_path}
$ python3 open_spiel/python/examples/mcts.py --game=tic_tac_toe --player1=human --player2=az --az_path=${az_path}/checkpoint-25
2020-12-26 21:26:57.202819: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations: AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2020-12-26 21:26:57.221343: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fcfa55182a0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-12-26 21:26:57.221356: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
INFO:tensorflow:Restoring parameters from exp/tic_tac_toe_alpha_zero/checkpoint-25
I1226 21:26:57.446192 4433972672 saver.py:1293] Restoring parameters from exp/tic_tac_toe_alpha_zero/checkpoint-25
Initial state:
...
...
...
Choose an action (empty to print legal actions): 4
Player 0 sampled action: x(1,1)
Next state:
...
.x.
...
Player 1 sampled action: o(2,2)
Next state:
...
.x.
..o
Choose an action (empty to print legal actions): 7
Player 0 sampled action: x(2,1)
Next state:
...
.x.
.xo
Player 1 sampled action: o(0,1)
Next state:
.o.
.x.
.xo
Choose an action (empty to print legal actions): 3
Player 0 sampled action: x(1,0)
Next state:
.o.
xx.
.xo
Player 1 sampled action: o(1,2)
Next state:
.o.
xxo
.xo
Choose an action (empty to print legal actions): 2
Player 0 sampled action: x(0,2)
Next state:
.ox
xxo
.xo
Player 1 sampled action: o(2,0)
Next state:
.ox
xxo
oxo
Choose an action (empty to print legal actions): 0
Player 0 sampled action: x(0,0)
Next state:
xox
xxo
oxo
Returns: 0.0 0.0 , Game actions: x(1,1) o(2,2) x(2,1) o(0,1) x(1,0) o(1,2) x(0,2) o(2,0) x(0,0)
Number of games played: 1
Number of distinct games played: 1
Players: human az
Overall wins [0, 0]
Overall returns [0.0, 0.0]
除此之外 deepmind/open_spiel 還提供了 DQN 和表格型方法的對弈學習
# DQN agent vs Tabular Q-Learning agents trained on Tic Tac Toe.
$ python3 open_spiel/python/examples/tic_tac_toe_dqn_vs_tabular.py
AlphaZero 同樣適用於除 GO 之外的 two players games。