強化學習算法 Sarsa 解迷宮遊戲,代碼逐條詳解

本文內容源自百度強化學習 7 日入門課程學習整理
感謝百度 PARL 團隊李科澆老師的課程講解

強化學習算法 Sarsa 解迷宮遊戲

一、安裝依賴庫

安裝強化學習算法中環境庫 Gym

pip install gym

二、導入依賴庫

import gym
import numpy as np
import time # 用於延時程序,方便渲染畫面

三、智能體 Agent 的算法:Sarsa

  • 智能體 Agent 是和環境 environment 交互的主體
    • 包含了觀察當前狀態
    • 根據當前狀態作出動作選擇
    • 根據選擇後的結果更新 Q 值表
  • predict() 方法:輸入觀察值 observation(或者說狀態state),輸出 “預測” 動作 action (最優動作)
    • 觀察當前狀態下,所有可以採用的 action 對應的 Q 值
    • 在其中選取最大的,組成一個列表
    • 該列表對應可能選取的最優動作列表
    • 在最優動作列表中隨機選取一個動作
  • sample() 方法:在 predict() 方法基礎上使用 ε-greedy 增加探索,輸出 “實際” 動作 action
    • 採用 epsilon greedy 算法
    • 90% 概率選擇最優動作
    • 10% 概率選擇隨機動作
  • learn() 方法:輸入訓練數據,完成一輪Q表格的更新
    • 更新的是之前狀態 obs 下采取動作 action 後的 Q 值
    • 如果遊戲結束,則 reward 爲新的 Q 值
    • 如果遊戲沒有結束,則 reward 和下一步的 Q 值結合產生新的 Q 值
    • 同時用學習速率 lr 做更新約束
class SarsaAgent(object):
    def __init__(self, obs_n, act_n, learning_rate=0.01, gamma=0.9, e_greed=0.1):
        self.act_n = act_n      # 動作維度,有幾個動作可選
        self.lr = learning_rate # 學習率
        self.gamma = gamma      # 後面的 Q 值對前面的影響
        self.epsilon = e_greed  # 按一定概率隨機選動作
        self.Q = np.zeros((obs_n, act_n))

    # 根據輸入觀察值,採樣輸出的動作值(帶 10% 的探索)
    def sample(self, obs):
        if (np.random.uniform(0, 1) < 1 - self.epsilon): # 這裏是 90% 可能性
            action = self.predict(obs) # 執行最優動作
        else: # 10% 的概率
            action = np.random.choice(self.act_n) # 執行隨機動作
        return action

    # 根據輸入觀察值,預測輸出的動作值
    def predict(self, obs):
        Q_list = self.Q[obs, :] # 獲取當前狀態下,作出所有動作,對應的 Q 值列表
        maxQ = np.max(Q_list) # 求列表中的最大值
        action_list = np.where(Q_list == maxQ)[0] # 最大 Q 值對應的動作即最優動作
        action = np.random.choice(action_list) # 隨機選擇一個最優動作
        return action

    # 學習方法,也就是更新Q-table的方法
    def learn(self, obs, action, reward, next_obs, next_action, done):
        """ on-policy
            obs: 交互前的obs, s_t
            action: 本次交互選擇的action, a_t
            reward: 本次動作獲得的獎勵r
            next_obs: 本次交互後的obs, s_t+1
            next_action: 根據當前Q表格, 針對next_obs會選擇的動作, a_t+1
            done: episode是否結束
        """
        predict_Q = self.Q[obs,action] # 交互前的狀態下,選擇的動作所對應 Q 值
        if (done): # 遊戲結束
            target_Q = reward # 新的 Q 值爲 reward
        else: # 遊戲沒有結束
            target_Q = reward + self.gamma * self.Q[next_obs, next_action]
            # 用 reward 和 交互後狀態下,選擇的下一個動作對應的 Q 值,綜合得到新的 Q 值
        self.Q[obs,action] += self.lr * (target_Q - predict_Q) # 使用 lr 做修正更新的幅度

    # 保存Q表格數據到文件
    def save(self):
        npy_file = './q_table.npy'
        np.save(npy_file, self.Q)
        print(npy_file + ' saved.')
    
    # 從文件中讀取數據到Q表格中
    def restore(self, npy_file='./q_table.npy'):
        self.Q = np.load(npy_file)
        print(npy_file + ' loaded.')

四、訓練和測試語句

每一局遊戲,記錄下步數 total_steps 和總獎勵 total_reward

每一步都更新 Q 值表

def run_episode(env, agent, render=False):
    total_steps = 0 # 記錄每個episode走了多少step
    total_reward = 0 # 記錄每一局遊戲的總獎勵

    obs = env.reset() # 重置環境, 重新開一局(即開始新的一個episode)
    action = agent.sample(obs) # 根據算法選擇一個動作

    while True:
        next_obs, reward, done, _ = env.step(action) # 與環境進行一個交互,執行動作
        next_action = agent.sample(next_obs) # 根據算法選擇下一個動作
        # 訓練 Sarsa 算法
        agent.learn(obs, action, reward, next_obs, next_action, done)
        # obs 執行動作前的狀態,action 執行的動作,得到預測的 Q0
        # reward 執行動作後的獎勵,next_obs 執行動作後的狀態,next_action 選擇的下一個動作,得到更新的 Q0
        # done 判斷遊戲是否結束
        

        action = next_action # 迭代新的動作
        obs = next_obs  # 存儲上一個觀察值,迭代新的狀態
        total_reward += reward # 累計獎勵
        total_steps += 1 # 計算step數
        if render: # 判斷是否需要渲染圖形顯示
            env.render() #渲染新的一幀圖形
        if done: # 遊戲結束
            break # 跳出循環,即結束本局遊戲
    return total_reward, total_steps # 返回總的獎勵和總的步數

def test_episode(env, agent):
    total_reward = 0 # 記錄總的獎勵
    obs = env.reset() # 重置環境,obs 初始觀察值,即初始狀態
    while True:
        action = agent.predict(obs) # greedy,每次選擇最優動作
        next_obs, reward, done, _ = env.step(action) # 交互後,獲取新的狀態,獎勵,遊戲是否結束
        total_reward += reward # 累計獎勵
        obs = next_obs # 迭代更新狀態
        time.sleep(0.5) # 休眠,以便於我們觀察渲染的圖形
        env.render() # 渲染圖形顯示
        if done: # 遊戲結束
            break # 跳出循環
    return total_reward # 返回最終累計獎勵

五、創建環境,實例化Agent,啓動訓練和測試

使用 Gym 庫創建我們需要的環境

實例化 SarsaAgent 類,創建一個 Agent 對象,同時設定超參數

訓練 500 局遊戲,查看每一局遊戲的結果

訓練結束後進行測試

# 使用gym創建迷宮環境,設置is_slippery爲False降低環境難度
env = gym.make("FrozenLake-v0", is_slippery=False)  # 0 left, 1 down, 2 right, 3 up
# 使用 make 方法創建需要的環境

# 創建一個agent實例,輸入超參數
agent = SarsaAgent(
        obs_n=env.observation_space.n, # 16 個狀態代表這個環境中 4*4 一共 16 個格子
        act_n=env.action_space.n, # 4 種動作選擇:0 left, 1 down, 2 right, 3 up
        learning_rate=0.1, # 學習速率
        gamma=0.9, # 下一步的影響率
        e_greed=0.1) # 隨機選擇概率


# 訓練500個episode,打印每個episode的分數
for episode in range(500):
    ep_reward, ep_steps = run_episode(env, agent, False)
    print('Episode %s: steps = %s , reward = %.1f' % (episode, ep_steps, ep_reward))

# 全部訓練結束,查看算法效果
test_reward = test_episode(env, agent)
print('test reward = %.1f' % (test_reward))

運行結果:

Episode 0: steps = 6 , reward = 0.0
Episode 1: steps = 17 , reward = 0.0
Episode 2: steps = 9 , reward = 0.0
Episode 3: steps = 2 , reward = 0.0
Episode 4: steps = 8 , reward = 0.0
Episode 5: steps = 8 , reward = 0.0
Episode 6: steps = 14 , reward = 0.0
Episode 7: steps = 7 , reward = 0.0
Episode 8: steps = 7 , reward = 0.0
Episode 9: steps = 2 , reward = 0.0
Episode 10: steps = 3 , reward = 0.0
Episode 11: steps = 8 , reward = 0.0
Episode 12: steps = 3 , reward = 0.0
Episode 13: steps = 8 , reward = 0.0
Episode 14: steps = 6 , reward = 0.0
Episode 15: steps = 5 , reward = 0.0
Episode 16: steps = 5 , reward = 0.0
Episode 17: steps = 7 , reward = 0.0
Episode 18: steps = 2 , reward = 0.0
Episode 19: steps = 7 , reward = 0.0
Episode 20: steps = 2 , reward = 0.0
Episode 21: steps = 7 , reward = 0.0
Episode 22: steps = 6 , reward = 0.0
Episode 23: steps = 3 , reward = 0.0
Episode 24: steps = 4 , reward = 0.0
Episode 25: steps = 4 , reward = 0.0
Episode 26: steps = 17 , reward = 0.0
Episode 27: steps = 11 , reward = 0.0
Episode 28: steps = 4 , reward = 0.0
Episode 29: steps = 9 , reward = 0.0
Episode 30: steps = 3 , reward = 0.0
Episode 31: steps = 11 , reward = 0.0
Episode 32: steps = 7 , reward = 0.0
Episode 33: steps = 3 , reward = 0.0
Episode 34: steps = 16 , reward = 0.0
Episode 35: steps = 10 , reward = 0.0
Episode 36: steps = 2 , reward = 0.0
Episode 37: steps = 9 , reward = 0.0
Episode 38: steps = 9 , reward = 0.0
Episode 39: steps = 19 , reward = 1.0
Episode 40: steps = 6 , reward = 0.0
Episode 41: steps = 6 , reward = 0.0
Episode 42: steps = 7 , reward = 0.0
Episode 43: steps = 4 , reward = 0.0
Episode 44: steps = 4 , reward = 0.0
Episode 45: steps = 5 , reward = 0.0
Episode 46: steps = 4 , reward = 0.0
Episode 47: steps = 22 , reward = 1.0
Episode 48: steps = 2 , reward = 0.0
Episode 49: steps = 2 , reward = 0.0
Episode 50: steps = 2 , reward = 0.0
Episode 51: steps = 17 , reward = 0.0
Episode 52: steps = 14 , reward = 0.0
Episode 53: steps = 6 , reward = 0.0
Episode 54: steps = 8 , reward = 0.0
Episode 55: steps = 18 , reward = 0.0
Episode 56: steps = 5 , reward = 0.0
Episode 57: steps = 2 , reward = 0.0
Episode 58: steps = 8 , reward = 0.0
Episode 59: steps = 4 , reward = 0.0
Episode 60: steps = 10 , reward = 0.0
Episode 61: steps = 2 , reward = 0.0
Episode 62: steps = 11 , reward = 0.0
Episode 63: steps = 21 , reward = 0.0
Episode 64: steps = 4 , reward = 0.0
Episode 65: steps = 2 , reward = 0.0
Episode 66: steps = 3 , reward = 0.0
Episode 67: steps = 3 , reward = 0.0
Episode 68: steps = 18 , reward = 1.0
Episode 69: steps = 6 , reward = 0.0
Episode 70: steps = 8 , reward = 0.0
Episode 71: steps = 8 , reward = 0.0
Episode 72: steps = 4 , reward = 0.0
Episode 73: steps = 13 , reward = 0.0
Episode 74: steps = 3 , reward = 0.0
Episode 75: steps = 7 , reward = 0.0
Episode 76: steps = 8 , reward = 0.0
Episode 77: steps = 3 , reward = 0.0
Episode 78: steps = 7 , reward = 0.0
Episode 79: steps = 8 , reward = 0.0
Episode 80: steps = 7 , reward = 0.0
Episode 81: steps = 10 , reward = 1.0
Episode 82: steps = 6 , reward = 1.0
Episode 83: steps = 9 , reward = 1.0
Episode 84: steps = 6 , reward = 0.0
Episode 85: steps = 6 , reward = 1.0
Episode 86: steps = 3 , reward = 0.0
Episode 87: steps = 7 , reward = 1.0
Episode 88: steps = 6 , reward = 1.0
Episode 89: steps = 7 , reward = 1.0
Episode 90: steps = 6 , reward = 1.0
Episode 91: steps = 6 , reward = 1.0
Episode 92: steps = 10 , reward = 1.0
Episode 93: steps = 6 , reward = 1.0
Episode 94: steps = 8 , reward = 1.0
Episode 95: steps = 6 , reward = 1.0
Episode 96: steps = 7 , reward = 1.0
Episode 97: steps = 6 , reward = 1.0
Episode 98: steps = 6 , reward = 1.0
Episode 99: steps = 8 , reward = 1.0
Episode 100: steps = 6 , reward = 1.0
Episode 101: steps = 8 , reward = 1.0
Episode 102: steps = 6 , reward = 1.0
Episode 103: steps = 6 , reward = 1.0
Episode 104: steps = 6 , reward = 1.0
Episode 105: steps = 8 , reward = 1.0
Episode 106: steps = 6 , reward = 1.0
Episode 107: steps = 6 , reward = 1.0
Episode 108: steps = 6 , reward = 1.0
Episode 109: steps = 6 , reward = 1.0
Episode 110: steps = 4 , reward = 0.0
Episode 111: steps = 6 , reward = 1.0
Episode 112: steps = 6 , reward = 1.0
Episode 113: steps = 6 , reward = 1.0
Episode 114: steps = 6 , reward = 1.0
Episode 115: steps = 7 , reward = 1.0
Episode 116: steps = 7 , reward = 1.0
Episode 117: steps = 10 , reward = 1.0
Episode 118: steps = 5 , reward = 0.0
Episode 119: steps = 6 , reward = 1.0
Episode 120: steps = 3 , reward = 0.0
Episode 121: steps = 6 , reward = 1.0
Episode 122: steps = 6 , reward = 1.0
Episode 123: steps = 9 , reward = 1.0
Episode 124: steps = 6 , reward = 1.0
Episode 125: steps = 5 , reward = 0.0
Episode 126: steps = 6 , reward = 1.0
Episode 127: steps = 6 , reward = 1.0
Episode 128: steps = 8 , reward = 1.0
Episode 129: steps = 6 , reward = 1.0
Episode 130: steps = 6 , reward = 1.0
Episode 131: steps = 8 , reward = 1.0
Episode 132: steps = 8 , reward = 1.0
Episode 133: steps = 6 , reward = 1.0
Episode 134: steps = 6 , reward = 1.0
Episode 135: steps = 6 , reward = 1.0
Episode 136: steps = 6 , reward = 1.0
Episode 137: steps = 6 , reward = 1.0
Episode 138: steps = 6 , reward = 1.0
Episode 139: steps = 4 , reward = 0.0
Episode 140: steps = 6 , reward = 1.0
Episode 141: steps = 6 , reward = 1.0
Episode 142: steps = 6 , reward = 1.0
Episode 143: steps = 9 , reward = 1.0
Episode 144: steps = 6 , reward = 1.0
Episode 145: steps = 6 , reward = 1.0
Episode 146: steps = 6 , reward = 1.0
Episode 147: steps = 7 , reward = 1.0
Episode 148: steps = 7 , reward = 1.0
Episode 149: steps = 6 , reward = 1.0
Episode 150: steps = 6 , reward = 1.0
Episode 151: steps = 6 , reward = 1.0
Episode 152: steps = 7 , reward = 1.0
Episode 153: steps = 6 , reward = 1.0
Episode 154: steps = 6 , reward = 1.0
Episode 155: steps = 7 , reward = 1.0
Episode 156: steps = 7 , reward = 1.0
Episode 157: steps = 7 , reward = 1.0
Episode 158: steps = 6 , reward = 1.0
Episode 159: steps = 6 , reward = 1.0
Episode 160: steps = 6 , reward = 1.0
Episode 161: steps = 4 , reward = 0.0
Episode 162: steps = 6 , reward = 1.0
Episode 163: steps = 5 , reward = 0.0
Episode 164: steps = 6 , reward = 1.0
Episode 165: steps = 6 , reward = 1.0
Episode 166: steps = 6 , reward = 1.0
Episode 167: steps = 6 , reward = 1.0
Episode 168: steps = 9 , reward = 1.0
Episode 169: steps = 6 , reward = 1.0
Episode 170: steps = 8 , reward = 1.0
Episode 171: steps = 6 , reward = 1.0
Episode 172: steps = 6 , reward = 1.0
Episode 173: steps = 6 , reward = 1.0
Episode 174: steps = 6 , reward = 1.0
Episode 175: steps = 6 , reward = 1.0
Episode 176: steps = 6 , reward = 1.0
Episode 177: steps = 6 , reward = 1.0
Episode 178: steps = 8 , reward = 1.0
Episode 179: steps = 6 , reward = 1.0
Episode 180: steps = 6 , reward = 1.0
Episode 181: steps = 3 , reward = 0.0
Episode 182: steps = 6 , reward = 1.0
Episode 183: steps = 6 , reward = 1.0
Episode 184: steps = 6 , reward = 1.0
Episode 185: steps = 8 , reward = 1.0
Episode 186: steps = 10 , reward = 1.0
Episode 187: steps = 8 , reward = 1.0
Episode 188: steps = 6 , reward = 1.0
Episode 189: steps = 6 , reward = 1.0
Episode 190: steps = 6 , reward = 1.0
Episode 191: steps = 6 , reward = 1.0
Episode 192: steps = 7 , reward = 1.0
Episode 193: steps = 6 , reward = 1.0
Episode 194: steps = 6 , reward = 1.0
Episode 195: steps = 8 , reward = 1.0
Episode 196: steps = 6 , reward = 1.0
Episode 197: steps = 4 , reward = 0.0
Episode 198: steps = 5 , reward = 0.0
Episode 199: steps = 6 , reward = 1.0
Episode 200: steps = 6 , reward = 1.0
Episode 201: steps = 6 , reward = 1.0
Episode 202: steps = 4 , reward = 0.0
Episode 203: steps = 8 , reward = 1.0
Episode 204: steps = 8 , reward = 1.0
Episode 205: steps = 7 , reward = 1.0
Episode 206: steps = 6 , reward = 1.0
Episode 207: steps = 6 , reward = 1.0
Episode 208: steps = 6 , reward = 1.0
Episode 209: steps = 8 , reward = 1.0
Episode 210: steps = 7 , reward = 1.0
Episode 211: steps = 6 , reward = 1.0
Episode 212: steps = 6 , reward = 1.0
Episode 213: steps = 10 , reward = 1.0
Episode 214: steps = 6 , reward = 1.0
Episode 215: steps = 6 , reward = 1.0
Episode 216: steps = 6 , reward = 1.0
Episode 217: steps = 6 , reward = 1.0
Episode 218: steps = 6 , reward = 1.0
Episode 219: steps = 6 , reward = 1.0
Episode 220: steps = 6 , reward = 1.0
Episode 221: steps = 7 , reward = 1.0
Episode 222: steps = 6 , reward = 1.0
Episode 223: steps = 6 , reward = 1.0
Episode 224: steps = 6 , reward = 1.0
Episode 225: steps = 6 , reward = 1.0
Episode 226: steps = 6 , reward = 1.0
Episode 227: steps = 6 , reward = 1.0
Episode 228: steps = 7 , reward = 1.0
Episode 229: steps = 6 , reward = 1.0
Episode 230: steps = 6 , reward = 1.0
Episode 231: steps = 10 , reward = 1.0
Episode 232: steps = 6 , reward = 1.0
Episode 233: steps = 6 , reward = 1.0
Episode 234: steps = 6 , reward = 1.0
Episode 235: steps = 8 , reward = 1.0
Episode 236: steps = 6 , reward = 1.0
Episode 237: steps = 6 , reward = 1.0
Episode 238: steps = 6 , reward = 1.0
Episode 239: steps = 8 , reward = 1.0
Episode 240: steps = 6 , reward = 1.0
Episode 241: steps = 6 , reward = 1.0
Episode 242: steps = 8 , reward = 1.0
Episode 243: steps = 2 , reward = 0.0
Episode 244: steps = 6 , reward = 1.0
Episode 245: steps = 6 , reward = 1.0
Episode 246: steps = 6 , reward = 1.0
Episode 247: steps = 6 , reward = 1.0
Episode 248: steps = 6 , reward = 1.0
Episode 249: steps = 6 , reward = 1.0
Episode 250: steps = 7 , reward = 1.0
Episode 251: steps = 6 , reward = 1.0
Episode 252: steps = 2 , reward = 0.0
Episode 253: steps = 6 , reward = 1.0
Episode 254: steps = 6 , reward = 1.0
Episode 255: steps = 6 , reward = 1.0
Episode 256: steps = 8 , reward = 1.0
Episode 257: steps = 6 , reward = 1.0
Episode 258: steps = 6 , reward = 1.0
Episode 259: steps = 7 , reward = 1.0
Episode 260: steps = 6 , reward = 1.0
Episode 261: steps = 6 , reward = 1.0
Episode 262: steps = 7 , reward = 1.0
Episode 263: steps = 6 , reward = 1.0
Episode 264: steps = 6 , reward = 1.0
Episode 265: steps = 6 , reward = 1.0
Episode 266: steps = 6 , reward = 1.0
Episode 267: steps = 7 , reward = 1.0
Episode 268: steps = 6 , reward = 1.0
Episode 269: steps = 6 , reward = 1.0
Episode 270: steps = 6 , reward = 1.0
Episode 271: steps = 6 , reward = 1.0
Episode 272: steps = 6 , reward = 1.0
Episode 273: steps = 7 , reward = 1.0
Episode 274: steps = 3 , reward = 0.0
Episode 275: steps = 8 , reward = 1.0
Episode 276: steps = 7 , reward = 1.0
Episode 277: steps = 4 , reward = 0.0
Episode 278: steps = 6 , reward = 1.0
Episode 279: steps = 4 , reward = 0.0
Episode 280: steps = 7 , reward = 1.0
Episode 281: steps = 6 , reward = 1.0
Episode 282: steps = 6 , reward = 1.0
Episode 283: steps = 6 , reward = 1.0
Episode 284: steps = 6 , reward = 1.0
Episode 285: steps = 7 , reward = 1.0
Episode 286: steps = 8 , reward = 1.0
Episode 287: steps = 6 , reward = 1.0
Episode 288: steps = 5 , reward = 0.0
Episode 289: steps = 8 , reward = 1.0
Episode 290: steps = 7 , reward = 1.0
Episode 291: steps = 8 , reward = 1.0
Episode 292: steps = 4 , reward = 0.0
Episode 293: steps = 6 , reward = 1.0
Episode 294: steps = 9 , reward = 1.0
Episode 295: steps = 6 , reward = 1.0
Episode 296: steps = 6 , reward = 1.0
Episode 297: steps = 6 , reward = 0.0
Episode 298: steps = 6 , reward = 1.0
Episode 299: steps = 6 , reward = 1.0
Episode 300: steps = 6 , reward = 1.0
Episode 301: steps = 5 , reward = 0.0
Episode 302: steps = 6 , reward = 1.0
Episode 303: steps = 7 , reward = 1.0
Episode 304: steps = 6 , reward = 1.0
Episode 305: steps = 8 , reward = 1.0
Episode 306: steps = 6 , reward = 1.0
Episode 307: steps = 6 , reward = 1.0
Episode 308: steps = 6 , reward = 1.0
Episode 309: steps = 6 , reward = 1.0
Episode 310: steps = 4 , reward = 0.0
Episode 311: steps = 7 , reward = 1.0
Episode 312: steps = 8 , reward = 1.0
Episode 313: steps = 7 , reward = 1.0
Episode 314: steps = 6 , reward = 1.0
Episode 315: steps = 6 , reward = 1.0
Episode 316: steps = 7 , reward = 1.0
Episode 317: steps = 6 , reward = 1.0
Episode 318: steps = 6 , reward = 1.0
Episode 319: steps = 6 , reward = 1.0
Episode 320: steps = 6 , reward = 1.0
Episode 321: steps = 6 , reward = 1.0
Episode 322: steps = 7 , reward = 1.0
Episode 323: steps = 6 , reward = 1.0
Episode 324: steps = 6 , reward = 1.0
Episode 325: steps = 6 , reward = 1.0
Episode 326: steps = 6 , reward = 1.0
Episode 327: steps = 6 , reward = 1.0
Episode 328: steps = 6 , reward = 1.0
Episode 329: steps = 6 , reward = 1.0
Episode 330: steps = 6 , reward = 1.0
Episode 331: steps = 6 , reward = 1.0
Episode 332: steps = 6 , reward = 1.0
Episode 333: steps = 6 , reward = 1.0
Episode 334: steps = 3 , reward = 0.0
Episode 335: steps = 6 , reward = 1.0
Episode 336: steps = 6 , reward = 1.0
Episode 337: steps = 4 , reward = 0.0
Episode 338: steps = 6 , reward = 1.0
Episode 339: steps = 8 , reward = 1.0
Episode 340: steps = 6 , reward = 1.0
Episode 341: steps = 6 , reward = 1.0
Episode 342: steps = 6 , reward = 1.0
Episode 343: steps = 6 , reward = 1.0
Episode 344: steps = 6 , reward = 1.0
Episode 345: steps = 6 , reward = 1.0
Episode 346: steps = 6 , reward = 1.0
Episode 347: steps = 6 , reward = 1.0
Episode 348: steps = 6 , reward = 1.0
Episode 349: steps = 6 , reward = 1.0
Episode 350: steps = 6 , reward = 1.0
Episode 351: steps = 7 , reward = 1.0
Episode 352: steps = 6 , reward = 1.0
Episode 353: steps = 10 , reward = 1.0
Episode 354: steps = 3 , reward = 0.0
Episode 355: steps = 7 , reward = 1.0
Episode 356: steps = 7 , reward = 1.0
Episode 357: steps = 6 , reward = 1.0
Episode 358: steps = 2 , reward = 0.0
Episode 359: steps = 6 , reward = 1.0
Episode 360: steps = 6 , reward = 1.0
Episode 361: steps = 6 , reward = 1.0
Episode 362: steps = 7 , reward = 1.0
Episode 363: steps = 8 , reward = 1.0
Episode 364: steps = 6 , reward = 1.0
Episode 365: steps = 2 , reward = 0.0
Episode 366: steps = 6 , reward = 1.0
Episode 367: steps = 5 , reward = 0.0
Episode 368: steps = 6 , reward = 1.0
Episode 369: steps = 6 , reward = 1.0
Episode 370: steps = 6 , reward = 1.0
Episode 371: steps = 6 , reward = 1.0
Episode 372: steps = 6 , reward = 1.0
Episode 373: steps = 6 , reward = 1.0
Episode 374: steps = 8 , reward = 1.0
Episode 375: steps = 9 , reward = 1.0
Episode 376: steps = 6 , reward = 0.0
Episode 377: steps = 6 , reward = 1.0
Episode 378: steps = 6 , reward = 1.0
Episode 379: steps = 8 , reward = 1.0
Episode 380: steps = 6 , reward = 1.0
Episode 381: steps = 6 , reward = 1.0
Episode 382: steps = 6 , reward = 1.0
Episode 383: steps = 6 , reward = 1.0
Episode 384: steps = 6 , reward = 1.0
Episode 385: steps = 6 , reward = 1.0
Episode 386: steps = 8 , reward = 1.0
Episode 387: steps = 6 , reward = 1.0
Episode 388: steps = 6 , reward = 1.0
Episode 389: steps = 2 , reward = 0.0
Episode 390: steps = 6 , reward = 1.0
Episode 391: steps = 6 , reward = 1.0
Episode 392: steps = 6 , reward = 1.0
Episode 393: steps = 6 , reward = 1.0
Episode 394: steps = 7 , reward = 1.0
Episode 395: steps = 6 , reward = 1.0
Episode 396: steps = 6 , reward = 1.0
Episode 397: steps = 6 , reward = 1.0
Episode 398: steps = 6 , reward = 1.0
Episode 399: steps = 7 , reward = 1.0
Episode 400: steps = 6 , reward = 1.0
Episode 401: steps = 6 , reward = 1.0
Episode 402: steps = 6 , reward = 1.0
Episode 403: steps = 6 , reward = 1.0
Episode 404: steps = 8 , reward = 1.0
Episode 405: steps = 6 , reward = 1.0
Episode 406: steps = 6 , reward = 1.0
Episode 407: steps = 6 , reward = 1.0
Episode 408: steps = 6 , reward = 1.0
Episode 409: steps = 6 , reward = 1.0
Episode 410: steps = 6 , reward = 1.0
Episode 411: steps = 6 , reward = 1.0
Episode 412: steps = 6 , reward = 1.0
Episode 413: steps = 6 , reward = 1.0
Episode 414: steps = 6 , reward = 1.0
Episode 415: steps = 9 , reward = 1.0
Episode 416: steps = 6 , reward = 1.0
Episode 417: steps = 4 , reward = 0.0
Episode 418: steps = 6 , reward = 1.0
Episode 419: steps = 6 , reward = 1.0
Episode 420: steps = 7 , reward = 1.0
Episode 421: steps = 6 , reward = 1.0
Episode 422: steps = 6 , reward = 1.0
Episode 423: steps = 10 , reward = 1.0
Episode 424: steps = 6 , reward = 1.0
Episode 425: steps = 6 , reward = 1.0
Episode 426: steps = 8 , reward = 1.0
Episode 427: steps = 6 , reward = 1.0
Episode 428: steps = 9 , reward = 1.0
Episode 429: steps = 6 , reward = 1.0
Episode 430: steps = 4 , reward = 0.0
Episode 431: steps = 6 , reward = 1.0
Episode 432: steps = 6 , reward = 1.0
Episode 433: steps = 6 , reward = 1.0
Episode 434: steps = 6 , reward = 1.0
Episode 435: steps = 8 , reward = 1.0
Episode 436: steps = 6 , reward = 1.0
Episode 437: steps = 6 , reward = 1.0
Episode 438: steps = 6 , reward = 1.0
Episode 439: steps = 8 , reward = 1.0
Episode 440: steps = 2 , reward = 0.0
Episode 441: steps = 6 , reward = 1.0
Episode 442: steps = 10 , reward = 1.0
Episode 443: steps = 6 , reward = 1.0
Episode 444: steps = 6 , reward = 1.0
Episode 445: steps = 8 , reward = 1.0
Episode 446: steps = 6 , reward = 1.0
Episode 447: steps = 6 , reward = 1.0
Episode 448: steps = 5 , reward = 0.0
Episode 449: steps = 6 , reward = 1.0
Episode 450: steps = 8 , reward = 1.0
Episode 451: steps = 6 , reward = 1.0
Episode 452: steps = 8 , reward = 1.0
Episode 453: steps = 8 , reward = 1.0
Episode 454: steps = 7 , reward = 1.0
Episode 455: steps = 5 , reward = 0.0
Episode 456: steps = 6 , reward = 1.0
Episode 457: steps = 6 , reward = 1.0
Episode 458: steps = 8 , reward = 1.0
Episode 459: steps = 8 , reward = 1.0
Episode 460: steps = 10 , reward = 1.0
Episode 461: steps = 8 , reward = 1.0
Episode 462: steps = 7 , reward = 1.0
Episode 463: steps = 7 , reward = 1.0
Episode 464: steps = 6 , reward = 1.0
Episode 465: steps = 6 , reward = 1.0
Episode 466: steps = 6 , reward = 1.0
Episode 467: steps = 6 , reward = 1.0
Episode 468: steps = 6 , reward = 1.0
Episode 469: steps = 6 , reward = 1.0
Episode 470: steps = 3 , reward = 0.0
Episode 471: steps = 7 , reward = 1.0
Episode 472: steps = 6 , reward = 1.0
Episode 473: steps = 6 , reward = 1.0
Episode 474: steps = 7 , reward = 1.0
Episode 475: steps = 6 , reward = 1.0
Episode 476: steps = 8 , reward = 1.0
Episode 477: steps = 6 , reward = 1.0
Episode 478: steps = 6 , reward = 1.0
Episode 479: steps = 6 , reward = 1.0
Episode 480: steps = 6 , reward = 1.0
Episode 481: steps = 6 , reward = 1.0
Episode 482: steps = 6 , reward = 1.0
Episode 483: steps = 6 , reward = 1.0
Episode 484: steps = 5 , reward = 0.0
Episode 485: steps = 6 , reward = 1.0
Episode 486: steps = 9 , reward = 1.0
Episode 487: steps = 7 , reward = 1.0
Episode 488: steps = 6 , reward = 1.0
Episode 489: steps = 6 , reward = 1.0
Episode 490: steps = 6 , reward = 1.0
Episode 491: steps = 6 , reward = 1.0
Episode 492: steps = 9 , reward = 1.0
Episode 493: steps = 6 , reward = 1.0
Episode 494: steps = 6 , reward = 1.0
Episode 495: steps = 9 , reward = 1.0
Episode 496: steps = 6 , reward = 1.0
Episode 497: steps = 6 , reward = 1.0
Episode 498: steps = 6 , reward = 1.0
Episode 499: steps = 7 , reward = 1.0
  (Down)
SFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
FHFH
FFFH
HFFG
  (Down)
SFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
FHFH
FFFH
HFFG
  (Right)
SFFF
FHFH
FFFH
HFFG
test reward = 1.0

五、結果分析

我們可以查看下最終訓練完成的 Q 表:

print(agent.Q)

運行結果:

[[0.27140285 0.4364344  0.09145568 0.15201279]
 [0.26813138 0.         0.         0.00945424]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.26636559 0.51632351 0.         0.13684245]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.33346755 0.         0.68004322 0.31572772]
 [0.26970648 0.77477987 0.35436455 0.        ]
 [0.04662094 0.73217092 0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.         0.         0.        ]
 [0.         0.39939922 0.88159607 0.11581402]
 [0.4472322  0.72976712 1.         0.40947544]
 [0.         0.         0.         0.        ]]

16 個格子對應的情況:

SFFF
FHFH
FFFH
HFFG

其中 S 代表起點,F 代表平地,H 代表陷阱(掉進去遊戲結束),G 代表終點(到達則獲勝)

每個格子的排序序號:

0  1  2  3
4  5  6  7
8  9  10 11
12 13 14 15

所以測試開始後,首先在第 0 格,這個時候的 4 個動作對應的 Q 值是:

[0.27140285 0.4364344  0.09145568 0.15201279]

這 4 個 Q 值對應:0 left,1 down,2 right,3 up

所以最大值 0.4364344 對應的是 1,即動作爲往下走一格

這個時候到達了第 4 個格子:

[0.26636559 0.51632351 0.         0.13684245]

選擇 1,動作:down,到達第 8 個格子:

[0.33346755 0.         0.68004322 0.31572772]

選擇 2,動作:right,到達第 9 個格子:

[0.26970648 0.77477987 0.35436455 0.        ]

選擇 1,動作:down,到達第 13 個格子:

[0.         0.39939922 0.88159607 0.11581402]

選擇 2,動作 right,到達第 14 個格子:

[0.4472322  0.72976712 1.         0.40947544]

選擇 2,動作 right,到達第 15 個格子:終點!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章