gym-MountainCar-v0離散狀態的Q-Learning

原創

野生蘑菇菌

2020-04-19 11:03

周老師課程推薦的程序解析

gym-MountainCar-v0離散狀態的Q-Learning

一、關鍵點

一、關於eta

二、關於離散化
離散爲40個狀態（二維）
三、關於_
表示某個變量是臨時的或無關緊要的
四、關於列表解析

    solution_policy_scores = [run_episode(env, solution_policy, False) for _ in range(100)]

二、代碼塊

import numpy as np

import gym
from gym import wrappers

off_policy = True # if True use off-policy q-learning update, if False, use on-policy SARSA update

n_states = 40 # Discrete value
iter_max = 5000

initial_lr = 1.0 # Learning rate
min_lr = 0.003
gamma = 1.0
t_max = 10000
eps = 0.1

測試策略函數

def run_episode(env, policy=None, render=False):
    obs = env.reset()#reset env
    total_reward = 0
    step_idx = 0
    for _ in range(t_max):#we know it can end the game in 10000 step
        if render:
            env.render()#fresh env

        if policy is None:
            action = env.action_space.sample()
        else:
            a,b = obs_to_state(env, obs)#it comes from the number34 code
            action = policy[a][b]

        obs, reward, done, _ = env.step(action)
        total_reward += gamma ** step_idx * reward
        step_idx += 1
        if done:
            break
    return total_reward

離散化狀態函數

def obs_to_state(env, obs):
    """ Maps an observation to state """
    # we quantify the continous state space into discrete space
    env_low = env.observation_space.low
    env_high = env.observation_space.high
    env_dx = (env_high - env_low) / n_states#state discretization
    a = int((obs[0] - env_low[0])/env_dx[0])#'/'
    b = int((obs[1] - env_low[1])/env_dx[1])
    return a, b

主函數

if __name__ == '__main__':
    env_name = 'MountainCar-v0'#the name of id can search
    env = gym.make(env_name)#make a env
    env.seed(0)#let the resule can be same
    np.random.seed(0)#let the resule can be same
    if off_policy == True:#confirm the policy
        print ('----- using Q Learning -----')
    else:
        print('------ using SARSA Learning ---')



    q_table = np.zeros((n_states, n_states, 3))#3 action，and the dimensional of state is 3



    for i in range(iter_max):#the ep is 5000
        obs = env.reset()#reset the env
        total_reward = 0#0 reward

        ## eta: learning rate is decreased at each step
        eta = max(min_lr, initial_lr * (0.85 ** (i//100)))
        for j in range(t_max):#the ep is 10000,after we need reset env
            a, b = obs_to_state(env, obs)#State value after discretization
            if np.random.uniform(0, 1) < eps:
                action = np.random.choice(env.action_space.n)#such as 0,1,2
            else:
                action = np.argmax(q_table[a][b])

            obs, reward, done, _ = env.step(action)
            total_reward += reward




            # update q table
            a_, b_ = obs_to_state(env, obs)
            if off_policy == True:
                # use q-learning update (off-policy learning)
                q_table[a][b][action] = q_table[a][b][action] + eta * (reward + gamma *  np.max(q_table[a_][b_]) - q_table[a][b][action])
            else:
                # use SARSA update (on-policy learning)
                # epsilon-greedy policy on Q again
                if np.random.uniform(0,1) < eps:
                    action_ = np.random.choice(env.action_space.n)
                else:
                    action_ = np.argmax(q_table[a_][b_])
                q_table[a][b][action] = q_table[a][b][action] + eta * (reward + gamma *  q_table[a_][b_][action_] - q_table[a][b][action])
            if done:
                break




        if i % 200 == 0:
            print('Iteration #%d -- Total reward = %d.' %(i+1, total_reward))
    solution_policy = np.argmax(q_table, axis=2)
    solution_policy_scores = [run_episode(env, solution_policy, False) for _ in range(100)]
    print("Average score of solution = ", np.mean(solution_policy_scores))
    # Animate it
    for _ in range(2):
        run_episode(env, solution_policy, True)
    env.close()

主函數的順序是：
首先建立一個環境然後選擇Q-Learning，然後初始化Q表。
循環5000個ep，每個裏有10000步。
選擇動作，得到狀態、獎勵和結束標誌
進而選擇下一個狀態，選擇這個狀態的最大值進行更新Q表。
每200個ep打印一次當前ep的總獎勵。
5000ep過後更新策略，然後展示2次畫面。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

gym-MountainCar-v0離散狀態的Q-Learning

gym-MountainCar-v0離散狀態的Q-Learning

一、關鍵點

二、代碼塊

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

安裝roboware-studio及本地依賴問題解決

理論力學－－－主矢量和主矩

理論力學－－－虛位移

理論力學---約束及其分類

最優化--中科院實景課堂--第二節局部極小點+凸函數+無約束優化引言

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結