Tensorlayer深度強化學習之FrozenLake介紹及表格型Q學習求解

在這裏插入圖片描述
獲取更多資訊,趕快關注上面的公衆號吧!

Tensorlayer深度強化學習系列:

1、Tensorlayer深度強化學習之Tensorlayer安裝

2.4 強化學習環境 gym 介紹

  這一部分主要講一下 gym 中各種環境是怎樣的。

2.4.1 安裝

  Gym 的安裝和相關說明可以查看文章(https://mp.weixin.qq.com/s?__biz=MzU1OTkwNzk4NQ==&mid=2247484108&idx=1&sn=0c9ff7488185c6287fbe56a3fa24a286&chksm=fc115732cb66de24dab450f458cc39effea9ffe4441010d5d3e00078badcdf132a54eb5388ba&token=366879770&lang=zh_CN#rd),這裏不再贅述。

2.4.2 FrozenLake-v0

2.4.2.1 描述

  FrozenLake-v0 是一個 4*4 的網絡格子,每個格子可以是起始塊,目標塊、凍結塊或者危險塊。我們的目標是讓 agent 學習從開始塊如何行動到目標塊上,而不是移動到危險塊上。agent 可以選擇向上、向下、向左或者向右移動,同時遊戲中還有可能吹來一陣風,將 agent 吹到任意的方塊上。在這種情況下,每個時刻都有完美的策略是不能的,但是如何避免危險洞並且到達目標洞肯定是可行的。
更通俗一點地講,就是冬天來了,你和你的朋友在公園裏玩飛盤的時候,你把飛盤扔到了湖中央。水大部分都已凍結,但也有一些地方融化出了幾個洞。如果你踏進其中一個洞,你就會掉進冰冷的水裏。在這個時候,由於沒有其他的飛盤,所以必須穿過湖面並取回光盤。然而,冰是滑的,所以你不會總是朝着你想要的方向前進。

  該冰面可以通過以下的網格來描述:

SFFF (S: starting point, safe)
FHFH (F: frozen surface, safe)
FFFH (H: hole, fall to your doom)
HFFG (G: goal, where the frisbee is located)

  當你達到目標或掉進洞裏時,這一片段(回合)就結束了。如果達到了目標,將得到 1 分的獎勵,否則爲 0 分。

2.4.2.2 代碼

import sys
from contextlib import closing

import numpy as np
from six import StringIO, b

from gym import utils
from gym.envs.toy_text import discrete

LEFT = 0
DOWN = 1
RIGHT = 2
UP = 3

MAPS = {
    "4x4": [
        "SFFF",
        "FHFH",
        "FFFH",
        "HFFG"
    ],
    "8x8": [
        "SFFFFFFF",
        "FFFFFFFF",
        "FFFHFFFF",
        "FFFFFHFF",
        "FFFHFFFF",
        "FHHFFFHF",
        "FHFFHFHF",
        "FFFHFFFG"
    ],
}


def generate_random_map(size=8, p=0.8):
    """Generates a random valid map (one that has a path from start to goal)
    :param size: size of each side of the grid
    :param p: probability that a tile is frozen
    """
    valid = False

    # DFS to check that it's a valid path.
    def is_valid(res):
        frontier, discovered = [], set()
        frontier.append((0,0))
        while frontier:
            r, c = frontier.pop()
            if not (r,c) in discovered:
                discovered.add((r,c))
                directions = [(1, 0), (0, 1), (-1, 0), (0, -1)]
                for x, y in directions:
                    r_new = r + x
                    c_new = c + y
                    if r_new < 0 or r_new >= size or c_new < 0 or c_new >= size:
                        continue
                    if res[r_new][c_new] == 'G':
                        return True
                    if (res[r_new][c_new] not in '#H'):
                        frontier.append((r_new, c_new))
        return False

    while not valid:
        p = min(1, p)
        res = np.random.choice(['F', 'H'], (size, size), p=[p, 1-p])
        res[0][0] = 'S'
        res[-1][-1] = 'G'
        valid = is_valid(res)
    return ["".join(x) for x in res]


class FrozenLakeEnv(discrete.DiscreteEnv):
    """
    Winter is here. You and your friends were tossing around a frisbee at the park
    when you made a wild throw that left the frisbee out in the middle of the lake.
    The water is mostly frozen, but there are a few holes where the ice has melted.
    If you step into one of those holes, you'll fall into the freezing water.
    At this time, there's an international frisbee shortage, so it's absolutely imperative that
    you navigate across the lake and retrieve the disc.
    However, the ice is slippery, so you won't always move in the direction you intend.
    The surface is described using a grid like the following
        SFFF
        FHFH
        FFFH
        HFFG
    S : starting point, safe
    F : frozen surface, safe
    H : hole, fall to your doom
    G : goal, where the frisbee is located
    The episode ends when you reach the goal or fall in a hole.
    You receive a reward of 1 if you reach the goal, and zero otherwise.
    """

    metadata = {'render.modes': ['human', 'ansi']}

    def __init__(self, desc=None, map_name="4x4",is_slippery=True):
        if desc is None and map_name is None:
            desc = generate_random_map()
        elif desc is None:
            desc = MAPS[map_name]
        self.desc = desc = np.asarray(desc,dtype='c')
        self.nrow, self.ncol = nrow, ncol = desc.shape
        self.reward_range = (0, 1)

        nA = 4
        nS = nrow * ncol

        isd = np.array(desc == b'S').astype('float64').ravel()
        isd /= isd.sum()

        P = {s : {a : [] for a in range(nA)} for s in range(nS)}

        def to_s(row, col):
            return row*ncol + col

        def inc(row, col, a):
            if a == LEFT:
                col = max(col-1,0)
            elif a == DOWN:
                row = min(row+1,nrow-1)
            elif a == RIGHT:
                col = min(col+1,ncol-1)
            elif a == UP:
                row = max(row-1,0)
            return (row, col)

        for row in range(nrow):
            for col in range(ncol):
                s = to_s(row, col)
                for a in range(4):
                    li = P[s][a]
                    letter = desc[row, col]
                    if letter in b'GH':
                        li.append((1.0, s, 0, True))
                    else:
                        if is_slippery:
                            for b in [(a-1)%4, a, (a+1)%4]:
                                newrow, newcol = inc(row, col, b)
                                newstate = to_s(newrow, newcol)
                                newletter = desc[newrow, newcol]
                                done = bytes(newletter) in b'GH'
                                rew = float(newletter == b'G')
                                li.append((1.0/3.0, newstate, rew, done))
                        else:
                            newrow, newcol = inc(row, col, a)
                            newstate = to_s(newrow, newcol)
                            newletter = desc[newrow, newcol]
                            done = bytes(newletter) in b'GH'
                            rew = float(newletter == b'G')
                            li.append((1.0, newstate, rew, done))

        super(FrozenLakeEnv, self).__init__(nS, nA, P, isd)

    def render(self, mode='human'):
        outfile = StringIO() if mode == 'ansi' else sys.stdout

        row, col = self.s // self.ncol, self.s % self.ncol
        desc = self.desc.tolist()
        desc = [[c.decode('utf-8') for c in line] for line in desc]
        desc[row][col] = utils.colorize(desc[row][col], "red", highlight=True)
        if self.lastaction is not None:
            outfile.write("  ({})\n".format(["Left","Down","Right","Up"][self.lastaction]))
        else:
            outfile.write("\n")
        outfile.write("\n".join(''.join(line) for line in desc)+"\n")

        if mode != 'human':
            with closing(outfile):
                return outfile.getvalue()

2.5 強化學習算法

2.5.1 表格 Q 學習

2.5.1.1 代碼

  表格 Q 學習的原理這裏不再贅述,可以參照另一篇文章(第五章 基於時序差分和 Q 學習的無模型預測與控制(一),https://mp.weixin.qq.com/s?__biz=MzU1OTkwNzk4NQ==&mid=2247484656&idx=1&sn=a0804ea632ff65b4f629dca5d4d23574&chksm=fc11510ecb66d818d8b91b7043254d5fe807fe123be4271caaca27282425f6b52952790f661e&token=366879770&lang=zh_CN#rd),

"""Q-Table learning algorithm.
Non deep learning - TD Learning, Off-Policy, e-Greedy Exploration
Q(S, A) <- Q(S, A) + alpha _ (R + lambda _ Q(newS, newA) - Q(S, A))
See David Silver RL Tutorial Lecture 5 - Q-Learning for more details.
For Q-Network, see tutorial_frozenlake_q_network.py
EN: https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-with-tables-and-neural-networks-d195264329d0#.5m3361vlw
CN: https://zhuanlan.zhihu.com/p/25710327
tensorflow==2.0.0a0
tensorlayer==2.0.0
"""

import argparse
import os
import time

import gym
import matplotlib.pyplot as plt
import numpy as np

parser = argparse.ArgumentParser()
parser.add_argument('--train', dest='train', action='store_true', default=True)
parser.add_argument('--test', dest='test', action='store_true', default=True)

parser.add_argument(
'--save_path', default=None, help='folder to save if mode == train else model path,'
'qnet will be saved once target net update'
)
parser.add_argument('--seed', help='random seed', type=int, default=0)
parser.add_argument('--env_id', default='FrozenLake-v0')
args = parser.parse_args()

## Load the environment

alg_name = 'Qlearning'
env_id = args.env_id
env = gym.make(env_id)
render = True # display the game environment

##================= Implement Q-Table learning algorithm =====================##

## Initialize table with all zeros

Q = np.zeros([env.observation_space.n, env.action_space.n])

## Set learning parameters

lr = .85 # alpha, if use value function approximation, we can ignore it
lambd = .99 # decay factor
num_episodes = 10000
t0 = time.time()

if args.train:
all*episode_reward = []
for i in range(num_episodes): ## Reset environment and get first new observation
s = env.reset()
rAll = 0 ## The Q-Table learning algorithm
for j in range(99):
if render: env.render() ## Choose an action by greedily (with noise) picking from Q table
a = np.argmax(Q[s, :] + np.random.randn(1, env.action_space.n) \* (1. / (i + 1))) ## Get new state and reward from environment
s1, r, d, * = env.step(a) ## Update Q-Table with new knowledge
Q[s, a] = Q[s, a] + lr _ (r + lambd _ np.max(Q[s1, :]) - Q[s, a])
rAll += r
s = s1
if d is True:
break
print(
'Training | Episode: {}/{} | Episode Reward: {:.4f} | Running Time: {:.4f}'.format(
i + 1, num_episodes, rAll,
time.time() - t0
)
)
if i == 0:
all_episode_reward.append(rAll)
else:
all_episode_reward.append(all_episode_reward[-1] _ 0.9 + rAll _ 0.1)

    # save
    path = os.path.join('model', '_'.join([alg_name, env_id]))
    if not os.path.exists(path):
        os.makedirs(path)
    np.save(os.path.join(path, 'Q_table.npy'), Q)

    plt.plot(all_episode_reward)
    if not os.path.exists('image'):
        os.makedirs('image')
    plt.savefig(os.path.join('image', '_'.join([alg_name, env_id])))

    # print("Final Q-Table Values:/n %s" % Q)

if args.test:
path = os.path.join('model', '_'.join([alg_name, env_id]))
Q = np.load(os.path.join(path, 'Q_table.npy'))
for i in range(num_episodes): ## Reset environment and get first new observation
s = env.reset()
rAll = 0 ## The Q-Table learning algorithm
for j in range(99): ## Choose an action by greedily (with noise) picking from Q table
a = np.argmax(Q[s, :]) ## Get new state and reward from environment
s1, r, d, _ = env.step(a) ## Update Q-Table with new knowledge
rAll += r
s = s1
if d is True:
break
print(
'Testing | Episode: {}/{} | Episode Reward: {:.4f} | Running Time: {:.4f}'.format(
i + 1, num_episodes, rAll,
time.time() - t0
)
)

2.5.1.2 實驗結果

  學習到的最終 Q 表如下:

[[6.20622965e-01 8.84762425e-03 3.09373823e-03 6.55067399e-03]

[6.49198039e-04 3.04069914e-04 8.78667903e-04 5.91638052e-01]

[1.92065690e-03 4.33985167e-01 3.49151873e-03 1.97126703e-03]

[2.70187111e-03 0.00000000e+00 0.00000000e+00 4.35444853e-01]

[6.34931610e-01 1.09286085e-04 1.86982907e-03 2.76783612e-04]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

[6.48093009e-07 1.13896350e-04 1.65719637e-01 1.90614063e-05]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

[0.00000000e+00 3.84251979e-03 1.48921362e-03 7.46942896e-01]

[0.00000000e+00 8.03386378e-01 6.92688383e-04 0.00000000e+00]

[8.40889312e-01 9.86082253e-06 1.25967676e-04 6.83892296e-05]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]

[0.00000000e+00 0.00000000e+00 9.61587991e-01 6.98637543e-03]

[0.00000000e+00 9.99905944e-01 0.00000000e+00 0.00000000e+00]

[0.00000000e+00 0.00000000e+00 0.00000000e+00 0.00000000e+00]]

  各代累積獎勵曲線如下:

圖 4 各代累積獎勵

  三次測試的正確率如下:

圖 5 測試正確率
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章