增強學習 - MDPs - Dynamic Programming (一)

  1. MDP

MDP - 馬爾科夫決策過程:
一個馬爾可夫決策過程由一個五元組構成M = <S,A,P,R,γ>

S: 表示狀態集(states),有s∈S,si表示第i步的狀態。
A:表示一組動作(actions),有a∈A,ai表示第i步的動作。
?sa: 表示狀態轉移概率。?s? 表示的是在當前s ∈ S狀態下,經過a ∈ A作用後,會轉移到的其他狀態的概率分佈情況。比如,在狀態s下執行動作a,轉移到s’的概率可以表示爲p(s’|s,a)。
R: S×A⟼R ,R是回報函數(reward function)。有些回報函數狀態S的函數,可以簡化爲R: S⟼R。如果一組(s,a)轉移到了下個狀態s’,那麼回報函數可記爲r(s’|s, a)。如果(s,a)對應的下個狀態s’是唯一的,那麼回報函數也可以記爲r(s,a)。
γ: γ是折扣因子,表示前面的狀態對當前狀態值的影響,取值0-1之間,1表示影響一樣大,0表示只有前一個狀態有影響

  1. Policy Evaluation

Iteration Policy Evaluation(迭代法策略評估) :
策略:
問題:評估一個給定的策略π。
解決方法:利用bellman方程反向迭代。
具體做法:每次迭代過程中,用所有的狀態s的第k次迭代得到的的vvk(s′)來計算第k+1次的vk+1(s)的值。經過這種方法的反覆迭代,最終是可以收斂到最優的v∗(s)。迭代的公式如下:
在這裏插入圖片描述

  1. Policy Evaluation python sample

一個簡單的例子:,
在這裏插入圖片描述

a. 即時獎勵:上圖是一個grid,左上角和右下角是終點,它們的reward是0,其他的狀態,reward都是-1。
b. 狀態空間:除了灰色兩個格子,其他都是非終點狀態
c. 動作空間:在每個狀態下,都有四種動作可以執行,分別是上下左右。
d. 轉移概率:任何想要離開grid的動作將保持其狀態不變,也就是原地不動。其他時候都是直接移動到下一個狀態。所以狀態轉移概率是確定性的。
e. 折扣因子:γ=1
f. 隨機策略:在任何狀態下,agent都採取隨機策略,也就是它的動作是隨機選擇的,即:
在這裏插入圖片描述

我們要求解在這個隨機策略下的所有的狀態的v(s)值:

Steps with code :
a. Build the grid with <S,A,P,R>
數據結構如下, 兩個灰色點對應編號是0和15
四個方向分別是0,1,2,3, 上右下左
對於編號1的點,向下,是1: {2: [(1.0, 5, -1.0, False)],表示(prob, next_state, reward, done)
可能向下,下一個狀態是點5,本身的reward是-1,不是終點 - not done
<class ‘dict’>: {0: {0: [(1.0, 0, 0.0, True)], 1: [(1.0, 0, 0.0, True)], 2: [(1.0, 0, 0.0, True)], 3: [(1.0, 0, 0.0, True)]}, 1: {0: [(1.0, 1, -1.0, False)], 1: [(1.0, 2, -1.0, False)], 2: [(1.0, 5, -1.0, False)], 3: [(1.0, 0, -1.0, True)]}…

b. 對每個狀態分別計算各個方向的可能action帶來的reward,更新狀態值:
初始狀態所有的V(s)都是0
這裏用第一次循環裏的V(1)來研究一下:
點1的相關數據: 1: {0: [(1.0, 1, -1.0, False)], 1: [(1.0, 2, -1.0, False)], 2: [(1.0, 5, -1.0, False)], 3: [(1.0, 0, -1.0, True)]}

向上:可能性25%, 但是到頂了,所以還是到自己,reward是-1,相應的V=0.25 * (-1 + 1 * V[1]) = -0.25
向其他方向也都同理可以算出來期望的V是-0.25
全部加起來就是-1
代碼:

            for a in range(env.nA):
                # get transitions
                [(prob, next_state, reward, done)] = env.P[s][a]
                # apply bellman expectatoin eqn
                Q += action_probs[a] * (reward + discount_factor * V[next_state])

後面就是把這個從點0到點1計算狀態值的方式不斷循環,一直到狀態值收斂,看一下每次循環計算後狀態值的情況,我把收斂的判斷值設置成epsilon=0.00001,一直到第251次才收斂:
第一次:[ 0. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. -1. 0.]
第二次: [ 0. -1.75 -2. -2. -1.75 -2. -2. -2. -2. -2. -2. -1.75
-2. -2. -1.75 0. ]
第三次:[ 0. -2.4375 -2.9375 -3. -2.4375 -2.875 -3. -2.9375 -2.9375
-3. -2.875 -2.4375 -3. -2.9375 -2.4375 0. ]
第四次: [ 0. -3.0625 -3.84375 -3.96875 -3.0625 -3.71875 -3.90625 -3.84375
-3.84375 -3.90625 -3.71875 -3.0625 -3.96875 -3.84375 -3.0625 0. ]
第五次: [ 0. -3.65625 -4.6953125 -4.90625 -3.65625 -4.484375
-4.78125 -4.6953125 -4.6953125 -4.78125 -4.484375 -3.65625
-4.90625 -4.6953125 -3.65625 0. ]
第六次:[ 0. -4.20898438 -5.50976562 -5.80078125 -4.20898438 -5.21875
-5.58984375 -5.50976562 -5.50976562 -5.58984375 -5.21875 -4.20898438
-5.80078125 -5.50976562 -4.20898438 0. ]
。。。。
第214次:[ 0. -13.99988715 -19.99983277 -21.99981286 -13.99988715
-17.99985268 -19.99983389 -19.99983277 -19.99983277 -19.99983389
-17.99985268 -13.99988715 -21.99981286 -19.99983277 -13.99988715
0. ]
第215次:[ 0. -13.99989315 -19.99984167 -21.99982282 -13.99989315
-17.99986052 -19.99984273 -19.99984167 -19.99984167 -19.99984273
-17.99986052 -13.99989315 -21.99982282 -19.99984167 -13.99989315
0. ]
[ 0. -13.99989315 -19.99984167 -21.99982282 -13.99989315
-17.99986052 -19.99984273 -19.99984167 -19.99984167 -19.99984273
-17.99986052 -13.99989315 -21.99982282 -19.99984167 -13.99989315
0. ]

相應的python代碼:
gridworld.py:

import numpy as np
import sys
from io import StringIO
from gym.envs.toy_text import discrete

UP = 0
RIGHT = 1
DOWN = 2
LEFT = 3

class GridworldEnv(discrete.DiscreteEnv):
    """
    Grid World environment from Sutton's Reinforcement Learning book chapter 4.
    You are an agent on an MxN grid and your goal is to reach the terminal
    state at the top left or the bottom right corner.
    For example, a 4x4 grid looks as follows:
    T  o  o  o
    o  x  o  o
    o  o  o  o
    o  o  o  T
    x is your position and T are the two terminal states.
    You can take actions in each direction (UP=0, RIGHT=1, DOWN=2, LEFT=3).
    Actions going off the edge leave you in your current state.
    You receive a reward of -1 at each step until you reach a terminal state.
    """

    metadata = {'render.modes': ['human', 'ansi']}

    def __init__(self, shape=[4,4]):
        if not isinstance(shape, (list, tuple)) or not len(shape) == 2:
            raise ValueError('shape argument must be a list/tuple of length 2')

        self.shape = shape

        nS = np.prod(shape)
        nA = 4

        MAX_Y = shape[0]
        MAX_X = shape[1]

        P = {}
        # Form the grid 4 * 4
        grid = np.arange(nS).reshape(shape)
        it = np.nditer(grid, flags=['multi_index'])

        while not it.finished:
            s = it.iterindex
            y, x = it.multi_index #一個點的x和y座標

            P[s] = {a : [] for a in range(nA)}

            is_done = lambda s: s == 0 or s == (nS - 1) #左上角和右上角到邊
            reward = 0.0 if is_done(s) else -1.0

            # We're stuck in a terminal state
            if is_done(s):
                P[s][UP] = [(1.0, s, reward, True)]
                P[s][RIGHT] = [(1.0, s, reward, True)]
                P[s][DOWN] = [(1.0, s, reward, True)]
                P[s][LEFT] = [(1.0, s, reward, True)]
            # Not a terminal state
            else:
                ns_up = s if y == 0 else s - MAX_X #向上到邊就是s,否則就是s - 4
                ns_right = s if x == (MAX_X - 1) else s + 1 #向右到邊是s,否則是s+1
                ns_down = s if y == (MAX_Y - 1) else s + MAX_X #向下到邊的話就是s,沒到邊就是s + MAX_X(4)
                ns_left = s if x == 0 else s - 1 #向左到邊的話就是0,否則就是s - 1
                P[s][UP] = [(1.0, ns_up, reward, is_done(ns_up))]
                P[s][RIGHT] = [(1.0, ns_right, reward, is_done(ns_right))]
                P[s][DOWN] = [(1.0, ns_down, reward, is_done(ns_down))]
                P[s][LEFT] = [(1.0, ns_left, reward, is_done(ns_left))]

            it.iternext()

        # Initial state distribution is uniform
        isd = np.ones(nS) / nS

        # We expose the model of the environment for educational purposes
        # This should not be used in any model-free learning algorithm
        self.P = P

        super(GridworldEnv, self).__init__(nS, nA, P, isd)

        print(isd)

    def _render(self, mode='human', close=False):
        if close:
            return

        outfile = StringIO() if mode == 'ansi' else sys.stdout

        grid = np.arange(self.nS).reshape(self.shape)
        it = np.nditer(grid, flags=['multi_index'])
        while not it.finished:
            s = it.iterindex
            y, x = it.multi_index

            if self.s == s:
                output = " x "
            elif s == 0 or s == self.nS - 1:
                output = " T "
            else:
                output = " o "

            if x == 0:
                output = output.lstrip()
            if x == self.shape[1] - 1:
                output = output.rstrip()

            outfile.write(output)

            if x == self.shape[1] - 1:
                outfile.write("\n")

            it.iternext()

policyEvaluation.py

import numpy as np
from gridworld import GridworldEnv

env = GridworldEnv()


def policy_eval(policy, env, discount_factor=1.0, epsilon=0.00001):
    """
    Evaluate a policy given an environment and a full description of the environment's dynamics.

    Args:
        policy: [S, A] shaped matrix representing the policy.
        env: OpenAI env. env.P represents the transition probabilities of the environment.
            env.P[s][a] is a list of transition tuples (prob, next_state, reward, done).
            env.nS is a number of states in the environment.
            env.nA is a number of actions in the environment.
        theta: We stop evaluation once our value function change is less than theta for all states.
        discount_factor: Gamma discount factor.

    Returns:
        Vector of length env.nS representing the value function.
    """
    # Start with a random (all 0) value function
    V = np.zeros(env.nS)
    iterCount = 0

    while True:

        # old value function
        V_old = np.zeros(env.nS)
        # stopping condition
        delta = 0
        iterCount += 1

        # loop over state space
        for s in range(env.nS):

            # To accumelate bellman expectation eqn
            Q = 0
            # get probability distribution over actions
            action_probs = policy[s]

            # loop over possible actions
            for a in range(env.nA):
                # get transitions
                [(prob, next_state, reward, done)] = env.P[s][a]
                # apply bellman expectatoin eqn
                Q += action_probs[a] * (reward + discount_factor * V[next_state])

            # get the biggest difference over state space
            delta = max(delta, abs(Q - V[s]))

            # update state-value
            V_old[s] = Q

        # the new value function
        V = V_old

        print("第" + str(iterCount) + "次:" + str(V))

        # if true value function
        if (delta < epsilon):
            break

    return np.array(V)

random_policy = np.ones([env.nS, env.nA]) / env.nA
v = policy_eval(random_policy, env)

expected_v = np.array([0, -14, -20, -22, -14, -18, -20, -20, -20, -20, -18, -14, -22, -20, -14, 0])
np.testing.assert_array_almost_equal(v, expected_v, decimal=2)

print(v)
print(expected_v)


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章