深度強化學習系列(4): Q-Learning原理與實現

原創

J.Q.Wang2011

2020-07-03 21:00

論文地址： http://www.gatsby.ucl.ac.uk/~dayan/papers/cjch.pdf

Q-Learning是發表於1989年的一種value-based，且model-free的特別經典的off-policy算法，近幾年的DQN等算法均是在此基礎上通過神經網絡進行展開的。

1. 相關簡介

強化學習學習過程中，通常是將學習的序列數據存儲在表格中，通過獲取表中的數據，利用greedy策略進行最大化Q值函數的學習方法。

2. 原理及推導

Q-Learning就是在某一個時刻的狀態(state)下，採取動作a能夠獲得收益的期望，環境會根據agent的動作反饋相應的reward獎賞，核心就是將state和action構建成一張Q_table表來存儲Q值，然後根據Q值來選取能夠獲得最大收益的動作，如表所示：

Q-Table	$a_{1}$	$a_{2}$
$s_{1}$	$Q(s_{1},a_{1})$	$Q(s_{1},a_{2})$
$s_{2}$	$Q(s_{2},a_{1})$	$Q(s_{2},a_{2})$
$s_{3}$	$Q(s_{3},a_{1})$	$Q(s_{3},a_{2})$

Q-learning的主要優勢就是使用了時間差分法TD（融合了蒙特卡洛和動態規劃）能夠進行離線（off-policy）學習, 使用bellman方程可以對馬爾科夫過程求解最優策略。算法僞代碼

從僞代碼中可以看出，在每個episode中的更新方式採用了貪婪greedy（進行探索）進行最優動作的選取，並通過更新 $Q$ 值（這裏的 $\max \limits_{a}$ 操作是非常關鍵的一部分）來達到學習目的。代碼的復現過程中也是嚴格按照僞代碼的順序進行完成。

3. 代碼復現

本文參考莫煩的代碼，利用Q-learning算法實現一個走迷宮的實現，具體爲紅色塊（機器人）通過上下左右移動，最後找到黃色圈（寶藏），黑色塊爲障礙物。

分析：對於機器人來說，選取的動作choose_action有四個狀態，上下左右，也就是下文中的self.action(本質可以用一個list進行表示)

第一步：構建Q值表、動作值選取和Q值更新

    
import numpy as np
import pandas as pd


class QLearningTable:
    def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):
        self.actions = actions  # a list
        self.lr = learning_rate
        self.gamma = reward_decay
        self.epsilon = e_greedy
        self.q_table = pd.DataFrame(columns=self.actions, dtype=np.float64)
        # 創建一個列爲self.action的表結構

    # 定義選取動作值
    def choose_action(self, observation):
        self.check_state_exist(observation)
        # 動作選擇，從均勻分佈中採樣(np.random.uniform)
        if np.random.uniform() < self.epsilon:
            # 選擇最好的動作,此處通過loc函數直接對元素賦值
            state_action = self.q_table.loc[observation, :]
            # some actions may have the same value, randomly choose on in these actions
            action = np.random.choice(state_action[state_action == np.max(state_action)].index)
        else:
            # choose random action
            action = np.random.choice(self.actions)
        return action

    def learn(self, s, a, r, s_):
        self.check_state_exist(s_)
        q_predict = self.q_table.loc[s, a]
        if s_ != 'terminal':
            q_target = r + self.gamma * self.q_table.loc[s_, :].max()  # next state is not terminal
        else:
            q_target = r  # next state is terminal
        self.q_table.loc[s, a] += self.lr * (q_target - q_predict)  # update

    def check_state_exist(self, state):
        if state not in self.q_table.index:
            # append new state to q table
            self.q_table = self.q_table.append(
                pd.Series(
                    [0]*len(self.actions),
                    index=self.q_table.columns,
                    name=state,
                )
            )

第二步：寫episode循環中的內容

def update():
    for episode in range(100):
        # initial observation
        observation = env.reset()
        # 每個Episode
        while True:
            # fresh env
            env.render()

            # RL choose action based on observation
            action = RL.choose_action(str(observation))

            # RL take action and get next observation and reward
            observation_, reward, done = env.step(action)

            # RL learn from this transition
            RL.learn(str(observation), action, reward, str(observation_))

            # swap observation
            observation = observation_

            # break while loop when end of this episode
            if done:
                break

    # end of game
    print('game over')
    env.destroy()

第三步：寫主函數入口


if __name__ == "__main__":
    env = Maze()
    RL = QLearningTable(actions=list(range(env.n_actions)))
    env.after(100, update)
    env.mainloop()

注：這裏對環境maze函數的代碼略去，大多數實驗中，我們直接使用gym環境或者其他的現有的環境即可，此處環境見參考文獻完整代碼

參考文獻：

MorvanZhou.github. (2017，點擊查看完整源代碼)

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

深度強化學習系列(4): Q-Learning原理與實現

1. 相關簡介

2. 原理及推導

3. 代碼復現

第一步：構建Q值表、動作值選取和Q值更新

第二步：寫episode循環中的內容

第三步：寫主函數入口

強化學習系列（1）：強化學習（Reinforcement Learning）

深度強化學習系列(5): Double Q-Learning原理詳解

深度強化學習系列: 最全深度強化學習資料

深度強化學習系列之(13): 深度強化學習實驗中應該使用多少個隨機種子？

深度強化學習系列(15): TRPO算法原理及Tensorflow實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

深度強化學習系列(4): Q-Learning原理與實現

1. 相關簡介

2. 原理及推導

3. 代碼復現

第一步：構建Q值表、動作值選取和Q值更新

第二步： 寫episode循環中的內容

第三步：寫主函數入口

第二步：寫episode循環中的內容