深度強化學習系列(6): DQN原理及實現

利用神經網絡近似值函數的方法表示爲：
$\hat{V}(s, w) \approx V_{\pi}(s) \\ \hat{q}(s, a, w) \approx q_{\pi}(s, a)$
那麼具體的工作過程是怎樣實現的? 以及如何從端到端的過程，本文將講解Deep Q Network（DQN, 而這正是由DeepMind於2013年和2015年分別提出的兩篇論文《Playing Atari with Deep Reinforcement Learning》《Human-level Control through Deep Reinforcement Learning：Nature雜誌》

其中DeepMind在第一篇中第一次提出Deep Reinforcement Learning（DRL）這個名稱，並且提出DQN算法，實現從視頻純圖像輸入，完全通過Agent學習來玩Atari遊戲的成果。之後DeepMind在Nature上發表了改進版的DQN文章（Human-level …）, 這將深度學習與RL結合起來實現從Perception感知到Action動作的端到端的一種全新的學習算法。簡單理解就是和人類一樣，輸入感知信息比如眼睛看到的東西，然後通過大腦（深度神經網絡)，直接做出對應的行爲（輸出動作）的學習過程。而後DeepMind提出了AlphaZero（完美的運用了DRL+Monte Calo Tree Search）取得了超過人類的水平！下文將詳細介紹DQN：

一、DQN算法

DQN算法是一種將Q_learning通過神經網絡近似值函數的一種方法，在Atari 2600 遊戲中取得了超越人類水平玩家的成績，下文通過將逐步深入講解：

1.1、 Q_Learning算法

$Q\_Learning$ 是Watkins於1989年提出的一種無模型的強化學習技術。它能夠比較可用操作的預期效用（對於給定狀態），而不需要環境模型。同時它可以處理隨機過渡和獎勵問題，而無需進行調整。目前已經被證明，對於任何有限的MDP，Q學習最終會找到一個最優策略，即從當前狀態開始，所有連續步驟的總回報回報的期望值是最大值可以實現的。學習開始之前，Q被初始化爲一個可能的任意固定值（由程序員選擇）。然後在每個時間t, Agent選擇一個動作 $a_{t}$ ，得到一個獎勵 $R_t$ ，進入一個新的狀態 $S_{t+1}$ 和Q值更新。其核心是值函數迭代過程，即：

$Q(s_{t},a_{t}) \leftarrow Q(s_{t},a_{t})+\alpha \cdot[r_{t}+\gamma \max\limits_{\pi}Q(s_{t+1},a_{t})-Q(s_{t},a_{t})]$

其中 $\alpha$ 是學習率， $\gamma$ 爲折扣因子，具體的實現過程見下圖僞代碼：

首先初始化值函數矩陣，開始episode,然後選擇一個狀態state，同時智能體根據自身貪婪策略，選擇action, 經過智能體將動作運用後得到一個獎勵 $R$ 和 $S^{'}$ ,計算值函數，繼續迭代下一個流程。

1.1.1、 $Q\_Learning$ 執行過程中有兩個特點：異策略和時間差分

異策略：就是指行動策略和評估策略不是同一個策略，行動策略採用了貪心的 $\epsilon$ - $greedy$ 策略（第5行），而評估策略採用了 $\max\limits_{\pi}Q(s, a)$ 貪心策略（第7行）！
時間差分：從值函數迭代公式(2)可以看出時間差分, 其中 $TD-target = r_{t}+\max\limits_{\pi}(s_{t+1}, a_{t}）$

爲了在學習過程中得到最優策略Policy，通常估算每一個狀態下每一種選擇的價值Value有多大。且每一個時間片的 $Q(s_{t},a_{t})$ 和當前得到的Reward以及下一個時間片的 $Q(s_{t+1},a_{t+1})$ 有關。 $Q\_Learning$ 通過不斷的學習，最終形成一張矩陣來存儲狀態(state)和動作（action）,表示爲：

具體過程根據僞代碼：首先初始化矩陣（所有都爲0），第一次隨機並採用貪婪策略選擇action，假如選擇action2後選擇到了狀態2，（ $\alpha=0,\gamma=0$ ）,此時得到獎勵1，則 $Q(1,2)=1$
$Q(s_{t},a_{t}) \leftarrow r_{t}+\max\limits_{\pi}Q(s_{t+1},a_{t})=1+0=1$
同樣的道理，直到所有的值函數都是最優的，得出一個策略。解決小規模問題，可以說這是一個非常優秀的做法，但是如果遇到直升機飛行（連續過程，state大概有N個狀態），或者圍棋等狀態空間（ $10^{70}$ ）特別大的情況。無疑會出現維度災難以及存儲和檢索困難，下面開始DQN神經網絡近似值函數。

1.2 DQN algorithm

本文以Atati遊戲例子（兩篇論文）進行分析。

1.2.1 Introduction

Atari 2600遊戲一款非常經典的遊戲，本文以打磚塊爲例子，它是一個高維狀態輸入（原始圖像像素輸入），低維動作輸出（離散的動作：上下左右，或者發射炮彈等），比如打磚塊如圖：

通常在計算機中處理，首先需要將圖像讀入，因此我們採用了卷積神經網絡（CNN）讀入圖像，（卷積神經網絡此處不詳述）如圖：

輸入： 從 Atari 遊戲的一幀RGB圖像提取出代表亮度(luminance)的 Y 通道, 並resize成 84×84, 將這樣圖像的連續m幀作爲輸入, 論文裏取m=4,即連續四帖圖像作爲遊戲的輸入信息.

經過卷積池化後得到n個state, 而最終我們將會輸出K個離散的動作，在神經網絡中可以表示爲：

最後結合這兩中思想得到論文中的圖：

前文提到的K個離散的動作。其實是Q值函數，但此處的值函數Q不是一個具體的數值，而是一組向量，在神經網絡中網路的權重爲 $\theta$ , 值函數表示爲 $Q(s,a,\theta)$ ,最終神經網絡收斂後的 $\theta$ 即爲值函數。

1.2.2 獎勵設置

reward r 的定義
將在當前時間 t, 狀態 s 下採取行動 a 之後遊戲的分值變化取值爲 rt ,即

$r_{t} =\left\{ \begin{aligned} 1 & & increase \\ 0 & & no exchange \\ -1 & & decrese \end{aligned} \right.$
長期累計折扣獎勵則定義爲：
$R_{t} = \sum\limits_{k=0}^{T} \gamma^{k}r_{t+k+1}$

1.2.3 近似和算法設置理論

因此整個過程的核心變爲如何確定 $\theta$ 來近似值函數，最經典的做法就是採用梯度下降最小化損失函數來不斷的調試網絡權重 $\theta$ ， Loss function定義爲：
$L_{i}(\theta_{i}) = E_{(s,a,r,s^{i})\sim U(D)}[(r+\gamma \max\limits_{a^{'}}Q(s^{'},a^{i};\theta_{i}^{-})-Q(s,a;\theta_{i}))^{2}] \tag{3})$
其中， $\theta_{i}^{-}$ 是第i次迭代的target網絡參數， $\theta_{i}$ 是Q-network網絡參數（後文會講爲什麼使用Q網路和目標網路兩種網絡！），接下來就是對 $\theta$ 求梯度，如公式：

$\frac{\partial L_{i}(\theta_{i})}{\partial\theta_{i}} = E_{(s,a,r,s^{i})\sim U(D)}[(r+\gamma \max\limits_{a^{'}}\hat{Q}(s^{'},a^{i};\theta_{i}^{-})-Q(s,a;\theta_{i}))\nabla_{\theta_{i}}Q(s,a;\theta_{i})] \tag{4}$
另外，在學習過程中，將訓練的四元組存進一個replay memory $D$ 中，在學習過程中以min-batch讀取訓練網絡結構。（優點見後文）

具體的僞代碼見：

兩個非常重要的思想：** 經驗回放和目標網絡**
（1) Experience Replay，其將系統探索環境得到的數據儲存起來，然後隨機採樣樣本更新深度神經網絡的參數。

Experience Replay的原因：

1、深度神經網絡作爲有監督學習模型，要求數據滿足獨立同分布
2、Q Learning 算法得到的樣本前後是有關係的。爲了打破數據之間的關聯性，Experience Replay 方法通過存儲-採樣的方法將這個關聯性打破了。

在這個問題中，之所以加入experience replay是因爲樣本是從遊戲中的連續幀獲得的，這與簡單RL比如maze）相比，樣本的關聯性大了很多，如果沒有experience replay，算法在連續一段時間內基本朝着同一個方向做梯度下降，那麼同樣的步長下這樣直接計算gradient就有可能不收斂。因此experience replay是從一個memory pool中隨機選取了一些 experience，然後再求梯度，從而避免了這個問題

Experience Replay優點：

1、數據利用率高，因爲一個樣本被多次使用。
2、連續樣本的相關性會使參數更新的方差（variance）比較大，該機制可減少這種相關性。注意這裏用的是均勻隨機採樣

（2）TargetNet: 引入TargetNet後，在一段時間裏目標Q值使保持不變的，一定程度降低了當前Q值和目標Q值的相關性，提高了算法穩定性。用另一個TargetNet產生Target Q值。具體地， $Q(s,a;θ_{i})$ 表示當前網絡MainNet的輸出，用來評估當前狀態動作對的值函數； $Q(s,a;θ_{i})$ 表示TargetNet的輸出，代入上面求 TargetQ 值的公式中得到目標Q值。根據上面的Loss Function更新MainNet的參數，每經過N輪迭代，將MainNet的參數複製給TargetNet。

根據算法僞代碼，運行結構如圖，下文將對具體代碼進行簡單分析：

1.2.3 Training

DeepMind關於《DQN(2015)》的實現採用了pytorch寫的源代碼（ https://github.com/deepmind/dqn ）,本文以tensorflow進行分析，對devsisters的倉庫的DQN代碼進行講解，文末致謝！
注：源碼使用gym環境，具體安裝參考：

git clone https://github.com/openai/gym
cd gym
sudo pip install -e .[all]
# test install 
import gym
env = gym.make('CartPole-v0')
env.reset()  #reset environment
for _ in range(1000):  #1000 frame
    env.render() 
    env.step(env.action_space.sample()) # take a random action
# result
 There is a swinging car

隨機選取動作過程如：

 def _random_step(self):
    action = self.env.action_space.sample()
    self._step(action)

經過action後獲得的獎勵：

 def _step(self, action):
    self._screen, self.reward, self.terminal, _ = self.env.step(action)

在運行過程中經過動作action後的 $s_{t+1}$ 環境：

class SimpleGymEnvironment(Environment):
  def __init__(self, config):
    super(SimpleGymEnvironment, self).__init__(config)

  def act(self, action, is_training=True):
    self._step(action)

    self.after_act(action)
    return self.state

卷積過程：

def conv2d(x,
           output_dim,
           kernel_size,
           stride,
           initializer=tf.contrib.layers.xavier_initializer(),
           activation_fn=tf.nn.relu,
           data_format='NHWC',
           padding='VALID',
           name='conv2d'):
  with tf.variable_scope(name):
    if data_format == 'NCHW':
      stride = [1, 1, stride[0], stride[1]]
      kernel_shape = [kernel_size[0], kernel_size[1], x.get_shape()[1], output_dim]
    elif data_format == 'NHWC':
      stride = [1, stride[0], stride[1], 1]
      kernel_shape = [kernel_size[0], kernel_size[1], x.get_shape()[-1], output_dim]

    w = tf.get_variable('w', kernel_shape, tf.float32, initializer=initializer)
    conv = tf.nn.conv2d(x, w, stride, padding, data_format=data_format)

    b = tf.get_variable('biases', [output_dim], initializer=tf.constant_initializer(0.0))
    out = tf.nn.bias_add(conv, b, data_format)

  if activation_fn != None:
    out = activation_fn(out)

  return out, w, b

網路結構優化見：

 # optimizer
    with tf.variable_scope('optimizer'):
      self.target_q_t = tf.placeholder('float32', [None], name='target_q_t')
      self.action = tf.placeholder('int64', [None], name='action')

      action_one_hot = tf.one_hot(self.action, self.env.action_size, 1.0, 0.0, name='action_one_hot')
      q_acted = tf.reduce_sum(self.q * action_one_hot, reduction_indices=1, name='q_acted')

      self.delta = self.target_q_t - q_acted

      self.global_step = tf.Variable(0, trainable=False)

      self.loss = tf.reduce_mean(clipped_error(self.delta), name='loss')
      self.learning_rate_step = tf.placeholder('int64', None, name='learning_rate_step')
      self.learning_rate_op = tf.maximum(self.learning_rate_minimum,
          tf.train.exponential_decay(
              self.learning_rate,
              self.learning_rate_step,
              self.learning_rate_decay_step,
              self.learning_rate_decay,
              staircase=True))
      self.optim = tf.train.RMSPropOptimizer(
          self.learning_rate_op, momentum=0.95, epsilon=0.01).minimize(self.loss)

網絡訓練函數：

  def train(self):
    start_step = self.step_op.eval()
    start_time = time.time()

    num_game, self.update_count, ep_reward = 0, 0, 0.
    total_reward, self.total_loss, self.total_q = 0., 0., 0.
    max_avg_ep_reward = 0
    ep_rewards, actions = [], []

    screen, reward, action, terminal = self.env.new_random_game()

    for _ in range(self.history_length):
      self.history.add(screen)

    for self.step in tqdm(range(start_step, self.max_step), ncols=70, initial=start_step):
      if self.step == self.learn_start:
        num_game, self.update_count, ep_reward = 0, 0, 0.
        total_reward, self.total_loss, self.total_q = 0., 0., 0.
        ep_rewards, actions = [], []

      # 1. predict
      action = self.predict(self.history.get())
      # 2. act
      screen, reward, terminal = self.env.act(action, is_training=True)
      # 3. observe
      self.observe(screen, reward, action, terminal)

      if terminal:
        screen, reward, action, terminal = self.env.new_random_game()

        num_game += 1
        ep_rewards.append(ep_reward)
        ep_reward = 0.
      else:
        ep_reward += reward

      actions.append(action)
      total_reward += reward

      if self.step >= self.learn_start:
        if self.step % self.test_step == self.test_step - 1:
          avg_reward = total_reward / self.test_step
          avg_loss = self.total_loss / self.update_count
          avg_q = self.total_q / self.update_count

          try:
            max_ep_reward = np.max(ep_rewards)
            min_ep_reward = np.min(ep_rewards)
            avg_ep_reward = np.mean(ep_rewards)
          except:
            max_ep_reward, min_ep_reward, avg_ep_reward = 0, 0, 0

          print('\navg_r: %.4f, avg_l: %.6f, avg_q: %3.6f, avg_ep_r: %.4f, max_ep_r: %.4f, min_ep_r: %.4f, # game: %d' \
              % (avg_reward, avg_loss, avg_q, avg_ep_reward, max_ep_reward, min_ep_reward, num_game))

          if max_avg_ep_reward * 0.9 <= avg_ep_reward:
            self.step_assign_op.eval({self.step_input: self.step + 1})
            self.save_model(self.step + 1)

            max_avg_ep_reward = max(max_avg_ep_reward, avg_ep_reward)

          if self.step > 180:
            self.inject_summary({
                'average.reward': avg_reward,
                'average.loss': avg_loss,
                'average.q': avg_q,
                'episode.max reward': max_ep_reward,
                'episode.min reward': min_ep_reward,
                'episode.avg reward': avg_ep_reward,
                'episode.num of game': num_game,
                'episode.rewards': ep_rewards,
                'episode.actions': actions,
                'training.learning_rate': self.learning_rate_op.eval({self.learning_rate_step: self.step}),
              }, self.step)

          num_game = 0
          total_reward = 0.
          self.total_loss = 0.
          self.total_q = 0.
          self.update_count = 0
          ep_reward = 0.
          ep_rewards = []
          actions = []

完整代碼見倉庫：（https://github.com/devsisters/DQN-tensorflow ），在此表示對devsisters的誠摯感謝

第三篇：

本文後續會以單獨的blog講解DQN的各種變體i：

Distribute DQN
Double DQN
Dueling DQN
Prioritized Experience Replay DQN
Rainbow
等

參考文獻:

[1]. Richard S.Sutton and Andrew G. Barto,Reinforcement learning: An Introduction,second edition.2017.
[2]. Lucian Busontu et al, Reinforcement learning and dynamic programming using function approximators.
[3]. 郭憲方勇純, 深入淺出強化學習:原理入門
[4]. David Sliver, Introduction to Reinforcement learning
(UCL:https://www.youtube.com/channel/UCP7jMXSY2xbc3KCAE0MHQ-A)
[5].https://zhuanlan.zhihu.com/reinforce
[6].https://zhuanlan.zhihu.com/p/21421729
[7].https://blog.csdn.net/u013236946/article/details/72871858
[8].http://jikaichen.com/2016/11/18/notes-on-atari/

深度強化學習系列(6): DQN原理及實現

一、DQN算法

1.1、 Q_Learning算法

1.1.1、 $Q\_Learning$ 執行過程中有兩個特點：異策略和時間差分

1.2 DQN algorithm

1.2.1 Introduction

1.2.2 獎勵設置

1.2.3 近似和算法設置理論

1.2.3 Training

參考文獻:

CORS error 但是 status code 是200 OK

使用skopeo同步鏡像

深度強化學習系列(1): 深度強化學習概述

深度強化學習系列(16): 從DPG到DDPG算法的原理講解及tensorflow代碼實現

深度強化學習系列(6): DQN原理及實現

深度強化學習系列(13): 策略梯度（Policy Gradient）

深度強化學習系列: 深度強化學習的加速方法解讀

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

深度強化學習系列(6): DQN原理及實現

一、DQN算法

1.1、 Q_Learning算法

1.1.1、Q_LearningQ\_LearningQ_Learning執行過程中有兩個特點：異策略和時間差分

1.2 DQN algorithm

1.2.1 Introduction

1.2.2 獎勵設置

1.2.3 近似和算法設置理論

1.2.3 Training

參考文獻:

1.1.1、 $Q\_Learning$ 執行過程中有兩個特點：異策略和時間差分