深度強化學習系列(10): NoisyNet-DQN原理及實現

在這裏插入圖片描述

論文地址: https://arxiv.org/pdf/1706.10295v1.pdf
本篇論文是DeepMind發表於頂會ICLR2018上的論文,第一作者Meire,裏面也有熟悉的Mnih等大佬,還是往常的閱讀順序:
在這裏插入圖片描述

本文解決的是強化學習中的“探索問題”(efficient exploration),作者通過給訓練網絡中添加噪音參數(和梯度更新同時更新網絡權重參數),通過權重網絡的訓練來更新參數,結果表明能夠使用額外較小的計算成本,在A3C、DQN、Dueling DQN等算法上實現相對於傳統的啓發式更優的結果。
在這裏插入圖片描述

1。背景及問題

我們知道,對於探索-利用而言,目前通常採用以下兩種方法:

  • epsilon-greedyϵ\epsilon-greedygreedy(以超參數形式給出)很有可能會導致智能體採取隨機步驟,而不是按照它學到的策略採取行動。 通常的做法是,在訓練開始時使該ϵ\epsilon-greedy=1greedy=1,然後慢慢減小到一個較小的值,例如0.1或0.02。

  • 熵正則化:將策略的熵添加到損失函數中時,在策略梯度方法中使用它,以懲罰我們的模型過於確定其行爲。

而常見的啓發式搜索在強化學習中的原則是“Optimism in the face of uncertainty”,這種面對不確定性樂觀的屬性就導致了啓發式需要在智能體的性能表現上有理論保證,而這些方法的缺點在於僅限於使用在較小的狀態、動作空間比較小或者是線性函數逼近的問題上效果還可以,對於一些問題複雜的函數畢竟問題並不能夠很好的進行解決。

本文作者提出了一種NoisyNet,該方法主要包括將高斯噪聲添加到網絡的最後(完全連接)層。 噪聲的參數可以在訓練過程中通過模型進行調整,這使智能體可以決定何時以及以什麼比例將權重引入不確定性。

2.原理與數學過程

NoisyNet是一個神經網絡,其權重和偏置會受到噪音的影響,

一般的,如果將NoisyNet數學表示爲y=fθ(x)y=f_{\theta}(x)(x表示輸入,y表示輸出,θ\theta表示噪音參數),作者在此處將θ\theta定義爲:
θ= def μ+Σε \theta \stackrel{\text { def }}{=} \mu+\Sigma \odot \varepsilon
其中 ζ= def (μ,Σ)\zeta \stackrel{\text { def }}{=}(\mu, \Sigma)定義爲是一組可學習參數向量的集合, “ε\varepsilon”是零均值噪聲的矢量,具有固定統計量,\odot表示逐元素乘法。那麼關於噪音參數的的損失函數我們表示爲:
ε:Lˉ(ζ)= def E[L(θ)] \varepsilon: \bar{L}(\zeta) \stackrel{\text { def }}{=} \mathbb{E}[L(\theta)]
那麼接下來的過程就是對ζ\zeta進行優化。那如何優化呢?

接下來讓我們思考這樣一個問題
對於一個 pp 個輸入 qq 個輸出來說,數學表示爲 y=wx+by= wx+b,其中可知 wRq×pw \in \mathbb{R}^{q \times p}, xRpx \in \mathbb{R}^{p}, bRpb \in \mathbb{R}^{p},這個理解起來很簡單。那麼如果給參數中添加噪音呢(也就是給網絡結構添加噪音)?下面是作者給出的帶有噪音參數的線性層數學表示:
y= def (μw+σwεw)x+μb+σbεb y \stackrel{\text { def }}{=}\left(\mu^{w}+\sigma^{w} \odot \varepsilon^{w}\right) x+\mu^{b}+\sigma^{b} \odot \varepsilon^{b}
乍看起來挺複雜,其中$w 等價於\left(\mu{w}+\sigma{w} \odot \varepsilon{w}\right)$,$b$等價於$(\mu{b}+\sigma^{b} \odot \varepsilon^{b})$,每個參數的維度如下:

μ\mu σ\sigma ε\varepsilon
μwRq×p\mu^{w} \in \mathbb{R}^{q \times p} σwRq×p\sigma^{w} \in \mathbb{R}^{q \times p} εwRq×p\varepsilon^{w} \in \mathbb{R}^{q \times p}
μbRq\mu^{b} \in \mathbb{R}^{q} σbRq\sigma^{b} \in \mathbb{R}^{q} εbRq\varepsilon^{b} \in \mathbb{R}^{q}

其中 ε\varepsilon是隨機噪音參數,下圖是對該過程進行圖表示:
在這裏插入圖片描述

其含義如下:

以上是如何引入噪音的問題,在論文中,作者嘗試噪音參數引入的兩種分佈:

  • 獨立高斯噪聲(Independent Gaussian Noise):噪聲層的每個權重都是獨立的,並且具有模型自己學習的μ\muσ\sigma。也就是對於任意的εi,jw\varepsilon^{w}_{i,j}(對應εjb\varepsilon^{b}_{j})和 εw\varepsilon^{w}(對應 εb\varepsilon^{b})的參數都是來自高斯分佈。共 (pq+q)(pq+q) 個變量

  • 分解高斯噪聲(Factorised Gaussian Noise):包含噪音的輸入輸出:第一個具有輸入p個單位的高斯分佈εi\varepsilon_{i}噪音輸入,第二個具有q個單位的高斯噪音輸出。共 (p+q)(p+q) 個變量,其分解如下:
    εi,jw=f(εi)f(εj)εjb=f(εj) \begin{aligned} \varepsilon_{i, j}^{w} &=f\left(\varepsilon_{i}\right) f\left(\varepsilon_{j}\right) \\ \varepsilon_{j}^{b} &=f\left(\varepsilon_{j}\right) \end{aligned}
    這裏的f:f(x)=sgn(x)xf: f(x)=\operatorname{sgn}(x) \sqrt{|x|}函數是一個實值函數

Lˉ(ζ)=E[L(θ)]=E[μ,ΣL(μ+Σε)] \nabla \bar{L}(\zeta)=\nabla \mathbb{E}[L(\theta)]=\mathbb{E}\left[\nabla_{\mu, \Sigma} L(\mu+\Sigma \odot \varepsilon)\right]
使用蒙特卡羅近似梯度,單步優化如下:
Lˉ(ζ)μ,ΣL(μ+Σξ) \nabla \bar{L}(\zeta) \approx \nabla_{\mu, \Sigma} L(\mu+\Sigma \odot \xi)

3. Deep NoisyNet原理以及初始化過程

注:本文的Noisy是針對於值函數(動作-值函數)的,不是針對策略輸出的action的

3.1 各種算法的NoisyNet更新公式

其實數學的更新公式很簡單,重新構造優化目標Lˉ(ζ)\bar{L}(\zeta)和優化參數(在原來的值函數基礎上加入對應參數即可)

  • NoisyNet-DQN
    Lˉ(ζ)=E[E(x,a,r,y)D[r+γmaxbAQ(y,b,ε;ζ)Q(x,a,ε;ζ)]2] \bar{L}(\zeta)=\mathbb{E}\left[\mathbb{E}_{(x, a, r, y) \sim D}\left[r+\gamma \max _{b \in A} Q\left(y, b, \varepsilon^{\prime} ; \zeta^{-}\right)-Q(x, a, \varepsilon ; \zeta)\right]^{2}\right]

  • NoisyNet-DuelingDQN
    Lˉ(ζ)=E[E(x,a,r,y)D[r+γQ(y,b(y),ε;ζ)Q(x,a,ε;ζ)]2] s.t. b(y)=argmaxbAQ(y,b(y),ε;ζ) \begin{aligned} \bar{L}(\zeta) &=\mathbb{E}\left[\mathbb{E}_{(x, a, r, y) \sim D}\left[r+\gamma Q\left(y, b^{*}(y), \varepsilon^{\prime} ; \zeta^{-}\right)-Q(x, a, \varepsilon ; \zeta)\right]^{2}\right] \\ \text { s.t. } \quad b^{*}(y) &=\arg \max _{b \in \mathcal{A}} Q\left(y, b(y), \varepsilon^{\prime \prime} ; \zeta\right) \end{aligned}

  • NoisyNet-A3C
    Q^i=j=ik1γjirt+j+γkiV(xt+k;ζ,εi) \hat{Q}_{i}=\sum_{j=i}^{k-1} \gamma^{j-i} r_{t+j}+\gamma^{k-i} V\left(x_{t+k} ; \zeta, \varepsilon_{i}\right)

3.2 噪音的初始化過程

    1. 對於沒有分解的高斯參數來說,每個元素μi,j\mu_{i,j} 的採樣來自於獨立正態分佈U[3p,+3p]\mathcal{U}[-\sqrt{\frac{3}{p}},+\sqrt{\frac{3}{p}}] ,其中pp表示神經網絡的輸入層的輸入。
    1. 分解高斯參數來說,採樣來自於分佈U[1p,+1p]\mathcal{U}[\frac{1}{-\sqrt{{p}}},+\frac{1}{\sqrt{{p}}}]

參見代碼:

# Added by Andrew Liao
# for NoisyNet-DQN (using Factorised Gaussian noise)
# modified from ```dense```function
def noisy_dense(x, size, name, bias=True, activation_fn=tf.identity):

    # the function used in eq.7,8
    def f(x):
        return tf.multiply(tf.sign(x), tf.pow(tf.abs(x), 0.5))
    # Initializer of \mu and \sigma 
    mu_init = tf.random_uniform_initializer(minval=-1*1/np.power(x.get_shape().as_list()[1], 0.5),     
                                                maxval=1*1/np.power(x.get_shape().as_list()[1], 0.5))
    sigma_init = tf.constant_initializer(0.4/np.power(x.get_shape().as_list()[1], 0.5))
    # Sample noise from gaussian
    p = sample_noise([x.get_shape().as_list()[1], 1])
    q = sample_noise([1, size])
    f_p = f(p); f_q = f(q)
    w_epsilon = f_p*f_q; b_epsilon = tf.squeeze(f_q)

    # w = w_mu + w_sigma*w_epsilon
    w_mu = tf.get_variable(name + "/w_mu", [x.get_shape()[1], size], initializer=mu_init)
    w_sigma = tf.get_variable(name + "/w_sigma", [x.get_shape()[1], size], initializer=sigma_init)
    w = w_mu + tf.multiply(w_sigma, w_epsilon)
    ret = tf.matmul(x, w)
    if bias:
        # b = b_mu + b_sigma*b_epsilon
        b_mu = tf.get_variable(name + "/b_mu", [size], initializer=mu_init)
        b_sigma = tf.get_variable(name + "/b_sigma", [size], initializer=sigma_init)
        b = b_mu + tf.multiply(b_sigma, b_epsilon)
        return activation_fn(ret + b)
    else:
        return activation_fn(ret)

4.算法僞代碼:

在這裏插入圖片描述
在這裏插入圖片描述

5. 實驗結果

在這裏插入圖片描述
在這裏插入圖片描述

6.算法實現(僅在部分Atari遊戲中使用)

本部分代碼包含兩種算法 NoisyNet-DQNNoisyNEt-A3C

(1)NoisyNet-DQN

# code source: https://github.com/wenh123/NoisyNet-DQN/blob/master/train.py
import argparse
import gym
import numpy as np
import os
import tensorflow as tf
import tempfile
import time

import baselines.common.tf_util as U

from baselines import logger
from baselines import deepq
from baselines.deepq.replay_buffer import ReplayBuffer, PrioritizedReplayBuffer
from baselines.common.misc_util import (
    boolean_flag,
    pickle_load,
    pretty_eta,
    relatively_safe_pickle_dump,
    set_global_seeds,
    RunningAvg,
    SimpleMonitor
)
from baselines.common.schedules import LinearSchedule, PiecewiseSchedule
# when updating this to non-deperecated ones, it is important to
# copy over LazyFrames
from baselines.common.atari_wrappers_deprecated import wrap_dqn
from baselines.common.azure_utils import Container
from model import model, dueling_model
from statistics import statistics

def parse_args():
    parser = argparse.ArgumentParser("DQN experiments for Atari games")
    # Environment
    parser.add_argument("--env", type=str, default="Pong", help="name of the game")
    parser.add_argument("--seed", type=int, default=42, help="which seed to use")
    # Core DQN parameters
    parser.add_argument("--replay-buffer-size", type=int, default=int(1e6), help="replay buffer size")
    parser.add_argument("--lr", type=float, default=1e-4, help="learning rate for Adam optimizer")
    parser.add_argument("--num-steps", type=int, default=int(2e8), help="total number of steps to run the environment for")
    parser.add_argument("--batch-size", type=int, default=32, help="number of transitions to optimize at the same time")
    parser.add_argument("--learning-freq", type=int, default=4, help="number of iterations between every optimization step")
    parser.add_argument("--target-update-freq", type=int, default=40000, help="number of iterations between every target network update")
    # Bells and whistles
    boolean_flag(parser, "noisy", default=False, help="whether or not to NoisyNetwork")
    boolean_flag(parser, "double-q", default=True, help="whether or not to use double q learning")
    boolean_flag(parser, "dueling", default=False, help="whether or not to use dueling model")
    boolean_flag(parser, "prioritized", default=False, help="whether or not to use prioritized replay buffer")
    parser.add_argument("--prioritized-alpha", type=float, default=0.6, help="alpha parameter for prioritized replay buffer")
    parser.add_argument("--prioritized-beta0", type=float, default=0.4, help="initial value of beta parameters for prioritized replay")
    parser.add_argument("--prioritized-eps", type=float, default=1e-6, help="eps parameter for prioritized replay buffer")
    # Checkpointing
    parser.add_argument("--save-dir", type=str, default=None, required=True, help="directory in which training state and model should be saved.")
    parser.add_argument("--save-azure-container", type=str, default=None,
                        help="It present data will saved/loaded from Azure. Should be in format ACCOUNT_NAME:ACCOUNT_KEY:CONTAINER")
    parser.add_argument("--save-freq", type=int, default=1e6, help="save model once every time this many iterations are completed")
    boolean_flag(parser, "load-on-start", default=True, help="if true and model was previously saved then training will be resumed")
    return parser.parse_args()


def make_env(game_name):
    env = gym.make(game_name + "NoFrameskip-v4")
    monitored_env = SimpleMonitor(env)  # puts rewards and number of steps in info, before environment is wrapped
    env = wrap_dqn(monitored_env)  # applies a bunch of modification to simplify the observation space (downsample, make b/w)
    return env, monitored_env


def maybe_save_model(savedir, container, state):
    """This function checkpoints the model and state of the training algorithm."""
    if savedir is None:
        return
    start_time = time.time()
    model_dir = "model-{}".format(state["num_iters"])
    U.save_state(os.path.join(savedir, model_dir, "saved"))
    if container is not None:
        container.put(os.path.join(savedir, model_dir), model_dir)
    relatively_safe_pickle_dump(state, os.path.join(savedir, 'training_state.pkl.zip'), compression=True)
    if container is not None:
        container.put(os.path.join(savedir, 'training_state.pkl.zip'), 'training_state.pkl.zip')
    relatively_safe_pickle_dump(state["monitor_state"], os.path.join(savedir, 'monitor_state.pkl'))
    if container is not None:
        container.put(os.path.join(savedir, 'monitor_state.pkl'), 'monitor_state.pkl')
    logger.log("Saved model in {} seconds\n".format(time.time() - start_time))


def maybe_load_model(savedir, container):
    """Load model if present at the specified path."""
    if savedir is None:
        return

    state_path = os.path.join(os.path.join(savedir, 'training_state.pkl.zip'))
    if container is not None:
        logger.log("Attempting to download model from Azure")
        found_model = container.get(savedir, 'training_state.pkl.zip')
    else:
        found_model = os.path.exists(state_path)
    if found_model:
        state = pickle_load(state_path, compression=True)
        model_dir = "model-{}".format(state["num_iters"])
        if container is not None:
            container.get(savedir, model_dir)
        U.load_state(os.path.join(savedir, model_dir, "saved"))
        logger.log("Loaded models checkpoint at {} iterations".format(state["num_iters"]))
        return state


if __name__ == '__main__':
    args = parse_args()
    # Parse savedir and azure container.
    savedir = args.save_dir
    if args.save_azure_container is not None:
        account_name, account_key, container_name = args.save_azure_container.split(":")
        container = Container(account_name=account_name,
                              account_key=account_key,
                              container_name=container_name,
                              maybe_create=True)
        if savedir is None:
            # Careful! This will not get cleaned up. Docker spoils the developers.
            savedir = tempfile.TemporaryDirectory().name
    else:
        container = None
    # Create and seed the env.
    env, monitored_env = make_env(args.env)
    if args.seed > 0:
        set_global_seeds(args.seed)
        env.unwrapped.seed(args.seed)

    with U.make_session(4) as sess:
        # Create training graph and replay buffer
        act, train, update_target, debug = deepq.build_train(
            make_obs_ph=lambda name: U.Uint8Input(env.observation_space.shape, name=name),
            q_func=dueling_model if args.dueling else model,
            num_actions=env.action_space.n,
            optimizer=tf.train.AdamOptimizer(learning_rate=args.lr, epsilon=1e-4),
            gamma=0.99,
            grad_norm_clipping=10,
            double_q=args.double_q,
            noisy=args.noisy,
        )
        approximate_num_iters = args.num_steps / 4
        exploration = PiecewiseSchedule([
            (0, 1.0),
            (approximate_num_iters / 50, 0.1),
            (approximate_num_iters / 5, 0.01)
        ], outside_value=0.01)

        if args.prioritized:
            replay_buffer = PrioritizedReplayBuffer(args.replay_buffer_size, args.prioritized_alpha)
            beta_schedule = LinearSchedule(approximate_num_iters, initial_p=args.prioritized_beta0, final_p=1.0)
        else:
            replay_buffer = ReplayBuffer(args.replay_buffer_size)

        U.initialize()
        update_target()
        num_iters = 0

        # Load the model
        state = maybe_load_model(savedir, container)
        if state is not None:
            num_iters, replay_buffer = state["num_iters"], state["replay_buffer"],
            monitored_env.set_state(state["monitor_state"])

        start_time, start_steps = None, None
        steps_per_iter = RunningAvg(0.999)
        iteration_time_est = RunningAvg(0.999)
        obs = env.reset()
        # Record the mean of the \sigma
        sigma_name_list = []
        sigma_list = []
        for param in tf.trainable_variables():
            # only record the \sigma in the action network
            if 'sigma' in param.name and 'deepq/q_func/action_value' in param.name:
                summary_name = param.name.replace('deepq/q_func/action_value/', '').replace('/', '.').split(':')[0]
                sigma_name_list.append(summary_name)
                sigma_list.append(tf.reduce_mean(tf.abs(param)))
        f_mean_sigma = U.function(inputs=[], outputs=sigma_list)
        # Statistics
        writer = tf.summary.FileWriter(savedir, sess.graph)
        im_stats = statistics(scalar_keys=['action', 'im_reward', 'td_errors', 'huber_loss']+sigma_name_list)
        ep_stats = statistics(scalar_keys=['ep_reward', 'ep_length'])  
        # Main trianing loop
        ep_length = 0
        while True:
            num_iters += 1
            ep_length += 1
            # Take action and store transition in the replay buffer.
            if args.noisy:
                # greedily choose
                action = act(np.array(obs)[None], stochastic=False)[0]
            else:
                # epsilon greedy
                action = act(np.array(obs)[None], update_eps=exploration.value(num_iters))[0]
            new_obs, rew, done, info = env.step(action)
            replay_buffer.add(obs, action, rew, new_obs, float(done))
            obs = new_obs
            if done:
                obs = env.reset()

            if (num_iters > max(5 * args.batch_size, args.replay_buffer_size // 20) and
                    num_iters % args.learning_freq == 0):
                # Sample a bunch of transitions from replay buffer
                if args.prioritized:
                    experience = replay_buffer.sample(args.batch_size, beta=beta_schedule.value(num_iters))
                    (obses_t, actions, rewards, obses_tp1, dones, weights, batch_idxes) = experience
                else:
                    obses_t, actions, rewards, obses_tp1, dones = replay_buffer.sample(args.batch_size)
                    weights = np.ones_like(rewards)
                # Minimize the error in Bellman's equation and compute TD-error
                td_errors, huber_loss = train(obses_t, actions, rewards, obses_tp1, dones, weights)
                # Update the priorities in the replay buffer
                if args.prioritized:
                    new_priorities = np.abs(td_errors) + args.prioritized_eps
                    replay_buffer.update_priorities(batch_idxes, new_priorities)
                # Write summary
                mean_sigma = f_mean_sigma()
                im_stats.add_all_summary(writer, [action, rew, np.mean(td_errors), np.mean(huber_loss)]+mean_sigma, num_iters)

            # Update target network.
            if num_iters % args.target_update_freq == 0:
                update_target()

            if start_time is not None:
                steps_per_iter.update(info['steps'] - start_steps)
                iteration_time_est.update(time.time() - start_time)
            start_time, start_steps = time.time(), info["steps"]

            # Save the model and training state.
            if num_iters > 0 and (num_iters % args.save_freq == 0 or info["steps"] > args.num_steps):
                maybe_save_model(savedir, container, {
                    'replay_buffer': replay_buffer,
                    'num_iters': num_iters,
                    'monitor_state': monitored_env.get_state()
                })

            if info["steps"] > args.num_steps:
                break

            if done:
                steps_left = args.num_steps - info["steps"]
                completion = np.round(info["steps"] / args.num_steps, 1)
                mean_ep_reward = np.mean(info["rewards"][-100:])
                logger.record_tabular("% completion", completion)
                logger.record_tabular("steps", info["steps"])
                logger.record_tabular("iters", num_iters)
                logger.record_tabular("episodes", len(info["rewards"]))
                logger.record_tabular("reward (100 epi mean)", np.mean(info["rewards"][-100:]))
                if not args.noisy:
                    logger.record_tabular("exploration", exploration.value(num_iters))
                if args.prioritized:
                    logger.record_tabular("max priority", replay_buffer._max_priority)
                fps_estimate = (float(steps_per_iter) / (float(iteration_time_est) + 1e-6)
                                if steps_per_iter._value is not None else "calculating...")
                logger.dump_tabular()
                logger.log()
                logger.log("ETA: " + pretty_eta(int(steps_left / fps_estimate)))
                logger.log()
                # add summary for one episode
                ep_stats.add_all_summary(writer, [mean_ep_reward, ep_length], num_iters)
                ep_length = 0

(2)NoisyNet-A3C

# using Pytorch
# code source: https://github.com/Kaixhin/NoisyNet-A3C
import gym
import torch
from torch import nn
from torch.autograd import Variable

from model import ActorCritic
from utils import state_to_tensor


# Transfers gradients from thread-specific model to shared model
def _transfer_grads_to_shared_model(model, shared_model):
  for param, shared_param in zip(model.parameters(), shared_model.parameters()):
    if shared_param.grad is not None:
      return
    shared_param._grad = param.grad


# Adjusts learning rate
def _adjust_learning_rate(optimiser, lr):
  for param_group in optimiser.param_groups:
    param_group['lr'] = lr


def train(rank, args, T, shared_model, optimiser):
  torch.manual_seed(args.seed + rank)

  env = gym.make(args.env)
  env.seed(args.seed + rank)
  model = ActorCritic(env.observation_space, env.action_space, args.hidden_size, args.sigma_init, args.no_noise)
  model.train()

  t = 1  # Thread step counter
  done = True  # Start new episode

  while T.value() <= args.T_max:
    # Sync with shared model at least every t_max steps
    model.load_state_dict(shared_model.state_dict())
    # Get starting timestep
    t_start = t

    # Reset or pass on hidden state
    if done:
      hx = Variable(torch.zeros(1, args.hidden_size))
      cx = Variable(torch.zeros(1, args.hidden_size))
      # Reset environment and done flag
      state = state_to_tensor(env.reset())
      done, episode_length = False, 0
    else:
      # Perform truncated backpropagation-through-time (allows freeing buffers after backwards call)
      hx = hx.detach()
      cx = cx.detach()
    model.sample_noise()  # Pick a new noise vector (until next optimisation step)

    # Lists of outputs for training
    values, log_probs, rewards, entropies = [], [], [], []

    while not done and t - t_start < args.t_max:
      # Calculate policy and value
      policy, value, (hx, cx) = model(Variable(state), (hx, cx))
      log_policy = policy.log()
      entropy = -(log_policy * policy).sum(1)

      # Sample action
      action = policy.multinomial()
      log_prob = log_policy.gather(1, action.detach())  # Graph broken as loss for stochastic action calculated manually
      action = action.data[0, 0]

      # Step
      state, reward, done, _ = env.step(action)
      state = state_to_tensor(state)
      reward = args.reward_clip and min(max(reward, -1), 1) or reward  # Optionally clamp rewards
      done = done or episode_length >= args.max_episode_length

      # Save outputs for training
      [arr.append(el) for arr, el in zip((values, log_probs, rewards, entropies), (value, log_prob, reward, entropy))]

      # Increment counters
      t += 1
      T.increment()

    # Return R = 0 for terminal s or V(s_i; θ) for non-terminal s
    if done:
      R = Variable(torch.zeros(1, 1))
    else:
      _, R, _ = model(Variable(state), (hx, cx))
      R = R.detach()

    # Calculate n-step returns in forward view, stepping backwards from the last state
    trajectory_length = len(rewards)
    values, log_probs, entropies = torch.cat(values), torch.cat(log_probs), torch.cat(entropies)
    returns = Variable(torch.Tensor(trajectory_length + 1, 1))
    returns[-1] = R
    for i in reversed(range(trajectory_length)):
      # R ← r_i + γR
      returns[i] = rewards[i] + args.discount * returns[i + 1]
    # Advantage A = R - V(s_i; θ)
    A = returns[:-1] - values
    # dθ ← dθ - ∂A^2/∂θ
    value_loss = 0.5 * A ** 2  # Least squares error

    # dθ ← dθ + ∇θ∙log(π(a_i|s_i; θ))∙A
    policy_loss = -log_probs * A.detach()  # Policy gradient loss (detached from critic)
    # dθ ← dθ + β∙∇θH(π(s_i; θ))
    policy_loss -= args.entropy_weight * entropies.unsqueeze(1)  # Entropy maximisation loss
    # Zero shared and local grads
    optimiser.zero_grad()
    # Note that losses were defined as negatives of normal update rules for gradient descent
    (policy_loss + value_loss).sum().backward()
    # Gradient L2 normalisation
    nn.utils.clip_grad_norm(model.parameters(), args.max_gradient_norm, 2)

    # Transfer gradients to shared model and update
    _transfer_grads_to_shared_model(model, shared_model)
    optimiser.step()
    if not args.no_lr_decay:
      # Linearly decay learning rate
      _adjust_learning_rate(optimiser, max(args.lr * (args.T_max - T.value()) / args.T_max, 1e-32))

  env.close()

參考內容

  1. https://arxiv.org/pdf/1706.10295v1.pdf
  2. https://arxiv.org/abs/1602.01783
  3. https://github.com/openai/baselines
  4. https://github.com/Kaixhin/NoisyNet-A3C
  5. https://github.com/wenh123/NoisyNet-DQN/
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章