TensorFlow 2.0深度強化學習指南

摘要： 用深度強化學習來展示TensorFlow 2.0的強大特性！

在本教程中，我將通過實施Advantage Actor-Critic(演員-評論家，A2C)代理來解決經典的CartPole-v0環境，通過深度強化學習（DRL）展示即將推出的TensorFlow2.0特性。雖然我們的目標是展示TensorFlow2.0，但我將盡最大努力讓DRL的講解更加平易近人，包括對該領域的簡要概述。

事實上，由於2.0版本的焦點是讓開發人員的生活變得更輕鬆，所以我認爲現在是使用TensorFlow進入DRL的好時機，本文用到的例子的源代碼不到150行！代碼可以在這裏或者這裏獲取。

建立

由於TensorFlow2.0仍處於試驗階段，我建議將其安裝在獨立的虛擬環境中。我個人比較喜歡Anaconda，所以我將用它來演示安裝過程：

> conda create -n tf2 python=3.6
> source activate tf2
> pip install tf-nightly-2.0-preview # tf-nightly-gpu-2.0-preview for GPU version

讓我們快速驗證一切是否按能夠正常工作：

>>> import tensorflow as tf
>>> print(tf.__version__)
1.13.0-dev20190117
>>> print(tf.executing_eagerly())
True

不要擔心1.13.x版本，這只是意味着它是早期預覽。這裏要注意的是我們默認處於eager模式！

>>> print(tf.reduce_sum([1, 2, 3, 4, 5]))
tf.Tensor(15, shape=(), dtype=int32)

如果你還不熟悉eager模式，那麼實質上意味着計算是在運行時被執行的，而不是通過預編譯的圖（曲線圖）來執行。你可以在TensorFlow文檔中找到一個很好的概述。

深度強化學習

一般而言，強化學習是解決連續決策問題的高級框架。RL通過基於某些agent進行導航觀察環境，並且獲得獎勵。大多數RL算法通過最大化代理在一輪遊戲期間收集的獎勵總和來工作。

基於RL的算法的輸出通常是policy（策略）-將狀態映射到函數有效的策略中，有效的策略可以像硬編碼的無操作動作一樣簡單。在某些狀態下，隨機策略表示爲行動的條件概率分佈。

演員，評論家方法（Actor-Critic Methods）

RL算法通常基於它們優化的目標函數進行分組。Value-based諸如DQN之類的方法通過減少預期的狀態-動作值的誤差來工作。

策略梯度（Policy Gradients）方法通過調整其參數直接優化策略本身，通常通過梯度下降完成的。完全計算梯度通常是難以處理的，因此通常要通過蒙特卡羅方法估算它們。

最流行的方法是兩者的混合：actor-critic方法，其中代理策略通過策略梯度進行優化，而基於值的方法用作預期值估計的引導。

深度演員-批評方法

雖然很多基礎的RL理論是在表格案例中開發的，但現代RL幾乎完全是用函數逼近器完成的，例如人工神經網絡。具體而言，如果策略和值函數用深度神經網絡近似，則RL算法被認爲是“深度”。

異步優勢演員-評論家（actor-critical）

多年來，爲了提高學習過程的樣本效率和穩定性，技術發明者已經進行了一些改進。

首先，梯度加權回報：折現的未來獎勵，這在一定程度上緩解了信用分配問題，並以無限的時間步長解決了理論問題。

其次，使用優勢函數代替原始回報。優勢在收益與某些基線之間的差異之間形成，並且可以被視爲衡量給定值與某些平均值相比有多好的指標。

第三，在目標函數中使用額外的熵最大化項以確保代理充分探索各種策略。本質上，熵以均勻分佈最大化來測量概率分佈的隨機性。

最後，並行使用多個工人加速樣品採集，同時在訓練期間幫助它們去相關。

將所有這些變化與深度神經網絡相結合，我們得出了兩種最流行的現代算法:異步優勢演員評論家（actor-critical）算法，簡稱A3C或者A2C。兩者之間的區別在於技術性而非理論性：顧名思義，它歸結爲並行工人如何估計其梯度並將其傳播到模型中。

有了這個，我將結束我們的DRL方法之旅，因爲博客文章的重點更多是關於TensorFlow2.0的功能。如果你仍然不瞭解該主題，請不要擔心，代碼示例應該更清楚。如果你想了解更多，那麼一個好的資源就可以開始在Deep RL中進行Spinning Up了。

使用TensorFlow 2.0的優勢演員-評論家

讓我們看看實現現代DRL算法的基礎是什麼：演員評論家代理（actor-critic agent）。如前一節所述，爲簡單起見，我們不會實現並行工作程序，儘管大多數代碼都會支持它，感興趣的讀者可以將其用作鍛鍊機會。

作爲測試平臺，我們將使用CartPole-v0環境。雖然它有點簡單，但它仍然是一個很好的選擇開始。在實現RL算法時，我總是依賴它作爲一種健全性檢查。

通過Keras Model API實現的策略和價值

首先，讓我們在單個模型類下創建策略和價值估計NN：

import numpy as np
import tensorflow as tf
import tensorflow.keras.layers as kl

class ProbabilityDistribution(tf.keras.Model):
    def call(self, logits):
        # sample a random categorical action from given logits
        return tf.squeeze(tf.random.categorical(logits, 1), axis=-1)

class Model(tf.keras.Model):
    def __init__(self, num_actions):
        super().__init__('mlp_policy')
        # no tf.get_variable(), just simple Keras API
        self.hidden1 = kl.Dense(128, activation='relu')
        self.hidden2 = kl.Dense(128, activation='relu')
        self.value = kl.Dense(1, name='value')
        # logits are unnormalized log probabilities
        self.logits = kl.Dense(num_actions, name='policy_logits')
        self.dist = ProbabilityDistribution()

    def call(self, inputs):
        # inputs is a numpy array, convert to Tensor
        x = tf.convert_to_tensor(inputs, dtype=tf.float32)
        # separate hidden layers from the same input tensor
        hidden_logs = self.hidden1(x)
        hidden_vals = self.hidden2(x)
        return self.logits(hidden_logs), self.value(hidden_vals)

    def action_value(self, obs):
        # executes call() under the hood
        logits, value = self.predict(obs)
        action = self.dist.predict(logits)
        # a simpler option, will become clear later why we don't use it
        # action = tf.random.categorical(logits, 1)
        return np.squeeze(action, axis=-1), np.squeeze(value, axis=-1)

驗證我們驗證模型是否按預期工作：

import gym
env = gym.make('CartPole-v0')
model = Model(num_actions=env.action_space.n)
obs = env.reset()
# no feed_dict or tf.Session() needed at all
action, value = model.action_value(obs[None, :])
print(action, value) # [1] [-0.00145713]

這裏要注意的事項：

模型層和執行路徑是分開定義的；
沒有“輸入”圖層，模型將接受原始numpy數組；
可以通過函數API在一個模型中定義兩個計算路徑；
模型可以包含一些輔助方法，例如動作採樣；
在eager的模式下，一切都可以從原始的numpy數組中運行；

隨機代理

現在我們可以繼續學習一些有趣的東西A2CAgent類。首先，讓我們添加一個貫穿整集的test方法並返回獎勵總和。

class A2CAgent:
    def __init__(self, model):
        self.model = model
    def test(self, env, render=True):
        obs, done, ep_reward = env.reset(), False, 0
        while not done:
            action, _ = self.model.action_value(obs[None, :])
            obs, reward, done, _ = env.step(action)
            ep_reward += reward
            if render:
                env.render()
        return ep_reward

讓我們看看我們的模型在隨機初始化權重下得分多少：

agent = A2CAgent(model)
rewards_sum = agent.test(env)
print("%d out of 200" % rewards_sum) # 18 out of 200

離最佳轉檯還有很遠，接下來是訓練部分！

損失/目標函數

正如我在DRL概述部分所描述的那樣，代理通過基於某些損失（目標）函數的梯度下降來改進其策略。在演員評論家中，我們訓練了三個目標：用優勢加權梯度加上熵最大化來改進策略，並最小化價值估計誤差。

import tensorflow.keras.losses as kls
import tensorflow.keras.optimizers as ko
class A2CAgent:
    def __init__(self, model):
        # hyperparameters for loss terms
        self.params = {'value': 0.5, 'entropy': 0.0001}
        self.model = model
        self.model.compile(
            optimizer=ko.RMSprop(lr=0.0007),
            # define separate losses for policy logits and value estimate
            loss=[self._logits_loss, self._value_loss]
        )

    def test(self, env, render=True):
        # unchanged from previous section
        ...

    def _value_loss(self, returns, value):
        # value loss is typically MSE between value estimates and returns
        return self.params['value']*kls.mean_squared_error(returns, value)

    def _logits_loss(self, acts_and_advs, logits):
        # a trick to input actions and advantages through same API
        actions, advantages = tf.split(acts_and_advs, 2, axis=-1)
        # polymorphic CE loss function that supports sparse and weighted options
        # from_logits argument ensures transformation into normalized probabilities
        cross_entropy = kls.CategoricalCrossentropy(from_logits=True)
        # policy loss is defined by policy gradients, weighted by advantages
        # note: we only calculate the loss on the actions we've actually taken
        # thus under the hood a sparse version of CE loss will be executed
        actions = tf.cast(actions, tf.int32)
        policy_loss = cross_entropy(actions, logits, sample_weight=advantages)
        # entropy loss can be calculated via CE over itself
        entropy_loss = cross_entropy(logits, logits)
        # here signs are flipped because optimizer minimizes
        return policy_loss - self.params['entropy']*entropy_loss

我們完成了目標函數！請注意代碼的緊湊程度：註釋行幾乎比代碼本身多。

代理訓練循環

最後，還有訓練迴路本身，它相對較長，但相當簡單：收集樣本，計算回報和優勢，並在其上訓練模型。

class A2CAgent:
    def __init__(self, model):
        # hyperparameters for loss terms
        self.params = {'value': 0.5, 'entropy': 0.0001, 'gamma': 0.99}
        # unchanged from previous section
        ...

   def train(self, env, batch_sz=32, updates=1000):
        # storage helpers for a single batch of data
        actions = np.empty((batch_sz,), dtype=np.int32)
        rewards, dones, values = np.empty((3, batch_sz))
        observations = np.empty((batch_sz,) + env.observation_space.shape)
        # training loop: collect samples, send to optimizer, repeat updates times
        ep_rews = [0.0]
        next_obs = env.reset()
        for update in range(updates):
            for step in range(batch_sz):
                observations[step] = next_obs.copy()
                actions[step], values[step] = self.model.action_value(next_obs[None, :])
                next_obs, rewards[step], dones[step], _ = env.step(actions[step])

                ep_rews[-1] += rewards[step]
                if dones[step]:
                    ep_rews.append(0.0)
                    next_obs = env.reset()

            _, next_value = self.model.action_value(next_obs[None, :])
            returns, advs = self._returns_advantages(rewards, dones, values, next_value)
            # a trick to input actions and advantages through same API
            acts_and_advs = np.concatenate([actions[:, None], advs[:, None]], axis=-1)
            # performs a full training step on the collected batch
            # note: no need to mess around with gradients, Keras API handles it
            losses = self.model.train_on_batch(observations, [acts_and_advs, returns])
        return ep_rews

    def _returns_advantages(self, rewards, dones, values, next_value):
        # next_value is the bootstrap value estimate of a future state (the critic)
        returns = np.append(np.zeros_like(rewards), next_value, axis=-1)
        # returns are calculated as discounted sum of future rewards
        for t in reversed(range(rewards.shape[0])):
            returns[t] = rewards[t] + self.params['gamma'] * returns[t+1] * (1-dones[t])
        returns = returns[:-1]
        # advantages are returns - baseline, value estimates in our case
        advantages = returns - values
        return returns, advantages
    def test(self, env, render=True):
        # unchanged from previous section
        ...
    def _value_loss(self, returns, value):
        # unchanged from previous section
        ...

    def _logits_loss(self, acts_and_advs, logits):
        # unchanged from previous section
        ...

訓練和結果

我們現在已經準備好在CartPole-v0上訓練我們的單工A2C代理了！訓練過程不應超過幾分鐘，訓練完成後，你應該看到代理成功達到200分中的目標。

rewards_history = agent.train(env)
print("Finished training, testing...")
print("%d out of 200" % agent.test(env)) # 200 out of 200

在源代碼中，我包含了一些額外的幫助程序，可以打印出運行的獎勵和損失，以及rewards_history的基本繪圖儀。

靜態計算圖

有了所有這種渴望模式的成功的喜悅，你可能想知道靜態圖形執行是否可以。當然！此外，我們還需要多一行代碼來啓用它！

with tf.Graph().as_default():
    print(tf.executing_eagerly()) # False
    model = Model(num_actions=env.action_space.n)
    agent = A2CAgent(model)
    rewards_history = agent.train(env)
    print("Finished training, testing...")
    print("%d out of 200" % agent.test(env)) # 200 out of 200

有一點需要注意，在靜態圖形執行期間，我們不能只有Tensors，這就是爲什麼我們在模型定義期間需要使用CategoricalDistribution的技巧。事實上，當我在尋找一種在靜態模式下執行的方法時，我發現了一個關於通過Keras API構建的模型的一個有趣的低級細節。

還有一件事…

還記得我說過TensorFlow默認是運行在eager模式下吧，甚至用代碼片段證明它嗎？好吧，我錯了。

如果你使用Keras API來構建和管理模型，那麼它將嘗試將它們編譯爲靜態圖形。所以你最終得到的是靜態計算圖的性能，具有渴望執行的靈活性。

你可以通過model.run_eagerly標誌檢查模型的狀態，你也可以通過設置此標誌來強制執行eager模式變成True，儘管大多數情況下你可能不需要這樣做。但如果Keras檢測到沒有辦法繞過eager模式，它將自動退出。

爲了說明它確實是作爲靜態圖運行，這裏是一個簡單的基準測試：

# create a 100000 samples batch
env = gym.make('CartPole-v0')
obs = np.repeat(env.reset()[None, :], 100000, axis=0)

Eager基準

%%time
model = Model(env.action_space.n)
model.run_eagerly = True
print("Eager Execution:  ", tf.executing_eagerly())
print("Eager Keras Model:", model.run_eagerly)
_ = model(obs)
######## Results #######
Eager Execution:   True
Eager Keras Model: True
CPU times: user 639 ms, sys: 736 ms, total: 1.38 s

靜態基準

%%time
with tf.Graph().as_default():
    model = Model(env.action_space.n)
    print("Eager Execution:  ", tf.executing_eagerly())
    print("Eager Keras Model:", model.run_eagerly)
    _ = model.predict(obs)
######## Results #######
Eager Execution:   False
Eager Keras Model: False
CPU times: user 793 ms, sys: 79.7 ms, total: 873 ms

默認基準

%%time
model = Model(env.action_space.n)
print("Eager Execution:  ", tf.executing_eagerly())
print("Eager Keras Model:", model.run_eagerly)
_ = model.predict(obs)
######## Results #######
Eager Execution:   True
Eager Keras Model: False
CPU times: user 994 ms, sys: 23.1 ms, total: 1.02 s

正如你所看到的，eager模式是靜態模式的背後，默認情況下，我們的模型確實是靜態執行的。

結論

希望本文能夠幫助你理解DRL和TensorFlow2.0。請注意，TensorFlow2.0仍然只是預覽版本，甚至不是候選版本，一切都可能發生變化。如果TensorFlow有什麼東西你特別不喜歡，讓它的開發者知道！

人們可能會有一個揮之不去的問題：TensorFlow比PyTorch好嗎？也許，也許不是。它們兩個都是偉大的庫，所以很難說這樣誰好，誰不好。如果你熟悉PyTorch，你可能已經注意到TensorFlow 2.0不僅趕上了它，而且還避免了一些PyTorch API的缺陷。

在任何一種情況下，對於開發者來說，這場競爭都已經爲雙方帶來了積極的結果，我很期待看到未來的框架將會變成什麼樣。

本文作者：【方向】

閱讀原文

本文爲雲棲社區原創內容，未經允許不得轉載。

TensorFlow 2.0深度強化學習指南

建立

深度強化學習

使用TensorFlow 2.0的優勢演員-評論家

靜態計算圖

還有一件事…

結論

大數據基礎工程技術團隊4篇論文入選ICLR，ICDE，WWW

PolarDB-X V2.4 列存引擎開源正式發佈

Serverless Devs 重大更新，基於 Serverless 架構的 CI/CD 框架：Serverless-cd

5個編寫技巧，有效提高單元測試實踐

使用EasyCV Mask2Former輕鬆實現圖像分割

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結