[源碼解析] 深度學習分佈式訓練框架 horovod (7) --- DistributedOptimizer

0x00 摘要

Horovod 是Uber於2017年發佈的一個易於使用的高性能的分佈式訓練框架，在業界得到了廣泛應用。

本系列將通過源碼分析來帶領大家瞭解 Horovod。本文是系列第七篇，看看 Horovod 如何與 TensorFlow 融合。

前面幾篇鏈接如下：

[源碼解析] 深度學習分佈式訓練框架 Horovod (1) --- 基礎知識

[源碼解析] 深度學習分佈式訓練框架 horovod (2) --- 從使用者角度切入

[源碼解析] 深度學習分佈式訓練框架 horovod (3) --- Horovodrun背後做了什麼

[源碼解析] 深度學習分佈式訓練框架 horovod (4) --- 網絡基礎 & Driver

[源碼解析] 深度學習分佈式訓練框架 horovod (5) --- 融合框架

[源碼解析] 深度學習分佈式訓練框架 horovod (6) --- 後臺線程架構

我們需要一些問題或者說是設計要點來引導分析，而且因爲讀者可能沒有看過本系列其他文章，因此問題點會和其他文章有部分重複：

第一個技術難點是：Horovod 如何從 TF 的執行流程中獲取到梯度（gradients）進行處理？
- 在 TensorFlow 1.x 中，深度學習計算過程被表示成爲一個計算圖（graph），並且由 TensorFlow runtime 負責解釋和執行，所以 Horovod 爲了獲得每個進程計算的梯度並且對於它們進行 AllReduce，就得用黑客的辦法進入到 TF 圖執行過程之中去獲取梯度。
第二個技術難點是：Horovod 可以自己定義 AllReduce操作, 但是它的AllReduce操作怎麼能夠嵌入到 TF 的處理流程之中？
- 因爲 Horovod 自定義的這套HVD Operation 是跟TF OP 無關的，因此是無法直接插入到TF Graph之中進行執行，所以還需要有一個辦法來把HVD OP註冊到TF的OP之中。

0x01 背景概念

我們回憶一下背景概念。

1.1 深度學習框架

深度學習訓練的核心問題是過反向梯度計算來擬合f()，反向梯度計算的目的是計算梯度和更新參數。而計算梯度的方式則主要是通過鏈式求導。一次鏈式求導只是一次的前向和後向的計算結果。模型訓練的重點過程就是：前向傳播和反向傳播。

以簡單的深度神經網絡爲例，爲了完成對損失的優化，我們把數據分成batch，不斷把數據送入模型網絡中進行如下迭代過程，目的是使最終優化網絡達到收斂：

一個batch的數據被送入網絡進行前向傳播，前向傳播就是一系列的矩陣+激活函數等的組合運算。
前向傳播輸出的預測值會同真實值 label 進行對比之後，使用損失函數計算出此次迭代的損失；
把這個損失進行反向傳播，送入神經網絡模型中之前的每一層進行反向梯度計算，更新每一層的權值矩陣和bias；

深度學習框架幫助我們解決的核心問題之一就是反向傳播時的梯度計算和更新。如果不用深度學習框架，就需要我們自己寫方法以進行復雜的梯度計算和更新。

1.2 Tensorflow Optimizer

Tensorflow的底層結構是由張量組成的計算圖。計算圖就是底層的編程系統，每一個計算都是圖中的一個節點，計算之間的依賴關係則用節點之間的邊來表示。計算圖構成了前向/反向傳播的結構基礎。

給定一個計算圖, TensorFlow 使用自動微分 (反向傳播) 來進行梯度運算。tf.train.Optimizer允許我們通過minimize()函數自動進行權值更新，此時tf.train.Optimizer.minimize()做了兩件事：

計算梯度。即調用compute_gradients (loss, var_list ...) 計算loss對指定val_list的梯度，返回元組列表 list(zip(grads, var_list))。
用計算得到的梯度來更新對應權重。即調用 apply_gradients(grads_and_vars, global_step=global_step, name=None) 將 compute_gradients (loss, var_list ...) 的返回值作爲輸入對權重變量進行更新；

將minimize()分成兩個步驟的原因是：可以在某種情況下對梯度進行修正，防止梯度消失或者梯度爆炸。

tensorflow也允許用戶自己計算梯度，在用戶做了中間處理之後，這個梯度會應用給權值進行更新，此時就會細分爲以下三個步驟：

利用tf.train.Optimizer.compute_gradients計算梯度；
用戶對梯度進行自定義處理。這裏其實就是 Horovod 可以做手腳的地方；
對於用戶計算後的梯度，利用tf.train.Optimizer.apply_gradients更新權值；

0x02 總體架構

2.1 總體思路

Horovod 作業的每個進程都調用單機版 TensorFlow 做本地計算，然後收集梯度，並且通過 AllReduce 來匯聚梯度並且更新每個進程中的模型。

Horovod 需要從 TensorFlow 截取梯度。

TensorFlow 1.x
- 在 TensorFlow 1.x 中，深度學習計算是一個計算圖，由 TensorFlow 運行時負責解釋執行。
- Horovod 爲了獲得每個進程計算的梯度並且可以對它們進行 AllReduce，就必須潛入圖執行的過程。爲此，Horovod 通過對用戶Optimizer 進行封裝組合方式完成了對梯度的 AllReduce 操作，即， Horovod 要求開發者使用Horovod自己定義的 hvd.DistributedOptimizer 代替 TensorFlow 官方的 optimizer，從而可以在優化模型階段得到梯度。
TensorFlow 2.0
- TensorFlow 2.0 的 eager execution模式採用完全不同的計算方式。其前向計算過程把對基本計算單元（operator）的調用記錄在一個數據結構 tape 裏，隨後進行反向計算過程時候可以回溯這個 tape，以此調用 operator 對應的 gradient operator。Tape 提供一個操作讓用戶可以獲取每個參數的梯度。
- Horovod 調用 TensorFlow 2.0 API 可以直接獲取梯度。然後Horovod 通過封裝 tape 完成 AllReduce 調用。

3.2 總體調用關係

我們先給出總體調用關係：hvd.DistributedOptimizer繼承keras Optimizer，然後hvd.DistributedOptimizer在其重載的get_gradients中把獲取到的梯度傳給hvd.allreduce(gradients, ...)，從而實現整個horovod集羣的梯度集體歸併。

具體計算梯度的邏輯是：

TF 調用 hvd.DistributedOptimizer 的 compute_gradients 方法：
- hvd.DistributedOptimizer 首先會利用 TF 官方 optimizer.compute_gradients 計算出本地梯度；
- 然後利用 AllReduce 來得到各個進程平均後的梯度；
- compute_gradients 返回一個(梯度，權值)對的列表。由apply_gradients使用；
TF 調用 hvd.DistributedOptimizer 的 apply_gradients 方法：
- 調用 TF 官方 optimizer.apply_gradients 對傳入的參數進行處理，返回一個更新權值的op。TF 可以用這個返回值進行後續處理；

因爲 TF 的版本問題，所以我們區分 1.x, 2.x 來分析。

0x04 TensorFlow 1.x

前面提到了，Horovod 要求開發者使用Horovod自己定義的 hvd.DistributedOptimizer 代替 TensorFlow 官方的 optimizer，從而可以在優化模型階段得到梯度，所以我們從_DistributedOptimizer進行分析。

4.1 _DistributedOptimizer

以 horovod/tensorflow/__init__.py 爲例。

try:
    # TensorFlow 2.x
    _LegacyOptimizer = tf.compat.v1.train.Optimizer
except AttributeError:
    try:
        # TensorFlow 1.x
        _LegacyOptimizer = tf.train.Optimizer
    except AttributeError:
        # Future TensorFlow versions
        _LegacyOptimizer = None

可以看到，對於 TensorFlow 1.x，我們後續使用的基礎是 _LegacyOptimizer。

_DistributedOptimizer 就繼承了 _LegacyOptimizer。其封裝了另外一個tf.optimizer，在模型應用梯度之前使用allreduce操作收集梯度值並求其均值。這個被封裝的tf.optimizer就是用戶在使用時候指定的TF官方優化器。

具體可以回憶用戶如何使用：

# TF官方Optimizer
opt = tf.optimizers.Adam(scaled_lr)

# 把常規TensorFlow Optimizer通過Horovod包裝起來，進而使用 ring-allreduce 來得到平均梯度
opt = hvd.DistributedOptimizer(
    opt, backward_passes_per_step=1, average_aggregated_gradients=True)

# 最後模型使用的是hvd.DistributedOptimizer
mnist_model.compile(loss=tf.losses.SparseCategoricalCrossentropy(),
                    optimizer=opt, metrics=['accuracy'],
                    experimental_run_tf_function=False)

opt 被傳給DistributedOptimizer的optimizer，在構造函數__init__.py中被賦值給了self._optimizer。

if _LegacyOptimizer is not None:
    class _DistributedOptimizer(_LegacyOptimizer):
        """An optimizer that wraps another tf.Optimizer, using an allreduce to
        combine gradient values before applying gradients to model weights."""

        def __init__(self, optimizer, name=None, use_locking=False, device_dense='',
                    device_sparse='', compression=Compression.none,
                    sparse_as_dense=False, op=Average, gradient_predivide_factor=1.0,
                    backward_passes_per_step=1, average_aggregated_gradients=False,
                    groups=None):

            self._optimizer = optimizer # 在構造函數中被賦值給了self._optimizer
            self._allreduce_grads = _make_allreduce_grads_fn( # 設置歸併函數
                name, device_dense, device_sparse, compression, sparse_as_dense, op,
                gradient_predivide_factor, groups)

            self._agg_helper = None
            if backward_passes_per_step > 1:
                # 可以先做本地梯度累積，再誇進程合併
                self._agg_helper = LocalGradientAggregationHelper( 
                    backward_passes_per_step=backward_passes_per_step,
                    allreduce_func=self._allreduce_grads,
                    sparse_as_dense=sparse_as_dense,
                    average_aggregated_gradients=average_aggregated_gradients,
                    rank=rank(),
                    optimizer_type=LocalGradientAggregationHelper._OPTIMIZER_TYPE_LEGACY,
                )

4.2 compute_gradients

計算梯度的第一步是調用 compute_gradients 計算loss對指定val_list的梯度，返回元組列表 list(zip(grads, var_list))。

每一個worker的 tensor 模型都會調用 compute_gradients，對於每個model來說，

gradients = self._optimizer.compute_gradients(*args, **kwargs) 就是本 model 本地計算得到的梯度。

DistributedOptimizer 重寫Optimizer類compute_gradients()方法。

_DistributedOptimizer 初始化時候有配置 self._allreduce_grads = _make_allreduce_grads_fn。這裏很重要。
compute_gradients()方法首先調用原始配置TF官方 optimizer 的 compute_gradients()。compute_gradients()返回值是一個元祖列表，列表的每個元素是 (gradient，variable)，gradient是每一個變量變化的梯度值；
如果設置了 _agg_helper，即 LocalGradientAggregationHelper，就調用 LocalGradientAggregationHelper 來做本地梯度累積（本地累積之後也會進行跨進程合併），否則調用 _allreduce_grads 計算，即直接跨進程合併（用MPI對計算出來的分佈式梯度做allreduce）；

        def compute_gradients(self, *args, **kwargs):
            """Compute gradients of all trainable variables.

            See Optimizer.compute_gradients() for more info.

            In DistributedOptimizer, compute_gradients() is overriden to also
            allreduce the gradients before returning them.
            """
            
            # _optimizer是原始配置的官方優化器，先調用其compute_gradients方法來計算所有訓練參數的梯度
            # 官方優化器的compute_gradients()方法返回一個元組(gradient，variable)的列表    
            # gradients 被賦值爲這個元組(gradient，variable)列表
            gradients = self._optimizer.compute_gradients(*args, **kwargs)
            grads, vars = zip(*gradients)
            
            if self._agg_helper: # 是否本地先累積
                avg_grads = self._agg_helper.compute_gradients(grads, vars)
            else:
                avg_grads = self._allreduce_grads(grads, vars)
            return list(zip(avg_grads, vars))

邏輯如下：

+-----------------------------+
|_DistributedOptimizer        |
|                             |
|                             |       +---------------+
| self._optimizer  +----------------> | tf.Optimizer  |
|                             |       |               |
|                             |       +---------------+
|                             |
|                             |       +-------------------------+
| _allreduce_grads +----------------> |_make_allreduce_grads_fn |
|                             |       +-------------------------+
|                             |
|                             |
|                             |
|                             |
|                             |       +-------------------------------------------------+
| compute_gradients  +------------->  |compute_gradients                                |
|                             |       |                                                 |
+-----------------------------+       |                                                 |
                                      |      _optimizer.compute_gradients               |
                                      |                +                                |
                                      |                |                                |
                                      |                |                                |
                                      |                v                                |
                                      |      _agg_helper.compute_gradients(grads, vars) |
                                      |                                                 |
                                      |      _allreduce_grads(grads, vars)              |
                                      |                +                                |
                                      |                |                                |
                                      |                |                                |
                                      |                v                                |
                                      |       list(zip(avg_grads, vars))                |
                                      |                                                 |
                                      +-------------------------------------------------+

4.3 LocalGradientAggregationHelper

前面提到，如果設置了 _agg_helper，即 LocalGradientAggregationHelper，就調用 LocalGradientAggregationHelper 來做本地累積梯度（本地累積之後也會進行跨進程合併）。所以我們講講 LocalGradientAggregationHelper。

LocalGradientAggregationHelper 會在本地更新梯度，但是因爲在初始化時候，成員函數 self._allreduce_grads = allreduce_func 就是跨進程allreduce函數。所以 LocalGradientAggregationHelper 之中也會進行跨進程 allreduce。即每次 backward_passes_per_step 時候跨機器更新一次。

這裏需要注意的是：allreduce_func=self._allreduce_grads，其實 LocalGradientAggregationHelper 內部調用 self._allreduce_grads也是調用到了 _make_allreduce_grads_fn。

LocalGradientAggregationHelper(
                        backward_passes_per_step=backward_passes_per_step,
                        allreduce_func=self._allreduce_grads, # 就是_make_allreduce_grads_fn
                        sparse_as_dense=sparse_as_dense,
                        average_aggregated_gradients=average_aggregated_gradients,
                        rank=rank(),
                        optimizer_type=LocalGradientAggregationHelper._OPTIMIZER_TYPE_KERAS,
                    )

具體是調用了 LocalGradientAggregationHelper.compute_gradients 完成功能，其中：

_init_aggregation_vars 函數會遍歷本地元組（gradient，variable）的列表，累積在 locally_aggregated_grads。
allreduce_grads 會做一個遍歷 tensor & 應用 tensor 的操作，對於每個 tensor，_allreduce_grads_helper 函數會進行跨進程合併。

4.3.1 _init_aggregation_vars

_init_aggregation_vars 函數會遍歷本地元組（gradient，variable）的列表，累積在 locally_aggregated_grads。

def _init_aggregation_vars(self, grads):
    """
    Initializes the counter that is used when to communicate and aggregate gradients
    and the tensorflow variables that store the locally aggregated gradients.
    """
    variable_scope_name = "aggregation_variables_" + str(self.rank)
    with tf.compat.v1.variable_scope(variable_scope_name, reuse=tf.compat.v1.AUTO_REUSE):
        self.counter = tf.compat.v1.get_variable(
            "aggregation_counter", shape=(), dtype=tf.int32,
            trainable=False, initializer=tf.compat.v1.zeros_initializer(),
            collections=[tf.compat.v1.GraphKeys.LOCAL_VARIABLES],
        )
        # 遍歷本地的梯度
        for idx, grad in enumerate(grads):
            # Handle IndexedSlices.
            # 如果是IndexedSlices，則轉換爲張量
            if self.sparse_as_dense and isinstance(grad, tf.IndexedSlices):
                grad = tf.convert_to_tensor(grad)
            elif isinstance(grad, tf.IndexedSlices):
                raise ValueError(
                    "IndexedSlices are not supported when "
                    "`backward_passes_per_step` > 1 and "
                    "`sparse_as_dense` is False."
                )

            # Handle grads that are None.
            # 如果爲空，則跳過
            if grad is None:
                self.num_none_grad_updates += 1
                continue
            self.not_none_indexes[idx] = len(self.locally_aggregated_grads)

            # Create shadow variable.
            grad_aggregation_variable_name = str(idx)
            zero_grad = tf.zeros(shape=grad.get_shape().as_list(), dtype=grad.dtype)
            grad_aggregation_variable = tf.compat.v1.get_variable(
                grad_aggregation_variable_name,
                trainable=False,
                initializer=zero_grad,
                collections=[
                    tf.compat.v1.GraphKeys.LOCAL_VARIABLES,
                    "aggregating_collection"],
            )
            # 添加到本地累積變量 locally_aggregated_grads 之中
            self.locally_aggregated_grads.append(grad_aggregation_variable)
        assert len(self.locally_aggregated_grads) + \
            self.num_none_grad_updates == len(grads)

    # We expect to get a `sess` when we need to manually do a `sess.run(...)`
    # for the variables to be initialized. This is the `tf.keras`
    # optimizers.
    # 遍歷locally_aggregated_grads的變量，如果需要則進行初始化
    if self.optimizer_type == self._OPTIMIZER_TYPE_KERAS:
        session = tf.compat.v1.keras.backend.get_session(op_input_list=())
        vars_init_op = tf.compat.v1.variables_initializer(
            [self.counter, *get_not_none_from_list(self.locally_aggregated_grads)]
        )
        session.run(vars_init_op)

4.3.2 compute_gradients

compute_gradients方法具體如下：

    def compute_gradients(self, grads, vars):
        """
        Applies the new gradient updates the locally aggregated gradients, and
        performs cross-machine communication every backward_passes_per_step
        times it is called.
        """
        # 遍歷 本地元組（gradient，variable）的列表，累積在 locally_aggregated_grads
        self._init_aggregation_vars(grads)

        # Clear the locally aggregated gradients when the counter is at zero.
        # 如果計數器爲0，則清理本地累積梯度
        clear_op = tf.cond(
            pred=tf.equal(self.counter, 0),
            true_fn=lambda: self._clear_grads(),
            false_fn=tf.no_op
        )

        # Add new gradients to the locally aggregated gradients.
        # 本地累積梯度
        with tf.control_dependencies([clear_op]):
            aggregation_ops_list = self._aggregate_grads(grads)

        # Increment the counter once new gradients have been applied.
        # 一旦本地梯度已經被應用，則把計數器加1
        aggregation_ops = tf.group(*aggregation_ops_list)
        with tf.control_dependencies([aggregation_ops]):
            update_counter = self.counter.assign_add(tf.constant(1))

        # 應用梯度    
        with tf.control_dependencies([update_counter]):
            grads = get_not_none_from_list(grads)
            assert len(grads) == len(self.locally_aggregated_grads)

            # Allreduce locally aggregated gradients when the counter is equivalent to
            # `backward_passes_per_step`. This the condition is true, it also resets
            # the counter back to 0.
            allreduced_grads = tf.cond(
                tf.equal(self.counter, self.backward_passes_per_step),
                lambda: self._allreduce_grads_helper(grads, vars),
                lambda: grads,
            )

            # Handle case where there is only one variable.
            if not isinstance(allreduced_grads, (list, tuple)):
                allreduced_grads = (allreduced_grads,)

            # Insert gradients that are None back in.
            # 對於本地累積的梯度，進行跨進程合併，locally_aggregated_grads是本地累積的梯度
            allreduced_grads = [
                allreduced_grads[self.not_none_indexes[idx]] if idx in self.not_none_indexes else None
                for idx in range(len(self.locally_aggregated_grads) + self.num_none_grad_updates)
            ]

        # If gradients have not been allreduced this batch, we return the gradients
        # that were submitted as the updates (the input).
        return allreduced_grads # 返回跨進程合併之後的梯度

邏輯拓展如下，這裏需要注意的是 _agg_helper 或者 _allreduce_grads 選一個執行：

如果設置了 _agg_helper，即 LocalGradientAggregationHelper，就調用 _agg_helper 來計算梯度（本地累積之後也會進行跨進程合併）；
否則調用 _allreduce_grads，即 _make_allreduce_grads_fn 計算，即跨進程合併（用MPI來對計算出來的分佈式梯度做allreduce操作）；

 +-----------------------------+
 |_DistributedOptimizer        |                                                                   +-----------------------------------------------------+
 |                             |                                                                   | LocalGradientAggregationHelper                      |
 |                             |       +---------------+                                           |                                                     |
 | self._optimizer  +----------------> | tf.Optimizer  |                                           |    +---------------------------------------------+  |
 |                             |       |               |                                           |    | compute_gradients                           |  |
 |                             |       +---------------+                                           |    |                                             |  |
 |                             |                                                                   |    |                                             |  |
 |                             |       +------------------------------------------------------+    |    |         _init_aggregation_vars              |  |
 | compute_gradients  +------------->  |compute_gradients                                     |    |    |                    +                        |  |
 |                             |       |                                                      |    |    |                    |                        |  |
 |                             |       |                                                      |    |    |                    |                        |  |
 |                             |       |      _optimizer.compute_gradients                    |    |    |                    v                        |  |
 | _allreduce_grads            |       |                +                                     |    |    |                                             |  |
 |      +                      |       |                |                                     |    |    |        _allreduce_grads_helper              |  |
 |      |                      |       |                |                                     |    |    |                    +                        |  |
 +-----------------------------+       |                v                                     |    |    |                    |                        |  |
        |                              |      _agg_helper.compute_gradients(grads, vars) +------------> |                    |                        |  |
        |                              |                                                      |    |    |                    v                        |  |
        |                   +--------------+  _allreduce_grads(grads, vars)                   |    |    |             allreduced_grads                |  |
        |                   |          |                +                                     |    |    |                                             |  |
        |                   |          |                |                                     |    |    +---------------------------------------------+  |
        |                   |          |                |                                     |    |                                                     |
        |                   |          |                v                                     |    |     allreduce_func                                  |
        |                   |          |       list(zip(avg_grads, vars))                     |    |            +                                        |
        |                   |          |                                                      |    |            |                                        |
        |                   |          +------------------------------------------------------+    +-----------------------------------------------------+
        |                   |                                                                                   |
        |                   |                                                                                   |
        v                   v                                                                                   |
+-------+-------------------+--------+                                                                          |
|_make_allreduce_grads_fn            |                                                                          |
|                                    |  <-----------------------------------------------------------------------+
|                _allreduce_cond     |
|                                    |
|                                    |
|                                    |
+------------------------------------+

具體如下：

4.4 _make_allreduce_grads_fn

_make_allreduce_grads_fn 就是調用了 _make_cached_allreduce_grads_fn 完成功能。

def _make_allreduce_grads_fn(name, device_dense, device_sparse,
                             compression, sparse_as_dense, op,
                             gradient_predivide_factor, groups):
    groups = vars_to_refs(groups) if isinstance(groups, list) else groups
    return _make_cached_allreduce_grads_fn(name, device_dense, device_sparse,
                                           compression, sparse_as_dense, op,
                                           gradient_predivide_factor, groups)

_make_cached_allreduce_grads_fn 的作用是：

獲取所有grads；
遍歷元組(gradient，variable)的列表，對於每個grad，使用_allreduce_cond與其他worker進行同步；
最後返回同步好的梯度列表；

@_cache
def _make_cached_allreduce_grads_fn(name, device_dense, device_sparse,
                                    compression, sparse_as_dense, op,
                                    gradient_predivide_factor, groups):
    groups = refs_to_vars(groups) if isinstance(groups, tuple) else groups
    ......
    def allreduce_grads(grads, vars=None):
        with tf.name_scope(name + "_Allreduce"): # 設置名稱空間
            ......
            # 獲取所有的 grads
            # 因爲grads列表致爲((grad0,var0),(grad1,var1)…)，裏面可能有很多None，所以提取出grad不爲None的var進行梯度計算。
            return [_allreduce_cond(grad,
                                    device_dense=device_dense,
                                    device_sparse=device_sparse,
                                    compression=compression,
                                    op=op,
                                    prescale_factor=prescale_factor,
                                    postscale_factor=postscale_factor)
                    if grad is not None else grad
                    for grad in grads]

    if _executing_eagerly():
        return _make_subgraph(allreduce_grads)
    else:
        return allreduce_grads

_allreduce_cond 函數中就是調用到 allreduce 進行集合通信操作。

def _allreduce_cond(tensor, *args, **kwargs):
    def allreduce_fn():
        return allreduce(tensor, *args, **kwargs)

    def id_fn():
        return tensor

    return tf.cond((size_op() > 1) if int(os.environ.get("HOROVOD_ELASTIC", 0)) else tf.convert_to_tensor(size() > 1),
                   allreduce_fn, id_fn)

4.5 allreduce

allreduce()方法之中，會依據所需要傳輸的張量類型是Tensor還是 IndexedSlices 做不同處理。

如果 tensor類型是IndexedSlices，則只需要做allgather操作，是否需要其他操作需要看具體附加配置。
- 因爲對於分佈在不同worker上的IndexedSlices，其values和indices彼此沒有重複。
- 假設在 worker 1上分佈的indices是[1, 3, 5, 7, 9]，在worker 2上分佈的indices是[2, 4, 6, 8, 10]。只需要使用allgather方法將其收集彙總得到 [1,2,3,4,5,6,7,8,9,10] 即可，不需要做求和/平均的操作。
- 如果有附加操作，才需要進一步處理。
如果是 Tensor 類型，則需要調用_allreduce方法處理：先求張量的和，再取平均。

def allreduce(tensor, average=None, device_dense='', device_sparse='',
              compression=Compression.none, op=None,
              prescale_factor=1.0, postscale_factor=1.0,
              name=None):
    """Perform an allreduce on a tf.Tensor or tf.IndexedSlices.
    """
    op = handle_average_backwards_compatibility(op, average)

    if isinstance(tensor, tf.IndexedSlices): # 對於IndexedSlices類型
        # TODO: Need to fix this to actuall call Adasum
        if op == Adasum:
        with tf.device(device_sparse):
            # For IndexedSlices, do two allgathers instead of an allreduce.
            # 做兩個allgathers操作即可
            horovod_size = tf.cast(size_op() if int(os.environ.get("HOROVOD_ELASTIC", 0)) else size(),
                                   dtype=tensor.values.dtype)
            values = allgather(tensor.values) # 一個 allgeathers對value進行處理
            indices = allgather(tensor.indices) # 一個allgather對index進行處理

            # To make this operation into an average, divide allgathered values by
            # the Horovod size.
			      # 如果op是Average，則需要計算所有value的均值，否則不做操作
            new_values = (values / horovod_size) if op == Average else values
        return tf.IndexedSlices(new_values, indices,
                                dense_shape=tensor.dense_shape)
    else: # 對於Tensor類型
        average_in_framework = False
        if rocm_built():
            # For ROCm, perform averaging at framework level
            average_in_framework = op == Average or op == Adasum
            op = Sum if op == Average else op

        with tf.device(device_dense):
            # 首先，將size_op()結果的類型轉化爲tensor的dtype類型
            horovod_size = tf.cast(size_op() if int(os.environ.get("HOROVOD_ELASTIC", 0)) else size(),
                                   dtype=tensor.dtype)
            tensor_compressed, ctx = compression.compress(tensor)
            # 定義了一個sum/壓縮操作: 將某張量和其他所有Horovod進程同名張量求和
            summed_tensor_compressed = _allreduce(tensor_compressed, op=op,
                                                  prescale_factor=prescale_factor,
                                                  postscale_factor=postscale_factor,
                                                  name=name)
            summed_tensor = compression.decompress(summed_tensor_compressed, ctx)
            if op == Adasum: # 處理其他附加操作
                if 'CPU' not in tensor.device and gpu_available('tensorflow'):
                    if nccl_built():
                        if not is_homogeneous:
                        elif not check_num_rank_power_of_2(int(size() / local_size())):
                        if rocm_built():
                            horovod_local_size = tf.cast(local_size_op() if int(os.environ.get("HOROVOD_ELASTIC", 0)) else local_size(),
                                                         dtype=tensor.dtype)
                            new_tensor = summed_tensor / horovod_local_size
                        else:
                            new_tensor = summed_tensor
                    else:
                        new_tensor = summed_tensor
                else:
                    new_tensor = summed_tensor
            else:
                if rocm_built():
                    new_tensor = (summed_tensor / horovod_size) if average_in_framework else summed_tensor
                else:
                    new_tensor = summed_tensor
        return new_tensor

4.6 _allreduce

_allreduce方法和 allgather方法在 horovod.tensorflow.mpi_ops.py 之中。

HorovodAllreduceOp和HorovodAllgatherOp這兩個方法是HVD自定義的與tensorflow相關的OP。_allreduce 和 allgather 分別與之對應。

_allreduce使用名字“HorovodAllreduce”和HorovodAllreduceOp綁定，由 MPI_LIB.horovod_allreduce 做了中間轉換；
allgather使用名字“HorovodAllgather”和HorovodAllgatherOp綁定，由 MPI_LIB.horovod_allgather 做了中間轉換；

結合前面的 _make_cached_allreduce_grads_fn 之中對於名字空間的配置，張量名稱大致爲：DistributedAdam_Allreduce/cond_14/HorovodAllreduce_grads_5_0。

這樣就調用到了 MPI 對應操作。

def _allreduce(tensor, name=None, op=Sum, prescale_factor=1.0, postscale_factor=1.0,
               ignore_name_scope=False):
    """An op which reduces an input tensor over all the Horovod processes. The
    default reduction is a sum.

    The reduction operation is keyed by the name of the op. The tensor type and
    shape must be the same on all Horovod processes for a given name. The reduction
    will not start until all processes are ready to send and receive the tensor.

    Returns:
      A tensor of the same shape and type as `tensor`, summed across all
      processes.
    """
    if name is None and not _executing_eagerly():
        name = 'HorovodAllreduce_%s' % _normalize_name(tensor.name)
    return MPI_LIB.horovod_allreduce(tensor, name=name, reduce_op=op,
                                     prescale_factor=prescale_factor,
                                     postscale_factor=postscale_factor,
                                     ignore_name_scope=ignore_name_scope)
  
def allgather(tensor, name=None, ignore_name_scope=False):
    """An op which concatenates the input tensor with the same input tensor on
    all other Horovod processes.

    The concatenation is done on the first dimension, so the input tensors on the
    different processes must have the same rank and shape, except for the first
    dimension, which is allowed to be different.

    Returns:
      A tensor of the same type as `tensor`, concatenated on dimension zero
      across all processes. The shape is identical to the input shape, except for
      the first dimension, which may be greater and is the sum of all first
      dimensions of the tensors in different Horovod processes.
    """
    if name is None and not _executing_eagerly():
        name = 'HorovodAllgather_%s' % _normalize_name(tensor.name)
    return MPI_LIB.horovod_allgather(tensor, name=name,
                                     ignore_name_scope=ignore_name_scope)

4.7 操作映射

Python世界中，調用 _allreduce 時傳遞了幾個參數，比如tensor和name。其中 op=Sum 最爲重要。這個是被 C++ 內部用來確定 reduction具體操作。我們具體梳理下：

4.7.1 C++定義

在 C++中有：

enum ReduceOp {
    AVERAGE = 0, // This value should never appear past framework code, as
                 // averaging is taken care of there.
    SUM = 1,
    ADASUM = 2
};

int horovod_reduce_op_sum() {
  return ReduceOp::SUM;
}

4.7.2 Python獲取配置

在 python 的初始化代碼中有：

class HorovodBasics(object):
    """Wrapper class for the basic Horovod API."""

    def __init__(self, pkg_path, *args):
        full_path = util.get_extension_full_path(pkg_path, *args)
        self.MPI_LIB_CTYPES = ctypes.CDLL(full_path, mode=ctypes.RTLD_GLOBAL)

        self.Average = self.MPI_LIB_CTYPES.horovod_reduce_op_average()
        self.Sum = self.MPI_LIB_CTYPES.horovod_reduce_op_sum() # 在這裏聯繫起來
        self.Adasum = self.MPI_LIB_CTYPES.horovod_reduce_op_adasum()

這樣，在調用 _allreduce 默認參數是 op=Sum，就對應了 C++ 的 ReduceOp::SUM。

4.7.3 建立聯繫

_allreduce 繼續調用：

MPI_LIB.horovod_allreduce(tensor, name=name, reduce_op=op

MPI_LIB.horovod_allreduce被轉換到了C++世界下面代碼中

首先，通過OP_REQUIRES_OK的配置可以得到reduce_op_；
其次，ComputeAsync 之中通過 reduce_op_ 就可以確定具體需要調用那種操作；

因此，Python和C++世界就進一步聯繫起來。

class HorovodAllreduceOp : public AsyncOpKernel {
public:
  explicit HorovodAllreduceOp(OpKernelConstruction* context)
      : AsyncOpKernel(context) {
    // 這裏會聲明，從 context 中得到reduce_op，賦值給reduce_op_
    OP_REQUIRES_OK(context, context->GetAttr("reduce_op", &reduce_op_));
    // 省略無關代碼
  }

  void ComputeAsync(OpKernelContext* context, DoneCallback done) override {
    OP_REQUIRES_OK_ASYNC(context, ConvertStatus(common::CheckInitialized()),
                         done);
    // 省略無關代碼
    // 這裏會依據 reduce_op_，來確認C++內部調用何種操作
    horovod::common::ReduceOp reduce_op = static_cast<horovod::common::ReduceOp>(reduce_op_);
    // 省略無關代碼
  }

4.8 拓展流程

我們拓展目前流程圖如下：

 +-----------------------------+
 |_DistributedOptimizer        |                                                                   +-----------------------------------------------------+
 |                             |                                                                   | LocalGradientAggregationHelper                      |
 |                             |       +---------------+                                           |                                                     |
 | self._optimizer  +----------------> | tf.Optimizer  |                                           |    +---------------------------------------------+  |
 |                             |       |               |                                           |    | compute_gradients                           |  |
 |                             |       +---------------+                                           |    |                                             |  |
 |                             |                                                                   |    |                                             |  |
 |                             |       +------------------------------------------------------+    |    |         _init_aggregation_vars              |  |
 | compute_gradients  +------------->  |compute_gradients                                     |    |    |                    +                        |  |
 |                             |       |                                                      |    |    |                    |                        |  |
 |                             |       |                                                      |    |    |                    |                        |  |
 |                             |       |      _optimizer.compute_gradients                    |    |    |                    v                        |  |
 | _allreduce_grads            |       |                +                                     |    |    |                                             |  |
 |      +                      |       |                |                                     |    |    |        _allreduce_grads_helper              |  |
 |      |                      |       |                |                                     |    |    |                    +                        |  |
 +-----------------------------+       |                v                                     |    |    |                    |                        |  |
        |                              |      _agg_helper.compute_gradients(grads, vars) +------------> |                    |                        |  |
        |                              |                                                      |    |    |                    v                        |  |
        |                   +--------------+  _allreduce_grads(grads, vars)                   |    |    |             allreduced_grads                |  |
        |                   |          |                +                                     |    |    |                                             |  |
        |                   |          |                |                                     |    |    +---------------------------------------------+  |
        |                   |          |                |                                     |    |                                                     |
        |                   |          |                v                                     |    |     allreduce_func                                  |
        |                   |          |       list(zip(avg_grads, vars))                     |    |            +                                        |
        |                   |          |                                                      |    |            |                                        |
        |                   |          +------------------------------------------------------+    +-----------------------------------------------------+
        |                   |                                                                                   |
        |                   |                                                                                   |
        v                   v                                                                                   |
+-------+-------------------+--------+                                                                          |
|_make_allreduce_grads_fn            |                                                                          |
|                                    |  <-----------------------------------------------------------------------+
|                                    |
|                                    |                  +-----------------+               +----------------+             +----------------------------+
|             _allreduce_cond  +------------------->    | allreduce       |               | _allreduce     |             |  MPI_LIB.horovod_allreduce |
|                                    |                  |              +----------------> |           +--------------->  |                            |
+------------------------------------+                  |                 |               |                |             |                            |
                                                        |                 |               |                |             |                            |
                                                        +-----------------+               +----------------+             +----------------------------+

手機如下：

0x05 Tensorflow 2.x

5.1 Horovod 實施

對於 TF2.x，每行代碼順序執行，不需要構建圖，也取消了control_dependency。Horovod 通過調用 TensorFlow 2.0 API 可以很直接地獲取梯度。所以 Horovod 梯度更新部分的實現並不是基於計算圖的實現，而是使用 hvd.DistributedGradientTape。

Worker 在訓練時候做如下操作：

使用 DistributedGradientTape 封裝 TF 官方的 Tape，配置 allreduce函數。
讀取一組訓練數據。
在本地模型調用前向傳播函數計算損失。
給定損失之後，worker 利用 TensorFlow eager execution 的 GradientTape 機制，調用基類函數得到梯度。
各個Worker 會調用 Allreduce 來同步梯度。
各個Worker 會依據最新梯度相應更新模型。

5.2 示例代碼

首先，我們給出示例代碼如下，下面省略部分非關鍵代碼，具體可以參見注釋：

# Horovod: initialize Horovod.
hvd.init() # 初始化HVD

# Horovod: pin GPU to be used to process local rank (one GPU per process)
# 配置GPU
gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
if gpus:
    tf.config.experimental.set_visible_devices(gpus[hvd.local_rank()], 'GPU')

# 加載數據    
(mnist_images, mnist_labels), _ = \
    tf.keras.datasets.mnist.load_data(path='mnist-%d.npz' % hvd.rank())

# 把數據進行特徵切片
dataset = tf.data.Dataset.from_tensor_slices(
    (tf.cast(mnist_images[..., tf.newaxis] / 255.0, tf.float32),
             tf.cast(mnist_labels, tf.int64))
)
# 打亂數據，分批加載
dataset = dataset.repeat().shuffle(10000).batch(128)

mnist_model = tf.keras.Sequential([
    tf.keras.layers.Conv2D(32, [3, 3], activation='relu'),
    tf.keras.layers.Conv2D(64, [3, 3], activation='relu'),
    tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
    tf.keras.layers.Dropout(0.25),
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(10, activation='softmax')
])
# 損失函數
loss = tf.losses.SparseCategoricalCrossentropy()

# Horovod: adjust learning rate based on number of GPUs.
opt = tf.optimizers.Adam(0.001 * hvd.size())

@tf.function
def training_step(images, labels, first_batch):
    with tf.GradientTape() as tape:
        probs = mnist_model(images, training=True)
        loss_value = loss(labels, probs)

    # Horovod: add Horovod Distributed GradientTape.
    # 調用 DistributedGradientTape，配置allreduce函數
    tape = hvd.DistributedGradientTape(tape)

    # 顯式得到梯度，其內部經過一系列操作後，會調用horovod的allreduce操作，最終是MPI_LIB.horovod_allreduce函數
    grads = tape.gradient(loss_value, mnist_model.trainable_variables)
    # 應用梯度，更新權重
    opt.apply_gradients(zip(grads, mnist_model.trainable_variables))

    # Horovod: broadcast initial variable states from rank 0 to all other processes.
    # This is necessary to ensure consistent initialization of all workers when
    # training is started with random weights or restored from a checkpoint.
    #
    # Note: broadcast should be done after the first gradient step to ensure optimizer
    # initialization.
    # 廣播變量
    if first_batch:
        hvd.broadcast_variables(mnist_model.variables, root_rank=0)
        hvd.broadcast_variables(opt.variables(), root_rank=0)

    return loss_value


# Horovod: adjust number of steps based on number of GPUs.
for batch, (images, labels) in enumerate(dataset.take(10000 // hvd.size())):
    loss_value = training_step(images, labels, batch == 0)

5.3 _DistributedGradientTape

關鍵類_DistributedGradientTape 定義如下：

class _DistributedGradientTape(tf.GradientTape):
    def __init__(self, tape, device_dense, device_sparse, compression, sparse_as_dense, op,
                 gradient_predivide_factor, groups, persistent=False,
                 watch_accessed_variables=True):
        if hasattr(tape, '_watch_accessed_variables'):
            super(self.__class__, self).__init__(persistent, watch_accessed_variables)
        else:
            super(self.__class__, self).__init__(persistent)

        # 把TF官方tape保存起來    
        self._tape = tape
        # 配置allreduce函數
        self._allreduce_grads = _make_allreduce_grads_fn(
            'DistributedGradientTape', device_dense, device_sparse, compression,
            sparse_as_dense, op, gradient_predivide_factor, groups)

    # 用戶顯式的調用此函數，其內部使用_make_allreduce_grads_fn進行處理
    def gradient(self, target, sources, output_gradients=None):
        # 調用基類函數獲得梯度
        gradients = super(self.__class__, self).gradient(target, sources, output_gradients)
        return self._allreduce_grads(gradients, sources)

_make_allreduce_grads_fn 函數會進行一系列調用，最終調用到 MPI_LIB.horovod_allreduce，具體做如下工作：

修改name scope，加上後綴 _Allreduce；
如果配置，則進行壓縮；
依據op類型，調用allreduce 或者直接返回tensor；
DistributedGradientTape 的 name scope 被改寫成了 DistributedGradientTape_Allreduce，名字被加上了 HorovodAllreduce_ 的前綴。
調用MPI_LIB.horovod_allreduce函數；

@_cache
def _make_allreduce_grads_fn(name, device_dense, device_sparse,
                             compression, sparse_as_dense, op):
    def allreduce_grads(grads):
        with tf.name_scope(name + "_Allreduce"): # 修改name scope，加上後綴
            if sparse_as_dense:
                grads = [tf.convert_to_tensor(grad) # 壓縮
                         if grad is not None and isinstance(grad, tf.IndexedSlices)
                         else grad for grad in grads]

            return [_allreduce_cond(grad,
                                    device_dense=device_dense,
                                    device_sparse=device_sparse,
                                    compression=compression,
                                    op=op)
                    if grad is not None else grad
                    for grad in grads]

def _allreduce_cond(tensor, *args, **kwargs):
    def allreduce_fn():
        return allreduce(tensor, *args, **kwargs)

    def id_fn():
        return tensor

    return tf.cond(size_op() > 1, allreduce_fn, id_fn) # 不用的調用方法

def _allreduce(tensor, name=None, op=Sum):
    """An op which reduces an input tensor over all the Horovod processes. The
    default reduction is a sum.

    The reduction operation is keyed by the name of the op. The tensor type and
    shape must be the same on all Horovod processes for a given name. The reduction
    will not start until all processes are ready to send and receive the tensor.

    Returns:
      A tensor of the same shape and type as `tensor`, summed across all
      processes.
    """
    if name is None and not _executing_eagerly():
        name = 'HorovodAllreduce_%s' % _normalize_name(tensor.name)
    # # 調用HorovodAllreduceOp    
    return MPI_LIB.horovod_allreduce(tensor, name=name, reduce_op=op)

邏輯如下：

+-------------------------------+
| _DistributedGradientTape      |             +------------------------------------+
|                               |             |_make_allreduce_grads_fn            |
|                               |             |                                    |
|         _allreduce_grads +--------------->  |                                    |
|                               |             |                                    |
|                               |             |             _allreduce_cond  +---------+
|                               |             |                                    |   |
+-------------------------------+             +------------------------------------+   |
                                                                                       |
                                                                                       |
            +--------------------------------------------------------------------------+
            |
            |
            |
            |
            |          +----------------+             +----------------------------+
            |          | _allreduce     |             |  MPI_LIB.horovod_allreduce |
            +------->  |           +--------------->  |                            |
                       |                |             |                            |
                       |                |             |                            |
                       +----------------+             +----------------------------+

0x06 HorovodAllreduceOp

MPI_LIB.horovod_allreduce 調用的就是 HorovodAllreduceOp。MPI_LIB.horovod_allreduce 是 python 函數，HorovodAllreduceOp 是C++代碼，這裏 TF 做了一個適配和轉換，讓我們可以從 python 函數直接調用到 C++ 函數。

HorovodAllreduceOp 繼承了AsyncOpKernel，是一種TF Async OP，而且被 REGISTER_KERNEL_BUILDER 註冊到 TF，因此就可以嵌入到 TF 流程之中。

TF 會調用到 HorovodAllreduceOp 所覆蓋的ComputeAsync方法，在ComputeAsync內部會把張量的Allreduce操作加入Horovod後臺隊列，從而把 TF OP 和 Horovod OP 聯繫起來。

總結一下，HorovodAllreduceOp 繼承了TF AsyncOpKernel，因此可以嵌入到 TF 流程，同時用組合方式與 Horovod 後臺線程聯繫起來。

class HorovodAllreduceOp : public AsyncOpKernel { //派生了，所以可以嵌入到 TF流程之中
public:
  explicit HorovodAllreduceOp(OpKernelConstruction* context)
      : AsyncOpKernel(context) {
    OP_REQUIRES_OK(context, context->GetAttr("reduce_op", &reduce_op_));
    OP_REQUIRES_OK(context, context->GetAttr("prescale_factor", &prescale_factor_));
    OP_REQUIRES_OK(context, context->GetAttr("postscale_factor", &postscale_factor_));
    OP_REQUIRES_OK(context, context->GetAttr("ignore_name_scope", &ignore_name_scope_));
  }

  void ComputeAsync(OpKernelContext* context, DoneCallback done) override {
    OP_REQUIRES_OK_ASYNC(context, ConvertStatus(common::CheckInitialized()),
                         done);
    ... // 省略一些變量驗證，初始化代碼
          
    // 將張量的Allreduce操作OP加入隊列       
    auto enqueue_result = EnqueueTensorAllreduce(
        hvd_context, hvd_tensor, hvd_output, ready_event, node_name, device,
        [context, done](const common::Status& status) {
          context->SetStatus(ConvertStatus(status));
          done();
        }, reduce_op, (double) prescale_factor_, (double) postscale_factor_);
    OP_REQUIRES_OK_ASYNC(context, ConvertStatus(enqueue_result), done);
  }

private:
  int reduce_op_;
  // Using float since TF does not support double OP attributes
  float prescale_factor_;
  float postscale_factor_;
  bool ignore_name_scope_;
};

從下文開始我們看看Horovod on Spark。