[源碼解析] 快手八卦 --- 機器學習分佈式訓練新思路(2)

0x00 摘要

“Bagua“ 是快手和蘇黎世理工（ETH Zürich）聯合開發的分佈式訓練框架。其專門針對分佈式的場景設計特定的優化算法，實現算法和系統層面的聯合優化，力圖極致化分佈式訓練的效率。其特點是：

並行性能顯著提高；
對網絡環境更魯棒；
“一鍵式”使用；
分佈式通訊算法易拓展性；
可用於工業級場景大規模使用；
安全、故障易排查；

本文以：

快手官方公共號文章快手八卦！突破 TensorFlow、PyTorch 並行瓶頸的開源分佈式訓練框架來了！
“bagua"論文 https://arxiv.org/pdf/2107.01499.pdf
“bagua"官方網站 https://tutorials.baguasys.com/
“bagua" 演示文檔
項目 GitHub 地址：https://github.com/BaguaSys/bagua

爲基礎來分析學習。本文介紹優化方案之中的 Fused Optimizer 和分層通信。

前一篇鏈接爲：

[源碼解析] 快手八卦 --- 機器學習分佈式訓練新思路(1)

0x01 優化

現有其他框架都是針對某一個具體算法或者場景進行優化，下圖是DP-SG的通信模式以及Horovod、BytePS和PyTorch-DDP如何針對這種通信模式進行優化。

八卦希望設計一種針對所有通信算法的優化方式。BAGUA的核心部分是它的執行優化器（execution optimizer）。給定一個神經網絡作爲輸入，一個訓練算法（例如QSGD）將在每個層的計算過程中利用一系列的通信原語來實現。BAGUA的執行優化器的目標是自動安排和優化這些計算和通信。在BAGUA中探索了以下技術。

1.1 重疊通信和計算

該項優化的目的是將通訊時間隱藏在計算時間中。

把通信和計算重疊起來是加速分佈式DP-SG的一個核心操作。不僅限於DP-SG算法，BAGUA能夠以一種靈活和自動的方式將通信原語與其他算法的計算重疊起來，因此能夠將部分通信時間隱藏在計算時間中，這樣可以降低通信開銷。

具體來講，在反向梯度的計算過程中，部分已經完成的梯度可以在剩餘梯度的計算過程中同時進行通信——通過這種流水的處理方式，部分通信時間可以被有效地“隱藏”在反向梯度的計算過程中，從而減小數據並行帶來的通信開銷。BAGUA自動分析計算圖，包括in-place張量操作和十個通信原語。儘管人們可以通過靜態分析來構建這個圖，但BAGUA利用動態分析方法，在第一次迭代中就可以收集到張量操作和通信基元的調用依賴。

與現有系統相比，BAGUA考慮了更復雜的調度。在vanilla DP-SG中，優化只能將Allreduce通信隱藏在反向傳播的計算中；相比之下，BAGUA可以調度額外的元素，如使用低精度的壓縮/解壓縮和優化算法對於指定的模型進行更新。

1.2 分桶通信和扁平化

頻繁的傳輸碎片化數據，會降低通信的效率，不利於充分利用網絡帶寬。爲了有效地將通信和計算重疊起來，將各層型參數劃分爲若干個桶進行通信是一個必要的步驟，這樣通訊的單位就變成了桶，從而能夠更高效地利用通信模型。

因此，Horovod和PyTorch-DDP都採用了桶的技巧。然而，他們的bucketing方案只是簡單地把Allreduce通信硬編碼，用啓發式的思路來減少成本，並使用神經網絡之中層的倒序來確定buckets。相比之下，由於BAGUA支持更多通信方式，而且這些通信方式可以指定優化算法，並且使用BAGUA的通信原語，因此bucketing是根據在分析（profiling）階段收集的相關性信息來確定。

一旦我們將計算圖分割成桶，BAGUA就在這些桶上進行融合。這使得BAGUA有可能實現一個更有效的流水線。在確定反向傳播的第一次運行中的桶的分區後，BAGUA會仔細地將桶內的參數（如模型參數、梯度和優化器狀態）對齊到一個連續的內存空間。然後在所有的流水線執行中利用這種參數的扁平化視圖。

此外，由於支持了信息壓縮算法，對於壓縮和解壓的函數，其操作的基本單位也是桶，這樣也能使得這些操作的開銷降低。例如，低精度壓縮/解壓縮lambda會直接應用於桶的扁平化視圖，而不是單個參數；用於模型更新的基於SG的優化器也在桶的層面上進行（NVIDIA的Apex也使用類似的優化）。請注意，這種扁平化視圖可以更有效地利用計算單元所提供的並行性。

1.3 分層化通信

由於工業級別的分佈式訓練往往需要多機多卡，而不同物理連接方式所帶來的延時和帶寬也有較大差異，因此，通訊的有效抽象也對性能的提升至關重要。

BAGUA的通信可以分層進行。這在處理異構網絡連接時特別有用，例如，服務器內GPU之間的帶寬要比服務器之間的帶寬高得多。Bagua 將涉及多機的通信抽象成：“機內”和“機間”，在此抽象的基礎上優化通信基元的實現，並對於相應的通信抽象做了優化。

例如，對於信息壓縮傳輸，分層化通訊將會把這一算法解讀成“機內”完整精度，“機間”信息壓縮，從而爲不同的物理鏈接提供最合適的通信算法。集中式低精度原語（CLPS）可以被優化爲首先在每個節點內部的本地工作者上聚合張量，不壓縮，然後在每個節點選出的領導worker上進行節點間聚合，壓縮。最後讓每個領導worker在節點內廣播聚合的數據。請注意，這種優化可能會改變通信原語的語義。對於去中心化的原語，節點內的工作者將總是被改變爲中心化的Allreduce方式。

接下來，我們就看看兩種優化手段：融合和分層化。

0x02 Generic Fused Optimizer

八卦提供了通用的融合優化器，通過在多層上融合優化器.step()操作（fusing the optimizer .step() operation on multiple layers）來提高優化器的性能。它可以應用於任意 PyTorch 優化器。代碼位於 bagua/torch_api/contrib/fused_optimizer.py。

2.1 背景知識

我們首先介紹一下背景知識。

2.1.1 Tensor

我們一般印象中的 Tensor 如下：

實際上，張量分爲元信息區（Tensor）和存儲區（Storage）。信息區保存張量的形狀（size），步長（stride），數據類型（type）等信息，真正數據則在 Storage 之中保存成連續數組。

+------------------+        +-----------------+
| Tensor           |        | Storage         |
|                  |        |                 |
|                  |        |                 |
|    stride        |        |      data       |
|                  |        |                 |
|    size          |        |      size       |
|                  |        |                 |
|    type          |        |                 |
|                  |        |                 |
|    shape         |        |                 |
|                  |        |                 |
|    dimention     |        |                 |
|                  |        |                 |
|    storage  +-----------> |                 |
|                  |        |                 |
|                  |        |                 |
+------------------+        +-----------------+

2.1.2 Storage

我們也可以這麼理解，Storage 是連續的內存塊，Tensor 是一個視圖，該視圖把Storage單條內存區域映射到了n維的空間視圖。

所以涉及到幾個概念。

Size 是張量的維度。
Storage offset 是數據在storage中的索引。是張量第一個元素與storage第一個元素的偏移量。
Stride 是storage中對應於張量相鄰維度間第一個索引的跨度，是在指定維度中從一個元素跳到下一個元素所必需的步長。

比如：

import torch

a = torch.arange(6)
print("Tensor a : ", a)
print("a storage : " , a.storage())
print("a size : " , a.size())
print("a stride : " , a.stride())
print("a.data.storage().data_ptr() : " , a.data.storage().data_ptr())

b = a.view(2,3) # 換一種view方式
print("Tensor b : ", b)
print("b storage : " , b.storage())
print("b size : " , b.size())
print("b stride : " , b.stride())
print("b.data.storage().data_ptr() : " , b.data.storage().data_ptr())

c = a.view(3,2) # 再換一種view方式
print("Tensor c : ", c)
print("c storage : " , c.storage())
print("c size : " , c.size())
print("c stride : " , c.stride())
print("c.data.storage().data_ptr() : " , c.data.storage().data_ptr())

輸出，可以看出來，同樣的存儲，但是視圖不同，就是不同的張量：

# 張量 a
Tensor a :  tensor([0, 1, 2, 3, 4, 5])
a storage :   
 0
 1
 2
 3
 4
 5
[torch.LongStorage of size 6]
a size :  torch.Size([6])
a stride :  (1,)
a.data.storage().data_ptr() :  140266160612352
  
# 張量 b  
Tensor b :  tensor([[0, 1, 2],
        [3, 4, 5]])
b storage :   
 0
 1
 2
 3
 4
 5
[torch.LongStorage of size 6]
b size :  torch.Size([2, 3])
b stride :  (3, 1)
b.data.storage().data_ptr() :  140266160612352
  
# 張量 c  
Tensor c :  tensor([[0, 1],
        [2, 3],
        [4, 5]])
c storage :   
 0
 1
 2
 3
 4
 5
[torch.LongStorage of size 6]
c size :  torch.Size([3, 2])
c stride :  (2, 1)
c.data.storage().data_ptr() :  140266160612352

我們單獨看看 offset

d = a[3:]
print(d.storage())
print(a.storage_offset())
print(b.storage_offset())
print(c.storage_offset())
print(d.storage_offset())

輸出如下，可以看出來，d 的 storage 不變，但是d 的 torage_offset 是 3 ：

# d的storae
 0
 1
 2
 3
 4
 5
[torch.LongStorage of size 6]
0 # a.storage_offset()
0 # b.storage_offset()
0 # c.storage_offset()
3 # d.storage_offset() ---- 變化了

另外，一個對象的id值可以認爲是其在內存中的地址，比如 id(b.storage()) 。

2.1.3 內部實現

我們接下來看看內部實現。

TensorImpl 是 Tensor 內部實現。

struct C10_API TensorImpl : public c10::intrusive_ptr_target {
  c10::impl::SizesAndStrides sizes_and_strides_;

  int64_t storage_offset_ = 0;
  caffe2::TypeMeta data_type_;  
  Storage storage_;

StorageImpl 則是 storage 的內部實現，可以看出來，storage是在DataPtr之上封裝的接口。

struct C10_API StorageImpl final : public c10::intrusive_ptr_target {

  DataPtr data_ptr_;
  size_t size_bytes_;
  bool resizable_;
  // Identifies that Storage was received from another process and doesn't have
  // local to process cuda memory allocation
  bool received_cuda_;
  Allocator* allocator_;

2.2 定義

FusedOptimizer 通過將參數張量展平到一個或多個連續桶之中，就可以將多個模塊參數更新內核融合爲一個或少數幾個。這裏最主要的是對於 16位，32位參數來分別調用 flatten_module_params 做 flatten。

class FusedOptimizer(torch.optim.Optimizer):
    """Convert any optimizer into a fused optimizer.

    This fused optimizer fuses multiple module parameter update kernel launches
    into one or a few, by flattening parameter tensors into one or more
    contiguous buckets.

    It can be used in conjunction with :meth:`~bagua.torch_api.distributed.BaguaModule.with_bagua` method. In this case,
    Bagua will do the fusions automatically, otherwise, you need to explicitly
    set :attr:`do_flatten=True`.

    Args:
        optimizer (torch.optim.Optimizer): Any PyTorch optimizer.
        do_flatten (bool): Whether to flatten the parameters. Default: ``False``.

    Returns:
        Fused optimizer.


    Example::
        To use in conjunction with :meth:`~bagua.torch_api.distributed.BaguaModule.with_bagua` method:

        >>> optimizer = torch.optim.Adadelta(model.parameters(), ....)
        >>> optimizer = bagua.torch_api.contrib.FusedOptimizer(optimizer)
        >>> model = model.with_bagua([optimizer], GradientAllReduceAlgorithm())

        To use alone or with `torch.nn.parallel.DistributedDataParallel <https://pytorch.org/docs/stable/generated/torch.nn.parallel.DistributedDataParallel.html?highlight=distributeddataparallel#torch.nn.parallel.DistributedDataParallel>`_,
        set :attr:`do_flatten=True`:

        >>> optimizer = torch.optim.Adadelta(model.parameters(), ....)
        >>> optimizer = bagua.torch_api.contrib.FusedOptimizer(optimizer, do_flatten=True)
    """

    def __init__(self, optimizer: torch.optim.Optimizer, do_flatten: bool = False):
        self.optimizer = copy.copy(optimizer)
        super(FusedOptimizer, self).__init__(optimizer.param_groups, optimizer.defaults)

        if do_flatten:
            f32_params = [ # 提取優化器參數之中32位參數
                param
                for group in self.optimizer.param_groups
                for param in group["params"]
                if param.type() == "torch.cuda.FloatTensor"
            ]
            f16_params = [ # 提取優化器參數之中16位參數
                param
                for group in self.optimizer.param_groups
                for param in group["params"]
                if param.type() == "torch.cuda.HalfTensor"
            ]

            # 然後分別打平
            flatten_module_params(f32_params, align_bytes=1)
            flatten_module_params(f16_params, align_bytes=1)

2.3 打平

把所有的 16 位 "params" 拷貝到一起，所有32位 "params" 拷貝到一起，邏輯是：

初始化打平的權重張量 flatten_weights_tensor，並且指定了之前的設備。
初始化打平的梯度張量 flatten_grads_tensor，並且指定了之前的設備。
獲取打平張量的storage。
遍歷參數列表：
- 把權重拷貝到flatten張量，p.numel() 是元素個數，reshape(-1) 就是展平了，設置了存儲信息。
- 把梯度拷貝到flatten張量，p.numel() 是元素個數，reshape(-1) 就是展平了，設置了存儲信息。
- 設置底層的storage，size 和 strides，其實就是設置元信息。
返回聚合打平之後的參數。

def flatten_module_params(params_list, align_bytes: int):
    if len(params_list) == 0:
        return
    if not isinstance(params_list[0], list):
        params_list = [params_list]

    total_size = 0
    for params in params_list: # 計算參數總大小
        total_size += _get_params_flattened_aligned_size(params, align_bytes)

    # 初始化打平的權重張量，並且指定了之前的設備    
    flatten_weights_tensor = torch.zeros(total_size, dtype=params_list[0][0].dtype).to(
        params_list[0][0].device
    )
    # 初始化打平的梯度張量，並且指定了之前的設備 
    flatten_grads_tensor = torch.zeros(total_size, dtype=params_list[0][0].dtype).to(
        params_list[0][0].device
    )

    # 獲取打平張量的storage
    flatten_weights_storage = flatten_weights_tensor.storage()
    flatten_grads_storage = flatten_grads_tensor.storage()

    # 設置底層的storage，size 和 strides，其實就是設置元信息
    def set_storage(param, weight_storage, grad_storage, storage_offset):
        with torch.no_grad():
            z = torch.zeros_like(param.data)
            z.set_(weight_storage, storage_offset, param.shape)
            param.data = z

            t = torch.zeros_like(param.data)
            t.set_(grad_storage, storage_offset, param.shape)
            param.grad = t

    offset = 0
    for params in params_list: # 遍歷參數列表
        for p in params:
            # copy data
            # 把權重拷貝到flatten，p.numel() 是元素個數，reshape(-1) 就是展平了，設置了存儲信息
            flatten_weights_tensor[offset : offset + p.numel()] = p.data.reshape(-1)

            # 把梯度拷貝到flatten
            if p.grad is not None:
                flatten_grads_tensor[offset : offset + p.numel()] = p.grad.data.reshape(
                    -1
                )

            # flatten
            # 設置底層的storage，size 和 strides，其實就是設置元信息
            set_storage(p, flatten_weights_storage, flatten_grads_storage, offset)
            offset += p.allocated_size

    # check
    for params in params_list:
        weight_tensors = [p.data for p in params]
        grad_tensors = [p.grad.data for p in params]

        assert check_contiguous(weight_tensors)
        assert check_contiguous(grad_tensors)

    # 返回聚合打平之後的參數    
    return new_param(flatten_weights_tensor, flatten_grads_tensor)

具體如下，這裏假設都是32位的張量，就都被聚合到 f32_params 之中。flatten_module_params 就是處理之後的，屬於被打平的張量。其中 group_1 的兩個權重 param_wg11, param_wg12 被排列在一起。

 +--------------------------+   +--------------------------+   +---------------------------+
 | group_1["params"]        |   | group_2["params"]        |   | group_3["params"]         |
 |                          |   |                          |   |                           |
 |  param_wg11 , param_gg11 |   |  param_wg21 , param_gg21 |   |  param_wg31 , param_gg31  |
 |  param_wg12 , param_gg12 |   |  param_wg22 , param_gg22 |   |  param_wg32 , param_gg32  |
 |                          |   |                          |   |                           |
 +-------+------------------+   +----------+---------------+   +----------------+----------+
         |                                 |                                    |
         |                                 |                                    |
         +---------------+-----------------+-------------------+----------------+
                         |                                     |
                         | f32_params                          | f16_params
                         |                                     |
                         v                                     v
+------------------------+---------------+     +---------------+---------------------------+
| flatten_module_params                  |     | flatten_module_params                     |
|                                        |     |                                           |
| +------------------------------------+ |     |  +-------------------------------------+  |
| |flatten_weights_tensor              | |     |  |flatten_weights_tensor               |  |
| |                                    | |     |  |                                     |  |
| | param_wg11, param_wg12, param_wg21 | |     |  |               ......                |  |
| |                                    | |     |  |                                     |  |
| | param_wg22, param_wg31, param_wg32 | |     |  |                                     |  |
| +------------------------------------+ |     |  +-------------------------------------+  |
| +------------------------------------+ |     |  +-------------------------------------+  |
| |flatten_grads_tensor                | |     |  |  flatten_grads_tensor               |  |
| |                                    | |     |  |                                     |  |
| | param_gg11, param_gg12, param_gg21 | |     |  |                ......               |  |
| |                                    | |     |  |                                     |  |
| | param_gg22, param_gg31, param_gg32 | |     |  |                                     |  |
| +------------------------------------+ |     |  +-------------------------------------+  |
+----------------------------------------+     +-------------------------------------------+

2.4 優化

優化代碼如下，具體是按照group遍歷參數，對於每組參數：

按照存儲把參數分組。
重新排序。
再把融合的賦值回去。

def step(self, closure=None):
    r"""Performs a single optimization step (parameter update).

    Args:
        closure (Callable): A closure that reevaluates the model and
            returns the loss. Optional for most optimizers.

    .. note::
        Unless otherwise specified, this function should not modify the
        ``.grad`` field of the parameters.
    """
    for group in self.optimizer.param_groups: # 按照group遍歷參數
        params = group["params"]
        grouped_params = group_params_by_storage(params) # 按照存儲把參數分組

        fused_params = []

        for _, group_p in grouped_params.items():
            fused_params.extend(reorder_params(group_p)) # 重新排序

        group["params"] = fused_params # 再把融合的賦值回去

    return self.optimizer.step(closure)

2.4.1 按照存儲分組

其實，就是 32 位，16位，weight，grad 一共四種組合。

比如針對 group_1 拿到了 32 位的權重 param_wg11, param_wg12，因爲他們的 p.data.storage().data_ptr() 一致，所以把這個數值作爲key，把這些權重放在同樣 key 對應的位置。

def group_params_by_storage(params):
    grouped_params = {}
    for p in params:
        weight_storage = p.data.storage().data_ptr() # 拿到key
        param_list = grouped_params.get(weight_storage, [])
        param_list.append(p) 
        grouped_params[weight_storage] = param_list # 放進value

    return grouped_params

2.4.2 重新排序

對於同樣key 的參數，按照 storage offset 進行排序。

def reorder_params(params):
    """Input params share same storage, reorder them by their storage offset"""

    sorted_params = sorted(params, key=lambda x: x.storage_offset())

    grouped = []
    tmp_params = []

    for p in sorted_params:
        if len(tmp_params) > 0 and not is_contiguous_param(p, tmp_params[-1]):
            grouped.append(collocate_params(tmp_params))
            tmp_params = []

        tmp_params.append(p)

    if len(tmp_params) > 0:
        grouped.append(collocate_params(tmp_params))  # FIXME: potential OOM

    return grouped

整個優化大致如下：

最開始時候是 group['params'] = list(param_wg11, param_wg12) ，兩個item 的list，兩次CUDA操作。

結束時候是 group['params'] = list(param_wg11 + param_wg12) ，一個 item 的list，這裏就融合了，縮減爲一次CUDA操作。

+----------------------------------------+     +-------------------------------------------+
| flatten_module_params                  |     | flatten_module_params                     |
|                                        |     |                                           |
| +------------------------------------+ |     |  +-------------------------------------+  |
| |flatten_weights_tensor              | |     |  |flatten_weights_tensor               |  |
| |                                    | |     |  |                                     |  |
| | param_wg11, param_wg12, param_wg21 | |     |  |               ......                |  |
| |                                    | |     |  |                                     |  |
| | param_wg22, param_wg31, param_wg32 | |     |  |                                     |  |
| +------------------------------------+ |     |  +-------------------------------------+  |
| +------------------------------------+ |     |  +-------------------------------------+  |
| |flatten_grads_tensor                | |     |  |  flatten_grads_tensor               |  |
| |                                    | |     |  |                                     |  |
| | param_gg11, param_gg12, param_gg21 | |     |  |                ......               |  |
| |                                    | |     |  |                                     |  |
| | param_gg22, param_gg31, param_gg32 | |     |  |                                     |  |
| +------------------------------------+ |     |  +-------------------------------------+  |
+----------------------------------------+     +-------------------------------------------+

+-------------------------------------------+----------------------------------------------+
                                            |
                                            |
                                            v
           +------------------------------------------------------------------------+
           | step()                         |                                       |
           |                                |                                       |
           |                                |                                       |
           |                                v            2 items list               |
           |                                                                        |
           |                   group['params'] = list(param_wg11, param_wg12)       |
           |                                +                                       |
           |                                |                                       |
           |                                |                                       |
           |                                v                                       |
           |                     group_params_by_storage / reorder_params           |
           |                                +                                       |
           |                                |                                       |
           |                                |                                       |
           |                                v                                       |
           |          grouped_params] = {140266160612352 : param_wg11, param_wg12}  |
           |                                +                                       |
           |                                |                                       |
           |                                |            1 item list                |
           |                                v                                       |
           |                group['params'] = list(param_wg11 + param_wg12)         |
           |                                                                        |
           |                                +                                       |
           |                                |                                       |
           |                                v                                       |
           |                    self.optimizer.step(closure)                        |
           |                                                                        |
           |                                                                        |
           +------------------------------------------------------------------------+

0x03 分層化 --- 進程組

3.1 設計思路

Bagua的設計思路如下：

分層化的通信實現：由於工業級別的分佈式訓練往往需要多機多卡，而不同物理連接方式所帶來的延時和帶寬也有較大差異，因此，通訊的有效抽象也對性能的提升至關重要。Bagua 將涉及多機的通信抽象成：“機內”和“機間”，並對於相應的通信抽象做了優化。例如，對於信息壓縮傳輸，分層化通訊將會把這一算法解讀成“機內”完整精度，“機間”信息壓縮，從而爲不同的物理鏈接提供最合適的通信算法。

我們想要強調的是，這些系統實現層面的優化是對於各種算法組合廣泛適用，而非侷限在某一特定的算法設置上。因此，所有的系統優化都可以被靈活的複用到各種算法實現中去，這在保證“端到端”的性能提升的同時，也爲開發新的分佈式算法提供了良好的平臺。

我們接下來就看看如何通過進程組實現分層化通信。分析思路就是：

分層通信是不是有多個對應的進程組？
如何得到節點內通信進程組的ranks？
如何得到節點間通信進程組使用的ranks？
每個進程組都有自己獨立的通信方法嗎？
通信時候如何進行分層通信？

3.2 生成進程組

我們可以從源碼之中的測試文件之中找到如何生成一個新進程組。

all_ranks = list(range(nprocs))
odd_ranks = list(filter(lambda r: r % 2 == 1, all_ranks))
g = bagua.communication.new_group(ranks=odd_ranks)

new_group 此功能要求默認組中的所有進程（即作爲分佈式作業一部分的所有進程）都執行這個函數，即使它們不是組的成員。其參數是：

ranks ：組成員的ranks列表。
stream ：執行NCCL操作的CUDA流。

def new_group(
    ranks: Optional[List[int]] = None, stream: Optional[torch.cuda.Stream] = None
):
    """
    Creates a new process group.

    This function requires that all processes in the default group (i.e. all
    processes that are part of the distributed job) enter this function, even
    if they are not going to be members of the group. Additionally, groups
    should be created in the same order in all processes.

    Each process group will create three communicators on request, a global communicator,
    a inter-node communicator and a intra-node communicator. Users can access them through
    ``group.get_global_communicator()``, ``group.get_inter_node_communicator()``
    and ``group.get_intra_node_communicator()`` respectively.

    Args:
        ranks: List of ranks of group members. If ``None``, will be
            set to all ranks. Default is ``None``.
        stream: A CUDA stream used to execute NCCL operations. If ``None``,
            CUDA stream of the default group will be used. See
            `CUDA semantics <https://pytorch.org/docs/stable/notes/cuda.html?highlight=stream>`_
            for details.

    Returns:
        A handle of process group that can be given to collective calls.

    .. note::
        The global communicator is used for global communications involving all ranks in the process group.
        The inter-node communicator and the intra-node communicator is used for hierarchical communications
        in this process group.

    .. note::
        For a specific communicator ``comm``, ``comm.rank()`` returns the rank of current process and
        ``comm.nranks()`` returns the size of the communicator.
    """
    global _group_count
    global _pg_group_ranks
    global _pg_map

    _group_count += 1

    if ranks is None:
        ranks = list(range(get_world_size()))
    else:
        ranks = sorted(ranks) # 排序

    if stream is None:
        _check_default_pg()
        stream = _get_default_group().stream

    group_name = str(_group_count)
    pg = BaguaProcessGroup(ranks, stream, str(_group_count)) # 生成進程組
    
    # Create the global rank to group rank mapping
    _pg_group_ranks[pg] = {
        global_rank: group_rank for group_rank, global_rank in enumerate(ranks)
    }
    _pg_map[group_name] = pg

    return pg

3.3 Ranks

我們接着看看兩個全局變量如何計算，一個是層內的ranks，一個是層間的ranks。

intra_ranks = list(
    filter(
        lambda rank: rank // get_local_size() == get_rank() // get_local_size(),
        ranks,
    )
)
inter_ranks = list(
    filter(
        lambda rank: rank % get_local_size() == ranks[0] % get_local_size(),
        ranks,
    )
)

Python 的操作符如下：

//	取整除 - 返回商的整數部分（向下取整）	9//2 是 4 , -9//2 是 -5
%	取模 - 返回除法的餘數	b % a 輸出結果 0

實驗一下

def get_rank() -> int:
    return 5
def get_local_size():
    return 3
    
nprocs = 10 # 10個進程
ranks = list(range(nprocs)) # rank是0~9
print(intra_ranks) # rank 5 所在的intra_ranks。
print(inter_ranks) # 總的inter_ranks，能看出來是在 local size 的邊緣。

輸出
[3, 4, 5] # intra_ranks
[0, 3, 6, 9] # inter_ranks，在 local size 3 的邊緣

具體用到的幾個函數如下：

def get_rank() -> int:
    """
    Get the rank of current process group.

    Rank is a unique identifier assigned to each process within a distributed
    process group. They are always consecutive integers ranging from 0 to
    ``world_size``.

    Returns:
        The rank of the process group.
    """
    return int(os.environ.get("RANK", 0))


def get_local_rank() -> int:
    """
    Get the rank of current node.

    Local rank is a unique identifier assigned to each process within a node.
    They are always consecutive integers ranging from 0 to ``local_size``.

    Returns:
        The local rank of the node.
    """
    return int(os.environ.get("LOCAL_RANK", 0))
  
  
def get_local_size() -> int:
    """
    Get the number of processes in the node.

    Returns:
        The local size of the node.
    """
    return int(os.environ.get("LOCAL_WORLD_SIZE", 1))

現在我們知道了，不同進程組內部的ranks如何得到。

3.4 BaguaProcessGroup 定義

我們接下來看看 BaguaProcessGroup 如何定義，從定義上看，每個進程組都建立了三個 communicators，分別是：

a global communicator，使用 group.get_global_communicator() 可以得到。
a inter-node communicator，使用 group.get_inter_node_communicator() 可以得到。
a intra-node communicator，使用 group.get_intra_node_communicator() 可以得到。

全局通訊器用於進程組中所有ranks的全局通訊。節點間（inter-node）通訊器和節點內（intra-node）通訊器用於此過程組中的分層（hierarchical）通訊。

啓用分層通信（hierarchical communication）。這意味着同一臺機器上的GPU將首先相互通信。之後，機器進行節點間通信。這可以在節點間通信成本較高時提高性能。

class BaguaProcessGroup:
    def __init__(self, ranks, stream, group_name):
        self.ranks = ranks
        self.stream = stream
        self.group_name = group_name

        self.intra_ranks = list(
            filter(
                lambda rank: rank // get_local_size() == get_rank() // get_local_size(),
                ranks,
            )
        )
        self.inter_ranks = list(
            filter(
                lambda rank: rank % get_local_size() == ranks[0] % get_local_size(),
                ranks,
            )
        )

    def get_global_communicator(self):
        return get_communicator(self.group_name, "global")

    def get_inter_node_communicator(self):
        return get_communicator(self.group_name, "inter")

    def get_intra_node_communicator(self):
        return get_communicator(self.group_name, "intra")

3.5 生成 communicator

具體就是生成了 BaguaSingleCommunicatorPy。這裏使用了 lru_cache 來保證只生成一次。BaguaSingleCommunicatorPy 定義在 rust/bagua-core/bagua-core-py/src/lib.rs，在 rust/bagua-core/bagua-core-internal/src/communicators/mod.rs 之中也有 BaguaHierarchicalCommunicator 和 HierarchicalCommunicator 這樣的實現，這就不是我們重點了，有興趣的讀者可以深入研究。

@lru_cache(maxsize=None)
def get_communicator(group_name: str, comm_name: str):
    global _pg_map

    pg = _pg_map[group_name]
    if comm_name == "global":
        ranks = pg.ranks
    elif comm_name == "inter":
        ranks = pg.inter_ranks
    elif comm_name == "intra":
        ranks = pg.intra_ranks
    else:
        raise ValueError("comm_name should be one of ['global', 'inter', 'intra']")

    comm_key = "{}_{}_{}".format(group_name, comm_name, ",".join(map(str, ranks)))

    nccl_unique_id = broadcast_nccl_unique_id(comm_key, root=ranks[0])

    if get_rank() not in ranks:
        return CommMember.NON_COMM_MEMBER

    rank = ranks.index(get_rank())
    nranks = len(ranks)

    comm = B.BaguaSingleCommunicatorPy(
        rank=rank,
        nranks=nranks,
        device_id=get_local_rank(),
        stream_ptr=pg.stream.cuda_stream,
        nccl_unique_id_str=nccl_unique_id,
    )

    comm.cuda_stream = pg.stream
    return comm

具體如下：

+-----------------------------------+
| BaguaProcessGroup                 |
|                                   |       +---------------------------+
|                                   |       | BaguaSingleCommunicatorPy |
|                                   |       |                           |
|    get_global_communicator  +-----------> |      ranks                |
|                                   |       |                           |
|                                   |       +---------------------------+
|                                   |
|                                   |       +---------------------------+
|                                   |       | BaguaSingleCommunicatorPy |
|    get_inter_node_communicator +--------> |                           |
|                                   |       |      inter_ranks          |
|                                   |       |                           |
|                                   |       +---------------------------+
|                                   |
|                                   |       +---------------------------+
|    get_intra_node_communicator +--------> | BaguaSingleCommunicatorPy |
|                                   |       |                           |
|                                   |       |      intra_ranks          |
|    ranks                          |       |                           |
|                                   |       +---------------------------+
|    inter_ranks                    |
|                                   |
|    intra_ranks                    |
|                                   |
|                                   |
+-----------------------------------+

3.6 使用

具體代碼在：rust/bagua-core/bagua-core-internal/src/communicators/mod.rs

可以看到，如果沒有設置hierarchical，就正常通信，如果設置hierarchical，就用intra 和 inter 混合着來，先試驗 intra，再節點間通信。

impl BaguaCommunicator {
    pub fn new(
        communicator_internode: Option<&BaguaSingleCommunicator>,
        communicator_intranode: Option<&BaguaSingleCommunicator>,
        hierarchical: bool,
    ) -> Result<Self, BaguaCoreError> {
        match hierarchical {
            false => Ok(BaguaCommunicator::SingleCommunicator( // 不是 hierarchical，就正常通信
                communicator_internode
                    .expect("inter node communicator must be given in non-hierarchical mode")
                    .clone(),
            )),
            true => { // 是 hierarchical，就用intra 和 inter 混合着來，先試驗 intra
                let intranode_rank = communicator_intranode.as_ref().unwrap().rank();
                if intranode_rank == 0 {
                    let intra = communicator_intranode.expect("intra node communicator must be given in worker GPU in hierarchical mode").clone();
                    let inter = communicator_internode.unwrap().clone();
                    {
                        if intra.inner.stream_ptr != inter.inner.stream_ptr {
                            return Err(BaguaCoreError::CommunicatorError("intra node communicator should use the same stream as the inter node communicator".into()));
                        }
                    }
                    Ok(BaguaCommunicator::HierarchicalCommunicator(
                        BaguaHierarchicalCommunicator::Leader(
                            BaguaHierarchicalCommunicatorLeader::new(inter, intra),
                        ),
                    ))
                } else {
                    Ok(BaguaCommunicator::HierarchicalCommunicator(BaguaHierarchicalCommunicator::Worker(BaguaHierarchicalCommunicatorWorker {
                        intranode: communicator_intranode.expect("intra node communicator must be given in worker GPU in hierarchical mode").clone()
                    })))
                }
            }
        }
    }

    pub fn execute_communication(
        &self,
        tensor: &mut BaguaCommunicationTensor,
        intranode_average: bool,
        hierarchical_pre: bool,
        hierarchical_post: bool,
        communication_hook: &mut dyn FnMut(
            &BaguaCommunicatorInner,
            &mut BaguaCommunicationTensor,
        ) -> (),
    ) {
        match &self {
            BaguaCommunicator::SingleCommunicator(communicator) => {
                let communicator = communicator.inner.clone();
                communication_hook(&communicator, tensor);
            }
            BaguaCommunicator::HierarchicalCommunicator(communicator) => match communicator {
                BaguaHierarchicalCommunicator::Leader(communicator) => {
                    let internode_communicator = communicator.internode.inner.clone();
                    if hierarchical_pre { // 先節點內部
                        communicator.hierarchical_pre(tensor, intranode_average);
                    }
                    communication_hook(&internode_communicator, tensor); // 再節點間
                    if hierarchical_post {
                        communicator.hierarchical_post(tensor);
                    }
                }
                BaguaHierarchicalCommunicator::Worker(communicator) => {
                    if hierarchical_pre {
                        communicator.hierarchical_worker_pre(tensor, intranode_average);
                    }
                    if hierarchical_post {
                        communicator.hierarchical_worker_post(tensor);
                    }
                }
            },
        }
    }
}

0xFF 參考

PyTorch internals

快手八卦！突破 TensorFlow、PyTorch 並行瓶頸的開源分佈式訓練框架來了！

https://arxiv.org/pdf/2107.01499.pdf

[1] Dean, Jeffrey, Greg S. Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao et al. “Large scale distributed deep networks.” (2012).

[2] Zhengyuan Zhou, Panayotis Mertikopoulos, Nicholas Bambos, Peter Glynn, Yinyu Ye, Li-Jia Li, and Li Fei-Fei. 2018. Distributed asynchronous optimization with unbounded delays: How slow can you go?. In International Conference on Machine Learning. PMLR, 5970–5979.

[3] DanAlistarh, DemjanGrubic, JerryLi, RyotaTomioka, and MilanVojnovic. 2016. QSGD: Communication-efficient SGD via gradient quantization and encoding. arXiv preprint arXiv:1610.02132 (2016).

[4] Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstanti- nov, and Cédric Renggli. 2018. The convergence of sparsified gradient methods. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 5977–5987.

[5] Anastasia Koloskova, Sebastian Stich, and Martin Jaggi. 2019. Decentralized stochastic optimization and gossip algorithms with compressed communication. In International Conference on Machine Learning. PMLR, 3478–3487.

[6] Xiangru Lian, Ce Zhang, Huan Zhang, Cho-Jui Hsieh, Wei Zhang, and Ji Liu. 2017. Can decentralized algorithms outperform centralized algorithms? a case study for decentralized parallel stochastic gradient descent. In Proceedings of the 31st International Conference on Neural Information Processing Systems. 5336–5346.

[7] Christopher De Sa, Matthew Feldman, Christopher Ré, and Kunle Olukotun. 2017. Understanding and optimizing asynchronous low-precision stochastic gradient descent. In Proceedings of the 44th Annual International Symposium on Computer Architecture. 561–574.

[8] Xiangru Lian, Wei Zhang, Ce Zhang, and Ji Liu. 2018. Asynchronous decentral- ized parallel stochastic gradient descent. In International Conference on Machine Learning. PMLR, 3043–3052.

[9] Hanlin Tang, Shaoduo Gan, Ce Zhang, Tong Zhang, and Ji Liu. 2018. Com- munication compression for decentralized training. In Proceedings of the 32nd International Conference on Neural Information Processing Systems. 7663–7673.

[10] Ji Liu, Ce Zhang, et al. 2020. Distributed Learning Systems with First-Order Methods. Foundations and Trends® in Databases 9, 1 (2020), 1–100.

[源碼解析] 快手八卦 --- 機器學習分佈式訓練新思路(2)

[源碼解析] 快手八卦 --- 機器學習分佈式訓練新思路(2)

0x00 摘要

0x01 優化

1.1 重疊通信和計算

1.2 分桶通信和扁平化

1.3 分層化通信

0x02 Generic Fused Optimizer

2.1 背景知識

2.1.1 Tensor

2.1.2 Storage

2.1.3 內部實現

2.2 定義

2.3 打平

2.4 優化

2.4.1 按照存儲分組

2.4.2 重新排序

0x03 分層化 --- 進程組

3.1 設計思路

3.2 生成進程組

3.3 Ranks

3.4 BaguaProcessGroup 定義

3.5 生成 communicator

3.6 使用

0xFF 參考

自學編程兩個月，現在我月入 4 萬元

新書出版：《分佈式機器學習——系統、工程與實戰》

[源碼解析] TensorFlow 分佈式之 ClusterCoordinator

[源碼解析] TensorFlow 分佈式之 ParameterServerStrategy V2

[源碼解析] TensorFlow 分佈式之 ParameterServerStrategy V1

[源碼解析] TensorFlow 分佈式之 MirroredStrategy 分發計算

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結