該代碼結構如下圖所示:
在initialize_megatron初始化megatron的過程中,有關於數據並行、流水線並行、張量並行的實現,簡介及其實現如下:
模型分佈式環境初始化:
以兩臺分別有8個GPU服務器爲例,訓練具有12層的transformer layers,
圖二
本示例將模型縱向切割爲4部分,每部分3個layers,實現pipeline-parallel(流水線並行),將模型橫向切割實現tensor-parallel(向量並行),把圖二中的“1,2,3層”切割成兩部分。
圖三
上圖說明了以model1爲例,如何切割一個模型爲八個部分,分別放入八個gpu的過程。
一個完整的模型model1的含義:
縱向三刀,把transformer layers的一共12層,切割成了四個部分,每個部分3個layers,其目的是實現pipeline-parallel;【需要pipeline_model_parallel_size=4】
橫向的一刀,代表了tensor-parallel,是把(1,2,3)直到(10,11,12)這樣的每三層layers,都切割成上下兩個部分。【需要tensor_model_parallel_size=2】
tensor model-parallel groups:代表有多少個包含向量並行的groups,由圖可知:
model1:[0, 1; 8, 9; 4, 5; 12, 13]
Model2:[2, 3; 10, 11; 6, 7; 14, 15]
對應代碼示例中的:
8 tensor model-parallel groups:
[g0, g1], [g2, g3], [g4, g5], [g6, g7], [g8, g9], [g10, g11], [g12, g13], [g14, g15]
pipeline model-parallel groups:代表有多少個包含流水線並行的模型,由圖可知:
模型model1先縱向切割爲4份爲流水線並行關係,然後橫向切分,故有兩個groups,第一個,[0,4,8,12],第二個:[1,5,9,13]
同理model2。
data_parallel groups:數據並行groups,數據並行,是”含有相同參數的模型的子塊“之間進行數據並行,有圖可以看到兩臺服務器中的模型結構,(0、2相同),(1、3相同),(4、6相同),對應代碼示例中的:
8 data_parallel groups:
[g0, g2], [g1, g3], [g4, g6], [g5, g7], [g8, g10], [g9, g11], [g12, g14], [g13, g15]
代碼實現:
initialize_model_parallel( tensor_model_parallel_size: int = 1, pipeline_model_parallel_size: int = 1, virtual_pipeline_model_parallel_size: Optional[int] = None, pipeline_model_parallel_split_rank: Optional[int] = None, use_fp8: bool = False, ) tensor_model_parallel_size = 4 pipeline_model_parallel_size = 2 world_size = 16 data_parallel_size: int = world_size // (tensor_model_parallel_size * pipeline_model_parallel_size) num_tensor_model_parallel_groups: int = world_size // tensor_model_parallel_size = 4 num_pipeline_model_parallel_groups: int = world_size // pipeline_model_parallel_size = 8 # Build the data-parallel groups. #構建數據並行groups all_data_parallel_group_ranks = [] for i in range(pipeline_model_parallel_size): start_rank = i * num_pipeline_model_parallel_groups end_rank = (i + 1) * num_pipeline_model_parallel_groups for i in range(pipeline_model_parallel_size): start_rank = i * num_pipeline_model_parallel_groups end_rank = (i + 1) * num_pipeline_model_parallel_groups for j in range(tensor_model_parallel_size): ranks = range(start_rank + j, end_rank, tensor_model_parallel_size) all_data_parallel_group_ranks.append(list(ranks)) group = torch.distributed.new_group(ranks) group_gloo = torch.distributed.new_group(ranks, backend="gloo") if rank in ranks: _DATA_PARALLEL_GROUP = group _DATA_PARALLEL_GROUP_GLOO = group_gloo _DATA_PARALLEL_GLOBAL_RANKS = ranks print(all_data_parallel_group_ranks) all_data_parallel_group_ranks [[0, 2], [1, 3], [4, 6], [5, 7], [8, 10], [9, 11], [12, 14], [13, 15]]
# Build the model-parallel groups. #構建模型並行佔用groups,即模型佔用了哪些GPU global _MODEL_PARALLEL_GROUP assert _MODEL_PARALLEL_GROUP is None, 'model parallel group is already initialized' for i in range(data_parallel_size): ranks = [data_parallel_group_ranks[i] for data_parallel_group_ranks in all_data_parallel_group_ranks] group = torch.distributed.new_group(ranks) print(ranks) if rank in ranks: _MODEL_PARALLEL_GROUP = group ranks [0, 1, 4, 5, 8, 9, 12, 13] [2, 3, 6, 7, 10, 11, 14, 15]
# Build the tensor model-parallel groups. #構建張量並行groups global _TENSOR_MODEL_PARALLEL_GROUP assert _TENSOR_MODEL_PARALLEL_GROUP is None, 'tensor model parallel group is already initialized' for i in range(num_tensor_model_parallel_groups): ranks = range(i * tensor_model_parallel_size, (i + 1) * tensor_model_parallel_size) group = torch.distributed.new_group(ranks) print(ranks) if rank in ranks: _TENSOR_MODEL_PARALLEL_GROUP = group [0, 1] [2, 3] [4, 5] [6, 7] [8, 9] [10, 11] [12, 13] [14, 15]
# Build the pipeline model-parallel groups and embedding groups #構建流水線並行groups和embedding groups for i in range(num_pipeline_model_parallel_groups): ranks = range(i, world_size, num_pipeline_model_parallel_groups) print(ranks) group = torch.distributed.new_group(ranks) if rank in ranks: _PIPELINE_MODEL_PARALLEL_GROUP = group _PIPELINE_GLOBAL_RANKS = ranks # Setup embedding group (to exchange gradients between # first and last stages). if len(ranks) > 1: embedding_ranks = [ranks[0], ranks[-1]] position_embedding_ranks = [ranks[0]] print(embedding_ranks) print(position_embedding_ranks) if pipeline_model_parallel_split_rank is not None: if ranks[pipeline_model_parallel_split_rank] not in embedding_ranks: embedding_ranks = [ranks[0], ranks[pipeline_model_parallel_split_rank], ranks[-1]] if ranks[pipeline_model_parallel_split_rank] not in position_embedding_ranks: position_embedding_ranks = [ranks[0], ranks[pipeline_model_parallel_split_rank]] else: embedding_ranks = ranks position_embedding_ranks = ranks group = torch.distributed.new_group(embedding_ranks) if rank in embedding_ranks: _EMBEDDING_GROUP = group if rank in ranks: _EMBEDDING_GLOBAL_RANKS = embedding_ranks group = torch.distributed.new_group(position_embedding_ranks) if rank in position_embedding_ranks: _POSITION_EMBEDDING_GROUP = group if rank in ranks: _POSITION_EMBEDDING_GLOBAL_RANKS = position_embedding_ranks 運行結果: [0, 4, 8, 12] [0, 12] [0] [1, 5, 9, 13] [1, 13] [1] [2, 6, 10, 14] [2, 14] [2] [3, 7, 11, 15] [3, 15] [3]
參考:
https://zhuanlan.zhihu.com/p/470279673
模型分佈式環境初始化:
以兩臺分別有8個GPU服務器爲例,訓練具有12層的transformer layers,