单机训练6000万类视觉分类模型，飞桨大规模分类库PLSC做到了

原創

百度开发者中心

2021-11-18 16:23

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" 大规模分类任务 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在介绍大规模分类任务之前，我们先简短回顾一下通常的分类任务。大家熟知的视觉分类任务中，网络模型由特征提取器（Backbone）和分类器（Classifier）组成。分类的类别数有2类（如，前景/背景分类）、10类（如，MNIST 数据分类）、80类（如，COCO 数据分类）和1000类（如，ImageNet 数据分类）等等。比较主流的特征提取器有 ResNet、MobileNet 等网络结构，分类器则通常采用线性分类层（全连接层，FC）。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大规模分类任务的『大规模』指模型参数规模非常大，包括以下3种情况：","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"特征提取器参数规模大、分类器参数规模大，以及两者参数规模都大。","attrs":{}},{"type":"text","text":"万物互联时代，随着人工智能、5G 和 IoT 等技术的发展，分类模型分类的类别数不断增加，类别数可以达到上千万甚至更多。在这种背景下，分类网络模型的 FC 层的类别数增加，参数规模爆炸式增长。基于度量学习的分类模型，通常在训练阶段使用闭集数据集学习特征提取器和分类器，在推理阶段，仅使用特征提取器提取输入图像的特征，并与预提取的特征进行相似性对比得出是否属于同一类，如图1所示。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/c5/c53a460e8b6f1d58221540188aba62b9.webp","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲ 图1：大规模分类网络模型训练和部署示意图","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" 大规模分类模型训练难点 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文聚焦于解决大规模分类模型训练问题。有小伙伴可能会问：大规模分类不就是一个普通的图像分类吗，除了分类类别数较多导致的 FC 层参数量大以外，还有什么难题？图像分类领域每年有大量的论文和工作在 ImageNet 数据集上取得新的 SOTA，随便从 Github 上找个图像分类库来训练是不是就可以了？","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"然而，FC 层参数规模的急剧增长在训练时会带来以下两方面挑战：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"首先是存储问题。假设分类类别数为4000万，在训练阶段特征向量的维度为512，并且以32比特浮点数存储模型参数，那么仅 FC 层的参数就可达512*40000000*4（bytes）/1024/1024=76.29GB，远远超出主流显卡的存储容量。这是普通图像分类库无法解决的存储问题。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其次是速度问题。普通分类模型也面临同样的问题。随着训练数据、模型规模和分类类别数的增加，模型训练的复杂度显著增长，导致模型训练所需要的时间不断增长。速度是人类永无止境的追求，如何在更短的时间内训练大规模分类模型也是工程实践中迫切需要解决的问题。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了解决以上两方面难题，学术界和工业界不断围绕着训练的显存消耗和速度进行优化；飞桨团队也持续不断地打磨升级大规模分类库 PLSC（Paddle Large Scale Classification），提供数据并行&模型并行混合训练、类别中心采样、稀疏梯度参数更新和 FP16 训练等解决方案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" 解决方案详解 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下来，小编将和大家一起来分享 PLSC 提供的数据并行&模型并行混合训练、模型并行 Loss 计算、类别中心采样、稀疏梯度参数更新和 FP16 训练解决方案。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"● 混合并行训练","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了提高训练效率，通常使用多张 GPU 卡做数据并行训练。但是对于大规模分类问题，类别数非常多，导致单卡无法训练。例如，4000万类的分类网络模型仅 FC 层的参数量就高达76.29GB，远远超出主流显卡的存储容量。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外，数据并行下 FC 层的梯度通信量也巨大，导致训练速度难以接受。针对 FC 层参数存储和梯度通信问题，我们自然会想到是否可以将参数存放到多张 GPU 卡上？答案是肯定的。我们可以采用模型并行策略，将 FC 层参数切分到多张卡上。如图2所示，Backbone 网络采用数据并行，分类 FC 层采用模型并行，兼顾了数据并行的训练效率和分类 FC 层参数的存储和梯度通信需求。在分类类别数为4000万类时，假如使用单机8卡，那么每张卡上的 FC 层仅需存放500万类，参数的存储大小为76.29GB/8=9.54GB。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"采用数据并行和模型并行结合的方式，单机8卡情形下前向计算过程如下：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1.每张卡接受一个 Batch 的数据，假设 batch size 为64；","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2.每张卡使用输入数据做数据并行计算，经过 Backbone 得到512维特征向量，维度为64x512；","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3.对每张卡上的特征和标签做 allgather 操作，从其他卡上收集特征和标签，此时每张卡上拥有全量特征维度为512x512，全量标签维度为512x1；","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4.全量特征（512x512）和部分 FC 参数（512x5000000）做矩阵乘操作得到 logits，维度为512x5000000；","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5.使用模型并行的 SoftmaxWithCrossEntropy Loss 函数计算 loss。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f9/f9ae3e5b820e633e457e8629d6d20f0b.webp","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲ 图2：Backbone 数据并行&Classifer 模型并行","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"●","attrs":{}},{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"模型并行 Loss 计算——API 级 MarginLoss 函数","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在度量学习领域中，ArcFace[2]论文将 ArcFace，CosFace[3]和 SphereFace[4] Loss 函数用如下统一的公式表示，我们称为 MarginLoss：","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/40/40d55a654ca4356918d0f5649fa30c48.webp","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"MarinLoss 函数是在 logits 上增加了 margin，最终基于的仍然是 SoftmaxWithCrossEntropy Loss 函数。模型并行下最容易想到的计算 Loss 的方法是用通信操作从其他卡上获取全量的 logits。但是这种方法不仅需要巨大的通信量，同时需要临时存储其他卡上的 logits，带来巨大的显存开销。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"飞桨框架在模型并行策略下提供了对 MarginLoss--paddle.nn.functional.margin_cross_entropy 的原生支持。该接口通信量少且显存开销较小。图3给出模型并行下 SoftmaxwithCrossEntropy 的计算过程。首先，在每张卡上逐行计算 logits 的最大值，然后通过 allreduce 操作获取全局每行最大值。为了保持数值计算的稳定性，每行减去其对应的全局最大值。接着，逐行计算分母和，然后通过 allreduce 操作获取全局和值。最后，逐行计算 loss, 并通过 allreduce 操作获取全局 loss。图中，我们对 Loss 和 Softmax Probability 计算做了共同表达式提取。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/59/5912968bd72a831cd0ff274e978dd20c.webp","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲图3：模型并行 SoftmaxwithCrossEntropy 计算过程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"●","attrs":{}},{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"类别中心采样——API 级支持 PartialFC","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"采用数据并行&模型并行解决了 FC 参数存储的问题。但是，可以发现前向计算的 logits 存储需求也非常大，在混合并行训练小节的假设条件下为512*5000000*4（bytes）/1024/1024=9.54 GB。考虑前向计算、反向计算和参数更新相关变量，当优化方法使用 Momentum 时，可以得到 FC 层需要的存储大小：","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/13/13810b58e8d7161ff822c4f73b38efd8.webp","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"其中 d 表示特征的维度，c 表示总类别数，k 表示 GPU 卡数，N 表示 Batch 大小， Memw 表示参数存储大小， Memlogits 表示 logits 存储大小，MemFc 表示 FC 层总的存储大小。当类别数增大时，我们可以将 FC 层参数切分到不同卡上，以保持每张卡上存储的参数大小不变。然而， logits 的维度却是随卡数线性增长的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因此，卡数增大 k 倍， Memlogits 也增大 k 倍。训练过程中，FC 层总的存储大小 MemFc等于3倍 Memw （weight，gradient 和 velocity）加2倍Memlogits （activation 和 gradient）。为了解决 logits 和对应的梯度存储问题，PartialFC [5] 提出基于采样的 FC 层，即从全量 FC 中采样一部分类别中心参与本次迭代的学习。如图4所示，对于正样本对应的类别中心全部保留，对于负类别中心按比例随机采样。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假设采样比例为1/10，则 logits 的维度为512x500000，存储大小为0.1*9.54 GB = 0.954GB。优化前存储大小为2* 9.54 GB = 19.08 GB，采用 PartialFC 时需要的显存开销为2*0.954GB=1.908GB，可见使用 PartialFC 可以大幅减小显存开销。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/90/90f786a7ce3a5062958d05ed6575bf81.webp","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲ 图4：PatialFC 采样过程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"飞桨提供了原生支持上述采样过程的 API——","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"paddle.nn.functional.class_center_sample。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"●","attrs":{}},{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"稀疏梯度参数更新——SparseMomentum","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"稀疏梯度参数更新是 PLSC 一大亮点。虽然 PartialFC 通过采样的方法缓解了logits 和对应梯度的显存开销，但是通过上述分析我们发现 FC 层仍然需要3倍的 Mem_w，分别对应 FC 层参数、梯度和优化器状态。我们是否可以进一步优化显存呢？答案是肯定的。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如图5左所示，在前向计算过程中，通过对参数 W 的采样，得到采样类别中心 Wsub；反向计算梯度时，首先得到稀疏梯度 Wsub@grad，进而得到参数梯度 W@grad；在参数更新阶段，momentum 算子使用传入的参数 Wsub，参数梯度 W@grad 和优化器状态 W@velocity 更新参数 W 和优化器状态 W@velocity。我们通过分析发现，参数梯度 W@grad 是冗余的，其可以通过稀疏梯度 Wsub@grad 得到。为此，我们设计和开发了 sparse_momentum 接口。相比于 momentum，该接口需要额外传入参数 index，表示 FC 参数采样的索引值，计算过程如图5右所示。使用该接口，可以大幅减少梯度的存储空间，从而可以训练更大规模参数的模型。相比 momentum 需要3*9.54Gb+2* 0.954Gb=30.528GB 的存储空间，使用 sparse_momentum ，FC 层仅需要2* 9.54Gb+2*0.954Gb=20.988GB 的存储空间，显存空间降低31.25%。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/55/55f93863463e41c5393af02f54cc3a8f.webp","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲ 图5：Momentum 和 SparseMomentum 的更新过程对比","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"● FP16 训练——节省50%显存","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PLSC 的另外一个亮点是使用 FP16 训练，即整个训练过程中参数、Activation、梯度和优化器状态均使用 FP16，相比于 FP32 显存空间节省50%。相比 FP32 和 AMP[6] ，FP16 可以大幅节省显存，并大幅提升训练速度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"图6分别给出是 FP32，AMP 和 FP16 的计算过程。FP32 计算过程中，所有模型参数、梯度、Activation 和优化器状态的数据类型均为 FP32。AMP 计算过程中，模型参数和优化器状态为 FP32；计算过程中，将参数 cast 成 FP16，因此Activation 和梯度也是 FP16；优化阶段参数梯度需要重新 cast 为 FP32；所以，相比于 FP32，AMP 通过将 Activation 和对应的梯度存储为 FP16，节省显存开销。PLSC 使用的是真正意义的 FP16，即模型参数、Activation、梯度和优化器状态均使用 FP16；相比 FP32 显存开销减少50%；此外，由于消除了 cast 操作，FP16 相比于 AMP 可以进一步提升训练速度。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b8/b8b4611abd19ed518f59d0ae08c6064a.webp","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲ 图6：FP32、AMP 和 FP16 计算过程","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":" 实验结果 ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上一节我们介绍了大规模分类模型训练的一些解决方案，这些解决方案都已经在 PLSC 中实现且开源。此外，我们也已经将 PLSC 开源到人脸识别社区 InsighFace[1]。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PLSC 库地址⬇️","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://github.com/PaddlePaddle/PLSC","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"PLSC 具有以下亮点：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"高吞吐，低显存，简单易用；","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持单机和多机分布式训练，API 级支持模型并行、PartialFC 和 MarginLoss；","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"支持 FP16 训练；","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"同时支持静态图和动态图。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下来我们将从训练精度、显存开销和训练速度3个维度来评测 PLSC。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"● MS1MV3 精度","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下表给出 MS1MV3 数据上的精度对比。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表1不同框架实现的 Repo 在 MS1MV3 数据集上的精度对比","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/43/4314a17ead15725ef801569fc373c430.webp","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从表1中，我们可以看到虽然 PLSC 使用了 FP16，但在主要数据集上 PLSC 的精度仍然可以打平其它框架实现。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"● 最大支持类别数","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"实验配置：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GPUs：8 NVIDIA Tesla V100 32G；","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BatchSize：64/512（每张卡的 batch size 是64，全局 batch size 是512个样本）；","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SampleRatio：0.1（PartialFC 采样率为0.1）。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表2不同框架实现支持的最大类别数对比","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/50/504404adc607c9417b921c9d1a10e09f.webp","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"表中数据说明，相比其他的框架实现，PLSC 在显存优化方面具有显著优势：静态图最多支持6000万类别分类，动态图最多可支持6700万类别分类。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"● 吞吐量对比","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"吞吐量是每秒训练的样本数。我们采用公开数据集 MS1MV3 来测试。为了取得稳定和公平的结果，我们对每个实验配置都运行了5组实验，每组运行200个 steps，然后计算50到150个 steps 间的100个 steps 吞吐量的平均值，最后得到这5组实验的平均吞吐量后再取中位数作为最终的吞吐量结果。以下是实验配置：","attrs":{}}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Tesla V100 (32G)：Driver Version：450.80.02, CUDA Version：11.0；","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Tesla A100 (40G)：Driver Version：460.32.03, CUDA Version：11.2；","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Datasets：MS1MV3（93431l 类）；","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"SampleRatio：0.1（使用了 PartialFC，采样率为0.1）；","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BatchSize：128（每张卡128个样本）。","attrs":{}}]}]}],"attrs":{}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/e2/e29486c035d2e11d270358d24195bef2.webp","alt":"图片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"▲ 图7：不同框架实现的 Repo 吞吐量对比","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从图7可以看出 PLSC 静态图模式下，优于所有框架实现，尤其在 A100，ResNet50，FP16，8卡的配置下，PLSC 吞吐量高达9500imgs/s。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"项目地址：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GitHub：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://github.com/PaddlePaddle/PLSC","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"GitHub:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https://github.com/deepinsight/insightface","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"参考引用：","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"[1] https://github.com/deepinsight/insightface.git","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"[2] Deng, J., Guo, J., Xue, N. and Zafeiriou, S., 2019. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4690-4699).","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"[3] Wang, H., Wang, Y., Zhou, Z., Ji, X., Gong, D., Zhou, J., Li, Z. and Liu, W., 2018. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5265-5274).","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"[4] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B. and Song, L., 2017. Sphereface: Deep hypersphere embedding for face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 212-220).","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"[5] An, X., Zhu, X., Gao, Y., Xiao, Y., Zhao, Y., Feng, Z., Wu, L., Qin, B., Zhang, M., Zhang, D. and Fu, Y., 2021. Partial fc: Training 10 million identities on a single machine. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1445-1449).","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"italic","attrs":{}}],"text":"[6] Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G. and Wu, H., 2017. Mixed precision training. arXiv preprint arXiv:1710.03740.","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https://developer.baidu.com/?from=111201","title":"","type":null},"content":[{"type":"text","text":"点击进入获得更多技术信息~~","attrs":{}}]}]}]}

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

相關文章

什么情况下 MySQL 连查询都能被阻塞？

MySQL 的鎖也是不少，在哪種情況下會連查詢都能被阻塞？這是一個有意思的問題。工作中，很多開發和 DBA 可能接觸較多的鎖也就行鎖了。對於行鎖，阻塞寫能理解，阻塞讀實在是想不到。能阻塞讀的那肯定是顆粒度更大的鎖了，比如表級別的。作者

2024-05-08 23:28:09

AIGC在京东广告创意的技术应用

一、前言電商廣告圖片不僅能夠抓住消費者的眼球，還可以傳遞品牌核心價值和故事，建立起與消費者之間的情感聯繫。然而現有的廣告圖片大多依賴人工製作，存在效率和成本的限制。儘管最近AIGC技術取得了卓越的進展，但其在廣告圖片的應

京東雲開發者

2024-05-08 23:24:18

十年编程经验一朝面试被刷，技术面试如何提升表现？

又是一年金三銀四，不同以往的是，當前的職場環境已經不再是那個雙向奔赴的美好時代了。求職者在變多，HC 在變少，崗位要求還更高了，面對這樣的困境，技術人員應該如何突圍？騰訊雲開發者社區特邀前貝殼金服小微企業生態 CTO、騰訊雲 TVP

2024-05-08 23:17:58

CVE复现之老洞新探（CVE-2021-3156）

環境搭建直接拉取合適的docker docker 環境： https://hub.docker.com/r/chenaotian/cve-2021-3156 下載glibc-2.27源碼和sudo-1.8.21源碼漏洞分析

2024-05-08 22:52:37

Zabbix终止与广东乐维软件有限公司一切合作关系的严正声明

尊敬的Zabbix用戶及合作伙伴： Zabbix SIA，註冊號：40003738045，註冊地址：拉脫維亞里加Dzelzavas街117號，LV-1021（以下簡稱“Zabbix”），是一家獨立軟件供應商，開發Zabbix監控

2024-05-08 22:17:54

连中三元！百度安全多篇议题入选Blackhat Asia，以硬技术发现“芯”问题

Blackhat Asia 2024於4月中旬在新加坡隆重舉行。此次大會聚集了業界最傑出的信息安全專業人士和研究者，爲參會人員提供了安全領域最新的研究成果和發展趨勢。在本次大會上，百度安全共有三篇技術議題被大會收錄，主要圍繞自動駕駛

2024-05-08 21:35:51

一键自动化博客发布工具,用过的人都说好(阿里云篇)

阿里雲有個開發者社區，入駐過的朋友可能想要把自己的博客發佈到阿里雲社區上。今天我來介紹一下blog-auto-publishing-tools自動發佈博客到阿里雲的實現原理。阿里雲的博客發佈界面比較簡單，只有標題，正文，摘要，關聯試用產

2024-05-08 21:33:08

程序员不存在了……吗？

近期，在談及人工智能的發展速度時，馬斯克預計，按照當前的技術進步速度，到2030年人工智能的智力可能超越人類，這項技術甚至有可能終結人類。那麼，人工智能的強大究竟對人類將帶來更多機遇還是危機呢？今天，我們特邀了《深入理解 FFmpeg

2024-05-08 11:12:06

Sermant在异地多活场景下的实践

本文分享自華爲雲社區《Sermant在異地多活場景下的實踐》，作者：華爲雲開源。 Sermant社區在1.3.0和1.4.0版本相繼推出了消息隊列禁止消費插件和數據庫禁寫插件，分別用於解決異地多活場景下的故障切流和保護數據一致性問題。本文將

2024-05-08 10:34:22

Java中止线程的方式

正常運行結束程序運行結束，線程自動結束。使用退出標誌退出線程一般 run()方法執行完，線程就會正常結束，但是，有些線程是伺服線程。它們需要長時間的運行，只有在外部某些條件滿足的情況下，才能關閉這些線程。使用一個變量來控制循環

2024-05-07 23:34:59

美国：每年因汽车保险欺诈损失数十亿美元

汽車保險欺詐是一種涉及各種計劃的大範圍犯罪，包括在保單信息中虛報信息以獲得更低的保費，或者故意製造事故以獲得更高的索賠賠付。美國每年都有數百萬人受到汽車保險欺詐的影響，即使是那些從未直接參與過騙局的人，也可能因爲保險公司的成本上升而支付更

2024-05-07 23:33:04

营销权益平台春晚技术探究| 京东云技术团队

一、引言在當前快速發展的互聯網環境中，許多企業和服務都面臨着高併發場景的挑戰。隨着用戶規模不斷增長，對於同一時間內大量用戶請求的處理能力、系統性能、穩定性和容錯性的要求也日益提高。高併發場景對系統架構設計、數據庫設計、緩存策略、自動化運

2024-05-07 23:17:14

企业IT架构治理之道| 京东云技术团队

一、什麼是架構和治理 1.1 架構的起源開篇還是要說說大家理解的架構，何爲架構，架構跟我們的工作和生活有什麼關係。英文Architecture本源來自於拉丁語，最早起源於建築領域，建築是文明社會一個重要的標誌，同時也是人類社會最早形

2024-05-07 23:17:13

ChatGPT-Next-Web漏洞利用分析（CVE-2023-49785）

1. 漏洞介紹日常網上衝浪，突然粗看以爲是有關Chat-GPT的CVE披露出來了，但是仔細一看原來是ChatGPT-Next-Web的漏洞。漏洞描述大致如下：（如果有自己搭建了還沒更新的速速修復升級防止被人利用，2.11.3已經出來了）

2024-05-07 22:46:46

网易云信IM入门价调整，低至行业平均50%

家人們，不鋪墊了，直接說重點。網易雲信 IM 計費套餐上新了！ IM 即時通訊最低只需 899 元/月，低至全行業頭部廠商最低價！（後有附表，有理有據）新購或新切換 IM 新套餐的用戶（除IM標準版外），每月還將獲得免費贈送的

2024-05-07 22:10:34

24小時熱門文章

最新文章

最新評論文章