基于Horovod on Ray的弹性深度学习

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#000000","name":"user"}},{"type":"strong"}],"text":"本文最初发布于优步工程网站,经授权由InfoQ中文站翻译并分享。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"引言"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"2017年我们推出了"},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/horovod\/","title":null,"type":null},"content":[{"type":"text","text":"Horovod"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",这是一个开源框架,可将深度学习训练任务并行扩展到数百个GPU上。当时,优步的大多数深度学习用例都和自动驾驶汽车的研发有关,而在"},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/michelangelo-machine-learning-platform\/","title":null,"type":null},"content":[{"type":"text","text":"米开朗基罗"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"平台上,绝大多数生产级机器学习模型都是基于"},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/productionizing-distributed-xgboost\/","title":null,"type":null},"content":[{"type":"text","text":"XGBoost"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"的树模型。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"如今已经是2021年,深度学习领域出现了很多变化。对于表格数据问题,深度学习模型相对于基于树的模型的性能优势越来越大,并且越来越多的深度学习模型在从研究转移到生产级别。因此,我们不得不重新审视现有的深度学习平台,以适应不断增长的需求和新的要求:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"自动缩放和容错"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"超参数搜索"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"统一计算基础架构"}]}]}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"自动缩放和容错"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我们之前的深度学习平台分配了一个固定大小的Horovod群集来训练单个模型。在本地运行时,用户经常会发现自己由于资源不足而无法开始大型训练作业。我们提供了在云数据中心进行训练的选项,但由于平台缺乏自动缩放能力,意味着只能在专用实例上运行作业,其成本通常是在可抢占或“竞价(spot)”实例上运行成本的3-5倍。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"就算资源容量不成问题,容错能力也常常存在瓶颈。随着作业训练的epoch增多,动用的机器越来越多,失败的机率也会增加。密集的检查点可以帮助缓解这种情况,但是需要大量额外的定制平台工具链来提供支持。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Elastic Horovod"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在Horovod v0.20中我们引入了"},{"type":"link","attrs":{"href":"https:\/\/github.com\/horovod\/horovod\/releases\/tag\/v0.20.0","title":null,"type":null},"content":[{"type":"text","text":"Elastic Horovod"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。它提供了分布式训练的能力,可以在整个训练过程中动态地扩展worker的数量。只需在现有的Horovod训练脚本中添加几行代码,当运行作业的机器加入或退出时作业都可以继续训练,几乎不会中断。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"从API的抽象级别来看,Elastic Horovod解决了自动缩放训练过程的问题,但并没有对其操作化。我们将Elastic Horovod编写为适用于任何特定编排系统或云提供商的通用平台,这意味着以下组件要留作待实现的接口:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"发现"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"可以添加到训练作业(或应从训练作业中删除)的主机"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"当额外worker可用且作业可以利用它们时"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"请求资源"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我们一开始的假设是,我们要在每个主流云提供商(例如AWS、Azure、GCP)或编排系统(例如Kubernetes、Peloton)各自的组件中实现这些接口,以支持内部和开源用户。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"结果我们发现,已经有一个开源解决方案可以解决多云分布式计算的问题了:它就是"},{"type":"link","attrs":{"href":"https:\/\/ray.io\/","title":null,"type":null},"content":[{"type":"text","text":"Ray"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"Elastic Horovod on Ray"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Ray是一个用于并行和分布式编程的分布式执行引擎。Ray由加州大学伯克利分校开发,最初的目的是使用一个简单的基于类\/函数的Python API来扩展机器学习负载和实验。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"自诞生以来,Ray生态系统已发展为可用于在云上训练ML模型的一系列功能和工具组合,包括用于分布式超参数调优的"},{"type":"link","attrs":{"href":"https:\/\/docs.ray.io\/en\/master\/tune\/index.html","title":null,"type":null},"content":[{"type":"text","text":"Ray Tune"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"、用于群集配置的"},{"type":"link","attrs":{"href":"https:\/\/docs.ray.io\/en\/master\/cluster\/index.html","title":null,"type":null},"content":[{"type":"text","text":"Ray Cluster Launcher"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",和基于负载的自动缩放("},{"type":"link","attrs":{"href":"https:\/\/docs.ray.io\/en\/master\/cluster\/quickstart.html","title":null,"type":null},"content":[{"type":"text","text":"load-based autoscaling"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":")。现在,Ray还集成了多种机器学习库,如RLLib、XGBoost和PyTorch。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"借助新的"},{"type":"link","attrs":{"href":"https:\/\/horovod.readthedocs.io\/en\/stable\/ray_include.html#elastic-ray-executor","title":null,"type":null},"content":[{"type":"text","text":"ElasticRayExecutor"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" API,Horovod能够利用Ray来简化底层主机的发现和编排工作。要使用Ray搭配Horovod来实现弹性训练,你首先需要启动一个包含多个节点(Ray群集)的Ray运行时。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"Ray运行时\/群集"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":Ray程序能够利用一个底层的Ray运行时完成并行化和分发工作。可以在一个或多个节点上启动这个"},{"type":"link","attrs":{"href":"https:\/\/docs.ray.io\/en\/master\/starting-ray.html","title":null,"type":null},"content":[{"type":"text","text":"Ray运行时"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",从而形成一个Ray群集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Ray打包了一个"},{"type":"link","attrs":{"href":"https:\/\/docs.ray.io\/en\/master\/cluster\/cloud.html#launching-cloud-clusters","title":null,"type":null},"content":[{"type":"text","text":"轻量级群集启动器"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",该启动器可简化任何云(AWS、Azure、GCP甚至是Kubernetes和YARN等群集管理器)上的群集配置。这个群集启动器根据给定的群集配置来配置群集,如以下示例所示:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ab\/abde8643e8768bedff63d68b709301d8.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"要在AWS上启动一个Ray群集,你只需将以上配置另存为`cfg.yaml`,然后调用`rayupcfg.yaml`即可。在上面的示例中,请注意头节点是一个纯CPU节点,而worker是GPU可抢占实例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"此外,可以将Ray群集配置为“自动缩放”,这意味着Ray可以在不同的负载要求下透明地配置新节点。例如,如果一个Ray程序在应用程序运行中需要一个GPU,则Ray可以配置一个便宜的GPU实例,在该实例上运行Ray任务或actor,然后在完成后终止该实例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"你可以查看Ray文档中关于在AWS\/GCP上创建一个自动缩放Ray程序的"},{"type":"link","attrs":{"href":"https:\/\/docs.ray.io\/en\/master\/cluster\/quickstart.html","title":null,"type":null},"content":[{"type":"text","text":"示例"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"ElasticRayExecutor API"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":ElasticRayExecutor可以利用自动缩放的Ray群集来简化主机的发现过程,并在适当情况下请求更多资源。要使用这个API,需要定义一个训练函数,其中包括一个使用"},{"type":"link","attrs":{"href":"https:\/\/horovod.readthedocs.io\/en\/stable\/elastic_include.html#modifying-the-training-script-with-state-synchronization","title":null,"type":null},"content":[{"type":"text","text":"Horovod Elastic状态同步装饰器"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"的更高级别的函数:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/61\/61704f4bdfd65b3a53491459c5318de7.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"然后,你可以附加到这个底层Ray群集并执行训练函数:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3d\/3d9602c8ec79bc3b238e277ca53226be.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在这个示例中,ElasticRayExecutor将创建多个GPU worker,每个worker都参与一个数据并行训练例程。Worker的数量将由群集中GPU worker节点的数量自动确定——这意味着即使节点因为抢占而被随意移除了,训练也不会停止。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"如你所见,Ray与Horovod的集成产生了一个解决方案,其简化了在主流云提供商之间使用Horovod运行弹性训练作业的操作。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"基准测试和实验"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在考虑采用弹性训练技术时,对我们来说最大的未知数之一就是收敛性。具体来说,我们不确定在训练期间可以多久调整一次作业的规模,能调整到什么程度,我们是否需要在训练期间进行任何调整以消除在运行时增加和减少worker数量的影响?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"为了衡量动态调整worker数量对模型收敛的影响,我们使用AWS的8个v100 GPU(p3.2xlarge实例),在Cifar10数据集上运行了三个训练ResNet50模型的作业,固定90个epoch。我们换了3种实验配置,调整了上下增减worker的频率以及每次添加\/删除worker的数量:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"固定8个没有自动缩放的GPU worker("},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"8gpu_fixed"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":")。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最初8个GPU worker,每60秒增加或减少1个("},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"8gpu_60_1"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":")。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最初8个GPU worker,每180秒增加或减少3个("},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"8gpu_180_3"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":")。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"对于每次调整事件,如果worker数量已经达到最大值(8),我们都会减少数量;如果worker数量达到最小值(2),则我们都会增加数量。否则,我们会随机以相等的概率增加或减少worker的数量。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d4\/d41e0e45d90e790a8834613656d73de0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"图1:验证精度对epoch数的函数(左)和实验中每个epoch末尾对应的worker数(右)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"如以上结果所示,我们观察到随着我们增加调整事件的幅度(通过一次添加\/删除的主机数量来衡量)并按比例降低调整频率,各个epoch之间模型性能的总体差异增加了,相对于基线的整体模型泛化性能得到了实际改进。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"弹性训练的另一个好处是,当调整事件的时机可以通过训练过程控制时,它甚至可以用来减少训练过程中的总体差异。如Smith等人在《"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/abs\/1711.00489","title":null,"type":null},"content":[{"type":"text","text":"不要降低学习速度,而是要增加batch大小"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"》中所述,我们可以利用以下事实:增加batch大小会导致模拟退火效果,从而使模型训练从较早的探索阶段过渡到最终的利用阶段。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在实践中,与保持训练中worker数量不变的方法相比,通过增加worker数量来按比例扩大batch大小的这一过程让模型能够更快收敛,往往还有更高的泛化精度。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"为了证明这种效果,我们进行了另一组实验:重复上面的训练过程,但有两个新配置:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"固定2个GPU worker("},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"2gpu_fixed"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":")"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"动态worker数量,每30个epoch加倍,从2开始,以8个worker结束("},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"8gpu_exp"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":")。"}]}]}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/fd\/fde47d9ce32150d3b02c0be721e531ec.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"图2:验证精度对epoch数的函数(左)和相对挂钟时间(从实验开始算起,以秒为单位)的函数(右)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"正如预期的那样,将worker数从8减少到2改善了整体模型的收敛。这引出了数据并行分布式训练中,当模型针对较小的batch大小做优化时存在的一个的常见陷阱。实践中,建议使用学习率预热\/衰减、超参数搜索(见下文)和"},{"type":"link","attrs":{"href":"https:\/\/horovod.readthedocs.io\/en\/latest\/adasum_user_guide_include.html","title":null,"type":null},"content":[{"type":"text","text":"Adasum"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"等技术来抵消这些影响。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"上面展示的第三个实验说明了通过大量并行来实现良好收敛的另一种解决方案:随着时间的推移逐步扩展并行度。这种方法不仅获得了比2 GPU基线更少的总挂钟时间,而且还以更高的验证精度完成了这项工作!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"正如Smith等人在上述论文中所解释的那样,其背后的原理是,训练过程将从最初的“探索”阶段过渡到最后的“利用”阶段,进而受益。增加并行度会增加batch大小,这有助于平滑训练示例之间的差异,从而减少训练后期的探索。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"可以使用Horovod on Ray的Callback API将此功能添加到Elastic Horovod on Ray中:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/81\/8184cd7514fdae175e8680d778148616.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"这些回调还可以用来简化训练的其他一些部分,比如将消息从worker转发到驱动程序来简化日志记录和指标跟踪。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"与超参数搜索结合使用时,这种方法可以提供最多的改善。与模拟退火对比(用于缩放batch大小)相似,许多现代超参数搜索算法也遵循“探索\/利用”范式,该范式可与弹性训练结合使用,以实现最佳模型性能和最佳资源利用率。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"超参数搜索"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在开发深度学习模型的过程中,用户在进行大规模训练时经常需要重新调整超参数,因为许多超参数在较大规模上会表现出不同的行为。在自动缩放群集中进行训练时,这一点更加重要,因为要考虑其他一些可能会影响训练吞吐量、收敛性和成本的超参数,其中包括:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"多长时间增加一次最大worker数\/有效batch大小"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"多久提交一次共享worker状态,以在最短的时间内获得最多的epoch"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我们允许作业一次输入\/删除多少worker"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"使用Horovod大规模调整超参数通常会有些棘手。执行并行超参数搜索需要一个单独的更高级别的系统,以在可用资源上协调和调度多节点Horovod训练作业。米开朗基罗内部支持的"},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/scaling-michelangelo\/","title":null,"type":null},"content":[{"type":"text","text":"AutoTune"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"服务解决了编排长期运行的超参数搜索作业的一些基本挑战,但不支持基于群体的训练和提早停止策略,这些方法本应能让我们重新分配GPU资源以加速性能最佳的试验。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Ray是基于对嵌套并行性的支持而构建的——这意味着它能够轻松处理“启动天然的分布式任务”的分布式程序。利用这一优势,我们开发了一个Horovod+RayTune"},{"type":"link","attrs":{"href":"https:\/\/horovod.readthedocs.io\/en\/stable\/hyperparameter_search_include.html","title":null,"type":null},"content":[{"type":"text","text":"集成"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",以实现分布式训练的并行超参数调整。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/2c\/2cb4ff84b0aca430740ee7013eec555c.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"使用Horovod+RayTune进行带有嵌套超参数并行调整的分布式训练的示意图。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"RayTune是打包进Ray的流行超参数调整库。RayTune包含一些最新的超参数搜索算法(例如基于群体的训练、贝叶斯优化和超频带),还支持故障处理,因此用户可以更好地利用模型性能与云成本之间的取舍来做超参数调整。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ec\/ec43c48840f52cb7b649412e7755562a.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在时间限制下使用动态资源分配来改善模型训练性能的图形示意。像HyperSched这样的算法通过动态分配更多并行资源来逐渐将探索减少到零,从而更深入地利用更少的试验。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"正如Liaw等人在《"},{"type":"link","attrs":{"href":"https:\/\/arxiv.org\/abs\/2001.02338","title":null,"type":null},"content":[{"type":"text","text":"HyperSched:在最后期限前完成模型开发的动态资源重新分配"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"》中所展示的那样,通过这种方式将超参数搜索与分布式训练相结合,可以让我们做到充分优化,在固定的时间内和计算资源内找到最佳模型。这对于像优步这样的组织来说特别有用,在该优步大多数训练都是在固定大小的本地GPU群集上进行的。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"下一步:统一用于机器学习和深度学习的计算基础架构"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我们将Elastic Horovod with Ray和RayTune集成的早期结果表明,Ray这种将复杂的分布式计算系统脚本化为统一工作流程的方式,具有很好的灵活性和易用性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"除了我们先前讨论的挑战之外,机器学习平台通常还需要集成几种不同的技术,例如SparkML、XGBoost、SKLearn和RLlib等。这些技术运行在不同的计算平台上。例如在优步,ML负载可以在通用的容器化计算基础设施(Peloton、Kubernetes)、Spark(YARN、Peloton)和Ray(Peloton、Kubernetes)等平台上运行。结果,生产级ML管道经常被分解成许多不同的任务,并由诸如Airflow或Kubeflow Pipelines等管道编排器进行编排。这增加了平台的软件和运维复杂性。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"过去,我们投资创建了很多定制系统来配置和运行深度学习工作流,但与Horovod on Ray相比它们存在许多缺点:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"可扩展性"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":因为内部系统是为特定应用而非通用计算而设计的,所以要增加对Elastic Horovod之类框架的支持,需要对代码库进行重新配置,并自行实现Ray中提供的那种自动缩放器。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"灵活性"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":为运行分布式深度学习而构建的系统无法轻松适应其他负载,例如数据处理(Dask、Modin)、超参数搜索(RayTune)或强化学习(RLlib)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"维护"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":":与Ray这样的开源系统不同,我们的内部深度学习基础架构必须由专门的工程师团队维护,而他们的时间和精力可以用来解决更高级别的问题。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"通过在Ray上整合更多的深度学习栈,我们可以进一步优化深度学习工作流程中的更多端到端过程。例如,当前我们在特征工程(ApacheSpark)和分布式训练(Horovod)之间存在清晰的界限。对于工作流的每个阶段,我们必须提供需要独立运行的单独计算基础架构,并将Spark流程的输出要素化到磁盘上,以便Horovod流程使用。当我们希望为工作流的不同阶段寻找替代框架时发现,这种架构不仅很难维护,而且替换起来同样困难。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"能够切换不同的分布式框架是Ray的核心优势之一。由于Ray是通用的分布式计算平台,因此Ray的用户可以"},{"type":"link","attrs":{"href":"https:\/\/www.anyscale.com\/blog\/data-processing-support-in-ray","title":null,"type":null},"content":[{"type":"text","text":"自由选择"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"越来越多的分布式数据处理框架(包括Spark),这些框架在Ray为深度学习工作流提供的相同资源上运行。这简化了我们的计算基础架构,因为Spark和Horovod负载之间不再存在上述的严格界限。实际上,根据数据量、计算强度和可用群集资源,针对不同的负载有不同的最优框架。此外,某些框架比其他框架更容易集成到现有项目中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我们利用这些能力增强的一个项目案例是"},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/introducing-ludwig\/","title":null,"type":null},"content":[{"type":"text","text":"Ludwig"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",这是优步开发的开源深度学习AutoML框架。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"过去,由于Ludwig依赖于Pandas框架进行数据处理,因此仅限于处理适合单台计算机内存的数据集。现在,在即将到来的Ludwig 0.4版本中,我们将在"},{"type":"link","attrs":{"href":"https:\/\/docs.ray.io\/en\/master\/dask-on-ray.html","title":null,"type":null},"content":[{"type":"text","text":"Dask on Ray"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"上进行分布式大内存数据预处理,在Horovod on Ray上进行分布式训练,并用RayTune进行超参数优化。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/01\/011a832c11b3ac4680368f7e119c325f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Ludwig在本地模式下运行(v0.4之前的版本):所有数据都需要容纳在一台计算机上的内存中。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a4\/a43f56ca9763f295c98aa7aebaca123d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Ludwig在Ray群集上运行(版本v0.4):Ray扩展了预处理和分布式训练任务,可以处理大型数据集,而无需在Ludwig中编写任何基础架构代码。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"利用Dask可以扩展Ludwig现有的Pandas预处理,从而以最小的代码更改量实现大型数据集的处理;利用Ray,我们可以将预处理、分布式训练和超参数搜索结合在一起,放进单个作业中运行的单个训练脚本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"对于Ludwig用户来说,这些功能不需要代码更改或其他基础设施配置更改,只需在命令行中添加“raySubmit”和“-backendray”即可:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/76\/7696a9bb127446c557e623347387e374.jpeg","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我们相信,在优步和整个行业内,Ray在为生产级机器学习生态系统带来行业亟需的通用基础架构和标准化方面将继续发挥越来越重要的作用。在接下来的几个月中,我们将分享更多关于将Ray的功能引入优步深度学习平台的相关工作。请查看"},{"type":"link","attrs":{"href":"https:\/\/horovod.readthedocs.io\/en\/stable\/ray_include.html#elastic-ray-executor","title":null,"type":null},"content":[{"type":"text","text":"Elastic Horovod on Ray"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":",并随时提出任何问题、意见或建议。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"致谢"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我们想感谢以下人员的工作:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Richard Liaw和Anyscale团队将Horovod与Ray集成在一起的努力,包括Horovod in RayTune,以及他们对在优步集成Ray的持续支持。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Fardin Abdi和Qiyan Zhang将Elastic Horovod与Peloton集成所做的工作。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"来自G-Research的Enrico Minack,他在Elastic Horovod上的工作及其与Horovod on Spark的集成工作。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Yi Wang,他在ElasticDL集成和Elastic Horovod基准测试方面的工作。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/github.com\/kuangliu\/pytorch-cifar的作者提供了用于训练模型的PyTorch模型定义","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/github.com\/kuangliu\/pytorch-cifar"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"的作者提供了用于训练模型的PyTorch模型定义。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我们还要感谢所有Horovod贡献者,如果没有他们,这项工作是不可能完成的;感谢Linux基金会对Horovod项目的持续支持;以及AWS为我们的持续集成系统提供的支持。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"作者介绍"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Travis Addair是优步AI的软件工程师,领导米开朗基罗AI平台的深度学习训练团队,领导Horovod开源项目,并主持Linux基金会内的技术指导委员会。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Xu Ning是优步西雅图工程办公室的工程经理,目前领导优步米开朗基罗机器学习平台的多个开发团队。他之前曾领导优步的Cherami分布式任务队列、Hadoop可观察性和数据安全团队。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Richard Liaw是Anyscale的软件工程师,他目前领导基于Ray的分布式机器学习库的开发工作,是开源Ray项目的维护者之一,在读UC Berkeley的PhD。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"原文链接:"},{"type":"link","attrs":{"href":"https:\/\/eng.uber.com\/horovod-ray\/","title":null,"type":null},"content":[{"type":"text","text":"https:\/\/eng.uber.com\/horovod-ray\/"}],"marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章