通过优化S3读取来提高效率和减少运行时间

{"type":"doc","content":[{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"概述"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文将介绍一种提升S3读取吞吐量的新方法,我们使用这种方法提高了生产作业的效率。结果非常令人鼓舞。单独的基准测试显示,S3读取吞吐量提高了12倍(从21MB\/s提高到269MB\/s)。吞吐量提高可以缩短生产作业的运行时间。这样一来,我们的vcore-hours减少了22%,memory-hours减少了23%,典型生产作业的运行时间也有类似的下降。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"虽然我们对结果很满意,但我们将来还会继续探索其他的改进方式。文末会有一个简短的说明。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"动机"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们每天要处理保存在Amazon S3上的数以PB计的数据。如果我们看下MapReduce\/Cascading\/Scalding作业的相关指标就很容易发现:mapper速度远低于预期。在大多数情况下,我们观测到的mapper速度大约是5-7MB\/s。这样的速度要比aws s3 cp这类命令的吞吐量慢几个数量级,后者的速度达到200+MB\/s都很常见(在EC2 c5.4xlarge实例上的观测结果)。如果我们可以提高作业读取数据的速度,那么作业就可以更快的完成,为我们节省相当多的处理时间和金钱。鉴于处理成本很高,节省的时间和金钱可以迅速增加到一个可观的数量。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"S3读取优化"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"问题:S3A吞吐量瓶颈"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我们看下S3AInputStream的实现,很容易就可以看出,以下几个方面可以做些改进:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#292929","name":"user"}},{"type":"strong"}],"text":"单线程读"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#292929","name":"user"}}],"text":":数据是在单线程中同步读取的,导致作业把大量时间花在通过网络读取数据上。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#292929","name":"user"}},{"type":"strong"}],"text":"多次非必要重新打开"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#292929","name":"user"}}],"text":":S3输入流是不可寻址的。每次执行寻址或是遇到读取错误时,总是要重复打开“分割(split)”。分割越大,出现这种情况的可能性越高。每次重新打开都会进一步降低总体的吞吐量。"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"解决方案:提高读取吞吐量"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"架构"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/3d\/2d\/3d2165f34e6d524af34134cba3d0532d.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"图1:S3读取器的预取+缓存组件"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"为了解决上述问题,我们采取了以下措施:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#292929","name":"user"}}],"text":"我们将分割视为是由固定大小的块组成的。默认大小是8MB,但可配置。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#292929","name":"user"}}],"text":"每个块在异步读取到内存后,调用者才能访问。预取缓存的大小(块的数量)是可配置的。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#292929","name":"user"}}],"text":"调用者只能读取已经预取到内存中的块。这样客户端可以免受网络异常的影响,而我们也可以有一个额外的重试层来增加整体弹性。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":4,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#292929","name":"user"}}],"text":"每当遇到在当前块之外寻址的情况时,我们会在本地文件系统中缓存预取的块。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们进一步增强了这个实现,让生产者-消费者交互几乎不会出现锁。根据一项单独的基准测试(详情见图2),这项增强将读吞吐量从20MB\/s提高到了269MB\/s。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"顺序读"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"任何按照顺序处理数据的消费者(如mapper)都可以从这个方法中获得很大的好处。虽然mapper处理的是当前检索出来的数据,但序列中接下来的数据已经异步预取。在大多数情况下,在mapper准备好处理下一个数据块时,数据就已经预取完成。这样一来,mapper就把更多的时间花在了有用的工作上,等待的时间减少了,CPU利用率因此增加了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"Parquet文件读取更高效"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Parquet文件需要非顺序读取,这是由它们的磁盘格式决定的。我们最初实现的时候没有使用本地缓存。"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#292929","name":"user"}}],"text":"每当遇到在当前块之外寻址的情况时,我们就得抛弃预取的数据。在读取"},{"type":"text","text":"Parquet文件时,这比通常的读取器性能还要差。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在引入预取数据的本地缓存后,我们发现Parquet文件读取吞吐量有明显的提升。目前,与通常的读取器相比,我们的实现将Parquet文件读取吞吐量提升了5倍。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"改进生产作业"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"读取吞吐量的增加给生产作业带来了多方面的提升。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"降低了作业运行时间"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"作业的总体运行时间减少了,因为mapper等待数据的时间减少了,可以更快地完成。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"减少mapper数量"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果mapper耗时大大减少,那么我们就可以通过增加分割大小来减少mapper数量。Mapper数量的减少可以减少由固定mapper开销所导致的CPU浪费。更重要的是,这样做并不会增加作业的运行时间。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"提高CPU利用率"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"由于mapper完成同样的工作所花费的时间减少,所以CPU整体的利用率会提高。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"结果"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"现在,我们的实现(S3E)使用了一个单独的存储库,提高了我们的迭代改进速度。最终,我们会将其合并到S3A,把它回馈给社区。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"单独的基准测试"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/5d\/5d\/5d0478a6fa285586a9e1a55d7f2fa05d.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"图2:S3A和S3E的吞吐量对比"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在每种情况下,我们都是顺序读取一个3.5GB的S3文件,并将其写入本地的一个临时文件。后半部分是为了模拟mapper操作期间发生的IO重叠。基准测试是在EC2 c5.9xlarge实例上进行的。我们测量了读取文件的总时间,并计算每种方法的有效吞吐量。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"生产运行"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们在许多大型生产作业中测试了S3E实现。这些作业每次运行时通常都要使用数以万计的vcore。图3是对比了启用S3E和不启用S3E时获得的指标。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"度量资源节省情况"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们使用以下方法度量这项优化所带来的资源节省情况。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/39\/8a\/39b7a25bc76bb774ec5e7d3c069daf8a.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"观测到的结果"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.infoq.cn\/resource\/image\/5c\/10\/5cf63283e7yya8cdbf91e7d1351c1310.png","alt":null,"title":"","style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":"","fromPaste":false,"pastePass":false}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"italic"}],"text":"图3:MapReduce作业资源消耗对比"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"虽然不同的生产作业工作负载有不同的特征,但我们看到,在30个成本高昂的作业中,大部分的vcore都减少了6%到45%。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们的方法有一个吸引人的地方,就是在一个作业中启用时不需要对作业的代码做任何修改。"}]},{"type":"heading","attrs":{"align":null,"level":1},"content":[{"type":"text","text":"未来展望"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"目前,我们把这个增强实现放在了一个单独的Git存储库中。将来,我们可能会升级已有的S3A实现,并把它回馈给社区。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我们正在把这项优化推广到我们的多个集群中,结果将发表在以后的博文上。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"鉴于S3E输入流的核心实现不依赖于任何Hadoop代码,我们可以在其他任何需要大量访问S3数据的系统中使用它。目前,我们把这项优化用在MapReduce、Cascading和Scalding作业中。不过,经过初步评估,将其应用于Spark和Spark SQL的结果也非常令人鼓舞。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"当前的实现可以通过进一步优化来提高效率。同样值得探索的是,是否可以使用过去的执行数据来优化每个作业的块大小和预取缓存大小。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查看英文原文:"},{"type":"link","attrs":{"href":"https:\/\/medium.com\/pinterest-engineering\/improving-efficiency-and-reducing-runtime-using-s3-read-optimization-b31da4b60fa0","title":null,"type":null},"content":[{"type":"text","text":"Improving efficiency and reducing runtime using S3 read optimization"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章