基於NVIDIA GPU和RAPIDS加速Spark 3.0

{"type":"doc","content":[{"type":"blockquote","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"今天給大家分享的主題是基於NVIDIA GPU和RAPIDS加速Apache Spark 3.0,首先會介紹Apache Spark的RAPIDS加速器及工作原理,然後分享我們對於Shuffle的改進,最後介紹RAPIDS加速器0.2和0.3版本新特性。"}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"用於Apache Spark的RAPIDS加速器"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/69\/6960e91244449f31f5ced1b82db2424d.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家看這張圖都能聯想到Hadoop很經典的一個標誌一頭大象,現在都是大數據時代了,面對海量數據的這種場景一直在不斷地演進一些新的適應的硬件、一些新的軟件架構。從最早的Google發的包括MapReduce、GFS等等的一些新的paper,然後到業界開源的一些新的軟件生態體系,比如說Hadoop體系、基於Hadoop的文件系統、計算框架比如說HDFS、Hive、Spark。現在在各個互聯網大廠,甚至不只是互聯網公司,其他包括工業界的應用也非常的多。傳統的這種大數據的處理框架都是基於CPU硬件的,GPU硬件已經發展了很多年,它其實在AI領域在深度學習領域已經取得了很好的效果。大家可能會有一個疑問,就是GPU能不能讓大數據領域大數據處理跑得更快,對於傳統的ETL流程能不能有一個比較好的加速效果?結果大家通過一些比較感知上的一些認識,可能會覺得還挺合適的,因爲大數據天然的數據量非常的大,第二個特點是它的數據處理任務並行度非常高,這兩個特點是非常適合GPU來執行的,對於GPU來說是非常親和的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/87\/8795b4eff65a9ccba49c6b2953833734.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們給的一個結論“是的”,通過GPU的加速,我們在TPC"},{"type":"sub","content":[{"type":"text","text":"X"}]},{"type":"text","text":"-BB Like數據集上測試的一個結果(上圖),可以看到相對於原始的CPU版本的query,我們測了這個圖中大概四個query它的執行時間分別是25分鐘、6分鐘、7分鐘、3分鐘,經過GPU版本的執行,它的時間都縮短在一分鐘上下左右,甚至最後的query只有0.14分鐘。我們用的數據集是10TB的一個數據集,一個比較有參考性的大小,然後用的硬件規格是一個兩節點的DGX-2集羣,DGX-2是一個搭載了16張NVIDIA V100 顯卡的 AI 服務器。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/26\/26381a96b733923a708fc56c66a045c4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"再比如說很多互聯網場景推薦場景上,我們是不是能達到比較好加速效果?因爲目前很多互聯網公司推廣搜推薦場景可能會涉及到,比如說短視頻推薦、電商推薦、文本推薦,目前面臨的一個問題是推薦場景本身,互聯網公司它的業務覆蓋的用戶規模越來越大,內容本身也處在一個內容爆炸的一個時代,有海量的UGC的內容。一方面用戶的數量的規模擴大,另一方面內容的數量量級的規模的擴大,對於整個的ETL訓練的挑戰都是非常大的。我們給出了一個DLRM經典的推薦模型在CRITEO數據集上的一個表現,達到了大概的一個加速效果是怎麼樣?我們依次看一下這四個版本的數據,最原始的版本還沒有分佈式的訓練數據處理框架誕生之前,對於這種ETL的流程可能就是用一種單核或者說單機多核的這種方式去處理ETL的時間大概能到144小時,訓練的時間我們用多核的去訓練的達到45個小時。從最原始的版本的改進,我們可以說比如說用Spark這種形式比較先進的分佈式計算框架去做ETL,這個時候它的ETL的流程能縮短到12小時。我們還可以怎麼繼續改進,比如說我們在訓練的這一段,從傳統的多核的CPU切換到GPU訓練,我們這邊舉了一個例子,是用了單張的V100去做訓練,訓練的時間從之前的45個小時縮短到0.7個小時,最終其實就是今天要highlight的主題,就是說我能把ETL這部分如果也切到GPU訓練,大概能達到一個怎麼樣的效果?我們這邊舉的一個例子是用了8張V100 顯卡做ETL,最終的ETL的時間能縮短到0.5小時,整體的加速比從最早的時間是大概提升了160倍,到比較先進的CPU方案仍然有個48倍的提升效果,只用了4%的成本。比目前比較主流的方式就是CPU做ETL,然後用GPU做訓練,我們仍然能達到一個10倍的加速效果,但是隻有1\/6的成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/e6\/e60084bf89e430e2168c775e71eaae12.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這其實是去年的GTC2020,老黃髮布的其實是一個比較經典的一個語錄,就是“買得越多,省得越多”。這句話不無道理,對於一些原本的一些大數據的處理流程,是不是可以利用一些新的一些硬件特性,新的一些處理範式,取得更好的一個性價比,達到一個更小的成本?其實給的這個答案是的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/87\/878fb41f6b40d83b06080b73810bc1a7.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我們想用GPU去做加速Spark處理,我們需要去改動多少的代碼規模?其實這個是大數據工程師、數據分析師非常關心的一個問題,我們這塊兒給的答案是對於Spark SQL和DataFrame代碼其實並不需要做代碼的任何的更改,也就是說你的業務代碼是不用變,只不過是我們在配置項的時候,我們會看到第一行開頭會把“spark.rapids.sql.enabled”設成true,就是一個配置項的改懂,然後讓spark-rapids生效,後面的這些業務代碼都是保持不變,它的實施成本是非常低。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5b\/5bd95ab8cd9a999b06b150a695bcc07f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於很多Spark SQL和DataFrame的裏頭的算子,這個算子的數目也是非常龐大的,我們也是需要去一個一個去做適配。目前我們可以整體的看一下這張圖,就是說支持的算子規模應該是非常大的。沒有再支持的這些算子,我們也非常歡迎大家反饋給我們,可以在github上去給我們提feature request。我們也非常迫切的想知道工業界裏頭具體哪些算子其實是用的頻率非常高,但是實際上我們還沒有去儘早的支持,這對於我們改進這個產品也是非常重要。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c5\/c5bb64d35d3991f3ff88a3b0229a31b5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark如果是跑在GPU上是不是能解決傳統的CPU上的所有的問題?其實是不一定的,我們有一個客觀的分析,對於某一些場景圖左邊舉的這些例子,其實並不一定說在GPU上跑就能達到一個很好的效果。比如說數據規模特別的小,整個的數據的size可能會特別小,但具體到每個partition的話,如果我設的partition數也是比較多一點,其實可能partition它的數據大小隻有幾百兆,不一定適合跑在GPU上。第二種場景就是說高緩存一致性的操作,這一類的操作如果在你的公司的業務的query裏頭包含的比例非常高的話,也不一定是GPU是十分合適的。第三類就是說包含特別多的數據移動,比如說我的整個的這些query有各種各樣的shuffle,shuffle的比例非常的多,可能我的整個的操作是bound在IO層面,可能有網絡、也可能有磁盤。還有一種可能就是UDF目前的實現可能會經常串到CPU,就是說還是會牽扯到CPU與GPU之間,可能會產生不斷的一些數據的搬運。在這種情況下,就是數據移動特別多的情況下,GPU也不一定是很合適的。最後一種場景就是說我的GPU的內存十分有限,主流的英偉達GPU的顯存也都是看具體型號,最新的A100也都能支持到80G,但是可能對於特定的場景來說,可能內存還有可能不夠,如果是這種比較極端的情況下,也有可能說處理不了,也有可能說在GPU上的加速效果並不一定是十分的明顯。右圖非常清晰得展示了各個環節的吞吐大小是怎樣的,從最左邊看如果你是經常需要寫磁盤、網絡環境並不是十分高配的網絡架構、數據移動比較多的話,經常會bound到這些地方。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/db\/db05712d8e1dbab7473cb0bc2d6663c3.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是我們在GPU上跑spark仍然還是有很多的一些任務,它是十分的適合GPU場景。我們舉一些具體的例子:"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 高散列度數據操作"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"比如說高散列度的這三類操作,比如joins、aggregates、sort,這些其實都是一些基於shuffle的操作。高散列度具體指的是某一個column,它不同的值的數量除以整個的column的數量,或者簡單理解爲不同的值的數量是不是比較大的,如果是高散列度的這種情況的,是比較適合用GPU來跑Spark。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. window操作比較多"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二類是說Windows的window的操作特別多,特別是Windows size的比較大的情況下,也是比較適合於GPU的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 複雜計算"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三類的話就是說複雜的計算,比如說寫一個特別複雜的UDF,也是比較適合GPU的。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 數據編碼"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後的一個是說數據編碼,或者說數據的序列化、反序列化讀取和寫入,比如說創建Parquet、讀CSV,在這種情況下,因爲GPU我們有一些特定的針對IO的一些優化,對於這一塊來說性能加速比較好。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Spark Rapids工作原理"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1c\/1cc3424919db8b11f482886fe2543b64.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. ETL技術棧"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們大概的介紹一下Rapids Accelerator,它的工作原理是怎麼樣,整個的ETL的技術站可以如下圖所示:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/76\/76e3f95e3056a21435e7b7fa3f0f7b30.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖左邊的可以看一下是Python爲主的技術棧,傳統的包括我們用Pandas Kaggle競賽,或者說做數據分析的時候可能用Pandas會比較多,操作DataFrame的數據,我們對應的也提供了 GPU 版本的 Pandas-like 實現,叫做 cuDF。在 cuDF 的基礎上我們提供了分佈式的 Dataframe 支持,它就是 Dask cuDF。這些基礎庫底層依賴的是Python和Cython。最右邊是spark技術棧上我們對應的一些優化,對於Spark Dataframe和Spark SQL都有了對應的加速,然後對於Scala和PySpark也都有一些對應的優化的實現。然後它底層依賴最上層是Java,底層調用實際上是cuDF的C++API,中間的通信是通過JNI來通信庫。cuDF也是依賴Arrow的內存數據格式。對於這類列式存儲,我們在CPU的基礎上,也提供了GPU的支持。最底層是依賴於英偉達的顯卡的一個計算平臺CUDA,還有依賴CUDA基礎上搭建的各種底層的一些實現的一些底層庫。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. RAPIDS ACCELERATOR FOR APACHE SPAK"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/53\/533683861526eb8947ac02aeff3ff4e7.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們今天要關注的RAPIDS Accelerator它的整個架構是怎麼樣?可以先從上圖中最頂上看,最頂上是具體的算法工程師或者說數據分析師寫的Spark任務在中間這一層是Spark core。左邊這塊我們目前已經實現加速的是spark SQL和DataFrame的API。剛纔前面也講到,我們是不需要去更改任何的業務代碼,對於所有的這些業務代碼之中描述的這些操作,這些算子來說,我們提供了RAPIDS Accelerator可以自動的去識別對應的操作數據類型,是不是可以調用Rapids來進行GPU加速,如果是可以的話,就會調用Rapids,如果是無法加速的話,就會執行標準的CPU操作,整個調度對於用戶來說,對於實際寫Spark應用的人來說是透明。右邊這塊是對於Shuffle的支持,Shuffle也是Spark很關鍵的一個性能瓶頸。對於Shuffle的流程,我們具體是做了哪些優化?對於GPU和RDMA\/RoCE這種網絡架構下,我們實現了一套新的Shuffle,在底層使用了UCX來達到一個更好的一個加速效果。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. SPARK SQL & DATAFRAME編譯流程"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/7b\/7be0f6b43c001a258a3359d2efb5fbba.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個的Spark SQL和DataFrame的一個編譯流程是如上圖所示,最上層是Dataframe在Logic Plan這一層還是不變,經過 Catalyst 優化,生成Physical Plan之後,對應到GPU的版本我們會生成GPU的Physical Plan,具體輸出的數據是ColumnarBatch的RDD。如果需要把數據轉回CPU處理的話,會再把RDD轉回InternalRow的RDD。"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/5d\/5dbaa65fa5a216b9ab630d6ba75e1caf.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體的Spark SQL和DataFrame的執行計劃,會對應到GPU的plugin,如果採用後會產生哪些變化,給出了一張比較詳細的圖。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/4e\/4e55c637d3f3125e31f6dbfcc37fa80b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家可以感興趣的話可以自己測試一下。具體是對應到一個CPU的operation,如果是能用GPU優化的話,是能一對一的是去map到GPU的一個版本,如果說大家想自己去測一下GPU的版本Spark處理效果能達到一個怎麼樣的一個加速比,DataBricks提供了一個比較標準的Spark SQL生成數據的一個工具。我們主要也是依賴這個工具去做了一些 benchmark,主要的參數可以參考一下,我們用的選擇scale factor是用的3TB,也用到了decimals和空類型。輸入的input partitions的數目是400,shuffle的partitions是用的200,所有的輸出的結果會寫到S3上。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 效果"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/1b\/1bc19507b35c42c0f6f999ed4febce0f.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個的加速比的效果是怎麼樣?TPC-DS數據集上它加速比是怎麼樣?我們可以看一下QUERY 38的加速效果。具體選用的CPU的硬件的標準和GPU的硬件標準,都是AWS的標準的硬件單元,價格也都是非常透明。如果是從查詢時間上來看的話,相比於CPU版本的話,大概有三倍的提升。雖然可以看到最底下GPU的硬件,我們用的是一個driver是一個CPU的driver,worker是用一個八節點的單GPU的配置,在這種情況下,每小時的cost會是高一點,但是整個query時間有了三倍的提升,最終算下來的話,我們大概節省了55%的成本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/f0\/f0d2c4097a682d178aa13aea6ebf7aa4.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Query 5也是一個比較經典的一個查詢,它特別在哪兒?因爲它重要部分都沒有被GPU加速,只有少量部分被GPU加速,因爲具體來說的話是它的Decimal Type還沒有被GPU支持。在這種情況下,GPU版本也取得了一個比較好的性價比收益,相對於它的查詢時間來說,是有1.8倍的速度的提升,成本上來說仍然能節省23%的成本。對於大家對於想從Spark3.0的集羣CPU目前的架構過度到GPU架構的話,這是一個比較有參考性的一個例子,因爲我們目前的Rapids Accelerator一直在緊鑼密鼓的在迭代之中。目前的版本來說,即便不是所有的query都能被GPU加速,但是仍然還是能取得一個比較好的一個性價比。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"加速Shuffle"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/80\/804538c158f63d8f8772c083789ddc53.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們會主要講一下Shuffle,Spark Accelerator做了什麼,它爲什麼要針對Shuffle做加速。傳統的Shuffle大家如果是對 Spark比較熟悉的話,這塊也不用再贅述了,其實就是牽涉到我們在某一些特定的一些操作,比如說join、sort、aggregate,要牽涉到節點是節點之間,或者說executor跟executor之間要做一些數據的交換,要通過網絡做一些數據的一些傳輸,前一個stage跟後一個stage之間會產生一些數據的一些傳輸,就是牽涉到前一個stage要做一些數據的一些準備,然後把數據寫到磁盤,然後通過網絡把數據拉取過來,這中間可能也會牽扯到一些磁盤IO,然後把數據規整好。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/bc\/bcac8111a099433ba1975242d530e7b0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"傳統的Shuffle的它的流程如上圖所示,這張圖是基於CPU的目前的一個硬件環境,Shuffle它的數據的搬移具體是怎麼樣一個流程,可以看到如果是我們不做任何的優化的話,即便是數據存在GPU的顯存上,它也要經過比如經過PCI-e,然後才能去走網絡,走本地的存儲。可以看到有很多不必要的一些步驟,然後產生了一些額外的開銷,比如說沒有必要一定要經過PCI-e。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c9\/c998335bb42dce2f3aee5901c560c2f9.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過我們的優化的Shuffle之後,它大概的一個數據的移動是表現是怎麼樣?首先如上圖所示,第一張圖描述的是說GPU的memory是足夠用的。這種情況下這時候的Shuffle是怎麼走的?如果在同一個GPU內的話,數據本身不需要搬移。如果是在同一個節點,如果我們採用的節點也是有NVLink的情況下,這個數據可以直接通過NVLink來傳輸,而不用走PCI-e,也不用經過CPU。如果這個數據是存在本地存儲NVMe接口的存儲,可以通過GPU Direct Storage去直接做讀取,如果是遠程的數據,我們可以直接通過RMDA去做加速,同樣用同樣的也都是bypass掉了CPU和PCI-e。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/35\/35c34abfee9cc1fe464ec00d60cb1999.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果說Shuffle的時候,這時候數據規模是超出了顯存容量的話,大家也都比較熟悉Shuffle spill機制,我們的RAPDIS的Shuffle是不是還是能有一定的優化?這個答案也肯定的。首先如果是GPU的memory超了之後,會往主存裏頭去寫一部分,如果主存之後也寫不下,其實類似於之前的CPU的方案,會把這個數據寫到本地的存儲裏。但是對於儲存的這部分數據來說,仍然可以通過RDMA獲得加速。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d5\/d546ce7968f6f294aa683afd848efd4a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們底層依賴的具體依賴的技術的組件,我們是用了UCX這個庫,它提供了一個抽象的通信接口,但是對於具體的底層的網絡環境和資源,比如說是TCP還是RDMA、有沒有用Shared Memory、是不是有GPU,它都會根據具體的狀況去選擇一個最優的一個網絡處理方案,對於用戶來說是透明的,你不需要去具體的關心這些網絡細節。能達到的一個效果是,如果是能利用上最新的性能最優的RDMA的話,我們是能達到一個零拷貝的一個GPU的數據的傳輸。RDMA需要特定的一些硬件的一些支持。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/67\/67b8ab692685453b680aaaa92f6a8264.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果是採用UCX,能用RDMA這種網絡架構的話,能達到一個怎麼樣的一個性能收益?我們這邊舉了一個具體的庫存定價query的例子,CPU的執行時間不是228秒,相對於GPU它大概就能達到一個五倍的一個提升。如果在對於網絡這塊再做進一步的一些優化的話,其實可以看到是能縮短到8.4秒,整體看是有30倍左右的性能提升,這個提升還是非常明顯。所以其實可以大家也可以看到,整個的計算流程其實主要是bound在網絡這一塊。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9c\/9c28e48cf6b436f87d891f6ceda9be2b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於傳統的邏輯迴歸模型,它的ETL的流程,也是能達到一個比較明顯的一個收益,最原始版本是1556秒,最終優化的版本的話是76秒就可以執行完整個的ETL流程。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"0.2版本中的亮點"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/3b\/3bc0be30c0fab2dd3900f8c0e82fe79b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們過一下,我們最近有Rapids的兩個版本:0.2版本和0.3版本,大概都包含哪些新的一些特性。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 多版本SPARK的支持"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於從0.2版本開始,除了對於 Apache 社區版本的支持,對於Databricks 7.0ML和Google Dataproc 2.0,也都有對應的支持。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/34\/3480a296f03fc9de1f6170969a824aed.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 讀取小文件時的優化(PARQUET)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/43\/432b5b4d6bbd84790d66d0827121d91b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二個feature是對於讀取較小的parquet文件的時候做了一些性能的優化,簡單來說可以藉助CPU的線程去並行化處理parquet文件的讀取,實現CPU任務跟GPU任務能夠互相覆蓋,GPU在進行實際計算的時候,CPU也會同時去load data,能達到6倍的性能提升。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 初步支持Scala UDF"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/81\/81a4c20d0efef392d50a7d54cc41de22.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於Scala UDF的話,也有了一些初步的支持,目前支持的算子不是特別多,但是也可以具體的跑一些例子可以看一下,就是說對應的實際的用到的UDF是不是已經可以被編譯成GPU版本,如果是GPU版本的話,其實應該是能達到一個比較好的一個性能收益。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 加速Pandas UDFS "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/03\/038ba6a22e268b7ef587563332b48ecd.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果是Pandas用戶的話可能會經常會用到Pandas UDF,對於這一塊來說是RAPIDS的加速器也做了實現,具體實現的細節其實是可以使JVM進程和Python進程共享一個GPU,而且這樣的一個好處是可以優化JVM進程跟Python進程之間的一些數據交換。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/98\/98295e63d0d50e8570e22722d4108efa.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"具體的實現細節可以如上圖所示,目前的實現,單個GPU上我們可以跑一個JVM進程的同時,可以跑多個Python進程,可以配置具體的Python進程數,對於Python進程使用的GPU顯存總量也都是可配置。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/ed\/edeca2b3aa8c27c333084e6c3420c3a2.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相比於傳統的CPU方法的優化來說,GPU版本其實是更加親和的,因爲都利用了列式存儲,不牽扯到行式數據到列式數據轉換的開銷。在這種情況下加速收益還是比較明顯。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a4\/a43ba6509a404d65e41f604ff51f87ca.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"RELEASE 0.3"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0b\/0b1ee19081bb359cfb06475709e0e9b1.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實現了一個per thread的默認流,然後通過這種方式能夠更好地提高GPU的利用率。對於 AQE,也有了進一步的優化,對於3.0版本,如果這一部分的執行是可以跑在GPU上的話,會自動搬到GPU,在GPU上AQE總體達到可用狀態。UCX版本升到最新的1.9.0。對於一些新的功能,對於parquet文件讀取的支持,支持到lists和structs,然後對於窗口操作的一些算子,新增了對lead和lag的支持。對於普通的算子,添加了greatest跟least。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/cf\/cf622a5893bc3677ceab5fc846f3c11b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果大家想要去獲取rapids accelerate for spark的一些更多的一些信息的話,我們可以去直接通過NVIDIA的官網,可以直接聯繫到NVIDIA的 Spark Team,整個項目也是開源在github上。對於想獲得Spark Accelerator比較新的、全面的信息的話,可以去下載Spark電子書,電子書目前是有中文版本。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"分享嘉賓:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"趙元青"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"NVIDIA | 深度學習架構師"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"趙元青 2013 年本科畢業於北京信息科技大學,曾就職於阿里大文娛優酷事業羣、知乎和 OPPO。曾經在知乎負責過個性化推送算法側和首頁排序\/重排序模塊,在推薦系統領域有較深的鑽研。目前在 NVIDIA 負責推薦系統領域的深度學習解決方案架構。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:DataFunTalk(ID:dataFunTalk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/0uVFZLg1CHmg0bWQ4bQQUg","title":"xxx","type":null},"content":[{"type":"text","text":"基於NVIDIA GPU和RAPIDS加速Spark 3.0"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章