優化和調整Spark應用程序(七)

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"寫在前面:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大家好,我是強哥,一個熱愛分享的技術狂。目前已有 12 年大數據與AI相關項目經驗, 10 年推薦系統研究及實踐經驗。平時喜歡讀書、暴走和寫作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"業餘時間專注於輸出大數據、AI等相關文章,目前已經輸出了40萬字的推薦系統系列精品文章,今年 6 月底會出版「構建企業級推薦系統:算法、工程實現與案例分析」一書。如果這些文章能夠幫助你快速入門,實現職場升職加薪,我將不勝歡喜。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"想要獲得更多免費學習資料或內推信息,一定要看到文章最後喔。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"內推信息","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你正在看相關的招聘信息,請加我微信:liuq4360,我這裏有很多內推資源等着你,歡迎投遞簡歷。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"免費學習資料","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想獲得更多免費的學習資料,請關注同名公衆號【數據與智能】,輸入“資料”即可!","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"學習交流羣","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果你想找到組織,和大家一起學習成長,交流經驗,也可以加入我們的學習成長羣。羣裏有老司機帶你飛,另有小哥哥、小姐姐等你來勾搭!加小姐姐微信:epsila,她會帶你入羣。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在上一章中,我們詳細介紹瞭如何在Java和Scala中使用數據集。我們探討了Spark如何管理內存以將數據集結構作爲其統一的高級API的一部分,並考慮了與使用數據集相關的成本以及如何降低這些成本。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了降低成本外,我們還想考慮如何優化和調整Spark。在本章中,我們將討論一組啓用優化的Spark配置,查看Spark的一系列join策略,並檢查Spark用戶界面,尋找有關不良行爲的線索。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"優化和調整Spark效率","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然Spark有許多可供調優的配置,但這本書將只涵蓋少數最重要和通常被調優的配置。要獲得按功能主題分組的完整列表,可以閱讀官網文檔。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"查看和設置Apache Spark配置","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你可以通過三種方式獲取和設置Spark屬性。首先是通過一組配置文件。在部署$SPARK_HOME目錄(安裝Spark的位置)中,有許多配置文件:conf / spark-defaults.conf.template,conf / log4j.properties.template和conf / spark-env.sh.template。更改這些文件中的默認值,保存爲不帶.template後綴的配置文件。這樣一來spark會自動加載修改後的配置文件,使得修改後的值生效。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"conf / spark-defaults.conf文件中修改後的配置適用於Spark集羣以及所有提交給集羣的Spark應用程序。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第二種方法是使用spark-submit命令,在提交應用程序的時候使用--conf 標識符直接在Spark應用程序中或在命令中指定Spark配置,如下所示:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark-submit --conf spark.sql.shuffle.partitions = 5","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"--conf “ spark.executor.memory = 2g”","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"--class main.scala.chapter7.SparkConfig_7_1 jars / main-","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"scala-chapter7_2.12-1.0.jar","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面是在Spark應用程序本身中調整配置的方法:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.sql.SparkSession","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"def printConfigs(session: SparkSession) = {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   // Get conf","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   val mconf = session.conf.getAll","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   // Print them","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   for (k ${mconf(k)}\\n\") }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"def main(args: Array[String]) {","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" // Create a session","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" val spark = SparkSession.builder","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   .config(\"spark.sql.shuffle.partitions\", 5)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   .config(\"spark.executor.memory\", \"2g\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   .master(\"local[*]\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   .appName(\"SparkConfig\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   .getOrCreate()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" printConfigs(spark)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" spark.conf.set(\"spark.sql.shuffle.partitions\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"   spark.sparkContext.defaultParallelism)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" println(\" ****** Setting Shuffle Partitions to Default Parallelism\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" printConfigs(spark)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"}","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.driver.host -> 10.8.154.34","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.driver.port -> 55243","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.app.name -> SparkConfig","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.executor.id -> driver","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.master -> local[*]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.executor.memory -> 2g","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.app.id -> local-1580162894307","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.sql.shuffle.partitions -> 5","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第三個選項是通過Spark shell的編程接口實現的。與Spark中的所有其他內容一樣,API是交互的主要方法。通過SparkSession對象,你可以訪問大多數Spark配置。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Spark REPL中,例如,這個Scala代碼顯示在本地主機上Spark以本地模式啓動的Spark配置(詳情上可用的不同的模式,請參閱第一章中的“部署模式”):","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// mconf is a Map[String, String]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"scala> val mconf = spark.conf.getAll","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"...","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"scala> for (k ${mconf(k)}\\n\") }","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.driver.host -> 10.13.200.101","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.driver.port -> 65204","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.repl.class.uri -> spark://10.13.200.101:65204/classes","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.jars ->","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.repl.class.outputDir -> /private/var/folders/jz/qg062ynx5v39wwmfxmph5nn...","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.app.name -> Spark shell","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.submit.pyFiles ->","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.ui.showConsoleProgress -> true","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.executor.id -> driver","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.submit.deployMode -> client","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.master -> local[*]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.home -> /Users/julesdamji/spark/spark-3.0.0-preview2-bin-hadoop2.7","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.sql.catalogImplementation -> hive","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.app.id -> local-1580144503745","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你還可以僅查看特定於Spark SQL的Spark配置:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.sql(\"SET -v\").select(\"key\", \"value\").show(5, false)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.sql(\"SET -v\").select(\"key\", \"value\").show(n=5, truncate=False)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+------------------------------------------------------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|key |value |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+------------------------------------------------------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|spark.sql.adaptive.enabled |false |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin |0.2 |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled|true |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|spark.sql.adaptive.shuffle.localShuffleReader.enabled |true |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|spark.sql.adaptive.shuffle.maxNumPostShufflePartitions ||","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+------------------------------------------------------------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"only showing top 5 rows","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"另外,你可以通過Spark UI的“Environment”選項卡訪問Spark的當前配置,只不過是作爲只讀值,我們將在本章稍後討論該選項卡,如圖7-1所示。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/92/92f6e28b9e76a9cef4b05d861e4da663.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要以編程方式設置或修改現有配置,請首先檢查該屬性是否可修改。spark.conf.isModifiable(“”)將返回true或false。所有可修改的配置都可以使用API設置爲新的值。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"scala> spark.conf.get(\"spark.sql.shuffle.partitions\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"res26: String = 200","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"scala> spark.conf.set(\"spark.sql.shuffle.partitions\", 5)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"scala> spark.conf.get(\"spark.sql.shuffle.partitions\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"res28: String = 5","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"# In Python","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":">>> spark.conf.get(\"spark.sql.shuffle.partitions\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"'200'","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":">>> spark.conf.set(\"spark.sql.shuffle.partitions\", 5)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":">>> spark.conf.get(\"spark.sql.shuffle.partitions\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"'5'","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在設置Spark屬性的所有方式中,優先級順序決定採用哪些值。優先級最低的是spark-defaults.conf中定義的配置項的值或標誌,其次讀取spark-submit命令行中傳遞的配置項的值或標誌,最後讀取SparkSession在Spark應用程序中通過SparkConf設置的值或標誌。總結下來優先級的高低爲:配置文件 < spark-submi命令 < 程序配置。最終所有這些屬性都會被合併,並且優先在Spark應用程序中重置的所有重複屬性。同樣,命令行上提供的配置項的值將替換配置文件中對應配置項的設置,前提是這些值不會被應用程序中的相同配置覆蓋。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"調整或設置正確的配置有助於提高性能,正如你將在下一節中看到的那樣。這裏的建議來自社區中從業人員的經驗,着重於如何最大程度地利用Spark的集羣資源以適應大規模工作負載。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"擴展Spark以應對高負載","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大型Spark工作負載通常是批處理工作,有些工作是每晚執行的,有些則是每天定期執行的。無論哪種情況,這些作業都可能處理數十個TB字節甚至更多的數據。爲了避免由於資源匱乏或性能逐漸下降而導致作業失敗,可以啓用或更改一些Spark配置。這些配置影響三個Spark組件:Spark 驅動程序,Executor和在Executor上運行的shuffle服務。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark驅動程序的職責是與集羣管理器協調,從而在集羣中啓動Executor,並在其上調度Spark任務。對於大型工作負載,你可能有數百個任務。本節說明了一些可以調整或啓用的配置,以優化資源利用率,並行化任務從而避免大量任務的瓶頸。一些優化想法和見解來自諸如Facebook之類的大數據公司,這些公司以TB的數據規模使用Spark,並在Spark + AI Summit大會上與Spark社區共享了這些優化方式和見解。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"靜態與動態資源分配","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當你像之前一樣將計算資源指定爲spark-submit命令行參數時,相當於把資源配置寫死了。這意味着,如果由於工作負載超出預期而導致以後在驅動程序中排隊任務時需要更多資源,Spark將無法容納或分配額外的資源。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"相反,如果你使用Spark的動態資源分配配置,則隨着大型工作負載的需求不斷增加或減少,Spark驅動程序可以請求更多或更少的計算資源。在工作負載是動態的情況下(即,它們對計算能力的需求各不相同),使用資源動態分配有助於解決突然出現峯值的情況。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個有用的用例是流,其中數據流量可能不均勻。另一個是按需數據分析,在高峯時段你可能會有大量的SQL查詢。啓用動態資源分配可以使Spark更好地利用資源,在不使用executor時釋放它們,並在需要時獲取新的executor。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以及在處理大型或變化的工作負載時,動態分配在多租戶環境中也很有用,在該環境中,Spark可以與YARN,Mesos或Kubernetes中的其他應用程序或服務一起部署。但是請注意,Spark不斷變化的資源需求可能會同時影響其他需要資源的應用程序。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要啓用和配置動態分配,可以使用如下設置。注意這裏的數字是任意的;適當的設置將取決於你的工作負載的性質,因此應進行相應的調整。其中一些配置無法在Spark REPL內設置,因此你必須以編程方式進行設置:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.dynamicAllocation.enabled true","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.dynamicAllocation.minExecutors 2","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.dynamicAllocation.schedulerBacklogTimeout 1m","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.dynamicAllocation.maxExecutors 20","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.dynamicAllocation.executorIdleTimeout 2min","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默認情況下spark.dynamicAllocation.enabled設置爲false。當啓用以上顯示的設置時,Spark驅動程序將要求集羣管理器啓動的時候至少創建兩個Executor進行初始化(spark.dynamicAllocation.minExecutors--executor最小值)。隨着任務隊列積壓的增加,每次超過積壓超時時間(spark.dynamicAllocation.schedulerBacklogTimeout)時,都會請求新的 Executor。在這種情況下,每當有未調度的待處理任務超過1分鐘時,驅動程序將請求啓動新的Executor以調度積壓的任務,最多20個(spark.dynamicAllocation.maxExecutors)。相反,如果 Executor完成一項任務並且空閒了2分鐘(spark.dynamicAllocation.executorIdleTimeout),Spark驅動程序將終止該任務。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"配置SPARK Executor的內存和shuffle服務","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"僅啓用動態資源分配是不夠的。你還必須瞭解Spark如何配置和使用Executor內存的,以便Executor不會因內存不足而受JVM垃圾回收的困擾。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個executor可用的內存由spark.executor.memory來控制。如圖7-2所示,它分爲三個部分:execution memory, storage memory, and reserved memory。在保留300 MB的預留內存之後,默認內存劃分爲60%的execution memory和40%的storage memory,以防止OOM錯誤。Spark文檔聲明此方法適用於大多數情況,但是你可以通過spark.executor.memory參數調整你期望的比例。當不使用存儲內存時,Spark可以獲取它以供執行內存用於執行目的,反之亦然。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/85/85ca1f2515299cbee5e34e97fe976919.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"執行內存用於Spark shuffle,join,排序和聚合。由於不同的查詢可能需要不同的內存,因此可用內存的比例(spark.mem ory.fraction默認爲0.6)可能很難設置一個合適的值,但很容易調整。相比之下,存儲內存主要用於緩存用戶數據結構和從DataFrame派生的分區。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在map和shuffle操作期間,Spark會寫入和讀取本地磁盤的shuffle文件,因此I/O活動頻繁。這可能會導致瓶頸,因爲對於大型Spark作業,默認配置不理想。知道要如何調整不合理的配置可以減輕Spark作業各個階段的風險。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在表7-1中,我們抓取了一些建議的配置來進行調整,以便這些操作過程中的map、spill和合並過程不受效率低下的I/O所困擾,並使這些操作能夠在將最終的shuffle分區寫入磁盤之前使用緩衝區內存。調整在每個executor上運行的shuffle服務也有助於提高大型Spark工作負載的整體性能。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2c/2c6ca1f427f97cafff61c258e6b19fe8.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/84/8412009567f77b54e52c353329b345e4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/24/248ba3956856fed7368ed82ae96f534e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"該表中的建議並非適用於所有情況,但是它們應該使你瞭解如何根據工作負載來調整這些配置。與性能調整中的所有其他內容一樣,你必須進行嘗試,直到找到合適的平衡。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"最大化SPARK並行性","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark的效率很大程度上是因爲它能夠大規模並行運行多個任務。要了解如何最大程度地提高並行度(即儘可能並行讀取和處理數據),你必須研究Spark如何將數據從存儲中讀取到內存中以及分區對Spark意味着什麼。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在數據管理用語中,分區是一種將數據排列成可配置和可讀的數據塊或磁盤上的連續數據塊的方式。這些數據子集可以獨立讀取或處理,如果有必要,可以通過一個進程中的多個線程並行讀取或處理。這種獨立性很重要,因爲它允許數據處理的大量並行性。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark在並行處理任務方面非常高效。正如你在第2章中瞭解到的那樣,對於大規模工作負載,Spark作業將具有多個階段,並且在每個階段內將有許多任務。Spark最多會爲每個內核的每個任務分配一個線程,並且每個任務將處理一個不同的分區。爲了優化資源利用並最大程度地提高並行度,理想的情況是分區至少與Executor上內核的數量一樣多,如圖7-3所示。如果每個Executor上的分區數量多於內核數量,則所有內核都將保持繁忙狀態。你可以將分區視爲並行性的基本單位:在單個內核上運行的單個線程可以在單個分區上工作。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/3a/3ab655f96ebf05f242874543160e357e.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"如何創建分區","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如前所述,Spark的任務將數據從磁盤讀取到內存中。磁盤上的數據按塊或連續的文件塊排列。默認情況下,數據存儲上的文件塊大小從64 MB到128 MB不等。例如,在HDFS和S3上,默認大小爲128 MB(可配置)。這些連續塊的集合構成一個分區。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark中分區的大小由spark.sql.files.maxPartitionBytes決定。默認值爲128 MB。你可以減小大小,但這可能會導致所謂的“小文件問題”,即許多小分區文件,由於文件系統操作,例如,打開,關閉和列出目錄,而引入了過多的磁盤I/O和性能下降。在分佈式文件系統上可能會很慢。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當你顯式使用DataFrame API的某些方法時,也會創建分區。例如,在創建大型DataFrame或從磁盤讀取大型文件時,可以顯式指示Spark創建一定數量的分區:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val ds = spark.read.textFile(\"../README.md\").repartition(16)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ds: org.apache.spark.sql.Dataset[String] = [value: string]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ds.rdd.getNumPartitions","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"res5: Int = 16","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val numDF = spark.range(1000L * 1000 * 1000).repartition(16)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"numDF.rdd.getNumPartitions","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"numDF: org.apache.spark.sql.Dataset[Long] = [id: bigint]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"res12: Int = 16","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,在shuffle階段創建shuffle分區。默認情況下,spark.sql.shuffle.partitions中的shuffle分區的數量設置爲200  。你可以根據擁有的數據集的大小來調整此數值,以減少通過網絡發送給 Executor任務的小分區的數量。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.sql.shuffle.partitions對於較小的工作流或流工作負載的默認值太大;你可能希望將其減小到一個較低的值,例如executor上的內核數或更少。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在諸如groupBy()或join()的操作(也稱爲寬轉換,wide transformations)期間創建的shuffle分區會佔用網絡和磁盤I/O資源。在執行這些操作期間,shuffle會將結果分發到spark.local.directory中指定位置的 Executor的本地磁盤上。使用高性能的SSD磁盤來執行此操作將提高性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於shuffle階段設置的shuffle分區數量沒有通用的計算公式。該數值可能取決於你的用例、數據集、核數和可用的executor內存,這是一種反覆試驗的方法。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了爲大型工作負載擴展Spark外,要提高性能,你還可以考慮緩存或持久存儲經常訪問的DataFrames或表。在下一節中,我們將探討各種緩存和持久性選項。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"數據緩存和持久化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"緩存和持久化有什麼區別?在Spark中,它們是同義詞。兩個API調用cache()和persist()提供了這些功能。後者可以更好地控制數據的存儲方式和位置——在內存和磁盤中(序列化和非序列化)。兩者都有助於提高頻繁訪問的DataFrame或表的性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"DataFrame.cache()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"cache()將在內存允許的範圍內存儲跨Spark Executor讀取的所有分區(見圖7-2)。儘管DataFrame可能被部分緩存,但是分區不能被部分緩存(例如,如果你有8個分區,但內存中只能容納4.5個分區,那麼將僅緩存4個)。但是,如果不是所有分區都被緩存,則當你要再次訪問數據時,必須重新計算未緩存的分區,這會降低Spark作業的速度。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們看一個示例,該示例說明在訪問DataFrame時如何緩存大型DataFrame可以提高性能:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Create a DataFrame with 10M records","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val df = spark.range(1 * 10000000).toDF(\"id\").withColumn(\"square\", $\"id\" * $\"id\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"df.cache() // Cache the data","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"df.count() // Materialize the cache","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"res3: Long = 10000000","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Command took 5.11 seconds","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"df.count() // Now get it from the cache","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"res4: Long = 10000000","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Command took 0.44 seconds","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一個count()實例化了緩存,而第二個count()訪問了緩存,從而使該數據集的訪問時間快了近12倍。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當你使用cache()或時persist(),直到你調用遍歷每條記錄的操作(例如count()),DataFrame纔會被完全緩存。如果你使用類似的操作take(1),則只有一個分區將被緩存,因爲Catalyst意識到你不必爲了檢索一條記錄而計算所有分區。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"觀察DataFrame如何跨本地主機上的一個executor存儲,如圖7-4所示,我們可以看到它們都完全放在了內存中(記住較低級別的DataFrame由RDD支持)。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/46/467c766e75b931429f93df440aca0bd9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"DataFrame.persist()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"persist(StorageLevel.LEVEL)具有細微差別,可通過StorageLevel來控制如何緩存數據的級別。表7-2總結了不同的存儲級別。磁盤上的數據總是使用Java或Kryo序列化進行序列化。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/f4/f45b5436135179686b98f662db0e6390.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/14/1474e3201965683c6b799811772d4b94.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"每個StorageLevel(除外OFF_HEAP)都有一個等價的LEVEL_NAME_2,這意味着在兩個不同的Spark Executor上重複兩次:MEMORY_ONLY_2,MEMORY_AND_DISK_SER_2等。雖然此選項使用成本很高,但它允許在兩個地方進行數據局部化,從而提供了容錯能力,並讓Spark可以選擇將任務調度到數據副本的本地執行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們看與上一節相同的示例,但是使用persist()方法:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.storage.StorageLevel","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Create a DataFrame with 10M records","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val df = spark.range(1 * 10000000).toDF(\"id\").withColumn(\"square\", $\"id\" * $\"id\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"df.persist(StorageLevel.DISK_ONLY) // Serialize the data and cache it on disk","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"df.count() // Materialize the cache","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"res2: Long = 10000000","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Command took 2.08 seconds","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"df.count() // Now get it from the cache","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"res3: Long = 10000000","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Command took 0.38 seconds","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從圖7-5中可以看到,數據保留在磁盤上,而不是內存中。要取消持久化緩存的數據,只需調用DataFrame.unpersist()。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/2e/2e075d970b4e3db1457024eda17de7df.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,不僅可以緩存DataFrame,還可以緩存從DataFrame派生的表或視圖。使它們在Spark UI中更具可讀性。例如:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"df.createOrReplaceTempView(\"dfTable\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.sql(\"CACHE TABLE dfTable\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.sql(\"SELECT count(*) FROM dfTable\").show()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|count(1)|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|10000000|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------+","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Command took 0.56 seconds","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"何時緩存和持久化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"緩存的常見應用場景是重複訪問大數據集以進行查詢或轉換。一些示例包括:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DataFrames常用於迭代機器學習訓練中","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DataFrames頻繁被用於ETL期間進行頻繁轉換或建立數據管道","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"什麼時候不緩存和持久化","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"並非所有用例都規定了需要緩存,有一些場景是不需要訪問DataFrame的,比如下面的例子:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"DataFrame太大而內存無法滿足需求","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在DataFrame上進行廉價不頻繁的轉換,而無需考慮它的大小","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通常,應謹慎使用內存緩存,因爲它可能會導致序列化和反序列化從而導致資源消耗,這取決於所使用的StorageLevel。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"接下來,我們將重點轉移到討論幾個常見的Spark連接操作上,這些操作會觸發高代價的數據移動,要求集羣提供計算和網絡資源,以及如何通過組織數據來減輕這種移動。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Spark 連接策略","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"連接操作是大數據分析中一種常見的轉換類型,其中兩個以表或DataFrames形式的數據集通過一個公共的配對鍵合併。與關係型數據庫的表關聯類似,Spark DataFrame和Dataset API以及Spark SQL提供了一系列連接轉換:內部連接,外部連接,左連接,右連接等。所有的這些操作都會觸發Spark Executor之間的數據移動。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些轉換的核心是Spark如何計算和要生成什麼數據,以及將相關聯的數據寫入到磁盤中,以及如何將這些key和數據傳輸到節點上進行一系列操作,如groupBy(),join(),agg(),sortBy()和reduceByKey()。以上這些操作我們通常稱爲shuffle操作,也就是常說的“洗牌”。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark具有五種不同的連接策略,通過它可以在Executor之間交換,移動,排序,分組和合並數據:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Shuffle Hash Join(SHJ):shuffle 哈希連接","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Broadcast Hash Join(BHJ):廣播哈希連接","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Sort Merge Join(SMJ):排序合併連接","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Cartesian Join(CJ):笛卡爾積連接","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Broadcast Nested Loop Join(BNLJ):廣播嵌套連接","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這裏,我們僅關注其中的兩種策略(BHJ和SMJ),也是我們在開發中最常見的兩種連接策略。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"廣播哈希連接(BHJ)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"map-side-only join也稱爲“僅在map端的連接”,當需要將兩個數據集和另一個足夠大數據集結合使用時,其中一個數據集較小,適合加載到Driver和Executor內存中,爲了避免大規模數據移動,採用廣播哈希連接。使用Spark廣播變量,較小的數據集由驅動程序廣播到所有Spark Executor,如圖7-6所示,隨後將其與每個Executor上的較大數據集合並。這種策略避免了大量的數據交換。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/1e/1ebe70c539e08908f0ef882774a9b62c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默認情況下,如果較小的數據集小於10 MB,Spark將使用廣播連接。該配置spark.sql.autoBroadcastJoinThreshold進行設置; 你可以根據每個executor和驅動程序中的內存大小來進行動態調整。如果你確信有足夠的內存,則可以對大於10 MB(甚至最大100 MB)的DataFrame使用廣播連接。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"一個常見的用例是,當你在兩個DataFrame之間擁有一組通用的key時,其中一個DataFrame包含的信息少於另一個DataFrame,並且這時候你需要將兩者合併成視圖。例如,考慮一個簡單的情況,你擁有世界各地大量的足球運動員的數據集playersDF以及球員所在足球俱樂部的數據集clubsDF,其中clubsDF數據集較小,並且你希望通過一個公共的key將兩者連接起來:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.sql.functions.broadcast","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val joinedDF = playersDF.join(broadcast(clubsDF), \"key1 === key2\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在此代碼中,我們強制Spark進行廣播連接,但是默認情況下,但如果較小的數據集的大小小於spark.sql.autoBroadcastJoinThreshold,那麼會默認使用這種連接策略。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"BHJ是Spark提供的最簡單,最快的連接策略,因爲它不涉及任何數據集。經過spark廣播之後,所有數據都可以在本地供 Executor使用。你只需要確保Spark驅動程序和Executor都具有足夠的內存,就可以將較小的數據集保存在內存中。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在執行該操作之後的任何時間,你都可以通過執行以下命令查看物理計劃中執行了哪些連接操作:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"joinDF.explain(mode)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Spark 3.0中,你可以使用joinedDF.explain('mode') 顯示一個可讀的和易於理解的輸出,該模式包括了'simple', 'extended', 'codegen', 'cost'和'formatted'這幾種類型。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"何時使用廣播哈希連接","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在以下條件下使用這種類型的連接以獲得最大利益:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 當較小和較大數據集中的每個鍵被Spark散列到同一分區時","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 當一個數據集比另一個數據集小得多時(並且在默認配置10 MB內;如果有足夠的內存,則更多;如果不超過10 MB,則默認配置爲10 MB)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 當你只想執行等值連接時,根據匹配的未排序key關聯兩個數據集","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 當你不必擔心使用過多的網絡帶寬資源或者OOM錯誤時,因爲較小的數據集將廣播給所有Spark Executor","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在Spark中指定spark.sql.autoBroadcastJoinThreshold 的值爲-1,則會導致Spark一直採用shuffle排序合併連接策略(SMJ),我們將在下一節中討論。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Shuffle排序合併連接(SMJ)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"排序合併算法是基於某個相同的key合併兩個大的數據集的有效方法,該key是可排序,唯一的、且可以分配給或存儲在同一個分區上,也就是說,兩個數據集的公共哈希key最終會落在同一分區上。從Spark的角度來看,這意味着每個數據集中具有相同key的所有行都將散列在同一Executor的同一分區上。顯然,這意味着數據必須在Executor之間進行協調或交換。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"顧名思義,此連接方案有兩個階段:排序階段和合並階段。排序階段根據連接的key對每個數據集進行排序;合併階段則是從每個數據集中迭代行中的每個key,如果兩個key匹配,則合併這些行。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默認情況下,通過spark.sql.join.preferSortMergeJoin啓用SortMergeJoin。以下是本書的GitHub repo中可用於獨立應用程序notebook中的代碼段。主要思想是提取兩個具有一百萬條記錄的大型DataFrame,並將它們通過公共的key進行連接,即uid == users_id。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"雖然該數據是合成的,但也能說明了這一點:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import scala.util.Random","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Show preference over other joins for large data sets","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Disable broadcast join","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Generate data","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"...","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.conf.set(\"spark.sql.autoBroadcastJoinThreshold\", \"-1\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Generate some sample data for two data setsvar states = scala.collection.mutable.Map","attrs":{}},{"type":"link","attrs":{"href":"","title":"","type":null},"content":[{"type":"text","text":"Int,String","attrs":{}}]},{"type":"text","text":"var items = scala.collection.mutable.Map","attrs":{}},{"type":"link","attrs":{"href":"","title":"","type":null},"content":[{"type":"text","text":"Int,String","attrs":{}}]},{"type":"text","text":"val rnd = new scala.util.Random(42)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Initialize states and items purchased","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"states += (0 -> \"AZ\", 1 -> \"CO\", 2-> \"CA\", 3-> \"TX\", 4 -> \"NY\", 5-> \"MI\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"items += (0 -> \"SKU-0\", 1 -> \"SKU-1\", 2-> \"SKU-2\", 3-> \"SKU-3\", 4 -> \"SKU-4\", 5-> \"SKU-5\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Create DataFrames","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val usersDF = (0 to 1000000).map(id => (id, s\"user_${id}\",","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    s\"user_${id}@databricks.com\", states(rnd.nextInt(5))))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    .toDF(\"uid\", \"login\", \"email\", \"user_state\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val ordersDF = (0 to 1000000)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    .map(r => (r, r, rnd.nextInt(10000), 10 * r* 0.2d,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    states(rnd.nextInt(5)), items(rnd.nextInt(5))))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    .toDF(\"transaction_id\", \"quantity\", \"users_id\", \"amount\", \"state\", \"items\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Do the join","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val usersOrdersDF = ordersDF.join(usersDF, $\"users_id\" === $\"uid\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Show the joined results","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"usersOrdersDF.show(false)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------------+--------+--------+--------+-----+-----+---+---+-------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|transaction_id|quantity|users_id|amount|state|items|uid|...|user_state|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------------+--------+--------+--------+-----+-----+---+---+-------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|3916          |3916    |148     |7832.0  |CA   |SKU-1|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|36384         |36384   |148     |72768.0 |NY   |SKU-2|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|41839         |41839   |148     |83678.0 |CA   |SKU-3|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|48212         |48212   |148     |96424.0 |CA   |SKU-4|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|48484         |48484   |148     |96968.0 |TX   |SKU-3|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|50514         |50514   |148     |101028.0|CO   |SKU-0|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|65694         |65694   |148     |131388.0|TX   |SKU-4|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|65723         |65723   |148     |131446.0|CA   |SKU-1|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|93125         |93125   |148     |186250.0|NY   |SKU-3|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|107097        |107097  |148     |214194.0|TX   |SKU-2|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|111297        |111297  |148     |222594.0|AZ   |SKU-3|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|117195        |117195  |148     |234390.0|TX   |SKU-4|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|253407        |253407  |148     |506814.0|NY   |SKU-4|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|267180        |267180  |148     |534360.0|AZ   |SKU-0|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|283187        |283187  |148     |566374.0|AZ   |SKU-3|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|289245        |289245  |148     |578490.0|AZ   |SKU-0|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|314077        |314077  |148     |628154.0|CO   |SKU-3|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|322170        |322170  |148     |644340.0|TX   |SKU-3|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|344627        |344627  |148     |689254.0|NY   |SKU-3|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|345611        |345611  |148     |691222.0|TX   |SKU-3|148|...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------------+--------+--------+--------+-----+-----+---+---+-----","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"only showing top 20 rows","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"檢查我們的最終執行計劃,我們注意到Spark使用了SortMerge Join來連接兩個DataFrame。該Exchange操作是對每個Executor上的map操作的結果的重新排列:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"usersOrdersDF.explain()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"== Physical Plan ==","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"InMemoryTableScan [transaction_id#40, quantity#41, users_id#42, amount#43,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"state#44, items#45, uid#13, login#14, email#15, user_state#16]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+- InMemoryRelation [transaction_id#40, quantity#41, users_id#42, amount#43,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"state#44, items#45, uid#13, login#14, email#15, user_state#16],","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"StorageLevel(disk, memory, deserialized, 1 replicas)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+- *(3) SortMergeJoin [users_id#42], [uid#13], Inner","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":":- *(1) Sort [users_id#42 ASC NULLS FIRST], false, 0","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":": +- Exchange hashpartitioning(users_id#42, 16), true, [id=#56]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":": +- LocalTableScan [transaction_id#40, quantity#41, users_id#42,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"amount#43, state#44, items#45]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+- *(2) Sort [uid#13 ASC NULLS FIRST], false, 0","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+- Exchange hashpartitioning(uid#13, 16), true, [id=#57]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+- LocalTableScan [uid#13, login#14, email#15, user_state#16]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"此外,Spark UI(我們將在下一節中討論)顯示了整個作業的三個階段:Exchange和Sort操作在最後階段進行,然後合併結果,如圖7-7和7-8所示。。這樣做交換的成本很昂貴,並且需要在Executor之間通過網絡對分區進行shuffle。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6d/6d2257c367148a49ba70e5858e21f369.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bf/bf1fb5b7c7d114692c6f95b72ac2882d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"優化shuffle排序合併連接","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果我們爲常見的排序鍵或列創建分區桶,則可以從該方案中省去Exchange步驟。也就是說,我們可以創建大量的存儲桶來存儲特定的排序列(每個存儲桶一個鍵)。通過這種方式對數據進行預分類和重組可以提高性能,因爲它使我們可以跳過昂貴的數據交換操作並直接進行操作WholeStageCodegen。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本章notebook的以下代碼片段中(在本書的GitHub repo中可以找到),我們將按連接的users_id和uid列進行排序和分桶,並將桶以Parquet格式保存爲Spark管理表:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// In Scala","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.sql.functions._","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"import org.apache.spark.sql.SaveMode","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Save as managed tables by bucketing them in Parquet format","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"usersDF.orderBy(asc(\"uid\"))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .write.format(\"parquet\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .bucketBy(8, \"uid\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .mode(SaveMode.OverWrite)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .saveAsTable(\"UsersTbl\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"ordersDF.orderBy(asc(\"users_id\"))","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .write.format(\"parquet\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .bucketBy(8, \"users_id\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .mode(SaveMode.OverWrite)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"  .saveAsTable(\"OrdersTbl\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Cache the tables","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.sql(\"CACHE TABLE UsersTbl\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark.sql(\"CACHE TABLE OrdersTbl\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Read them back in","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val usersBucketDF = spark.table(\"UsersTbl\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val ordersBucketDF = spark.table(\"OrdersTbl\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"// Do the join and show the results","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"val joinUsersOrdersBucketDF = ordersBucketDF","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"    .join(usersBucketDF, $\"users_id\" === $\"uid\")","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"joinUsersOrdersBucketDF.show(false)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------------+--------+--------+---------+-----+-----+---+---+------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|transaction_id|quantity|users_id|amount|state|items|uid|...|user_state|","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------------+--------+--------+---------+-----+-----+---+---+------","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|144179        |144179  |22      |288358.0 |TX   |SKU-4|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|145352        |145352  |22      |290704.0 |NY   |SKU-0|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|168648        |168648  |22      |337296.0 |TX   |SKU-2|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|173682        |173682  |22      |347364.0 |NY   |SKU-2|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|397577        |397577  |22      |795154.0 |CA   |SKU-3|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|403974        |403974  |22      |807948.0 |CO   |SKU-2|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|405438        |405438  |22      |810876.0 |NY   |SKU-1|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|417886        |417886  |22      |835772.0 |CA   |SKU-3|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|420809        |420809  |22      |841618.0 |NY   |SKU-4|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|659905        |659905  |22      |1319810.0|AZ   |SKU-1|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|899422        |899422  |22      |1798844.0|TX   |SKU-4|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|906616        |906616  |22      |1813232.0|CO   |SKU-2|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|916292        |916292  |22      |1832584.0|TX   |SKU-0|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|916827        |916827  |22      |1833654.0|TX   |SKU-1|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|919106        |919106  |22      |1838212.0|TX   |SKU-1|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|921921        |921921  |22      |1843842.0|AZ   |SKU-4|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|926777        |926777  |22      |1853554.0|CO   |SKU-2|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|124630        |124630  |22      |249260.0 |CO   |SKU-0|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|129823        |129823  |22      |259646.0 |NY   |SKU-4|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"|132756        |132756  |22      |265512.0 |AZ   |SKU-2|22 |...|CO        |","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"+--------------+--------+--------+---------+-----+-----+---+---+-----","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"only showing top 20 rows","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"連接的輸出按uid和users_id做排序,因爲我們保存的表是升序排列的。因此,在SortMergeJoin期間無需進行排序。查看Spark UI(圖7-9),我們可以看到我們跳過了Exchange並直接轉到WholeStageCodegen。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"物理計劃還顯示,與插入前的物理計劃相比,沒有執行Exchange:Exchange與存儲之前的物理計劃相比,該物理計劃還顯示未執行任何操作:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"joinUsersOrdersBucketDF.explain()","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"== Physical Plan ==","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"*(3) SortMergeJoin [users_id#165], [uid#62], Inner","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":":- *(1) Sort [users_id#165 ASC NULLS FIRST], false, 0","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":": +- *(1) Filter isnotnull(users_id#165)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":": +- Scan In-memory table `OrdersTbl` [transaction_id#163, quantity#164,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"users_id#165, amount#166, state#167, items#168], [isnotnull(users_id#165)]","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":": +- InMemoryRelation [transaction_id#163, quantity#164, users_id#165,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"amount#166, state#167, items#168], StorageLevel(disk, memory, deserialized, 1","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"replicas)","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":": +- *(1) ColumnarToRow","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":": +- FileScan parquet","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"... ","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/65/6560f1f542df7c508888e7ef0bc10be0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/bd/bd121b67de380c8548b285b060fa21ad.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"何時使用shuffle排序合併連接","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在以下條件下使用這種類型的連接以獲得最大利益:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 當兩個大型數據集中的每個鍵可以排序並通過Spark散列到同一分區時。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 當你只想執行等值連接,基於匹配的排序鍵組合兩個數據集時。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" 當你要防止Exchange和Sort操作,以誇網絡節省大量的shuffle操作時。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"到目前爲止,我們已經介紹了與調整和優化Spark有關的操作方面,以及Spark如何在兩次常見的連接操作期間交換數據。我們還演示瞭如何通過使用桶來避免大量的數據交換從而提高shuffle排序合併連接操作的性能。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正如你在前面的圖中所看到的,Spark UI是可以對這些操作進行可視化分析的有效渠道。它顯示了收集到的指標和程序狀態,揭示了有關可能的性能瓶頸的大量信息以及線索。在本章的最後部分,我們討論在Spark UI中可以查看哪些內容。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"查看Spark UI","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark提供了精心設計的Web UI,使得我們能夠檢查應用程序的各個組件。它提供了有關內存使用情況、作業、階段和任務的詳細信息,以及事件時間表,日誌以及各種指標和統計信息,可讓你深入瞭解Spark應用程序中在Spark驅動程序級別和單個Executor中發生的情況。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark-submit 作業同時會啓動Spark UI,你可以在本地主機上(在本地模式下)或通過默認端口4040上的Spark驅動程序(在其他模式下)進行訪問。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"學習Spark UI選項卡","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Spark UI有六個選項卡,如圖7-10所示,每個選項卡都給我們提供了探索的機會。讓我們看一下每個選項卡向我們展示的內容。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/a0/a0d7c7eb7a6696b02bf5c65f494b8ba4.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本討論適用於Spark 2.x和Spark 3.0。雖然Spark 3.0中的大部分UI相同,但它還添加了第七個選項卡,即“Structured Streaming”。我買將在第12章中進行預覽。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Jobs和Stages","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"正如你在第2章中瞭解到的那樣,Spark將應用程序細分爲作業、階段和任務。通過“Jobs和Stages”選項卡,你可以瀏覽這些內容並向下鑽取一個細粒度的級別,以檢查各個任務的詳細信息。你可以查看它們的完成狀態並查看與I/O、內存消耗以及執行時間等相關的指標。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"圖7-11顯示了展開的事件時間線的“Jobs”選項卡,顯示了Executor何時被添加到集羣或從集羣中刪除了 。它還提供了集羣中所有已完成作業的表格列表。“Duration”列表示完成每個作業所花費的時間(由第一列中的JobID標識)。如果該時間耗時很長,則表明你需要分析該作業的各個階段,以查看哪些任務可能會導致延遲。通過這個摘要頁面,你還可以訪問每個作業的詳細信息頁面,包括DAG可視化和已完成階段的列表。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/b6/b61e257cece5b476b3e2288889b74f6c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Stages”選項卡提供了應用程序中所有作業的所有階段的當前狀態的摘要。你還可以訪問每個階段的詳細信息頁面,提供有關其任務的DAG和指標(圖7-12)。除了其他一些可選的統計信息之外,你還可以看到每個任務的平均持續時間,在垃圾回收(GC)上花費的時間以及讀取的shuffle字節/記錄數。如果從遠程executor讀取shuffle數據,則較高的shuffle讀取阻塞時間會發出I/O問題的信號。較高的GC時間表示堆上的對象太多(你的Executor可能內存不足)。如果一個階段的最大任務時間遠遠大於中位數,則可能是由於分區中數據分佈不均而導致數據傾斜。讓我們找出一些有說服力的現象來說明問題。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/54/5426369a530aa0f55404de331acbced0.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"你還可以在此頁面上看到每個執行者的聚合指標以及每個任務的明細。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":" ","attrs":{}},{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Executors","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“Executors”選項卡提供爲應用程序創建的Executor的有關信息。正如你在圖7-13看到的,你可以深入瞭解有關資源使用情況(磁盤,內存,內核)、在GC上花費的時間以及shuffle過程中寫入和讀取的數據量等詳細信息。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/df/df9c79884e79ccf2de10f6b4c732f43f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"除了彙總統計數據,你還可以查看每個executor如何使用內存以及用於什麼目的。這還有助於當你在DataFrame或託管表上使用cache()或persist()方法時查看資源使用情況,我們將在下面討論這些問題。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Storage","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在“shuffle排序合併連接”中的Spark代碼中,在關聯後緩存了兩個託管表。如圖7-14所示,“Storage”選項卡提供了有關應用程序使用cache()或persist()方法而緩存的任何表或DataFrame的信息。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/fe/feebead878bfd7d10dbaab3d5d51c740.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單擊圖7-14中的“ In-memory table`UsersTbl`”鏈接,可以進一步瞭解該表是如何在1個Executor和8個分區上的內存和磁盤上緩存的,這個數字對應於我們爲該表創建的桶的數量(請參見圖7-15)。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/49/49943e70220ba8d96ef3011851931042.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"SQL","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過SQL選項卡可以跟蹤和查看作爲Spark應用程序的一部分而執行的Spark SQL查詢的效果。你可以查看執行查詢的時間,執行了哪些作業及其持續時間。例如,在SortMergeJoin示例中,我們執行了一些查詢;所有這些查詢都顯示在圖7-16中,其鏈接可以進一步鑽取。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/6d/6d6887fcd51a3d659465e3c9e16130b9.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"單擊查詢描述將顯示所有物理操作的執行計劃的詳細信息,如圖7-17所示。根據該計劃,在這裏,每個物理運算符Scan In-memory table、HashAggregate和Exchange都是 SQL指標。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當我們要檢查物理操作符的詳細信息並探索發生了什麼事情:掃描了多少行,寫入了多少shuffle字節等等,這些度量標準非常有用。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/7c/7cf308f332394ac39f7bd6bd308a8792.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"Environment","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如圖7-18所示,“Environment”選項卡與其他選項卡一樣重要。瞭解你的Spark應用程序運行的環境,會發現許多對故障排除有用的線索。實際上,必須知道設置了哪些環境變量,包括了哪些jar,設置了哪些Spark屬性(以及它們各自的值,特別是如果你對“優化和調整Spark效率”中提到的某些配置進行了調整),設置什麼系統屬性,使用哪種運行時環境(例如JVM或Java版本)等。所有這些只讀詳細信息都是非常重要的信息,如果你發現Spark應用程序中有任何異常行爲,可以以此作爲依據進行排查和調整。","attrs":{}}]},{"type":"image","attrs":{"src":"https://static001.geekbang.org/infoq/59/5919cf16d7f85f5e0b23a76599b16cbb.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"SPARK應用程序調試","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本節中,我們瀏覽了Spark UI中的各個選項卡。如你所見,UI提供了大量信息,可用於調試和解決Spark應用程序中的問題。除了我們在這裏介紹的內容之外,它還提供對驅動程序和Executor stdout / stderr日誌的訪問,在其中你可能已記錄了部分調試信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過UI進行調試與在你最喜歡的IDE中逐步執行應用程序不同,過程更像偵查,跟蹤線索,儘管你更喜歡這種方法,也可以在本地諸如IntelliJ IDEA之類的IDE中調試Spark應用程序。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"“ Spark 3.0 UI”選項卡顯示了發生情況的有價值的線索,以及訪問驅動程序和Executor stdout / stderr的日誌,你可能已在其中記錄了某些調試信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最初,大量的信息可能會使新手不知所措。但是隨着時間的流逝,你將瞭解在每個選項卡中查找的內容,並且可以更快地檢測和診斷異常。這樣的調試模式將變得清晰明瞭,在運行一些Spark示例後,通過經常訪問這些選項卡並熟悉它們,你將習慣於通過UI調整和檢查Spark應用程序。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong","attrs":{}}],"text":"總結","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本章中,我們討論了許多用於優化Spark應用程序的優化技術。如你所見,通過調整一些默認的Spark配置,可以改善大型工作負載的伸縮性,增強並行性,並最大程度地減少Spark Executor之間的內存不足。你還可以瞭解如何使用適當級別的緩存和持久化策略來加快對常用數據集的訪問,並且我們研究了Spark使用的兩個常用連接進行復雜聚合,並演示了DataFrames如何按key排序進行分桶,藉此跳過shuffle操作。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後,爲了更直觀地瞭解性能,Spark UI提供了可視化界面。儘管UI內容豐富且詳細,但它並不等效於IDE中的逐步調試。但是我們展示瞭如何通過Spark UI的六個選項卡檢查和收集指標和統計數據,包括計算和內存使用數據以及SQL查詢執行跟蹤等信息。","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在下一章中,我們將深入探討結構化流,並向你展示在前幾章中瞭解到的結構化API如何使你連續地編寫流應用程序和批處理應用程序,從而使你能夠構建可靠的數據湖和管道。","attrs":{}}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章