Hive大數據表性能調優

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"本文要點"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據應用程序開發人員在從Hadoop文件系統或Hive表讀取數據時遇到了挑戰。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"合併作業(一種用於將小文件合併爲大文件的技術)有助於提高讀取Hadoop數據的性能。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過合併,文件的數量顯著減少,讀取數據的查詢時間更短。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當通過map-reduce作業讀取Hive表數據時,Hive調優參數也可以幫助提高性能。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/cwiki.apache.org\/confluence\/display\/Hive\/Tutorial","title":"","type":null},"content":[{"type":"text","text":"Hive"}]},{"type":"text","text":"表是一種依賴於結構化數據的大數據表。數據默認存儲在Hive數據倉庫中。爲了將它存儲在特定的位置,開發人員可以在創建表時使用location標記設置位置。Hive遵循同樣的SQL概念,如行、列和模式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在讀取Hadoop文件系統數據或Hive表數據時,大數據應用程序開發人員遇到了一個普遍的問題。數據是通過"},{"type":"link","attrs":{"href":"https:\/\/spark.apache.org\/","title":"","type":null},"content":[{"type":"text","text":"spark streaming"}]},{"type":"text","text":"、"},{"type":"link","attrs":{"href":"https:\/\/nifi.apache.org\/","title":"","type":null},"content":[{"type":"text","text":"Nifi streaming"}]},{"type":"text","text":"作業、其他任何流或攝入程序寫入Hadoop集羣的。攝入作業將大量的小數據文件寫入Hadoop集羣。這些文件也稱爲part文件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這些part文件是跨不同數據節點寫入的,如果當目錄中的文件數量增加時,其他應用程序或用戶試圖讀取這些數據,就會遇到性能瓶頸,速度緩慢。其中一個原因是數據分佈在各個節點上。考慮一下駐留在多個分佈式節點中的數據。數據越分散,讀取數據的時間就越長,讀取數據大約需要“N *(文件數量)”的時間,其中N是跨每個名字節點的節點數量。例如,如果有100萬個文件,當我們運行MapReduce作業時,mapper就必須對跨數據節點的100萬個文件運行,這將導致整個集羣的利用率升高,進而導致性能問題。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"對於初學者來說,Hadoop集羣有多個"},{"type":"link","attrs":{"href":"https:\/\/hadoop.apache.org\/docs\/current\/hadoop-project-dist\/hadoop-hdfs\/HdfsDesign.html","title":"","type":null},"content":[{"type":"text","text":"名字節點"}]},{"type":"text","text":",每個名字節點將有多個數據節點。攝入\/流作業跨多個數據節點寫入數據,在讀取這些數據時存在性能挑戰。對於讀取數據的作業,開發人員花費相當長的時間才能找出與查詢響應時間相關的問題。這個問題主要發生在每天數據量以數十億計的用戶中。對於較小的數據集,這種性能技術可能不是必需的,但是爲長期運行做一些額外的調優總是好的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在本文中,我將討論如何解決這些問題和性能調優技術,以提高Hive表的數據訪問速度。與Cassandra和Spark等其他大數據技術類似,Hive是一個非常強大的解決方案,但需要數據開發人員和運營團隊進行調優,才能在對Hive數據執行查詢時獲得最佳性能。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們先看一些Hive數據使用的用例。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"用例"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive數據主要應用於以下應用程序:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"大數據分析,就交易行爲、活動、成交量等運行分析報告;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"跟蹤欺詐活動並生成有關該活動的報告;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"基於數據創建儀表板;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"用於審計和存儲歷史數據;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲機器學習提供數據及圍繞數據構建智能"}]}]}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"優化技術"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有幾種方法可以將數據攝入Hive表。攝入可以通過Apache Spark流作業、Nifi或任何流技術或應用程序完成。攝入的數據是原始數據,在攝入過程開始之前考慮所有調優因素非常重要。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"組織Hadoop數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"第一步是組織Hadoop數據。我們從攝入\/流作業開始。首先,需要對數據進行分區。數據分區最基本的方法是按天或小時劃分。甚至可以同時擁有按天和按小時的分區。在某些情況下,在按天劃分的分區裏,你還可以按照國家、地區或其他適合你的數據和用例的維度進行劃分。例如,圖書館裏的一個書架,書是按類型排列的,每種類型都有兒童區或成人區。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a8\/a8aebb755e3dc116bf3dece1defb111d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"圖1:組織好的數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"以此爲例,我們像下面這樣向Hadoop目錄中寫入數據:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"hdfs:\/\/cluster-uri\/app-path\/category=children\/genre=fairytale OR\nhdfs:\/\/cluster-uri\/app-path\/category=adult\/genre=thrillers\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣,你的數據就更有條理了。大多數時候,在沒有特殊需求的情況下,數據按天或小時進行分區:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"hdfs :\/\/cluster-uri\/app-path\/day=20191212\/hr=12\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"或者只根據需要按天分區:"}]},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"hdfs:\/\/cluster-uri\/app-path\/day=20191212\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0f\/0f5aa3110185fd74d0095bed6544c040.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"圖2:分區文件夾攝入流"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hadoop數據格式"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在創建Hive表時,最好提供像zlib這樣的表壓縮屬性和orc這樣的格式。在攝入的過程中,這些數據將以這些格式寫入。如果你的應用程序是寫入普通的Hadoop文件系統,那麼建議提供這種格式。大多數攝入框架(如Spark或Nifi)都有指定格式的方法。指定數據格式有助於以壓縮格式組織數據,從而節省集羣空間。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"合併作業"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"合併作業在提高Hadoop數據總體讀取性能方面發揮着至關重要的作用。有多個部分與合併技術有關。默認情況下,寫入HDFS目錄的文件都是比較小的part文件,當part文件太多時,讀取數據就會出現性能問題。合併並不是Hive特有的特性——它是一種用於將小文件合併爲大文件的技術。合併技術也不涉及任何在線的地方,因此,這項特定的技術非常重要,特別是批處理應用程序讀取數據時。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"什麼是合併作業?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"默認情況下,攝入\/流作業寫入到Hive,目錄寫入比較小的part文件,對於高容量應用程序,一天的文件數將超過10萬個。當我們試圖讀取數據時,真正的問題來了,最終返回結果需要花費很多時間,有時是幾個小時,或者作業可能會失敗。例如,假設你有一個按天分區的目錄,你需要處理大約100萬個小文件。例如,如果運行count,輸出如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"#Before:\nhdfs dfs -count -v \/cluster-uri\/app-path\/day=20191212\/*\nOutput = 1Million\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在,在運行合併作業之後,文件的數量將顯著減少。它將所有比較小的part文件合併成大文件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"#After:\nhdfs dfs -count -v \/cluster-uri\/app-path\/day=20191212\/*\nOutput = 1000\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"注意:cluster-uri因組織而異,它是一個Hadoop集羣uri,用於連接到特定的集羣。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"合併作業有什麼好處?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"文件合併不僅是爲了性能,也是爲了集羣的健康。根據Hadoop平臺的指南,節點中不應該有這麼多文件。過多的文件會導致讀取過多的節點,進而導致高延遲。記住,當讀取Hive數據時,它會掃描所有的數據節點。如果你的文件太多,讀取時間會相應地增加。因此,有必要將所有小文件合併成大文件。此外,如果數據在某天之後不再需要,就有必要運行清除程序。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"合併作業的工作機制"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有幾種方法可以合併文件。這主要取決於數據寫入的位置。下面我將討論兩種不同的常見的用例。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用Spark或Nifi向日分區目錄下的Hive表寫入數據"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用Spark或Nifi向Hadoop文件系統(HDFS)寫入數據"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這種情況下,大文件會被寫入到日文件夾下。開發人員需要遵循下面的某個選項。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/62\/620b13413ddc6486f605b7102abce860.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"圖3:合併邏輯"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":1,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","text":"編寫一個腳本來執行合併。該腳本接受像天這樣的參數,在同一分區數據中執行Hive select查詢數據,並在同一分區中insert overwrite。此時,當Hive在同一個分區上重寫數據時,會執行map-reduce作業,減少文件數量。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","text":"有時,如果命令失敗,在同一命令中重寫相同的數據可能會導致意外的數據丟失。在這種情況下,從日分區中選擇數據並將其寫入臨時分區。如果成功,則使用load命令將臨時分區數據移動到實際的分區。步驟如圖3所示。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在這兩個選項中,選項B更好,它適合所有的用例,而且效率最高。選項B很有效,因爲任何步驟失敗都不會丟失數據。開發人員可以編寫一個control M,並安排它在第二天午夜前後沒有活躍用戶讀取數據時運行。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"有一種情況,開發者不需要編寫Hive查詢。相反,提交一個spark作業,select相同的分區,並overwrite數據,但建議只有在分區文件夾中文件數量不是很大,並且spark仍然可以讀取數據而又不需要指定過多的資源時才這樣做。這個選項適合低容量的用例,這個額外的步驟可以提高讀取數據的性能。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"整個流程是如何工作的?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"讓我們通過一個示例場景來回顧上述所有的部分。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"假設你擁有一個電子商務應用程序,你可以根據不同的購買類別跟蹤每天的客戶量。你的應用容量很大,你需要基於用戶購買習慣和歷史進行智能數據分析。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從表示層到中間層,你希望用"},{"type":"link","attrs":{"href":"https:\/\/kafka.apache.org\/","title":"","type":null},"content":[{"type":"text","text":"Kafka"}]},{"type":"text","text":"或IBM"},{"type":"link","attrs":{"href":"https:\/\/www.ibm.com\/products\/mq","title":"","type":null},"content":[{"type":"text","text":"MQ"}]},{"type":"text","text":"發佈這些消息。下一步是有一個流應用程序,消費Kafka\/MQ的數據,並攝取到Hadoop Hive表。這可以通過Nifi或Spark實現。在此之前,需要設計和創建Hive表。在創建Hive表的過程中,你需要決定分區列什麼樣,以及是否需要排序或者使用什麼壓縮算法,比如"},{"type":"link","attrs":{"href":"https:\/\/community.cloudera.com\/t5\/Support-Questions\/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive\/td-p\/97110","title":"","type":null},"content":[{"type":"text","text":"Snappy"}]},{"type":"text","text":"或者"},{"type":"link","attrs":{"href":"https:\/\/community.cloudera.com\/t5\/Support-Questions\/Snappy-vs-Zlib-Pros-and-Cons-for-each-compression-in-Hive\/td-p\/97110","title":"","type":null},"content":[{"type":"text","text":"Zlib"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Hive表的設計是決定整體性能的一個關鍵方面。你在設計時必須考慮如何查詢數據。如果你想查詢每天有多少顧客購買了特定類別的商品,如玩具、傢俱等,建議最多兩個分區,如一個天分區和一個類別分區。然後,流應用程序攝取相應的數據。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"提前掌握所有可用性方面的信息可以讓你更好地設計適合自己需要的表。因此,對於上面的例子,一旦數據被攝取到這個表中,就應該按天和類別進行分區。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"只有攝入的數據纔會形成Hive location裏的小文件,所以如上所述,合併這些文件變得至關重要。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下一步,你可以設置調度程序或使用control M(它將調用合併腳本)每天晚上運行合併作業,例如在凌晨1點左右。這些腳本將爲你合併數據。最後,在這些Hive location中,你應該可以看到文件的數量減少了。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當真正的智能數據分析針對前一天的數據運行時,查詢將變得很容易,而且性能會更好。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Hive參數設置"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當你通過map-reduce作業讀取Hive表的數據時,有一些方便的調優參數。要了解更多關於這些調優參數的信息,請查閱"},{"type":"link","attrs":{"href":"https:\/\/docs.cloudera.com\/documentation\/enterprise\/5-9-x\/topics\/admin_hive_tuning.html","title":"","type":null},"content":[{"type":"text","text":"Hive調優參數"}]},{"type":"text","text":"。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"Set hive.exec.parallel = true;\nset hive.vectorized.execution.enabled = true;\nset hive.vectorized.execution.reduce.enabled = true;\nset hive.cbo.enable=true;\nset hive.compute.query.using.stats=true;\nset hive.stats.fetch.column.stats=true;\nset hive.stats.fetch.partition.stats=true;\nset mapred.compress.map.output = true;\nset mapred.output.compress= true;\nSet hive.execution.engine=tez;\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"要進一步瞭解其中的每個屬性,可以參考這個"},{"type":"link","attrs":{"href":"https:\/\/www.hdfstutorial.com\/blog\/hive-performance-tuning\/","title":"","type":null},"content":[{"type":"text","text":"教程"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"技術實現"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在,讓我們用一個示例場景,一步一步地進行展示。在這裏,我正在考慮將客戶事件數據攝取到Hive表。我的下游系統或團隊將使用這些數據來運行進一步的分析(例如,在一天中,客戶購買了什麼商品,從哪個城市購買的?)這些數據將用於分析產品用戶的人口統計特徵,使我能夠排除故障或擴展業務用例。這些數據可以讓我們進一步瞭解活躍客戶來自哪裏,以及我如何做更多的事情來增加我的業務。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"步驟1:創建一個示例Hive表,代碼如下:"}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/82\/82cf37e3385908830ce82ee6e4f70b45.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"步驟2:設置流作業,將數據攝取到Hive表中"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個流作業可以從Kafka的實時數據觸發流,然後轉換並攝取到Hive表中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/70\/7070665ca2f6b1ec0a0a653446efce1f.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"圖4:Hive數據流"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣,當攝取到實時數據時,就會寫入天分區。不妨假設今天是20200101。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"hdfs dfs -ls \/data\/customevents\/day=20200101\/\n\/data\/customevents\/day=20200101\/part00000-djwhu28391\n\/data\/customevents\/day=20200101\/part00001-gjwhu28e92\n\/data\/customevents\/day=20200101\/part00002-hjwhu28342\n\/data\/customevents\/day=20200101\/part00003-dewhu28392\n\/data\/customevents\/day=20200101\/part00004-dfdhu24342\n\/data\/customevents\/day=20200101\/part00005-djwhu28fdf\n\/data\/customevents\/day=20200101\/part00006-djwffd8392\n\/data\/customevents\/day=20200101\/part00007-ddfdggg292\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"當一天結束時,這個數值可能是10K到1M之間的任意一個值,這取決於應用程序的流量。對於大型公司來說,流量會很高。我們假設文件的總數是141K。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"步驟3:運行合併作業"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在20201月2號,也就是第二天,凌晨1點左右,我們運行合併作業。示例代碼上傳到git中。文件名爲consolidated .sh。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"下面是在edge node\/box中運行的命令:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":".\/consolidate.sh 20200101\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"現在,這個腳本將合併前一天的數據。合併完成後,你可以重新運行count:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"plain"},"content":[{"type":"text","text":"hdfs dfs -count -v \/data\/customevents\/day=20200101\/*\ncount = 800\n"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"之前是141K,合併後是800。因此,這將爲你帶來顯著的性能提升。合併邏輯代碼見"},{"type":"link","attrs":{"href":"https:\/\/github.com\/skoloth\/Hive-Consolidation","title":"","type":null},"content":[{"type":"text","text":"這裏"}]},{"type":"text","text":"。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"統計數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在不使用任何調優技術的情況下,從Hive表讀取數據的查詢時間根據數據量不同需要耗費5分鐘到幾個小時不等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/d2\/d2a1541d91029d214a4c50432181ea44.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":"center","origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"圖5:統計數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"合併之後,查詢時間顯著減少,我們可以更快地得到結果。文件數量顯著減少,讀取數據的查詢時間也會減少。如果不合並,查詢會在跨名字節點的許多小文件上運行,會導致響應時間增加。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"參考資料"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Koloth, K. S. (2020年10月15日)."},{"type":"link","attrs":{"href":"https:\/\/londondailypost.com\/sudhish-koloth-the-importance-of-big-data-on-artificial-intelligence\/","title":"","type":null},"content":[{"type":"text","text":"大數據對人工智能的重要性"}]},{"type":"text","text":"."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache. (n.d.). Hive Apache."},{"type":"link","attrs":{"href":"https:\/\/hive.apache.org\/","title":"","type":null},"content":[{"type":"text","text":"Hive Apache"}]},{"type":"text","text":"."}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Gauthier, G. L. (2019年7月25日)."},{"type":"link","attrs":{"href":"https:\/\/www.adaltas.com\/en\/2019\/07\/25\/hive-3-features-tips-tricks\/","title":"","type":null},"content":[{"type":"text","text":"運行Apache Hive 3,新特性、要訣和技巧"}]},{"type":"text","text":"."}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"作者簡介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/timebusinessnews.com\/sudhish-koloth-played-a-key-role-2020\/","title":"","type":null},"content":[{"type":"text","text":"Sudhish Koloth"}]},{"type":"text","text":"是一家銀行和金融服務公司的首席開發。他已從事信息技術領域的工作13年。他使用過各種技術,包括全棧、大數據、自動化和Android開發。在2019冠狀病毒病大流行期間,他還在交付有重要影響的"},{"type":"link","attrs":{"href":"https:\/\/play.google.com\/store\/apps\/details?id=com.feedom.uandus","title":"","type":null},"content":[{"type":"text","text":"項目"}]},{"type":"text","text":"方面發揮了重要的作用。Sudhish用他的專業知識解決人類面臨的共同問題,他是一名志願者,爲非營利組織的應用程序提供幫助。他也是一位導師,利用他的技術專長幫助其他專業人士和同事。關於STEM教育對於學齡兒童和年輕大學畢業生的重要性,Sudhish先生也是一位積極的傳道者和激勵者。他在職業關係網內外的工作都得到了認可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"link","attrs":{"href":"https:\/\/www.infoq.com\/articles\/hive-performance-tuning-techniques\/","title":"","type":null},"content":[{"type":"text","text":"Performance Tuning Techniques of Hive Big Data Table"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章