貝殼基於Spark的HiveToHBase實踐

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文詳細介紹瞭如何將Hive裏的數據快速穩定的寫進HBase中。由於數據量比較大,我們採取的是HBase官方提供的bulkload方式來避免HBase put api寫入壓力大的缺陷。團隊早期採用的是MapReduce進行計算生成HFile,然後使用bulkload進行數據導入的工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲結構性的因素,整體的性能不是很理想,對於部分業務方來說不能接受。其中最重要的因素就是建HBase表時預分區的規劃不合理,導致了後面很多任務運行時間太過漫長,很多都達到了4~5個小時才能成功。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"在重新審視和規劃時,自然的想到了從計算層面性能表現更佳的Spark。由它來接替MapReduce完成數據格式轉換,並生成HFile的核心工作。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"HiveToHBase 全解析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實際生產工作中因爲工作涉及到了兩個數據端的交互,爲了更好的理解整體的流程以及如何優化,知道ETL流程中爲什麼需要一些看上去並不需要的步驟,我們首先需要簡單的瞭解HBase的架構。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. HBase結構簡單介紹"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Apache HBase是一個開源的非關係型分佈式數據庫,運行於HDFS之上。它能夠基於HDFS提供實時計算服務主要是架構與底層數據結構決定的,即由 LSM-Tree (Log-Structured Merge-Tree) + HTable (Region分區) + Cache決定的:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"LSM樹是目前最流行的不可變磁盤存儲結構之一,僅使用緩存和append file方式來實現順序寫操作。其中關鍵的點是:排序字符串表 Sorted-String-Table,這裏我們不深入細節,這種底層結構對於bulkload的要求很重要一點就是數據需要排序。而以HBase的存儲形式來看,就是KeyValue需要進行排序!"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HTable的數據需要均勻的分散在各個Region中,訪問HBase時先去HBase系統表查找定位這條記錄屬於哪個Region ,然後定位到這個Region屬於哪個RegionServer,然後就到對應服務器裏面查找對應Region中的數據。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後的bulkload過程都是相同的,差別只是在生成HFile的步驟。這也是下文重點描述的部分。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 數據流轉通路"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"數據從Hive到HBase的流程大致如下圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/23\/230313c5a6a543388040dcf2f88b815e.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"整個流程真正需要我們cover的就是ETL ( Extract Transfer Load ) 部分,HBase底層文件HFile屬於列存文件,每一列都是以KeyValue的數據格式進行存儲。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/77\/772ddfa8e84484752523ce16e9a4bdc0.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"邏輯上真正需要我們做的工作很簡單:( 爲了簡便、省去了timestamp 版本列 )、HBase一條數據在邏輯上的概念簡化如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/0f\/0fb006a6ac54d4f707be36278e70c17a.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果看到了這裏,恭喜你已經基本明白本文的行文邏輯了。接下來就是代碼原理時間:"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"MapReduce工作流程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"Map\/Reduce框架運轉在鍵值對上,也就是說框架把作業的輸入看爲是一組鍵值對,同樣也產出一組鍵值對做爲作業的輸出。在我們的場景中是這樣的:"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. mapper:數據格式轉換"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"mapper的目的就是將一行數據,轉爲rowkey:column family:qualifer:value的形式。關鍵的ETL代碼就是將map取得的value,轉成< ImmutableBytesWritable,Put>輸出、進而交給reducer進行處理。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\nprotected void map(LongWritable key, Text value, Mapper.Context context)\n throws IOException, InterruptedException {\n \/\/由字符串切割成每一列的value數組\n String[] values = value.toString().split(\"\\\\x01\", -1);\n String rowKeyStr = generateRowKey();\n ImmutableBytesWritable hKey = new ImmutableBytesWritable(Bytes.toBytes(rowKeyStr));\n\n Put hPut = new Put(Bytes.toBytes(rowKeyStr));\n for (int i = 0; i < columns.length; i++) {\n String columnStr = columns[i];\n String cfNameStr = \"cf1\";\n String cellValueStr = values[i].trim();\n \n byte[] columbByte = Bytes.toBytes(columnStr);\n byte[] cfNameByte = Bytes.toBytes(cfNameStr);\n byte[] cellValueByte = Bytes.toBytes(cellValueStr);\n \n hPut.addColumn(cfNameByte, columbByte, cellValueByte);\n \n }\n context.write(hKey, hPut);\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"mapper寫完了,好像已經把數據格式轉完了,還需要reducer嗎?參考官方的資料裏也沒有找到關於reducer的消息,我轉念一想 事情沒有這麼簡單!研讀了提交Job的主流程代碼後發現除了輸出文件的格式設置與其他mr程序不一樣:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\njob.setOutputFormatClass(HFileOutputFormat2.class);"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有一個其他程序沒有的部分,那就是:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\nHFileOutputFormat2.configureIncrementalLoad(job,htable)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"故名思義就是對job進行HFile相關配置。HFileOutputFormat2 是工具包提供的,讓我們看看裏面到底幹了什麼吧!"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. job的配置"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"挑選出比較相關核心的配置:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"根據mapper的輸出格式來自動設置reducer,意味着我們這個mr程序可以只寫mapper,不需要寫reducer了。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"獲取對應HBase表各個region的startKey,根據region的數量來設置reduce的數量,同時配置partitioner讓上一步mapper產生的數據,分散到對應的partition ( reduce ) 中。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"reducer的自動設置"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\n\/\/ Based on the configured map output class, set the correct reducer to properly\n\/\/ sort the incoming values.\n\/\/ TODO it would be nice to pick one or the other of these formats.\nif (KeyValue.class.equals(job.getMapOutputValueClass())) {\n job.setReducerClass(KeyValueSortReducer.class);\n} else if (Put.class.equals(job.getMapOutputValueClass())) {\n job.setReducerClass(PutSortReducer.class);\n} else if (Text.class.equals(job.getMapOutputValueClass())) {\n job.setReducerClass(TextSortReducer.class);\n} else {\n LOG.warn(\"Unknown map output value type:\" + job.getMapOutputValueClass());\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"實際上上面三種reducer底層都是會將數據轉爲KeyValue形式,然後進行排序。需要注意的是KeyValue 的排序是全排序,並不是以單個rowkey進行排序就行的。所謂全排序,就是將整個對象進行比較!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"查看KeyValueSortRducer後會發現底層是一個叫做KeyValue.COMPARATOR的比較器,它是由Bytes.compareTo(byte[] buffer1, int offset1, int length1, byte[] buffer2, int offset2, int length2)將兩個KeyValue對象的每一個字節從頭開始比較,這是上面說到的全排序形式。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"我們輸出的文件格式是HFileOutputFormat2,它在我們寫入數據的時候也會進行校驗check每次寫入的數據是否是按照KeyValue.COMPARATOR 定義的順序,要是沒有排序就會報錯退出!Added a key not lexically larger than previous。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"reduce數量以及partitioner設置"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"爲什麼要根據HBase的region的情況來設置我們reduce的分區器以及數量呢?在上面的小節中有提到,region是HBase查詢的一個關鍵點。每個htable的region會有自己的【startKey、endKey】,分佈在不同的region server中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a7\/a7f780e06bef1a1e37dfc180ade95497.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這個key的範圍是與rowkey匹配的,以上面這張表爲例,數據進入region時的邏輯場景如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/79\/79a7545c8e6737cb0b324c88bb46531c.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"也正是因爲這種管理結構,讓HBase的表的rowkey設計與region預分區 ( 其實就是region數量與其 [starkey,endkey]的設置 ) 在日常的生產過程當中相當的重要。在大批量數據導入的場景當然也是需要考慮的!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HBase的文件在hdfs的路徑是:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\n\/HBase\/data\/\/\/\/\/"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"通過並行處理Region來加快查詢的響應速度,所以會要求每個Region的數據量儘量均衡,否則大量的請求就會堆積在某個Region上,造成熱點問題、對於Region Server的壓力也會比較大。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如何避免熱點問題以及良好的預分區以及rowkey設計並不是我們的重點,但這能夠解釋爲什麼在ETL的過程中需要根據region的startkey進行reduce的分區。都是爲了貼合HBase原本的設計,讓後續的bulkload能夠簡單便捷,快速的將之前生成HFile直接導入到region中!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這點是後續進行優化的部分,讓HiveToHBase能夠儘量擺脫其他前置流程 ( 建htable ) 的干擾、更加的專注於ETL部分。其實bulkload並沒有強制的要求一個HFile中都是相同region的記錄!"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. 執行bulkload、完成的儀式感"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"講到這裏我們開頭講的需要cover的重點部分就已經完成並解析了底層原理,加上最後的job提交以及bulkload,給整個流程加上結尾。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\nJob job = Job.getInstance(conf, \"HFile Generator ... \");\njob.setJarByClass(MRMain.class);\njob.setMapperClass(MRMapper.class);\njob.setMapOutputKeyClass(ImmutableBytesWritable.class);\njob.setMapOutputValueClass(Put.class);\njob.setInputFormatClass(TextInputFormat.class);\njob.setOutputFormatClass(HFileOutputFormat2.class);\n\nHFileOutputFormat2.configureIncrementalLoad(job, htable);\n\/\/等待mr運行完成\njob.waitForCompletion(true);\n\nLoadIncrementalHFiles loader = new LoadIncrementalHFiles(loadHBaseConf);\nloader.doBulkLoad(new Path(targetHtablePath), htable);"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"4. 現狀分析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"講到這裏HiveToHBase的MapReduce工作細節和流程都已經解析完成了,來看一下實際運行中的任務例子,總數據248903451條,60GB經過壓縮的ORC文件。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"痛點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"因爲歷史的任務HBase建表時預分區沒有設置或者設置不合理,導致很多任務的region數量只有幾個。所以歷史的任務性能卡點基本都是在進行reduce生成HFile的時候,經查驗發現747個Map執行了大約4分鐘,剩下兩個Reduce執行了2小時22分鐘。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/66\/6651209c7acade54839ad794d72370e5.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而平臺整體HiveToHBase的HBase表region數量分佈如下:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/67\/6725214f7b47de4f18bd466be87808ba.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看到大部分的HBase表 region數量都只有幾個,在這種情況下如果沿用之前的體系進行分區。那麼整體的性能改變可以預想的不會太高!"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"而且由於歷史原因HiveToHBase支持用戶寫sql完成Hive數據的處理,然後再導入HBase中。mr是不支持sql寫法的,之前是先使用tez引擎以insert overwrite directory + sql的方式產生臨時文件,然後將臨時文件再進行上述的加工。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"解決方案"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"經過綜合的考量,決定採用Spark重寫HiveToHBase的流程。現在官方已經有相應的工具包提供,也有樣例的scala代碼 ( Apache HBase ™ Reference Guide、中文版:HBase and Spark-HBase中文參考指南 3.0 ),讓我們可以像寫MR一樣只寫mapper,不需要管分區和排序。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"但是這樣解決不了我們的痛點,所以決定不借助的官方工具箱,這也正是我們分析mr的job配置的最大原因,可以根據自己的需求進行定製開發。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還記得上文中說過,其實bulkload並沒有強制的要求一個HFile中都是相同region的記錄 嗎?所以我們是可以不按照region數量切分partition的,擺脫htable region的影響。HBase bulkload的時候會check之前生成的HFile,查看數據應該被劃分到哪個Region中。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"如果是之前的方式提前將相同的前綴rowkey的數據聚合那麼bulkload的速度就會很快,而如果不按照這種方式,各個region的數據混雜在一個HFile中,那麼就會對bulkload的性能和負載產生一定的影響!這點需要根據實際情況進行評估。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"使用Spark的原因:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"考慮它直接支持sql連接hive,能夠優化掉上面提到的步驟,整體流程會更簡便。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark從架構上會比mr運行快得多。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後的預期以上述例子爲示意 如下圖:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/79\/797e0548830150cabbf545d690384257.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Spark工作流程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"核心流程代碼:與MR類似,不過它採用的是Spark 將RDD寫成磁盤文件的api。需要我們自己對數據進行排序工作。"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1. 排序"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"構造一個KeyFamilyQualifier類,然後繼承Comparable進行類似完全排序的設計。實際驗證過程只需要rowkey:family:qualifier進行排序即可。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\npublic class KeyFamilyQualifier implements Comparable, Serializable {\n\n private byte[] rowKey;\n private byte[] family;\n private byte[] qualifier;\n\n public KeyFamilyQualifier(byte[] rowKey, byte[] family, byte[] qualifier) {\n this.rowKey = rowKey;\n this.family = family;\n this.qualifier = qualifier;\n }\n\n @Override\n public int compareTo(KeyFamilyQualifier o) {\n int result = Bytes.compareTo(rowKey, o.getRowKey());\n if (result == 0) {\n result = Bytes.compareTo(family, o.getFamily());\n if (result == 0) {\n result = Bytes.compareTo(qualifier, o.getQualifier());\n }\n }\n return result;\n }\n}"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"2. 核心處理流程"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"spark中由於沒有可以自動配置的reducer,需要我們自己做更多的工作。下面是工作的流程:"}]},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將spark的dataset轉爲這部分是我們處理ETL的重點。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將按照KeyFamilyQualifier進行排序,滿足HBase底層需求,這一步使用 sortByKey(true) 升冪排列就行,因爲Key是上面的KeyFamilyQualifier!"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將排好序的數據轉爲,HFile接受的輸入數據格式。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"將構建完成的rdd數據集,轉成hfile格式的文件。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\nSparkSession spark = SparkSession.builder().config(sparkConf).enableHiveSupport().getOrCreate();\nDataset rows = spark.sql(hql);\n\nJavaPairRDD javaPairRDD = rows.javaRDD()\n .flatMapToPair(row -> rowToKeyFamilyQualifierPairRdd(row).iterator())\n .sortByKey(true)\n .mapToPair(combineKey -> {\n return new Tuple2(combineKey._1()._1(), combineKey._2());\n });\n\nJob job = Job.getInstance(conf, HBaseConf.getName());\njob.setMapOutputKeyClass(ImmutableBytesWritable.class);\njob.setMapOutputValueClass(KeyValue.class);\nHFileOutputFormat2.configureIncrementalLoad(job, table, regionLocator); \/\/使用job的conf,而不使用job本身;完成後續 compression,bloomType,blockSize,DataBlockSize的配置\njavaPairRDD.saveAsNewAPIHadoopFile(outputPath, ImmutableBytesWritable.class, KeyValue.class, HFileOutputFormat2.class, job.getConfiguration());"}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"3. Spark:數據格式轉換"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"row -> rowToKeyFamilyQualifierPairRdd(row).iterator()  這一part其實就是將row數據轉爲< KeyFamilyQualifier, KeyValue>"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\n\/\/獲取字段 的tuple\nTuple2[] dTypes = dataset.dtypes();\nreturn dataset.javaRDD().flatMapToPair(row -> {\n List> kvs = new ArrayList<>();\n byte[] rowKey = generateRowKey();\n \/\/ 如果rowKey 爲null, 跳過\n if (rowKey != null) {\n for (Tuple2 dType : dTypes) {\n Object obj = row.getAs(dType._1);\n if (obj != null) {\n kvs.add(new Tuple2<>(new KeyFamilyQualifier(rowkey,\"cf1\".getBytes(),Bytes.toBytes(dType._1)),getKV(param-x));\n }\n }\n } else {\n LOGGER.error(\"row key is null ,row = {}\", row.toString());\n }\n return kvs.iterator();\n});"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"這樣關於HiveToHBase的spark方式就完成了,關於partition的控制我們單獨設置了參數維護便於調整:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"\n\/\/ 如果任務的參數 傳入了 預定的分區數量\nif (partitionNum > 0) {\n hiveData = hiveData.repartition(partitionNum);\n}"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"分離了partition與sort的過程,因爲repartition也是需要shuffle 有性能損耗,所以默認不開啓。就按照spark正常讀取的策略 一個hdfs block對應一個partition即可。如果有需要特殊維護的任務,例如加大並行度等,也可以通過參數控制。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"二者對比"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"上述例子的任務換成了新的方式運行,運行33分鐘完成。從146分鐘到33分鐘,性能整整提升了4倍有餘。由於任務遷移和升級還需要很多前置性的工作,整體的數據未能在文章撰寫時產出,所以暫時以單個任務爲例子進行對比性實驗。(因爲任務的運行情況和集羣的資源緊密掛鉤,只作爲對照參考作用)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看到策略變化對於bulkload的性能來說是幾乎沒有變化的,實際證明我們這種策略是行得通的:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9e\/9e0fc8cf297f729369b0d712ffd29f7b.png","alt":"圖片","title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"還有個任務是原有mr運行方式需要5.29小時,遷移到spark的方式 經過調優 ( 提高partition數量 ) 只需要11分鐘45秒。這種方式最重要的是可以手動進行調控,是可靈活維護的。本身離線任務的運行時長就是受到很多因素的制約,實驗雖然缺乏很強的說服力,但是基本還是能夠對比出提升的性能是非常多的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"限於篇幅,有很多未能細講的點,例如加鹽讓數據均勻的分佈在region中,partition的自動計算,spark生成hfile過程中導致的oom問題。文筆拙略,希望大家能有點收穫。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"最後感謝開發測試過程中給予筆者很多幫助的雨松和馮亮,還有同組同學的大力支持。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"strong"}],"text":"參考文章:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"1. 20張圖帶你到HBase的世界遨遊【轉】 - sunsky303 - 博客園"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/www.cnblogs.com\/sunsky303\/p\/14312350.html"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"2. HBase原理-數據讀取流程解析"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"http:\/\/HBasefly.com\/2016\/12\/21\/HBase-getorscan\/?aixuds=6h5ds3"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"3. Hive、Spark SQL任務參數調優"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/www.jianshu.com\/p\/2964bf816efc"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"4. Spark On HBase的官方jar包編譯與使用"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/juejin.cn\/post\/6844903961242124295"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"5. Apache HBase ™ Reference Guide"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/hbase.apache.org\/book.html#_bulk_load"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"6. HBase and Spark-HBase中文參考指南 3.0"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"https:\/\/www.cntofu.com\/book\/173\/docs\/17.md"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"本文轉載自:DataFunTalk(ID:dataFunTalk)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"原文鏈接:"},{"type":"link","attrs":{"href":"https:\/\/mp.weixin.qq.com\/s\/pfeg25F_E3UrZJXJRXsfug","title":"xxx","type":null},"content":[{"type":"text","text":"貝殼基於Spark的HiveToHBase實踐"}]}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章