如何設計高效的HBase數據模型
{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"從學習和使用HBase的經歷中,整理出對使用者而言,需要了解的HBase基礎知識,Mark一下。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.背景知識","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1.1 數據模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"學習HBase/BigTable最困難的部分,是理解它的數據模型,換句話說它究竟是咋用的?在BigTable論文中明確說明:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"A Bigtable is a sparse, distributed, persistent multidimensional sorted map.\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"論文做了進一步解釋:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推薦兩篇文章,對此解釋的非常清楚:","attrs":{}},{"type":"link","attrs":{"href":"https://dzone.com/articles/understanding-hbase-and-bigtab","title":"","type":null},"content":[{"type":"text","text":"understanding-hbase-and-bigtab","attrs":{}}]},{"type":"text","text":" 和 ","attrs":{}},{"type":"link","attrs":{"href":"http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf","title":"","type":null},"content":[{"type":"text","text":"Introduction to HBase Schema Design","attrs":{}}]},{"type":"text","text":",下面是一個更形象的示例:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"{\n // ...\n \"aaaaa\" : { // Row Key\n \"A\" : { // Column Families\n \"foo\" : { // Column Qualifiers\n 15 : \"y\", // 15: Timestamp / Version number, \"y\": Cell Value\n 4 : \"m\"\n },\n \"bar\" : {\n 15 : \"d\",\n }\n },\n \"B\" : {\n \"\" : {\n 6 : \"w\"\n 3 : \"o\"\n 1 : \"w\"\n }\n }\n },\n // ...\n}\n\n拆解說明:\n map : 存儲KeyValue數據。\n persistent : 數據以文件的形式在HDFS(或S3)/GFS存儲。\n distributed : HBase和BigTable都建立在分佈式文件系統之上,計算/存儲分離架構;由Master和一組Storage Server組成,數據分區存儲,典型的分佈式系統。\n sorted : RowKey按字典序排序的。\n multidimensional: 如上面的示例看到的,這是一個嵌套Map,或者說一個多維Map。\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"真實的存儲是平面文件結構,存儲模型是類似下面的結構:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"aaaaa, A:foo,15 ---> y\naaaaa, A:foo, 4 ---> m\naaaaa, A:bar, 15 ---> d\naaaaa, B:, 6 ---> w\naaaaa, B:, 3 ---> w\naaaaa, B:, 1 ---> w\n\nKey的邏輯結構是: {RowKey:ColumnFamily:Qualifier:TimeStamp},Key按字母升序排序,TimeStamp按降序排序。\n","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1.2 讀/寫/壓縮","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HBase/BigTable存儲使用LSM方式實現,這個網上文章很多,這裏只簡單介紹需要了解的主要流程:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"(1) 寫流程:先寫Log文件(WAL),然後寫MemTable,當MemTable寫滿後,把數據落地到文件當中。\n(2) 讀流程:由於MemStore、HFiles中都包含數據,讀取操作其實類似一個多路歸併排序操作,最近的數據在MemStore\n中,次新數據在近期生成的HFile中,老數據在更早生成的HFile中,按照這個順序遍歷要查找的key。\n(3) 壓縮流程:當文件越來越多時,就需要進行壓縮,回收無效效數據(減少存儲佔用),減少文件數量(提高讀效率)。\n壓縮操作的思路是合併MemStore和HFiles文件,刪除無效的key值(被新版本覆蓋或刪除),生成新的文件,回收舊文件。\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看出,HBase/BigTable非常適合有大量寫操作的應用,順序讀性能也不錯,適合批量數據處理(例如MapReduce)。BigTable論文提到的其中兩個使用場景都符合這個特徵:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"(1) 存儲抓取回來的網頁,使用MapReduce進行處理;\n(2) Google Analytics項目,收集用戶點擊數據,使用MapReduce定期分析,生成網站訪問報告。\n","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1.3 存儲分區","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HBase支持數據自動分區,分區方式: 水平分割+垂直分割,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"水平分割:KeyValue數據依據RowKey劃分到不同的Region,每一個Region被分配給一個Region Server,具備良好的擴展性。當一個Region過大時,會被分割成兩個Region。當一個Region Server負載過重時,把其中的部分Region遷移到其他Region Server。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"垂直分割:在一個Region內部,每個ColumnFamily的數據是單獨存儲的,這使他們有更好的訪問局部性。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"參考","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"http://hbase.apache.org/book.html#trouble.namenode.hbase.objects","attrs":{}}],"attrs":{}},{"type":"text","text":",HBase在HDFS存儲數據的目錄結構如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"/hbase\n /data\n / (Namespaces in the cluster)\n / (Tables in the cluster)\n /
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.