如何设计高效的HBase数据模型
{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"从学习和使用HBase的经历中,整理出对使用者而言,需要了解的HBase基础知识,Mark一下。","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":3},"content":[{"type":"text","text":"1.背景知识","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1.1 数据模型","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"学习HBase/BigTable最困难的部分,是理解它的数据模型,换句话说它究竟是咋用的?在BigTable论文中明确说明:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"A Bigtable is a sparse, distributed, persistent multidimensional sorted map.\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"论文做了进一步解释:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"推荐两篇文章,对此解释的非常清楚:","attrs":{}},{"type":"link","attrs":{"href":"https://dzone.com/articles/understanding-hbase-and-bigtab","title":"","type":null},"content":[{"type":"text","text":"understanding-hbase-and-bigtab","attrs":{}}]},{"type":"text","text":" 和 ","attrs":{}},{"type":"link","attrs":{"href":"http://0b4af6cdc2f0c5998459-c0245c5c937c5dedcca3f1764ecc9b2f.r43.cf2.rackcdn.com/9353-login1210_khurana.pdf","title":"","type":null},"content":[{"type":"text","text":"Introduction to HBase Schema Design","attrs":{}}]},{"type":"text","text":",下面是一个更形象的示例:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"{\n // ...\n \"aaaaa\" : { // Row Key\n \"A\" : { // Column Families\n \"foo\" : { // Column Qualifiers\n 15 : \"y\", // 15: Timestamp / Version number, \"y\": Cell Value\n 4 : \"m\"\n },\n \"bar\" : {\n 15 : \"d\",\n }\n },\n \"B\" : {\n \"\" : {\n 6 : \"w\"\n 3 : \"o\"\n 1 : \"w\"\n }\n }\n },\n // ...\n}\n\n拆解说明:\n map : 存储KeyValue数据。\n persistent : 数据以文件的形式在HDFS(或S3)/GFS存储。\n distributed : HBase和BigTable都建立在分布式文件系统之上,计算/存储分离架构;由Master和一组Storage Server组成,数据分区存储,典型的分布式系统。\n sorted : RowKey按字典序排序的。\n multidimensional: 如上面的示例看到的,这是一个嵌套Map,或者说一个多维Map。\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"真实的存储是平面文件结构,存储模型是类似下面的结构:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"aaaaa, A:foo,15 ---> y\naaaaa, A:foo, 4 ---> m\naaaaa, A:bar, 15 ---> d\naaaaa, B:, 6 ---> w\naaaaa, B:, 3 ---> w\naaaaa, B:, 1 ---> w\n\nKey的逻辑结构是: {RowKey:ColumnFamily:Qualifier:TimeStamp},Key按字母升序排序,TimeStamp按降序排序。\n","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1.2 读/写/压缩","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HBase/BigTable存储使用LSM方式实现,这个网上文章很多,这里只简单介绍需要了解的主要流程:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"(1) 写流程:先写Log文件(WAL),然后写MemTable,当MemTable写满后,把数据落地到文件当中。\n(2) 读流程:由于MemStore、HFiles中都包含数据,读取操作其实类似一个多路归并排序操作,最近的数据在MemStore\n中,次新数据在近期生成的HFile中,老数据在更早生成的HFile中,按照这个顺序遍历要查找的key。\n(3) 压缩流程:当文件越来越多时,就需要进行压缩,回收无效效数据(减少存储占用),减少文件数量(提高读效率)。\n压缩操作的思路是合并MemStore和HFiles文件,删除无效的key值(被新版本覆盖或删除),生成新的文件,回收旧文件。\n","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"可以看出,HBase/BigTable非常适合有大量写操作的应用,顺序读性能也不错,适合批量数据处理(例如MapReduce)。BigTable论文提到的其中两个使用场景都符合这个特征:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"(1) 存储抓取回来的网页,使用MapReduce进行处理;\n(2) Google Analytics项目,收集用户点击数据,使用MapReduce定期分析,生成网站访问报告。\n","attrs":{}}]},{"type":"heading","attrs":{"align":null,"level":4},"content":[{"type":"text","text":"1.3 存储分区","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"HBase支持数据自动分区,分区方式: 水平分割+垂直分割,","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"bulletedlist","content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"水平分割:KeyValue数据依据RowKey划分到不同的Region,每一个Region被分配给一个Region Server,具备良好的扩展性。当一个Region过大时,会被分割成两个Region。当一个Region Server负载过重时,把其中的部分Region迁移到其他Region Server。","attrs":{}}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"垂直分割:在一个Region内部,每个ColumnFamily的数据是单独存储的,这使他们有更好的访问局部性。","attrs":{}}]}]}],"attrs":{}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","text":"参考","attrs":{}},{"type":"codeinline","content":[{"type":"text","text":"http://hbase.apache.org/book.html#trouble.namenode.hbase.objects","attrs":{}}],"attrs":{}},{"type":"text","text":",HBase在HDFS存储数据的目录结构如下:","attrs":{}}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"codeblock","attrs":{"lang":"text"},"content":[{"type":"text","text":"/hbase\n /data\n / (Namespaces in the cluster)\n / (Tables in the cluster)\n /
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.