HBase:On the number of column families

1.聲明

當前內容主要用於本人學習和複習,當前內容主要爲官方文檔中的On the number of column families的翻譯和理解

2.On the number of column families

HBase currently does not do well with anything above two or three column families so keep the number of column families in your schema low. Currently, flushing is done on a per Region basis so if one column family is carrying the bulk of the data bringing on flushes, the adjacent families will also be flushed even though the amount of data they carry is small. When many column families exist the flushing interaction can make for a bunch of needless i/o (To be addressed by changing flushing to work on a per column family basis). In addition, compactions triggered at table/region level will happen per store too.

HBase當前的列族應該小於2或者3個,在你的設計中(表)中列族的數量應該儘量小。刷入操作是基於區域進行操作的,如果一個列族攜帶了大量的數據需要刷新,那麼它旁邊的列族也會被刷新,儘管旁邊的列族的數據是非常小。當大量的列族存在刷新交互時將會導致一系列的無用的io(通過列族來修改刷新)。此外存儲時也會觸發表/區域級別的壓縮事件

1.HBase中的列族因該小於2個或者3個,數量儘可能小

2.一個列族刷新可能導致旁邊的列族也刷新,從而產生無用的io浪費

3.數據存儲時,會觸發表/區域級別的壓縮事件

Try to make do with one column family if you can in your schemas. Only introduce a second and third column family in the case where data access is usually column scoped; i.e. you query one column family or the other but usually not both at the one time.

儘可能地讓你的架構使用一個列族。數據訪問通常都是列區域地,列族在這裏僅引入第二列和第三列。例如你查詢一個列族或者其他列族但是通常不是在同一事件查詢地

3.Cardinality of ColumnFamilies

Where multiple ColumnFamilies exist in a single table, be aware of the cardinality (i.e., number of rows). If ColumnFamilyA has 1 million rows and ColumnFamilyB has 1 billion rows, ColumnFamilyA’s data will likely be spread across many, many regions (and RegionServers). This makes mass scans for ColumnFamilyA less efficient.

單個表中存在多個列族時,請注意這個基數(行數)。如果列族A有100萬行,列族B有10億行,列族A地數據是多範圍多區域地(和RegionServers)。這將導致列族A地掃描效率降低

總地來說,列族越多掃描起來越慢,效率越低,所以儘可能使用更少地列族,列如使用1個

4.總結

1.表中的列族多少會直接影響到查詢、更新、添加等效率

2.表中的列族應該小於2或者小於3,最好是1

3.當表出現列族更新操作時,導致問題就是其旁邊的列族也會更新,導致壓縮數據的操作執行

以上純屬個人見解,如有問題請聯本人!

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章