hbase 程序優化參數調整方法

原創

2020-02-24 20:41

hbase讀數據用scan，讀數據加速的配置參數爲：

Scan scan = new Scan();
scan.setCaching(500);        // 1 is the default in Scan, which will be bad for MapReduce jobs
scan.setCacheBlocks(false);  // don't set to true for MR jobs

其中，

public Scan setCacheBlocks(boolean cacheBlocks)//Set whether blocks should be cached for this Scan

    默認值爲true, 分內存，緩存和磁盤，三種方式，一般數據的讀取爲內存->緩存->磁盤；setCacheBlocks不適合MapReduce工作：
    MR程序爲非熱點數據，不需要緩存，因爲Blockcache is        LRU，也就是最近最少訪問算法（扔掉最少訪問的），那麼，前一個請求（比如map讀取）讀入Blockcache的
    所有記錄在後一個請求（新的map讀取）中都沒有用，就必須全部被swap，那麼RegionServer要不斷的進行無意義的swapping data，也就是無意義的輸入和輸出BlockCache，增加了無必要的IO。而普通讀取時局部查找，或者查找最熱數據時，會有提升性能的幫助。

public Scan setCaching(int caching)

//增加緩存讀取條數(一次RPC調用返回多行記錄)，加快scaners讀取速度，但耗費內存增加，設太大會響應慢、超時、或者OOM，
找到RPC操作的數據和內存佔用情況的一個折中，默認使用Configuration setting HConstants.HBASE_CLIENT_SCANNER_CACHING，值爲1
public void setBatch(int batch) //設置獲取記錄的列個數，默認無限制，也就是返回所有的列。實際上就是控制一次next()傳輸多少個columns,如setBatch(5)則每個Result實例返回5個columns，(http://hbase.apache.org/apidocs/org/apache/hadoop/hbase/client/Scan.html)
setBatch使用場景爲，用客戶端的scanner緩存進行批量交互從而提高性能時，非常大的行可能無法放入客戶端的內存，這時需要用HBase客戶端API中進行batching處理。

通過調整setCaching和setBatch這兩個參數，可以觀察對RPC交互數量的影響，也就是時間性能的影響：

一個簡單的例子，一個表含兩個column family，每個column family下10個column，10行數據，比較效果的組合
    Caching: 1, Batch: 1, Results: 200, RPCs: 201
    Caching: 200, Batch: 1, Results: 200, RPCs: 2
    Caching: 5, Batch:100, Results: 10, RPCs: 3
    Caching: 5, Batch:20, Results: 10, RPCs: 3
    Caching: 10, Batch:10, Results: 20, RPCs: 3
計算RPC的次數：將行數result與column數相乘，再除以batch和column數中的較小值，最後除以caching大小。

===================================================

其他的參數優化配置：

參考http://joshuasabrina.iteye.com/blog/1798239

1. Write Buffer Size

HTable htable = new HTable(config, tablename);   
htable.setWriteBufferSize(10 * 1024 * 1024);  
htable.setAutoFlush(false);

2.map中間結果壓縮

參考(
http://blog.csdn.net/yangbutao/article/details/8474731
http://m.blog.csdn.net/blog/u012875880/21874293
http://developer.51cto.com/art/201204/331337.htm
)

老版本中用
        conf.setCompressMapOutput(true);  
        conf.setMapOutputCompressorClass(GzipCodec.class);

新的設置方法爲
    conf.setBoolean("mapred.compress.map.output", true); 
    //conf.setClass("mapred.map.output.compression.codec",GzipCodec.class, CompressionCodec.class);

    默認使用GzipCodec，可以指定GzipCodec.class,SnappyCodec.class,LzopCodec.class,
其中lzo、snappy需要操作系統安裝native庫纔可以支持。
數據格式爲TextFile，Sequence，以及其他用戶自定義的文件格式的文件都可以壓縮,不同的場合用不同的壓縮算法，bzip2和GZIP是比較消耗CPU的，壓縮比最高。但GZIP不能被分塊並行處理；
Snappy和LZO差不多，稍微勝出一點，cpu消耗的比GZIP少。

comparison between compression algorithms
Algorithm   % remaining Encoding    Decoding
GZIP    13.4%   21 MB/s 118 MB/s
LZO 20.5%   135 MB/s    410 MB/s
Snappy  22.2%   172 MB/s    409 MB/s

運行時報錯處理

1,客戶端收到報錯

ScannerTimeoutException:org.apache.hadoop.hbase.client.ScannerTimeoutException
    這是當從服務器傳輸數據到客戶端的時間，或者客戶端處理數據的時間大於了scanner設置的超時時間，scanner超時報錯，可在客戶端代碼中設置超時時間
    Configuration conf = HBaseConfiguration.create()
    conf.setLong(HConstants.HBASE_REGIONSERVER_LEASE_PERIOD_KEY,120000)
    但由於scan超時時間是配在Region Server上的，所以此配置無用。修改該值必須在Region Server上修改hbase-site.xml，重啓集羣。

2,報錯org.apache.hadoop.hbase.regionserver.LeaseException: org.apache.hadoop.hbase.regionserver.LeaseException: lease ” does not exist

    首先介紹，租約（Lease）過期或租約不存在，指的是hbase client端每次和regionserver交互時，服務器端會生成一個租約（Lease),其有效期時間由參數hbase.regionserver.lease.period確定。
    原理如下：
    客戶端在regionserver取數據時，如果hbase中存的數據過大且很多region時，客戶端請求的region不在內存中，或是沒有被cache住，需要從磁盤中加載。而當加載時間超過了hbase.regionserver.lease.period的設置時間，且客戶端沒有和regionserver報告其還活着，那麼regionserver就會認爲本次租約已經過期，並從LeaseQueue中刪掉本次租約，當regionserver加載完成後，拿已經被刪除的租約再去取數據的時候，就會出現如上的錯誤現象。   
    解決辦法一般是增加租約時間，設置hbase.regionserver.lease.period參數（默認爲60000，一分鐘）
        conf.setLong(HConstants.HBASE_REGIONSERVER_LEASE_PERIOD_KEY,120000)。
    如果還報錯，則可能是hbase.rpc.timeout的問題，增大hbase.regionserver.lease.period的時候應該同時增大hbase.rpc.timeout，同時hbase.rpc.timeout應該等於或大於hbase.regionserver.lease.period
        conf.setLong("hbase.rpc.timeout", 1200000);

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

hbase 程序優化參數調整方法

public Scan setCacheBlocks(boolean cacheBlocks)//Set whether blocks should be cached for this Scan

public Scan setCaching(int caching)

其他的參數優化配置：

1. Write Buffer Size

2.map中間結果壓縮

運行時報錯處理

1,客戶端收到報錯

2,報錯org.apache.hadoop.hbase.regionserver.LeaseException: org.apache.hadoop.hbase.regionserver.LeaseException: lease ” does not exist

京東面試：如何進行JVM調優？

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

Python 將PowerPoint (PPT/PPTX) 轉爲HTML

SQL優化-20231016

hive編程指南——讀書筆記（無知拾遺）

sql的簡單提高效率方法

java多線程的編程實例

hive中使用case、if：一個region統計業務（hive條件函數case、if、COALESCE語法介紹:CONDITIONAL FUNCTIONS IN HIVE）

pig腳本不需要後綴名（python tempfile模塊生成pig腳本臨時文件，執行）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

hbase 程序優化 參數調整方法

public Scan setCacheBlocks(boolean cacheBlocks)//Set whether blocks should be cached for this Scan

public Scan setCaching(int caching)

其他的參數優化配置：

1. Write Buffer Size

2.map中間結果壓縮

運行時報錯處理

1,客戶端收到報錯

2,報錯org.apache.hadoop.hbase.regionserver.LeaseException: org.apache.hadoop.hbase.regionserver.LeaseException: lease ” does not exist

hbase 程序優化參數調整方法