如果你需要在HBase的數據上做MapReduce任務,記得打開壓縮選項。
IO speed is always performance bottleneck in any case. So focus on IO performance generally is best practice for performance tuning.
Data compression is one of way to improve IO performance.
Below table is our case, use LZO compression on HBase compare with data none compression.
compression algorithm | Record Count | HDFS Space usage(GB) | MapReduce Job Time |
NONE | 400,000 | 190 | 19mins, 24sec |
LZO | 400,000 | 46 | 9mins, 34sec |
Almost 100% increase performance, impressive.
For the compression algorithm, Snappy is another option which seems more faster than LZO.
see, http://blog.cloudera.com/blog/2011/09/snappy-and-hadoop/ and http://blog.erdemagaoglu.com/post/4605524309/lzo-vs-snappy-vs-lzf-vs-zlib-a-comparison-of