一、這種方式有很多的優點：

1. 如果我們一次性入庫hbase巨量數據，處理速度慢不說，還特別佔用Region資源，一個比較高效便捷的方法就是使用 “Bulk Loading”方法，即HBase提供的HFileOutputFormat類。

2. 它是利用hbase的數據信息按照特定格式存儲在hdfs內這一原理，直接生成這種hdfs內存儲的數據格式文件，然後上傳至合適位置，即完成巨量數據快速入庫的辦法。配合mapreduce完成，高效便捷，而且不佔用region資源，增添負載。

二、這種方式也有很大的限制：

1. 僅適合初次數據導入，即表內數據爲空，或者每次入庫表內都無數據的情況。

2. HBase集羣與Hadoop集羣爲同一集羣，即HBase所基於的HDFS爲生成HFile的MR的集羣(額，咋表述~~~)

三、接下來一個demo，簡單介紹整個過程。

1. 生成HFile部分

package zl.hbase.mr;
 
import java.io.IOException;
 
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.KeyValue;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat;
import org.apache.hadoop.hbase.mapreduce.KeyValueSortReducer;
import org.apache.hadoop.hbase.mapreduce.SimpleTotalOrderPartitioner;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.GenericOptionsParser;
 
import zl.hbase.util.ConnectionUtil;
 
public class HFileGenerator {
 
    public static class HFileMapper extends
            Mapper<LongWritable, Text, ImmutableBytesWritable, KeyValue> {
        @Override
        protected void map(LongWritable key, Text value, Context context)
                throws IOException, InterruptedException {
            String line = value.toString();
            String[] items = line.split(",", -1);
            ImmutableBytesWritable rowkey = new ImmutableBytesWritable(
                    items[0].getBytes());
 
            KeyValue kv = new KeyValue(Bytes.toBytes(items[0]),
                    Bytes.toBytes(items[1]), Bytes.toBytes(items[2]),
                    System.currentTimeMillis(), Bytes.toBytes(items[3]));
            if (null != kv) {
                context.write(rowkey, kv);
            }
        }
    }
 
    public static void main(String[] args) throws IOException,
            InterruptedException, ClassNotFoundException {
        Configuration conf = new Configuration();
        String[] dfsArgs = new GenericOptionsParser(conf, args)
                .getRemainingArgs();
 
        Job job = new Job(conf, "HFile bulk load test");
        job.setJarByClass(HFileGenerator.class);
 
        job.setMapperClass(HFileMapper.class);
        job.setReducerClass(KeyValueSortReducer.class);
 
        job.setMapOutputKeyClass(ImmutableBytesWritable.class);
        job.setMapOutputValueClass(Text.class);
 
        job.setPartitionerClass(SimpleTotalOrderPartitioner.class);
 
        FileInputFormat.addInputPath(job, new Path(dfsArgs[0]));
        FileOutputFormat.setOutputPath(job, new Path(dfsArgs[1]));
 
        HFileOutputFormat.configureIncrementalLoad(job,
                ConnectionUtil.getTable());
        System.exit(job.waitForCompletion(true) ? 0 : 1);
    }
}

生成HFile程序說明：

①. 最終輸出結果，無論是map還是reduce，輸出部分key和value的類型必須是： < ImmutableBytesWritable, KeyValue>或者< ImmutableBytesWritable, Put>。

②. 最終輸出部分，Value類型是KeyValue 或Put，對應的Sorter分別是KeyValueSortReducer或PutSortReducer。

③. MR例子中job.setOutputFormatClass(HFileOutputFormat.class); HFileOutputFormat只適合一次對單列族組織成HFile文件。

④. MR例子中HFileOutputFormat.configureIncrementalLoad(job, table);自動對job進行配置。SimpleTotalOrderPartitioner是需要先對key進行整體排序，然後劃分到每個reduce中，保證每一個reducer中的的key最小最大值區間範圍，是不會有交集的。因爲入庫到HBase的時候，作爲一個整體的Region，key是絕對有序的。

⑤. MR例子中最後生成HFile存儲在HDFS上，輸出路徑下的子目錄是各個列族。如果對HFile進行入庫HBase，相當於move HFile到HBase的Region中，HFile子目錄的列族內容沒有了。

2. HFile入庫到HBase

package zl.hbase.bulkload;
 
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;
import org.apache.hadoop.util.GenericOptionsParser;
 
import zl.hbase.util.ConnectionUtil;
 
public class HFileLoader {
 
    public static void main(String[] args) throws Exception {
        String[] dfsArgs = new GenericOptionsParser(
                ConnectionUtil.getConfiguration(), args).getRemainingArgs();
        LoadIncrementalHFiles loader = new LoadIncrementalHFiles(
                ConnectionUtil.getConfiguration());
        loader.doBulkLoad(new Path(dfsArgs[0]), ConnectionUtil.getTable());
    }
 
}

通過HBase中 LoadIncrementalHFiles的doBulkLoad方法，對生成的HFile文件入庫。

hbase提供了寫的操作，通常，我們可以採用HBase的Shell 客戶端或者Java API進行操作。

如果數據量大的話，這兩種操作是很費時的。其實如果瞭解了HBase的數據底層存儲的細節的話，HBase的數據存儲格式是HFile定義的格式。

批量導入HBase主要分兩步：

通過mapreduce在輸出目錄OutputDir下生成一系列按Store存儲結構一樣的，存儲HFile文件
通過LoadIncrementalHFiles.doBulkLoad把OutputDir裏面的數據導入HBase表中

優點

HBase提供了一種直接寫hfile文件的類，同時通過類似傳統數據庫的load把這些文件寫進去，不再需要通過客戶端或Java API一條一條插進去，
這些接口簡單方便，快捷靈活；
應用不需要一直去連HBase集羣進行RPC multi寫，提高mapreduce效率；
HBase集羣也相應減少不必要的連接，可以讓它去多幹些其它的事，效率更加高效，降低HBase集羣因爲大量併發寫而產生不必要的風險。

1. 從HDFS批量導入

在MapReduce裏面就把想要的輸出成HFileOutputFormat格式的文件，然後通過LoadIncrementalHFiles.doBulkLoad方式就可以load進去即可。例子如下：

Java代碼

Configuration conf = getConf();
conf.set("hbase.table.name", args[2]);
// Load hbase-site.xml
HBaseConfiguration.addHbaseResources(conf);
Job job = new Job(conf, "HBase Bulk Import Example");
job.setJarByClass(Mapper2.class);
job.setMapperClass(Mapper2.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
job.setInputFormatClass(TextInputFormat.class);
// Auto configure partitioner and reducer
HTable hTable = new HTable(conf, args[2]);
HFileOutputFormat.configureIncrementalLoad(job, hTable);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.waitForCompletion(true);
// Load generated HFiles into table
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(new Path(args[1]), hTable);

2. 從MySQL批量導入

這個星期把一些MySQL表導到線上HBase表。這個MySQL表散了100份，在HBase集羣未提供向業務使用時，通過Sqoop工具導進HBase表所花費的時間大約32個小時(已串行化)，在hbase集羣繁忙時，花了10個小時都還沒有把一張表導到HBase裏面。這是有原因的，Sqoop未實現批量導的功能，它通常是邊讀邊寫。

後來自己寫了一個從MySQL批量導入HBase的應用程序，每個表導入HBase所需時間平均只需要8分鐘。

核心代碼如下：

Java代碼

HBaseConfiguration.addHbaseResources(conf);
Job job = new Job(conf, "Load_MySQL_" + table + "_to_HBase_" + hbaseTable);
// 用來讀mysql的Mapper
job.setJarByClass(MysqlMapper.class);
job.setMapperClass(MysqlMapper.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);
job.setMapOutputValueClass(KeyValue.class);
//配置DB參數
DBConfiguration.configureDB(job.getConfiguration(), driver, connect, username, password);
DataDrivenDBInputFormat.setInput(job, dbWritableClass, query, boundaryQuery);
DataDrivenDBInputFormat.setInput(job, dbWritableClass, table, conditions, splitBy, columns);
//設置輸出路徑
FileOutputFormat.setOutputPath(job, new Path(tmpTargetDir));
// 自動設置partitioner和reduce
HTable hTable = new HTable(conf, hbaseTable);
HFileOutputFormat.configureIncrementalLoad(job, hTable);
job.waitForCompletion(true);
// 上面JOB運行完後，就把數據批量load到HBASE中
LoadIncrementalHFiles loader = new LoadIncrementalHFiles(conf);
loader.doBulkLoad(new Path(tmpTargetDir), hTable);

MapReduce生成HFile入庫到HBase

一、這種方式有很多的優點：

二、這種方式也有很大的限制：

三、接下來一個demo，簡單介紹整個過程。

1. 生成HFile部分

2. HFile入庫到HBase

優點

1. 從HDFS批量導入

2. 從MySQL批量導入

回答阿里社招面試如何準備，順便談談對於Java程序猿學習當中各個階段的建議

Spark RDD、DataFrame和DataSet的區別

Spark算子：RDD基本轉換操作(7)–zipWithIndex、zipWithUniqueId

Spark: sortBy和sortByKey函數詳解

Spark算子：RDD創建操作

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結