相對應MapReduce的hbase實現類:
1)InputFormat 類:HBase 實現了 TableInputFormatBase 類,該類提供了對錶數據的大部分操作,其子類 TableInputFormat 則提供了完整的實現,用於處理表數據並生成鍵值對。TableInputFormat 類將數據表按照 Region 分割成 split,既有多少個 Regions 就有多個splits。然後將 Region 按行鍵分成<key,value>對,key 值對應與行健,value 值爲該行所包含的數據。
2)Mapper 類和 Reducer 類:HBase 實現了 TableMapper 類和 TableReducer 類,其中TableMapper 類並沒有具體的功能,只是將輸入的<key,value>對的類型分別限定爲 Result 和ImmutableBytesWritable。IdentityTableMapper 類和 IdentityTableReducer 類則是上述兩個類的具體實現,其和 Mapper 類和 Reducer 類一樣,只是簡單地將<key,value>對輸出到下一個階段。
3)OutputFormat 類:HBase 實現的 TableOutputFormat 將輸出的<key,value>對寫到指定的 HBase 表中,該類不會對 WAL(Write-Ahead Log)進行操作,即如果服務器發生
故障將面臨丟失數據的風險。可以使用 MultipleTableOutputFormat 類解決這個問題,該類可以對是否寫入 WAL 進行設置。
代碼:
- import java.io.IOException;
- import java.util.Iterator;
- import java.util.StringTokenizer;
- import org.apache.hadoop.conf.Configuration;
- import org.apache.hadoop.fs.Path;
- import org.apache.hadoop.hbase.HBaseConfiguration;
- import org.apache.hadoop.hbase.HColumnDescriptor;
- import org.apache.hadoop.hbase.HTableDescriptor;
- import org.apache.hadoop.hbase.client.HBaseAdmin;
- import org.apache.hadoop.hbase.client.Put;
- import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
- import org.apache.hadoop.hbase.mapreduce.TableReducer;
- import org.apache.hadoop.hbase.util.Bytes;
- import org.apache.hadoop.io.IntWritable;
- import org.apache.hadoop.io.LongWritable;
- import org.apache.hadoop.io.Text;
- import org.apache.hadoop.io.NullWritable;
- import org.apache.hadoop.mapreduce.Job;
- import org.apache.hadoop.mapreduce.Mapper;
- import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
- import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
- public class WordCountHBase {
- // 實現 Map 類,只是繼承自Mapper,而不是TableMapper,所以該map是從HDFS中讀取數據的
- public static class Map extends
- Mapper<LongWritable, Text, Text, IntWritable> {
- private final static IntWritable one = new IntWritable(1);
- private Text word = new Text();
- public void map(LongWritable key, Text value, Context context)
- throws IOException, InterruptedException {
- StringTokenizer itr = new StringTokenizer(value.toString());
- while (itr.hasMoreTokens()) {
- word.set(itr.nextToken());
- context.write(word, one);
- }
- }
- }
- // 實現 Reduce 類,繼承自TableReducer,所以該reducer將數據寫入到hbase中
- public static class Reduce extends
- TableReducer<Text, IntWritable, NullWritable> {
- public void reduce(Text key, Iterable<IntWritable> values,
- Context context) throws IOException, InterruptedException {
- int sum = 0;
- Iterator<IntWritable> iterator = values.iterator();
- while (iterator.hasNext()) {
- sum += iterator.next().get();
- }
- // Put 實例化,每個詞存一行
- Put put = new Put(Bytes.toBytes(key.toString()));
- // column family爲 content,column爲 count,value爲數目
- put.add(Bytes.toBytes("content"), Bytes.toBytes("count"),
- Bytes.toBytes(String.valueOf(sum)));
- context.write(NullWritable.get(), put);
- }
- }
- // 創建 HBase 數據表
- public static void createHBaseTable(String tableName)
- throws IOException {
- // 創建表描述
- HTableDescriptor htd = new HTableDescriptor(tableName);
- // 創建列族描述
- HColumnDescriptor col = new HColumnDescriptor("content");
- htd.addFamily(col);
- // 配置 HBase
- Configuration conf = HBaseConfiguration.create();
- conf.set("hbase.zookeeper.quorum","master");
- conf.set("hbase.zookeeper.property.clientPort", "2181");
- HBaseAdmin hAdmin = new HBaseAdmin(conf);
- if (hAdmin.tableExists(tableName)) {
- System.out.println("該數據表已經存在,正在重新創建。");
- hAdmin.disableTable(tableName);
- hAdmin.deleteTable(tableName);
- }
- System.out.println("創建表:" + tableName);
- hAdmin.createTable(htd);
- }
- public static void main(String[] args) throws Exception {
- String tableName = "wordcount";
- // 第一步:創建數據庫表
- WordCountHBase.createHBaseTable(tableName);
- // 第二步:進行 MapReduce 處理
- // 配置 MapReduce
- Configuration conf = new Configuration();
- // 這幾句話很關鍵
- conf.set("mapred.job.tracker", "master:9001");
- conf.set("hbase.zookeeper.quorum","master");
- conf.set("hbase.zookeeper.property.clientPort", "2181");
- conf.set(TableOutputFormat.OUTPUT_TABLE, tableName);
- Job job = new Job(conf, "New Word Count");
- job.setJarByClass(WordCountHBase.class);
- // 設置 Map 和 Reduce 處理類
- job.setMapperClass(Map.class);
- job.setReducerClass(Reduce.class);
- // 設置輸出類型
- job.setMapOutputKeyClass(Text.class);
- job.setMapOutputValueClass(IntWritable.class);
- // 設置輸入和輸出格式
- job.setInputFormatClass(TextInputFormat.class);
- job.setOutputFormatClass(TableOutputFormat.class);
- // 設置輸入目錄
- FileInputFormat.addInputPath(job, new Path("hdfs://master:9000/in/"));
- System.exit(job.waitForCompletion(true) ? 0 : 1);
- }
- }
常見錯誤及解決方法:
1、java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.TableOutputFormat
錯誤輸出節選:
- 13/09/10 21:14:01 INFO mapred.JobClient: Running job: job_201308101437_0016
- 13/09/10 21:14:02 INFO mapred.JobClient: map 0% reduce 0%
- 13/09/10 21:14:16 INFO mapred.JobClient: Task Id : attempt_201308101437_0016_m_000007_0, Status : FAILED
- java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.TableOutputFormat
- at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:849)
- at org.apache.hadoop.mapreduce.JobContext.getOutputFormatClass(JobContext.java:235)
- at org.apache.hadoop.mapred.Task.initialize(Task.java:513)
- at org.apache.hadoop.mapred.MapTask.run(MapTask.java:353)
- at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
- at java.security.AccessController.doPrivileged(Native Method)
- at javax.security.auth.Subject.doAs(Subject.java:396)
- at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1149)
- at org.apache.hadoop.mapred.Child.main(Child.java:249)
- Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.mapreduce.TableOutputFormat
- at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
- at java.security.AccessController.doPrivileged(Native Method)
- at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
- at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
- at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
- at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
- at java.lang.Class.forName0(Native Method)
- at java.lang.Class.forName(Class.java:249)
- at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:802)
- at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:847)
- ... 8 more
錯誤原因:
相關的類文件沒有引入到 Hadoop 集羣上。
解決步驟:
步驟一、停止HBase數據庫:
- [hadoop@master bin]$ stop-hbase.sh
- stopping hbase............
- master: stopping zookeeper.
- [hadoop@master bin]$ jps
- 16186 Jps
- 26186 DataNode
- 26443 TaskTracker
- 26331 JobTracker
- 26063 NameNode
- [hadoop@master bin]$ stop-all.sh
- Warning: $HADOOP_HOME is deprecated.
- stopping jobtracker
- master: Warning: $HADOOP_HOME is deprecated.
- master:
- master: stopping tasktracker
- node1: Warning: $HADOOP_HOME is deprecated.
- node1:
- node1: stopping tasktracker
- stopping namenode
- master: Warning: $HADOOP_HOME is deprecated.
- master:
- master: stopping datanode
- node1: Warning: $HADOOP_HOME is deprecated.
- node1: stopping datanode
- node1:
- node1: Warning: $HADOOP_HOME is deprecated.
- node1:
- node1: stopping secondarynamenode
- [hadoop@master bin]$ jps
- 16531 Jps
步驟二、需要配置 Hadoop 集羣中每臺機器,在 hadoop 目錄的 conf 子目錄中,找 hadoop-env.sh文件,並添加如下內容:
- # set hbase environment
- export HBASE_HOME=/opt/modules/hadoop/hbase/hbase-0.94.11-security
- export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.94.11-security.jar:$HBASE_HOME/hbase-0.94.11-security-tests.jar:$HBASE_HOME/conf:$HBASE_HOME/lib/zookeeper-3.4.5.jar
2、Error: java.lang.ClassNotFoundException: com.google.protobuf.Message
錯誤輸出節選:
- 2013-09-12 12:38:57,833 INFO mapred.JobClient (JobClient.java:monitorAndPrintJob(1363)) - map 0% reduce 0%
- 2013-09-12 12:39:12,490 INFO mapred.JobClient (JobClient.java:monitorAndPrintJob(1392)) - Task Id : attempt_201309121232_0001_m_000007_0, Status : FAILED
- Error: java.lang.ClassNotFoundException: com.google.protobuf.Message
- at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
- at java.security.AccessController.doPrivileged(Native Method)
- at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
- at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
- at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
明顯,沒找到protobuf-java-2.4.0a.jar包,將該包路徑加入hadoop-env.sh中。