Spark生成HBase 的 HFile 文件,並使用BulkLoad 方式將 HFile 文件加載到對應的表中

先看一個問題

java.io.IOException: Added a key not lexically larger than previous. Current cell = M00000006/info:age/1563723718005/Put/vlen=4/seqid=0, lastCell = M00000006/info:name/1563723718005/Put/vlen=2/seqid=0
	at org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.checkKey(HFileWriterImpl.java:245)
	at org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.append(HFileWriterImpl.java:731)
	at org.apache.hadoop.hbase.regionserver.StoreFileWriter.append(StoreFileWriter.java:234)
	at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:337)
	at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:230)
	at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:356)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:130)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
	at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
	at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:121)
	at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
19/07/21 23:41:58 ERROR Utils: Aborting task

拋出的異常是我們在生成 HFile 文件的時候,我們的 Cell沒有進行排序,而我們知道,使用 HBase 的 API 進行 Put 的時候,是先將我們得Cell放入到 HBase 的 MemStore 裏面,等MemStore滿了或者刷寫時間到了以後,會使用LMS算法將裏面的 KeyValue 進行排序,然後生成 HFile.也就是說 HBase 自己生成的 HFile 裏面的 KeyValue 已經是有序的,現在我們自己生成的HFile,也要保證 KeyValue有序才行.

那怎麼保證我們得 KeyValue得順序呢?

借鑑一下 HBase 提供的 CellSortReducer類,該類在 hbase-mapreduce 裏面,我們使用 HBase 的 Api 生成 HFile 時候用到.

/**
 * Emits sorted Cells.
 * Reads in all Cells from passed Iterator, sorts them, then emits
 * Cells in sorted order.  If lots of columns per row, it will use lots of
 * memory sorting.
 * @see HFileOutputFormat2
 */
@InterfaceAudience.Public
public class CellSortReducer
    extends Reducer<ImmutableBytesWritable, Cell, ImmutableBytesWritable, Cell> {
  protected void reduce(ImmutableBytesWritable row, Iterable<Cell> kvs,
      Reducer<ImmutableBytesWritable, Cell, ImmutableBytesWritable, Cell>.Context context)
  throws java.io.IOException, InterruptedException {
    TreeSet<Cell> map = new TreeSet<>(CellComparator.getInstance());
    for (Cell kv : kvs) {
      try {
        map.add(PrivateCellUtil.deepClone(kv));
      } catch (CloneNotSupportedException e) {
        throw new IOException(e);
      }
    }
    context.setStatus("Read " + map.getClass());
    int index = 0;
    for (Cell kv: map) {
      context.write(row, new MapReduceExtendedCell(kv));
      if (++index % 100 == 0) context.setStatus("Wrote " + index);
    }
  }
}

從CellSortReducer類的源碼中我們可以看到,在 HBase 的 CellSortReducer 中,對 RowKey 相同的 KeyValue 使用 TreeSet+CellComparatorImpl比較器實現了排序.
所以,Spark 生成 HFile 的時候,我們也可以借鑑一下.

Spark 生成 HFile 並使用 BulkLoad方式加載數據

全過程如下:

package com.ljy.spark

import java.io.Closeable
import java.util

import com.ljy.common.ConfigurationFactory
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory, Table}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2
import org.apache.hadoop.hbase.tool.LoadIncrementalHFiles
import org.apache.hadoop.hbase.{Cell, CellComparator, CellUtil, KeyValue, TableName}
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession

import scala.collection.JavaConversions

object SparkGenHFile {

  import org.apache.hadoop.hbase.util.Bytes

  private val FAMILY = Bytes.toBytes("info")

  private val COL_NAME = Bytes.toBytes("name")
  private val COL_AGE = Bytes.toBytes("age")
  private val COL_GENDER = Bytes.toBytes("gender")
  private val COL_ADDRESS = Bytes.toBytes("address")
  private val COL_INCOME = Bytes.toBytes("income")
  private val COL_JOB = Bytes.toBytes("job")
  private val COL_JOINEDYM = Bytes.toBytes("joined")

  def main(args: Array[String]): Unit = {
    val sparkConf = new SparkConf()
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .set("spark.kryo.registrationRequired", "true")
      .setAppName("spark-gen-hfile")
      .setMaster("local[*]")
    sparkConf.registerKryoClasses(Array(
      classOf[ImmutableBytesWritable],
      classOf[org.apache.hadoop.hbase.KeyValue],
      classOf[Array[org.apache.hadoop.hbase.io.ImmutableBytesWritable]],
      Class.forName("org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage"),
      Class.forName("scala.reflect.ClassTag$$anon$1")
    ))

    val spark = SparkSession.builder()
      .config(sparkConf)
      .getOrCreate()

    val dataPath = "hdfs://vhb1:8020/user/hbase/bulkload/user"
    val rdd = spark.sparkContext.textFile(dataPath)
      .flatMap(line => {
        val fields = line.split("\t")
        val key = new ImmutableBytesWritable(Bytes.toBytes(fields(0)))
        val cells = buildKeyValueCells(fields)
        cells.map((key, _))
      })
    val tableName = "sparkhfile"
    val hbaseConf = ConfigurationFactory.getHBaseConf
    hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
    val job = Job.getInstance(hbaseConf)

    var conn: Connection = null
    var table: Table = null
    var fs: FileSystem = null
    fs = FileSystem.get(hbaseConf)
    val hfileDir = new Path(fs.getWorkingDirectory, "hfile-dir")
    val hfile = new Path(hfileDir, System.currentTimeMillis() + "")

    try {
      conn = ConnectionFactory.createConnection(hbaseConf)
      table = conn.getTable(TableName.valueOf(tableName))
      HFileOutputFormat2.configureIncrementalLoadMap(job, table.getDescriptor)
      // 將 生成 HFile
      rdd.sortByKey() // 根據 row 排序
        .saveAsNewAPIHadoopFile(hfile.toString, classOf[ImmutableBytesWritable], classOf[Cell], classOf[HFileOutputFormat2], job.getConfiguration)
      spark.stop()
      // 使用 bulkload 的方式加載 hfile 到 hbase 表中
      new LoadIncrementalHFiles(hbaseConf).run(Array(hfile.toString/* 生成的hfile所在的路徑 */, tableName/*要加載的表名,需要事先創建好*/))
      println("hfile: " + hfile.toString)
    } finally {
      fs.delete(hfileDir, true)
      close(conn, table, fs)
    }

  }


  def buildKeyValueCells(fields: Array[String]): List[Cell] = {
    val rowKey = Bytes.toBytes(fields(0))
    val name = new KeyValue(rowKey, FAMILY, COL_NAME, Bytes.toBytes(fields(1)))
    val age = new KeyValue(rowKey, FAMILY, COL_AGE, Bytes.toBytes(fields(2).toInt))
    val gender = new KeyValue(rowKey, FAMILY, COL_GENDER, Bytes.toBytes(fields(3)))
    val address = new KeyValue(rowKey, FAMILY, COL_ADDRESS, Bytes.toBytes(fields(4)))
    val income = new KeyValue(rowKey, FAMILY, COL_INCOME, Bytes.toBytes(fields(5).toDouble))
    val job = new KeyValue(rowKey, FAMILY, COL_JOB, Bytes.toBytes(fields(6)))
    val joined = new KeyValue(rowKey, FAMILY, COL_JOINEDYM, Bytes.toBytes(fields(7)))
    // 參照 hbase-mapreduce 中的CellSortReducer
    val set = new util.TreeSet[KeyValue](CellComparator.getInstance)
    util.Collections.addAll(set, name, age, gender, address, income, job, joined)
    // 將 Java 的 set 轉化成 Scala 的 Set
    JavaConversions.asScalaSet(set).toList
  }

  /**
    * 也可以使用 Scala 的方式實現排序,但是避免重複造輪子,就使用 HBase 提供的就好.
    * 故這裏我們不在使用我們自己實現的比較器了.
    */
  @Deprecated
  class KeyValueOrder extends Ordering[KeyValue] {
    override def compare(x: KeyValue, y: KeyValue): Int = {
      val xRow = CellUtil.cloneRow(x)
      val yRow = CellUtil.cloneRow(y)
      var com = Bytes.compareTo(xRow, yRow)
      if (com != 0) return com

      val xf = CellUtil.cloneFamily(x)
      val yf = CellUtil.cloneFamily(y)
      com = Bytes.compareTo(xf, yf)
      if (com != 0) return com

      val xq = CellUtil.cloneQualifier(x)
      val yq = CellUtil.cloneQualifier(y)
      com = Bytes.compareTo(xq, yq)
      if (com != 0) return com

      val xv = CellUtil.cloneValue(x)
      val yv = CellUtil.cloneValue(x)
      Bytes.compareTo(xv, yv)
    }
  }

  def close(closes: Closeable*): Unit = {
    for (elem <- closes) {
      if (elem != null) {
        elem.close()
      }
    }
  }
}

我裏面用到的 Conf,爲了方便,我抽取出來了.
代碼如下:

package com.ljy.common;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;

public class ConfigurationFactory {
    public static Configuration getHBaseConf() {
        final Configuration conf = getConf();
        conf.set("hbase.rootdir", "hdfs://vhb1:8020/hbase2");
        conf.set("hbase.zookeeper.quorum", "vhb1,vhb2,vhb3");
        conf.set("hbase.zookeeper.property.clientPort", "2181");
        conf.set("zookeeper.znode.parent", "/hbase");
        return HBaseConfiguration.create(conf);
    }

    public static Configuration getConf() {
        final Configuration conf = new Configuration();
        conf.set("fs.defaultFS", "hdfs://vhb1:8020");
        return conf;
    }
}

pom 依賴如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.ljy</groupId>
    <artifactId>spark-hbase</artifactId>
    <version>1.0</version>
    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>

    <dependencies>
        <!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>2.11.8</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>2.4.3</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-mapreduce</artifactId>
            <version>2.1.0</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>2.1.0</version>
        </dependency>

    </dependencies>
</project>

Spark生成HBase 的 HFile 文件,並使用BulkLoad 方式將 HFile 文件加載到對應的表中

先看一個問題

那怎麼保證我們得 KeyValue得順序呢?

Spark 生成 HFile 並使用 BulkLoad方式加載數據

最後,有問題,歡迎留言討論.

前端使用 Konva 實現可視化設計器（13）- 折線 - 最優路徑應用【思路篇】

kerberos + Ranger 實現對Kafka的認證以及權限管理

發佈開源項目至maven中央倉庫，內附打scala源碼包，scala doc 包的教程。

Hive on Spark 搭建過程(hvie-2.3.6 spark-2.4.4 hadoop-2.8.5)

MapReduce 二次排序

深入理解G1GC日誌

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結