先看一個問題
java.io.IOException: Added a key not lexically larger than previous. Current cell = M00000006/info:age/1563723718005/Put/vlen=4/seqid=0, lastCell = M00000006/info:name/1563723718005/Put/vlen=2/seqid=0
at org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.checkKey(HFileWriterImpl.java:245)
at org.apache.hadoop.hbase.io.hfile.HFileWriterImpl.append(HFileWriterImpl.java:731)
at org.apache.hadoop.hbase.regionserver.StoreFileWriter.append(StoreFileWriter.java:234)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:337)
at org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2$1.write(HFileOutputFormat2.java:230)
at org.apache.spark.internal.io.HadoopMapReduceWriteConfigUtil.write(SparkHadoopWriter.scala:356)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:130)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$4.apply(SparkHadoopWriter.scala:127)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)
at org.apache.spark.internal.io.SparkHadoopWriter$.org$apache$spark$internal$io$SparkHadoopWriter$$executeTask(SparkHadoopWriter.scala:139)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:83)
at org.apache.spark.internal.io.SparkHadoopWriter$$anonfun$3.apply(SparkHadoopWriter.scala:78)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:121)
at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
19/07/21 23:41:58 ERROR Utils: Aborting task
拋出的異常是我們在生成 HFile 文件的時候,我們的 Cell沒有進行排序,而我們知道,使用 HBase 的 API 進行 Put 的時候,是先將我們得Cell放入到 HBase 的 MemStore 裏面,等MemStore滿了或者刷寫時間到了以後,會使用LMS算法將裏面的 KeyValue 進行排序,然後生成 HFile.也就是說 HBase 自己生成的 HFile 裏面的 KeyValue 已經是有序的,現在我們自己生成的HFile,也要保證 KeyValue有序才行.
那怎麼保證我們得 KeyValue得順序呢?
借鑑一下 HBase 提供的 CellSortReducer類,該類在 hbase-mapreduce 裏面,我們使用 HBase 的 Api 生成 HFile 時候用到.
/**
* Emits sorted Cells.
* Reads in all Cells from passed Iterator, sorts them, then emits
* Cells in sorted order. If lots of columns per row, it will use lots of
* memory sorting.
* @see HFileOutputFormat2
*/
@InterfaceAudience.Public
public class CellSortReducer
extends Reducer<ImmutableBytesWritable, Cell, ImmutableBytesWritable, Cell> {
protected void reduce(ImmutableBytesWritable row, Iterable<Cell> kvs,
Reducer<ImmutableBytesWritable, Cell, ImmutableBytesWritable, Cell>.Context context)
throws java.io.IOException, InterruptedException {
TreeSet<Cell> map = new TreeSet<>(CellComparator.getInstance());
for (Cell kv : kvs) {
try {
map.add(PrivateCellUtil.deepClone(kv));
} catch (CloneNotSupportedException e) {
throw new IOException(e);
}
}
context.setStatus("Read " + map.getClass());
int index = 0;
for (Cell kv: map) {
context.write(row, new MapReduceExtendedCell(kv));
if (++index % 100 == 0) context.setStatus("Wrote " + index);
}
}
}
從CellSortReducer類的源碼中我們可以看到,在 HBase 的 CellSortReducer 中,對 RowKey 相同的 KeyValue 使用 TreeSet+CellComparatorImpl比較器實現了排序.
所以,Spark 生成 HFile 的時候,我們也可以借鑑一下.
Spark 生成 HFile 並使用 BulkLoad方式加載數據
全過程如下:
package com.ljy.spark
import java.io.Closeable
import java.util
import com.ljy.common.ConfigurationFactory
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.hbase.client.{Connection, ConnectionFactory, Table}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapred.TableOutputFormat
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2
import org.apache.hadoop.hbase.tool.LoadIncrementalHFiles
import org.apache.hadoop.hbase.{Cell, CellComparator, CellUtil, KeyValue, TableName}
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import scala.collection.JavaConversions
object SparkGenHFile {
import org.apache.hadoop.hbase.util.Bytes
private val FAMILY = Bytes.toBytes("info")
private val COL_NAME = Bytes.toBytes("name")
private val COL_AGE = Bytes.toBytes("age")
private val COL_GENDER = Bytes.toBytes("gender")
private val COL_ADDRESS = Bytes.toBytes("address")
private val COL_INCOME = Bytes.toBytes("income")
private val COL_JOB = Bytes.toBytes("job")
private val COL_JOINEDYM = Bytes.toBytes("joined")
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf()
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.set("spark.kryo.registrationRequired", "true")
.setAppName("spark-gen-hfile")
.setMaster("local[*]")
sparkConf.registerKryoClasses(Array(
classOf[ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.KeyValue],
classOf[Array[org.apache.hadoop.hbase.io.ImmutableBytesWritable]],
Class.forName("org.apache.spark.internal.io.FileCommitProtocol$TaskCommitMessage"),
Class.forName("scala.reflect.ClassTag$$anon$1")
))
val spark = SparkSession.builder()
.config(sparkConf)
.getOrCreate()
val dataPath = "hdfs://vhb1:8020/user/hbase/bulkload/user"
val rdd = spark.sparkContext.textFile(dataPath)
.flatMap(line => {
val fields = line.split("\t")
val key = new ImmutableBytesWritable(Bytes.toBytes(fields(0)))
val cells = buildKeyValueCells(fields)
cells.map((key, _))
})
val tableName = "sparkhfile"
val hbaseConf = ConfigurationFactory.getHBaseConf
hbaseConf.set(TableOutputFormat.OUTPUT_TABLE, tableName)
val job = Job.getInstance(hbaseConf)
var conn: Connection = null
var table: Table = null
var fs: FileSystem = null
fs = FileSystem.get(hbaseConf)
val hfileDir = new Path(fs.getWorkingDirectory, "hfile-dir")
val hfile = new Path(hfileDir, System.currentTimeMillis() + "")
try {
conn = ConnectionFactory.createConnection(hbaseConf)
table = conn.getTable(TableName.valueOf(tableName))
HFileOutputFormat2.configureIncrementalLoadMap(job, table.getDescriptor)
// 將 生成 HFile
rdd.sortByKey() // 根據 row 排序
.saveAsNewAPIHadoopFile(hfile.toString, classOf[ImmutableBytesWritable], classOf[Cell], classOf[HFileOutputFormat2], job.getConfiguration)
spark.stop()
// 使用 bulkload 的方式加載 hfile 到 hbase 表中
new LoadIncrementalHFiles(hbaseConf).run(Array(hfile.toString/* 生成的hfile所在的路徑 */, tableName/*要加載的表名,需要事先創建好*/))
println("hfile: " + hfile.toString)
} finally {
fs.delete(hfileDir, true)
close(conn, table, fs)
}
}
def buildKeyValueCells(fields: Array[String]): List[Cell] = {
val rowKey = Bytes.toBytes(fields(0))
val name = new KeyValue(rowKey, FAMILY, COL_NAME, Bytes.toBytes(fields(1)))
val age = new KeyValue(rowKey, FAMILY, COL_AGE, Bytes.toBytes(fields(2).toInt))
val gender = new KeyValue(rowKey, FAMILY, COL_GENDER, Bytes.toBytes(fields(3)))
val address = new KeyValue(rowKey, FAMILY, COL_ADDRESS, Bytes.toBytes(fields(4)))
val income = new KeyValue(rowKey, FAMILY, COL_INCOME, Bytes.toBytes(fields(5).toDouble))
val job = new KeyValue(rowKey, FAMILY, COL_JOB, Bytes.toBytes(fields(6)))
val joined = new KeyValue(rowKey, FAMILY, COL_JOINEDYM, Bytes.toBytes(fields(7)))
// 參照 hbase-mapreduce 中的CellSortReducer
val set = new util.TreeSet[KeyValue](CellComparator.getInstance)
util.Collections.addAll(set, name, age, gender, address, income, job, joined)
// 將 Java 的 set 轉化成 Scala 的 Set
JavaConversions.asScalaSet(set).toList
}
/**
* 也可以使用 Scala 的方式實現排序,但是避免重複造輪子,就使用 HBase 提供的就好.
* 故這裏我們不在使用我們自己實現的比較器了.
*/
@Deprecated
class KeyValueOrder extends Ordering[KeyValue] {
override def compare(x: KeyValue, y: KeyValue): Int = {
val xRow = CellUtil.cloneRow(x)
val yRow = CellUtil.cloneRow(y)
var com = Bytes.compareTo(xRow, yRow)
if (com != 0) return com
val xf = CellUtil.cloneFamily(x)
val yf = CellUtil.cloneFamily(y)
com = Bytes.compareTo(xf, yf)
if (com != 0) return com
val xq = CellUtil.cloneQualifier(x)
val yq = CellUtil.cloneQualifier(y)
com = Bytes.compareTo(xq, yq)
if (com != 0) return com
val xv = CellUtil.cloneValue(x)
val yv = CellUtil.cloneValue(x)
Bytes.compareTo(xv, yv)
}
}
def close(closes: Closeable*): Unit = {
for (elem <- closes) {
if (elem != null) {
elem.close()
}
}
}
}
我裏面用到的 Conf,爲了方便,我抽取出來了.
代碼如下:
package com.ljy.common;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
public class ConfigurationFactory {
public static Configuration getHBaseConf() {
final Configuration conf = getConf();
conf.set("hbase.rootdir", "hdfs://vhb1:8020/hbase2");
conf.set("hbase.zookeeper.quorum", "vhb1,vhb2,vhb3");
conf.set("hbase.zookeeper.property.clientPort", "2181");
conf.set("zookeeper.znode.parent", "/hbase");
return HBaseConfiguration.create(conf);
}
public static Configuration getConf() {
final Configuration conf = new Configuration();
conf.set("fs.defaultFS", "hdfs://vhb1:8020");
return conf;
}
}
pom 依賴如下:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.ljy</groupId>
<artifactId>spark-hbase</artifactId>
<version>1.0</version>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.scala-lang/scala-library -->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.11.8</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-mapreduce</artifactId>
<version>2.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.hbase</groupId>
<artifactId>hbase-client</artifactId>
<version>2.1.0</version>
</dependency>
</dependencies>
</project>