HBase 用Spark導入MySQL數據到HBase

用Spark導入MySQL數據到HBase

​企業中大規模數據存儲於HBase背景:

項目中有需求,要頻繁地、快速地向一個表中初始化數據。因此如何加載數據,如何提高速度是需要解決的問題。

一般來說,作爲數據存儲系統會分爲檢索存儲兩部分。
檢索是對外暴露數據查詢接口。
存儲一是要實現數據按固定規則存儲到存儲介質中(如磁盤、內存等),另一方面還需要向外暴露批量裝載的工具。如DB2的 db2load 工具,在關閉掉日誌的前提下,寫入速度能有顯著提高。

HBase數據庫提供批量導入數據到表功能:

1、Hbase 中LoadIncrementalHFiles 支持向Hbase 寫入HFile 文件
2、寫入的HFile 文件要求是排序的(rowKey,列簇,列)
3、關鍵是繞過Hbase regionServer, 直接寫入Hbase文件
4、Spark RDD的repartitionAndSortWithinPartitions 方法可以高效地實現分區並排序
5、JAVA util.TreeMap 是紅黑樹的實現,能很好的實現排序的要求

編寫應用開發流程如下:

1、對待寫入的數據按Key值構造util.TreeMap 樹結構。目的是按Key值構造匹配Hbase 的排序結構
2、轉換成RDD,使用repartitionAndSortWithinPartitions算子 對Key值分區並排序
3、調用RDD的saveAsNewAPIHadoopFile 算子,生成HFile文件
4、調用Hbase: LoadIncrementalHFiles 將HFile文件Load 到Hbase 表中

1,將HBase數據庫中不同表的字段信息封裝object對象:

package demo01


import scala.collection.immutable.TreeMap

/**
  * HBase 中各個表的字段名稱,存儲在TreeMap中
  */
object TableFieldNames{

	// TODO: 使用TreeMap爲qualifier做字典序排序

	// a. 行爲日誌數據表的字段
	val LOG_FIELD_NAMES: TreeMap[String, Int] = TreeMap(
		("id", 0),
		("log_id", 1),
		("remote_ip", 2),
		("site_global_ticket", 3),
		("site_global_session", 4),
		("global_user_id", 5),
		("cookie_text", 6),
		("user_agent", 7),
		("ref_url", 8),
		("loc_url", 9),
		("log_time", 10)
	)

	// b. 商品信息數據表的字段
	val GOODS_FIELD_NAMES: TreeMap[String, Int] = TreeMap(
		("id", 0),
		("siteid", 1),
		("istest", 2),
		("hasread", 3),
		("supportonedaylimit", 4),
		("orderid", 5),
		("cordersn", 6),
		("isbook", 7),
		("cpaymentstatus", 8),
		("cpaytime", 9),
		("producttype", 10),
		("productid", 11),
		("productname", 12),
		("sku", 13),
		("price", 14),
		("number", 15),
		("lockednumber", 16),
		("unlockednumber", 17),
		("productamount", 18),
		("balanceamount", 19),
		("couponamount", 20),
		("esamount", 21),
		("giftcardnumberid", 22),
		("usedgiftcardamount", 23),
		("couponlogid", 24),
		("activityprice", 25),
		("activityid", 26),
		("cateid", 27),
		("brandid", 28),
		("netpointid", 29),
		("shippingfee", 30),
		("settlementstatus", 31),
		("receiptorrejecttime", 32),
		("iswmssku", 33),
		("scode", 34),
		("tscode", 35),
		("tsshippingtime", 36),
		("status", 37),
		("productsn", 38),
		("invoicenumber", 39),
		("expressname", 40),
		("invoiceexpressnumber", 41),
		("postman", 42),
		("postmanphone", 43),
		("isnotice", 44),
		("noticetype", 45),
		("noticeremark", 46),
		("noticetime", 47),
		("shippingtime", 48),
		("lessordersn", 49),
		("waitgetlesshippinginfo", 50),
		("getlesshippingcount", 51),
		("outping", 52),
		("lessshiptime", 53),
		("closetime", 54),
		("isreceipt", 55),
		("ismakereceipt", 56),
		("receiptnum", 57),
		("receiptaddtime", 58),
		("makereceipttype", 59),
		("shippingmode", 60),
		("lasttimeforshippingmode", 61),
		("lasteditorforshippingmode", 62),
		("systemremark", 63),
		("tongshuaiworkid", 64),
		("orderpromotionid", 65),
		("orderpromotionamount", 66),
		("externalsalesettingid", 67),
		("recommendationid", 68),
		("hassendalertnum", 69),
		("isnolimitstockproduct", 70),
		("hpregisterdate", 71),
		("hpfaildate", 72),
		("hpfinishdate", 73),
		("hpreservationdate", 74),
		("shippingopporunity", 75),
		("istimeoutfree", 76),
		("itemshareamount", 77),
		("lessshiptintime", 78),
		("lessshiptouttime", 79),
		("cbsseccode", 80),
		("points", 81),
		("modified", 82),
		("splitflag", 83),
		("splitrelatecordersn", 84),
		("channelid", 85),
		("activityid2", 86),
		("pdorderstatus", 87),
		("omsordersn", 88),
		("couponcode", 89),
		("couponcodevalue", 90),
		("storeid", 91),
		("storetype", 92),
		("stocktype", 93),
		("o2otype", 94),
		("brokeragetype", 95),
		("ogcolor", 96)
	)


	// c. 用戶信息數據表的字段
	val USER_FIELD_NAMES: TreeMap[String, Int] = TreeMap(
		("id", 0),
		("siteid", 1),
		("avatarimagefileid", 2),
		("email", 3),
		("username", 4),
		("password", 5),
		("salt", 6),
		("registertime", 7),
		("lastlogintime", 8),
		("lastloginip", 9),
		("memberrankid", 10),
		("bigcustomerid", 11),
		("lastaddressid", 12),
		("lastpaymentcode", 13),
		("gender", 14),
		("birthday", 15),
		("qq", 16),
		("job", 17),
		("mobile", 18),
		("politicalface", 19),
		("nationality", 20),
		("validatecode", 21),
		("pwderrcount", 22),
		("source", 23),
		("marriage", 24),
		("money", 25),
		("moneypwd", 26),
		("isemailverify", 27),
		("issmsverify", 28),
		("smsverifycode", 29),
		("emailverifycode", 30),
		("verifysendcoupon", 31),
		("canreceiveemail", 32),
		("modified", 33),
		("channelid", 34),
		("grade_id", 35),
		("nick_name", 36),
		("is_blacklist", 37)
	)

	// d. 訂單數據表的字段
	val ORDER_FIELD_NAMES: TreeMap[String, Int] = TreeMap(
		("id", 0),
		("siteid", 1),
		("istest", 2),
		("hassync", 3),
		("isbackend", 4),
		("isbook", 5),
		("iscod", 6),
		("notautoconfirm", 7),
		("ispackage", 8),
		("packageid", 9),
		("ordersn", 10),
		("relationordersn", 11),
		("memberid", 12),
		("predictid", 13),
		("memberemail", 14),
		("addtime", 15),
		("synctime", 16),
		("orderstatus", 17),
		("paytime", 18),
		("paymentstatus", 19),
		("receiptconsignee", 20),
		("receiptaddress", 21),
		("receiptzipcode", 22),
		("receiptmobile", 23),
		("productamount", 24),
		("orderamount", 25),
		("paidbalance", 26),
		("giftcardamount", 27),
		("paidamount", 28),
		("shippingamount", 29),
		("totalesamount", 30),
		("usedcustomerbalanceamount", 31),
		("customerid", 32),
		("bestshippingtime", 33),
		("paymentcode", 34),
		("paybankcode", 35),
		("paymentname", 36),
		("consignee", 37),
		("originregionname", 38),
		("originaddress", 39),
		("province", 40),
		("city", 41),
		("region", 42),
		("street", 43),
		("markbuilding", 44),
		("poiid", 45),
		("poiname", 46),
		("regionname", 47),
		("address", 48),
		("zipcode", 49),
		("mobile", 50),
		("phone", 51),
		("receiptinfo", 52),
		("delayshiptime", 53),
		("remark", 54),
		("bankcode", 55),
		("agent", 56),
		("confirmtime", 57),
		("firstconfirmtime", 58),
		("firstconfirmperson", 59),
		("finishtime", 60),
		("tradesn", 61),
		("signcode", 62),
		("source", 63),
		("sourceordersn", 64),
		("onedaylimit", 65),
		("logisticsmanner", 66),
		("aftersalemanner", 67),
		("personmanner", 68),
		("visitremark", 69),
		("visittime", 70),
		("visitperson", 71),
		("sellpeople", 72),
		("sellpeoplemanner", 73),
		("ordertype", 74),
		("hasreadtaobaoordercomment", 75),
		("memberinvoiceid", 76),
		("taobaogroupid", 77),
		("tradetype", 78),
		("steptradestatus", 79),
		("steppaidfee", 80),
		("depositamount", 81),
		("balanceamount", 82),
		("autocanceldays", 83),
		("isnolimitstockorder", 84),
		("ccborderreceivedlogid", 85),
		("ip", 86),
		("isgiftcardorder", 87),
		("giftcarddownloadpassword", 88),
		("giftcardfindmobile", 89),
		("autoconfirmnum", 90),
		("codconfirmperson", 91),
		("codconfirmtime", 92),
		("codconfirmremark", 93),
		("codconfirmstate", 94),
		("paymentnoticeurl", 95),
		("addresslon", 96),
		("addresslat", 97),
		("smconfirmstatus", 98),
		("smconfirmtime", 99),
		("smmanualtime", 100),
		("smmanualremark", 101),
		("istogether", 102),
		("isnotconfirm", 103),
		("tailpaytime", 104),
		("points", 105),
		("modified", 106),
		("channelid", 107),
		("isproducedaily", 108),
		("couponcode", 109),
		("couponcodevalue", 110),
		("ckcode", 111)
	)

}

2,批量加載數據:HBaseBulkLoader,代碼如下:

package demo01



import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs.{FileSystem, Path}
import org.apache.hadoop.hbase.client.{ConnectionFactory, Table}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.{HFileOutputFormat2, LoadIncrementalHFiles}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.hbase.{HBaseConfiguration, KeyValue, TableName}
import org.apache.hadoop.mapreduce.Job
import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

import scala.collection.immutable.TreeMap

/**
 * 將數據存儲文本文件轉換爲HFile文件,加載到HBase表中
 */
object HBaseBulkLoader {
	
	def main(args: Array[String]): Unit = {
		// 應用執行時傳遞5個參數:數據類型、HBase表名稱、表列簇、輸入路徑及輸出路徑
		/*
		args = Array("1", "tbl_tag_logs", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_logs", "/datas/output_hfile/tbl_tag_logs")
		args = Array("2", "tbl_tag_goods", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_goods", "/datas/output_hfile/tbl_tag_goods")
		args = Array("3", "tbl_tag_users", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_users", "/datas/output_hfile/tbl_tag_users")
		args = Array("4", "tbl_tag_orders", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_orders", "/datas/output_hfile/tbl_tag_orders")
		*/
		if(args.length != 5){
			println("Usage: required params: <DataType> <HBaseTable> <Family> <InputDir> <OutputDir>")
			System.exit(-1)
		}
		// 將傳遞賦值給變量, 其中數據類型:1Log、2Good、3User、4Order
		val Array(dataType, tableName, family, inputDir, outputDir) = args
		
		// 依據參數獲取處理數據schema
		val fieldNames = dataType.toInt match {
			case 1 => TableFieldNames.LOG_FIELD_NAMES
			case 2 => TableFieldNames.GOODS_FIELD_NAMES
			case 3 => TableFieldNames.USER_FIELD_NAMES
			case 4 => TableFieldNames.ORDER_FIELD_NAMES
		}
		
		// 1. 構建SparkContext實例對象
		val sc: SparkContext = {
			// a. 創建SparkConf,設置應用配置信息
			val sparkConf = new SparkConf()
				.setMaster("local[2]")
				.setAppName(this.getClass.getSimpleName.stripSuffix("$"))
				.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
			// b. 傳遞SparkContext創建對象
			SparkContext.getOrCreate(sparkConf)
		}
		
		// 2. 讀取文本文件數據,轉換格式
		val keyValuesRDD: RDD[(ImmutableBytesWritable, KeyValue)] = sc
			.textFile(inputDir)
			// 過濾數據
			.filter(line => null != line)
			// 提取數據字段,構建二元組(RowKey, KeyValue)
			/*
				Key: rowkey + cf + column + version(timestamp)
				Value: ColumnValue
			 */
			.flatMap{line => getLineToData(line, family, fieldNames)}
			// TODO: 對數據做字典排序
			.sortByKey()
		
		// TODO:構建Job,設置相關配置信息,主要爲輸出格式
		// a. 讀取配置信息
		val conf: Configuration = HBaseConfiguration.create()
		// b. 如果輸出目錄存在,刪除
		val dfs = FileSystem.get(conf)
		val outputPath: Path = new Path(outputDir)
		if(dfs.exists(outputPath)){
			dfs.delete(outputPath, true)
		}
		dfs.close()
		
		// TODO:c. 配置HFileOutputFormat2輸出
		val conn = ConnectionFactory.createConnection(conf)
		val htableName = TableName.valueOf(tableName)
		val table: Table = conn.getTable(htableName)
		HFileOutputFormat2.configureIncrementalLoad(
			Job.getInstance(conf), //
			table, //
			conn.getRegionLocator(htableName)//
		)
		
		// TODO: 3. 保存數據爲HFile文件
		keyValuesRDD.saveAsNewAPIHadoopFile(
			outputDir, //
			classOf[ImmutableBytesWritable], //
			classOf[KeyValue], //
			classOf[HFileOutputFormat2], //
			conf //
		)
		
		// TODO:4. 將輸出HFile加載到HBase表中
		val load = new LoadIncrementalHFiles(conf)
		load.doBulkLoad(outputPath, conn.getAdmin, table, conn.getRegionLocator(htableName))
		
		// 應用結束,關閉資源
		sc.stop()
	}
	
	/**
	 * 依據不同表的數據文件,提取對應數據,封裝到KeyValue對象中
	 */
	def getLineToData(line: String, family: String,
	                  fieldNames: TreeMap[String, Int]): List[(ImmutableBytesWritable, KeyValue)] = {
		val length = fieldNames.size
		// 分割字符串
		val fieldValues: Array[String] = line.split("\\t", -1)
		if(null == fieldValues || fieldValues.length != length) return Nil
		
		// 獲取id,構建RowKey
		val id: String = fieldValues(0)
		val rowKey = Bytes.toBytes(id)
		val ibw: ImmutableBytesWritable = new ImmutableBytesWritable(rowKey)
		
		// 列簇
		val columnFamily: Array[Byte] = Bytes.toBytes(family)
		
		// 構建KeyValue對象
		fieldNames.toList.map{ case (fieldName, fieldIndex) =>
			// KeyValue實例對象
			val keyValue = new KeyValue(
				rowKey, //
				columnFamily, //
				Bytes.toBytes(fieldName), //
				Bytes.toBytes(fieldValues(fieldIndex)) //
			)
			// 返回
			(ibw, keyValue)
		}
	}
	
}

運行此應用程序時,需傳遞相關參數:

// 應用執行時傳遞5個參數:數據類型、HBase表名稱、表列簇、輸入路徑及輸出路徑
/*
args = Array("1", "tbl_tag_logs", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_logs", "/datas/output_hfile/tbl_tag_logs")
args = Array("2", "tbl_tag_goods", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_goods", "/datas/output_hfile/tbl_tag_goods")
args = Array("3", "tbl_tag_users", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_users", "/datas/output_hfile/tbl_tag_users")
args = Array("4", "tbl_tag_orders", "detail", "/user/hive/warehouse/tags_dat2.db/tbl_orders", "/datas/output_hfile/tbl_tag_orders")
*/

// IDEA中測試執行,傳遞應用程序參數
// a. 行爲日誌
1 tbl_tag_logs detail /user/hive/warehouse/tags_dat2.db/tbl_logs /datas/output_hfile/tbl_logs

// b. 商品信息
2 tbl_tag_goods detail /user/hive/warehouse/tags_dat2.db/tbl_goods /datas/output_hfile/tbl_goods

// c. 用戶信息
3 tbl_tag_users detail /user/hive/warehouse/tags_dat2.db/tbl_users /datas/output_hfile/tbl_users

// d. 訂單數據
4 tbl_tag_orders detail /user/hive/warehouse/tags_dat2.db/tbl_orders /datas/output_hfile/tbl_orders

Spark Bulkload常見錯誤解析:

1,調用 saveAsNewAPIHadoopFile 方法拋出 “Added a key not lexically larger than previous” 的異常是因爲排序問題導致

2,ImmutableBytesWritable 無法序列化的異常。通過如下 java 代碼設置即可(也可以用 scala 實現)

sparkConf.set("spark.serializer","org.apache.spark.serializer.KryoSerializer");
sparkConf.registerKryoClasses(newClass[{org.apache.hadoop.hbase.io.ImmutableBytesWritable.class});

3,比較器無法序列化的異常。讓比較器實現 Serializable 接口即可。

4,driver 中初始化的對象 於在 RDD 的 action 或者 transformation 中無法獲取的異常,需要做 broadcast。

5,可以通過動態設置 executor 個數來優化整體任務執行效率。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章