spark讀取hbase數據(newAPIHadoopRDD方式)

原創

2019-04-10 06:34

用spark讀取hbase數據的方法有很多種,今天就用spark內置的方法newAPIHadoopRDD來實現了一個簡單的demo,代碼非常簡單,就不加註釋了.

spark寫入hbase的可以看之前的兩篇文章 https://blog.csdn.net/xianpanjia4616/article/details/85301998,https://blog.csdn.net/xianpanjia4616/article/details/80738961

package hbase

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.log4j.{Level, Logger}
import util.PropertiesScalaUtils
import org.apache.spark.sql.SparkSession

/**
  * spark讀取hbase的數據
  */
object ReadHbase {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
    Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
    val spark = SparkSession
      .builder
      .appName("read hbase")
      .master("local[4]")
      .config("spark.some.config.option", "config-value")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .getOrCreate
    val sc = spark.sparkContext
    val mode = "local"
    val zk_hbase = PropertiesScalaUtils.loadProperties("zk_hbase",mode)
    val zk_port = PropertiesScalaUtils.loadProperties("zk_port",mode)
    val hbase_master = PropertiesScalaUtils.loadProperties("hbase_master",mode)
    val hbase_rootdir = PropertiesScalaUtils.loadProperties("hbase_rootdir",mode)
    val zookeeper_znode_parent = PropertiesScalaUtils.loadProperties("zookeeper_znode_parent",mode)
    val hbase_table = PropertiesScalaUtils.loadProperties("hbase_table",mode)

    val conf = HBaseConfiguration.create()
    conf.set("hbase.zookeeper.quorum", zk_hbase)
    conf.set("hbase.zookeeper.property.clientPort", zk_port)
    conf.set("hbase.master", hbase_master)
    conf.set("hbase.defaults.for.version.skip", "true")
    conf.set("hhbase.rootdir", hbase_rootdir)
    conf.set("zookeeper.znode.parent", zookeeper_znode_parent)
    conf.set(TableInputFormat.INPUT_TABLE, "cbd:prod_base")
    val hbaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
      classOf[org.apache.hadoop.hbase.client.Result])
    hbaseRDD.sample(false,0.1)foreachPartition(fp=>{
      fp.foreach(f=>{
        val rowkey = Bytes.toString(f._2.getRow)
        val InsertTime = Bytes.toString(f._2.getValue("cf1".getBytes,"InsertTime".getBytes))
        val VipPrice = Bytes.toString(f._2.getValue("cf1".getBytes,"VipPrice".getBytes))
        println(s"Row key:$rowkey InsertTime:$InsertTime VipPrice:$VipPrice")
      })
    })
    println("元素的個數:"+hbaseRDD.count())
    sc.stop()
  }
}

用TableInputFormat讀取hbase的數據有一些缺點,比如一個Task裏面只能啓動一個Scan去HBase中讀取數據,TableInputFormat 中不支持 BulkGet操作等,有興趣的同學可以去研究研究讀取hbase的各種方法.

運行打印的結果如下:

Row key:1c|1cc063f9 InsertTime:2019-03-28 10:53:21.780 VipPrice:0.0000
Row key:44|442be99e InsertTime:2019-03-28 10:53:21.945 VipPrice:0.0000
Row key:44|44334a63 InsertTime:2019-03-28 10:53:04.346 VipPrice:0.0000
Row key:4a|4a4a3eb9 InsertTime:2019-03-28 10:53:21.845 VipPrice:0.0000
Row key:6c|6c7b57a0 InsertTime:2019-03-28 10:53:21.258 VipPrice:0.0000
Row key:80|809a1f10 InsertTime:2019-03-28 10:53:08.014 VipPrice:0.0000
Row key:a3|a3f97943 InsertTime:2019-03-28 10:53:03.909 VipPrice:0.0000
Row key:a8|a83215ec InsertTime:2019-03-28 10:53:05.094 VipPrice:0.0000
Row key:d5|d561106e InsertTime:2019-03-28 10:53:04.792 VipPrice:0.0000
Row key:e3|e34daa5e InsertTime:2019-03-28 10:53:21.712 VipPrice:0.0000
Row key:e5|e5d0270e InsertTime:2019-03-28 10:53:07.076 VipPrice:0.0000
Row key:f1|f17b91b7 InsertTime:2019-03-28 10:53:21.995 VipPrice:0.0000
Row key:f5|f5ba844d InsertTime:2019-03-28 10:53:07.811 VipPrice:0.0000
Row key:fe|fe4a30f2 InsertTime:2019-03-28 10:53:04.719 VipPrice:0.0000
元素的個數:107

如果有寫的不對的地方,歡迎大家指正,如果有什麼疑問,可以加QQ羣:340297350,更多的Flink和spark的乾貨可以加入下面的星球

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark讀取hbase數據(newAPIHadoopRDD方式)

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

Spark和Flink的對比(誰是下一代大數據流計算引擎?)

Flink基於EventTime和WaterMark處理亂序事件和晚到的數據(三)

Flink web UI的使用介紹

sparkstreaming實時寫入Hbase(saveAsNewAPIHadoopDataset方法)

Flink各種報錯彙總及解決方法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結