spark读取hbase数据(newAPIHadoopRDD方式)

用spark读取hbase数据的方法有很多种,今天就用spark内置的方法newAPIHadoopRDD来实现了一个简单的demo,代码非常简单,就不加注释了.

spark写入hbase的可以看之前的两篇文章 https://blog.csdn.net/xianpanjia4616/article/details/85301998,https://blog.csdn.net/xianpanjia4616/article/details/80738961

package hbase

import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.TableInputFormat
import org.apache.hadoop.hbase.util.Bytes
import org.apache.log4j.{Level, Logger}
import util.PropertiesScalaUtils
import org.apache.spark.sql.SparkSession

/**
  * spark读取hbase的数据
  */
object ReadHbase {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org.apache.spark").setLevel(Level.WARN)
    Logger.getLogger("org.apache.hadoop").setLevel(Level.WARN)
    Logger.getLogger("org.eclipse.jetty.server").setLevel(Level.OFF)
    val spark = SparkSession
      .builder
      .appName("read hbase")
      .master("local[4]")
      .config("spark.some.config.option", "config-value")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .getOrCreate
    val sc = spark.sparkContext
    val mode = "local"
    val zk_hbase = PropertiesScalaUtils.loadProperties("zk_hbase",mode)
    val zk_port = PropertiesScalaUtils.loadProperties("zk_port",mode)
    val hbase_master = PropertiesScalaUtils.loadProperties("hbase_master",mode)
    val hbase_rootdir = PropertiesScalaUtils.loadProperties("hbase_rootdir",mode)
    val zookeeper_znode_parent = PropertiesScalaUtils.loadProperties("zookeeper_znode_parent",mode)
    val hbase_table = PropertiesScalaUtils.loadProperties("hbase_table",mode)

    val conf = HBaseConfiguration.create()
    conf.set("hbase.zookeeper.quorum", zk_hbase)
    conf.set("hbase.zookeeper.property.clientPort", zk_port)
    conf.set("hbase.master", hbase_master)
    conf.set("hbase.defaults.for.version.skip", "true")
    conf.set("hhbase.rootdir", hbase_rootdir)
    conf.set("zookeeper.znode.parent", zookeeper_znode_parent)
    conf.set(TableInputFormat.INPUT_TABLE, "cbd:prod_base")
    val hbaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
      classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
      classOf[org.apache.hadoop.hbase.client.Result])
    hbaseRDD.sample(false,0.1)foreachPartition(fp=>{
      fp.foreach(f=>{
        val rowkey = Bytes.toString(f._2.getRow)
        val InsertTime = Bytes.toString(f._2.getValue("cf1".getBytes,"InsertTime".getBytes))
        val VipPrice = Bytes.toString(f._2.getValue("cf1".getBytes,"VipPrice".getBytes))
        println(s"Row key:$rowkey InsertTime:$InsertTime VipPrice:$VipPrice")
      })
    })
    println("元素的个数:"+hbaseRDD.count())
    sc.stop()
  }
}

 用TableInputFormat读取hbase的数据有一些缺点,比如一个Task里面只能启动一个Scan去HBase中读取数据,TableInputFormat 中不支持 BulkGet操作等,有兴趣的同学可以去研究研究读取hbase的各种方法.

运行打印的结果如下:

Row key:1c|1cc063f9 InsertTime:2019-03-28 10:53:21.780 VipPrice:0.0000
Row key:44|442be99e InsertTime:2019-03-28 10:53:21.945 VipPrice:0.0000
Row key:44|44334a63 InsertTime:2019-03-28 10:53:04.346 VipPrice:0.0000
Row key:4a|4a4a3eb9 InsertTime:2019-03-28 10:53:21.845 VipPrice:0.0000
Row key:6c|6c7b57a0 InsertTime:2019-03-28 10:53:21.258 VipPrice:0.0000
Row key:80|809a1f10 InsertTime:2019-03-28 10:53:08.014 VipPrice:0.0000
Row key:a3|a3f97943 InsertTime:2019-03-28 10:53:03.909 VipPrice:0.0000
Row key:a8|a83215ec InsertTime:2019-03-28 10:53:05.094 VipPrice:0.0000
Row key:d5|d561106e InsertTime:2019-03-28 10:53:04.792 VipPrice:0.0000
Row key:e3|e34daa5e InsertTime:2019-03-28 10:53:21.712 VipPrice:0.0000
Row key:e5|e5d0270e InsertTime:2019-03-28 10:53:07.076 VipPrice:0.0000
Row key:f1|f17b91b7 InsertTime:2019-03-28 10:53:21.995 VipPrice:0.0000
Row key:f5|f5ba844d InsertTime:2019-03-28 10:53:07.811 VipPrice:0.0000
Row key:fe|fe4a30f2 InsertTime:2019-03-28 10:53:04.719 VipPrice:0.0000
元素的个数:107

如果有写的不对的地方,欢迎大家指正,如果有什么疑问,可以加QQ群:340297350,更多的Flink和spark的干货可以加入下面的星球

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章