mysql(oracle)-shareplex-kafka-flink-hbase數據同步

企業運維的數據庫最常見的是mysql(oracle)；但是mysql(oracle)有個缺陷：當數據量達到千萬條的時候，mysql(oracle)的相關操作會變的非常遲緩；
如果這個時候有需求需要實時展示數據；對於mysql來說是一種災難；而且對於mysql來說，同一時間還要給多個開發人員和用戶操作；
所以經過調研，將mysql數據實時同步到hbase中；最開始使用的架構方案：
Mysql—logstash—kafka—sparkStreaming—hbase—web
Mysql—sqoop—hbase—web 但是無論使用logsatsh還是使用kafka，都避免不了一個尷尬的問題：
他們在導數據過程中需要去mysql中做查詢操作：
本處不對mysql(oracle)-shareplex做贅述，世面上有很多從數據庫到同步工具的demo.主要介紹kafka-flink-hbase

KAFKA

下載並安裝 Kafka
[root@ QRHEL64KFK root] ]# # wget https://www.apache.org/dyn/closer.cgi path=/kafka/0.10.1.1/kafka_2.10- - 0.10.1.1.tgz
[root@ QRHEL64KFK root] ]# # ls
kafka_2.10- - 0.10.1.1.tgz
[root@ QRHEL64KFK root] ]# # tar zxvf kafka_2.10- - 0.10.1.1.tgz
[root@ QRHEL64KFK root] ]# # cd kafka_2.10- - 0.10.1.1
[root@ QRHEL64KFK kafka_2.10- - 0.10.1.1]# pwd
/root/kafka_2.10-0.10.1.1

啓動並後臺運行 Kafka

[root@ QRHEL64KFK kafka_2.10- - 0.10.1.1]# bin/zookeeper-server-start.sh config/zookeeper.properties &
[root@ QRHEL64KFK kafka_2.10- - 0.10.1.1]# bin/kafka-server-start.sh config/server.properties &
[root@ QRHEL64KFK kafka_2.10- - 0.10.1.1]# jobs
[1]- Running bin/zookeeper-server-start.sh config/zookeeper.properties &
[2]+ Running bin/kafka-server-start.sh config/server.properties &

創建名爲 test 的 Kafka topic

[root@ QRHEL64KFK kafka_2.10- - 0.10.1.1]# bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test

查看 Kafka topic

[root@ QRHEL64KFK kafka_2.10- - 0.10.1.1]# bin/kafka-topics.sh --list --zookeeper localhost:2181 test

通過 Producer 測試寫信息到 Topic

[root@ QRHEL64KFK kafka_2.10- - 0.10.1.1]# bin/kafka-console-producer.sh --broker --list localhost:9092 --topic test
This is a message
This is another message

通過 consumer 讀出 producer 寫出的信息

[root@ QRHEL6 4KFK kafka_2.10- - 0.10.1.1]# bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
This is a message
This is another message

使用flink將kafka中的數據解析成Hbase的DML操作，然後實時存儲到hbase中

import java.util
import java.util.Properties

import org.apache.commons.lang3.StringUtils
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.api.scala._
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.{CheckpointingMode, TimeCharacteristic}
import org.apache.hadoop.hbase.{HBaseConfiguration, HColumnDescriptor, HTableDescriptor, TableName}
import org.apache.hadoop.hbase.client.{ConnectionFactory, Delete, Put}
import org.apache.hadoop.hbase.util.Bytes


/**
  * Created by angel；
  */

object DataExtraction {
  //1指定相關信息
  val zkCluster = "hadoop01,hadoop02,hadoop03"
  val kafkaCluster = "hadoop01:9092,hadoop02:9092,hadoop03:9092"
  val kafkaTopicName = "canal"
  val hbasePort = "2181"
  val tableName:TableName = TableName.valueOf("canal")
  val columnFamily = "info"


  def main(args: Array[String]): Unit = {
    //2.創建流處理環境
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStateBackend(new FsStateBackend("hdfs://hadoop01:9000/flink-checkpoint/checkpoint/"))
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    env.getConfig.setAutoWatermarkInterval(2000)//定期發送
    env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
    env.getCheckpointConfig.setCheckpointInterval(6000)
    System.setProperty("hadoop.home.dir", "/");
    //3.創建kafka數據流
    val properties = new Properties()
    properties.setProperty("bootstrap.servers", kafkaCluster)
    properties.setProperty("zookeeper.connect", zkCluster)
    properties.setProperty("group.id", kafkaTopicName)

    val kafka09 = new FlinkKafkaConsumer09[String](kafkaTopicName, new SimpleStringSchema(), properties)
    //4.添加數據源addSource(kafka09)
    val text = env.addSource(kafka09).setParallelism(1)
    //5、解析kafka數據流，封裝成canal對象
    val values = text.map{
      line =>
        val values = line.split("#CS#")
        val valuesLength = values.length
        //
        val fileName = if(valuesLength > 0) values(0) else ""
        val fileOffset = if(valuesLength > 1) values(1) else ""
        val dbName = if(valuesLength > 2) values(2) else ""
        val tableName = if(valuesLength > 3) values(3) else ""
        val eventType = if(valuesLength > 4) values(4) else ""
        val columns = if(valuesLength > 5) values(5) else ""
        val rowNum = if(valuesLength > 6) values(6) else ""
        //(mysql-bin.000001,7470,test,users,[uid, 18, true, uname, spark, true, upassword, 1111, true],null,1)
        Canal(fileName , fileOffset , dbName , tableName ,eventType, columns  , rowNum)
    }


    //6、將數據落地到Hbase
    val list_columns_ = values.map{
      line =>
        //處理columns字符串
        val strColumns = line.columns
        //[[uid, 22, true], [uname, spark, true], [upassword, 1111, true]]
        val array_columns = packaging_str_list(strColumns)
        //獲取主鍵
        val primaryKey = getPrimaryKey(array_columns)
        //拼接rowkey  DB+tableName+primaryKey
        val rowkey = line.dbName+"_"+line.tableName+"_"+primaryKey
        //獲取操作類型INSERT UPDATE DELETE
        val eventType = line.eventType
        //獲取觸發的列:inser update

        val triggerFileds: util.ArrayList[UpdateFields] = getTriggerColumns(array_columns , eventType)
        //因爲不同表直接有關聯，肯定是有重合的列，所以hbase表=line.dbName + line.tableName
        val hbase_table = line.dbName + line.tableName
        //根據rowkey刪除數據
        if(eventType.equals("DELETE")){
          operatorDeleteHbase(rowkey , eventType)
        }else{
          if(triggerFileds.size() > 0){
            operatorHbase(rowkey , eventType , triggerFileds)
          }

        }
    }
    env.execute()

  }



  //封裝字符串列表
  def packaging_str_list(str_list:String):String ={
    val substring = str_list.substring(1 , str_list.length-1)
    substring
  }


  //獲取每個表的主鍵
  def getPrimaryKey(columns :String):String = {
    //  [uid, 1, false], [uname, abc, false], [upassword, uabc, false]
     val arrays: Array[String] = StringUtils.substringsBetween(columns , "[" , "]")
    val primaryStr: String = arrays(0)//uid, 13, true
    primaryStr.split(",")(1).trim
  }

  //獲取觸發更改的列
  def getTriggerColumns(columns :String , eventType:String): util.ArrayList[UpdateFields] ={
    val arrays: Array[String] = StringUtils.substringsBetween(columns , "[" , "]")
    val list = new util.ArrayList[UpdateFields]()
    eventType match {
      case "UPDATE" =>
        for(index <- 1 to arrays.length-1){
          val split: Array[String] = arrays(index).split(",")
          if(split(2).trim.toBoolean == true){
            list.add(UpdateFields(split(0) , split(1)))
          }
        }
        list
      case "INSERT" =>
        for(index <- 1 to arrays.length-1){
          val split: Array[String] = arrays(index).split(",")
          list.add(UpdateFields(split(0) , split(1)))
        }
        list
      case _ =>
        list

    }
  }
  //增改操作
  def operatorHbase(rowkey:String , eventType:String , triggerFileds:util.ArrayList[UpdateFields]): Unit ={
    val config = HBaseConfiguration.create();
    config.set("hbase.zookeeper.quorum", zkCluster);
    config.set("hbase.master", "hadoop01:60000");
    config.set("hbase.zookeeper.property.clientPort", hbasePort);
    config.setInt("hbase.rpc.timeout", 20000);
    config.setInt("hbase.client.operation.timeout", 30000);
    config.setInt("hbase.client.scanner.timeout.period", 200000);
    val connect = ConnectionFactory.createConnection(config);
    val admin = connect.getAdmin
    //構造表描述器
    val hTableDescriptor = new HTableDescriptor(tableName)
    //構造列族描述器
    val hColumnDescriptor = new HColumnDescriptor(columnFamily)
    hTableDescriptor.addFamily(hColumnDescriptor)
    if(!admin.tableExists(tableName)){
      admin.createTable(hTableDescriptor);
    }
    //如果表存在，則開始插入數據
    val table = connect.getTable(tableName)
    val put = new Put(Bytes.toBytes(rowkey))
    //獲取對應的列[UpdateFields(uname, spark), UpdateFields(upassword, 1111)]
    for(index <- 0 to triggerFileds.size()-1){
      val fields = triggerFileds.get(index)
      val key = fields.key
      val value = fields.value
      put.addColumn(Bytes.toBytes(columnFamily) , Bytes.toBytes(key) , Bytes.toBytes(value))
    }
    table.put(put)
  }
  //刪除操作
  def operatorDeleteHbase(rowkey:String , eventType:String): Unit ={
    val config = HBaseConfiguration.create();
    config.set("hbase.zookeeper.quorum", zkCluster);
    config.set("hbase.zookeeper.property.clientPort", hbasePort);
    config.setInt("hbase.rpc.timeout", 20000);
    config.setInt("hbase.client.operation.timeout", 30000);
    config.setInt("hbase.client.scanner.timeout.period", 200000);
    val connect = ConnectionFactory.createConnection(config);
    val admin = connect.getAdmin
    //構造表描述器
    val hTableDescriptor = new HTableDescriptor(tableName)
    //構造列族描述器
    val hColumnDescriptor = new HColumnDescriptor(columnFamily)
    hTableDescriptor.addFamily(hColumnDescriptor)
    if(admin.tableExists(tableName)){
      val table = connect.getTable(tableName)
      val delete = new Delete(Bytes.toBytes(rowkey))
      table.delete(delete)
    }
  }


}
//[uname, spark, true], [upassword, 11122221, true]
case class UpdateFields(key:String , value:String)


//(fileName , fileOffset , dbName , tableName ,eventType, columns  , rowNum)
case class Canal(fileName:String ,
                 fileOffset:String,
                 dbName:String ,
                 tableName:String ,
                 eventType:String ,
                 columns:String ,
                 rowNum:String
                )

打包上線

添加maven打包依賴：
1：打包java程序

src/main/java
src/test/scala

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-compiler-plugin</artifactId>
  <version>2.5.1</version>
  <configuration>
    <source>1.7</source>
    <target>1.7</target>
    <!--<encoding>${project.build.sourceEncoding}</encoding>-->
  </configuration>
</plugin>

<plugin>
  <groupId>net.alchim31.maven</groupId>
  <artifactId>scala-maven-plugin</artifactId>
  <version>3.2.0</version>
  <executions>
    <execution>
      <goals>
        <goal>compile</goal>
        <goal>testCompile</goal>
      </goals>
      <configuration>
        <args>
          <!--<arg>-make:transitive</arg>-->
          <arg>-dependencyfile</arg>
          <arg>${project.build.directory}/.scala_dependencies</arg>
        </args>

      </configuration>
    </execution>
  </executions>
</plugin>
<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-surefire-plugin</artifactId>
  <version>2.18.1</version>
  <configuration>
    <useFile>false</useFile>
    <disableXmlReport>true</disableXmlReport>
    <includes>
      <include>**/*Test.*</include>
      <include>**/*Suite.*</include>
    </includes>
  </configuration>
</plugin>

<plugin>
  <groupId>org.apache.maven.plugins</groupId>
  <artifactId>maven-shade-plugin</artifactId>
  <version>2.3</version>
  <executions>
    <execution>
      <phase>package</phase>
      <goals>
        <goal>shade</goal>
      </goals>
      <configuration>
        <filters>
          <filter>
            <artifact>*:*</artifact>
            <excludes>
              <!--
              zip -d learn_spark.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF
              -->
              <exclude>META-INF/*.SF</exclude>
              <exclude>META-INF/*.DSA</exclude>
              <exclude>META-INF/*.RSA</exclude>
            </excludes>
          </filter>
        </filters>
        <transformers>
          <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
            <mainClass>canal.CanalClient</mainClass>
          </transformer>
        </transformers>
      </configuration>
    </execution>
  </executions>
</plugin>

2：打包scala程序
將上述的maven依賴修改成：

<sourceDirectory>src/main/scala</sourceDirectory>
<mainClass>scala的驅動類</mainClass>

maven打包步驟：

mysql(oracle)-shareplex-kafka-flink-hbase數據同步

KAFKA

啓動並後臺運行 Kafka

創建名爲 test 的 Kafka topic

查看 Kafka topic

通過 Producer 測試寫信息到 Topic

通過 consumer 讀出 producer 寫出的信息

使用flink將kafka中的數據解析成Hbase的DML操作，然後實時存儲到hbase中

打包上線

機器學習入門及基本算法圖解

星環TDH數據庫批量生成表和存儲過程

mysql(oracle)-shareplex-kafka-flink-hbase數據同步

解決TypeError:'twophase' is an invalid keyword argumet for this function（附：pandas連接oracle）

機器學習常見算法及其優缺點

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結