企业运维的数据库最常见的是mysql(oracle);但是mysql(oracle)有个缺陷:当数据量达到千万条的时候,mysql(oracle)的相关操作会变的非常迟缓;
如果这个时候有需求需要实时展示数据;对于mysql来说是一种灾难;而且对于mysql来说,同一时间还要给多个开发人员和用户操作;
所以经过调研,将mysql数据实时同步到hbase中; 最开始使用的架构方案:
Mysql—logstash—kafka—sparkStreaming—hbase—web
Mysql—sqoop—hbase—web 但是无论使用logsatsh还是使用kafka,都避免不了一个尴尬的问题:
他们在导数据过程中需要去mysql中做查询操作:
本处不对mysql(oracle)-shareplex做赘述,世面上有很多从数据库到同步工具的demo.主要介绍kafka-flink-hbase
KAFKA
下载并安装 Kafka
[root@ QRHEL64KFK root] ]# # wget https://www.apache.org/dyn/closer.cgi
path=/kafka/0.10.1.1/kafka_2.10- - 0.10.1.1.tgz
[root@ QRHEL64KFK root] ]# # ls
kafka_2.10- - 0.10.1.1.tgz
[root@ QRHEL64KFK root] ]# # tar zxvf kafka_2.10- - 0.10.1.1.tgz
[root@ QRHEL64KFK root] ]# # cd kafka_2.10- - 0.10.1.1
[root@ QRHEL64KFK kafka_2.10- - 0.10.1.1]# pwd
/root/kafka_2.10-0.10.1.1
启动并后台运行 Kafka
[root@ QRHEL64KFK kafka_2.10- - 0.10.1.1]# bin/zookeeper-server-start.sh config/zookeeper.properties &
[root@ QRHEL64KFK kafka_2.10- - 0.10.1.1]# bin/kafka-server-start.sh config/server.properties &
[root@ QRHEL64KFK kafka_2.10- - 0.10.1.1]# jobs
[1]- Running bin/zookeeper-server-start.sh config/zookeeper.properties &
[2]+ Running bin/kafka-server-start.sh config/server.properties &
创建名为 test 的 Kafka topic
[root@ QRHEL64KFK kafka_2.10- - 0.10.1.1]# bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
查看 Kafka topic
[root@ QRHEL64KFK kafka_2.10- - 0.10.1.1]# bin/kafka-topics.sh --list --zookeeper localhost:2181 test
通过 Producer 测试写信息到 Topic
[root@ QRHEL64KFK kafka_2.10- - 0.10.1.1]# bin/kafka-console-producer.sh --broker --list localhost:9092 --topic test
This is a message
This is another message
通过 consumer 读出 producer 写出的信息
[root@ QRHEL6 4KFK kafka_2.10- - 0.10.1.1]# bin/kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic test --from-beginning
This is a message
This is another message
使用flink将kafka中的数据解析成Hbase的DML操作,然后实时存储到hbase中
import java.util
import java.util.Properties
import org.apache.commons.lang3.StringUtils
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer09
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.api.scala._
import org.apache.flink.runtime.state.filesystem.FsStateBackend
import org.apache.flink.streaming.api.{CheckpointingMode, TimeCharacteristic}
import org.apache.hadoop.hbase.{HBaseConfiguration, HColumnDescriptor, HTableDescriptor, TableName}
import org.apache.hadoop.hbase.client.{ConnectionFactory, Delete, Put}
import org.apache.hadoop.hbase.util.Bytes
/**
* Created by angel;
*/
object DataExtraction {
//1指定相关信息
val zkCluster = "hadoop01,hadoop02,hadoop03"
val kafkaCluster = "hadoop01:9092,hadoop02:9092,hadoop03:9092"
val kafkaTopicName = "canal"
val hbasePort = "2181"
val tableName:TableName = TableName.valueOf("canal")
val columnFamily = "info"
def main(args: Array[String]): Unit = {
//2.创建流处理环境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setStateBackend(new FsStateBackend("hdfs://hadoop01:9000/flink-checkpoint/checkpoint/"))
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
env.getConfig.setAutoWatermarkInterval(2000)//定期发送
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
env.getCheckpointConfig.setCheckpointInterval(6000)
System.setProperty("hadoop.home.dir", "/");
//3.创建kafka数据流
val properties = new Properties()
properties.setProperty("bootstrap.servers", kafkaCluster)
properties.setProperty("zookeeper.connect", zkCluster)
properties.setProperty("group.id", kafkaTopicName)
val kafka09 = new FlinkKafkaConsumer09[String](kafkaTopicName, new SimpleStringSchema(), properties)
//4.添加数据源addSource(kafka09)
val text = env.addSource(kafka09).setParallelism(1)
//5、解析kafka数据流,封装成canal对象
val values = text.map{
line =>
val values = line.split("#CS#")
val valuesLength = values.length
//
val fileName = if(valuesLength > 0) values(0) else ""
val fileOffset = if(valuesLength > 1) values(1) else ""
val dbName = if(valuesLength > 2) values(2) else ""
val tableName = if(valuesLength > 3) values(3) else ""
val eventType = if(valuesLength > 4) values(4) else ""
val columns = if(valuesLength > 5) values(5) else ""
val rowNum = if(valuesLength > 6) values(6) else ""
//(mysql-bin.000001,7470,test,users,[uid, 18, true, uname, spark, true, upassword, 1111, true],null,1)
Canal(fileName , fileOffset , dbName , tableName ,eventType, columns , rowNum)
}
//6、将数据落地到Hbase
val list_columns_ = values.map{
line =>
//处理columns字符串
val strColumns = line.columns
//[[uid, 22, true], [uname, spark, true], [upassword, 1111, true]]
val array_columns = packaging_str_list(strColumns)
//获取主键
val primaryKey = getPrimaryKey(array_columns)
//拼接rowkey DB+tableName+primaryKey
val rowkey = line.dbName+"_"+line.tableName+"_"+primaryKey
//获取操作类型INSERT UPDATE DELETE
val eventType = line.eventType
//获取触发的列:inser update
val triggerFileds: util.ArrayList[UpdateFields] = getTriggerColumns(array_columns , eventType)
//因为不同表直接有关联,肯定是有重合的列,所以hbase表=line.dbName + line.tableName
val hbase_table = line.dbName + line.tableName
//根据rowkey删除数据
if(eventType.equals("DELETE")){
operatorDeleteHbase(rowkey , eventType)
}else{
if(triggerFileds.size() > 0){
operatorHbase(rowkey , eventType , triggerFileds)
}
}
}
env.execute()
}
//封装字符串列表
def packaging_str_list(str_list:String):String ={
val substring = str_list.substring(1 , str_list.length-1)
substring
}
//获取每个表的主键
def getPrimaryKey(columns :String):String = {
// [uid, 1, false], [uname, abc, false], [upassword, uabc, false]
val arrays: Array[String] = StringUtils.substringsBetween(columns , "[" , "]")
val primaryStr: String = arrays(0)//uid, 13, true
primaryStr.split(",")(1).trim
}
//获取触发更改的列
def getTriggerColumns(columns :String , eventType:String): util.ArrayList[UpdateFields] ={
val arrays: Array[String] = StringUtils.substringsBetween(columns , "[" , "]")
val list = new util.ArrayList[UpdateFields]()
eventType match {
case "UPDATE" =>
for(index <- 1 to arrays.length-1){
val split: Array[String] = arrays(index).split(",")
if(split(2).trim.toBoolean == true){
list.add(UpdateFields(split(0) , split(1)))
}
}
list
case "INSERT" =>
for(index <- 1 to arrays.length-1){
val split: Array[String] = arrays(index).split(",")
list.add(UpdateFields(split(0) , split(1)))
}
list
case _ =>
list
}
}
//增改操作
def operatorHbase(rowkey:String , eventType:String , triggerFileds:util.ArrayList[UpdateFields]): Unit ={
val config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", zkCluster);
config.set("hbase.master", "hadoop01:60000");
config.set("hbase.zookeeper.property.clientPort", hbasePort);
config.setInt("hbase.rpc.timeout", 20000);
config.setInt("hbase.client.operation.timeout", 30000);
config.setInt("hbase.client.scanner.timeout.period", 200000);
val connect = ConnectionFactory.createConnection(config);
val admin = connect.getAdmin
//构造表描述器
val hTableDescriptor = new HTableDescriptor(tableName)
//构造列族描述器
val hColumnDescriptor = new HColumnDescriptor(columnFamily)
hTableDescriptor.addFamily(hColumnDescriptor)
if(!admin.tableExists(tableName)){
admin.createTable(hTableDescriptor);
}
//如果表存在,则开始插入数据
val table = connect.getTable(tableName)
val put = new Put(Bytes.toBytes(rowkey))
//获取对应的列[UpdateFields(uname, spark), UpdateFields(upassword, 1111)]
for(index <- 0 to triggerFileds.size()-1){
val fields = triggerFileds.get(index)
val key = fields.key
val value = fields.value
put.addColumn(Bytes.toBytes(columnFamily) , Bytes.toBytes(key) , Bytes.toBytes(value))
}
table.put(put)
}
//删除操作
def operatorDeleteHbase(rowkey:String , eventType:String): Unit ={
val config = HBaseConfiguration.create();
config.set("hbase.zookeeper.quorum", zkCluster);
config.set("hbase.zookeeper.property.clientPort", hbasePort);
config.setInt("hbase.rpc.timeout", 20000);
config.setInt("hbase.client.operation.timeout", 30000);
config.setInt("hbase.client.scanner.timeout.period", 200000);
val connect = ConnectionFactory.createConnection(config);
val admin = connect.getAdmin
//构造表描述器
val hTableDescriptor = new HTableDescriptor(tableName)
//构造列族描述器
val hColumnDescriptor = new HColumnDescriptor(columnFamily)
hTableDescriptor.addFamily(hColumnDescriptor)
if(admin.tableExists(tableName)){
val table = connect.getTable(tableName)
val delete = new Delete(Bytes.toBytes(rowkey))
table.delete(delete)
}
}
}
//[uname, spark, true], [upassword, 11122221, true]
case class UpdateFields(key:String , value:String)
//(fileName , fileOffset , dbName , tableName ,eventType, columns , rowNum)
case class Canal(fileName:String ,
fileOffset:String,
dbName:String ,
tableName:String ,
eventType:String ,
columns:String ,
rowNum:String
)
打包上线
添加maven打包依赖:
1:打包java程序
src/main/java
src/test/scala
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.5.1</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
<!--<encoding>${project.build.sourceEncoding}</encoding>-->
</configuration>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
<goal>testCompile</goal>
</goals>
<configuration>
<args>
<!--<arg>-make:transitive</arg>-->
<arg>-dependencyfile</arg>
<arg>${project.build.directory}/.scala_dependencies</arg>
</args>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-surefire-plugin</artifactId>
<version>2.18.1</version>
<configuration>
<useFile>false</useFile>
<disableXmlReport>true</disableXmlReport>
<includes>
<include>**/*Test.*</include>
<include>**/*Suite.*</include>
</includes>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<!--
zip -d learn_spark.jar META-INF/*.RSA META-INF/*.DSA META-INF/*.SF
-->
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
<mainClass>canal.CanalClient</mainClass>
</transformer>
</transformers>
</configuration>
</execution>
</executions>
</plugin>
2:打包scala程序
将上述的maven依赖修改成:
<sourceDirectory>src/main/scala</sourceDirectory>
<mainClass>scala的驱动类</mainClass>
maven打包步骤: