1 Avro簡介
在互聯網發展早期,很多項目都是運行在單體架構上,使用Java原生序列化機制能滿足大部分場景需求。後面隨着業務和訪問量的增大,項目架構慢慢遷移到微服務架構。每個微服務可能採用不同的開發語言,而且部分服務對通信性能要求比較高,這時候Java原生的序列化就不能滿足要求。因爲Java原生序列化機制存在1)不支持跨語言2)序列化後的體積比較大等問題,所以採第三方的序列化協議就顯得很有必要。
Apache Avro是Hadoop之父Doug Cutting創建,具有跨語言,序列化後空間小等優點,被廣泛應用在大數據領域。Avro模式通常用JSON來寫,使用二進制格式編碼,支持的簡單類型如下:
也支持複雜類型,詳見官網https://avro.apache.org/。
2 使用Avro替代Java原生序列化機制
2.1 定義Avro的模式(Schema)
創建Avro的Schema文件user.avsc,寫入以下內容:
{
"namespace": "com.dhhy.avro",
"type": "record",
"name": "User",
"fields": [
{"name": "name", "type": "string"},
{"name": "id", "type": "int"},
{"name": "phonenum", "type": ["string", "null"]}
]
}
這裏的變量名name是字符串類型,id是整數類型,都爲非空;phonenum是字符串類型,可以爲空。
2.2 編譯模式
下載依賴avro-tools-1.9.2.jar,執行命令java -jar avro-tools-1.9.2.jar compile schema user.avsc ./
2.3 創建maven工程
項目名爲KafkaStudy,整體文件目錄結構如下:
1)pom文件如下:
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.dhhy</groupId>
<artifactId>kafkaStudy</artifactId>
<version>1.0-SNAPSHOT</version>
<dependencies>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>2.1.1</version>
</dependency>
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.9.2</version>
</dependency>
<dependency>
<groupId>com.google.code.gson</groupId>
<artifactId>gson</artifactId>
<version>2.2.4</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
2)將模式編譯生成的文件夾複製到KafkaStudy工程下,即User.java文件
3)創建一個Java類AvroApp.java,內容如下:
package com.dhhy.avro;
import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.specific.SpecificDatumWriter;
import java.io.File;
import java.io.IOException;
/**
* Created by JayLai on 2020-02-24 16:44:18
*/
public class AvroApp {
/**
* 創建User對象
* @return
*/
public User createUser(){
User user = User.newBuilder()
.setName("Jay")
.setId(1024)
.setPhonenum("18814123346")
.build();
return user;
}
/**
* 將User對象序列化到磁盤
*/
public void serializeUser(User user, File file){
DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter);
try {
dataFileWriter.create(user.getSchema(), file);
dataFileWriter.append(user);
} catch (IOException e) {
e.printStackTrace();
} finally{
try{
dataFileWriter.close();
}catch(IOException e){
e.printStackTrace();
}
}
}
/**
* 將User對象從磁盤序列化到內存
*/
public void deserializeUser(File file){
DatumReader<User> userDatumReader = new SpecificDatumReader<User>(User.class);
DataFileReader<User> dataFileReader = null;
try {
dataFileReader = new DataFileReader<User>(file, userDatumReader);
User user = null;
while (dataFileReader.hasNext()) {
user = dataFileReader.next(user);
System.out.println(user);
}
dataFileReader.close();
} catch (IOException e) {
e.printStackTrace();
} finally{
try{
if(dataFileReader != null){
dataFileReader.close();
}
}catch(IOException e){
e.printStackTrace();
}
}
}
public static void main(String[] args) {
AvroApp app = new AvroApp();
File file = new File("users.avro");
app.serializeUser(app.createUser(), file);
app.deserializeUser(file);
}
}
2.4 測試
運行AvroApp.java,可以看到控制檯輸出{“name”: “Jay”, “id”: 1024, “phonenum”: “18814123346”},同時本地磁盤多了一個users.avro的文件,說明執行成功,內存中的User對象能通過Avro序列化到本地磁盤,同時也能通過Avro反序列化到內存中。
3 在Kafka中使用Avro進行編解碼
由於消息隊列Kafka具有高吞吐等優點,在大數據領域,上下游數據的實時交互大部分是通過Kafka實現。上游可能使用不同開發語言,如C++,Java等,使用Avro將數據進行序列化後寫入Kafka的broker中;下游的客戶端同樣也存在多種開發語言,再使用Avro進行反序列化,如圖所示。
在KafkaStudy工程的基礎上,刪掉無關的2個文件AvroApp.java和user.varo,整體文件結構如下:
3.1 創建Topic
使用Xshell登陸KAFKA_HOME,創建名爲avro_topic的Topic
hadoop@ubuntu:~/app/kafka$ ls
bin config kafka_log libs LICENSE logs NOTICE site-docs start.sh
hadoop@ubuntu:~/app/kafka$:bin/kafka-topics.sh --create --zookeeper 192.168.0.131:2181 --partitions 2 --replication-factor 1 --topic avro_topic
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic "avro_topic".
3.2 生產者端的話單序列化類
package com.dhhy.avro;
import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.specific.SpecificDatumWriter;
import org.apache.kafka.common.errors.SerializationException;
import org.apache.kafka.common.serialization.Serializer;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.Map;
/**
* 序列化類
* Created by JayLai on 2020-03-24 22:17:40
*/
public class AvroSerializer implements Serializer<User> {
@Override
public void configure(Map<String, ?> map, boolean b) {
}
@Override
public byte[] serialize(String topic, User data) {
if(data == null) {
return null;
}
DatumWriter<User> writer = new SpecificDatumWriter<>(data.getSchema());
ByteArrayOutputStream out = new ByteArrayOutputStream();
BinaryEncoder encoder = EncoderFactory.get().directBinaryEncoder(out, null);
try {
writer.write(data, encoder);
}catch (IOException e) {
throw new SerializationException(e.getMessage());
}
return out.toByteArray();
}
@Override
public void close() {
}
}
3.3 生產者
package com.dhhy.avro;
import com.google.gson.Gson;
import org.apache.kafka.clients.producer.Callback;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;
/**
* Created by JayLai on 2020-02-24 19:55:29
*/
public class AvroProducer {
public static final String brokerList = "192.168.0.131:9092,192.168.0.132:9092,192.168.0.133:9092";
public static final String topic = "avro_topic";
static int count = 0;
/**
* 創建User對象
* @return
*/
public static User createUser(){
User user = User.newBuilder()
.setName("Jay")
.setId(++count)
.setPhonenum("18814123456")
.build();
return user;
}
public static void main(String[] args) {
User[] users = new User[10];
for (int i = 0; i < 10; i++){
users[i] = createUser();
}
Properties properties = new Properties();
properties.put("bootstrap.servers", brokerList);
properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
properties.put("value.serializer", "com.dhhy.avro.AvroSerializer");
//自己直生產者客戶端參數並創建KafkaProducer實例
KafkaProducer<String, User> producer = new KafkaProducer<>(properties);
//發送消息
Map<String, Object> map = new HashMap<>();
Gson gson= new Gson();
try {
for (User user : users) {
ProducerRecord<String, User> record = new ProducerRecord<>(topic, user);
producer.send(record, new Callback() {
@Override
public void onCompletion(RecordMetadata metadata, Exception exception) {
if (exception != null){
exception.printStackTrace();
}else{
map.put("topic", metadata.topic());
map.put("partition", metadata.partition());
map.put("offset", metadata.offset());
map.put("user", user);
System.out.println(gson.toJson(map));
}
}
});
}
}catch (Exception e){
e.printStackTrace();
}finally {
//關閉生產着客戶端實例
if(producer != null){
producer.close();
}
}
}
}
3.4 消費者端的話單反序列化類
package com.dhhy.avro;
import org.apache.avro.io.BinaryDecoder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.kafka.common.serialization.Deserializer;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.util.Map;
/**
* 反序列化類
* Created by JayLai on 2020-03-24 22:30:03
*/
public class AvroDeserializer implements Deserializer<User>{
@Override
public void configure(Map<String, ?> map, boolean b) {
}
@Override
public void close() {
}
@Override
public User deserialize(String topic, byte[] data) {
if(data == null) {
return null;
}
User user = new User();
ByteArrayInputStream in = new ByteArrayInputStream(data);
DatumReader<User> userDatumReader = new SpecificDatumReader<>(User.getClassSchema());
BinaryDecoder decoder = DecoderFactory.get().directBinaryDecoder(in, null);
try {
user = userDatumReader.read(null, decoder);
} catch (IOException e) {
e.printStackTrace();
}
return user;
}
}
3.4 消費者
package com.dhhy.avro;
import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;
import java.time.Duration;
import java.util.Collections;
import java.util.Properties;
import java.util.concurrent.atomic.AtomicBoolean;
/**
* Created by JayLai on 2020-03-24 23:34:17
*/
public class AvroConsumer {
public static final String brokerList = "192.168.0.131:9092,192.168.0.132:9092,192.168.0.133:9092";
public static final String topic = "avro_topic";
public static final String groupId = "avro_group_001";
public static final AtomicBoolean isRunning =new AtomicBoolean(true);
public static void main(String[] args) {
Properties properties = new Properties();
properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
properties.put("value.deserializer", "com.dhhy.avro.AvroDeserializer");
properties.put("bootstrap.servers", brokerList);
properties.put ("group.id", groupId);
properties.put ("auto.offset.reset", "earliest");
//創建一個消費者客戶端實例
KafkaConsumer<String, User> consumer = new KafkaConsumer(properties);
//訂閱主題
consumer.subscribe(Collections.singletonList(topic));
consumer.partitionsFor(topic);
//循環消貨消息
while (isRunning.get()) {
ConsumerRecords<String, User> records =
consumer.poll(5000);
System.out.println(records.count());
for (ConsumerRecord<String, User> record :records ) {
System.out.println("{topic:" + record.topic() + " ,partition:" + record.partition()
+ " ,offset:" + record.offset() +
" ,key:" + record.topic() + " ,value:" +record.value().toString() + "}");
}
}
}
}
3.5 測試
先啓動生產者AvroProducer類,再運行消費者AvroConsumer類,可以看到消費者的控制檯打印了消費記錄。
4 Spark消費以Avro編碼的kafka消息
Kafk下游的消費者也可以是不同的大數據組件,如Spark Streaming, Flink等,在這裏以Spark Streaming爲例子。
4.1 創建maven工程
在實際工作中,Spark的應用主要以Scala開發爲主,因此需要重新創建一個Scala工程,名稱爲sparkstudy。
Pom文件的內容如下:
```markup
```yaml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.dhhy</groupId>
<artifactId>sparkstudy</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<encoding>UTF-8</encoding>
<scala.version>2.11.8</scala.version>
<spark.version>2.3.0</spark.version>
</properties>
<dependencies>
<!--scala-->
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>${scala.version}</version>
</dependency>
<!--SparkSQL-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-hive_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.spark-project.hive</groupId>
<artifactId>hive-jdbc</artifactId>
<version>1.2.1.spark2</version>
</dependency>
<!-- Spark ML-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-mllib_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!-- Spark Streaming-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<!--Spark Streaming整合kafka-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.kafka</groupId>
<artifactId>kafka-clients</artifactId>
<version>0.10.2.1</version>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.38</version>
</dependency>
<!--avro序列化 -->
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.9.2</version>
</dependency>
</dependencies>
<build>
<plugins>
<!-- mixed scala/java compile -->
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<executions>
<execution>
<id>compile</id>
<goals>
<goal>compile</goal>
</goals>
<phase>compile</phase>
</execution>
<execution>
<id>test-compile</id>
<goals>
<goal>testCompile</goal>
</goals>
<phase>test-compile</phase>
</execution>
<execution>
<phase>process-resources</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<!-- for jar -->
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-assembly-plugin</artifactId>
<version>2.4</version>
<configuration>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>assemble-all</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-jar-plugin</artifactId>
<configuration>
<archive>
<manifest>
<addClasspath>true</addClasspath>
<mainClass>com.dhhy.spark.streaming.kafkaConsumerApp</mainClass>
</manifest>
</archive>
</configuration>
</plugin>
</plugins>
</build>
<repositories>
<repository>
<id>aliyunmaven</id>
<url>http://maven.aliyun.com/nexus/content/groups/public/</url>
</repository>
</repositories>
</project>
並將User.java和AvroDeserializer.java兩個類引入到sparkstudy中。
4.2 Spark應用的消費類
package com.dhhy.avro
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}
object AvroSparkConsumer {
def main(args: Array[String]): Unit = {
//配置項
val conf = new SparkConf().setMaster("local[2]").setAppName("AvroSparkConsumer")
val ssc = new StreamingContext(conf, Seconds(5))
val kafkaParams = Map[String, Object](
"bootstrap.servers" -> "192.168.0.131:9092,192.168.0.132:9092,192.168.0.133:9092", //kafka集羣地址
"key.deserializer" -> classOf[StringDeserializer], //序列化類型,此處爲字符類型
"value.deserializer" -> classOf[AvroDeserializer], //序列化類型,此處爲AVRO
"group.id" -> "AvroConsumeGroup2", //Kafka消費組
"auto.offset.reset" -> "earliest" //讀取最新offset
)
//kafka的話題
val topics = Array("avro_topic")
val stream: InputDStream[ConsumerRecord[String, User]] = KafkaUtils.createDirectStream[String, User](
ssc,
PreferConsistent,
Subscribe[String, User](topics, kafkaParams)
)
stream.foreachRDD(rdd =>
rdd.foreachPartition(partition =>
partition.foreach(record =>
println("{" +
"topic:" + record.topic() +
" ,partition:" + record.partition() +
" ,offset:" + record.offset() +
" ,key:" + record.topic() +
" ,value:" + record.value().toString() +
"}")
)
)
)
ssc.start()
ssc.awaitTermination()
}
}
4.3 工程的目錄結構
4.4 測試
運行AvroSparkConsumer.scala,消費的結果打印在控制檯上:
5 參考文獻
[1] Hadoop權威指南 第4版. 2017, 清華大學出版社.
[2] Avro官網 https://avro.apache.org/
[3] 深入理解Kafka:核心設計與實踐原理 第1版. 2019, 電子工業出版社.
[4] Spark官網 http://spark.apache.org/