Apahce Avro入門及其在大數據中的應用

1 Avro簡介

在互聯網發展早期,很多項目都是運行在單體架構上,使用Java原生序列化機制能滿足大部分場景需求。後面隨着業務和訪問量的增大,項目架構慢慢遷移到微服務架構。每個微服務可能採用不同的開發語言,而且部分服務對通信性能要求比較高,這時候Java原生的序列化就不能滿足要求。因爲Java原生序列化機制存在1)不支持跨語言2)序列化後的體積比較大等問題,所以採第三方的序列化協議就顯得很有必要。
Apache Avro是Hadoop之父Doug Cutting創建,具有跨語言,序列化後空間小等優點,被廣泛應用在大數據領域。Avro模式通常用JSON來寫,使用二進制格式編碼,支持的簡單類型如下:
Avro 支持的簡單類型
也支持複雜類型,詳見官網https://avro.apache.org/。

2 使用Avro替代Java原生序列化機制

2.1 定義Avro的模式(Schema)

創建Avro的Schema文件user.avsc,寫入以下內容:

{
    "namespace": "com.dhhy.avro",
    "type": "record",
    "name": "User",
    "fields": [
        {"name": "name", "type": "string"},
        {"name": "id", "type": "int"},
        {"name": "phonenum", "type": ["string", "null"]}
    ]
}

這裏的變量名name是字符串類型,id是整數類型,都爲非空;phonenum是字符串類型,可以爲空。

2.2 編譯模式

下載依賴avro-tools-1.9.2.jar,執行命令java -jar avro-tools-1.9.2.jar compile schema user.avsc ./
在這裏插入圖片描述

2.3 創建maven工程

項目名爲KafkaStudy,整體文件目錄結構如下:
在這裏插入圖片描述

1)pom文件如下:

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>
    <groupId>com.dhhy</groupId>
    <artifactId>kafkaStudy</artifactId>
    <version>1.0-SNAPSHOT</version>

    <dependencies>
        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>2.1.1</version>
        </dependency>

        <dependency>
        	<groupId>org.apache.avro</groupId>
        	<artifactId>avro</artifactId>
        	<version>1.9.2</version>
		</dependency>

		<dependency>
            <groupId>com.google.code.gson</groupId>
            <artifactId>gson</artifactId>
            <version>2.2.4</version>
		</dependency>
	</dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>8</source>
                    <target>8</target>
                </configuration>
            </plugin>
        </plugins>
    </build>
</project>

2)將模式編譯生成的文件夾複製到KafkaStudy工程下,即User.java文件
3)創建一個Java類AvroApp.java,內容如下:

package com.dhhy.avro;

import org.apache.avro.file.DataFileReader;
import org.apache.avro.file.DataFileWriter;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.avro.specific.SpecificDatumWriter;
import java.io.File;
import java.io.IOException;

/**
 * Created by JayLai on 2020-02-24 16:44:18
 */
public class AvroApp {

    /**
     * 創建User對象
     * @return
     */
    public User createUser(){
        User user = User.newBuilder()
                .setName("Jay")
                .setId(1024)
                .setPhonenum("18814123346")
                .build();

        return user;
    }

    /**
     * 將User對象序列化到磁盤
     */
    public void serializeUser(User user, File file){
        DatumWriter<User> userDatumWriter = new SpecificDatumWriter<User>(User.class);
        DataFileWriter<User> dataFileWriter = new DataFileWriter<User>(userDatumWriter);

        try {
            dataFileWriter.create(user.getSchema(), file);
            dataFileWriter.append(user);
        } catch (IOException e) {
            e.printStackTrace();
        } finally{
            try{
                dataFileWriter.close();
            }catch(IOException e){
                e.printStackTrace();
            }
        }
    }



    /**
     * 將User對象從磁盤序列化到內存
     */
    public void deserializeUser(File file){
        DatumReader<User> userDatumReader = new SpecificDatumReader<User>(User.class);
        DataFileReader<User> dataFileReader = null;


        try {
            dataFileReader = new DataFileReader<User>(file, userDatumReader);
            User user = null;
            while (dataFileReader.hasNext()) {
                user = dataFileReader.next(user);
                System.out.println(user);
            }
            dataFileReader.close();
        } catch (IOException e) {
            e.printStackTrace();
        } finally{
            try{
                if(dataFileReader != null){
                    dataFileReader.close();
                }
            }catch(IOException e){
                e.printStackTrace();
            }
        }
    }


    public static void main(String[] args) {
        AvroApp app = new AvroApp();
        File file = new File("users.avro");
        app.serializeUser(app.createUser(), file);
        app.deserializeUser(file);
    }


}

2.4 測試

運行AvroApp.java,可以看到控制檯輸出{“name”: “Jay”, “id”: 1024, “phonenum”: “18814123346”},同時本地磁盤多了一個users.avro的文件,說明執行成功,內存中的User對象能通過Avro序列化到本地磁盤,同時也能通過Avro反序列化到內存中。
在這裏插入圖片描述

3 在Kafka中使用Avro進行編解碼

由於消息隊列Kafka具有高吞吐等優點,在大數據領域,上下游數據的實時交互大部分是通過Kafka實現。上游可能使用不同開發語言,如C++,Java等,使用Avro將數據進行序列化後寫入Kafka的broker中;下游的客戶端同樣也存在多種開發語言,再使用Avro進行反序列化,如圖所示。

在這裏插入圖片描述

在KafkaStudy工程的基礎上,刪掉無關的2個文件AvroApp.java和user.varo,整體文件結構如下:
在這裏插入圖片描述

3.1 創建Topic

使用Xshell登陸KAFKA_HOME,創建名爲avro_topic的Topic

hadoop@ubuntu:~/app/kafka$ ls
bin  config  kafka_log  libs  LICENSE  logs  NOTICE  site-docs  start.sh
hadoop@ubuntu:~/app/kafka$:bin/kafka-topics.sh --create --zookeeper 192.168.0.131:2181 --partitions 2 --replication-factor 1 --topic avro_topic
WARNING: Due to limitations in metric names, topics with a period ('.') or underscore ('_') could collide. To avoid issues it is best to use either, but not both.
Created topic "avro_topic".

3.2 生產者端的話單序列化類

package com.dhhy.avro;

import org.apache.avro.io.BinaryEncoder;
import org.apache.avro.io.DatumWriter;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.specific.SpecificDatumWriter;
import org.apache.kafka.common.errors.SerializationException;
import org.apache.kafka.common.serialization.Serializer;

import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.util.Map;

/**
 * 序列化類
 * Created by JayLai on 2020-03-24 22:17:40
 */
public class AvroSerializer implements Serializer<User> {
    @Override
    public void configure(Map<String, ?> map, boolean b) {

    }

    @Override
    public byte[] serialize(String topic, User data) {
        if(data == null) {
            return null;
        }
        DatumWriter<User> writer = new SpecificDatumWriter<>(data.getSchema());
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        BinaryEncoder encoder = EncoderFactory.get().directBinaryEncoder(out, null);
        try {
            writer.write(data, encoder);
        }catch (IOException e) {
            throw new SerializationException(e.getMessage());
        }
        return out.toByteArray();
    }

    @Override
    public void close() {

    }
}

3.3 生產者

package com.dhhy.avro;

import com.google.gson.Gson;
import org.apache.kafka.clients.producer.Callback;
import org.apache.kafka.clients.producer.KafkaProducer;
import org.apache.kafka.clients.producer.ProducerRecord;
import org.apache.kafka.clients.producer.RecordMetadata;
import java.util.HashMap;
import java.util.Map;
import java.util.Properties;

/**
 * Created by JayLai on 2020-02-24 19:55:29
 */
public class AvroProducer {
    public static final String brokerList = "192.168.0.131:9092,192.168.0.132:9092,192.168.0.133:9092";
    public static final String topic = "avro_topic";
    static int count = 0;


    /**
     * 創建User對象
     * @return
     */
    public static User createUser(){
        User user = User.newBuilder()
                .setName("Jay")
                .setId(++count)
                .setPhonenum("18814123456")
                .build();

        return user;
    }

    public static void main(String[] args) {
        User[] users = new User[10];
        for (int i = 0; i < 10; i++){
            users[i] = createUser();
        }


        Properties properties =  new Properties();
        properties.put("bootstrap.servers", brokerList);
        properties.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer");
        properties.put("value.serializer", "com.dhhy.avro.AvroSerializer");


        //自己直生產者客戶端參數並創建KafkaProducer實例
        KafkaProducer<String, User> producer = new KafkaProducer<>(properties);
        //發送消息
        Map<String, Object> map = new HashMap<>();
        Gson gson= new Gson();
        try {
            for (User user : users) {
                ProducerRecord<String, User> record = new ProducerRecord<>(topic, user);
                producer.send(record, new Callback() {
                    @Override
                    public void onCompletion(RecordMetadata metadata, Exception exception) {
                        if (exception != null){
                            exception.printStackTrace();
                        }else{
                            map.put("topic", metadata.topic());
                            map.put("partition", metadata.partition());
                            map.put("offset", metadata.offset());
                            map.put("user", user);
                            System.out.println(gson.toJson(map));
                        }
                    }
                });
            }
        }catch (Exception e){
            e.printStackTrace();
        }finally {
            //關閉生產着客戶端實例
            if(producer != null){
                producer.close();
            }
        }

    }
}

3.4 消費者端的話單反序列化類

package com.dhhy.avro;

import org.apache.avro.io.BinaryDecoder;
import org.apache.avro.io.DatumReader;
import org.apache.avro.io.DecoderFactory;
import org.apache.avro.specific.SpecificDatumReader;
import org.apache.kafka.common.serialization.Deserializer;

import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.util.Map;

/**
 * 反序列化類
 * Created by JayLai on 2020-03-24 22:30:03
 */
public class AvroDeserializer implements Deserializer<User>{
    @Override
    public void configure(Map<String, ?> map, boolean b) {

    }


    @Override
    public void close() {

    }

    @Override
    public User deserialize(String topic, byte[] data) {
        if(data == null) {
            return null;
        }
        User user = new User();
        ByteArrayInputStream in = new ByteArrayInputStream(data);
        DatumReader<User> userDatumReader = new SpecificDatumReader<>(User.getClassSchema());
        BinaryDecoder decoder = DecoderFactory.get().directBinaryDecoder(in, null);
        try {
            user = userDatumReader.read(null, decoder);
        } catch (IOException e) {
            e.printStackTrace();
        }
        return user;
    }
}
3.4 消費者
package com.dhhy.avro;

import org.apache.kafka.clients.consumer.ConsumerRecord;
import org.apache.kafka.clients.consumer.ConsumerRecords;
import org.apache.kafka.clients.consumer.KafkaConsumer;

import java.time.Duration;
import java.util.Collections;
import java.util.Properties;
import java.util.concurrent.atomic.AtomicBoolean;

/**
 * Created by JayLai on 2020-03-24 23:34:17
 */
public class AvroConsumer {
    public static final String brokerList = "192.168.0.131:9092,192.168.0.132:9092,192.168.0.133:9092";
    public static final String topic = "avro_topic";
    public static final String groupId = "avro_group_001";
    public static final AtomicBoolean isRunning =new AtomicBoolean(true);

    public static void main(String[] args) {
        Properties properties =  new Properties();
        properties.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");
        properties.put("value.deserializer", "com.dhhy.avro.AvroDeserializer");
        properties.put("bootstrap.servers", brokerList);
        properties.put ("group.id", groupId);
        properties.put ("auto.offset.reset", "earliest");


        //創建一個消費者客戶端實例
        KafkaConsumer<String, User> consumer = new KafkaConsumer(properties);
        //訂閱主題
        consumer.subscribe(Collections.singletonList(topic));
        consumer.partitionsFor(topic);

        //循環消貨消息
        while (isRunning.get()) {
            ConsumerRecords<String, User> records =
                    consumer.poll(5000);

            System.out.println(records.count());

            for (ConsumerRecord<String, User> record :records  ) {
                System.out.println("{topic:"  + record.topic() + " ,partition:" + record.partition()
                        +  " ,offset:" + record.offset() +
                        " ,key:" + record.topic() + " ,value:" +record.value().toString() + "}");
            }

        }

    }
}

3.5 測試

先啓動生產者AvroProducer類,再運行消費者AvroConsumer類,可以看到消費者的控制檯打印了消費記錄。
在這裏插入圖片描述

4 Spark消費以Avro編碼的kafka消息

Kafk下游的消費者也可以是不同的大數據組件,如Spark Streaming, Flink等,在這裏以Spark Streaming爲例子。
在這裏插入圖片描述

4.1 創建maven工程

在實際工作中,Spark的應用主要以Scala開發爲主,因此需要重新創建一個Scala工程,名稱爲sparkstudy。
Pom文件的內容如下:


```markup

```yaml
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>


  <groupId>com.dhhy</groupId>
  <artifactId>sparkstudy</artifactId>
  <version>1.0-SNAPSHOT</version>

    <properties>
        <encoding>UTF-8</encoding>
        <scala.version>2.11.8</scala.version>
        <spark.version>2.3.0</spark.version>
    </properties>

    <dependencies>

        <!--scala-->
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>

        <!--SparkSQL-->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.spark-project.hive</groupId>
            <artifactId>hive-jdbc</artifactId>
            <version>1.2.1.spark2</version>
        </dependency>


        <!-- Spark ML-->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>


        <!-- Spark Streaming-->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>


        <!--Spark Streaming整合kafka-->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.kafka</groupId>
            <artifactId>kafka-clients</artifactId>
            <version>0.10.2.1</version>
        </dependency>


        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.38</version>
        </dependency>


        <!--avro序列化 -->
        <dependency>
            <groupId>org.apache.avro</groupId>
            <artifactId>avro</artifactId>
            <version>1.9.2</version>
        </dependency>



    </dependencies>


    <build>
        <plugins>
            <!-- mixed scala/java compile -->
            <plugin>
                <groupId>org.scala-tools</groupId>
                <artifactId>maven-scala-plugin</artifactId>
                <executions>
                    <execution>
                        <id>compile</id>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                        <phase>compile</phase>
                    </execution>
                    <execution>
                        <id>test-compile</id>
                        <goals>
                            <goal>testCompile</goal>
                        </goals>
                        <phase>test-compile</phase>
                    </execution>
                    <execution>
                        <phase>process-resources</phase>
                        <goals>
                            <goal>compile</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <!-- for jar -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-assembly-plugin</artifactId>
                <version>2.4</version>
                <configuration>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>assemble-all</id>
                        <phase>package</phase>
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-jar-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <addClasspath>true</addClasspath>
                            <mainClass>com.dhhy.spark.streaming.kafkaConsumerApp</mainClass>
                        </manifest>
                    </archive>
                </configuration>
            </plugin>
        </plugins>
    </build>
    <repositories>
        <repository>
            <id>aliyunmaven</id>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
        </repository>

    </repositories>
</project>

並將User.java和AvroDeserializer.java兩個類引入到sparkstudy中。

4.2 Spark應用的消費類

package com.dhhy.avro

import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.common.serialization.StringDeserializer
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.KafkaUtils
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.{Seconds, StreamingContext}

object AvroSparkConsumer {

  def main(args: Array[String]): Unit = {

    //配置項
    val conf = new SparkConf().setMaster("local[2]").setAppName("AvroSparkConsumer")

    val ssc = new StreamingContext(conf, Seconds(5))

    val kafkaParams = Map[String, Object](
      "bootstrap.servers" -> "192.168.0.131:9092,192.168.0.132:9092,192.168.0.133:9092", //kafka集羣地址
      "key.deserializer" -> classOf[StringDeserializer], //序列化類型,此處爲字符類型
      "value.deserializer" -> classOf[AvroDeserializer], //序列化類型,此處爲AVRO
      "group.id" -> "AvroConsumeGroup2", //Kafka消費組
      "auto.offset.reset" -> "earliest" //讀取最新offset
    )


    //kafka的話題
    val topics = Array("avro_topic")
    val stream: InputDStream[ConsumerRecord[String, User]] = KafkaUtils.createDirectStream[String, User](
      ssc,
      PreferConsistent,
      Subscribe[String, User](topics, kafkaParams)
    )

    stream.foreachRDD(rdd =>
      rdd.foreachPartition(partition =>
        partition.foreach(record =>
          println("{" +
            "topic:" + record.topic() +
            " ,partition:" + record.partition() +
            " ,offset:" + record.offset() +
            " ,key:" + record.topic() +
            " ,value:" + record.value().toString() +
            "}")
        )
      )

    )


    ssc.start()
    ssc.awaitTermination()
  }

}

4.3 工程的目錄結構

在這裏插入圖片描述

4.4 測試

運行AvroSparkConsumer.scala,消費的結果打印在控制檯上:
在這裏插入圖片描述

5 參考文獻

[1] Hadoop權威指南 第4版. 2017, 清華大學出版社.
[2] Avro官網 https://avro.apache.org/
[3] 深入理解Kafka:核心設計與實踐原理 第1版. 2019, 電子工業出版社.
[4] Spark官網 http://spark.apache.org/

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章