Spark Streaming+kafka+spring boot+elasticsearch實時項目（canal）

在本次實驗中，利用spark、elasticsearch、kafka等相關框架搭建一個實時計算系統。

具體流程如下圖所示，

用戶訪問對應服務，由nginx服務器進行負載均衡訪問具體的主機上的服務，訪問過程中將產生用戶具體的操作日誌，該操作日誌將由具體服務發送保存到Kafka集羣（或者可以寫到具體文件，可以通過Flume對日誌文件進行採集，發送到Kafka集羣）。
數據緩存到kafka集羣后，利用Spark Streaming對Kafka進行具體時間間隔的消費（批處理），對消費的數據進行業務去重，計算，加工，完成後，將數據寫到Mysql數據庫或者ES（用於對數據的檢索和分析）。
數據保存到ES後，編寫Spring boot程序，將es中數據讀取，並按照一定的業務邏輯進行處理，將需求數據以json格式返回。在本次實驗中，編寫的改spring boot程序主要用於發佈接口，由另外一個前端程序請求該接口，返回相應數據，當然也可以寫到一個web工程中，本次例程中主要是偏向於基礎。
另外一個web工程訪問具體業務接口，返回json數據，解析響應數據，利用echart.js繪製相應圖表，並設置時間間隔進行請求，實時更新圖表內容。

注：另外，還可以通過canal監控對應的業務數據，對更改的業務數據進行抓取，發送給kafka。主要利用的是mysql的主從備份的原理，將canal僞裝成一臺mysql slave服務器，從主節點請求數據。

一、環境搭建

集羣搭建可以參考

三臺虛擬機，分別爲hadoop1、hadoop2、hadoop3，本次例程中使用的是centos 6.8。

分配的內存爲：(當然內存足夠可以多分配)

主機	內存	處理器
hadoop1	4G	2
hadoop2	2G	1
hadoop3	2G	1

hadoop集羣，（可選，方便查看具體job 日誌）hadoop版本 hadoop-2.7.2
zookeeper集羣，版本：zookeeper-3.4.10
kafka集羣，版本 kafka_2.11-0.11.0.2
spark集羣（可選），版本spark-2.1.1-bin-hadoop2.7 將項目部署到集羣上可以考慮搭建spark集羣，測試則不需要，在idea測試即可。
elasticsearch集羣，版本 elasticsearch-6.6.0 ，可以再安裝一個es的可視化平臺，kibana 版本kibana-6.6.0-linux-x86_64
redis 可單機可集羣，版本redis-5.0.6
nginx

二、項目搭建

如下圖所示，爲本項目的功能文件目錄結構。

canal模塊爲利用canal API將mysql數據庫修改的數據發送到kafka集羣。
common模塊是公用的依賴和工具類。
dw-chart模塊是web項目，負責向對應接口請求數據，並繪製圖表，前端展示。
export2ES模塊（可忽略），將hive數據導入到es。
logger模塊，是用戶請求的對應服務的spring boot工程，負責將用戶操作日誌發送給kafka。
mock模塊，是模擬用戶操作日誌，負責向logger模塊發起請求。
publisher模塊，spring boot功能，負責發佈訪問接口，由dw-chart請求相應數據。
realtime模塊，spark streaming計算，負責消費kafka數據，並保存到es中。
sql文件夾中是對應的order_info 模擬生成數據的存儲過程和部分模擬數據，用於cannal監控，和統計銷售額。

三、分析過程

kafka集羣中topic有以下三個，GMALL_STARTUP（用於統計每日活躍度）、GMALL_EVENT（暫時未使用）、GMALL_ORDER（用於統計銷售額）。
es集羣中index有以下三個，gmall_dau（保存計算每日活躍度的結果數據）、gmall_order（保存計算後的銷售額數據）、gmall_sale_detail（保存從hive中導入到es的數據）。

日誌數據格式如下，一條json數據表示用戶做的一次操作，當type爲startup爲登錄，可以記錄當前app的每日活躍度。

{
    "area": "guangdong",   //地址
    "uid": "186",          
    "itemid": 17,          //主題id
    "npgid": 14,
    "evid": "addCart",     //時間id
    "os": "andriod",       //用戶操作系統
    "pgid": 43,
    "appid": "gmall_hcx",    //appid
    "mid": "mid_74",         //用戶唯一id
    "type": "event",         //用戶操作類型
    "ts": 1574325528404      //時間戳
}

mysql中的order_info表中數據如下，記錄着用戶下單產生的業務數據，由canal監控mysql數據庫的這個表的變化，並將數據寫入kafka集羣中，便於之後統計銷售額。

以下爲spark streaming代碼，進行每日活躍度的統計。首先從kafka中讀取數據爲inputDstream，再將輸入流轉換爲泛型爲具體樣例類的輸入流。利用redis對數據進行去重，因爲統計用戶活躍度，當一個用戶多次登錄後，只取這個用戶的一次有效登錄記錄。利用redis去重後，還需要考慮到當一個批次讀取的數據中有重複數據時，redis未能去重，則需要再對過濾後的數據進一步去重，去重思路是將想用mid的數據分爲同一組，即一個用戶的登錄記錄分爲一組，只取其中一條作爲有效數據，其餘的去除。

val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("dau_app")
    val ssc = new StreamingContext(sparkConf,Seconds(5))

    val inputDstream: InputDStream[ConsumerRecord[String, String]] = MyKafkaUtil.getKafkaStream(GmallConstant.KAFKA_TOPIC_STARTUP,ssc)

    //轉換操作
    val startuplogStream: DStream[Startuplog] = inputDstream.map {
      record =>
        val jsonStr: String = record.value()
        val startuplog: Startuplog = JSON.parseObject(jsonStr, classOf[Startuplog])
        val date = new Date(startuplog.ts)
        val DateStr: String = new SimpleDateFormat("yyyy-MM-dd HH:mm").format(date)
        val splits: Array[String] = DateStr.split(" ")
        startuplog.logDate = splits(0)
        startuplog.logHour = splits(1).split(":")(0)
        startuplog.logHourMinute = splits(1)
        startuplog
    }
    //利用redis進行去重過濾
    val filteredDstream: DStream[Startuplog] = startuplogStream.transform {
      rdd =>
        //driver  週期性執行
        val curdate: String = new SimpleDateFormat("yyyy-MM-dd").format(new Date())
        val jedis: Jedis = RedisUtil.getJedisClient
        val key = "dau:" + curdate
        val dauSet: util.Set[String] = jedis.smembers(key)
        val dauBC: Broadcast[util.Set[String]] = ssc.sparkContext.broadcast(dauSet)
        val filteredRDD: RDD[Startuplog] = rdd.filter {
          startuplog =>
            //executor
            val dauSet: util.Set[String] = dauBC.value
            !dauSet.contains(startuplog.mid)
        }
        filteredRDD
    }
    val groupbyMidDstram: DStream[(String, Iterable[Startuplog])] = filteredDstream.map {
      startiplog => (startiplog.mid, startiplog)
    }.groupByKey()
    //去重思路，把相同mid的數據分成一組，每組取一個
    val distinctDstream: DStream[Startuplog] = groupbyMidDstram.flatMap {
      case (mid, startuplogItr) =>
        startuplogItr.take(1)
    }
    //保存到redis中
    distinctDstream.foreachRDD{rdd=>
      //driver
      //redis   type  set
      //key  dau:2019-06-03  value:mids
      rdd.foreachPartition{startuplogItr =>
        //executor
        val jedis: Jedis = RedisUtil.getJedisClient
        val list: List[Startuplog] = startuplogItr.toList
        for (startuplog<- list){
          val key = "dau:" + startuplog.logDate
          val value = startuplog.mid
          jedis.sadd(key,value)
          println(startuplog)
        }
        MyEsUtil.indexBulk(GmallConstant.ES_INDEX_DAU,list)
        jedis.close()
      }
    }
    ssc.start()
    ssc.awaitTermination()

以下爲canal API 部分代碼，負責監聽mysql數據庫的order_info表的數據變化，將改變後的數據發送到kafka集羣。

CanalConnector canalConnector = CanalConnectors.newSingleConnector(new InetSocketAddress("hadoop1", 11111), "example", "", "");
        while (true){
            //連接、訂閱表、獲取數據
            canalConnector.connect();
            canalConnector.subscribe("gmall.order_info");
            Message message = canalConnector.get(100);
            int size = message.getEntries().size();
            if (size == 0){
                try {
                    System.out.println("no Data...");
                    Thread.sleep(5000);
                } catch (InterruptedException e) {
                    e.printStackTrace();
                }
            }else {
                for (CanalEntry.Entry entry : message.getEntries()) {

                    //判斷時間類型，只處理行變化業務
                    if (entry.getEntryType().equals(CanalEntry.EntryType.ROWDATA)){
                        //將數據集進行反序列化
                        ByteString storeValue = entry.getStoreValue();
                        CanalEntry.RowChange rowChange = null;
                        try {
                             rowChange = CanalEntry.RowChange.parseFrom(storeValue);

                        } catch (InvalidProtocolBufferException e) {
                            e.printStackTrace();
                        }
                        // 獲取行集
                        List<CanalEntry.RowData> rowDatasList = rowChange.getRowDatasList();
                        //操作類
                        CanalEntry.EventType eventType = rowChange.getEventType();
                        //表名
                        String tableName = entry.getHeader().getTableName();
                        CanalHandler.handle(tableName,eventType,rowDatasList);
                    }
                }
            }
        }

四、項目運行

1、首先啓動zookeeper集羣和kafka集羣、nginx。

nginx配置文件內容如下：

#user  nobody;
worker_processes  1;
#error_log  logs/error.log;
#error_log  logs/error.log  notice;
#error_log  logs/error.log  info;
#pid        logs/nginx.pid;
events {
    worker_connections  1024;
}
http {
    upstream logserver{
        server   hadoop1:8080  weight=1;
        server   hadoop2:8080  weight=1;
        server   hadoop3:8080  weight=1;
}
    include       mime.types;
    default_type  application/octet-stream;
    sendfile        on;
    #tcp_nopush     on;
    #keepalive_timeout  0;
    keepalive_timeout  65;
    #gzip  on;

    server {
        listen       80;
        server_name  logserver;
        #charset koi8-r;
        #access_log  logs/host.access.log  main;
        location / {
            root   html;
            index  index.html index.htm;
            proxy_pass http://logserver;
            proxy_connect_timeout 10;
        }
        #error_page  404              /404.html;
        # redirect server error pages to the static page /50x.html
        #
        error_page   500 502 503 504  /50x.html;
        location = /50x.html {
            root   html;
        }
    }
}

zookeeper配置文件內容如下：

# The number of milliseconds of each tick
tickTime=2000
# The number of ticks that the initial 
# synchronization phase can take
initLimit=10
# The number of ticks that can pass between 
# sending a request and getting an acknowledgement
syncLimit=5
# the directory where the snapshot is stored.
# do not use /tmp for storage, /tmp here is just 
# example sakes.

server.1=hadoop1:2888:3888
server.2=hadoop2:2888:3888
server.3=hadoop3:2888:3888
dataDir=/home/hadoop/zookeeper-3.4.10/zkData
# the port at which the clients will connect
clientPort=2181
# the maximum number of client connections.
# increase this if you need to handle more clients
#maxClientCnxns=60
#
# Be sure to read the maintenance section of the 
# administrator guide before turning on autopurge.
#
# http://zookeeper.apache.org/doc/current/zookeeperAdmin.html#sc_maintenance
#
# The number of snapshots to retain in dataDir
#autopurge.snapRetainCount=3
# Purge task interval in hours
# Set to "0" to disable auto purge feature
#autopurge.purgeInterval=1

kafka集羣配置主節點配置文件內容如下：(slave節點的配置文件內容也需要修改，具體可參考網上內容)

boker.id=0
zookeeper.connect=hadoop1:2181,hadoop2:2181,hadoop3:2181
listeners=PLAINTEXT://hadoop1:9092
advertised.listeners=PLAINTEXT://hadoop1:9092
delete.topic.enable=true   #用於刪除topic

2、將logger模塊打包上傳到三臺虛擬機，並每臺都啓動。

可以在hadoop1主機編寫一個啓動腳本來啓動三臺主機的服務。腳本內容如下，具體需要修改java路徑和jar包路徑。

#!/bin/bash
JAVA_BIN=/home/hadoop/jdk1.8/bin/java
PROJECT=gmall
APPNAME=logger-0.0.1-SNAPSHOT.jar
SERVER_PORT=8080

case $1 in
"start")
{
  for i in hadoop1 hadoop2 hadoop3
  do
  echo "=======啓動日誌服務：$i"
  ssh $i "$JAVA_BIN -Xms32m -Xmx64m -jar gmall/$APPNAME --server.port=$SERVER_PORT >/home/hadoop/gmall/boot.log 2>&1 &"
  done
};;

"stop")
{
 for i in hadoop1 hadoop2 hadoop3
  do 
   echo "=========關閉日誌服務：$i=========="
   ssh $i "ps -ef | grep $APPNAME | grep -v grep | awk '{print \$2}' |xargs kill" >/dev/null 2>&1 &
  done
};;

esac

3、啓動JsonMocker程序，發送請求到nginx服務器，訪問三臺主機的具體服務，並將日誌保存到kafka集羣中。（可以直接在IDEA中啓動，發送請求，看到終端輸出200返回結果，並kafka對應topic有數據即成功）

4、啓動spark streaming程序 DauApp，從kafka讀取數據進行計算處理，並將結果保存到es中。（可以直接在IDEA中啓動，通過查看es-head或kibana查詢有數據來查看，如果有數據即成功）

5、啓動發佈接口spring-boot程序，讀取es中數據，按照對應的業務邏輯處理數據，並以json形式返回。（可以在IDEA中啓動，也可以打包部署到集羣，瀏覽器訪問對應接口地址，返回json數據即成功）

6、啓動前端展示web項目，通過請求對應接口，得到返回的json數據，將數據解析後利用echart繪製圖表。（可以在IDEA中啓動，也可以打包部署到集羣，瀏覽器輸入地址後，看到對應圖表，並且圖表按照規律時間變化及成功）

效果圖：（每日活躍度完成顯示，顯示昨天和今天兩天的對比圖）

7、銷售額統計部分，首先需要配置canal，監聽對應的mysql，canal的配置文件內容如下，啓動canal bin/startup.sh

conf/example/instance.properties 主要配置slaveId和mysql地址，還有canal的用戶和密碼，這個需要在mysql中配置一個用戶和密碼，用於canal操作mysql中的表。

#################################################
## mysql serverId , v1.0.26+ will autoGen 
canal.instance.mysql.slaveId=3

# enable gtid use true/false
canal.instance.gtidon=false

# position info
canal.instance.master.address=hadoop1:3306
canal.instance.master.journal.name=
canal.instance.master.position=
canal.instance.master.timestamp=
canal.instance.master.gtid=

# rds oss binlog
canal.instance.rds.accesskey=
canal.instance.rds.secretkey=
canal.instance.rds.instanceId=

# table meta tsdb info
canal.instance.tsdb.enable=true
#canal.instance.tsdb.url=jdbc:mysql://127.0.0.1:3306/canal_tsdb
#canal.instance.tsdb.dbUsername=canal
#canal.instance.tsdb.dbPassword=canal

#canal.instance.standby.address =
#canal.instance.standby.journal.name =
#canal.instance.standby.position =
#canal.instance.standby.timestamp =
#canal.instance.standby.gtid=

# username/password
canal.instance.dbUsername=canal
canal.instance.dbPassword=canal
canal.instance.connectionCharset = UTF-8
canal.instance.defaultDatabaseName =test
# enable druid Decrypt database password
canal.instance.enableDruid=false
#canal.instance.pwdPublicKey=MFwwDQYJKoZIhvcNAQEBBQADSwAwSAJBALK4BUxdDltRRE5/zXpVEVPUgunvscYFtEip3pmLlhrWpacX7y7GCMo2/JM6LeHmiiNdH1FWgGCpUfircSwlWKUCAwEAAQ==

# table regex
canal.instance.filter.regex=.*\\..*
# table black regex
canal.instance.filter.black.regex=

# mq config
canal.mq.topic=example
canal.mq.partition=0
# hash partition config
#canal.mq.partitionsNum=3
#canal.mq.partitionHash=mytest.person:id,mytest.role:id
#################################################

8、啓動canal API程序，將mysql業務表的修改數據保存到kafka對應topic，啓動程序後，需要利用sql文件夾中的sql腳本，在對應mysql數據庫中創建存儲過程和表，並利用存儲過程修改order_info表，此時canal監聽到數據發生改變，就會讀取bin文件，將數據發送到kafka集羣。

利用下列存儲過程修改表中數據，具體含義可查看存儲過程。

call init_data(varchar do_date_string, int order_incr_num, int user_incr_num, tinyint if_truncate);
call init_data('2019-11-22', 10, 5, false)

9、啓動spark streaming程序的orderApp，讀取kafka數據，並進行處理後保存到es對應index中。（可以直接在IDEA中運行，查看es中idnex中有數據增加即成功）

10、啓動publisher模塊和dw-chart模塊，輸入訪問地址，可以查看到以下效果圖。當然也可以通過kibana的圖表工具繪製對應的圖，如下第二張圖所示，設置對應的index和字段後也可以查看到自己需要的圖。

五、總結

本次的例程主要是針對基礎，完成一個完整的從數據模擬、數據採集到傳輸、計算、結果展示的流程。這樣的一個簡單實時系統還有很多需要完善的地方，也有很多更優選擇，可以在後期完善，該例程用於記錄學習過程，也希望能幫到想學習大數據的同學。

完整工程github：https://github.com/HeCCXX/gmall-parent.git

Spark Streaming+kafka+spring boot+elasticsearch實時項目（canal）

一、環境搭建

二、項目搭建

三、分析過程

四、項目運行

五、總結

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

京東面試：如何進行JVM調優？

Python 將PowerPoint (PPT/PPTX) 轉爲HTML

SQL優化-20231016

Kylin安裝及員工表和部門表多維度分析實戰（詳細步驟）

JDK+CGLIB動態代理過程詳細分析（源碼分析和調用過程分析）

快速理解替換、搜索利器————正則表達式

Spark Streaming+kafka+spring boot+elasticsearch實時項目（canal）

Scala快速入門（零基礎到入門）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結