大數據Spark Streaming實時項目:日誌+Flume+Kafka+Spark Streaming+HBase+ECharts

需求:實時顯示網址的點擊量

編寫日誌生成腳本,編寫Flume配置文件,Flume source爲日誌文件,Flume sink爲Kafka,編寫Spark Streaming程序,整合Kafka,清洗數據,把統計結果寫入到HBase數據庫中,最後把數據展示出來

------ 實時日誌 -> Flume ------
1 編寫Python腳本模擬生成用戶搜索數據(generate_log.py)
2 編寫運行模擬生成用戶搜索數據的Python程序的腳本(log_generator.sh)
3 使用crontab -e定時執行腳本生成用戶搜索數據(crontab添加計劃任務)
4 編寫日誌文件 -> Flume 的配置文件(文件source -> 控制檯sink)(exec source -> memory channel -> logger sink)(streaming_project.conf)
5 啓動Flume程序,查看是否有日誌信息打印(Flume啓動命令)

------ Flume -> Kafka ------
6 啓動zookeeper(啓動zookeeper命令)
7 啓動Kafka(啓動Kafka命令)
8 編寫Flume -> Kafka的配置文件(streaming_project2.conf)
9 啓動kafka消費者(啓動kafka消費者命令)
10 啓動Fluem(Flume啓動命令)

------ Kafka -> Spark Streaming ------
11 編寫Spark Streaming程序(Receiver方法)
12 啓動Spark Streaming程序(啓動時傳入參數,zookeeper group topics 線程數)

------ 數據清洗 ------
13 編寫日期轉換工具類
14 編寫數據清洗代碼
15 編寫數據清洗後的case class

------ 編寫HBase操作代碼 ------

16 啓動Hadoop(啓動Hadoop之前Kafka和Zookeeper已經啓動)
17 啓動HBase
18 創建HBase表
19 設計Rowkey(day_courseid)
20 使用Scala操作HBase
   1 編寫實體類(case class CourseClickCount)
   2 編寫實體類對應的DAO訪問層(object CourseClickCountDAO)
     1 根據rowkey,列族,列累加值
     2 根據rowkey查詢值
   3 編寫HBase的工具類(HBaseUtils)
     1 設置zookeeper和hdfs
     2 獲取表
     3 插入數據

----- 統計今天到現在爲止課程的訪問量 ------
21 編寫統計訪問量的Spark代碼(ImoocStatStreamingApp)
22 啓動Flume(streaming_project2.conf)(啓動Flume之前要先啓動Kafka)
23 hbase shell,查看hbase表數據是否有更新

 


 

------ 實時日誌 -> Flume ------


1 編寫Python腳本模擬生成用戶搜索數據(generate_log.py)

代碼段 小部件


2 編寫運行模擬生成用戶搜索數據的Python程序的腳本(log_generator.sh)

python /home/hadoop/data/project/generate_log_smj.py


3 使用crontab -e定時執行腳本生成用戶搜索數據(crontab添加計劃任務)

   在終端運行 crontab -e 命令後添加以下語句(一分鐘運行一次)

*/1 * * * * /home/hadoop/data/project/log_generator.sh

crontab在線工具: https://tool.lu/crontab


4 編寫日誌文件 -> Flume 的配置文件(文件source -> 控制檯sink)(exec source -> memory channel -> logger sink)(streaming_project.conf)

exec-memory-logger.sources = exec-source
exec-memory-logger.sinks = logger-sink
exec-memory-logger.channels = memory-channel

exec-memory-logger.sources.exec-source.type = exec
exec-memory-logger.sources.exec-source.command = tail -F /home/hadoop/data/project/logs/access_smj.log
exec-memory-logger.sources.exec-source.shell = /bin/sh -c

exec-memory-logger.channels.memory-channel.type = memory

exec-memory-logger.sinks.logger-sink.type = logger

exec-memory-logger.sources.exec-source.channels = memory-channel
exec-memory-logger.sinks.logger-sink.channel = memory-channel


5 啓動Flume程序,查看是否有日誌信息打印(Flume啓動命令)

flume-ng agent \
--name exec-memory-logger \
--conf $FLUME_HOME/conf \
--conf-file /home/hadoop/data/project/streaming_project.conf
-Dflume.root.logger=INFO,console

 


 

------ Flume -> Kafka ------
6 啓動zookeeper(啓動zookeeper命令)

zkServer.sh start


7 啓動Kafka(啓動Kafka命令)

$KAFKA_HOME/bin/kafka-server-start.sh -daemon $KAFKA_HOME/config/server.properties


8 編寫Flume -> Kafka的配置文件(streaming_project2.conf)

exec-memory-kafka.sources = exec-source
exec-memory-kafka.sinks = kafka-sink
exec-memory-kafka.channels = memory-channel

exec-memory-kafka.sources.exec-source.type = exec
exec-memory-kafka.sources.exec-source.command = tail -F /home/hadoop/data/project/logs/access_smj.log
exec-memory-kafka.sources.exec-source.shell = /bin/sh -c

exec-memory-kafka.channels.memory-channel.type = memory

exec-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
exec-memory-kafka.sinks.kafka-sink.topic = streamingtopic
exec-memory-kafka.sinks.kafka-sink.brokerList = hadoop000:9092
exec-memory-kafka.sinks.kafka-sink.requiredAcks = 1
exec-memory-kafka.sinks.kafka-sink.batchSize = 5

exec-memory-kafka.sources.exec-source.channels = memory-channel
exec-memory-kafka.sinks.kafka-sink.channel = memory-channel


9 啓動kafka消費者(啓動kafka消費者命令)

kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic streamingtopic


10 啓動Fluem(Flume啓動命令)

flume-ng agent \
--name exec-memory-kafka \
--conf $FLUME_HOME/conf \
--conf-file /home/hadoop/data/project/streaming_project2.conf \
-Dflume.root.logger=INFO,console

 



 

------ Kafka -> Spark Streaming ------
11 編寫Spark Streaming程序(Receiver方法)(ImoocStatStreamingApp)

package com.imooc.spark.project.spark

import com.imooc.spark.project.domain.ClickLog
import com.imooc.spark.project.utils.DateUtils
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

object ImoocStatStreamingApp {

  def main(args: Array[String]): Unit = {

    if(args.length != 4) {
      println("Usage: ImoocStatStreamingApp <zkQuorum> <groupId> <tipics> <numThreads>")
      System.exit(1)
    }

    val Array(zkQuorum, groupId, topics, numThreads) = args

    val sparkConf = new SparkConf().setAppName("ImoocStatStreamingApp").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(60))

    val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap

    val messages = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicMap)

    // 測試步驟一:測試數據接收
    messages.map(_._2).count().print

    ssc.start()
    ssc.awaitTermination()
  }

}


12 啓動Spark Streaming程序(啓動時傳入參數,zookeeper group topics 線程數)(ImoocStatStreamingApp)

 


 

------ 數據清洗 ------
13 編寫日期轉換工具類(DateUtils)

package com.imooc.spark.project.utils

import java.util.Date

import org.apache.commons.lang3.time.FastDateFormat

/*日期時間工具類*/
object
DateUtils {

  val YYYYMMDDHHMMSS_FORMAT = FastDateFormat.getInstance("yyyy-MM-dd HH:mm:ss")
  val TARGE_FORMAT = FastDateFormat.getInstance("yyyyMMddHHmmss")

  def getTime(time: String) = {
    YYYYMMDDHHMMSS_FORMAT.parse(time).getTime
  }

  def parseToMinute(time :String) = {
    TARGE_FORMAT.format(new Date(getTime(time)))
  }

  def main(args: Array[String]): Unit = {
    println(parseToMinute("2020-04-04 21:47:01"))
  }

}


14 編寫數據清洗代碼(ImoocStatStreamingApp)

代碼段 小部件


15 編寫數據清洗後的case class(ClickLog)

package com.imooc.spark.project.domain

/*
清洗後的日誌信息
ip 日誌訪問的ip地址
time 日誌訪問的時間
courseId 日誌訪問的實戰課程編號
statusCode 日誌訪問的狀態碼
referer 日誌訪問的referer
*/
case class ClickLog(ip:String, time:String, courseId:Int, statusCode:Int, referer:String)

 

------ 編寫HBase操作代碼 ------

16 啓動Hadoop(啓動Hadoop之前Kafka和Zookeeper已經啓動)

./start-dfs.sh


17 啓動HBase

cd 到HBase的bin目錄運行該命令
./start-hbase.sh


18 創建HBase表

代碼段 小部件

     補充:

hbase shell啓動命令
cd到hbase bin目錄運行命令
./hbase shell

查看有哪些表
list

查看錶詳細信息
desc 'imooc_course_clickcount'

查看錶內容
scan 'imooc_course_clickcount'

在運行list命令的時候我遇到過連接不到ZooKeeper,
初步分析爲虛擬機的ZooKeeper不穩定,關閉ZooKeeper後重啓,問題依舊,
最後把hadoop,zookeeper,hbase的進程都殺了之後再按照zookeeper,hadoop,hbase
這個順序啓動hbase恢復正常。
高可用的hadoop集羣依賴zookeeper(非高可用不需要zookeeper),
而hbase集羣依賴zookeeper和hdfs,所以得先啓動zookeeper再啓動hadoop,最後啓動hbase

 


19 設計Rowkey(day_courseid)

     這裏我的rowkey設置爲 日期_課程號

     日期和課程號都可能產生數據傾斜,這裏暫不做處理


20 使用Scala操作HBase
   1 編寫實體類(case class CourseClickCount)

package com.imooc.spark.project.domain

/**
 * 實戰課程點擊數實體類
 * day_course 對應的就是HBase中的rowkey 20200413_1
 * click_count 對應20200413_1的訪問總數
 * */
case class CourseClickCount (day_course:String, click_count:Long)


   2 編寫實體類對應的DAO訪問層(object CourseClickCountDAO)
     1 根據rowkey,列族,列累加值
     2 根據rowkey查詢值

package com.imooc.spark.project.dao

import com.imooc.spark.project.domain.CourseClickCount
import com.imooc.spark.project.utils.HBaseUtils
import org.apache.hadoop.hbase.client.Get
import org.apache.hadoop.hbase.util.Bytes

import scala.collection.mutable.ListBuffer

/*實戰課程點擊數數據訪問層*/
object CourseClickCountDAO {

  val tableName = "imooc_course_clickcount_smj"
  val cf = "info"
  val qualifer = "click_count"

  /**
   * 保存數據到HBase
   * CourseClickCount集合
   */
  def save(list: ListBuffer[CourseClickCount]): Unit = {

    val table = HBaseUtils.getInstance().getTable(tableName)

    for(ele <- list) {
      // incrementColumnValue(rowkey,列族,列,值)可以把相同的rowkey,列族,列的值加起來
      table.incrementColumnValue(Bytes.toBytes(ele.day_course),
        Bytes.toBytes(cf),
        Bytes.toBytes(qualifer),
        ele.click_count)
    }

  }

  /**
   * 根據rowkey查詢值
   */
  def count(day_course:String):Long = {
    val table = HBaseUtils.getInstance().getTable(tableName)

    // Get要傳入rowkey
    val get = new Get(Bytes.toBytes(day_course))
    // 獲取值要傳入Get和列族,列
    val value = table.get(get).getValue(cf.getBytes, qualifer.getBytes)

    if(value == null) {
      0L
    } else {
      Bytes.toLong(value)
    }
  }

  def main(args: Array[String]): Unit = {

    val list = new ListBuffer[CourseClickCount]
    list.append(CourseClickCount("20200507_8", 8))
    list.append(CourseClickCount("20200507_9", 9))
    list.append(CourseClickCount("20200507_1", 100))

    save(list)

    println(count("20200507_8") + ":" + count("20200507_9") + ":" + count("20200507_1"))
  }

}


   3 編寫HBase的工具類(HBaseUtils)
     1 設置zookeeper和hdfs
     2 獲取表
     3 插入數據

package com.imooc.spark.project.utils;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;

import java.io.IOException;

/*HBase操作工具類:Java工具類建議採用單例模式封裝*/
public class HBaseUtils {

    HBaseAdmin admin = null;

    Configuration configuration = null;

    // 單例的構造方法是私有的
    private HBaseUtils() {
        // hbase需要設置zookeeper和hdfs路徑
        configuration = new Configuration();
        configuration.set("hbase.zookeeper.quorum", "hadoop000:2181");
        configuration.set("hbase.rootdir", "hdfs://hadoop000:8020/hbase");

        try {
            admin = new HBaseAdmin(configuration);
        } catch (Exception e) {
            e.printStackTrace();
        }
    }

    private static HBaseUtils instance = null;

    public static HBaseUtils getInstance() {
        if (null == instance) {
            instance = new HBaseUtils();
        }
        return instance;
    }

    /**
     * 根據表名獲取到HTable實例
     * @param tableName
     * @return org.apache.hadoop.hbase.client.HTable
     * @author songminjian
     * @date 2020/5/6 下午6:22
     */
    public HTable getTable(String tableName) {

        HTable table = null;

        try {
            table = new HTable(configuration, tableName);
        } catch (IOException e) {
            e.printStackTrace();
        }

        return table;
    }

    /**
     * 添加一條記錄到HBase表
     * @param tableName HBase表名
	 * @param rowkey HBase表的rowkey
	 * @param cf HBase表的columnfamily
	 * @param column HBase表的列
	 * @param value  寫入HBase表的值
     * @return void
     * @author songminjian
     * @date 2020/5/6 下午6:31
     */
    public void put(String tableName, String rowkey, String cf, String column, String value) {

        HTable table = getTable(tableName);

        Put put = new Put(Bytes.toBytes(rowkey));
        put.add(Bytes.toBytes(cf), Bytes.toBytes(column), Bytes.toBytes(value));

        try {
            table.put(put);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }

    public static void main(String[] args) {

        // HTable table = HBaseUtils.getInstance().getTable("imooc_course_clickcount_smj");
        // System.out.println(table.getName().getNameAsString());

        String tableName = "imooc_course_clickcount_smj";
        String rowkey = "20200506_88";
        String cf = "info";
        String column = "click_count";
        String value = "2";

        HBaseUtils.getInstance().put(tableName, rowkey, cf, column, value);

    }

}

 


 

----- 統計今天到現在爲止課程的訪問量 ------
21 編寫統計訪問量的Spark代碼(ImoocStatStreamingApp)

注意一下測試步驟三就可以了,其他都是上面有的代碼

package com.imooc.spark.project.spark

import com.imooc.spark.project.dao.CourseClickCountDAO
import com.imooc.spark.project.domain.{ClickLog, CourseClickCount}
import com.imooc.spark.project.utils.DateUtils
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}

import scala.collection.mutable.ListBuffer

object ImoocStatStreamingApp {

  def main(args: Array[String]): Unit = {

    if(args.length != 4) {
      println("Usage: ImoocStatStreamingApp <zkQuorum> <groupId> <tipics> <numThreads>")
      System.exit(1)
    }

    val Array(zkQuorum, groupId, topics, numThreads) = args

    val sparkConf = new SparkConf().setAppName("ImoocStatStreamingApp").setMaster("local[2]")
    val ssc = new StreamingContext(sparkConf, Seconds(60))

    val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap

    val messages = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicMap)

    // 測試步驟一:測試數據接收
    // messages.map(_._2).count().print

    // 測試步驟二:數據清洗
    val logs = messages.map(_._2)
    val cleanData = logs.map(line => {
      val infos = line.split("\t")
      // infos(2) = "GET /class/128.html HTTP/1.1"
      // url = /class/128.html
      val url = infos(2).split(" ")(1)
      var courseId = 0

      // 把實戰課程的課程編號拿到了
      if(url.startsWith("/class")) {
        val courseIdHTML = url.split("/")(2)
        courseId = courseIdHTML.substring(0, courseIdHTML.lastIndexOf(".")).toInt
      }

      ClickLog(infos(0), DateUtils.parseToMinute(infos(1)), courseId, infos(3).toInt, infos(4))
    }).filter(clicklog => clicklog.courseId != 0)

    // cleanData.print()

    // 測試步驟三:統計今天到現在爲止實戰課程的訪問量
    cleanData.map(x => {
      // HBase rowkey設計:20200507_88(日期_課程編號)
      (x.time.substring(0, 8) + "_" + x.courseId, 1)
    }).reduceByKey(_ + _).foreachRDD(rdd => {
      rdd.foreachPartition(partitionRecords => {
        val list = new ListBuffer[CourseClickCount]
        partitionRecords.foreach(pair => {
          list.append(CourseClickCount(pair._1, pair._2))
        })

        CourseClickCountDAO.save(list)
      })
    })

    ssc.start()
    ssc.awaitTermination()
  }

}

 


22 啓動Flume(streaming_project2.conf)(因爲Flume的sink是Kafka,所以啓動Flume之前要先啓動Kafka,還有要確保Flume的source要有新的數據產生)

flume-ng agent \
--name exec-memory-kafka \
--conf $FLUME_HOME/conf \
--conf-file /home/hadoop/data/project/streaming_project2.conf \
-Dflume.root.logger=INFO,console


23 hbase shell,查看hbase表數據是否有更新

scan 'imooc_course_clickcount_smj'

 

 

 

 


 

 

 

 

 

 

 

 

 

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章