需求:實時顯示網址的點擊量
編寫日誌生成腳本,編寫Flume配置文件,Flume source爲日誌文件,Flume sink爲Kafka,編寫Spark Streaming程序,整合Kafka,清洗數據,把統計結果寫入到HBase數據庫中,最後把數據展示出來
------ 實時日誌 -> Flume ------
1 編寫Python腳本模擬生成用戶搜索數據(generate_log.py)
2 編寫運行模擬生成用戶搜索數據的Python程序的腳本(log_generator.sh)
3 使用crontab -e定時執行腳本生成用戶搜索數據(crontab添加計劃任務)
4 編寫日誌文件 -> Flume 的配置文件(文件source -> 控制檯sink)(exec source -> memory channel -> logger sink)(streaming_project.conf)
5 啓動Flume程序,查看是否有日誌信息打印(Flume啓動命令)
------ Flume -> Kafka ------
6 啓動zookeeper(啓動zookeeper命令)
7 啓動Kafka(啓動Kafka命令)
8 編寫Flume -> Kafka的配置文件(streaming_project2.conf)
9 啓動kafka消費者(啓動kafka消費者命令)
10 啓動Fluem(Flume啓動命令)
------ Kafka -> Spark Streaming ------
11 編寫Spark Streaming程序(Receiver方法)
12 啓動Spark Streaming程序(啓動時傳入參數,zookeeper group topics 線程數)
------ 數據清洗 ------
13 編寫日期轉換工具類
14 編寫數據清洗代碼
15 編寫數據清洗後的case class
------ 編寫HBase操作代碼 ------
16 啓動Hadoop(啓動Hadoop之前Kafka和Zookeeper已經啓動)
17 啓動HBase
18 創建HBase表
19 設計Rowkey(day_courseid)
20 使用Scala操作HBase
1 編寫實體類(case class CourseClickCount)
2 編寫實體類對應的DAO訪問層(object CourseClickCountDAO)
1 根據rowkey,列族,列累加值
2 根據rowkey查詢值
3 編寫HBase的工具類(HBaseUtils)
1 設置zookeeper和hdfs
2 獲取表
3 插入數據
----- 統計今天到現在爲止課程的訪問量 ------
21 編寫統計訪問量的Spark代碼(ImoocStatStreamingApp)
22 啓動Flume(streaming_project2.conf)(啓動Flume之前要先啓動Kafka)
23 hbase shell,查看hbase表數據是否有更新
------ 實時日誌 -> Flume ------
1 編寫Python腳本模擬生成用戶搜索數據(generate_log.py)
代碼段 小部件
2 編寫運行模擬生成用戶搜索數據的Python程序的腳本(log_generator.sh)
python /home/hadoop/data/project/generate_log_smj.py
3 使用crontab -e定時執行腳本生成用戶搜索數據(crontab添加計劃任務)
在終端運行 crontab -e 命令後添加以下語句(一分鐘運行一次)
*/1 * * * * /home/hadoop/data/project/log_generator.sh
crontab在線工具: https://tool.lu/crontab
4 編寫日誌文件 -> Flume 的配置文件(文件source -> 控制檯sink)(exec source -> memory channel -> logger sink)(streaming_project.conf)
exec-memory-logger.sources = exec-source
exec-memory-logger.sinks = logger-sink
exec-memory-logger.channels = memory-channel
exec-memory-logger.sources.exec-source.type = exec
exec-memory-logger.sources.exec-source.command = tail -F /home/hadoop/data/project/logs/access_smj.log
exec-memory-logger.sources.exec-source.shell = /bin/sh -c
exec-memory-logger.channels.memory-channel.type = memory
exec-memory-logger.sinks.logger-sink.type = logger
exec-memory-logger.sources.exec-source.channels = memory-channel
exec-memory-logger.sinks.logger-sink.channel = memory-channel
5 啓動Flume程序,查看是否有日誌信息打印(Flume啓動命令)
flume-ng agent \
--name exec-memory-logger \
--conf $FLUME_HOME/conf \
--conf-file /home/hadoop/data/project/streaming_project.conf
-Dflume.root.logger=INFO,console
------ Flume -> Kafka ------
6 啓動zookeeper(啓動zookeeper命令)
zkServer.sh start
7 啓動Kafka(啓動Kafka命令)
$KAFKA_HOME/bin/kafka-server-start.sh -daemon $KAFKA_HOME/config/server.properties
8 編寫Flume -> Kafka的配置文件(streaming_project2.conf)
exec-memory-kafka.sources = exec-source
exec-memory-kafka.sinks = kafka-sink
exec-memory-kafka.channels = memory-channel
exec-memory-kafka.sources.exec-source.type = exec
exec-memory-kafka.sources.exec-source.command = tail -F /home/hadoop/data/project/logs/access_smj.log
exec-memory-kafka.sources.exec-source.shell = /bin/sh -c
exec-memory-kafka.channels.memory-channel.type = memory
exec-memory-kafka.sinks.kafka-sink.type = org.apache.flume.sink.kafka.KafkaSink
exec-memory-kafka.sinks.kafka-sink.topic = streamingtopic
exec-memory-kafka.sinks.kafka-sink.brokerList = hadoop000:9092
exec-memory-kafka.sinks.kafka-sink.requiredAcks = 1
exec-memory-kafka.sinks.kafka-sink.batchSize = 5
exec-memory-kafka.sources.exec-source.channels = memory-channel
exec-memory-kafka.sinks.kafka-sink.channel = memory-channel
9 啓動kafka消費者(啓動kafka消費者命令)
kafka-console-consumer.sh --zookeeper hadoop000:2181 --topic streamingtopic
10 啓動Fluem(Flume啓動命令)
flume-ng agent \
--name exec-memory-kafka \
--conf $FLUME_HOME/conf \
--conf-file /home/hadoop/data/project/streaming_project2.conf \
-Dflume.root.logger=INFO,console
------ Kafka -> Spark Streaming ------
11 編寫Spark Streaming程序(Receiver方法)(ImoocStatStreamingApp)
package com.imooc.spark.project.spark
import com.imooc.spark.project.domain.ClickLog
import com.imooc.spark.project.utils.DateUtils
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object ImoocStatStreamingApp {
def main(args: Array[String]): Unit = {
if(args.length != 4) {
println("Usage: ImoocStatStreamingApp <zkQuorum> <groupId> <tipics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum, groupId, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("ImoocStatStreamingApp").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(60))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val messages = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicMap)
// 測試步驟一:測試數據接收
messages.map(_._2).count().print
ssc.start()
ssc.awaitTermination()
}
}
12 啓動Spark Streaming程序(啓動時傳入參數,zookeeper group topics 線程數)(ImoocStatStreamingApp)
------ 數據清洗 ------
13 編寫日期轉換工具類(DateUtils)
package com.imooc.spark.project.utils
import java.util.Date
import org.apache.commons.lang3.time.FastDateFormat
/*日期時間工具類*/
object
DateUtils {
val YYYYMMDDHHMMSS_FORMAT = FastDateFormat.getInstance("yyyy-MM-dd HH:mm:ss")
val TARGE_FORMAT = FastDateFormat.getInstance("yyyyMMddHHmmss")
def getTime(time: String) = {
YYYYMMDDHHMMSS_FORMAT.parse(time).getTime
}
def parseToMinute(time :String) = {
TARGE_FORMAT.format(new Date(getTime(time)))
}
def main(args: Array[String]): Unit = {
println(parseToMinute("2020-04-04 21:47:01"))
}
}
14 編寫數據清洗代碼(ImoocStatStreamingApp)
代碼段 小部件
15 編寫數據清洗後的case class(ClickLog)
package com.imooc.spark.project.domain
/*
清洗後的日誌信息
ip 日誌訪問的ip地址
time 日誌訪問的時間
courseId 日誌訪問的實戰課程編號
statusCode 日誌訪問的狀態碼
referer 日誌訪問的referer
*/
case class ClickLog(ip:String, time:String, courseId:Int, statusCode:Int, referer:String)
------ 編寫HBase操作代碼 ------
16 啓動Hadoop(啓動Hadoop之前Kafka和Zookeeper已經啓動)
./start-dfs.sh
17 啓動HBase
cd 到HBase的bin目錄運行該命令
./start-hbase.sh
18 創建HBase表
代碼段 小部件
補充:
hbase shell啓動命令
cd到hbase bin目錄運行命令
./hbase shell
查看有哪些表
list
查看錶詳細信息
desc 'imooc_course_clickcount'
查看錶內容
scan 'imooc_course_clickcount'
在運行list命令的時候我遇到過連接不到ZooKeeper,
初步分析爲虛擬機的ZooKeeper不穩定,關閉ZooKeeper後重啓,問題依舊,
最後把hadoop,zookeeper,hbase的進程都殺了之後再按照zookeeper,hadoop,hbase
這個順序啓動hbase恢復正常。
高可用的hadoop集羣依賴zookeeper(非高可用不需要zookeeper),
而hbase集羣依賴zookeeper和hdfs,所以得先啓動zookeeper再啓動hadoop,最後啓動hbase
19 設計Rowkey(day_courseid)
這裏我的rowkey設置爲 日期_課程號
日期和課程號都可能產生數據傾斜,這裏暫不做處理
20 使用Scala操作HBase
1 編寫實體類(case class CourseClickCount)
package com.imooc.spark.project.domain
/**
* 實戰課程點擊數實體類
* day_course 對應的就是HBase中的rowkey 20200413_1
* click_count 對應20200413_1的訪問總數
* */
case class CourseClickCount (day_course:String, click_count:Long)
2 編寫實體類對應的DAO訪問層(object CourseClickCountDAO)
1 根據rowkey,列族,列累加值
2 根據rowkey查詢值
package com.imooc.spark.project.dao
import com.imooc.spark.project.domain.CourseClickCount
import com.imooc.spark.project.utils.HBaseUtils
import org.apache.hadoop.hbase.client.Get
import org.apache.hadoop.hbase.util.Bytes
import scala.collection.mutable.ListBuffer
/*實戰課程點擊數數據訪問層*/
object CourseClickCountDAO {
val tableName = "imooc_course_clickcount_smj"
val cf = "info"
val qualifer = "click_count"
/**
* 保存數據到HBase
* CourseClickCount集合
*/
def save(list: ListBuffer[CourseClickCount]): Unit = {
val table = HBaseUtils.getInstance().getTable(tableName)
for(ele <- list) {
// incrementColumnValue(rowkey,列族,列,值)可以把相同的rowkey,列族,列的值加起來
table.incrementColumnValue(Bytes.toBytes(ele.day_course),
Bytes.toBytes(cf),
Bytes.toBytes(qualifer),
ele.click_count)
}
}
/**
* 根據rowkey查詢值
*/
def count(day_course:String):Long = {
val table = HBaseUtils.getInstance().getTable(tableName)
// Get要傳入rowkey
val get = new Get(Bytes.toBytes(day_course))
// 獲取值要傳入Get和列族,列
val value = table.get(get).getValue(cf.getBytes, qualifer.getBytes)
if(value == null) {
0L
} else {
Bytes.toLong(value)
}
}
def main(args: Array[String]): Unit = {
val list = new ListBuffer[CourseClickCount]
list.append(CourseClickCount("20200507_8", 8))
list.append(CourseClickCount("20200507_9", 9))
list.append(CourseClickCount("20200507_1", 100))
save(list)
println(count("20200507_8") + ":" + count("20200507_9") + ":" + count("20200507_1"))
}
}
3 編寫HBase的工具類(HBaseUtils)
1 設置zookeeper和hdfs
2 獲取表
3 插入數據
package com.imooc.spark.project.utils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.client.HBaseAdmin;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.util.Bytes;
import java.io.IOException;
/*HBase操作工具類:Java工具類建議採用單例模式封裝*/
public class HBaseUtils {
HBaseAdmin admin = null;
Configuration configuration = null;
// 單例的構造方法是私有的
private HBaseUtils() {
// hbase需要設置zookeeper和hdfs路徑
configuration = new Configuration();
configuration.set("hbase.zookeeper.quorum", "hadoop000:2181");
configuration.set("hbase.rootdir", "hdfs://hadoop000:8020/hbase");
try {
admin = new HBaseAdmin(configuration);
} catch (Exception e) {
e.printStackTrace();
}
}
private static HBaseUtils instance = null;
public static HBaseUtils getInstance() {
if (null == instance) {
instance = new HBaseUtils();
}
return instance;
}
/**
* 根據表名獲取到HTable實例
* @param tableName
* @return org.apache.hadoop.hbase.client.HTable
* @author songminjian
* @date 2020/5/6 下午6:22
*/
public HTable getTable(String tableName) {
HTable table = null;
try {
table = new HTable(configuration, tableName);
} catch (IOException e) {
e.printStackTrace();
}
return table;
}
/**
* 添加一條記錄到HBase表
* @param tableName HBase表名
* @param rowkey HBase表的rowkey
* @param cf HBase表的columnfamily
* @param column HBase表的列
* @param value 寫入HBase表的值
* @return void
* @author songminjian
* @date 2020/5/6 下午6:31
*/
public void put(String tableName, String rowkey, String cf, String column, String value) {
HTable table = getTable(tableName);
Put put = new Put(Bytes.toBytes(rowkey));
put.add(Bytes.toBytes(cf), Bytes.toBytes(column), Bytes.toBytes(value));
try {
table.put(put);
} catch (IOException e) {
e.printStackTrace();
}
}
public static void main(String[] args) {
// HTable table = HBaseUtils.getInstance().getTable("imooc_course_clickcount_smj");
// System.out.println(table.getName().getNameAsString());
String tableName = "imooc_course_clickcount_smj";
String rowkey = "20200506_88";
String cf = "info";
String column = "click_count";
String value = "2";
HBaseUtils.getInstance().put(tableName, rowkey, cf, column, value);
}
}
----- 統計今天到現在爲止課程的訪問量 ------
21 編寫統計訪問量的Spark代碼(ImoocStatStreamingApp)
注意一下測試步驟三就可以了,其他都是上面有的代碼
package com.imooc.spark.project.spark
import com.imooc.spark.project.dao.CourseClickCountDAO
import com.imooc.spark.project.domain.{ClickLog, CourseClickCount}
import com.imooc.spark.project.utils.DateUtils
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
import scala.collection.mutable.ListBuffer
object ImoocStatStreamingApp {
def main(args: Array[String]): Unit = {
if(args.length != 4) {
println("Usage: ImoocStatStreamingApp <zkQuorum> <groupId> <tipics> <numThreads>")
System.exit(1)
}
val Array(zkQuorum, groupId, topics, numThreads) = args
val sparkConf = new SparkConf().setAppName("ImoocStatStreamingApp").setMaster("local[2]")
val ssc = new StreamingContext(sparkConf, Seconds(60))
val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
val messages = KafkaUtils.createStream(ssc, zkQuorum, groupId, topicMap)
// 測試步驟一:測試數據接收
// messages.map(_._2).count().print
// 測試步驟二:數據清洗
val logs = messages.map(_._2)
val cleanData = logs.map(line => {
val infos = line.split("\t")
// infos(2) = "GET /class/128.html HTTP/1.1"
// url = /class/128.html
val url = infos(2).split(" ")(1)
var courseId = 0
// 把實戰課程的課程編號拿到了
if(url.startsWith("/class")) {
val courseIdHTML = url.split("/")(2)
courseId = courseIdHTML.substring(0, courseIdHTML.lastIndexOf(".")).toInt
}
ClickLog(infos(0), DateUtils.parseToMinute(infos(1)), courseId, infos(3).toInt, infos(4))
}).filter(clicklog => clicklog.courseId != 0)
// cleanData.print()
// 測試步驟三:統計今天到現在爲止實戰課程的訪問量
cleanData.map(x => {
// HBase rowkey設計:20200507_88(日期_課程編號)
(x.time.substring(0, 8) + "_" + x.courseId, 1)
}).reduceByKey(_ + _).foreachRDD(rdd => {
rdd.foreachPartition(partitionRecords => {
val list = new ListBuffer[CourseClickCount]
partitionRecords.foreach(pair => {
list.append(CourseClickCount(pair._1, pair._2))
})
CourseClickCountDAO.save(list)
})
})
ssc.start()
ssc.awaitTermination()
}
}
22 啓動Flume(streaming_project2.conf)(因爲Flume的sink是Kafka,所以啓動Flume之前要先啓動Kafka,還有要確保Flume的source要有新的數據產生)
flume-ng agent \
--name exec-memory-kafka \
--conf $FLUME_HOME/conf \
--conf-file /home/hadoop/data/project/streaming_project2.conf \
-Dflume.root.logger=INFO,console
23 hbase shell,查看hbase表數據是否有更新
scan 'imooc_course_clickcount_smj'