項目概述
CDN熱門分發網絡,日誌數據分析,日誌數據內容包括
aliyun
CN
E
[17/Jul/2018:17:07:50 +0800]
223.104.18.110
v2.go2yd.com
17168
接入的數據類型就是日誌
離線:Flume==>HDFS
實時: Kafka==>流處理引擎==>ES==>Kibana
數據查詢
接口名 | 功能描述 |
---|---|
彙總統計查詢 | 峯值帶寬 總流量 總請求數 |
項目功能
- 統計一分鐘內每個域名訪問產生的流量,Flink接收Kafka的數據進行處理
- 統計一分鐘內每個用戶產生的流量,域名和用戶是有對應關係的,Flink接收Kafka的數據進行處理+Flink讀取域名和用戶的配置數據(在MySQL中)進行處理
項目架構
Mock數據
@Component @Slf4j public class KafkaProducer { private static final String TOPIC = "pktest"; @Autowired private KafkaTemplate<String,String> kafkaTemplate; @SuppressWarnings("unchecked") public void produce(String message) { try { ListenableFuture future = kafkaTemplate.send(TOPIC, message); SuccessCallback<SendResult<String,String>> successCallback = new SuccessCallback<SendResult<String, String>>() { @Override public void onSuccess(@Nullable SendResult<String, String> result) { log.info("發送消息成功"); } }; FailureCallback failureCallback = new FailureCallback() { @Override public void onFailure(Throwable ex) { log.error("發送消息失敗",ex); produce(message); } }; future.addCallback(successCallback,failureCallback); } catch (Exception e) { log.error("發送消息異常",e); } } @Scheduled(fixedRate = 1000 * 2) public void send() { StringBuilder builder = new StringBuilder(); builder.append("aliyun").append("\t") .append("CN").append("\t") .append(getLevels()).append("\t") .append(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") .format(new Date())).append("\t") .append(getIps()).append("\t") .append(getDomains()).append("\t") .append(getTraffic()).append("\t"); log.info(builder.toString()); produce(builder.toString()); } /** * 生產Level數據 * @return */ private String getLevels() { List<String> levels = Arrays.asList("M","E"); return levels.get(new Random().nextInt(levels.size())); } /** * 生產IP數據 * @return */ private String getIps() { List<String> ips = Arrays.asList("222.104.18.111", "223.101.75.185", "27.17.127.133", "183.225.121.16", "112.1.65.32", "175.147.222.190", "183.227.43.68", "59.88.168.87", "117.28.44.29", "117.59.34.167"); return ips.get(new Random().nextInt(ips.size())); } /** * 生產域名數據 * @return */ private String getDomains() { List<String> domains = Arrays.asList("v1.go2yd.com", "v2.go2vd.com", "v3.go2yd.com", "v4.go2yd.com", "vmi.go2yd.com"); return domains.get(new Random().nextInt(domains.size())); } /** * 生產流量數據 * @return */ private int getTraffic() { return new Random().nextInt(10000); } }
關於Springboot Kafka其他配置請參考Springboot2整合Kafka
打開Kafka服務器消費者,可以看到
說明Kafka數據發送成功
Flink消費者
public class LogAnalysis { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); String topic = "pktest"; Properties properties = new Properties(); properties.setProperty("bootstrap.servers","外網ip:9092"); properties.setProperty("group.id","test"); DataStreamSource<String> data = env.addSource(new FlinkKafkaConsumer<>(topic, new SimpleStringSchema(), properties)); data.print().setParallelism(1); env.execute("LogAnalysis"); } }
接收到的消息
aliyun CN M 2021-01-31 23:43:07 222.104.18.111 v1.go2yd.com 4603
aliyun CN E 2021-01-31 23:43:09 222.104.18.111 v4.go2yd.com 6313
aliyun CN E 2021-01-31 23:43:11 222.104.18.111 v2.go2vd.com 4233
aliyun CN E 2021-01-31 23:43:13 222.104.18.111 v4.go2yd.com 2691
aliyun CN E 2021-01-31 23:43:15 183.225.121.16 v1.go2yd.com 212
aliyun CN E 2021-01-31 23:43:17 183.225.121.16 v4.go2yd.com 7744
aliyun CN M 2021-01-31 23:43:19 175.147.222.190 vmi.go2yd.com 1318
數據清洗
數據清洗就是按照我們的業務規則把原始輸入的數據進行一定業務規則的處理,使得滿足我們業務需求爲準
@Slf4j public class LogAnalysis { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); String topic = "pktest"; Properties properties = new Properties(); properties.setProperty("bootstrap.servers","外網ip:9092"); properties.setProperty("group.id","test"); DataStreamSource<String> data = env.addSource(new FlinkKafkaConsumer<>(topic, new SimpleStringSchema(), properties)); data.map(new MapFunction<String, Tuple4<String, Long, String, String>>() { @Override public Tuple4<String, Long, String, String> map(String value) throws Exception { String[] splits = value.split("\t"); String level = splits[2]; String timeStr = splits[3]; Long time = 0L; try { time = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(timeStr).getTime(); } catch (ParseException e) { log.error("time轉換錯誤:" + timeStr + "," + e.getMessage()); } String domain = splits[5]; String traffic = splits[6]; return new Tuple4<>(level,time,domain,traffic); } }).filter(x -> (Long) x.getField(1) != 0) //此處我們只需要Level爲E的數據 .filter(x -> x.getField(0).equals("E")) //拋棄level .map(new MapFunction<Tuple4<String,Long,String,String>, Tuple3<Long,String,Long>>() { @Override public Tuple3<Long, String, Long> map(Tuple4<String, Long, String, String> value) throws Exception { return new Tuple3<>(value.getField(1),value.getField(2),Long.parseLong(value.getField(3))); } }) .print().setParallelism(1); env.execute("LogAnalysis"); } }
運行結果
(1612130315000,v1.go2yd.com,533)
(1612130319000,v4.go2yd.com,8657)
(1612130321000,vmi.go2yd.com,4353)
(1612130327000,v1.go2yd.com,9566)
(1612130329000,v2.go2vd.com,1460)
(1612130331000,vmi.go2yd.com,1444)
(1612130333000,v3.go2yd.com,6955)
(1612130337000,v1.go2yd.com,9612)
(1612130341000,vmi.go2yd.com,1732)
(1612130345000,v3.go2yd.com,694)
Scala代碼
import java.text.SimpleDateFormat import java.util.Properties import org.apache.flink.api.common.serialization.SimpleStringSchema import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer import org.slf4j.LoggerFactory import org.apache.flink.api.scala._ object LogAnalysis { val log = LoggerFactory.getLogger(LogAnalysis.getClass) def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment val topic = "pktest" val properties = new Properties properties.setProperty("bootstrap.servers", "外網ip:9092") properties.setProperty("group.id","test") val data = env.addSource(new FlinkKafkaConsumer[String](topic, new SimpleStringSchema, properties)) data.map(x => { val splits = x.split("\t") val level = splits(2) val timeStr = splits(3) var time: Long = 0l try { time = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(timeStr).getTime }catch { case e: Exception => { log.error(s"time轉換錯誤: $timeStr",e.getMessage) } } val domain = splits(5) val traffic = splits(6) (level,time,domain,traffic) }).filter(_._2 != 0) .filter(_._1 == "E") .map(x => (x._2,x._3,x._4.toLong)) .print().setParallelism(1) env.execute("LogAnalysis") } }
數據分析
現在我們要分析的是在一分鐘內的域名流量
@Slf4j public class LogAnalysis { public static void main(String[] args) throws Exception { StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(); env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime); String topic = "pktest"; Properties properties = new Properties(); properties.setProperty("bootstrap.servers","外網ip:9092"); properties.setProperty("group.id","test"); DataStreamSource<String> data = env.addSource(new FlinkKafkaConsumer<>(topic, new SimpleStringSchema(), properties)); data.map(new MapFunction<String, Tuple4<String, Long, String, String>>() { @Override public Tuple4<String, Long, String, String> map(String value) throws Exception { String[] splits = value.split("\t"); String level = splits[2]; String timeStr = splits[3]; Long time = 0L; try { time = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(timeStr).getTime(); } catch (ParseException e) { log.error("time轉換錯誤:" + timeStr + "," + e.getMessage()); } String domain = splits[5]; String traffic = splits[6]; return new Tuple4<>(level,time,domain,traffic); } }).filter(x -> (Long) x.getField(1) != 0) //此處我們只需要Level爲E的數據 .filter(x -> x.getField(0).equals("E")) //拋棄level .map(new MapFunction<Tuple4<String,Long,String,String>, Tuple3<Long,String,Long>>() { @Override public Tuple3<Long, String, Long> map(Tuple4<String, Long, String, String> value) throws Exception { return new Tuple3<>(value.getField(1),value.getField(2),Long.parseLong(value.getField(3))); } }) .setParallelism(1).assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Tuple3<Long, String, Long>>() { private Long maxOutOfOrderness = 10000L; private Long currentMaxTimestamp = 0L; @Nullable @Override public Watermark getCurrentWatermark() { return new Watermark(currentMaxTimestamp - maxOutOfOrderness); } @Override public long extractTimestamp(Tuple3<Long, String, Long> element, long previousElementTimestamp) { Long timestamp = element.getField(0); currentMaxTimestamp = Math.max(timestamp,currentMaxTimestamp); return timestamp; } }).keyBy(x -> (String) x.getField(1)) .timeWindow(Time.minutes(1)) //輸出格式:一分鐘的時間間隔,域名,該域名在一分鐘內的總流量 .apply(new WindowFunction<Tuple3<Long,String,Long>, Tuple3<String,String,Long>, String, TimeWindow>() { @Override public void apply(String s, TimeWindow window, Iterable<Tuple3<Long, String, Long>> input, Collector<Tuple3<String, String, Long>> out) throws Exception { List<Tuple3<Long,String,Long>> list = (List) input; Long sum = list.stream().map(x -> (Long) x.getField(2)).reduce((x, y) -> x + y).get(); SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss"); out.collect(new Tuple3<>(format.format(window.getStart()) + " - " + format.format(window.getEnd()),s,sum)); } }) .print().setParallelism(1); env.execute("LogAnalysis"); } }
運行結果
(2021-02-01 07:14:00 - 2021-02-01 07:15:00,vmi.go2yd.com,6307)
(2021-02-01 07:15:00 - 2021-02-01 07:16:00,v4.go2yd.com,15474)
(2021-02-01 07:15:00 - 2021-02-01 07:16:00,v2.go2vd.com,9210)
(2021-02-01 07:15:00 - 2021-02-01 07:16:00,v3.go2yd.com,190)
(2021-02-01 07:15:00 - 2021-02-01 07:16:00,v1.go2yd.com,12787)
(2021-02-01 07:15:00 - 2021-02-01 07:16:00,vmi.go2yd.com,14250)
(2021-02-01 07:16:00 - 2021-02-01 07:17:00,v4.go2yd.com,33298)
(2021-02-01 07:16:00 - 2021-02-01 07:17:00,v1.go2yd.com,37140)
Scala代碼
import java.text.SimpleDateFormat import java.util.Properties import org.apache.flink.api.common.serialization.SimpleStringSchema import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer import org.slf4j.LoggerFactory import org.apache.flink.api.scala._ import org.apache.flink.streaming.api.TimeCharacteristic import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks import org.apache.flink.streaming.api.scala.function.WindowFunction import org.apache.flink.streaming.api.watermark.Watermark import org.apache.flink.streaming.api.windowing.time.Time import org.apache.flink.streaming.api.windowing.windows.TimeWindow import org.apache.flink.util.Collector object LogAnalysis { val log = LoggerFactory.getLogger(LogAnalysis.getClass) def main(args: Array[String]): Unit = { val env = StreamExecutionEnvironment.getExecutionEnvironment env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime) val topic = "pktest" val properties = new Properties properties.setProperty("bootstrap.servers", "外網ip:9092") properties.setProperty("group.id","test") val data = env.addSource(new FlinkKafkaConsumer[String](topic, new SimpleStringSchema, properties)) data.map(x => { val splits = x.split("\t") val level = splits(2) val timeStr = splits(3) var time: Long = 0l try { time = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(timeStr).getTime }catch { case e: Exception => { log.error(s"time轉換錯誤: $timeStr",e.getMessage) } } val domain = splits(5) val traffic = splits(6) (level,time,domain,traffic) }).filter(_._2 != 0) .filter(_._1 == "E") .map(x => (x._2,x._3,x._4.toLong)) .setParallelism(1).assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[(Long, String, Long)] { var maxOutOfOrderness: Long = 10000l var currentMaxTimestamp: Long = _ override def getCurrentWatermark: Watermark = { new Watermark(currentMaxTimestamp - maxOutOfOrderness) } override def extractTimestamp(element: (Long, String, Long), previousElementTimestamp: Long): Long = { val timestamp = element._1 currentMaxTimestamp = Math.max(timestamp,currentMaxTimestamp) timestamp } }).keyBy(_._2) .timeWindow(Time.minutes(1)) .apply(new WindowFunction[(Long,String,Long),(String,String,Long),String,TimeWindow] { override def apply(key: String, window: TimeWindow, input: Iterable[(Long, String, Long)], out: Collector[(String, String, Long)]): Unit = { val list = input.toList val sum = list.map(_._3).sum val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss") out.collect((format.format(window.getStart) + " - " + format.format(window.getEnd),key,sum)) } }) .print().setParallelism(1) env.execute("LogAnalysis") } }