Flink簡單項目整體流程

項目概述

CDN熱門分發網絡,日誌數據分析,日誌數據內容包括

aliyun
CN
E
[17/Jul/2018:17:07:50 +0800]
223.104.18.110
v2.go2yd.com
17168

接入的數據類型就是日誌

離線:Flume==>HDFS

實時:  Kafka==>流處理引擎==>ES==>Kibana

數據查詢

接口名 功能描述
彙總統計查詢

峯值帶寬

總流量

總請求數

項目功能

  1. 統計一分鐘內每個域名訪問產生的流量,Flink接收Kafka的數據進行處理
  2. 統計一分鐘內每個用戶產生的流量,域名和用戶是有對應關係的,Flink接收Kafka的數據進行處理+Flink讀取域名和用戶的配置數據(在MySQL中)進行處理

項目架構

Mock數據

@Component
@Slf4j
public class KafkaProducer {
    private static final String TOPIC = "pktest";
    @Autowired
    private KafkaTemplate<String,String> kafkaTemplate;

    @SuppressWarnings("unchecked")
    public void produce(String message) {
        try {
            ListenableFuture future = kafkaTemplate.send(TOPIC, message);
            SuccessCallback<SendResult<String,String>> successCallback = new SuccessCallback<SendResult<String, String>>() {
                @Override
                public void onSuccess(@Nullable SendResult<String, String> result) {
                    log.info("發送消息成功");
                }
            };
            FailureCallback failureCallback = new FailureCallback() {
                @Override
                public void onFailure(Throwable ex) {
                    log.error("發送消息失敗",ex);
                    produce(message);
                }
            };
            future.addCallback(successCallback,failureCallback);
        } catch (Exception e) {
            log.error("發送消息異常",e);
        }
    }

    @Scheduled(fixedRate = 1000 * 2)
    public void send() {
        StringBuilder builder = new StringBuilder();
        builder.append("aliyun").append("\t")
                .append("CN").append("\t")
                .append(getLevels()).append("\t")
                .append(new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
                        .format(new Date())).append("\t")
                .append(getIps()).append("\t")
                .append(getDomains()).append("\t")
                .append(getTraffic()).append("\t");
        log.info(builder.toString());
        produce(builder.toString());
    }

    /**
     * 生產Level數據
     * @return
     */
    private String getLevels() {
        List<String> levels = Arrays.asList("M","E");
        return levels.get(new Random().nextInt(levels.size()));
    }

    /**
     * 生產IP數據
     * @return
     */
    private String getIps() {
        List<String> ips = Arrays.asList("222.104.18.111",
                "223.101.75.185",
                "27.17.127.133",
                "183.225.121.16",
                "112.1.65.32",
                "175.147.222.190",
                "183.227.43.68",
                "59.88.168.87",
                "117.28.44.29",
                "117.59.34.167");
        return ips.get(new Random().nextInt(ips.size()));
    }

    /**
     * 生產域名數據
     * @return
     */
    private String getDomains() {
        List<String> domains = Arrays.asList("v1.go2yd.com",
                "v2.go2vd.com",
                "v3.go2yd.com",
                "v4.go2yd.com",
                "vmi.go2yd.com");
        return domains.get(new Random().nextInt(domains.size()));
    }

    /**
     * 生產流量數據
     * @return
     */
    private int getTraffic() {
        return new Random().nextInt(10000);
    }
}

關於Springboot Kafka其他配置請參考Springboot2整合Kafka

打開Kafka服務器消費者,可以看到

說明Kafka數據發送成功

Flink消費者

public class LogAnalysis {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        String topic = "pktest";
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers","外網ip:9092");
        properties.setProperty("group.id","test");
        DataStreamSource<String> data = env.addSource(new FlinkKafkaConsumer<>(topic,
                new SimpleStringSchema(), properties));
        data.print().setParallelism(1);
        env.execute("LogAnalysis");
    }
}

接收到的消息

aliyun	CN	M	2021-01-31 23:43:07	222.104.18.111	v1.go2yd.com	4603	
aliyun	CN	E	2021-01-31 23:43:09	222.104.18.111	v4.go2yd.com	6313	
aliyun	CN	E	2021-01-31 23:43:11	222.104.18.111	v2.go2vd.com	4233	
aliyun	CN	E	2021-01-31 23:43:13	222.104.18.111	v4.go2yd.com	2691	
aliyun	CN	E	2021-01-31 23:43:15	183.225.121.16	v1.go2yd.com	212	
aliyun	CN	E	2021-01-31 23:43:17	183.225.121.16	v4.go2yd.com	7744	
aliyun	CN	M	2021-01-31 23:43:19	175.147.222.190	vmi.go2yd.com	1318

數據清洗

數據清洗就是按照我們的業務規則把原始輸入的數據進行一定業務規則的處理,使得滿足我們業務需求爲準

@Slf4j
public class LogAnalysis {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        String topic = "pktest";
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers","外網ip:9092");
        properties.setProperty("group.id","test");
        DataStreamSource<String> data = env.addSource(new FlinkKafkaConsumer<>(topic,
                new SimpleStringSchema(), properties));
        data.map(new MapFunction<String, Tuple4<String, Long, String, String>>() {
            @Override
            public Tuple4<String, Long, String, String> map(String value) throws Exception {
                String[] splits = value.split("\t");
                String level = splits[2];
                String timeStr = splits[3];
                Long time = 0L;
                try {
                    time = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(timeStr).getTime();
                } catch (ParseException e) {
                    log.error("time轉換錯誤:" + timeStr + "," + e.getMessage());
                }
                String domain = splits[5];
                String traffic = splits[6];
                return new Tuple4<>(level,time,domain,traffic);
            }
        }).filter(x -> (Long) x.getField(1) != 0)
          //此處我們只需要Level爲E的數據
          .filter(x -> x.getField(0).equals("E"))
          //拋棄level
          .map(new MapFunction<Tuple4<String,Long,String,String>, Tuple3<Long,String,Long>>() {
              @Override
              public Tuple3<Long, String, Long> map(Tuple4<String, Long, String, String> value) throws Exception {
                  return new Tuple3<>(value.getField(1),value.getField(2),Long.parseLong(value.getField(3)));
              }
          })
          .print().setParallelism(1);
        env.execute("LogAnalysis");
    }
}

運行結果

(1612130315000,v1.go2yd.com,533)
(1612130319000,v4.go2yd.com,8657)
(1612130321000,vmi.go2yd.com,4353)
(1612130327000,v1.go2yd.com,9566)
(1612130329000,v2.go2vd.com,1460)
(1612130331000,vmi.go2yd.com,1444)
(1612130333000,v3.go2yd.com,6955)
(1612130337000,v1.go2yd.com,9612)
(1612130341000,vmi.go2yd.com,1732)
(1612130345000,v3.go2yd.com,694)

Scala代碼

import java.text.SimpleDateFormat
import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.slf4j.LoggerFactory
import org.apache.flink.api.scala._

object LogAnalysis {
  val log = LoggerFactory.getLogger(LogAnalysis.getClass)

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    val topic = "pktest"
    val properties = new Properties
    properties.setProperty("bootstrap.servers", "外網ip:9092")
    properties.setProperty("group.id","test")
    val data = env.addSource(new FlinkKafkaConsumer[String](topic, new SimpleStringSchema, properties))
    data.map(x => {
      val splits = x.split("\t")
      val level = splits(2)
      val timeStr = splits(3)
      var time: Long = 0l
      try {
        time = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(timeStr).getTime
      }catch {
        case e: Exception => {
          log.error(s"time轉換錯誤: $timeStr",e.getMessage)
        }
      }
      val domain = splits(5)
      val traffic = splits(6)
      (level,time,domain,traffic)
    }).filter(_._2 != 0)
      .filter(_._1 == "E")
      .map(x => (x._2,x._3,x._4.toLong))
      .print().setParallelism(1)
    env.execute("LogAnalysis")
  }
}

數據分析

現在我們要分析的是在一分鐘內的域名流量

@Slf4j
public class LogAnalysis {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime);
        String topic = "pktest";
        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers","外網ip:9092");
        properties.setProperty("group.id","test");
        DataStreamSource<String> data = env.addSource(new FlinkKafkaConsumer<>(topic,
                new SimpleStringSchema(), properties));
        data.map(new MapFunction<String, Tuple4<String, Long, String, String>>() {
            @Override
            public Tuple4<String, Long, String, String> map(String value) throws Exception {
                String[] splits = value.split("\t");
                String level = splits[2];
                String timeStr = splits[3];
                Long time = 0L;
                try {
                    time = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(timeStr).getTime();
                } catch (ParseException e) {
                    log.error("time轉換錯誤:" + timeStr + "," + e.getMessage());
                }
                String domain = splits[5];
                String traffic = splits[6];
                return new Tuple4<>(level,time,domain,traffic);
            }
        }).filter(x -> (Long) x.getField(1) != 0)
          //此處我們只需要Level爲E的數據
          .filter(x -> x.getField(0).equals("E"))
          //拋棄level
          .map(new MapFunction<Tuple4<String,Long,String,String>, Tuple3<Long,String,Long>>() {
              @Override
              public Tuple3<Long, String, Long> map(Tuple4<String, Long, String, String> value) throws Exception {
                  return new Tuple3<>(value.getField(1),value.getField(2),Long.parseLong(value.getField(3)));
              }
          })
          .setParallelism(1).assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks<Tuple3<Long, String, Long>>() {
            private Long maxOutOfOrderness = 10000L;
            private Long currentMaxTimestamp = 0L;

            @Nullable
            @Override
            public Watermark getCurrentWatermark() {
                return new Watermark(currentMaxTimestamp - maxOutOfOrderness);
            }

            @Override
            public long extractTimestamp(Tuple3<Long, String, Long> element, long previousElementTimestamp) {
                Long timestamp = element.getField(0);
                currentMaxTimestamp = Math.max(timestamp,currentMaxTimestamp);
                return timestamp;
            }
        }).keyBy(x -> (String) x.getField(1))
          .timeWindow(Time.minutes(1))
          //輸出格式:一分鐘的時間間隔,域名,該域名在一分鐘內的總流量
          .apply(new WindowFunction<Tuple3<Long,String,Long>, Tuple3<String,String,Long>, String, TimeWindow>() {
              @Override
              public void apply(String s, TimeWindow window, Iterable<Tuple3<Long, String, Long>> input, Collector<Tuple3<String, String, Long>> out) throws Exception {
                  List<Tuple3<Long,String,Long>> list = (List) input;
                  Long sum = list.stream().map(x -> (Long) x.getField(2)).reduce((x, y) -> x + y).get();
                  SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");
                  out.collect(new Tuple3<>(format.format(window.getStart()) + " - " + format.format(window.getEnd()),s,sum));
              }
          })
          .print().setParallelism(1);
        env.execute("LogAnalysis");
    }
}

運行結果

(2021-02-01 07:14:00 - 2021-02-01 07:15:00,vmi.go2yd.com,6307)
(2021-02-01 07:15:00 - 2021-02-01 07:16:00,v4.go2yd.com,15474)
(2021-02-01 07:15:00 - 2021-02-01 07:16:00,v2.go2vd.com,9210)
(2021-02-01 07:15:00 - 2021-02-01 07:16:00,v3.go2yd.com,190)
(2021-02-01 07:15:00 - 2021-02-01 07:16:00,v1.go2yd.com,12787)
(2021-02-01 07:15:00 - 2021-02-01 07:16:00,vmi.go2yd.com,14250)
(2021-02-01 07:16:00 - 2021-02-01 07:17:00,v4.go2yd.com,33298)
(2021-02-01 07:16:00 - 2021-02-01 07:17:00,v1.go2yd.com,37140)

Scala代碼

import java.text.SimpleDateFormat
import java.util.Properties

import org.apache.flink.api.common.serialization.SimpleStringSchema
import org.apache.flink.streaming.api.scala.StreamExecutionEnvironment
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer
import org.slf4j.LoggerFactory
import org.apache.flink.api.scala._
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.AssignerWithPeriodicWatermarks
import org.apache.flink.streaming.api.scala.function.WindowFunction
import org.apache.flink.streaming.api.watermark.Watermark
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector

object LogAnalysis {
  val log = LoggerFactory.getLogger(LogAnalysis.getClass)

  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
    val topic = "pktest"
    val properties = new Properties
    properties.setProperty("bootstrap.servers", "外網ip:9092")
    properties.setProperty("group.id","test")
    val data = env.addSource(new FlinkKafkaConsumer[String](topic, new SimpleStringSchema, properties))
    data.map(x => {
      val splits = x.split("\t")
      val level = splits(2)
      val timeStr = splits(3)
      var time: Long = 0l
      try {
        time = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss").parse(timeStr).getTime
      }catch {
        case e: Exception => {
          log.error(s"time轉換錯誤: $timeStr",e.getMessage)
        }
      }
      val domain = splits(5)
      val traffic = splits(6)
      (level,time,domain,traffic)
    }).filter(_._2 != 0)
      .filter(_._1 == "E")
      .map(x => (x._2,x._3,x._4.toLong))
      .setParallelism(1).assignTimestampsAndWatermarks(new AssignerWithPeriodicWatermarks[(Long, String, Long)] {
      var maxOutOfOrderness: Long = 10000l
      var currentMaxTimestamp: Long = _

      override def getCurrentWatermark: Watermark = {
        new Watermark(currentMaxTimestamp - maxOutOfOrderness)
      }

      override def extractTimestamp(element: (Long, String, Long), previousElementTimestamp: Long): Long = {
        val timestamp = element._1
        currentMaxTimestamp = Math.max(timestamp,currentMaxTimestamp)
        timestamp
      }
    }).keyBy(_._2)
      .timeWindow(Time.minutes(1))
      .apply(new WindowFunction[(Long,String,Long),(String,String,Long),String,TimeWindow] {
          override def apply(key: String, window: TimeWindow, input: Iterable[(Long, String, Long)], out: Collector[(String, String, Long)]): Unit = {
            val list = input.toList
            val sum = list.map(_._3).sum
            val format = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss")
            out.collect((format.format(window.getStart) + " - " + format.format(window.getEnd),key,sum))
          }
      })
      .print().setParallelism(1)
    env.execute("LogAnalysis")
  }
}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章