Flink之側輸出流和流的拆分

一、準備數據

首先構建一個DataStream
case class Sensor(id: String, timestamp: Long, temperature: Double)
傳感器樣例類包含,傳感器id、時間戳、溫度
構建Stream

val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //flatMap可以完成map和filter的操作,map和filter有明確的語義,轉換和過濾更加直白
    val dataFromFile: DataStream[String] = env.readTextFile("E:\\workspace\\flink-scala\\src\\main\\resources\\sensors.txt")
    val dataStream: DataStream[Sensor] = dataFromFile.map(data => {
      val array = data.split(",")
      new Sensor(array(0).trim, array(1).trim.toLong, array(2).trim.toDouble)
    })

二、使用split和select方法完成

此方法在新版api已標記爲廢棄

//使用split將流數據打上不同標記,結合select方法真正的分離出新的流DataStream
    //已廢棄,參考SideOutPutTest,使用ProcessFunction實現側輸出流
    val splitStream = dataStream.split(data => {
      if (data.temperature >= 30) {
        Seq("high")
      }
      else if (data.temperature >= 20 && data.temperature < 30) {
        Seq("mid")
      }
      else
        Seq("low")
    })
    val high = splitStream.select("high")
    val mid = splitStream.select("mid")
    val low = splitStream.select("low")

三、使用outPutTag和ProcessFunction完成

package com.hk.processFunctionTest

import java.util.Properties

import com.hk.transformTest.Sensor
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.util.Collector

case class Sensor(id: String, timestamp: Long, temperature: Double)
/**
  * Description: 使用ProcessFunction實現,流的切分功能,異常溫度單獨放到一個流裏,和split類似
  *
  * @author heroking
  * @version 1.0.0
  */
object SideOutPutTest {
  def main(args: Array[String]) {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //flatMap可以完成map和filter的操作,map和filter有明確的語義,轉換和過濾更加直白
    val dataFromFile: DataStream[String] = env.readTextFile("E:\\workspace\\flink-scala\\src\\main\\resources\\sensors.txt")
    val dataStream: DataStream[Sensor] = dataFromFile.map(data => {
      val array = data.split(",")
      new Sensor(array(0).trim, array(1).trim.toLong, array(2).trim.toDouble)
    })
    //泛型爲側輸出流要輸出的數據格式
    val tag: OutputTag[Sensor] = new OutputTag[Sensor]("hot")
    val result = dataStream
      .process(new HotAlarm(tag))
    
    //獲取側輸出流信息
    val sideOutPut: DataStream[Sensor] = result.getSideOutput(tag)
    sideOutPut.print("側輸出流:")

    result.print("out:")
    env.execute("TransformTest")
  }
}

//第二個參數是主輸出流將要輸出的數據類型
/**
  * 如果溫度過高,輸出報警信息到側輸出流
  */
class HotAlarm(alarmOutPutStream:OutputTag[Sensor]) extends ProcessFunction[Sensor, Sensor] {
  override def processElement(sensor: Sensor, context: ProcessFunction[Sensor, Sensor]#Context, collector: Collector[Sensor]): Unit = {
    if (sensor.temperature > 0.5) {
      context.output(alarmOutPutStream, sensor)
    } else {
      collector.collect(sensor)
    }
  }
}

輸出:
out:> Sensor(1,7282164761,0.1)
out:> Sensor(1,7282164765,0.5)
側輸出流:> Sensor(1,7282164774,1.14)
out:> Sensor(1,7282164762,0.2)
out:> Sensor(1,7282164763,0.3)
out:> Sensor(1,7282164764,0.4)
側輸出流:> Sensor(1,7282164766,0.6)
側輸出流:> Sensor(1,7282164767,0.7)
側輸出流:> Sensor(1,7282164768,0.8)

四、窗口遲到數據

窗口watermark和allowedLateness之後依然遲到的流數據,也是通過.sideOutputLateData(outputTag)和result.getSideOutput(outputTag)的側輸出流方式輸出的,拿到這一部分數據後用戶可以自己處理,相比於spark的水印和數據延遲機制來說,flink的更加完善和易用

val outputTag = new OutputTag[(String, Double)]("side")
    val result = waterMarkDataStream
      .map(data => (data.id, data.temperature))
      .keyBy(_._1)
      .timeWindow(Time.seconds(2))
      .allowedLateness(Time.seconds(2))
      .sideOutputLateData(outputTag)
      .minBy(1)
    //.reduce((data1, data2) => (data1._1, data1._2.min(data2._2)))

    dataStream.print("in")
    val sideOutPutStream = result.getSideOutput(outputTag)
    sideOutPutStream.print("窗口watermark和allowedLateness之後依然遲到的流數據:")
    result.print("out")

    env.execute("TransformTest")
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章