Flink之侧输出流和流的拆分

一、准备数据

首先构建一个DataStream
case class Sensor(id: String, timestamp: Long, temperature: Double)
传感器样例类包含,传感器id、时间戳、温度
构建Stream

val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //flatMap可以完成map和filter的操作,map和filter有明确的语义,转换和过滤更加直白
    val dataFromFile: DataStream[String] = env.readTextFile("E:\\workspace\\flink-scala\\src\\main\\resources\\sensors.txt")
    val dataStream: DataStream[Sensor] = dataFromFile.map(data => {
      val array = data.split(",")
      new Sensor(array(0).trim, array(1).trim.toLong, array(2).trim.toDouble)
    })

二、使用split和select方法完成

此方法在新版api已标记为废弃

//使用split将流数据打上不同标记,结合select方法真正的分离出新的流DataStream
    //已废弃,参考SideOutPutTest,使用ProcessFunction实现侧输出流
    val splitStream = dataStream.split(data => {
      if (data.temperature >= 30) {
        Seq("high")
      }
      else if (data.temperature >= 20 && data.temperature < 30) {
        Seq("mid")
      }
      else
        Seq("low")
    })
    val high = splitStream.select("high")
    val mid = splitStream.select("mid")
    val low = splitStream.select("low")

三、使用outPutTag和ProcessFunction完成

package com.hk.processFunctionTest

import java.util.Properties

import com.hk.transformTest.Sensor
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.util.Collector

case class Sensor(id: String, timestamp: Long, temperature: Double)
/**
  * Description: 使用ProcessFunction实现,流的切分功能,异常温度单独放到一个流里,和split类似
  *
  * @author heroking
  * @version 1.0.0
  */
object SideOutPutTest {
  def main(args: Array[String]) {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)
    //flatMap可以完成map和filter的操作,map和filter有明确的语义,转换和过滤更加直白
    val dataFromFile: DataStream[String] = env.readTextFile("E:\\workspace\\flink-scala\\src\\main\\resources\\sensors.txt")
    val dataStream: DataStream[Sensor] = dataFromFile.map(data => {
      val array = data.split(",")
      new Sensor(array(0).trim, array(1).trim.toLong, array(2).trim.toDouble)
    })
    //泛型为侧输出流要输出的数据格式
    val tag: OutputTag[Sensor] = new OutputTag[Sensor]("hot")
    val result = dataStream
      .process(new HotAlarm(tag))
    
    //获取侧输出流信息
    val sideOutPut: DataStream[Sensor] = result.getSideOutput(tag)
    sideOutPut.print("侧输出流:")

    result.print("out:")
    env.execute("TransformTest")
  }
}

//第二个参数是主输出流将要输出的数据类型
/**
  * 如果温度过高,输出报警信息到侧输出流
  */
class HotAlarm(alarmOutPutStream:OutputTag[Sensor]) extends ProcessFunction[Sensor, Sensor] {
  override def processElement(sensor: Sensor, context: ProcessFunction[Sensor, Sensor]#Context, collector: Collector[Sensor]): Unit = {
    if (sensor.temperature > 0.5) {
      context.output(alarmOutPutStream, sensor)
    } else {
      collector.collect(sensor)
    }
  }
}

输出:
out:> Sensor(1,7282164761,0.1)
out:> Sensor(1,7282164765,0.5)
侧输出流:> Sensor(1,7282164774,1.14)
out:> Sensor(1,7282164762,0.2)
out:> Sensor(1,7282164763,0.3)
out:> Sensor(1,7282164764,0.4)
侧输出流:> Sensor(1,7282164766,0.6)
侧输出流:> Sensor(1,7282164767,0.7)
侧输出流:> Sensor(1,7282164768,0.8)

四、窗口迟到数据

窗口watermark和allowedLateness之后依然迟到的流数据,也是通过.sideOutputLateData(outputTag)和result.getSideOutput(outputTag)的侧输出流方式输出的,拿到这一部分数据后用户可以自己处理,相比于spark的水印和数据延迟机制来说,flink的更加完善和易用

val outputTag = new OutputTag[(String, Double)]("side")
    val result = waterMarkDataStream
      .map(data => (data.id, data.temperature))
      .keyBy(_._1)
      .timeWindow(Time.seconds(2))
      .allowedLateness(Time.seconds(2))
      .sideOutputLateData(outputTag)
      .minBy(1)
    //.reduce((data1, data2) => (data1._1, data1._2.min(data2._2)))

    dataStream.print("in")
    val sideOutPutStream = result.getSideOutput(outputTag)
    sideOutPutStream.print("窗口watermark和allowedLateness之后依然迟到的流数据:")
    result.print("out")

    env.execute("TransformTest")
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章