一、準備數據
首先構建一個DataStream
case class Sensor(id: String, timestamp: Long, temperature: Double)
傳感器樣例類包含,傳感器id、時間戳、溫度
構建Stream
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//flatMap可以完成map和filter的操作,map和filter有明確的語義,轉換和過濾更加直白
val dataFromFile: DataStream[String] = env.readTextFile("E:\\workspace\\flink-scala\\src\\main\\resources\\sensors.txt")
val dataStream: DataStream[Sensor] = dataFromFile.map(data => {
val array = data.split(",")
new Sensor(array(0).trim, array(1).trim.toLong, array(2).trim.toDouble)
})
二、使用split和select方法完成
此方法在新版api已標記爲廢棄
//使用split將流數據打上不同標記,結合select方法真正的分離出新的流DataStream
//已廢棄,參考SideOutPutTest,使用ProcessFunction實現側輸出流
val splitStream = dataStream.split(data => {
if (data.temperature >= 30) {
Seq("high")
}
else if (data.temperature >= 20 && data.temperature < 30) {
Seq("mid")
}
else
Seq("low")
})
val high = splitStream.select("high")
val mid = splitStream.select("mid")
val low = splitStream.select("low")
三、使用outPutTag和ProcessFunction完成
package com.hk.processFunctionTest
import java.util.Properties
import com.hk.transformTest.Sensor
import org.apache.flink.streaming.api.functions.ProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer011
import org.apache.flink.streaming.util.serialization.SimpleStringSchema
import org.apache.flink.util.Collector
case class Sensor(id: String, timestamp: Long, temperature: Double)
/**
* Description: 使用ProcessFunction實現,流的切分功能,異常溫度單獨放到一個流裏,和split類似
*
* @author heroking
* @version 1.0.0
*/
object SideOutPutTest {
def main(args: Array[String]) {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//flatMap可以完成map和filter的操作,map和filter有明確的語義,轉換和過濾更加直白
val dataFromFile: DataStream[String] = env.readTextFile("E:\\workspace\\flink-scala\\src\\main\\resources\\sensors.txt")
val dataStream: DataStream[Sensor] = dataFromFile.map(data => {
val array = data.split(",")
new Sensor(array(0).trim, array(1).trim.toLong, array(2).trim.toDouble)
})
//泛型爲側輸出流要輸出的數據格式
val tag: OutputTag[Sensor] = new OutputTag[Sensor]("hot")
val result = dataStream
.process(new HotAlarm(tag))
//獲取側輸出流信息
val sideOutPut: DataStream[Sensor] = result.getSideOutput(tag)
sideOutPut.print("側輸出流:")
result.print("out:")
env.execute("TransformTest")
}
}
//第二個參數是主輸出流將要輸出的數據類型
/**
* 如果溫度過高,輸出報警信息到側輸出流
*/
class HotAlarm(alarmOutPutStream:OutputTag[Sensor]) extends ProcessFunction[Sensor, Sensor] {
override def processElement(sensor: Sensor, context: ProcessFunction[Sensor, Sensor]#Context, collector: Collector[Sensor]): Unit = {
if (sensor.temperature > 0.5) {
context.output(alarmOutPutStream, sensor)
} else {
collector.collect(sensor)
}
}
}
輸出:
out:> Sensor(1,7282164761,0.1)
out:> Sensor(1,7282164765,0.5)
側輸出流:> Sensor(1,7282164774,1.14)
out:> Sensor(1,7282164762,0.2)
out:> Sensor(1,7282164763,0.3)
out:> Sensor(1,7282164764,0.4)
側輸出流:> Sensor(1,7282164766,0.6)
側輸出流:> Sensor(1,7282164767,0.7)
側輸出流:> Sensor(1,7282164768,0.8)
四、窗口遲到數據
窗口watermark和allowedLateness之後依然遲到的流數據,也是通過.sideOutputLateData(outputTag)和result.getSideOutput(outputTag)的側輸出流方式輸出的,拿到這一部分數據後用戶可以自己處理,相比於spark的水印和數據延遲機制來說,flink的更加完善和易用
val outputTag = new OutputTag[(String, Double)]("side")
val result = waterMarkDataStream
.map(data => (data.id, data.temperature))
.keyBy(_._1)
.timeWindow(Time.seconds(2))
.allowedLateness(Time.seconds(2))
.sideOutputLateData(outputTag)
.minBy(1)
//.reduce((data1, data2) => (data1._1, data1._2.min(data2._2)))
dataStream.print("in")
val sideOutPutStream = result.getSideOutput(outputTag)
sideOutPutStream.print("窗口watermark和allowedLateness之後依然遲到的流數據:")
result.print("out")
env.execute("TransformTest")