flink常用算子以及window

Flink 窗口種類以及常用算子

flink有以下幾類窗口：

Tumbling Windows

滾動窗口長度固定，滑動間隔等於窗口長度，窗口元素之間沒有交疊。

// tumbling event-time windows
input
    .keyBy(<key selector>)
    .window(TumblingEventTimeWindows.of(Time.seconds(5)))
    .<windowed transformation>(<window function>)

Sliding Windows

滑動窗口長度固定，窗口長度大於窗口滑動間隔，元素存在交疊。

// sliding event-time windows
input
    .keyBy(<key selector>)
    .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
    .<windowed transformation>(<window function>)

Global Windows

全局窗口會將所有key相同的元素放到一個窗口中，默認該窗口永遠都不會關閉（永遠都不會觸發），因爲該窗口沒有默認的窗口觸發器Trigger，因此需要用戶自定義Trigger。

input
    .keyBy(<key selector>)
    .window(GlobalWindows.create())
    .<windowed transformation>(<window function>)

Session Windows（MergerWindow）

通過計算元素時間間隔，如果間隔小於session gap，則會合併到一個窗口中；如果大於時間間隔，當前窗口關閉，後續的元素屬於新的窗口。與滾動窗口和滑動窗口不同的是會話窗口沒有固定的窗口大小，底層本質上做的是窗口合併。

// event-time session windows with static gap
input
    .keyBy(<key selector>)
    .window(EventTimeSessionWindows.withGap(Time.minutes(10)))
    .<windowed transformation>(<window function>)

Event Time

Flink在做窗口計算的時候支持以下語義的window：Processing time、Event time、Ingestion time

Processing time:使用處理節點時間，計算窗口，默認

Event time：使用事件產生時間，計算窗口- 精確

Ingestion time：數據進入到Flink的時間，一般是通過SourceFunction指定時間

常用算子

map

map可以理解爲映射，對每個元素進行一定的變換後，映射爲另一個元素

object MapOperator {
  def main(args: Array[String]): Unit = {
    //獲取環境變量
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    //準備數據,類型DataStreamSource
    val dataStreamSource = env.fromElements(Tuple1.apply("flink")
                                            ,Tuple1.apply("spark")
                                            ,Tuple1.apply("hadoop"))
      .map("hello"+_._1)
      .print()
    env.execute("flink map operator")
  }
}

運行結果：
3> hello hadoop
2> hello spark
1> hello flink

flatmap

flatmap可以理解爲將元素攤平，每個元素可以變爲0個、1個、或者多個元素。

 env.fromElements(Tuple1.apply("flink jobmanger taskmanager")
      ,Tuple1.apply("spark streaming")
      ,Tuple1.apply("hadoop hdfs"))
      .flatMap(_._1.split(" "))
      .print()
運行結果：
hadoop
hdfs
spark
streaming
flink
jobmanger
taskmanager

filter

filter是進行篩選。

env.fromElements(Tuple1.apply("flink jobmanger taskmanager")
      ,Tuple1.apply("spark streaming")
      ,Tuple1.apply("hadoop hdfs"))
      .flatMap(_._1.split(" "))
      .filter(_.equals("flink"))
      .print()
運行結果：
flink

keyBy

邏輯上將Stream根據指定的Key進行分區，是根據key的散列值進行分區的。

注：keyed state 必須要在keyby() 之後使用

env.fromElements(Tuple1.apply("flink jobmanger taskmanager")
      , Tuple1.apply("flink streaming")
      , Tuple1.apply("hadoop jobmanger"))
      .flatMap(_._1.split(" "))
      .map(data => {
        (data, 1)
      })
      .keyBy(0)
      .reduce((t1,t2)=>{(t1._1,t1._2+t2._2)})
      .print().setParallelism(1)
運行結果：
(hadoop,1)
(flink,1)
(flink,2)
(jobmanger,1)
(jobmanger,2)
(taskmanager,1)
(streaming,1)

Aggregations

DataStream API 支持各種聚合，這些函數可以應用於 KeyedStream 以獲得 Aggregations 聚合

常用的方法有

min、minBy、max、minBy、sum
max 和 maxBy 之間的區別在於 max 返回流中的最大值，但 maxBy 返回具有最大值的鍵， min 和 minBy 同理

輸入：
zs 001 1200
zs 001 1500
   env.socketTextStream("localhost",8888)
      .map(_.split("\\s+"))
      .map(ts=>(ts(0),ts(1),ts(2).toDouble))
      .keyBy(1)
      .min(2)
      .print()
運行結果：
1> (zs,001,1200.0)
1> (zs,001,1200.0)

reduce

reduce是歸併操作，它可以將KeyedStream 轉變爲 DataStream；對每一組內的元素進行歸併操作，即第一個和第二個歸併，結果再與第三個歸併…

env.fromElements(Tuple1.apply("flink jobmanger taskmanager")
      , Tuple1.apply("flink streaming")
      , Tuple1.apply("hadoop jobmanger"))
      .flatMap(_._1.split(" "))
      .map(data => {
        (data, 1)
      })
      .keyBy(0)
      .reduce((t1,t2)=>{(t1._1,t1._2+t2._2)})
      .print().setParallelism(1)
運行結果：
(hadoop,1)
(flink,1)
(flink,2)
(jobmanger,1)
(jobmanger,2)
(taskmanager,1)
(streaming,1)

fold

給定一個初始值，將各個元素逐個歸併計算。它將KeyedStream轉變爲DataStream；指定一個開始的值，對每一組內的元素進行歸併操作，即第一個和第二個歸併，結果再與第三個歸併…

    env.fromElements(Tuple1.apply("flink jobmanger taskmanager")
      , Tuple1.apply("flink streaming")
      , Tuple1.apply("hadoop jobmanger"))
      .flatMap(_._1.split(" "))
      .map(data => {
        (data, 1)
      })
      .keyBy(0)
      .fold(("",11))((t1,t2)=>{(t2._1,t1._2+t2._2)})
      .print().setParallelism(1)
運行結果：
(jobmanger,11)
(jobmanger,12)
(taskmanager,11)
(streaming,11)
(hadoop,11)
(flink,11)
(flink,12)

union

union可以將多個流合併到一個流中，以便對合並的流進行統一處理。是對多個流的水平拼接。

參與合併的流必須是同一種類型。

dataStream1.union(dataStream2, dataStream3, ...)

join

根據指定的Key將兩個流進行關聯。

      //兩個流進行join操作，是inner join，關聯上的才能保留下來
        DataStream<String> result =  stream1.join(stream2)
                //關聯條件
                .where(t1->t1.getField(0)).equalTo(t2->t2.getField(0))
                //每5秒一個滾動窗口
                .window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
                //關聯後輸出
                .apply((t1,t2)->t1.getField(1)+"|"+t2.getField(1))
                ;

split

將一個流拆分爲多個流。

val lines = env.socketTextStream("localhost",8888)
    val splitStream: SplitStream[String] = lines.split(line => {
      if (line.contains("zs")) {
        List("zs") //分支名稱
      } else {
        List("ls") //分支名稱
      }
    })

Select

從拆分流中選擇特定流，那麼就得搭配使用 Select 算子

輸入：
zs 001 1200
zs 001 1500
ls 002 1000
   oo2 2000
    val lines = env.socketTextStream("localhost",8888)
    val splitStream: SplitStream[String] = lines.split(line => {
      if (line.contains("zs")) {
        List("zs") //分支名稱
      } else {
        List("ls") //分支名稱
      }
    })
    splitStream.select("zs").print("zs")
    splitStream.select("ls").print("ls")
運行結果：
zs:2> zs 001 1200
zs:3> zs 001 1500
ls:4> ls 002 1000
ls:1>    oo2 2000

Side Out

輸入：
zs 001 1200
ls 002 1500
    val lines = env.socketTextStream("localhost",8888) 
    val outTag: OutputTag[String] = new OutputTag[String]("zs")
    // processFunction: ProcessFunction[T, R]
    val result: DataStream[String] = lines.process(new ProcessFunction[String, String] {
      override def processElement(value: String, ctx: ProcessFunction[String, String]#Context, out: Collector[String]): Unit = {
        if (value.contains("zs")) {
          ctx.output(outTag, value)
        } else {
          out.collect(value)
        }
      }
    })
    result.print("正常輸出")
    //獲取側輸出流中的數據
    result.getSideOutput(outTag).print("側輸出")
運行結果：
正常輸出:2> ls 002 1000
側輸出:3> zs 001 1200

ValueSate

var env=StreamExecutionEnvironment.getExecutionEnvironment
env.socketTextStream("localhost",8888)
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.map(new RichMapFunction[(String,Int),(String,Int)] {
    var vs:ValueState[Int]=_
    
    override def open(parameters: Configuration): Unit = {
        val vsd=new ValueStateDescriptor[Int]("valueCount",createTypeInformation[Int])
        vs=getRuntimeContext.getState[Int](vsd)
    }
    
    override def map(value: (String, Int)): (String, Int) = {
        val histroyCount = vs.value()
        val currentCount=histroyCount+value._2
        vs.update(currentCount)
        (value._1,currentCount)
    }
}).print()

KafkaSource

val prop = new Properties()
prop.setProperty("zookeeper.connect", ZOOKEEPER_HOST)
    prop.setProperty("bootstrap.server", KAFKA_BROKER)
    prop.setProperty("group.id", TRANSACTION_GROUP)
    prop.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
    prop.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
    prop.setProperty("auto.offset.reset", "latest")
val lines=env.addSource(new FlinkKafkaConsumer("topic",new SimpleStringSchema(),properties))
    lines.flatMap(_.split("\\s+"))
    .map((_,1))
    .keyBy(t=>t._1)
    .sum(1)
    .print()

如果使用SimpleStringSchema，只能獲取value，如果想要獲取更多信息，比如 key/value/partition/offset ，用戶可以通過繼承KafkaDeserializationSchema類自定義反序列化對象

import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.flink.streaming.api.scala._

class KafkaDeserializationSchema extends KafkaDeserializationSchema[(String,String)] {
  override def isEndOfStream(nextElement: (String, String)): Boolean = {
    false
  }
    
  override def deserialize(record: ConsumerRecord[Array[Byte], Array[Byte]]): (String, String) = {
    var key=""
    if(record.key()!=null && record.key().size!=0){
      key=new String(record.key())
    }
    val value=new String(record.value())
    (key,value)
  }
    
  //tuple元素類型
  override def getProducedType: TypeInformation[(String, String)] = {
    createTypeInformation[(String, String)]
  }
}

如果Kafka存儲的數據爲json格式時，可以使用系統自帶的一些支持json的Schema：

JsonNodeDeserializationSchema：要求value必須是json格式的字符串

flink常用算子以及window

Flink 窗口種類以及常用算子

Tumbling Windows

Sliding Windows

Global Windows

Session Windows（MergerWindow）

Event Time

常用算子

map

flatmap

filter

keyBy

Aggregations

reduce

fold

union

join

split

Select

Side Out

ValueSate

KafkaSource

Apache Impala總結

數倉ods分區總結

shell腳本基本操作一

flink常用算子以及window

Kylin中使用Api構建cube以及狀態監控腳本

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結