Flink的DataStream API

參考: https://ci.apache.org/projects/flink/flink-docs-release-1.9/dev/datastream_api.html

Data Sources

Sources 是程序讀取其輸入的位置,可以使用fsEnv.addSource(sourceFunction)將Source附加到程序中。Flink內置了許多預先實現的SourceFunction,可以通過實現SourceFunction(non-parallel sources)來編寫自己的自定義Source,或通過實現ParallelSourceFunction接口或繼承RichParallelSourceFunction來實現並行Source.

File-based

readTextFile(path):逐行讀取文本文件,底層使用TextInputFormat規範讀取文件,並將其作爲字符串返回。

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

val lines:DataStream[String]=fsEnv.readTextFile("file:///E:\\demo\\words")

lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(t=>t._1)
.sum(1)
.print()

fsEnv.execute("wordcount")

readFile(fileInputFormat, path) :根據指定的文件輸入格式讀取文件(僅僅讀取一次,類似批處理)

 //1.創建流處理的環境 - 遠程發佈|本地執行
    val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

    val inputFormat = new TextInputFormat(null)
    //2.讀取外圍系統數據 - 細化
    val lines:DataStream[String]=fsEnv.readFile(inputFormat,"file:///D:/demo/words")
    lines.flatMap(_.split("\\s+"))
      .map((_,1))
      .keyBy(t=>t._1)
      .sum(1)
      .print()
    // print(fsEnv.getExecutionPlan)
    //3.執行流計算
    fsEnv.execute("wordcount")

readFile(fileInputFormat, path, watchType, interval, pathFilter) :這是前兩個內部調用的方法。它根據給定的FileInputFormat讀取路徑中的文件。可以根據watchType定期的檢測路徑下的文件,其中watchType可選值FileProcessingMode.PROCESS_CONTINUOUSLY或者FileProcessingMode.PROCESS_ONCE檢查的週期由interval參數指定。用戶可以使用pathFilter參數排除路勁下需要排除的文件。如果處理是PROCESS_CONTINUOUSLY,一旦文件內容發生改變,整個文件內容會被重複處理。

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

val inputFormat=new TextInputFormat(null)
val lines:DataStream[String]=fsEnv.readFile(
    inputFormat,"file:///E:\\demo\\words",
    FileProcessingMode.PROCESS_CONTINUOUSLY,
    5000,new FilePathFilter {
        override def filterPath(filePath: Path): Boolean = {
            filePath.getPath.endsWith(".txt")
        }
    })
lines.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(t=>t._1)
.sum(1)
.print()
fsEnv.execute("wordcount")

Socket-based

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val lines:DataStream[String]=fsEnv.socketTextStream("CentOS",9999)
lines.flatMap(_.split("\\s+"))
    .map((_,1))
    .keyBy(t=>t._1)
    .sum(1)
    .print()
fsEnv.execute("wordcount")

Collection-based(測試)

   //1.創建流處理的環境 - 遠程發佈|本地執行
    val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

   val lines:DataStream[String] = fsEnv.fromCollection(List("this is a demo","where are you from"))

    lines.flatMap(_.split("\\s+"))
      .map((_,1))
      .keyBy(t=>t._1)
      .sum(1)
      .print()
    // print(fsEnv.getExecutionPlan)
    //3.執行流計算
    fsEnv.execute("wordcount")

Custom Source

自定義數據源

package com.hw.demo02

import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction, SourceFunction}

import scala.util.Random

/**
  * @aurhor:fql
  * @date 2019/10/15 19:23 
  * @type:
  */
class CustomSourceFunction extends ParallelSourceFunction[String]{

  @volatile
  var isRunning:Boolean=true
  val lines:Array[String] = Array("this is a demo","hello word","are you ok")

  override def run(ctx: SourceFunction.SourceContext[String]): Unit = {
    while (isRunning){
      Thread.sleep(1000)
      ctx.collect(lines(new Random().nextInt(lines.length))) //將數據輸出給下游

    }
  }


  override def cancel(): Unit = {
     isRunning=false
  }
}
 //1.創建流處理的環境 - 遠程發佈|本地執行
    val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

   val lines:DataStream[String] = fsEnv.addSource[String](new CustomSourceFunction)

    lines.flatMap(_.split("\\s+"))
      .map((_,1))
      .keyBy(t=>t._1)
      .sum(1)
      .print()
    // print(fsEnv.getExecutionPlan)
    //3.執行流計算
    fsEnv.execute("wordcount")

√FlinkKafkaConsumer

  • 引入相關依賴
<dependency>
    <groupId>org.apache.flink</groupId>
    <artifactId>flink-connector-kafka_${scala.version}</artifactId>
    <version>${flink.version}</version>
</dependency>
 //1.創建流處理的環境 - 遠程發佈|本地執行
    val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
    val props = new Properties()
    props.setProperty("bootstrap.servers", "CentOS:9092")
    props.setProperty("group.id", "g1")

    val lines=fsEnv.addSource(new FlinkKafkaConsumer("topic01",new SimpleStringSchema(),props))

    lines.flatMap(_.split("\\s+"))
      .map((_,1))
      .keyBy(t=>t._1)
      .sum(1)
      .print()
    // print(fsEnv.getExecutionPlan)
    //3.執行流計算
    fsEnv.execute("wordcount")

如果Kafka存儲的都是json字符串數據,用戶可以使用系統自帶一些json支持的Schema。推薦使用

  • JsonNodeDeserializationSchema:要求value必須是json字符串
  • JSONKeyValueDeserializationSchema(meta):要求key,value都必須是josn格式,同時可以攜帶元數據(分區、 owset等)
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")

val jsonData:DataStream[ObjectNode]=fsEnv.addSource(new FlinkKafkaConsumer("topic01",new JSONKeyValueDeserializationSchema(true),props))
jsonData.map(on=> (on.get("value").get("id").asInt(),on.get("value").get("name")))
.print()
fsEnv.execute("wordcount")

Data Sinks

Data Sinks接收DataStream數據,並將其轉發到文件,socket,外部系統或者print它們。Flink預定義一些輸出Sink。

file-based

write*:writeAsText/writeAsCsv(…)/writeUsingOutputFormat請注意,DataStream上的write *()方法主要用於調試目的。

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")

fsEnv.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props))
    .flatMap(_.split("\\s+"))
    .map((_,1))
    .keyBy(0)
    .sum(1)
    .writeAsText("file:///E:/results/text",WriteMode.OVERWRITE)

以上的寫法只能保證at_least_once語義的保證,如果是在生產環境下推薦使用flink-connector-filesystem將數據寫到外圍系統,可以保證exactly-once

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")

val bucketingSink = new BucketingSink[(String,Int)]("hdfs://CentOS:9000/BucketingSink")
bucketingSink.setBucketer(new DateTimeBucketer("yyyyMMddHH"))//文件目錄
bucketingSink.setBatchSize(1024)
fsEnv.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props))
    .flatMap(_.split("\\s+"))
    .map((_,1))
    .keyBy(0)
    .sum(1)
    .addSink(bucketingSink)
    .setParallelism(6)

fsEnv.execute("wordcount")

print()/printErr()

  //1.創建流處理的環境 - 遠程發佈|本地執行
    val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
    //2.讀取外圍系統數據 - 細化
    val porps = new Properties()

    porps.setProperty("bootstrap.servers","CentOS:9092")
    porps.setProperty("group.id","g2")

    val lines = fsEnv.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),porps))
    lines.flatMap(_.split("\\s+"))
      .map((_,1))
      .keyBy(t=>t._1)
      .sum(1)
      .print("測試") //輸出前綴,可以區分當前有多個輸出到控制檯的流 可以添加 前綴
        .setParallelism(2)
    // print(fsEnv.getExecutionPlan)
    //3.執行流計算
    fsEnv.execute("wordcount")

Custom Sink

class  CustomSinkFunction extends  RichSinkFunction[(String,Int)]{
  override def open(parameters: Configuration): Unit = {
    println("初始化連接")
  }
  override def invoke(value: (String, Int), context: SinkFunction.Context[_]): Unit = {
    println(value)
  }

  override def close(): Unit = {
    println("關閉連接")
  }
}
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")

fsEnv.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props))
    .flatMap(_.split("\\s+"))
    .map((_,1))
    .keyBy(0)
    .sum(1)
    .addSink(new CustomSinkFunction)

fsEnv.execute("wordcount")

√ RedisSink

  • 添加依賴
<dependency>
    <groupId>org.apache.bahir</groupId>
    <artifactId>flink-connector-redis_2.11</artifactId>
    <version>1.0</version>
</dependency>
class UserRedisMapper extends RedisMapper[(String,Int)]{
  // 設置數據類型
  override def getCommandDescription: RedisCommandDescription = {
    new RedisCommandDescription(RedisCommand.HSET,"wordcount")
  }

  override def getKeyFromData(data: (String, Int)): String = {
    data._1
  }

  override def getValueFromData(data: (String, Int)): String = {
    data._2.toString
  }
}
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")

val jedisConfig=new FlinkJedisPoolConfig.Builder()
    .setHost("CentOS")
    .setPort(6379)
    .build()

fsEnv.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props))
    .flatMap(_.split("\\s+"))
    .map((_,1))
    .keyBy(0)
    .sum(1)
    .addSink(new RedisSink[(String, Int)](jedisConfig,new UserRedisMapper))

fsEnv.execute("wordcount")

√FlinkkafkaProducer

class UserKeyedSerializationSchema extends KeyedSerializationSchema[(String,Int)]{
Int

  override def serializeKey(element: (String, Int)): Array[Byte] = {
    element._1.getBytes()
  }

  override def serializeValue(element: (String, Int)): Array[Byte] = {
    element._2.toString.getBytes()
  }

  //可以覆蓋 默認topic ,如果返回null 則將數據寫入到默認topic中
  override def getTargetTopic(element: (String, Int)): String = {
    null
  }
}
val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

val props1 = new Properties()
props1.setProperty(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092")
props1.setProperty(ConsumerConfig.GROUP_ID_CONFIG, "g1")

val props2 = new Properties()
props2.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "CentOS:9092")
props2.setProperty(ProducerConfig.BATCH_SIZE_CONFIG,"100")
props2.setProperty(ProducerConfig.LINGER_MS_CONFIG,"500")
props2.setProperty(ProducerConfig.ACKS_CONFIG,"all")
props2.setProperty(ProducerConfig.RETRIES_CONFIG,"2")


fsEnv.addSource(new FlinkKafkaConsumer[String]("topic01",new SimpleStringSchema(),props1))
.flatMap(_.split("\\s+"))
.map((_,1))
.keyBy(0)
.sum(1)
.addSink(new FlinkKafkaProducer[(String, Int)]("topic02",new UserKeyedSerializationSchema,props2))
fsEnv.execute("wordcount")

DataStream Transformations

Map

Takes one element and produces one element.

dataStream.map { x => x * 2 }

FlatMap

Takes one element and produces zero, one, or more elements.

dataStream.flatMap { str => str.split(" ") }

Filter

Evaluates a boolean function for each element and retains those for which the function returns true.

dataStream.filter { _ != 0 }

Union

Union of two or more data streams creating a new stream containing all the elements from all the streams.

dataStream.union(otherStream1, otherStream2, ...)

Connect

“Connects” two data streams retaining their types, allowing for shared state between the two streams.

  val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

    val stream1 = fsEnv.socketTextStream("CentOS",9999)

    val stream2 = fsEnv.socketTextStream("CentOS",8888)


    stream1.connect(stream2).flatMap(line=>line.split("\\s+"),line=>line.split("\\s+"))
      .map(Word(_,1))
      .keyBy("word")
      .sum("count")
      .print()
    fsEnv.execute("wordcount")

Split

Split the stream into two or more streams according to some criterion.

val split = someDataStream.split(
  (num: Int) =>
    (num % 2) match {
      case 0 => List("even")
      case 1 => List("odd")
    }
)    

Select

Select one or more streams from a split stream.

val even = split select "even"
val odd = split select "odd"
val all = split.select("even","odd")
val lines = fsEnv.socketTextStream("CentOS",9999)
val splitStream: SplitStream[String] = lines.split(line => {
    if (line.contains("error")) {
        List("error") //分支名稱
    } else {
        List("info") //分支名稱
    }
})
splitStream.select("error").print("error")
splitStream.select("info").print("info")

Side Out

val fsEnv = StreamExecutionEnvironment.getExecutionEnvironment

    val lines = fsEnv.socketTextStream("CentOS",9999)

    val outTag = new OutputTag[String]("error")

    val reslut = lines.process(new ProcessFunction[String, String] {
      override def processElement(value: String, ctx: ProcessFunction[String, String]#Context, out: Collector[String]): Unit = {
        if (value.contains("error")) {
          ctx.output(outTag, value)
        } else {
          out.collect(value)
        }
      }
    })
    reslut.print("正常結果")

    //獲取邊輸出
    reslut.getSideOutput(outTag).print("錯誤結果")
    fsEnv.execute("wordcount")

KeyBy

Logically partitions a stream into disjoint partitions, each partition containing elements of the same key. Internally, this is implemented with hash partitioning.

dataStream.keyBy("someKey") // Key by field "someKey"
dataStream.keyBy(0) // Key by the first element of a Tuple

Reduce

A “rolling” reduce on a keyed data stream. Combines the current element with the last reduced value and emits the new value.

fsEnv.socketTextStream("CentOS",9999)
        .flatMap(_.split("\\s+"))
        .map((_,1))
        .keyBy(0)
        .reduce((t1,t2)=>(t1._1,t1._2+t2._2))
        .print()

Fold

A “rolling” fold on a keyed data stream with an initial value. Combines the current element with the last folded value and emits the new value.

fsEnv.socketTextStream("CentOS",9999)
    .flatMap(_.split("\\s+"))
    .map((_,1))
    .keyBy(0)
    .fold(("",0))((t1,t2)=>(t2._1,t1._2+t2._2))
    .print()

Aggregations

Rolling aggregations on a keyed data stream. The difference between min and minBy is that min returns the minimum value, whereas minBy returns the element that has the minimum value in this field (same for max and maxBy).

zhangsan 001 1000
wangw 001 1500
zhaol 001 800
fsEnv.socketTextStream("CentOS",9999)
    .map(_.split("\\s+"))
    .map(ts=>(ts(0),ts(1),ts(2).toDouble))
    .keyBy(1)
    .minBy(2)//輸出含有最小值的記錄
    .print()
1> (zhangsan,001,1000.0)
1> (zhangsan,001,1000.0)
1> (zhaol,001,800.0)
fsEnv.socketTextStream("CentOS",9999)
    .map(_.split("\\s+"))
    .map(ts=>(ts(0),ts(1),ts(2).toDouble))
    .keyBy(1)
    .min(2)
    .print()
1> (zhangsan,001,1000.0)
1> (zhangsan,001,1000.0)
1> (zhangsan,001,800.0)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章