Streaming (DataStream API(數據流接口)
DataSource(數據源)
數據源是程序讀取數據的來源,⽤戶可以通env.addSource(SourceFunction)
,將SourceFunction
添加到程序中。Flink內置許多已知實現的SourceFunction
,但是⽤戶可以⾃定義實現SourceFunction (⾮並⾏化的接⼝)
接⼝或者實現 ParallelSourceFunction (並⾏化)
接⼝,如果需要有狀態管理還可以繼承 RichParallelSourceFunction
.
File-based(以文件爲基礎的來源)
readTextFile(path)
- Reads(once) text files, i.e. files that respect the TextInputFormatspecification, line-by-line and returns them as Strings.
readTextFile(path)
逐行讀取(一次)文本文件,即遵循文本文件輸入格式規格並將其作爲字符串返回的文件。
//1.創建流計算執⾏環境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.創建DataStream - 細化 從HDFS 裏讀取文本文件
val text:DataStream[String] = env.readTextFile("hdfs://CentOS:9000/demo/words")
//3.執⾏DataStream的轉換算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.將計算的結果在控制打印
counts.print()
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
readFile(fileInputFormat, path)
- Reads (once) files as dictated by the specified file inputformat.
readFile(fileInputFormat, path)
-根據指定的文件輸入格式讀取(一次)文件。
//1.創建流計算執⾏環境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//創建文件輸入格式
var inputFormat:FileInputFormat[String]=new TextInputFormat(null)
//2.創建DataStream - 細化
val text:DataStream[String] =
env.readFile(inputFormat,"hdfs://CentOS:9000/demo/words")
//3.執⾏DataStream的轉換算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)//以0下標爲key
.sum(1)
//4.將計算的結果在控制打印
counts.print()
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)
–This is the method called internally by the two previous ones. It reads files in the path based on thegivenfileInputFormat
. Depending on the providedwatchType
, this source may periodicallymonitor (every interval ms) the path for new data(FileProcessingMode.PROCESS_CONTINUOUSLY
), or process once the data currently in the pathand exit (FileProcessingMode.PROCESS_ONCE ). Using thepathFilter
, the user can furtherexclude files from being processed.
readFile(fileInputFormat, path, watchType, interval, pathFilter, typeInfo)
–這是前兩個方法在內部調用的方法。它根據給定的**文本文件輸入格式
讀取路徑中的文件。根據所提供的觀看型號
,此源可以定期調用監視(每隔ms)新數據的路徑( FileProcessingMode.PROCESS_CONTINUOUSLY(文件處理模式.過程連續
。))或處理當前路徑中的數據並退出(FileProcessingMode.PROCESS_ONCE
(文件處理模式.讀取一次
))。使用路徑過濾器
**,用戶可以進一步排除正在處理的文件。
補充:
該⽅法會檢查採集⽬錄下的⽂件,如果⽂件發⽣變化系統會重新採集。此時可能會導致⽂件的重複計算。⼀般來說不建議修改⽂件內容,直接上傳新⽂件即可
//1.創建流計算執⾏環境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.創建DataStream - 細化
var inputFormat:FileInputFormat[String]=new TextInputFormat(null)
val text:DataStream[String] = env.readFile(inputFormat,
"hdfs://CentOS:9000/demo/words",FileProcessingMode.PROCESS_CONTINUOUSLY,1000)
//3.執⾏DataStream的轉換算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.將計算的結果在控制打印
counts.print()
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
Socket Based(基於套接字的來源)
socketTextStream
- Reads from a socket. Elements can be separated by a delimiter.
socketTextStream
-從套接字讀取。元素可以用分隔符分隔
//1.創建流計算執⾏環境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.創建DataStream - 細化
val text = env.socketTextStream("CentOS", 9999,'\n',3)
//3.執⾏DataStream的轉換算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.將計算的結果在控制打印
counts.print()
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
Collection-based 基於集合
//1.創建流計算執⾏環境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.創建DataStream - 細化
val text = env.fromCollection(List("this is a demo","hello word"))
//3.執⾏DataStream的轉換算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.將計算的結果在控制打印
counts.print()
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
UserDefinedSource 用戶定義的來源
SourceFunction
import org.apache.flink.streaming.api.functions.source.SourceFunction
import scala.util.Random
class UserDefinedNonParallelSourceFunction extends SourceFunction[String]{
@volatile //防⽌線程拷⻉變量
var isRunning:Boolean=true
val lines:Array[String]=Array("this is a demo","hello world","ni hao ma")
//在該⽅法中啓動線程,通過sourceContext的collect⽅法發送數據
override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
while(isRunning){
Thread.sleep(100)
//輸送數據給下游
sourceContext.collect(lines(new Random().nextInt(lines.size)))
}
}
//釋放資源
override def cancel(): Unit = {
isRunning=false
}
}
ParallelSourceFunction
import org.apache.flink.streaming.api.functions.source.{ParallelSourceFunction,
SourceFunction}
import scala.util.Random
class UserDefinedParallelSourceFunction extends ParallelSourceFunction[String]{
@volatile //防⽌線程拷⻉變量
var isRunning:Boolean=true
val lines:Array[String]=Array("this is a demo","hello world","ni hao ma")
//在該⽅法中啓動線程,通過sourceContext的collect⽅法發送數據
override def run(sourceContext: SourceFunction.SourceContext[String]): Unit = {
while(isRunning){
Thread.sleep(100)
//輸送數據給下游
sourceContext.collect(lines(new Random().nextInt(lines.size)))
}
}
//釋放資源
override def cancel(): Unit = {
isRunning=false
}
}
下游來接收
//1.創建流計算執⾏環境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
//2.創建DataStream - 細化
val text = env.addSource[String](⽤戶定義的SourceFunction)
//3.執⾏DataStream的轉換算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.將計算的結果在控制打印
counts.print()
println(env.getExecutionPlan) //打印執⾏計劃
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
Kafka集成
引⼊maven
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.10.0</version>
</dependency>
SimpleStringSchema
–簡單的字符串模式
該SimpleStringSchema⽅案只會反序列化kafka中的value
//1.創建流計算執⾏環境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.創建DataStream - 細化
val props = new Properties() //kafka 的連接屬性
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")
// 創建Flink與kafka的連接通道
val text = env.addSource(new FlinkKafkaConsumer[String]("topic01",new
SimpleStringSchema(),props))
//3.執⾏DataStream的轉換算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.將計算的結果在控制打印
counts.print()
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
KafkaDeserializationSchema
–kafka反序列化模式
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.connectors.kafka.KafkaDeserializationSchema
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.flink.api.scala._
class UserDefinedKafkaDeserializationSchema extends KafkaDeserializationSchema[(String, String, Int, Long)]{
//是否結束流計算 因爲是流計算是持續的不能結束
override def isEndOfStream(t: (String, String, Int, Long)): Boolean = false
//反序列化
override def deserialize(consumerRecord: ConsumerRecord[Array[Byte], Array[Byte]]): (String, String, Int, Long) = {
if(consumerRecord.key()!=null){ //如果消費記錄的key不爲空 則將消費記錄的k,v,分區數,偏移量等返回
(new String(consumerRecord.key()),new
String(consumerRecord.value()),consumerRecord.partition(),consumerRecord.offset())
}else{//如果消費記錄的key爲空 則返回一個k=null ,v,分區數,偏移量照常返回
(null,new
String(consumerRecord.value()),consumerRecord.partition(),consumerRecord.offset())
}
}
// 獲得生產類型
override def getProducedType: TypeInformation[(String, String, Int, Long)] = {
//提醒 : 如何要 create 創建 需要導import org.apache.flink.api.scala._
createTypeInformation[(String, String, Int, Long)]
}
}
獲取數據並打印輸出
def main(args: Array[String]): Unit = {
//1.創建流計算執⾏環境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.創建DataStream - 細化
val props = new Properties()
props.setProperty("bootstrap.servers", "SparkTwo:9092")
props.setProperty("group.id", "g1")
val text = env.addSource(new FlinkKafkaConsumer[(String,String,Int,Long)]//通道傳遞的類型
("topic01",new UserDefinedKafkaDeserializationSchema(),props))
//3.執⾏DataStream的轉換算⼦
val counts = text.flatMap(t=> t._2.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.將計算的結果在控制打印
counts.print()
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
}
補充下
拿kafka 消費者記錄的不同信息的方法
//拿出不同的數據信息
private static void shum(ConsumerRecord<String, String> next){
String topic = next.topic();//信息
int partition = next.partition();//分區數
long offset = next.offset();//偏移量
String key = next.key();//key
String value = next.value();//value
long timestamp = next.timestamp();//時間戳
System.out.println("信息"+topic+"分區數"+partition+"偏移量"+offset+"key"+"value"+value+"時間戳"+timestamp);
}
JSONKeyValueNodeDeserializationSchema
–JSON鍵值節點反序列化模式
要求Kafka中的topic的key和value都必須是json格式,也可以在使⽤的時候,指定是否讀取元數據(topic、分區、offset等)
//1.創建流計算執⾏環境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.創建DataStream - 細化
val props = new Properties()
props.setProperty("bootstrap.servers", "CentOS:9092")
props.setProperty("group.id", "g1")
//{"id":1,"name":"zhangsan"}
val text = env.addSource(new FlinkKafkaConsumer[ObjectNode]("topic01",new
JSONKeyValueDeserializationSchema(true),props))
//t:{"value":{"id":1,"name":"zhangsan"},"metadata":
{"offset":0,"topic":"topic01","partition":13}}
text.map(t=> (t.get("value").get("id").asInt(),t.get("value").get("name").asText()))
.print()
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
kafka 與 Flink 的集成 文檔參考:
https://ci.apache.org/projects/flink/flink-docs-release-1.10/dev/connectors/kafka.html
Data Sinks(數據輸出)
Data Sink使⽤DataStreams並將其轉發到⽂件,Socket,外部系統或打印它們。 Flink帶有多種內置輸出格式,這些格式封裝在DataStreams的操作後⾯。
File-based(基於文件輸出)
writeAsText() / TextOutputFormat(文本格式的文件輸出格式)
–將元素按行寫入爲字符串。這些字符串是通過調用每個元素的toString()方法獲得的
writeAsCsv(...) / CsvOutputFormat(Csv輸出格式)
–將元組寫入逗號分隔的值文件。行和字段分隔符是可配置的。每個字段的值來自對象的toString()方法。
writeUsingOutputFormat/ FileOutputFormat(文件輸出格式)
–方法和自定義文件輸出的基類。支持自定義對象到字節的轉換。
writeAsText() / TextOutputFormat - Writes elements line-wise as Strings. The Strings are obtainedby calling the toString() method of each element.
writeAsCsv(...) / CsvOutputFormat - Writes tuples as comma-separated value files. Row and field
delimiters are configurable. The value for each field comes from the toString() method of the objects.
writeUsingOutputFormat/ FileOutputFormat - Method and base class for custom file outputs.Supports custom object-to-bytes conversion.
注意:DataStream上的write*()⽅法主要⽤於調試⽬的。
//1.創建流計算執⾏環境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.創建DataStream - 細化
val text = env.socketTextStream("CentOS", 9999)
//3.執⾏DataStream的轉換算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//4.將計算的結果在控制打印
counts.writeUsingOutputFormat(new TextOutputFormat[(String, Int)](new
Path("file:///Users/admin/Desktop/flink-results")))
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
注意:
如果改成HDFS,需要⽤戶⾃⼰產⽣⼤量數據,才能看到測試效果,原因是因爲HDFS⽂
件系統寫⼊時的緩衝區⽐較⼤。以上寫⼊⽂件系統的Sink不能夠參與系統檢查點,如果在⽣產環境下通常使⽤flink-connector-filesystem寫⼊到外圍系統。
生產環境下使用flink-connector-filesystem寫⼊到外圍系統。
首先要導依賴
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-filesystem_2.11</artifactId>
<version>1.10.0</version>
</dependency>
//1.創建流計算執⾏環境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//2.創建DataStream - 細化
val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
var bucketingSink=StreamingFileSink.forRowFormat(new
Path("hdfs://CentOS:9000/bucket-results"),
new
SimpleStringEncoder[(String,Int)]("UTF-8"))
.withBucketAssigner(new DateTimeBucketAssigner[(String, Int)]("yyyy-MM-dd"))//動態產⽣寫⼊的路徑
.build()
//3.執⾏DataStream的轉換算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
counts.addSink(bucketingSink)
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
老版寫法
//1.創建流計算執⾏環境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
//2.創建DataStream - 細化
val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
var bucketingSink=new BucketingSink[(String,Int)]("hdfs://CentOS:9000/bucketresults")
bucketingSink.setBucketer(new DateTimeBucketer[(String,Int)]("yyyy-MM-dd"))
bucketingSink.setBatchSize(1024)
//3.執⾏DataStream的轉換算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
counts.addSink(bucketingSink)
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
UserDefinedSinkFunction
--用戶定義接收函數
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.sink.{RichSinkFunction, SinkFunction}
class UserDefinedSinkFunction extends RichSinkFunction[(String,Int)]{
override def open(parameters: Configuration): Unit = {
println("打開鏈接...")
}
override def invoke(value: (String, Int), context: SinkFunction.Context[_]): Unit =
{
println("輸出:"+value)
}
override def close(): Unit = {
println("釋放連接")
}
}
//1.創建流計算執⾏環境
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//2.創建DataStream - 細化
val text = env.readTextFile("hdfs://CentOS:9000/demo/words")
var bucketingSink=new BucketingSink[(String,Int)]("hdfs://CentOS:9000/bucketresults")
bucketingSink.setBucketer(new DateTimeBucketer[(String,Int)]("yyyy-MM-dd"))
bucketingSink.setBatchSize(1024)
//3.執⾏DataStream的轉換算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
counts.addSink(new UserDefinedSinkFunction)
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
RedisSink
參考文檔
:https://bahir.apache.org/docs/flink/current/flink-streaming-redis/
首先還是導入依賴
<dependency>
<groupId>org.apache.bahir</groupId>
<artifactId>flink-connector-redis_2.11</artifactId>
<version>1.0</version>
</dependency>
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.redis.RedisSink
import org.apache.flink.streaming.connectors.redis.common.config.FlinkJedisPoolConfig
import org.apache.flink.streaming.connectors.redis.common.mapper.{RedisCommand, RedisCommandDescription, RedisMapper}
object FlinkOne {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//2.創建DataStream - 細化
val text = env.readTextFile("hdfs://SparkTwo:9000/demo/words")
var flinkJeidsConf = new FlinkJedisPoolConfig.Builder()
.setHost("SparkTwo")
.setPort(6379)
.build()
//3.執⾏DataStream的轉換算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
counts.addSink(new RedisSink(flinkJeidsConf,new UserDefinedRedisMapper()))
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
}
}
class UserDefinedRedisMapper extends RedisMapper[(String,Int)]{
//獲取命令的描述
override def getCommandDescription: RedisCommandDescription = {
// redis命令 . 設置一個散列值 附加/額外的k
new RedisCommandDescription(RedisCommand.HSET,"wordcounts")
}
//獲取數據中的key
override def getKeyFromData(t: (String, Int)): String = {
t._1
}
//獲取數據中的v
override def getValueFromData(t: (String, Int)): String = {
t._2.toString
}
}
Kafka集成
首先還是依賴
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka_2.11</artifactId>
<version>1.10.0</version>
</dependency>
方法一:
package com.baizhi.flinkKafka
import java.lang
import java.util.Properties
import org.apache.flink.api.common.typeinfo.TypeInformation
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.Semantic
import org.apache.flink.streaming.connectors.kafka.{FlinkKafkaConsumer, FlinkKafkaProducer, KafkaDeserializationSchema, KafkaSerializationSchema}
import org.apache.kafka.clients.consumer.ConsumerRecord
import org.apache.kafka.clients.producer.ProducerRecord
object KafkaAndFlink {
def main(args: Array[String]): Unit = {
//創建流計算環境
val env = StreamExecutionEnvironment.getExecutionEnvironment
//連接kafka 的屬性
val properties = new Properties()
properties.setProperty("bootstrap.servers","SparkTwo:9092")//服務程序
properties.setProperty("group.id", "g1")//組
val text = env.addSource(new FlinkKafkaConsumer[(String, String, Int, Long)]("topics01", new userde()
, properties))
//輸入到kafka
val value= new FlinkKafkaProducer[(String, Int)]("defult_topic"
, new UserDefinedKafkaSerializationSchema(), properties, Semantic.AT_LEAST_ONCE)
val counts = text.flatMap(t=> t._2.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
//counts.print()
counts.addSink(value)
env.execute("Window Stream WordCount")
}
}
//輸出到kafka 的那個topic 輸出什麼數據
class UserDefinedKafkaSerializationSchema extends KafkaSerializationSchema[(String,Int)]{
override def serialize(t: (String, Int), aLong: lang.Long): ProducerRecord[Array[Byte], Array[Byte]] = {
new ProducerRecord("topic01",t._1.getBytes(),t._2.toString.getBytes())
}
}
//kafka反序列化模式 看前面的輸入源kafka
class userde extends KafkaDeserializationSchema[(String, String, Int, Long)]{
override def isEndOfStream(t: (String, String, Int, Long)): Boolean = false
override def deserialize(consumerRecord: ConsumerRecord[Array[Byte], Array[Byte]]): (String, String, Int, Long) = {
if (!(consumerRecord.key()==null)){
(consumerRecord.key().toString,consumerRecord.value().toString,consumerRecord.partition(),consumerRecord.offset())
}else{
(null,consumerRecord.value().toString,consumerRecord.partition(),consumerRecord.offset())
}
}
override def getProducedType: TypeInformation[(String, String, Int, Long)] = {
createTypeInformation[(String, String, Int, Long)]
}
}
提醒:
上面的 defult_topic 沒有任何意義
方法二:
import java.util.Properties
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer.Semantic
import org.apache.flink.streaming.util.serialization.KeyedSerializationSchema
import org.apache.kafka.clients.producer.ProducerConfig
object KafkaAndFlinkTwo {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
val text = env.readTextFile("hdfs://SparkTwo:9000/demo/words")
val props = new Properties()
props.setProperty(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "SparkTwo:9092")
props.setProperty(ProducerConfig.BATCH_SIZE_CONFIG,"100")//一次讀取的長度
props.setProperty(ProducerConfig.LINGER_MS_CONFIG,"500")//逗留的時間ms
//Semantic.EXACTLY_ONCE:開啓kafka冪等寫特性
//Semantic.AT_LEAST_ONCE:開啓Kafka Retries機制
val kafakaSink = new FlinkKafkaProducer[(String, Int)]("defult_topic",
new UserDefinedKeyedSerializationSchema, props, Semantic.AT_LEAST_ONCE)
//3.執⾏DataStream的轉換算⼦
val counts = text.flatMap(line=>line.split("\\s+"))
.map(word=>(word,1))
.keyBy(0)
.sum(1)
counts.addSink(kafakaSink)
// counts.print()
//5.執⾏流計算任務
env.execute("Window Stream WordCount")
}
}
class UserDefinedKeyedSerializationSchema extends KeyedSerializationSchema[(String,Int)]{
override def serializeKey(t: (String, Int)): Array[Byte] = {t._1.getBytes()}
override def serializeValue(t: (String, Int)): Array[Byte] = {t._2.toString.getBytes()}
//可以覆蓋 默認是topic,如果返回null,則將數據寫⼊到默認的topic中
override def getTargetTopic(t: (String, Int)): String = {
"topic01"
}
}