【Flink】(十)Table API 和 SQL

寫在前面:我是「雲祁」,一枚熱愛技術、會寫詩的大數據開發猿。暱稱來源於王安石詩中一句 [ 雲之祁祁,或雨於淵 ] ,甚是喜歡。


寫博客一方面是對自己學習的一點點總結及記錄,另一方面則是希望能夠幫助更多對大數據感興趣的朋友。如果你也對 數據中臺、數據建模、數據分析以及 Flink/Spark/Hadoop/數倉開發 感興趣,可以關注我 https://blog.csdn.net/BeiisBei ,讓我們一起挖掘數據的價值~


每天都要進步一點點,生命不是要超越別人,而是要超越自己! (ง •_•)ง

Table API 是流處理和批處理通用的關係型 API,Table API 可以基於流輸入或者批輸入來運行而不需要進行任何修改。Table API 是 SQL 語言的超集並專門爲 Apache Flink 設計的,Table API 是 Scala 和 Java 語言集成式的 API。與常規 SQL 語言中將查詢指定爲字符串不同,Table API 查詢是以 Java 或 Scala 中的語言嵌入樣式來定義的,具有 IDE 支持如:自動完成和語法檢測。

一、需要引入的pom依賴

<dependency>
 <groupId>org.apache.flink</groupId>
 <artifactId>flink-table_2.11</artifactId>
 <version>1.7.2</version>
</dependency>

二、簡單瞭解 Table API

def main(args: Array[String]): Unit = {
 val env: StreamExecutionEnvironment = 
StreamExecutionEnvironment.getExecutionEnvironment

 val myKafkaConsumer: FlinkKafkaConsumer011[String] = 
MyKafkaUtil.getConsumer("ECOMMERCE")
 val dstream: DataStream[String] = env.addSource(myKafkaConsumer)
 
 val tableEnv: StreamTableEnvironment = 
TableEnvironment.getTableEnvironment(env)

 val ecommerceLogDstream: DataStream[EcommerceLog] = dstream.map{ 
jsonString => JSON.parseObject(jsonString,classOf[EcommerceLog]) }
 
 val ecommerceLogTable: Table = 
 tableEnv.fromDataStream(ecommerceLogDstream)

 val table: Table = ecommerceLogTable.select("mid,ch").filter("ch='appstore'")
 
 val midchDataStream: DataStream[(String, String)] = 
table.toAppendStream[(String,String)]
 
 midchDataStream.print()
 env.execute()
}

2.1 動態表

如果流中的數據類型是 case class 可以直接根據 case class 的結構生成 table

tableEnv.fromDataStream(ecommerceLogDstream)

或者根據字段順序單獨命名

tableEnv.fromDataStream(ecommerceLogDstream,’mid,’uid .......)

最後的動態表可以轉換爲流進行輸出

table.toAppendStream[(String,String)]

2.2 字段

用一個單引放到字段前面來標識字段名, 如 ‘name , ‘mid ,’amount 等

三、Table API 的窗口聚合操作

3.1 通過一個例子瞭解Table API

//每 10 秒中渠道爲 appstore 的個數
def main(args: Array[String]): Unit = {

 //sparkcontext
 val env: StreamExecutionEnvironment = 
StreamExecutionEnvironment.getExecutionEnvironment

 //時間特性改爲 eventTime
 env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)

val myKafkaConsumer: FlinkKafkaConsumer011[String] = 
MyKafkaUtil.getConsumer("ECOMMERCE")
 val dstream: DataStream[String] = env.addSource(myKafkaConsumer)
 
 val ecommerceLogDstream: DataStream[EcommerceLog] = dstream.map{ jsonString 
=>JSON.parseObject(jsonString,classOf[EcommerceLog]) }
 
 //告知 watermark 和 eventTime 如何提取
 val ecommerceLogWithEventTimeDStream: DataStream[EcommerceLog] = 
ecommerceLogDstream.assignTimestampsAndWatermarks(new 
BoundedOutOfOrdernessTimestampExtractor[EcommerceLog](Time.seconds(0L)) {
 override def extractTimestamp(element: EcommerceLog): Long = {
 element.ts
 }
 }).setParallelism(1)
 
 val tableEnv: StreamTableEnvironment = 
TableEnvironment.getTableEnvironment(env)

 //把數據流轉化成 Table
 val ecommerceTable: Table = 
tableEnv.fromDataStream(ecommerceLogWithEventTimeDStream , 
'mid,'uid,'appid,'area,'os,'ch,'logType,'vs,'logDate,'logHour,'logHourMinute,'ts.rowtime)

 //通過 table api 進行操作
 // 每 10 秒 統計一次各個渠道的個數 table api 解決
 //1 groupby 2 要用 window 3 用 eventtime 來確定開窗時間
 val resultTable: Table = ecommerceTable.
 window(Tumble over 10000.millis on 'ts as 'tt).groupBy('ch,'tt ).select( 'ch, 'ch.count)

 //把 Table 轉化成數據流
 val resultDstream: DataStream[(Boolean, (String, Long))] = resultSQLTable.toRetractStream[(String,Long)]
 
 resultDstream.filter(_._1).print()
 
 env.execute()
}

3.2 關於group by

  1. 如果了使用 groupby,table 轉換爲流的時候只能用 toRetractDstream
val rDstream: DataStream[(Boolean, (String, Long))] = table
.toRetractStream[(String,Long)]
  1. toRetractDstream 得到的第一個 boolean 型字段標識 true 就是最新的數據(Insert),false 表示過期老數據(Delete)
val rDstream: DataStream[(Boolean, (String, Long))] = table
.toRetractStream[(String,Long)]
 rDstream.filter(_._1).print()
  1. 如果使用的 api 包括時間窗口,那麼窗口的字段必須出現在 groupBy 中。
val table: Table = ecommerceLogTable
.filter("ch ='appstore'")
.window(Tumble over 10000.millis on 'ts as 'tt)
.groupBy('ch ,'tt)
.select("ch,ch.count ")

3.3 關於時間窗口

  1. 用到時間窗口,必須提前聲明時間字段,如果是 processTime 直接在創建動態表時進行追加就可以。
val ecommerceLogTable: Table = tableEnv
.fromDataStream( ecommerceLogWithEtDstream,
'mid,'uid,'appid,'area,'os,'ch,'logType,'vs,'logDate,'logHour,'logHourMinute,'ps.proctime)
  1. 如果是 EventTime 要在創建動態表時聲明
val ecommerceLogTable: Table = tableEnv
.fromDataStream(ecommerceLogWithEtDstream,
'mid,'uid,'appid,'area,'os,'ch,'logType,'vs,'logDate,'logHour,'logHourMinute,'ts.rowtime)
  1. 滾動窗口可以使用 Tumble over 10000.millis on 來表示
val table: Table = ecommerceLogTable.filter("ch ='appstore'")
.window(Tumble over 10000.millis on 'ts as 'tt)
.groupBy('ch ,'tt)
.select("ch,ch.count ")

四、SQL 如何編寫

def main(args: Array[String]): Unit = {
 //sparkcontext
 val env: StreamExecutionEnvironment = 
StreamExecutionEnvironment.getExecutionEnvironment

 //時間特性改爲 eventTime
 env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
 
 val myKafkaConsumer: FlinkKafkaConsumer011[String] = 
MyKafkaUtil.getConsumer("ECOMMERCE")
 val dstream: DataStream[String] = env.addSource(myKafkaConsumer)
 
 val ecommerceLogDstream: DataStream[EcommerceLog] = dstream.map{ jsonString 
=>JSON.parseObject(jsonString,classOf[EcommerceLog]) }
 //告知 watermark 和 eventTime 如何提取
 val ecommerceLogWithEventTimeDStream: DataStream[EcommerceLog] = 
ecommerceLogDstream.assignTimestampsAndWatermarks(new 
BoundedOutOfOrdernessTimestampExtractor[EcommerceLog](Time.seconds(0L)) {
 override def extractTimestamp(element: EcommerceLog): Long = {
 element.ts
 }
 }).setParallelism(1)
 
 //SparkSession
 val tableEnv: StreamTableEnvironment = 
TableEnvironment.getTableEnvironment(env)

 //把數據流轉化成 Table
 val ecommerceTable: Table = 
tableEnv.fromDataStream(ecommerceLogWithEventTimeDStream , 
'mid,'uid,'appid,'area,'os,'ch,'logType,'vs,'logDate,'logHour,'logHourMinute,'ts.rowtime)

 //通過 table api 進行操作
 // 每 10 秒 統計一次各個渠道的個數 table api 解決
 //1 groupby 2 要用 window 3 用 eventtime 來確定開窗時間
 val resultTable: Table = ecommerceTable
	 .window(Tumble over 10000.millis on 'ts as 'tt)
	 .groupBy('ch,'tt )
 	 .select( 'ch, 'ch.count)
// 通過 sql 進行操作

 val resultSQLTable : Table = tableEnv.sqlQuery( "select ch ,count(ch) from 
 "+ecommerceTable+" group by ch ,Tumble(ts,interval '10' SECOND )")

 //把 Table 轉化成數據流
 //val appstoreDStream: DataStream[(String, String, Long)] = 
appstoreTable.toAppendStream[(String,String,Long)]
 
 val resultDstream: DataStream[(Boolean, (String, Long))] = 
resultSQLTable.toRetractStream[(String,Long)]

 resultDstream.filter(_._1).print()
 
 env.execute()
}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章