寫在前面:我是「雲祁」,一枚熱愛技術、會寫詩的大數據開發猿。暱稱來源於王安石詩中一句
[ 雲之祁祁,或雨於淵 ]
,甚是喜歡。
寫博客一方面是對自己學習的一點點總結及記錄,另一方面則是希望能夠幫助更多對大數據感興趣的朋友。如果你也對數據中臺、數據建模、數據分析以及 Flink/Spark/Hadoop/數倉開發
感興趣,可以關注我 https://blog.csdn.net/BeiisBei ,讓我們一起挖掘數據的價值~
每天都要進步一點點,生命不是要超越別人,而是要超越自己! (ง •_•)ง
文章目錄
Table API 是流處理和批處理通用的關係型 API,Table API 可以基於流輸入或者批輸入來運行而不需要進行任何修改。Table API 是 SQL 語言的超集並專門爲 Apache Flink 設計的,Table API 是 Scala 和 Java 語言集成式的 API。與常規 SQL 語言中將查詢指定爲字符串不同,Table API 查詢是以 Java 或 Scala 中的語言嵌入樣式來定義的,具有 IDE 支持如:自動完成和語法檢測。
一、需要引入的pom依賴
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table_2.11</artifactId>
<version>1.7.2</version>
</dependency>
二、簡單瞭解 Table API
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment =
StreamExecutionEnvironment.getExecutionEnvironment
val myKafkaConsumer: FlinkKafkaConsumer011[String] =
MyKafkaUtil.getConsumer("ECOMMERCE")
val dstream: DataStream[String] = env.addSource(myKafkaConsumer)
val tableEnv: StreamTableEnvironment =
TableEnvironment.getTableEnvironment(env)
val ecommerceLogDstream: DataStream[EcommerceLog] = dstream.map{
jsonString => JSON.parseObject(jsonString,classOf[EcommerceLog]) }
val ecommerceLogTable: Table =
tableEnv.fromDataStream(ecommerceLogDstream)
val table: Table = ecommerceLogTable.select("mid,ch").filter("ch='appstore'")
val midchDataStream: DataStream[(String, String)] =
table.toAppendStream[(String,String)]
midchDataStream.print()
env.execute()
}
2.1 動態表
如果流中的數據類型是 case class 可以直接根據 case class 的結構生成 table
tableEnv.fromDataStream(ecommerceLogDstream)
或者根據字段順序單獨命名
tableEnv.fromDataStream(ecommerceLogDstream,’mid,’uid .......)
最後的動態表可以轉換爲流進行輸出
table.toAppendStream[(String,String)]
2.2 字段
用一個單引放到字段前面來標識字段名, 如 ‘name , ‘mid ,’amount 等
三、Table API 的窗口聚合操作
3.1 通過一個例子瞭解Table API
//每 10 秒中渠道爲 appstore 的個數
def main(args: Array[String]): Unit = {
//sparkcontext
val env: StreamExecutionEnvironment =
StreamExecutionEnvironment.getExecutionEnvironment
//時間特性改爲 eventTime
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val myKafkaConsumer: FlinkKafkaConsumer011[String] =
MyKafkaUtil.getConsumer("ECOMMERCE")
val dstream: DataStream[String] = env.addSource(myKafkaConsumer)
val ecommerceLogDstream: DataStream[EcommerceLog] = dstream.map{ jsonString
=>JSON.parseObject(jsonString,classOf[EcommerceLog]) }
//告知 watermark 和 eventTime 如何提取
val ecommerceLogWithEventTimeDStream: DataStream[EcommerceLog] =
ecommerceLogDstream.assignTimestampsAndWatermarks(new
BoundedOutOfOrdernessTimestampExtractor[EcommerceLog](Time.seconds(0L)) {
override def extractTimestamp(element: EcommerceLog): Long = {
element.ts
}
}).setParallelism(1)
val tableEnv: StreamTableEnvironment =
TableEnvironment.getTableEnvironment(env)
//把數據流轉化成 Table
val ecommerceTable: Table =
tableEnv.fromDataStream(ecommerceLogWithEventTimeDStream ,
'mid,'uid,'appid,'area,'os,'ch,'logType,'vs,'logDate,'logHour,'logHourMinute,'ts.rowtime)
//通過 table api 進行操作
// 每 10 秒 統計一次各個渠道的個數 table api 解決
//1 groupby 2 要用 window 3 用 eventtime 來確定開窗時間
val resultTable: Table = ecommerceTable.
window(Tumble over 10000.millis on 'ts as 'tt).groupBy('ch,'tt ).select( 'ch, 'ch.count)
//把 Table 轉化成數據流
val resultDstream: DataStream[(Boolean, (String, Long))] = resultSQLTable.toRetractStream[(String,Long)]
resultDstream.filter(_._1).print()
env.execute()
}
3.2 關於group by
- 如果了使用 groupby,table 轉換爲流的時候只能用 toRetractDstream
val rDstream: DataStream[(Boolean, (String, Long))] = table
.toRetractStream[(String,Long)]
- toRetractDstream 得到的第一個 boolean 型字段標識 true 就是最新的數據(Insert),false 表示過期老數據(Delete)
val rDstream: DataStream[(Boolean, (String, Long))] = table
.toRetractStream[(String,Long)]
rDstream.filter(_._1).print()
- 如果使用的 api 包括時間窗口,那麼窗口的字段必須出現在 groupBy 中。
val table: Table = ecommerceLogTable
.filter("ch ='appstore'")
.window(Tumble over 10000.millis on 'ts as 'tt)
.groupBy('ch ,'tt)
.select("ch,ch.count ")
3.3 關於時間窗口
- 用到時間窗口,必須提前聲明時間字段,如果是 processTime 直接在創建動態表時進行追加就可以。
val ecommerceLogTable: Table = tableEnv
.fromDataStream( ecommerceLogWithEtDstream,
'mid,'uid,'appid,'area,'os,'ch,'logType,'vs,'logDate,'logHour,'logHourMinute,'ps.proctime)
- 如果是 EventTime 要在創建動態表時聲明
val ecommerceLogTable: Table = tableEnv
.fromDataStream(ecommerceLogWithEtDstream,
'mid,'uid,'appid,'area,'os,'ch,'logType,'vs,'logDate,'logHour,'logHourMinute,'ts.rowtime)
- 滾動窗口可以使用 Tumble over 10000.millis on 來表示
val table: Table = ecommerceLogTable.filter("ch ='appstore'")
.window(Tumble over 10000.millis on 'ts as 'tt)
.groupBy('ch ,'tt)
.select("ch,ch.count ")
四、SQL 如何編寫
def main(args: Array[String]): Unit = {
//sparkcontext
val env: StreamExecutionEnvironment =
StreamExecutionEnvironment.getExecutionEnvironment
//時間特性改爲 eventTime
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
val myKafkaConsumer: FlinkKafkaConsumer011[String] =
MyKafkaUtil.getConsumer("ECOMMERCE")
val dstream: DataStream[String] = env.addSource(myKafkaConsumer)
val ecommerceLogDstream: DataStream[EcommerceLog] = dstream.map{ jsonString
=>JSON.parseObject(jsonString,classOf[EcommerceLog]) }
//告知 watermark 和 eventTime 如何提取
val ecommerceLogWithEventTimeDStream: DataStream[EcommerceLog] =
ecommerceLogDstream.assignTimestampsAndWatermarks(new
BoundedOutOfOrdernessTimestampExtractor[EcommerceLog](Time.seconds(0L)) {
override def extractTimestamp(element: EcommerceLog): Long = {
element.ts
}
}).setParallelism(1)
//SparkSession
val tableEnv: StreamTableEnvironment =
TableEnvironment.getTableEnvironment(env)
//把數據流轉化成 Table
val ecommerceTable: Table =
tableEnv.fromDataStream(ecommerceLogWithEventTimeDStream ,
'mid,'uid,'appid,'area,'os,'ch,'logType,'vs,'logDate,'logHour,'logHourMinute,'ts.rowtime)
//通過 table api 進行操作
// 每 10 秒 統計一次各個渠道的個數 table api 解決
//1 groupby 2 要用 window 3 用 eventtime 來確定開窗時間
val resultTable: Table = ecommerceTable
.window(Tumble over 10000.millis on 'ts as 'tt)
.groupBy('ch,'tt )
.select( 'ch, 'ch.count)
// 通過 sql 進行操作
val resultSQLTable : Table = tableEnv.sqlQuery( "select ch ,count(ch) from
"+ecommerceTable+" group by ch ,Tumble(ts,interval '10' SECOND )")
//把 Table 轉化成數據流
//val appstoreDStream: DataStream[(String, String, Long)] =
appstoreTable.toAppendStream[(String,String,Long)]
val resultDstream: DataStream[(Boolean, (String, Long))] =
resultSQLTable.toRetractStream[(String,Long)]
resultDstream.filter(_._1).print()
env.execute()
}