GitHub
https://github.com/SmallScorpion/flink-tutorial.git
函數(Functions)
用戶自定義函數(UDF)
- 用戶定義函數(User-defined Functions,UDF)是一個重要的特性,它們顯著地擴展了查詢的表達能力,一些系統內置函數無法解決的需求,我們可以用UDF來自定義實現。
- 在大多數情況下,用戶定義的函數必須先註冊,然後才能在查詢中使用
- 函數通過調用 registerFunction()方法在 TableEnvironment 中註冊。當用戶定義的函數被註冊時,它被插入到 TableEnvironment 的函數目錄中,這樣Table API 或 SQL 解析器就可以識別並正確地解釋它
標量函數(Scalar Functions)
- 用戶定義的標量函數,可以將0、1或多個標量值,映射到新的標量值
- 爲了定義標量函數,必須在 org.apache.flink.table.functions 中擴展基類Scalar Function,並實現(一個或多個)求值(eval)方法
- 標量函數的行爲由求值方法決定,求值方法必須公開聲明並命名爲 eval
import java.sql.Timestamp
import com.atguigu.bean.SensorReading
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.ScalarFunction
import org.apache.flink.types.Row
object ScalarFunctionTest {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
// 開啓事件時間語義
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
// 創建表環境
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)
val inputDStream: DataStream[String] = env.readTextFile("D:\\MyWork\\WorkSpaceIDEA\\flink-tutorial\\src\\main\\resources\\SensorReading.txt")
val dataDStream: DataStream[SensorReading] = inputDStream.map(
data => {
val dataArray: Array[String] = data.split(",")
SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
})
.assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor[SensorReading]
( Time.seconds(1) ) {
override def extractTimestamp(element: SensorReading): Long = element.timestamp * 1000L
} )
// 用proctime定義處理時間
val dataTable: Table = tableEnv
.fromDataStream(dataDStream, 'id, 'temperature, 'timestamp.rowtime as 'ts)
// 使用自定義的hash函數,求id的哈希值
val myHashCode = MyHashCode(1.23)
// 查詢 Table API 方式
val resultTable: Table = dataTable
.select('id, 'ts, myHashCode('id)) // 查詢id和temperature字段
// SQL調用方式,首先要註冊表
tableEnv.createTemporaryView("dataTable", dataTable)
// 註冊函數
tableEnv.registerFunction("myHashCode", myHashCode)
val resultSqlTable: Table = tableEnv.sqlQuery(
"""
|select id, ts, myHashCode(id)
|from dataTable
|""".stripMargin)
// 測試輸出
resultTable.toAppendStream[ Row ].print( "scalar" )
resultSqlTable.toAppendStream[ Row ].print( "scalar_sql" )
// 查看錶結構
dataTable.printSchema()
env.execute(" table ProcessingTime test job")
}
}
// 自定義一個求hash code的標量函數
case class MyHashCode(factor: Double) extends ScalarFunction{
def eval( value: String ): Int ={
(value.hashCode * factor).toInt
}
}
表函數(Table Functions)
- 用戶定義的表函數,也可以將0、1或多個標量值作爲輸入參數;與標量函數不同的是,它可以返回任意數量的行作爲輸出,而不是單個值
- 爲了定義一個表函數,必須擴展 org.apache.flink.table.functions 中的基類 TableFunction 並實現(一個或多個)求值方法
- 表函數的行爲由其求值方法決定,求值方法必須是 public 的,並命名爲 eval
import com.atguigu.bean.SensorReading
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.TableFunction
import org.apache.flink.types.Row
object TableFunctionTest {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
// 開啓事件時間語義
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
// 創建表環境
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)
val inputDStream: DataStream[String] = env.readTextFile("D:\\MyWork\\WorkSpaceIDEA\\flink-tutorial\\src\\main\\resources\\SensorReading.txt")
val dataDStream: DataStream[SensorReading] = inputDStream.map(
data => {
val dataArray: Array[String] = data.split(",")
SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
})
.assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor[SensorReading]
( Time.seconds(1) ) {
override def extractTimestamp(element: SensorReading): Long = element.timestamp * 1000L
} )
// 用proctime定義處理時間
val dataTable: Table = tableEnv
.fromDataStream(dataDStream, 'id, 'temperature, 'timestamp.rowtime as 'ts)
// 先創建一個UDF對象
val mySplit = MySplit("_")
// 查詢 Table API 方式
val resultTable: Table = dataTable
// 分割成字段
.joinLateral( mySplit('id) as ('word, 'length) )
.select('id, 'ts, 'word, 'length)
// SQL調用方式,首先要註冊表
tableEnv.createTemporaryView("dataTable", dataTable)
// 註冊函數
tableEnv.registerFunction("mySplit", mySplit)
val resultSqlTable: Table = tableEnv.sqlQuery(
"""
|select id, ts, word, length
|from dataTable, lateral table(mySplit(id)) as splitid(word, length)
|""".stripMargin)
// 測試輸出
resultTable.toAppendStream[ Row ].print( "scalar" )
resultSqlTable.toAppendStream[ Row ].print( "scalar_sql" )
// 查看錶結構
dataTable.printSchema()
env.execute(" table function test job")
}
}
// 自定義TableFunction,實現分割字符串並統計長度(word, length)
case class MySplit( separator: String ) extends TableFunction[ (String, Int) ] {
def eval( str: String ): Unit ={
str.split( separator ).foreach(
word => collector.collect( (word, word.length) )
)
}
}
聚合函數(Aggregate Functions)
- 用戶自定義聚合函數(User-Defined Aggregate Functions,UDAGGs)可以把一個表中的數據,聚合成一個標量值
- 用戶定義的聚合函數,是通過繼承 AggregateFunction 抽象類實現的
- AggregationFunction要求必須實現的方法:createAccumulator()、accumulate()、getValue()
- AggregateFunction 的工作原理:首先,它需要一個累加器(Accumulator),用來保存聚合中間結果的數據結構;可以通過調用 createAccumulator() 方法創建空累加器,隨後,對每個輸入行調用函數的 accumulate() 方法來更新累加器、處理完所有行後,將調用函數的 getValue() 方法來計算並返回最終結果
import com.atguigu.bean.SensorReading
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.AggregateFunction
import org.apache.flink.types.Row
object AggregateFunctionTest {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
// 開啓事件時間語義
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
// 創建表環境
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)
val inputDStream: DataStream[String] = env.readTextFile("D:\\MyWork\\WorkSpaceIDEA\\flink-tutorial\\src\\main\\resources\\SensorReading.txt")
val dataDStream: DataStream[SensorReading] = inputDStream.map(
data => {
val dataArray: Array[String] = data.split(",")
SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
})
.assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor[SensorReading]
( Time.seconds(1) ) {
override def extractTimestamp(element: SensorReading): Long = element.timestamp * 1000L
} )
// 用proctime定義處理時間
val dataTable: Table = tableEnv
.fromDataStream(dataDStream, 'id, 'temperature, 'timestamp.rowtime as 'ts)
// 先創建一個UDF對象
val avgTemp = AvgTempFunc()
// 查詢 Table API 方式
val resultTable: Table = dataTable
.groupBy('id)
.aggregate( avgTemp('temperature) as 'avgTemp )
.select('id, 'avgTemp)
// SQL調用方式,首先要註冊表
tableEnv.createTemporaryView("dataTable", dataTable)
// 註冊函數
tableEnv.registerFunction("avgTemp", avgTemp)
val resultSqlTable: Table = tableEnv.sqlQuery(
"""
|select id, avgTemp(temperature)
|from dataTable
|group by id
|""".stripMargin)
// 測試輸出
resultTable.toRetractStream[ Row ].print( "scalar" )
resultSqlTable.toRetractStream[ Row ].print( "scalar_sql" )
// 查看錶結構
dataTable.printSchema()
env.execute(" table function test job")
}
}
// 專門定義一個聚合函數的狀態類,用於保存聚合狀態(sum,count)
case class AvgTempAcc() {
var sum: Double = 0.0
var count: Int = 0
}
// 自定義一個聚合函數 求相同id的溫度平均值
case class AvgTempFunc() extends AggregateFunction[Double, AvgTempAcc]{
override def getValue(accumulator: AvgTempAcc): Double = accumulator.sum/accumulator.count
override def createAccumulator(): AvgTempAcc = AvgTempAcc()
def accumulate(acc: AvgTempAcc, temp: Double): Unit ={
acc.sum += temp
acc.count += 1
}
}
表聚合函數(Table Aggregate Functions)
- 用戶定義的表聚合函數(User-Defined Table Aggregate Functions,UDTAGGs),可以把一個表中數據,聚合爲具有多行和多列的結果表
- 用戶定義表聚合函數,是通過繼承 TableAggregateFunction 抽象類來實現的
- AggregationFunction 要求必須實現的方法:createAccumulator()、accumulate()、emitValue()
- TableAggregateFunction 的工作原理如下:首先,它同樣需要一個累加器(Accumulator),它是保存聚合中間結果的數據結構。通過調用 createAccumulator() 方法可以創建空累加器。隨後,對每個輸入行調用函數的 accumulate() 方法來更新累加器。處理完所有行後,將調用函數的 emitValue() 方法來計算並返回最終結果。
import com.atguigu.bean.SensorReading
import org.apache.flink.streaming.api.TimeCharacteristic
import org.apache.flink.streaming.api.functions.timestamps.BoundedOutOfOrdernessTimestampExtractor
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.table.api.Table
import org.apache.flink.table.api.scala._
import org.apache.flink.table.functions.TableAggregateFunction
import org.apache.flink.types.Row
import org.apache.flink.util.Collector
object TableAggregateFunctionTest {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
// 開啓事件時間語義
env.setStreamTimeCharacteristic(TimeCharacteristic.EventTime)
// 創建表環境
val tableEnv: StreamTableEnvironment = StreamTableEnvironment.create(env)
val inputDStream: DataStream[String] = env.readTextFile("D:\\MyWork\\WorkSpaceIDEA\\flink-tutorial\\src\\main\\resources\\SensorReading.txt")
val dataDStream: DataStream[SensorReading] = inputDStream.map(
data => {
val dataArray: Array[String] = data.split(",")
SensorReading(dataArray(0), dataArray(1).toLong, dataArray(2).toDouble)
})
.assignTimestampsAndWatermarks( new BoundedOutOfOrdernessTimestampExtractor[SensorReading]
( Time.seconds(1) ) {
override def extractTimestamp(element: SensorReading): Long = element.timestamp * 1000L
} )
// 用proctime定義處理時間
val dataTable: Table = tableEnv
.fromDataStream(dataDStream, 'id, 'temperature, 'timestamp.rowtime as 'ts)
// 使用自定義的hash函數,求id的哈希值
val myAggTabTemp = MyAggTabTemp()
// 查詢 Table API 方式
val resultTable: Table = dataTable
.groupBy('id)
.flatAggregate( myAggTabTemp('temperature) as ('temp, 'rank) )
.select('id, 'temp, 'rank)
// SQL調用方式,首先要註冊表
tableEnv.createTemporaryView("dataTable", dataTable)
// 註冊函數
tableEnv.registerFunction("myAggTabTemp", myAggTabTemp)
/*
val resultSqlTable: Table = tableEnv.sqlQuery(
"""
|select id, temp, `rank`
|from dataTable, lateral table(myAggTabTemp(temperature)) as aggtab(temp, `rank`)
|group by id
|""".stripMargin)
*/
// 測試輸出
resultTable.toRetractStream[ Row ].print( "scalar" )
//resultSqlTable.toAppendStream[ Row ].print( "scalar_sql" )
// 查看錶結構
dataTable.printSchema()
env.execute(" table ProcessingTime test job")
}
}
// 自定義狀態類
case class AggTabTempAcc() {
var highestTemp: Double = Double.MinValue
var secondHighestTemp: Double = Double.MinValue
}
case class MyAggTabTemp() extends TableAggregateFunction[(Double, Int), AggTabTempAcc]{
// 初始化狀態
override def createAccumulator(): AggTabTempAcc = new AggTabTempAcc()
// 每來一個數據後,聚合計算的操作
def accumulate( acc: AggTabTempAcc, temp: Double ): Unit ={
// 將當前溫度值,跟狀態中的最高溫和第二高溫比較,如果大的話就替換
if( temp > acc.highestTemp ){
// 如果比最高溫還高,就排第一,其它溫度依次後移
acc.secondHighestTemp = acc.highestTemp
acc.highestTemp = temp
} else if( temp > acc.secondHighestTemp ){
acc.secondHighestTemp = temp
}
}
// 實現一個輸出數據的方法,寫入結果表中
def emitValue( acc: AggTabTempAcc, out: Collector[(Double, Int)] ): Unit ={
out.collect((acc.highestTemp, 1))
out.collect((acc.secondHighestTemp, 2))
}
}