Flink UDF 自動註冊實踐

https://www.bilibili.com/video/AV36166554/

日前,在更新UDF函數這塊的一些功能時,發現一些較爲細小但大家都會遇到的問題,作爲趟過的坑發出來,希望大家能夠避免。

1.註冊UDF函數

1.1 註冊相關方法    此處,我們使用的udf函數爲標量函數,它繼承的是ScalarFunction,該類在我們的使用中,發現它繼承自UserDefinedFunction這個類,該處的udf函數由用戶自己定義,而函數的註冊此處我們自己實現;    函數註冊時,使用flink的tableEnv上下文對象註冊該函數,此處註冊時使用的方法是TableEnvironment類裏面的重載方法registerFunction,這個函數不涉及參數和泛型的問題,具體方法如下:

/**    * Registers a [[ScalarFunction]] under a unique name. Replaces already existing     

        * user-defined functions under this name.    

        * /  

def registerFunction(name: String, function: ScalarFunction): Unit = {    

// check if class could be instantiated    checkForInstantiation(function.getClass)    

// register in Table API    

functionCatalog.registerFunction(name, function.getClass)    

// register in SQL API    

functionCatalog.registerSqlFunction(      

createScalarSqlFunction(name, name, function, typeFactory)  

 )  }  

 通過上面得方法,發現在檢查完類的實例化之後,便是對該類進行註冊使用,分別針對Table API和SQL API兩種不同形式去進行註冊。    

下面是我們註冊的小案例:

日常模式

 tableEnv.registerFunction("hashCode",new HashCode())                             

myTable.select("item,item.hashCode(),hashCode(item)")      

val hcTest = tableEnv.sqlQuery("select item,hashCode(item) from myTable")

假日模式    

ableEnv.registerFunction(m("name").toString,ReflectHelper.newInstanceByCl sName[ScalarFunction]        (m("className").toString,this.getClass.getClassLoader))

1.2 函數示例

/**   * Created by LX on 2018/11/15.  

*/

class HashCode extends ScalarFunction {  

var hashcode_factor = 12  

override def open(context: FunctionContext): Unit = {      

// access "hashcode_factor" parameter      

// "12" would be the default value if parameter does not exist    

hashcode_factor = context.getJobParameter("hashcode_factor", "12").toInt   }  

def eval(s: String): Int = {    

s.hashCode()+hashcode_factor   }  }

2.註冊UDTF函數

2.1 註冊相關方法 在UDTF和UDAF中,我們發現,註冊使用的具體函數是包含有一定的格式限制,比如此時我們需要註冊的UDTF函數,Split類繼承自TableFunction[(String,Int)],那麼我們的函數註冊中,在java程序編譯時會去檢查該泛型,後續實際運行時,解析我們的UDTF函數時,對泛型內的類型進行序列化和反序列化時會和我們規定的泛型進行對比,如果此時我們的數據schema或者說我們的數據本身格式不匹配抑或是我們給出了數據的泛型,編譯過了擦除掉之後,在實際運行中卻發現並沒有該字段信息,那麼同樣也會出錯,所以此時,我們更加要去注意產生該問題的根源,那麼根源究竟是什麼呢,話不多說,接着看代碼。 我們需要註冊函數的registerFunction方法,來自於StreamTableEnvironment中的registerFunction方法,此處的類請大家和之前區別一下,注意,此處這個類在後續我們使用UDAF時也會使用,那麼原因在於這兩個函數加入了泛型的約束,所以兜兜轉轉,會有中間的一個檢查判斷過程,接着,同樣是在TableEnvironment這個類中的registerTableFunctionInternal方法,下來,我會分別給出兩個方法,請看代碼。

StreamTableEnvironment

/**    

* Registers a [[TableFunction]] under a unique name in the TableEnvironment' s catalog.            

* Registered functions can be referenced in SQL queries.            *            * @param name The name under which the function is registered.            

* @param tf The TableFunction to register    

* /    

def registerFunction[T: TypeInformation](name: String, tf: TableFunction[T]): Unit = {    

registerTableFunctionInternal(name, tf)   } TableEnvironment

/**      

* Registers a [[TableFunction]] under a unique name. Replaces already existing    

* user-defined functions under this name.      

*/    

private[flink] def registerTableFunctionInternal[T: TypeInformation](          

name: String, function: TableFunction[T]): Unit = {          

// check if class not Scala object          

checkNotSingleton(function.getClass)          

// check if class could be instantiated         

checkForInstantiation(function.getClass)      

val typeInfo: TypeInformation[_] = if (    

function.getResultType != null) {            

function.getResultType          

} else {            

implicitly[TypeInformation[T]]    }          

// register in Table API        

 functionCatalog.registerFunction(name, function.getClass)          

// register in SQL API          

val sqlFunction = createTableSqlFunction(name, name, function, typeInfo, typeFactory)           functionCatalog.registerSqlFunction(sqlFunction)  }    

看到了吧,這個T類型規範了我們註冊的這個函數的類型,這個在自定義註冊時一定要小心;注意我們返回類型是否和我們註冊時規定的泛型一致,要讓註冊能過編譯,也要讓函數能順利運行。

2.2 函數示例

/**       

* Created by lx on 2018/11/15.  

*/     

class Split(separator: String) extends TableFunction[(String, Int)] {           

def eval(str: String): Unit = {                 

str.split(separator).foreach(x => collect(x, x.length))       

       } 

}    

這個裏面的返回即是(String, Int),因爲我們註冊時,已經獲取了該類的泛型,所以此時,只需要我們在註冊前引入隱式轉換即可。

2.3 註冊部分 //register table schema: [a: String]         

tableEnv.registerDataStream("mySplit", textFiles,'a)         

val mySplit: Table = tableEnv.sqlQuery("select * from mySplit")                

mySplit.printSchema()         

//register udtf         

val split = new Split(",")         

val dslTable =mySplit.join(split('a) as ('word,'length)).select('a,'word,'length)         

val dslLeftTable = mySplit.leftOuterJoin(split('a) as  ('word,'length)).select('a,'word,'length)                           tableEnv.registerFunction("split",split)         

val sqlJoin =  tableEnv.sqlQuery("select a,item,counts from mySplit,LATERAL TABLE(split(a)) as T(item,counts)")         

val sqlLeftJoin =tableEnv.sqlQuery("select a, item, counts from mySplit         

LEFT JOIN LATERAL TABLE(split(a)) as T(item, counts) ON TRUE")

3.註冊UDAF函數

3.1 註冊函數 看了上面兩種,其實無非是,UDF函數直接註冊就可以,UDTF在註冊時需要我們規範下類的泛型,而UDAF則不止是這些,不過,take it easy放輕鬆,趟過的坑馬上列出來給你看,哈哈,這裏提前說,多了一個返回的類,而此處這個類你們就可要小心啦~~~嗚啦啦,開始吧,騷年。。。    此時,我們的具體思路是,要先給出一個類,比如有幾個成員變量作爲後續AggregateFunction的一個輔助類,然後UDAF函數中用到了它還有它其中的成員變量,下來,改變下思路,先看註冊的函數吧: WeightedAvgAccum

/**  

       * Created by lx on 2018/11/15.  

*/

import java.lang.{Long => JLong, Integer => JInteger}  import org.apache.flink.api.java.tuple.{Tuple2 => JTuple2} class WeightedAvgAccum extends JTuple2[JLong, JInteger]  {    

var   sum = 0L     var   count =0 } WeightedAvg

/**      

* Weighted Average user-defined aggregate function. 

*/

class WeightedAvg extends AggregateFunction[JLong, CountAccumulator] {       

override def createAccumulator(): WeightedAvgAccum = {            

 new WeightedAvgAccum   }         

override def getValue(acc: WeightedAvgAccum): JLong = {             

if (acc.count == 0) {                     

null             

} else {

                    acc.sum / acc.count

            }       

}         

def accumulate(acc: WeightedAvgAccum, iValue: JLong, iWeight: JInteger): Unit = {         

acc.sum += iValue * iWeight         

acc.count += iWeight      

 }       

def retract(acc: WeightedAvgAccum, iValue: JLong, iWeight: JInteger): Unit = {        

acc.sum -= iValue * iWeight             

acc.count -= iWeight      

 }           

def merge(acc: WeightedAvgAccum, it: java.lang.Iterable[WeightedAvgAccum]):         

Unit = {            

 val iter = it.iterator()             

while (iter.hasNext) {                  

val a = iter.next()                   

acc.count += a.count                   

acc.sum += a.sum            

 }       

}       

def resetAccumulator(acc: WeightedAvgAccum): Unit = {

            acc.count = 0

            acc.sum = 0L

      }      

 override def getAccumulatorType: TypeInformation[WeightedAvgAccum] = {  

          new TupleTypeInfo(classOf[WeightedAvgAccum], Types.LONG, Types.INT)

      }      

 override def getResultType: TypeInformation[JLong] = Types.LONG }

3.2 註冊方法及問題詳解    大傢伙受累看完這兩個類,沒什麼問題的話我們接着往下講,官網例子,如下包換,在我們使用flink註冊時,沒什麼問題啊,那麼你憑什麼說要注意呢?此處我們的前提是用戶上傳到我們的系統,我們通過反射來拿到該類的實例然後再去註冊,那麼,問題就來了,如果平時使用沒有任何問題,而我們自動讓flink識別註冊時,flink卻做不到,原因爲何,請先看看,平時使用和我們自動註冊時的一些區別;

日常玩法:  

 tableEnv.registerFunction("wAvg",new WeightedAvg())         

val  weightAvgTable = tableEnv.sqlQuery("select item,wAvg(points,counts) AS avgPoints FROM myTable GROUP BY item")

假日玩法:

   implicit  val infoTypes = TypeInformation.of(classOf[Object])  

 tableEnv.registerFunction[Object,Object](m("name").toString,    ReflectHelper.newInstanceByClsName(m("className").toString,this.getClass.getClassLoader,sm.dac))  

 親愛的們,看到問題了嗎?其實原因就是我們的程序不是你,它無法推斷具體的類的類型,這個需要是我們給出一定的範圍或者說我們規範這個流程,即便是這樣,引入了對object的隱式轉換,過得了編譯,但是運行時還會報錯,不信,你看:

2018-11-22 19:49:40,872 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph         

- groupBy: (item), window: (TumblingGroupWindow('w$, 'proctime, 5000.millis)), select: (item, udafWeightAvg(counts, points) AS c) -> select: (c, item) -> job_1542187919994

-> Sink: Print to Std. Out (2/2) (3a9098c7ddb0a115349f6d89aba606ff) switched from RUNNING to FAILED. org.apache.flink.types.NullFieldException: Field 0 is null, but expected to hold a value.     atorg.apache.flink.api.java.typeutils.runtime.TupleSerializer.serialize(TupleSerializer.java:127)     atorg.apache.flink.api.java.typeutils.runtime.TupleSerializer.serialize(TupleSerializer.java:30)     atorg.apache.flink.api.java.typeutils.runtime.RowSerializer.serialize(RowSerializer.java:160)     atorg.apache.flink.api.java.typeutils.runtime.RowSerializer.serialize(RowSerializer.java:46)    atorg.apache.flink.contrib.streaming.state.AbstractRocksDBState.getValueBytes(AbstractRocksDBState.java:171)     atorg.apache.flink.contrib.streaming.state.AbstractRocksDBAppendingState.updateInternal    (AbstractRocksDBAppendingState.java:80)     atorg.apache.flink.contrib.streaming.state.RocksDBAggregatingState.add(RocksDBAggregatingState.java:105)     atorg.apache.flink.streaming.runtime.operators.windowing.WindowOperator.processElement(WindowOperator.java:391)     atorg.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:202)     atorg.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:105)     atorg.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:300)    atorg.apache.flink.runtime.taskmanager.Task.run(Task.java:711)         

at java.lang.Thread.run(Thread.java:745) Caused by: java.lang.NullPointerException     atorg.apache.flink.api.common.typeutils.base.LongSerializer.serialize(LongSerializer.java:63)     atorg.apache.flink.api.common.typeutils.base.LongSerializer.serialize(LongSerializer.java:27)     atorg.apache.flink.api.java.typeutils.runtime.TupleSerializer.serialize(TupleSerializer.java:125)    

... 12 more2018-11-22 19:49:40,873 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        

- Job csv_test csv_test_udaf_acc (ec1e56648123905f7ffd85ba884e89ca) switched from state RUNNING to FAILING. org.apache.flink.types.NullFieldException: Field 0 is null, but expected to hold a value.    

報錯很顯然,flink大羣裏大佬翻譯告訴我說這是tuple裏面第一位的數據爲空,序列化long爲空導致,而我傻傻的找了大半天,發現我的程序沒問題啊,那這到底問題出在哪呢?其實有時候,並不是我們的錯,只是我們太高看了別人,經過苦苦尋找和我們可愛的老大的指點,才找到原來官網的例子中有貓膩。    

下來,同樣的,看一下StreamTableEnvironment和TableEnvironment這兩個類中的註冊方法。

StreamTableEnvironment

/**         

* Registers an [[AggregateFunction]] under a unique name in the TableEnvironment's catalog.         

* Registered functions can be referenced in Table API and SQL queries.         *         

* @param name The name under which the function is registered.         

* @param f The AggregateFunction to register.         

* @tparam T The type of the output value.         

* @tparam ACC The type of aggregate accumulator.         

*/       

def registerFunction[T: TypeInformation, ACC: TypeInformation](

          name: String,

          f: AggregateFunction[T, ACC])  : Unit = {  

                  registerAggregateFunctionInternal[T, ACC](name, f) 

 } TableEnvironment

/**       

  * Registers an [[AggregateFunction]] under a unique name. Replaces already existing       

  * user-defined functions under this name.      

   */      

 private[flink] def registerAggregateFunctionInternal[T: TypeInformation, ACC: TypeInformation](

          name: String, function: AggregateFunction[T, ACC]): Unit = {

        // check if class not Scala object  

      checkNotSingleton(function.getClass)

        // check if class could be instantiated            

        checkForInstantiation(function.getClass)  

      val resultTypeInfo: TypeInformation[_] = getResultTypeOfAggregateFunction(              function,               implicitly[TypeInformation[T]])  

          val accTypeInfo: TypeInformation[_] = getAccumulatorTypeOfAggregateFunction(               function,               implicitly[TypeInformation[ACC]])

        // register in Table API

        functionCatalog.registerFunction(name, function.getClass)

        // register in SQL API

        val sqlFunctions = createAggregateSqlFunction(               name,               name,               function,               resultTypeInfo,               accTypeInfo,               typeFactory)  

      functionCatalog.registerSqlFunction(sqlFunctions)   }

具體呢如下:

一是,我們此處規範的是一個[Object,Object]的泛型,對於Accum類裏的Jlong,JInteger沒法起到限定,進而在解析時無法找到對應的類型,這個反應在TableEnvironment裏面的T和ACC,T對應上了,而ACC卻沒有;

二是,我們的Avg類裏面,返回的卻是一個 new TupleTypeInfo(classOf[WeightedAvgAccum], Types.LONG, Types.INT)和Types.LONG,那麼不難發現,這個tuple裏三個元素,我們其實只需要把第一個解析了,而另外兩個都是套在它裏面的,所以Object只有一個,而WeightedAvgAccum裏卻有三個,完全不對應,所以我們需要更改這兩個類,改完後具體代碼如下: WeightedAvgAccum

/**      

 * Created by lx on 2018/11/22.  

*/   

  class WeightedAvgAccum {

          var   sum = 1L  

        var   count = 2    

 }

WeightedAvg import org.apache.flink.api.common.typeinfo.TypeInformation  import org.apache.flink.api.java.typeutils.TupleTypeInfo  import org.apache.flink.table.api.Types  import

org.apache.flink.table.functions.{AggregateFunction, UserDefinedFunction, utils}  import java.lang.{Integer => JInteger, Long => JLong}  import

org.apache.flink.table.functions.utils.UserDefinedFunctionUtils /

**       

* Created by lx on 2018/11/14.  

*/

class WeightedAvg extends  AggregateFunction[JLong, WeightedAvgAccum]{  

    override def createAccumulator(): WeightedAvgAccum = {  

          new WeightedAvgAccum

      }         

   override def getValue(acc: WeightedAvgAccum): JLong = {  

          if (acc.count == 0) {  

                null  

          } else {  

                acc.sum / acc.count  

          }  

       }       

def accumulate(acc: WeightedAvgAccum, iValue: JLong, iWeight: JInteger): Unit = {  

          acc.sum += iValue * iWeight

            acc.count += iWeight

      }       

def retract(acc: WeightedAvgAccum, iValue: JLong, iWeight: JInteger): Unit = {

       acc.sum -= iValue * iWeight  

        acc.count -= iWeight  

    }  

    def merge(acc: WeightedAvgAccum, it: java.lang.Iterable[WeightedAvgAccum]): Unit = {  

      val iter = it.iterator()         while (iter.hasNext) {               val a = iter.next()               acc.count += a.count               acc.sum += a.sum             }

      }       

def resetAccumulator(acc: WeightedAvgAccum): Unit = {

            acc.count = 0

            acc.sum = 0L  

    }      

 override def getAccumulatorType: TypeInformation[WeightedAvgAccum] = TypeInformation.of(classOf[WeightedAvgAccum])    override def getResultType: TypeInformation[JLong] = Types.LONG } 那麼具體運行如何呢,具體如下: flink_running——picture 4.問題總結    巴拉巴拉說了這麼多,可能對於大神來說並不是什麼新鮮問題,但是我相信初次接觸的小白來講還是或多或少有一些幫助的,所以希望後續在碼代碼的過程中要多方面去思考,也希望還是要加強對底層知識的深度認識,也能夠更快的觸類旁通,好啦,今天就到這啦,我是林夕,後面持續爲大家帶來我自己的一些見解和認知,撒由那拉~~~

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章