說明

本博客每週五更新一次。自定義函數（UDF）是一種Flink 擴展開發機制，可在查詢語句裏實現自定義的功能邏輯。自定義函數可用 JVM 語言（例如 Java 或 Scala）或 Python 實現，推薦java或scala。

個人java工具庫項目https://gitee.com/wangzonghui/object-tool
- 包含json、string、集合、excel、zip壓縮、pdf、bytes、http等多種工具，歡迎使用。

資料

Flink UDF 1.14官方中文文檔
博客

種類

UDF按功能大致分爲4類（也可以3類，聚合函數和表值聚合函數算一類），如下表

名稱	說明
標量函數	把0到多個標量值映射成 1 個標量值
表值函數	把0到多個標量值映射成多行數據
聚合函數	把一行或多行數據聚合爲1個值
表值聚合函數	把一行或多行數據聚合爲多行

標量函數

說明

標量函數必須繼承 org.apache.flink.table.functions.ScalarFunction 類，實現 eval 方法，java實例代碼如下：

實例

//-------------- 實現標量函數 ----------------
import org.apache.flink.table.annotation.InputGroup;
import org.apache.flink.table.api.*;
import org.apache.flink.table.functions.ScalarFunction;
import static org.apache.flink.table.api.Expressions.*;

public static class HashFunction extends ScalarFunction {

  // 接受任意類型輸入，返回 INT 型輸出
  public int eval(@DataTypeHint(inputGroup = InputGroup.ANY) Object o) {
    return o.hashCode();
  }
}


// 調用自定義函數
TableEnvironment env = TableEnvironment.create(...);

//-------------- 方式1 不註冊函數 ----------------
// 在 Table API 裏不經註冊直接“內聯”調用函數
env.from("MyTable").select(call(HashFunction.class, $("myField")));

//-------------- 方式2 註冊函數 ----------------
// 註冊函數
env.createTemporarySystemFunction("HashFunction", HashFunction.class);

// 在 Table API 裏調用註冊好的函數
env.from("MyTable").select(call("HashFunction", $("myField")));

// 在 SQL 裏調用註冊好的函數
env.sqlQuery("SELECT HashFunction(myField) FROM MyTable");

表值函數

說明

實現類 org.apache.flink.table.functions.TableFunction，通過實現多個名爲 eval 的方法對求值方法進行重載。

實例

//-------------- 實現表值函數 ----------------
import org.apache.flink.table.annotation.DataTypeHint;
import org.apache.flink.table.annotation.FunctionHint;
import org.apache.flink.table.api.*;
import org.apache.flink.table.functions.TableFunction;
import org.apache.flink.types.Row;
import static org.apache.flink.table.api.Expressions.*;

@FunctionHint(output = @DataTypeHint("ROW<word STRING, length INT>"))
public static class SplitFunction extends TableFunction<Row> {

  public void eval(String str) {
    for (String s : str.split(" ")) {
      // use collect(...) to emit a row
      collect(Row.of(s, s.length()));
    }
  }
}

//-------------- 使用標量函數 ----------------
TableEnvironment env = TableEnvironment.create(...);

//-------------- 方式1：不註冊使用 ----------------
// 在 Table API 裏不經註冊直接“內聯”調用函數
env
  .from("MyTable")
  .joinLateral(call(SplitFunction.class, $("myField")))
  .select($("myField"), $("word"), $("length"));
env
  .from("MyTable")
  .leftOuterJoinLateral(call(SplitFunction.class, $("myField")))
  .select($("myField"), $("word"), $("length"));

// 在 Table API 裏重命名函數字段
env
  .from("MyTable")
  .leftOuterJoinLateral(call(SplitFunction.class, $("myField")).as("newWord", "newLength"))
  .select($("myField"), $("newWord"), $("newLength"));

//-------------- 方式1：註冊使用 ----------------
// 註冊函數
env.createTemporarySystemFunction("SplitFunction", SplitFunction.class);

// 在 Table API 裏調用註冊好的函數
env
  .from("MyTable")
  .joinLateral(call("SplitFunction", $("myField")))
  .select($("myField"), $("word"), $("length"));
env
  .from("MyTable")
  .leftOuterJoinLateral(call("SplitFunction", $("myField")))
  .select($("myField"), $("word"), $("length"));

// 在 SQL 裏調用註冊好的函數
env.sqlQuery(
  "SELECT myField, word, length " +
  "FROM MyTable, LATERAL TABLE(SplitFunction(myField))");
env.sqlQuery(
  "SELECT myField, word, length " +
  "FROM MyTable " +
  "LEFT JOIN LATERAL TABLE(SplitFunction(myField)) ON TRUE");

// 在 SQL 裏重命名函數字段
env.sqlQuery(
  "SELECT myField, newWord, newLength " +
  "FROM MyTable " +
  "LEFT JOIN LATERAL TABLE(SplitFunction(myField)) AS T(newWord, newLength) ON TRUE");

聚合函數

說明

自定義聚合函數（UDAGG）是把一個表（一行或者多行，每行可以有一列或者多列）聚合成一個標量值。
如上圖，有一個關於飲料的表，有三個字段id、name、price，有 5 行數據。假設需要找到所有飲料裏最貴的飲料價格，即執行一個 max() 聚合。需要遍歷所有5行數據，結果只有一個數值。
自定義聚合函數是通過擴展 AggregateFunction 來實現的。AggregateFunction 需要 accumulator 定義數據結構，存儲了聚合的中間結果。通過 AggregateFunction 的 createAccumulator() 方法創建一個空的 accumulator。對於每一行數據，會調用 accumulate() 方法來更新 accumulator。當所有的數據都處理完了之後，通過調用 getValue() 計算和返回最終結果。
因此實現AggregateFunction 必須實現方法：createAccumulator()、accumulate()、getValue()
某些場景下還需要實現其他方法。
- retract() 在 bounded OVER 窗口中是必須實現的。
- merge() 在許多批式聚合和會話以及滾動窗口聚合中是必須實現的。除此之外，這個方法對於優化也很多幫助。例如，兩階段聚合優化就需要所有的 AggregateFunction 都實現 merge 方法。
- resetAccumulator() 在許多批式聚合中是必須實現的。

代碼實例

//----------------創建數據對象 ----------------
/**
 * Accumulator for WeightedAvg.
 */
public static class WeightedAvgAccum {
    public long sum = 0;
    public int count = 0;
}

//-------------- 定義聚合函數 ----------------

/**
 * Weighted Average user-defined aggregate function.
 */
public static class WeightedAvg extends AggregateFunction<Long, WeightedAvgAccum> {

    @Override
    public WeightedAvgAccum createAccumulator() {
        return new WeightedAvgAccum();
    }

    @Override
    public Long getValue(WeightedAvgAccum acc) {
        if (acc.count == 0) {
            return null;
        } else {
            return acc.sum / acc.count;
        }
    }

    public void accumulate(WeightedAvgAccum acc, long iValue, int iWeight) {
        acc.sum += iValue * iWeight;
        acc.count += iWeight;
    }

    public void retract(WeightedAvgAccum acc, long iValue, int iWeight) {
        acc.sum -= iValue * iWeight;
        acc.count -= iWeight;
    }

    public void merge(WeightedAvgAccum acc, Iterable<WeightedAvgAccum> it) {
        Iterator<WeightedAvgAccum> iter = it.iterator();
        while (iter.hasNext()) {
            WeightedAvgAccum a = iter.next();
            acc.count += a.count;
            acc.sum += a.sum;
        }
    }

    public void resetAccumulator(WeightedAvgAccum acc) {
        acc.count = 0;
        acc.sum = 0L;
    }
}

//-------------- 使用聚合函數 ----------------
// 註冊函數
StreamTableEnvironment tEnv = ...
tEnv.registerFunction("wAvg", new WeightedAvg());

// 使用函數
tEnv.sqlQuery("SELECT user, wAvg(points, level) AS avgPoints FROM userScores GROUP BY user");

表值聚合函數

說明

自定義表值聚合函數（UDTAGG）可以把一個表（一行或者多行，每行有一列或者多列）聚合成另一張表，結果中可以有多行多列。
如上圖有一個表，3個字段分別爲 id、name 和 price 共 5 行。假設需要找到價格最高的兩個飲料，類似於 top2() 表值聚合函數。需要遍歷所有 5 行數據，結果是有 2 行數據的一個表。
自定義表值聚合函數通過擴展 TableAggregateFunction 類來實現的，具體執行過程如下。首先，需要一個 accumulator 負責存儲聚合的中間結果。通過調用 TableAggregateFunction 的 createAccumulator() 方法來一個空的 accumulator。對於每一行數據，調用 accumulate() 方法更新 accumulator。當所有數據都處理完之後，調用 emitValue() 方法計算和返回最終的結果。
實現TableAggregateFunction 必須要實現的方法：createAccumulator()、accumulate()
某些場景下必須實現的方法：
- retract() 在 bounded OVER 窗口中的聚合函數必須要實現。
- merge() 在許多批式聚合和以及流式會話和滑動窗口聚合中是必須要實現的。
- resetAccumulator() 在許多批式聚合中是必須要實現的。
- emitValue() 在批式聚合以及窗口聚合中是必須要實現的。
emitUpdateWithRetract() 在 retract 模式下，可以提升人物效率，該方法負責發送被更新的值。

代碼實例

定義TableAggregateFunction 來計算給定列的最大的 2 個值，在 TableEnvironment 中註冊函數，在 Table API 查詢中使用函數（當前只在 Table API 中支持 TableAggregateFunction）。

//----------------創建數據對象 ----------------
/**
 * Accumulator for Top2.
 */
public class Top2Accum {
    public Integer first;
    public Integer second;
}

//-------------- 定義聚合函數 ----------------
/**
 * The top2 user-defined table aggregate function.
 */
public static class Top2 extends TableAggregateFunction<Tuple2<Integer, Integer>, Top2Accum> {

    @Override
    public Top2Accum createAccumulator() {
        Top2Accum acc = new Top2Accum();
        acc.first = Integer.MIN_VALUE;
        acc.second = Integer.MIN_VALUE;
        return acc;
    }


    public void accumulate(Top2Accum acc, Integer v) {
        if (v > acc.first) {
            acc.second = acc.first;
            acc.first = v;
        } else if (v > acc.second) {
            acc.second = v;
        }
    }

    public void merge(Top2Accum acc, java.lang.Iterable<Top2Accum> iterable) {
        for (Top2Accum otherAcc : iterable) {
            accumulate(acc, otherAcc.first);
            accumulate(acc, otherAcc.second);
        }
    }

    public void emitValue(Top2Accum acc, Collector<Tuple2<Integer, Integer>> out) {
        // emit the value and rank
        if (acc.first != Integer.MIN_VALUE) {
            out.collect(Tuple2.of(acc.first, 1));
        }
        if (acc.second != Integer.MIN_VALUE) {
            out.collect(Tuple2.of(acc.second, 2));
        }
    }
}

//-------------- 使用聚合函數 ----------------
// 註冊函數
StreamTableEnvironment tEnv = ...
tEnv.registerFunction("top2", new Top2());

// 初始化表
Table tab = ...;

// 使用函數
tab.groupBy("key")
    .flatAggregate("top2(a) as (v, rank)")
    .select("key, v, rank");

下面例子使用 emitUpdateWithRetract 方法來只發送更新的數據。爲了只發送更新的結果，accumulator 保存上一次的最大2個值，也保存了當前最大2個值。
注意：如果 TopN 中的 n 非常大，這種既保存上次的結果，也保存當前的結果的方式不太高效。一種解決這種問題的方式是把輸入數據直接存儲到 accumulator 中，然後在調用 emitUpdateWithRetract 方法時再進行計算。

//----------------創建數據對象 ----------------
/**
 * Accumulator for Top2.
 */
public class Top2Accum {
    public Integer first;
    public Integer second;
    public Integer oldFirst;
    public Integer oldSecond;
}

//-------------- 定義聚合函數 ----------------
/**
 * The top2 user-defined table aggregate function.
 */
public static class Top2 extends TableAggregateFunction<Tuple2<Integer, Integer>, Top2Accum> {

    @Override
    public Top2Accum createAccumulator() {
        Top2Accum acc = new Top2Accum();
        acc.first = Integer.MIN_VALUE;
        acc.second = Integer.MIN_VALUE;
        acc.oldFirst = Integer.MIN_VALUE;
        acc.oldSecond = Integer.MIN_VALUE;
        return acc;
    }

    public void accumulate(Top2Accum acc, Integer v) {
        if (v > acc.first) {
            acc.second = acc.first;
            acc.first = v;
        } else if (v > acc.second) {
            acc.second = v;
        }
    }

    public void emitUpdateWithRetract(Top2Accum acc, RetractableCollector<Tuple2<Integer, Integer>> out) {
        if (!acc.first.equals(acc.oldFirst)) {
            // if there is an update, retract old value then emit new value.
            if (acc.oldFirst != Integer.MIN_VALUE) {
                out.retract(Tuple2.of(acc.oldFirst, 1));
            }
            out.collect(Tuple2.of(acc.first, 1));
            acc.oldFirst = acc.first;
        }

        if (!acc.second.equals(acc.oldSecond)) {
            // if there is an update, retract old value then emit new value.
            if (acc.oldSecond != Integer.MIN_VALUE) {
                out.retract(Tuple2.of(acc.oldSecond, 2));
            }
            out.collect(Tuple2.of(acc.second, 2));
            acc.oldSecond = acc.second;
        }
    }
}

//-------------- 使用聚合函數 ----------------s
// 註冊函數
StreamTableEnvironment tEnv = ...
tEnv.registerFunction("top2", new Top2());

// 初始化表
Table tab = ...;

// 使用函數
tab.groupBy("key")
    .flatAggregate("top2(a) as (v, rank)")
    .select("key, v, rank");

總結

個人感覺UDF本質是抽象類的實現，擴展了Flink計算能力。

flink（十五）：udf自定義函數

說明

分享

資料

種類

標量函數

說明

實例

表值函數

說明

實例

聚合函數

說明

代碼實例

表值聚合函數

說明

代碼實例

總結

EXCEL中下拉菜單中添加新選項或者刪除選項

號稱能打敗MLP的KAN到底行不行？數學核心原理全面解析

同事使用 insert into select 遷移數據，開開心心上線，上線後被公司開除！

Git使用經驗總結5-修改提交信息

Python 爬蟲：Spring Boot 反爬蟲的成功案例

京東科技數字化營銷能力的演進與最佳實踐| 京東雲技術團隊

Git使用經驗總結4-撤回上一次本地提交

Java中止線程的方式

壓榨數據庫的真實處理速度

[轉帖]Oracle Exadata 學習筆記之核心特性Part1

git 將其中一個文件恢復到之前的版本

「Java開發指南」如何用MyEclipse搭建GWT 2.1和Spring？（一）

界面組件DevExpress Reporting中文教程 - 如何按條件顯示頁面水印？

程序員不存在了……嗎？

眼看他搭中臺，眼看他又拆了

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結