文章目錄
Flink SQL
簡單介紹
Why Flink SQL
- 靈活的,豐富的語法;
- 能夠應用於大部分的計算場景;
Flink SQL
底層使用的是Apache Calcite
框架,將標準的Flink SQL語句解析並轉換成底層的算子處理邏輯,並在轉換過程中基於語法規則層面進行性能優化。- 可以屏蔽底層技術上的細節,能夠更加方便且高效的通過
SQL
語句來構建Flink應用; FlinkSQL
在``Table API之上,涵蓋了大部分的
Table API`功能特性;SQL
和``TableAPI`混用,Flink最終會在整體上將代碼合併在同一套代碼邏輯中;- 構建一套``SQ
L代碼可以同時應用在相同數據結構的流式計算和批量計算的場景,不需要用戶對
SQL`語句做任何調整,最終達到實現批流同一的目的
Flink SQL
實例
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env);
DataStream<Tuple3<Long, String, Integer>> ds = env.addSource(...);
Table table = tableEnv.fromDataStream(ds, "user, product, amount");
Table result = tableEnv.sqlQuery(
"SELECT SUM(amount) FROM " + table + " WHERE product LIKE '%Rubber%'");
tableEnv.registerDataStream("Orders", ds, "user, product, amount");
Table result2 = tableEnv.sqlQuery(
"SELECT product,
amount
FROM Orders
WHERE product
LIKE '%Rubber%'");
TableSink csvSink = new CsvTableSink("/path/to/file", ...);
String[] fieldNames = {"product", "amount"};
TypeInformation[] fieldTypes = {Types.STRING, Types.INT};
tableEnv.registerTableSink("RubberOrders", fieldNames, fieldTypes, csvSink);
tableEnv.sqlUpdate(
"INSERT INTO
RubberOrders
SELECT
product,
amount
FROM Orders
WHERE product
LIKE '%Rubber%'");
Operations
Show and Use
Operation | Description |
---|---|
Show | show catalogs show databases show tables |
Use | use catalog mycatalog use database |
Scan,Projection and Filter
Operation | Description |
---|---|
Scan / Select / As | SELECT * FROM Orders SELECT a, c AS d FROM Orders |
Where / Filter | SELECT * FROM Orders WHERE b = 'red' SELECT * FROM Orders WHERE a % 2 = 0 |
User-defined Scalar Functions (Scalar UDF) | SELECT PRETTY_PRINT(user) FROM Orders |
Aggregations
Operation | Description |
---|---|
GroupBy Aggregation | SELECT a, SUM(b) as d FROM Orders GROUP BY a |
GroupBy Window Aggregation | SELECT user, SUM(amount) FROM Orders GROUP BY TUMBLE(rowtime, INTERVAL '1' DAY), user |
Over Window aggregation | SELECT COUNT(amount) OVER ( PARTITION BY user ORDER BY proctime ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) FROM Orders SELECT COUNT(amount) OVER w, SUM(amount) OVER w FROM Orders WINDOW w AS ( PARTITION BY user ORDER BY proctime ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) |
Distinct | SELECT DISTINCT users FROM Orders |
Grouping sets, Rollup, Cube | SELECT SUM(amount) FROM Orders GROUP BY GROUPING SETS ((user), (product)) |
User-defined Aggregate Functions (UDAGG) | SELECT MyAggregate(amount) FROM Orders GROUP BY users |
Having | SELECT SUM(amount) FROM Orders GROUP BY users HAVING SUM(amount) > 50 |
Joins
Operation | Description |
---|---|
Inner Equi-join | SELECT * FROM Orders INNER JOIN Product ON Orders.productId = Product.id |
Outer Equi-join | SELECT * FROM Orders LEFT JOIN Product ON Orders.productId = Product.id SELECT * FROM Orders RIGHT JOIN Product ON Orders.productId = Product.id SELECT * FROM Orders FULL OUTER JOIN Product ON Orders.productId = Product.id |
Time-windowed Join | SELECT * FROM Orders o, Shipments s WHERE o.id = s.orderId AND o.ordertime BETWEEN s.shiptime - INTERVAL '4' HOUR AND s.shiptime |
Expanding arrays into a relation | SELECT users, tag FROM Orders CROSS JOIN UNNEST(tags) AS t (tag) |
Join with Table Function (UDTF) | 內連接 如果其表函數調用返回空結果,則刪除左(外)表的一行. SELECT users, tag FROM Orders, LATERAL TABLE(unnest_udtf(tags)) t AS tag 外連接 如果表函數調用返回空結果,則保留相應的外部行,並使用空值填充結果。 SELECT users, tag FROM Orders LEFT JOIN LATERAL TABLE(unnest_udtf(tags)) t AS tag ON TRUE |
Join with Temporal Table Function | 時態表是跟蹤隨時間變化的表。假設Rates是一個時態表函數,可以用SQL表示連接SELECT o_amount, r_rate FROM Orders, LATERAL TABLE (Rates(o_proctime)) WHERE r_currency = o_currency |
Join with Temporal Table | 時態表是跟蹤隨時間變化的表。時態表提供對特定時間點的時態表版本的訪問。 僅支持具有處理時間時態表的內部和左側連接。 SELECT o.amout, o.currency, r.rate, o.amount * r.rate FROM Orders AS o JOIN LatestRates FOR SYSTEM_TIME AS OF o.proctime AS r ON r.currency = o.currency |
Set Operations
Operation | Description |
---|---|
Union | SELECT * FROM ( ( SELECT user FROM Orders WHERE a % 2 = 0) UNION (SELECT user FROM Orders WHERE b = 0)<br> ) |
UnionAll | SELECT * FROM ( (SELECT user FROM Orders WHERE a % 2 = 0) UNION ALL (SELECT user FROM Orders WHERE b = 0)<br> ) |
Intersect / Except | SELECT * FROM ( (SELECT user FROM Orders WHERE a % 2 = 0) INTERSECT(SELECT user FROM Orders WHERE b = 0)<br> ) SELECT * FROM ( (SELECT user FROM Orders WHERE a % 2 = 0) EXCEPT (SELECT user FROM Orders WHERE b = 0) ) |
In | SELECT user, amount FROM Orders WHERE product IN ( SELECT product FROM NewProducts ) |
Exists | SELECT user, amount FROM Orders WHERE product EXISTS ( SELECT product FROM NewProducts ) |
OrderBy & Limit
Operation | Description |
---|---|
Order By | SELECT * FROM Orders ORDER BY orderTime |
Limit | SELECT * FROM Orders ORDER BY orderTime LIMIT 3 |
Top-N
SELECT [column_list]
FROM (
SELECT [column_list],
ROW_NUMBER() OVER ([PARTITION BY col1[, col2...]]
ORDER BY col1 [asc|desc][, col2 [asc|desc]...]) AS rownum
FROM table_name)
WHERE rownum <= N [AND conditions]
e.g.
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
StreamTableEnvironment tableEnv = TableEnvironment.getTableEnvironment(env);
// ingest a DataStream from an external source
DataStream<Tuple3<String, String, String, Long>> ds = env.addSource(...);
// register the DataStream as table "ShopSales"
tableEnv.registerDataStream("ShopSales", ds, "product_id, category, product_name, sales");
// select top-5 products per category which have the maximum sales.
Table result1 = tableEnv.sqlQuery(
"SELECT * " +
"FROM (" +
" SELECT *," +
" ROW_NUMBER() OVER (PARTITION BY category ORDER BY sales DESC) as row_num" +
" FROM ShopSales)" +
"WHERE row_num <= 5");
Deduplication
(重複數據刪除)
SELECT [column_list]
FROM (
SELECT [column_list],
ROW_NUMBER() OVER ([PARTITION BY col1[, col2...]]
ORDER BY time_attr [asc|desc]) AS rownum
FROM table_name)
WHERE rownum = 1
e.g.
tableEnv.registerDataStream("Orders", ds, "order_id, user, product, number, proctime.proctime");
Table result1 = tableEnv.sqlQuery(
"SELECT order_id, user, product, number " +
"FROM (" +
" SELECT *," +
" ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY proctime ASC) as row_num" +
" FROM Orders)" +
"WHERE row_num = 1");
Insert
Operation | Description |
---|---|
Insert Into | Insert Into 語句只能被應用在SqlUpdate方法中,用於完成對Table中數據的輸出INSERT INTO OutputTable SELECT users, tag FROM Orders |
自定義函數
Scalar Function
(標量函數)
public class HashCode extends ScalarFunction {
private int factor = 12;
public HashCode(int factor) {
this.factor = factor;
}
public int eval(String s) {
return s.hashCode() * factor;
}
}
BatchTableEnvironment tableEnv = BatchTableEnvironment.create(env);
// 對方法在環境中進行註冊
tableEnv.registerFunction("hashCode", new HashCode(10));
// Table 使用自定義函數
myTable.select("string, string.hashCode(), hashCode(string)");
// SQL 使用自定義函數
tableEnv.sqlQuery("SELECT string, hashCode(string) FROM MyTable");
**Note : ** 在自定義標量函數過程中,函數返回值類型必須爲標量值。
public static class TimestampModifier extends ScalarFunction {
public long eval(long t) {
return t % 1000;
}
public TypeInformation<?> getResultType(Class<?>[] signature) {
return Types.SQL_TIMESTAMP;
}
}
**Note : ** 對於不支持的輸出結果類型,可以通過實現ScalarFunction
接口中的getResultType
對輸出結果數據類型的轉換。
Table Function
與Scalar Function
不同,Table Function
將一個或多個標量字段作爲輸入參數,且經過計算和處理後返回的是任意數量的記錄,不再是一個單獨的一個標量指標,且返回結果中可以含有一列或多列指標。從形式上看更像一個Table
結構數據。
定義tableFunction
需要繼承org.apache.flink.table.functions
包中的TableFunction
類,並實現類中的evaluation
方法,且所有的自定義函數計算邏輯均在該方法中定義。需要注意的是,方法必須聲明爲public且名稱必須定義爲eval,在一個TableFunction
中,可以實現evaluation的重載。在使用TableFunction
前,在TableEnvironment
中註冊Table Function
,然後結合LATERAL TABLE關鍵字使用,根據句尾是否增加ON TRUE
關鍵字來區分是join
還是leftOuterJoin
操作。
public class Split extends TableFunction<Tuple2<String, Integer>> {
private String separator = " ";
public Split(String separator) {
this.separator = separator;
}
public void eval(String str) {
for (String s : str.split(separator)) {
// use collect(...) to emit a row
collect(new Tuple2<String, Integer>(s, s.length()));
}
}
}
BatchTableEnvironment tableEnv = BatchTableEnvironment.create(env);
Table myTable = ... // table schema: [a: String]
// Register the function.
tableEnv.registerFunction("split", new Split("#"));
// Use the table function in the Java Table API. "as" specifies the field names of the table.
myTable.joinLateral("split(a) as (word, length)")
.select("a, word, length");
myTable.leftOuterJoinLateral("split(a) as (word, length)")
.select("a, word, length");
// Use the table function in SQL with LATERAL and TABLE keywords.
// CROSS JOIN a table function (equivalent to "join" in Table API).
tableEnv.sqlQuery("SELECT a, word, length FROM MyTable, LATERAL TABLE(split(a)) as T(word, length)");
// LEFT JOIN a table function (equivalent to "leftOuterJoin" in Table API).
tableEnv.sqlQuery("SELECT a, word, length FROM MyTable LEFT JOIN LATERAL TABLE(split(a)) as T(word, length) ON TRUE");
public class CustomTypeSplit extends TableFunction<Row> {
public void eval(String str) {
for (String s : str.split(" ")) {
Row row = new Row(2);
row.setField(0, s);
row.setField(1, s.length());
collect(row);
}
}
@Override
public TypeInformation<Row> getResultType() {
return Types.ROW(Types.STRING(), Types.INT());
}
}
Aggregation Function
`Flink Table API`中提供了`User-Defined Aggregate Functions(UDAGGS)`,其主要功能是將一行或多行數據進行聚合然後輸出一個標量值,例如在數據集中根據`Key`求取指定值`Value`的最大值或最小值。這個玩意的定義很麻煩......
public static class WeightedAvgAccum {
public long sum = 0;
public int count = 0;
}
/**
* Weighted Average user-defined aggregate function.
*/
public static class WeightedAvg extends AggregateFunction<Long, WeightedAvgAccum> {
@Override
public WeightedAvgAccum createAccumulator() {
return new WeightedAvgAccum();
}
@Override
public Long getValue(WeightedAvgAccum acc) {
if (acc.count == 0) {
return null;
} else {
return acc.sum / acc.count;
}
}
public void accumulate(WeightedAvgAccum acc, long iValue, int iWeight) {
acc.sum += iValue * iWeight;
acc.count += iWeight;
}
public void retract(WeightedAvgAccum acc, long iValue, int iWeight) {
acc.sum -= iValue * iWeight;
acc.count -= iWeight;
}
public void merge(WeightedAvgAccum acc, Iterable<WeightedAvgAccum> it) {
Iterator<WeightedAvgAccum> iter = it.iterator();
while (iter.hasNext()) {
WeightedAvgAccum a = iter.next();
acc.count += a.count;
acc.sum += a.sum;
}
}
public void resetAccumulator(WeightedAvgAccum acc) {
acc.count = 0;
acc.sum = 0L;
}
}
// register function
StreamTableEnvironment tEnv = ...
tEnv.registerFunction("wAvg", new WeightedAvg());
// use function
tEnv.sqlQuery("SELECT user, wAvg(points, level) AS avgPoints FROM userScores GROUP BY user");