這個文章是根據 【實時數倉篇】基於 Flink 的典型 ETL 場景實現 寫的,對視頻中講解的四種維表Join分別實現了一些Demo。
常見的維表Join方式有四種:
- 預加載維表
- 熱存儲維表
- 廣播維表
- Temporal table function join
下面分別使用這四種方式來實現一個join的需求,這個需求是:一個主流中數據是用戶信息,字段包括用戶姓名、城市id;維表是城市數據,字段包括城市ID、城市名稱。要求用戶表與城市表關聯,輸出爲:用戶名稱、城市ID、城市名稱。
1、 預加載維表
通過定義一個類實現RichMapFunction,在open()中讀取維表數據加載到內存中,在probe流map()方法中與維表數據進行關聯。
RichMapFunction中open方法里加載維表數據到內存的方式特點如下:
優點:實現簡單
缺點:因爲數據存於內存,所以只適合小數據量並且維表數據更新頻率不高的情況下。雖然可以在open中定義一個定時器定時更新維,但是還是存在維表更新不及時的情況。
下面是一個例子:
package join;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.util.HashMap;
import java.util.Map;
/**
* Create By 鳴宇淳 on 2020/6/1
* 這個例子是從socket中讀取的流,數據爲用戶名稱和城市id,維表是城市id、城市名稱,
* 主流和維表關聯,得到用戶名稱、城市id、城市名稱
* 這個例子採用在RichMapfunction類的open方法中將維表數據加載到內存
**/
public class JoinDemo1 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Tuple2<String, Integer>> textStream = env.socketTextStream("localhost", 9000, "\n")
.map(p -> {
//輸入格式爲:user,1000,分別是用戶名稱和城市編號
String[] list = p.split(",");
return new Tuple2<String, Integer>(list[0], Integer.valueOf(list[1]));
})
.returns(new TypeHint<Tuple2<String, Integer>>() {
});
DataStream<Tuple3<String, Integer, String>> result = textStream.map(new MapJoinDemo1());
result.print();
env.execute("joinDemo1");
}
static class MapJoinDemo1 extends RichMapFunction<Tuple2<String, Integer>, Tuple3<String, Integer, String>> {
//定義一個變量,用於保存維表數據在內存
Map<Integer, String> dim;
@Override
public void open(Configuration parameters) throws Exception {
//在open方法中讀取維表數據,可以從數據中讀取、文件中讀取、接口中讀取等等。
dim = new HashMap<>();
dim.put(1001, "beijing");
dim.put(1002, "shanghai");
dim.put(1003, "wuhan");
dim.put(1004, "changsha");
}
@Override
public Tuple3<String, Integer, String> map(Tuple2<String, Integer> value) throws Exception {
//在map方法中進行主流和維表的關聯
String cityName = "";
if (dim.containsKey(value.f1)) {
cityName = dim.get(value.f1);
}
return new Tuple3<>(value.f0, value.f1, cityName);
}
}
}
2、 熱存儲維表
這種方式是將維表數據存儲在Redis、HBase、MySQL等外部存儲中,實時流在關聯維表數據的時候實時去外部存儲中查詢,這種方式特點如下:
優點:維度數據量不受內存限制,可以存儲很大的數據量。
缺點:因爲維表數據在外部存儲中,讀取速度受制於外部存儲的讀取速度;另外維表的同步也有延遲。
(1) 使用cache來減輕訪問壓力
可以使用緩存來存儲一部分常訪問的維表數據,以減少訪問外部系統的次數,比如使用guava Cache。
下面是一個例子:
package join;
import com.google.common.cache.*;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import java.util.HashMap;
import java.util.Map;
import java.util.concurrent.TimeUnit;
/**
* Create By 鳴宇淳 on 2020/6/1
**/
public class JoinDemo2 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Tuple2<String, Integer>> textStream = env.socketTextStream("localhost", 9000, "\n")
.map(p -> {
//輸入格式爲:user,1000,分別是用戶名稱和城市編號
String[] list = p.split(",");
return new Tuple2<String, Integer>(list[0], Integer.valueOf(list[1]));
})
.returns(new TypeHint<Tuple2<String, Integer>>() {
});
DataStream<Tuple3<String, Integer, String>> result = textStream.map(new MapJoinDemo1());
result.print();
env.execute("joinDemo1");
}
static class MapJoinDemo1 extends RichMapFunction<Tuple2<String, Integer>, Tuple3<String, Integer, String>> {
LoadingCache<Integer, String> dim;
@Override
public void open(Configuration parameters) throws Exception {
//使用google LoadingCache來進行緩存
dim = CacheBuilder.newBuilder()
//最多緩存個數,超過了就根據最近最少使用算法來移除緩存
.maximumSize(1000)
//在更新後的指定時間後就回收
.expireAfterWrite(10, TimeUnit.MINUTES)
//指定移除通知
.removalListener(new RemovalListener<Integer, String>() {
@Override
public void onRemoval(RemovalNotification<Integer, String> removalNotification) {
System.out.println(removalNotification.getKey() + "被移除了,值爲:" + removalNotification.getValue());
}
})
.build(
//指定加載緩存的邏輯
new CacheLoader<Integer, String>() {
@Override
public String load(Integer cityId) throws Exception {
String cityName = readFromHbase(cityId);
return cityName;
}
}
);
}
private String readFromHbase(Integer cityId) {
//讀取hbase
//這裏寫死,模擬從hbase讀取數據
Map<Integer, String> temp = new HashMap<>();
temp.put(1001, "beijing");
temp.put(1002, "shanghai");
temp.put(1003, "wuhan");
temp.put(1004, "changsha");
String cityName = "";
if (temp.containsKey(cityId)) {
cityName = temp.get(cityId);
}
return cityName;
}
@Override
public Tuple3<String, Integer, String> map(Tuple2<String, Integer> value) throws Exception {
//在map方法中進行主流和維表的關聯
String cityName = "";
if (dim.get(value.f1) != null) {
cityName = dim.get(value.f1);
}
return new Tuple3<>(value.f0, value.f1, cityName);
}
}
}
(2) 使用異步IO來提高訪問吞吐量
Flink與外部存儲系統進行讀寫操作的時候可以使用同步方式,也就是發送一個請求後等待外部系統響應,然後再發送第二個讀寫請求,這樣的方式吞吐量比較低,可以用提高並行度的方式來提高吞吐量,但是並行度多了也就導致了進程數量多了,佔用了大量的資源。
Flink中可以使用異步IO來讀寫外部系統,這要求外部系統客戶端支持異步IO,不過目前很多系統都支持異步IO客戶端。但是如果使用異步就要涉及到三個問題:
超時:如果查詢超時那麼就認爲是讀寫失敗,需要按失敗處理;
併發數量:如果併發數量太多,就要觸發Flink的反壓機制來抑制上游的寫入。
返回順序錯亂:順序錯亂了要根據實際情況來處理,Flink支持兩種方式:允許亂序、保證順序。
下面是一個實例,演示了試用異步IO來訪問維表:
package join;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.AsyncDataStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.async.ResultFuture;
import org.apache.flink.streaming.api.functions.async.RichAsyncFunction;
import java.sql.DriverManager;
import java.sql.PreparedStatement;
import java.sql.ResultSet;
import java.util.ArrayList;
import java.util.List;
import java.util.concurrent.TimeUnit;
/**
* Create By 鳴宇淳 on 2020/6/1
**/
public class JoinDemo3 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStream<Tuple2<String, Integer>> textStream = env.socketTextStream("localhost", 9000, "\n")
.map(p -> {
//輸入格式爲:user,1000,分別是用戶名稱和城市編號
String[] list = p.split(",");
return new Tuple2<String, Integer>(list[0], Integer.valueOf(list[1]));
})
.returns(new TypeHint<Tuple2<String, Integer>>() {
});
DataStream<Tuple3<String,Integer, String>> orderedResult = AsyncDataStream
//保證順序:異步返回的結果保證順序,超時時間1秒,最大容量2,超出容量觸發反壓
.orderedWait(textStream, new JoinDemo3AyncFunction(), 1000L, TimeUnit.MILLISECONDS, 2)
.setParallelism(1);
DataStream<Tuple3<String,Integer, String>> unorderedResult = AsyncDataStream
//允許亂序:異步返回的結果允許亂序,超時時間1秒,最大容量2,超出容量觸發反壓
.unorderedWait(textStream, new JoinDemo3AyncFunction(), 1000L, TimeUnit.MILLISECONDS, 2)
.setParallelism(1);
orderedResult.print();
unorderedResult.print();
env.execute("joinDemo");
}
//定義個類,繼承RichAsyncFunction,實現異步查詢存儲在mysql裏的維表
//輸入用戶名、城市ID,返回 Tuple3<用戶名、城市ID,城市名稱>
static class JoinDemo3AyncFunction extends RichAsyncFunction<Tuple2<String, Integer>, Tuple3<String, Integer, String>> {
// 鏈接
private static String jdbcUrl = "jdbc:mysql://192.168.145.1:3306?useSSL=false";
private static String username = "root";
private static String password = "123";
private static String driverName = "com.mysql.jdbc.Driver";
java.sql.Connection conn;
PreparedStatement ps;
@Override
public void open(Configuration parameters) throws Exception {
super.open(parameters);
Class.forName(driverName);
conn = DriverManager.getConnection(jdbcUrl, username, password);
ps = conn.prepareStatement("select city_name from tmp.city_info where id = ?");
}
@Override
public void close() throws Exception {
super.close();
conn.close();
}
//異步查詢方法
@Override
public void asyncInvoke(Tuple2<String, Integer> input, ResultFuture<Tuple3<String,Integer, String>> resultFuture) throws Exception {
// 使用 city id 查詢
ps.setInt(1, input.f1);
ResultSet rs = ps.executeQuery();
String cityName = null;
if (rs.next()) {
cityName = rs.getString(1);
}
List list = new ArrayList<Tuple2<Integer, String>>();
list.add(new Tuple3<>(input.f0,input.f1, cityName));
resultFuture.complete(list);
}
//超時處理
@Override
public void timeout(Tuple2<String, Integer> input, ResultFuture<Tuple3<String,Integer, String>> resultFuture) throws Exception {
List list = new ArrayList<Tuple2<Integer, String>>();
list.add(new Tuple3<>(input.f0,input.f1, ""));
resultFuture.complete(list);
}
}
}
3、 廣播維表
利用Flink的Broadcast State將維度數據流廣播到下游做join操作。特點如下:
優點:維度數據變更後可以即時更新到結果中。
缺點:數據保存在內存中,支持的維度數據量比較小。
下面是一個實例:
package join;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.state.BroadcastState;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.state.ReadOnlyBroadcastState;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.streaming.api.datastream.BroadcastStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction;
import org.apache.flink.util.Collector;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import java.util.Map;
/**
* Create By 鳴宇淳 on 2020/6/1
* 這個例子是從socket中讀取的流,數據爲用戶名稱和城市id,維表是城市id、城市名稱,
* 主流和維表關聯,得到用戶名稱、城市id、城市名稱
* 這個例子採用 Flink 廣播流的方式來做爲維度
**/
public class JoinDemo4 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//定義主流
DataStream<Tuple2<String, Integer>> textStream = env.socketTextStream("localhost", 9000, "\n")
.map(p -> {
//輸入格式爲:user,1000,分別是用戶名稱和城市編號
String[] list = p.split(",");
return new Tuple2<String, Integer>(list[0], Integer.valueOf(list[1]));
})
.returns(new TypeHint<Tuple2<String, Integer>>() {
});
//定義城市流
DataStream<Tuple2<Integer, String>> cityStream = env.socketTextStream("localhost", 9001, "\n")
.map(p -> {
//輸入格式爲:城市ID,城市名稱
String[] list = p.split(",");
return new Tuple2<Integer, String>(Integer.valueOf(list[0]), list[1]);
})
.returns(new TypeHint<Tuple2<Integer, String>>() {
});
//將城市流定義爲廣播流
final MapStateDescriptor<Integer, String> broadcastDesc = new MapStateDescriptor("broad1", Integer.class, String.class);
BroadcastStream<Tuple2<Integer, String>> broadcastStream = cityStream.broadcast(broadcastDesc);
DataStream result = textStream.connect(broadcastStream)
.process(new BroadcastProcessFunction<Tuple2<String, Integer>, Tuple2<Integer, String>, Tuple3<String, Integer, String>>() {
//處理非廣播流,關聯維度
@Override
public void processElement(Tuple2<String, Integer> value, ReadOnlyContext ctx, Collector<Tuple3<String, Integer, String>> out) throws Exception {
ReadOnlyBroadcastState<Integer, String> state = ctx.getBroadcastState(broadcastDesc);
String cityName = "";
if (state.contains(value.f1)) {
cityName = state.get(value.f1);
}
out.collect(new Tuple3<>(value.f0, value.f1, cityName));
}
@Override
public void processBroadcastElement(Tuple2<Integer, String> value, Context ctx, Collector<Tuple3<String, Integer, String>> out) throws Exception {
System.out.println("收到廣播數據:" + value);
ctx.getBroadcastState(broadcastDesc).put(value.f0, value.f1);
}
});
result.print();
env.execute("joinDemo");
}
}
4、 Temporal table function join
Temporal table是持續變化表上某一時刻的視圖,Temporal table function是一個表函數,傳遞一個時間參數,返回Temporal table這一指定時刻的視圖。
可以將維度數據流映射爲Temporal table,主流與這個Temporal table進行關聯,可以關聯到某一個版本(歷史上某一個時刻)的維度數據。
Temporal table function join的特點如下:
優點:維度數據量可以很大,維度數據更新及時,不依賴外部存儲,可以關聯不同版本的維度數據。
缺點:只支持在Flink Table API中使用。
以下是一個實例:
package join;
import org.apache.flink.api.common.typeinfo.TypeHint;
import org.apache.flink.api.java.tuple.Tuple2;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.EnvironmentSettings;
import org.apache.flink.table.api.Table;
import org.apache.flink.table.api.java.StreamTableEnvironment;
import org.apache.flink.table.functions.TemporalTableFunction;
import org.apache.flink.types.Row;
/**
* Create By 鳴宇淳 on 2020/6/1
**/
public class JoinDemo5 {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
EnvironmentSettings bsSettings = EnvironmentSettings.newInstance().useBlinkPlanner().inStreamingMode().build();
StreamTableEnvironment tableEnv = StreamTableEnvironment.create(env, bsSettings);
//定義主流
DataStream<Tuple2<String, Integer>> textStream = env.socketTextStream("localhost", 9000, "\n")
.map(p -> {
//輸入格式爲:user,1000,分別是用戶名稱和城市編號
String[] list = p.split(",");
return new Tuple2<String, Integer>(list[0], Integer.valueOf(list[1]));
})
.returns(new TypeHint<Tuple2<String, Integer>>() {
});
//定義城市流
DataStream<Tuple2<Integer, String>> cityStream = env.socketTextStream("localhost", 9001, "\n")
.map(p -> {
//輸入格式爲:城市ID,城市名稱
String[] list = p.split(",");
return new Tuple2<Integer, String>(Integer.valueOf(list[0]), list[1]);
})
.returns(new TypeHint<Tuple2<Integer, String>>() {
});
//轉變爲Table
Table userTable = tableEnv.fromDataStream(textStream, "user_name,city_id,ps.proctime");
Table cityTable = tableEnv.fromDataStream(cityStream, "city_id,city_name,ps.proctime");
//定義一個TemporalTableFunction
TemporalTableFunction dimCity = cityTable.createTemporalTableFunction("ps", "city_id");
//註冊表函數
tableEnv.registerFunction("dimCity", dimCity);
//關聯查詢
Table result = tableEnv
.sqlQuery("select u.user_name,u.city_id,d.city_name from " + userTable + " as u " +
", Lateral table (dimCity(u.ps)) d " +
"where u.city_id=d.city_id");
//打印輸出
DataStream resultDs = tableEnv.toAppendStream(result, Row.class);
resultDs.print();
env.execute("joinDemo");
}
}
5、四種維表關聯方式比較
預加載到內存 | 熱存儲關聯 | 廣播維表 | Temporal table function jsoin | |
---|---|---|---|---|
實現複雜度 | 低 | 中 | 低 | 低 |
維表數據量 | 低 | 高 | 低 | 高 |
維表更新頻率 | 低 | 中 | 高 | 高 |
維表更新實時性 | 低 | 中 | 高 | 高 |
維表形式 | 熱存儲 | 實時流 | 實時流 | |
是否依然外部存儲 | 低 | 是 | 否 | 否 |