Flink-Sql 實戰案例
1.環境準備
下載代碼並安裝環境,前提是準備好Docker環境。
git clone [email protected]:ververica/sql-training.git
cd sql-training
docker-compose up -d
會先下載以來鏡像,時間比較慢,耐心等待。
接下來進入sql-client
docker-compose exec sql-client ./sql-client.sh
2.實戰演示
表定義,Rides表,類型是source表,更新模式爲追加。
tables:
- name: Rides #表名
type: source #表類型
update-mode: append #更新模式
schema:
- name: rideId #路線ID
type: LONG
- name: taxiId #出租車ID
type: LONG
- name: isStart #是否出發
type: BOOLEAN
- name: lon #經度
type: FLOAT
- name: lat #緯度
type: FLOAT
- name: rideTime #時間
type: TIMESTAMP
rowtime:
timestamps:
type: "from-field"
from: "eventTime"
watermarks:
type: "periodic-bounded"
delay: "60000"
- name: psgCnt #乘客數量
type: INT
connector:
property-version: 1
type: kafka
version: 0.11
topic: Rides
startup-mode: earliest-offset #kafka offset
properties:
- key: zookeeper.connect
value: ${ZOOKEEPER}:2181
- key: bootstrap.servers
value: ${KAFKA}:9092
- key: group.id
value: testGroup
format:
property-version: 1
type: json
schema: "ROW(rideId LONG, isStart BOOLEAN, eventTime TIMESTAMP, lon FLOAT, lat FLOAT, psgCnt INT, taxiId LONG)"
2.1 需求一(出現在紐約的行車記錄)
判斷經緯度是否在紐約的UDF函數如下:
public class IsInNYC extends ScalarFunction {
public boolean eval(float lon, float lat) {
return isInNYC(lon, lat);
}
}
UDF加載定義如下:需要在sql-client啓動時加載的配置文件中定義
functions:
- name: isInNYC
from: class
class: com.dataartisans.udfs.IsInNYC
查詢出現在紐約的行車記錄,sql如下:
select * from Rides where isInNYC(lon.lat);
由於後續我們可能會一直用到紐約市的行車記錄,所以我們可以建一個view來方便後續使用,命令如下:
CREATE VIEW nyc_view as select * from Rides where isInNYC(lon.lat);
show tables;
Flink SQL> show tables;
nyc_view
Rides
2.2 需求二(計算搭載每種乘客數量的行車記錄數)
sql查詢如下:
select psgCnt,count(*) from Rides group by psgCnt;
2.3 需求三(計算紐約市每個區域5分鐘的進入車輛數)
首先需要知道一個經緯度是屬於哪個區域的,這就需要再來一個UDF,如下:
public class ToAreaId extends ScalarFunction {
public int eval(float lon, float lat) {
return GeoUtils.mapToGridCell(lon, lat);
}
}
配置文件中定義如下:
functions:
- name: toAreaId
from: class
class: com.dataartisans.udfs.ToAreaId
sql實現如下:
select
toAreaId(lon,lat) as area,
isStart,
TUMBLE_END(rideTime, INTERVAL '5' MINUTE) as window_end,
count(*) as cnt
from Rides
where isInNYC(lon,lat)
group by
toAreaId(lon,lat), isStart, TUMBLE(rideTime, INTERVAL '5' MINUTE);
2.4 需求四(將每10分鐘搭乘的乘客總數寫入Kafka)
首先我們需要定義一個kafka類型的sink表:
- name: Sink_TenMinPsgCnts #表名
type: sink
update-mode: append #追加模式
schema:
- name: cntStart #開始時間
type: TIMESTAMP
- name: cntEnd #結束時間
type: TIMESTAMP
- name: cnt #總數
type: LONG
connector:
property-version: 1
type: kafka
version: 0.11
topic: TenMinPsgCnts
startup-mode: earliest-offset
properties: #kafka配置
- key: zookeeper.connect
value: zookeeper:2181
- key: bootstrap.servers
value: kafka:9092
- key: group.id
value: trainingGroup
format:
property-version: 1
type: json
schema: "ROW(cntStart TIMESTAMP, cntEnd TIMESTAMP, cnt LONG)"
sql語句實現如下:
INSERT INTO Sink_TenMinPsgCnts
select
TUMBLE_START(rideTime, INTERVAL '10' MINUTE) as cntStart,
TUMBLE_END(rideTime, INTERVAL '10' MINUTE) as cntEnd,
cast(sum(psgCnt) as bigint) as cnt
from Rides
group by TUMBLE(rideTime, INTERVAL '10' MINUTE);
也可以打開kafka consumer的client查看數據進入情況
docker-compose exec sql-client /opt/kafka-client/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic TenMinPsgCnts --from-beginning
2.5 需求五(將每個區域出發的行車數寫入到ES)
首先我們需要定義一個Es類型的sink表:
- name: Sink_AreaCnts
type: sink
update-mode: upsert
schema:
- name: areaId
type: INT
- name: cnt
type: LONG
connector:
type: elasticsearch
version: 6
hosts:
- hostname: "elasticsearch"
port: 9200
protocol: "http"
index: "area-cnts"
document-type: "areacnt"
key-delimiter: "$"
format:
property-version: 1
type: json
schema: "ROW(areaId INT, cnt LONG)"
sql語句執行如下:
insert into Sink_AreaCnts
select toAreaId(lon,lat) as areaId,
count(*) as cnt
from Rides
where isStart
group by toAreaId(lon,lat);
可以訪問ES的地址http://localhost:9200/_search?pretty=true查看數據的插入情況。