九、Flink入門--SQL實戰

原創

2020-06-04 12:47

Flink-Sql 實戰案例

1.環境準備

下載代碼並安裝環境，前提是準備好Docker環境。

git clone [email protected]:ververica/sql-training.git
cd sql-training
docker-compose up -d

會先下載以來鏡像，時間比較慢，耐心等待。
接下來進入sql-client

docker-compose exec sql-client ./sql-client.sh

2.實戰演示

表定義，Rides表，類型是source表，更新模式爲追加。

tables:
   - name: Rides #表名
    type: source #表類型
    update-mode: append #更新模式
    schema:
    - name: rideId #路線ID
      type: LONG
    - name: taxiId #出租車ID
      type: LONG
    - name: isStart #是否出發
      type: BOOLEAN
    - name: lon #經度
      type: FLOAT
    - name: lat #緯度
      type: FLOAT
    - name: rideTime #時間
      type: TIMESTAMP
      rowtime:
        timestamps:
          type: "from-field"
          from: "eventTime"
        watermarks:
          type: "periodic-bounded"
          delay: "60000"
    - name: psgCnt #乘客數量
      type: INT
    connector:
      property-version: 1
      type: kafka
      version: 0.11
      topic: Rides 
      startup-mode: earliest-offset #kafka offset
      properties:
      - key: zookeeper.connect
        value: ${ZOOKEEPER}:2181
      - key: bootstrap.servers
        value: ${KAFKA}:9092
      - key: group.id
        value: testGroup
    format:
      property-version: 1
      type: json
      schema: "ROW(rideId LONG, isStart BOOLEAN, eventTime TIMESTAMP, lon FLOAT, lat FLOAT, psgCnt INT, taxiId LONG)"

2.1 需求一(出現在紐約的行車記錄)

判斷經緯度是否在紐約的UDF函數如下：

public class IsInNYC extends ScalarFunction {

	public boolean eval(float lon, float lat) {
		return isInNYC(lon, lat);
	}
}

UDF加載定義如下：需要在sql-client啓動時加載的配置文件中定義

functions:
- name: isInNYC
  from: class
  class: com.dataartisans.udfs.IsInNYC

查詢出現在紐約的行車記錄，sql如下：

select * from Rides where isInNYC(lon.lat);

由於後續我們可能會一直用到紐約市的行車記錄，所以我們可以建一個view來方便後續使用,命令如下：

CREATE VIEW nyc_view as select * from Rides where isInNYC(lon.lat);
show tables;

Flink SQL> show tables;
nyc_view
Rides

2.2 需求二(計算搭載每種乘客數量的行車記錄數)

sql查詢如下:

select psgCnt,count(*) from Rides group by psgCnt;

2.3 需求三(計算紐約市每個區域5分鐘的進入車輛數)

首先需要知道一個經緯度是屬於哪個區域的，這就需要再來一個UDF，如下：

public class ToAreaId extends ScalarFunction {

	public int eval(float lon, float lat) {
		return GeoUtils.mapToGridCell(lon, lat);
	}
}
配置文件中定義如下：
functions:
- name: toAreaId
  from: class
  class: com.dataartisans.udfs.ToAreaId

sql實現如下：

select 
toAreaId(lon,lat) as area,
isStart,
TUMBLE_END(rideTime, INTERVAL '5' MINUTE) as window_end,
count(*) as cnt
from Rides
where isInNYC(lon,lat)
group by 
toAreaId(lon,lat), isStart, TUMBLE(rideTime, INTERVAL '5' MINUTE);

2.4 需求四(將每10分鐘搭乘的乘客總數寫入Kafka)

首先我們需要定義一個kafka類型的sink表：

- name: Sink_TenMinPsgCnts #表名
    type: sink
    update-mode: append #追加模式
    schema: 
    - name: cntStart #開始時間
      type: TIMESTAMP
    - name: cntEnd #結束時間
      type: TIMESTAMP
    - name: cnt #總數
      type: LONG
    connector:
      property-version: 1
      type: kafka
      version: 0.11
      topic: TenMinPsgCnts
      startup-mode: earliest-offset
      properties: #kafka配置
      - key: zookeeper.connect
        value: zookeeper:2181
      - key: bootstrap.servers
        value: kafka:9092
      - key: group.id
        value: trainingGroup
    format:
      property-version: 1
      type: json
      schema: "ROW(cntStart TIMESTAMP, cntEnd TIMESTAMP, cnt LONG)"

sql語句實現如下：

INSERT INTO Sink_TenMinPsgCnts
select 
TUMBLE_START(rideTime, INTERVAL '10' MINUTE) as cntStart,
TUMBLE_END(rideTime, INTERVAL '10' MINUTE) as cntEnd,
cast(sum(psgCnt) as bigint) as cnt
from Rides
group by TUMBLE(rideTime, INTERVAL '10' MINUTE);

也可以打開kafka consumer的client查看數據進入情況

docker-compose exec sql-client /opt/kafka-client/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic TenMinPsgCnts --from-beginning

2.5 需求五(將每個區域出發的行車數寫入到ES)

首先我們需要定義一個Es類型的sink表:

- name: Sink_AreaCnts
    type: sink
    update-mode: upsert
    schema: 
    - name: areaId
      type: INT
    - name: cnt
      type: LONG
    connector:
      type: elasticsearch
      version: 6
      hosts:
        - hostname: "elasticsearch"
          port: 9200
          protocol: "http"
      index: "area-cnts"
      document-type: "areacnt"
      key-delimiter: "$"
    format:
      property-version: 1
      type: json
      schema: "ROW(areaId INT, cnt LONG)"

sql語句執行如下：

insert into Sink_AreaCnts
select toAreaId(lon,lat) as areaId,
count(*) as cnt
from Rides
where isStart
group by toAreaId(lon,lat);

可以訪問ES的地址http://localhost:9200/_search?pretty=true查看數據的插入情況。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

九、Flink入門--SQL實戰

Flink-Sql 實戰案例

1.環境準備

2.實戰演示

2.1 需求一(出現在紐約的行車記錄)

2.2 需求二(計算搭載每種乘客數量的行車記錄數)

2.3 需求三(計算紐約市每個區域5分鐘的進入車輛數)

2.4 需求四(將每10分鐘搭乘的乘客總數寫入Kafka)

2.5 需求五(將每個區域出發的行車數寫入到ES)

spark、hadoop大數據計算面試題彙總

hive源碼編譯

spark streaming任務,讀kafka寫入mysql

一、Spark官網走讀筆記

二十四、Flink進階--Flink sql轉換爲JobGraph過程

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結