Hive Metastore作爲一個元數據管理的標準在Hadoop生態系統中已經成爲公認的事實,因此Flink也採用HiveCatalog作爲表元數據持久化的介質。對於同時部署了Hive和Flink的公司來說,可以方便管理元數據,而對於只部署了Flink的公司來說,HiveCatalog也是Flink唯一支持的元數據持久化的介質。不將元數據持久化的時候,開發過程中的每個地方都需要使用DDL重新將Kafka等數據源的數據註冊到臨時的Catalog中,浪費了很多精力和時間。
1、使用Hive作爲元數據存儲介質,需要制定hive-site.xml配置文件的位置,裏面制定了存儲hive-metadata的MySQL或其他關係型數據庫的連接信息。
2、sql-client配置訪問HiveCatalog,修改flink安裝目錄下conf/sql-client-defaults.yaml文件,將catalogs模塊的內容改爲:
#==============================================================================
# Catalogs
#==============================================================================
# Define catalogs here.
catalogs:
# A typical catalog definition looks like:
- name: myhive
type: hive
hive-conf-dir: /apps/hive/conf
hive-version: 2.3.5
default-database: gmall
executions模塊也需要修改:
execution:
# select the implementation responsible for planning table programs
# possible values are 'blink' (used by default) or 'old'
planner: blink
# 'batch' or 'streaming' execution
type: streaming
# allow 'event-time' or only 'processing-time' in sources
time-characteristic: event-time
# interval in ms for emitting periodic watermarks
periodic-watermarks-interval: 200
# 'changelog' or 'table' presentation of results
result-mode: table
# maximum number of maintained rows in 'table' presentation of results
max-table-result-rows: 1000
# parallelism of the program
parallelism: 4
# maximum parallelism
max-parallelism: 128
# minimum idle state retention in ms
min-idle-state-retention: 0
# maximum idle state retention in ms
max-idle-state-retention: 0
# current catalog ('default_catalog' by default)
current-catalog: hive
# current database of the current catalog (default database of the catalog by default)
current-database: flink
# controls how table programs are restarted in case of a failures
restart-strategy:
# strategy type
# possible values are "fixed-delay", "failure-rate", "none", or "fallback" (default)
type: fallback
hive-conf-dir目錄需要根據實際情況填寫。
此時,通過sql-client.sh embedded啓動sql-client:
Flink SQL> use catalog myhive; # 選擇Catalog
Flink SQL> show databases;
default
gmall
test
Flink SQL> create database flink;
[INFO] Database has been created.
Flink SQL> use flink;
Flink SQL> CREATE TABLE log_user_cart (name STRING, logtime BIGINT, etime STRING, prod_url STRING) WITH (
> 'connector.type' = 'kafka',
> 'connector.version' = 'universal',
> 'connector.topic' = 'test_log_user_cart',
> 'connector.properties.zookeeper.connect' = 'm.hadoop.com:2181,s1.hadoop.com:2181,s2.hadoop.com:2181',
> 'connector.properties.bootstrap.servers' = 'm.hadoop.com:9092,s1.hadoop.com:9092,s2.hadoop.com:9092',
> 'format.type' = 'csv'
> );
[INFO] Table has been created.
Flink SQL>
切換到別的窗口打開hive的終端,即可看到該數據庫:
hive (flink)> show databases;
OK
database_name
default
flink
gmall
test
Time taken: 0.019 seconds, Fetched: 4 row(s)
hive (flink)> use flink;
OK
Time taken: 0.025 seconds
hive (flink)> show tables;
OK
tab_name
log_user_cart
Time taken: 0.026 seconds, Fetched: 1 row(s)
hive (flink)> select * from log_user_cart;
FAILED: SemanticException Line 0:-1 Invalid column reference 'TOK_ALLCOLREF'
hive (flink)>
在Hive中可以看到sql-client中創建的數據庫和表,但是無法查詢該數據表。
在sql-client中查詢該數據表:
3、代碼中訪問HiveCatalog:
val settings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build()
val tableEnv = TableEnvironment.create(settings)
val name = "myhive"
val defaultDatabase = "gmall"
val hiveConfDir = "/apps/hive/conf" // a local path
val version = "2.3.5"
val hive = new HiveCatalog(name, defaultDatabase, hiveConfDir, version)
tableEnv.registerCatalog("myhive", hive)
注意:使用時,需要從hive安裝目錄的lib下拷貝jar包到flink的lib目錄,需要拷貝的jar包包括:
167761 Mar 19 20:50 antlr-runtime-3.5.2.jar
366748 Mar 19 20:48 datanucleus-api-jdo-4.2.4.jar
2016766 Mar 19 20:48 datanucleus-core-4.1.17.jar
1908681 Mar 19 20:48 datanucleus-rdbms-4.1.19.jar
74416 Mar 20 20:57 flink-avro-1.10.0.jar
357555 Mar 19 20:33 flink-connector-hive_2.11-1.10.0.jar
81579 Mar 20 20:54 flink-connector-kafka_2.11-1.10.0.jar
107027 Mar 20 21:01 flink-connector-kafka-base_2.11-1.10.0.jar
37311 Mar 20 20:55 flink-csv-1.10.0.jar
110055308 Feb 8 02:54 flink-dist_2.11-1.10.0.jar
37966 Mar 20 20:56 flink-hadoop-fs-1.10.0.jar
43007 Mar 18 23:45 flink-json-1.10.0.jar
43317025 Mar 18 23:45 flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
19301237 Feb 8 02:54 flink-table_2.11-1.10.0.jar
55339 Mar 19 20:34 flink-table-api-scala-bridge_2.11-1.10.0.jar
22520058 Feb 8 02:54 flink-table-blink_2.11-1.10.0.jar
241622 Mar 19 00:28 gson-2.8.5.jar
34271938 Mar 19 20:32 hive-exec-2.3.5.jar
348625 Mar 19 22:27 jackson-core-2.10.1.jar
249790 Mar 19 20:50 javax.jdo-3.2.0-m3.jar
2736313 Mar 18 23:45 kafka-clients-2.3.0.jar
489884 Sep 2 2019 log4j-1.2.17.jar
1007502 Mar 19 20:49 mysql-connector-java-5.1.47.jar
9931 Sep 2 2019 slf4j-log4j12-1.7.15.jar
有些jar包需要手動去maven官網下載,如gson-2.8.5.jar,只需在百度輸入maven gson搜索即可