Flink HiveCatalog

Hive Metastore作爲一個元數據管理的標準在Hadoop生態系統中已經成爲公認的事實,因此Flink也採用HiveCatalog作爲表元數據持久化的介質。對於同時部署了Hive和Flink的公司來說,可以方便管理元數據,而對於只部署了Flink的公司來說,HiveCatalog也是Flink唯一支持的元數據持久化的介質。不將元數據持久化的時候,開發過程中的每個地方都需要使用DDL重新將Kafka等數據源的數據註冊到臨時的Catalog中,浪費了很多精力和時間。

1、使用Hive作爲元數據存儲介質,需要制定hive-site.xml配置文件的位置,裏面制定了存儲hive-metadata的MySQL或其他關係型數據庫的連接信息。

2、sql-client配置訪問HiveCatalog,修改flink安裝目錄下conf/sql-client-defaults.yaml文件,將catalogs模塊的內容改爲:

#==============================================================================
# Catalogs
#==============================================================================

# Define catalogs here.

catalogs:
# A typical catalog definition looks like:
 - name: myhive
   type: hive
   hive-conf-dir: /apps/hive/conf
   hive-version: 2.3.5
   default-database: gmall

executions模塊也需要修改:

execution:
  # select the implementation responsible for planning table programs
  # possible values are 'blink' (used by default) or 'old'
  planner: blink
  # 'batch' or 'streaming' execution
  type: streaming
  # allow 'event-time' or only 'processing-time' in sources
  time-characteristic: event-time
  # interval in ms for emitting periodic watermarks
  periodic-watermarks-interval: 200
  # 'changelog' or 'table' presentation of results
  result-mode: table
  # maximum number of maintained rows in 'table' presentation of results
  max-table-result-rows: 1000
  # parallelism of the program
  parallelism: 4
  # maximum parallelism
  max-parallelism: 128
  # minimum idle state retention in ms
  min-idle-state-retention: 0
  # maximum idle state retention in ms
  max-idle-state-retention: 0
  # current catalog ('default_catalog' by default)
  current-catalog: hive
  # current database of the current catalog (default database of the catalog by default)
  current-database: flink
  # controls how table programs are restarted in case of a failures
  restart-strategy:
    # strategy type
    # possible values are "fixed-delay", "failure-rate", "none", or "fallback" (default)
    type: fallback

hive-conf-dir目錄需要根據實際情況填寫。

此時,通過sql-client.sh embedded啓動sql-client:


Flink SQL> use catalog myhive;    # 選擇Catalog

Flink SQL> show databases;
default
gmall
test

Flink SQL> create database flink;
[INFO] Database has been created.

Flink SQL> use flink;

Flink SQL> CREATE TABLE log_user_cart (name STRING, logtime BIGINT, etime STRING, prod_url STRING) WITH (
>    'connector.type' = 'kafka',
>    'connector.version' = 'universal',
>    'connector.topic' = 'test_log_user_cart',
>    'connector.properties.zookeeper.connect' = 'm.hadoop.com:2181,s1.hadoop.com:2181,s2.hadoop.com:2181',
>    'connector.properties.bootstrap.servers' = 'm.hadoop.com:9092,s1.hadoop.com:9092,s2.hadoop.com:9092',
>    'format.type' = 'csv'
> );
[INFO] Table has been created.

Flink SQL> 

切換到別的窗口打開hive的終端,即可看到該數據庫:

hive (flink)> show databases;
OK
database_name
default
flink
gmall
test
Time taken: 0.019 seconds, Fetched: 4 row(s)
hive (flink)> use flink;
OK
Time taken: 0.025 seconds
hive (flink)> show tables;
OK
tab_name
log_user_cart
Time taken: 0.026 seconds, Fetched: 1 row(s)
hive (flink)> select * from log_user_cart;
FAILED: SemanticException Line 0:-1 Invalid column reference 'TOK_ALLCOLREF'
hive (flink)> 

在Hive中可以看到sql-client中創建的數據庫和表,但是無法查詢該數據表。

在sql-client中查詢該數據表:

3、代碼中訪問HiveCatalog:

val settings = EnvironmentSettings.newInstance().useBlinkPlanner().inBatchMode().build()
val tableEnv = TableEnvironment.create(settings)

val name            = "myhive"
val defaultDatabase = "gmall"
val hiveConfDir     = "/apps/hive/conf" // a local path
val version         = "2.3.5"

val hive = new HiveCatalog(name, defaultDatabase, hiveConfDir, version)
tableEnv.registerCatalog("myhive", hive)

 

注意:使用時,需要從hive安裝目錄的lib下拷貝jar包到flink的lib目錄,需要拷貝的jar包包括:

    167761 Mar 19 20:50 antlr-runtime-3.5.2.jar
    366748 Mar 19 20:48 datanucleus-api-jdo-4.2.4.jar
   2016766 Mar 19 20:48 datanucleus-core-4.1.17.jar
   1908681 Mar 19 20:48 datanucleus-rdbms-4.1.19.jar
     74416 Mar 20 20:57 flink-avro-1.10.0.jar
    357555 Mar 19 20:33 flink-connector-hive_2.11-1.10.0.jar
     81579 Mar 20 20:54 flink-connector-kafka_2.11-1.10.0.jar
    107027 Mar 20 21:01 flink-connector-kafka-base_2.11-1.10.0.jar
     37311 Mar 20 20:55 flink-csv-1.10.0.jar
 110055308 Feb  8 02:54 flink-dist_2.11-1.10.0.jar
     37966 Mar 20 20:56 flink-hadoop-fs-1.10.0.jar
     43007 Mar 18 23:45 flink-json-1.10.0.jar
  43317025 Mar 18 23:45 flink-shaded-hadoop-2-uber-2.8.3-10.0.jar
  19301237 Feb  8 02:54 flink-table_2.11-1.10.0.jar
     55339 Mar 19 20:34 flink-table-api-scala-bridge_2.11-1.10.0.jar
  22520058 Feb  8 02:54 flink-table-blink_2.11-1.10.0.jar
    241622 Mar 19 00:28 gson-2.8.5.jar
  34271938 Mar 19 20:32 hive-exec-2.3.5.jar
    348625 Mar 19 22:27 jackson-core-2.10.1.jar
    249790 Mar 19 20:50 javax.jdo-3.2.0-m3.jar
   2736313 Mar 18 23:45 kafka-clients-2.3.0.jar
    489884 Sep  2  2019 log4j-1.2.17.jar
   1007502 Mar 19 20:49 mysql-connector-java-5.1.47.jar
      9931 Sep  2  2019 slf4j-log4j12-1.7.15.jar

有些jar包需要手動去maven官網下載,如gson-2.8.5.jar,只需在百度輸入maven gson搜索即可

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章