一. Hudi集成Hive概述

Hudi 源表對應一份 HDFS數據，通過Spark，Flink組件或者Hudi CLI，可以將Hudi表的數據映射爲Hive外部表，基於該外部表，Hive可以方便的進行實時視圖，讀優化視圖以及增量的查詢。

二. Hudi集成Hive步驟

以 Hive 3.1.2、hudi 0.12.0 爲例。

2.1 拷貝jar包

2.1.1 拷貝編譯好的hudi的jar包

將hudi-hive-sync-bundle-0.12.0.jar 和 hudi-hadoop-mr-bundle-0.12.0.jar 放到hive節點的lib目錄下

cd /home/hudi-0.12.0/packaging/hudi-hive-sync-bundle/target
cp ./hudi-hive-sync-bundle-0.12.0.jar /home/apache-hive-3.1.2-bin/lib/
cd /home/hudi-0.12.0/packaging/hudi-hadoop-mr-bundle/target
cp ./hudi-hadoop-mr-bundle-0.12.0.jar /home/apache-hive-3.1.2-bin/lib/

2.1.2 拷貝Hive jar包到Flink lib目錄

將Hive的lib拷貝到Flink的lib目錄

cd $HIVE_HOME/lib
cp ./hive-exec-3.1.2.jar $FLINK_HOME/lib/
cp ./libfb303-0.9.3.jar $FLINK_HOME/lib/

https://nightlies.apache.org/flink/flink-docs-release-1.15/zh/docs/connectors/table/hive/overview/

2.1.3 Flink以及Flink SQL連接Hive的jar包

wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-connector-hive_2.12/1.14.5/flink-connector-hive_2.12-1.14.5.jar
wget https://repo.maven.apache.org/maven2/org/apache/flink/flink-sql-connector-hive-3.1.2_2.12/1.14.5/flink-sql-connector-hive-3.1.2_2.12-1.14.5.jar

2.2 重啓hive

拷貝jar包之後，需要重啓hive

2.3 Flink訪問Hive表

2.3.1 啓動Flink SQL Client

# 啓動yarn session(非root賬戶)
/home/flink-1.14.5/bin/yarn-session.sh -d  2>&1 &

# 在yarn session模式下啓動Flink SQL
/home/flink-1.14.5/bin/sql-client.sh embedded -s yarn-session

2.3.2 創建hive catalog

CREATE CATALOG hive_catalog WITH (
    'type' = 'hive',
    'default-database' = 'default',
    'hive-conf-dir' = '/home/apache-hive-3.1.2-bin/conf'
);

2.3.3 切換 catalog

use catalog hive_catalog;

2.3.4 查詢Hive表

use test;
show tables;
-- Flink可以直接讀取hive表
select * from t1;

2.4 Flink 同步Hive

Flink hive sync 現在支持兩種 hive sync mode，分別是 hms 和 jdbc 模式。其中 hms 只需要配置 metastore uris；而 jdbc模式需要同時配置 jdbc 屬性和 metastore uris。

配置模板:

三. 實操案例（COW）

3.1 在內存中創建hudi表(不使用catalog)

代碼:


-- 創建表
create table t_cow1 (
  id     int primary key,
  num    int,
  ts     int
)
partitioned by (num)
with (
  'connector' = 'hudi',
  'path' = 'hdfs://hp5:8020/tmp/hudi/t_cow1',
  'table.type' = 'COPY_ON_WRITE',
  'hive_sync.enable' = 'true',
  'hive_sync.table' = 't_cow1',
  'hive_sync.db' = 'test',
  'hive_sync.mode' = 'hms',
  'hive_sync.metastore.uris' = 'thrift://hp5:9083',
  'hive_sync.conf.dir'='/home/apache-hive-3.1.2-bin/conf'
);




-- 只有在寫數據的時候纔會觸發同步Hive表
insert into t_cow1 values (1,1,1);

測試記錄:
Flink SQL運行記錄:

Hive的test庫下面多了一個t_cow1 表

Hive端查詢數據:

3.2 在catalog中創建hudi表

3.2.1 指定到hive目錄之外

代碼:

-- 創建目錄
CREATE CATALOG hive_catalog WITH (
    'type' = 'hive',
    'default-database' = 'default',
    'hive-conf-dir' = '/home/apache-hive-3.1.2-bin/conf'
);
        
-- 進入目錄
USE CATALOG hive_catalog;

use test;

  -- 創建表
create table t_catalog_cow1 (
  id     int primary key,
  num    int,
  ts     int
)
partitioned by (num)
with (
  'connector' = 'hudi',
  'path' = 'hdfs://hp5:8020/tmp/hudi/t_catalog_cow1',
  'table.type' = 'COPY_ON_WRITE',
  'hive_sync.enable' = 'true',
  'hive_sync.table' = 't_catalog_cow1',
  'hive_sync.db' = 'test',
  'hive_sync.mode' = 'hms',
  'hive_sync.metastore.uris' = 'thrift://hp5:9083',
  'hive_sync.conf.dir'='/home/apache-hive-3.1.2-bin/conf'
);


insert into t_catalog_cow1 values (1,1,1);

測試記錄:
Flink SQL 這邊是可以查看到表

Flink SQL查詢數據也沒問題

Hive端可以看到表，但是查詢不到數據:

Hive端查看建表語句:

發現問題:
COW的表從hudi同步過來之後，直接少了partition字段。
也就是相當於在使用hive catalog的情況下，通過FLink創建的Hudi表自動同步到Hive這邊是存在一定的問題的

3.2.2 指定到hive目錄之內

代碼:

-- 創建目錄
CREATE CATALOG hive_catalog WITH (
    'type' = 'hive',
    'default-database' = 'default',
    'hive-conf-dir' = '/home/apache-hive-3.1.2-bin/conf'
);
        
-- 進入目錄
USE CATALOG hive_catalog;

use test;

  -- 創建表
create table t_catalog_cow2 (
  id     int primary key,
  num    int,
  ts     int
)
partitioned by (num)
with (
  'connector' = 'hudi',
  'path' = 'hdfs://hp5:8020/user/hive/warehouse/test.db/t_catalog_cow2',
  'table.type' = 'COPY_ON_WRITE',
  'hive_sync.enable' = 'true',
  'hive_sync.table' = 't_catalog_cow2',
  'hive_sync.db' = 'test',
  'hive_sync.mode' = 'hms',
  'hive_sync.metastore.uris' = 'thrift://hp5:9083',
  'hive_sync.conf.dir'='/home/apache-hive-3.1.2-bin/conf'
);


insert into t_catalog_cow2 values (1,1,1);

測試記錄:
問題依舊存在

3.2.3 使用參數指定hudi表分區

代碼:

create table t_catalog_cow4 (
  id     int primary key,
  num    int,
  ts     int
)
partitioned by (num)
with (
  'connector' = 'hudi',
  'path' = 'hdfs://hp5:8020/tmp/hudi/t_catalog_cow4',
  'table.type' = 'COPY_ON_WRITE',
  'hive_sync.enable' = 'true',
  'hive_sync.table' = 't_catalog_cow4',
  'hive_sync.db' = 'test',
  'hoodie.datasource.write.keygenerator.class' = 'org.apache.hudi.keygen.ComplexAvroKeyGenerator',
  'hoodie.datasource.write.recordkey.field' = 'id',
  'hoodie.datasource.write.hive_style_partitioning' = 'true',
  'hive_sync.mode' = 'hms',
  'hive_sync.metastore.uris' = 'thrift://hp5:9083',
  'hive_sync.conf.dir'='/home/apache-hive-3.1.2-bin/conf',
  'hive_sync.partition_fields' = 'dt',
  'hive_sync.partition_extractor_class' = 'org.apache.hudi.hive.HiveStylePartitionValueExtractor'
);


insert into t_catalog_cow4 values (1,1,1);

測試記錄:

四. 實操案例（MOR）

4.1 在內存中創建hudi表(不使用catalog)

代碼:

-- 創建表
create table t_mor1 (
  id     int primary key,
  num    int,
  ts     int
)
partitioned by (num)
with (
  'connector' = 'hudi',
  'path' = 'hdfs://hp5:8020/tmp/hudi/t_mor1',
  'table.type' = 'MERGE_ON_READ',
  'hive_sync.enable' = 'true',
  'hive_sync.table' = 't_mor1',
  'hive_sync.db' = 'test',
  'hive_sync.mode' = 'hms',
  'hive_sync.metastore.uris' = 'thrift://hp5:9083',
  'hive_sync.conf.dir'='/home/apache-hive-3.1.2-bin/conf'
);

-- 只有在寫數據的時候纔會觸發同步Hive表
-- Hive只能讀取Parquet的數據，MOR的表不會立馬生成Parquet文件，需要多錄入幾條數據，或者使用Spark-SQL再多錄入幾條數據
insert into t_mor1 values (1,1,1);

測試記錄:
HDFS:
只有log，沒有Parquet文件

insert into t_mor1 values (2,1,2);
insert into t_mor1 values (3,1,3);
insert into t_mor1 values (4,1,4);
insert into t_mor1 values (5,1,5);

Flink WEB：

多了幾個表:
t_mor1 是hudi表，通過Flink可以進行讀寫

t_mor1_ro、t_mor1_rt hive表，可以通過Hive、Spark進行操作

Hive端查看數據:
因爲沒有parquet文件，所以沒有數據生成

加入了很多的測試數據，結果依舊是log文件而沒有parquet文件.....

退出重新登陸:
Flink SQL 客戶端這邊看不到之前的表了

Hive這邊，退出重新登陸，依舊是存在的。

FAQ:

FAQ1: NoClassDefFoundError ParquetInputFormat

問題描述:
在Flink SQL客戶端查詢COW表的時候報錯

[ERROR] Could not execute SQL statement. Reason:
java.lang.NoClassDefFoundError: org/apache/parquet/hadoop/ParquetInputFormat

解決方案:
找到hudi編譯時候的parquet的包，拷貝到flink的lib目錄

Hudi系列13:Hudi集成Hive 一. Hudi集成Hive概述二. Hudi集成Hive步驟三. 實操案例（COW）四. 實操案例（MOR） FAQ: 參考:

一. Hudi集成Hive概述

二. Hudi集成Hive步驟

2.1 拷貝jar包

2.1.1 拷貝編譯好的hudi的jar包

2.1.2 拷貝Hive jar包到Flink lib目錄

2.1.3 Flink以及Flink SQL連接Hive的jar包

2.2 重啓hive

2.3 Flink訪問Hive表

2.3.1 啓動Flink SQL Client

2.3.2 創建hive catalog

2.3.3 切換 catalog

2.3.4 查詢Hive表

2.4 Flink 同步Hive

三. 實操案例（COW）

3.1 在內存中創建hudi表(不使用catalog)

3.2 在catalog中創建hudi表

3.2.1 指定到hive目錄之外

3.2.2 指定到hive目錄之內

3.2.3 使用參數指定hudi表分區

四. 實操案例（MOR）

4.1 在內存中創建hudi表(不使用catalog)

FAQ:

FAQ1: NoClassDefFoundError ParquetInputFormat

參考:

AI模型 Llama 3體驗筆記

【面試準備】又一次失敗的面試經歷，題目離譜～資深軟件測試工程師

dotnet 8 版本與銀河麒麟V10和UOS系統的 glibc 兼容性

Hudi系列19:Hudi寫入模式一. Changelog 模式二. Append 模式

Hudi系列18:Hudi全量接增量一. 全量接增量概述

Hudi系列15:Hudi元數據同步到Hive 一. hive sync tool工具介紹二. 問題排查三. 實操參考:

Hudi系列13:Hudi集成Hive 一. Hudi集成Hive概述二. Hudi集成Hive步驟三. 實操案例（COW）四. 實操案例（MOR） FAQ: 參考:

Hudi系列14:Hudi元數據持久化一. 元數據持久化二. 實操1(不使用初始化文件) 三. 實操2(使用初始化文件) 參考:

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Hudi系列13:Hudi集成Hive 一. Hudi集成Hive概述 二. Hudi集成Hive步驟 三. 實操案例（COW） 四. 實操案例（MOR） FAQ: 參考:

一. Hudi集成Hive概述

二. Hudi集成Hive步驟

2.1 拷貝jar包

2.1.1 拷貝編譯好的hudi的jar包

2.1.2 拷貝Hive jar包到Flink lib目錄

2.1.3 Flink以及Flink SQL連接Hive的jar包

2.2 重啓hive

2.3 Flink訪問Hive表

2.3.1 啓動Flink SQL Client

2.3.2 創建hive catalog

2.3.3 切換 catalog

2.3.4 查詢Hive表

2.4 Flink 同步Hive

三. 實操案例（COW）

3.1 在內存中創建hudi表(不使用catalog)

3.2 在catalog中創建hudi表

3.2.1 指定到hive目錄之外

3.2.2 指定到hive目錄之內

3.2.3 使用參數指定hudi表分區

四. 實操案例（MOR）

4.1 在內存中創建hudi表(不使用catalog)

FAQ:

FAQ1: NoClassDefFoundError ParquetInputFormat

參考:

Hudi系列13:Hudi集成Hive 一. Hudi集成Hive概述二. Hudi集成Hive步驟三. 實操案例（COW）四. 實操案例（MOR） FAQ: 參考: