Hive 分區歸檔

第一次使用分區歸檔,解決分區下小文件過多的問題:

使用步驟:

1 執行:set hive.archive.enabled=true;

2  執行:ALTER TABLE test.users_part ARCHIVE PARTITION(ds='20200401');

執行第二條SQL時報錯:

SQL CLI中的報錯信息爲:

hive (test)> ALTER TABLE test.users_part ARCHIVE PARTITION(ds='20200401');
intermediate.archived is hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401_INTERMEDIATE_ARCHIVED
intermediate.original is hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401_INTERMEDIATE_ORIGINAL
Creating data.har for hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401
in hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401/.hive-staging_hive_2020-05-13_18-46-48_858_3143515583523997534-1/-ext-10000/partlevel
Please wait... (this may take a while)
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/tools/HadoopArchives
hive (test)>

hive的log4j詳細日誌(該日誌目錄可以到hive/conf中的log4j配置文件中查找,默認爲/tmp/$user)報錯信息爲:

2020-05-13T18:46:48,825  INFO [main] conf.HiveConf: Using the default value passed in for log id: 06ab61b1-0eab-44f7-a818-6cf2cf8f0b08
2020-05-13T18:46:48,825  INFO [main] session.SessionState: Updating thread name to 06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main
2020-05-13T18:46:48,828  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Compiling command(queryId=ytc_20200513184648_9ab2659a-ef3b-4e37-a3f0-1ff344b2721b): ALTER TABLE test.users_part ARCHIVE PARTITION(ds='20200401')
2020-05-13T18:46:48,859  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Concurrency mode is disabled, not creating a lock manager
2020-05-13T18:46:48,859  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] metastore.HiveMetaStore: 0: get_table : tbl=hive.test.users_part
2020-05-13T18:46:48,859  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] HiveMetaStore.audit: ugi=ytc	ip=unknown-ip-addr	cmd=get_table : tbl=hive.test.users_part	
2020-05-13T18:46:48,875  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] metastore.HiveMetaStore: 0: get_partitions_ps_with_auth : tbl=hive.test.users_part[15]
2020-05-13T18:46:48,875  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] HiveMetaStore.audit: ugi=ytc	ip=unknown-ip-addr	cmd=get_partitions_ps_with_auth : tbl=hive.test.users_part[15]	
2020-05-13T18:46:48,916  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Semantic Analysis Completed (retrial = false)
2020-05-13T18:46:48,916  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Returning Hive schema: Schema(fieldSchemas:null, properties:null)
2020-05-13T18:46:48,916  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Completed compiling command(queryId=ytc_20200513184648_9ab2659a-ef3b-4e37-a3f0-1ff344b2721b); Time taken: 0.088 seconds
2020-05-13T18:46:48,916  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] reexec.ReExecDriver: Execution #1 of query
2020-05-13T18:46:48,916  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Concurrency mode is disabled, not creating a lock manager
2020-05-13T18:46:48,916  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Executing command(queryId=ytc_20200513184648_9ab2659a-ef3b-4e37-a3f0-1ff344b2721b): ALTER TABLE test.users_part ARCHIVE PARTITION(ds='20200401')
2020-05-13T18:46:48,916  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Starting task [Stage-0:DDL] in serial mode
2020-05-13T18:46:48,917  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] metastore.HiveMetaStore: 0: get_table : tbl=hive.test.users_part
2020-05-13T18:46:48,917  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] HiveMetaStore.audit: ugi=ytc	ip=unknown-ip-addr	cmd=get_table : tbl=hive.test.users_part	
2020-05-13T18:46:48,932  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] metastore.HiveMetaStore: 0: get_partitions_ps_with_auth : tbl=hive.test.users_part[15]
2020-05-13T18:46:48,932  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] HiveMetaStore.audit: ugi=ytc	ip=unknown-ip-addr	cmd=get_partitions_ps_with_auth : tbl=hive.test.users_part[15]	
2020-05-13T18:46:48,975  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] exec.Task: intermediate.archived is hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds='20200401'_INTERMEDIATE_ARCHIVED
2020-05-13T18:46:48,975  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] exec.Task: intermediate.original is hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds='20200401'_INTERMEDIATE_ORIGINAL
2020-05-13T18:46:48,978  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] common.FileUtils: Creating directory if it doesn't exist: hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds='20200401'ds=20200401'/.hive-staging_hive_2020-05-13_18-46-48_858_3143515583523997534-1
2020-05-13T18:46:48,980  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] exec.Task: Creating data.har for hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds='20200401'
2020-05-13T18:46:48,980  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] exec.Task: in hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds='20200401'/.hive-staging_hive_2020-05-13_18-46-48_858_3143515583523997534-1/-ext-10000/partlevel
2020-05-13T18:46:48,980  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] exec.Task: Please wait... (this may take a while)
2020-05-13T18:46:48,980 ERROR [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] exec.DDLTask: Failed
java.lang.NoClassDefFoundError: org/apache/hadoop/tools/HadoopArchives
	at org.apache.hadoop.hive.ql.exec.DDLTask.archive(DDLTask.java:1756) ~[hive-exec-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:479) ~[hive-exec-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:205) ~[hive-exec-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:97) ~[hive-exec-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:2664) ~[hive-exec-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:2335) ~[hive-exec-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:2011) ~[hive-exec-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1709) ~[hive-exec-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.ql.Driver.run(Driver.java:1703) ~[hive-exec-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:157) ~[hive-exec-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.ql.reexec.ReExecDriver.run(ReExecDriver.java:218) ~[hive-exec-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.cli.CliDriver.processLocalCmd(CliDriver.java:239) ~[hive-cli-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.cli.CliDriver.processCmd(CliDriver.java:188) ~[hive-cli-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:402) ~[hive-cli-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.cli.CliDriver.executeDriver(CliDriver.java:821) ~[hive-cli-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:759) ~[hive-cli-3.1.2.jar:3.1.2]
	at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:683) ~[hive-cli-3.1.2.jar:3.1.2]
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:1.8.0_211]
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:1.8.0_211]
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_211]
	at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_211]
	at org.apache.hadoop.util.RunJar.run(RunJar.java:318) ~[hadoop-common-3.1.3.jar:?]
	at org.apache.hadoop.util.RunJar.main(RunJar.java:232) ~[hadoop-common-3.1.3.jar:?]
2020-05-13T18:46:48,980  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] reexec.ReOptimizePlugin: ReOptimization: retryPossible: false
2020-05-13T18:46:48,980 ERROR [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org/apache/hadoop/tools/HadoopArchives
2020-05-13T18:46:48,980  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Completed executing command(queryId=ytc_20200513184648_9ab2659a-ef3b-4e37-a3f0-1ff344b2721b); Time taken: 0.064 seconds
2020-05-13T18:46:48,980  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] ql.Driver: Concurrency mode is disabled, not creating a lock manager
2020-05-13T18:46:48,981  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] conf.HiveConf: Using the default value passed in for log id: 06ab61b1-0eab-44f7-a818-6cf2cf8f0b08
2020-05-13T18:46:48,982  INFO [06ab61b1-0eab-44f7-a818-6cf2cf8f0b08 main] session.SessionState: Resetting thread name to  main

重點爲:java.lang.NoClassDefFoundError: org/apache/hadoop/tools/HadoopArchives

該類在執行hive時不會加載,需要將該類所在的jar包拷貝至運行環境下的lib目錄,我是在yarn下采用mapreduce引擎,因此我將

$HADOOP_HOME/share/hadoop/tools/lib/hadoop-archives-3.1.3.jar文件拷貝到下面兩個目錄

$HADOOP_HOME/share/hadoop/yarn/lib/

$HADOOP_HOME/share/hadoop/mapreduce/lib

3 重新執行hive命令進入SQL CLI:

ytc@ytc:/apps/hive/log$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/apps/hive/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/apps/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = ad93aab7-55df-4181-a52c-245a2220978b

Logging initialized using configuration in file:/apps/hive/conf/hive-log4j2.properties Async: true
Hive Session ID = 52f38cb6-e313-455d-a6cd-8b333fcab3ce
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive (default)> 
              > 
              > 
              > use test;
OK
Time taken: 0.811 seconds
hive (test)> set hive.archive.enabled=true;
hive (test)> 
           > 
           > ALTER TABLE test.users_part ARCHIVE PARTITION(ds=20200401);
intermediate.archived is hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401_INTERMEDIATE_ARCHIVED
intermediate.original is hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401_INTERMEDIATE_ORIGINAL
Creating data.har for hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401
in hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401/.hive-staging_hive_2020-05-13_18-47-23_023_5539564871141800018-1/-ext-10000/partlevel
Please wait... (this may take a while)
Moving hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401/.hive-staging_hive_2020-05-13_18-47-23_023_5539564871141800018-1/-ext-10000/partlevel to hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401_INTERMEDIATE_ARCHIVED
Moving hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401 to hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401_INTERMEDIATE_ORIGINAL
Moving hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401_INTERMEDIATE_ARCHIVED to hdfs://m.local.com:9000/dbs/hive/test.db/users_part/ds=20200401
OK
Time taken: 18.526 seconds
hive (test)> 

這樣就可以正常使用分區歸檔了

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章