前言
-
OS:CentOS 7
-
Hive:2.3.0
-
Hadoop:2.7.7
-
MySQL Server:5.7.10
-
Hive官方手冊:LanguageManual LZO
-
在配置Hive使用lzo壓縮功能之前,需要保證Hadoop集羣中lzo依賴庫的正確安裝,以及hadoop-lzo依賴的正確配置,可以參考:Hadoop配置lzo壓縮
-
溫馨提示:Hive自定義組件打包時,不要同時打包依賴,避免各種版本衝突,只將額外的依賴添加到classpath中即可
配置過程
一、配置Hadoop集羣
- 在
core-site.xml
文件的io.compression.codecs
參數中添加lzo、lzop壓縮對應的編解碼器類,並配置io.compression.codec.lzo.class
參數,具體如下所示:
<!-- 聲明可用的壓縮算法的編/解碼器 -->
<property>
<name>io.compression.codecs</name>
<value>
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
org.apache.hadoop.io.compress.DeflateCodec,
org.apache.hadoop.io.compress.BZip2Codec,
org.apache.hadoop.io.compress.SnappyCodec,
org.apache.hadoop.io.compress.Lz4Codec,
com.hadoop.compression.lzo.LzoCodec,
com.hadoop.compression.lzo.LzopCodec
</value>
<description>
A comma-separated list of the compression codec classes that can
be used for compression/decompression. In addition to any classes specified
with this property (which take precedence), codec classes on the classpath
are discovered using a Java ServiceLoader.
</description>
</property>
<!-- 配置lzo編解碼器相關參數 -->
<property>
<name>io.compression.codec.lzo.class</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
</property>
- 在
mapred-site.xml
文件中配置如下參數,設置MR Job執行時使用的壓縮方式:
<!-- map輸出是否壓縮 -->
<!-- 默認值:false -->
<property>
<name>mapreduce.map.output.compress</name>
<value>true</value>
<description>
Should the outputs of the maps be compressed before being
sent across the network. Uses SequenceFile compression.
</description>
</property>
<!-- 設置map輸出壓縮所使用的對應壓縮算法的編解碼器,此處設置爲LzoCodec,生成的文件後綴爲.lzo_deflate -->
<!-- 默認值:org.apache.hadoop.io.compress.DefaultCodec -->
<property>
<name>mapreduce.map.output.compress.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
<description>
If the map outputs are compressed, how should they be compressed?
</description>
</property>
<!-- 設置MR job最終輸出文件是否壓縮 -->
<!-- 默認值:false -->
<property>
<name>mapreduce.output.fileoutputformat.compress</name>
<value>true</value>
<description>Should the job outputs be compressed?
</description>
</property>
<!-- 設置MR job最終輸出文件所使用的壓縮算法對應的編解碼器,此處設置爲LzoCodec,生成的文件後綴爲.lzo_deflate -->
<!-- 默認值:org.apache.hadoop.io.compress.DefaultCodec -->
<property>
<name>mapreduce.output.fileoutputformat.compress.codec</name>
<value>com.hadoop.compression.lzo.LzoCodec</value>
<description>If the job outputs are compressed, how should they be compressed?
</description>
</property>
<!-- 設置序列文件的壓縮格式 -->
<!-- 默認值:RECORD -->
<property>
<name>mapreduce.output.fileoutputformat.compress.type</name>
<value>BLOCK</value>
<description>If the job outputs are to compressed as SequenceFiles, how should
they be compressed? Should be one of NONE, RECORD or BLOCK.
</description>
</property>
二、配置Hive
在$HIVE_HOME/conf/hive-site.xml
文件中設置如下參數,使得Hive進行查詢時使用壓縮功能,具體使用的壓縮算法默認與Hadoop中的配置相同,當然也有相應的參數可以進行覆蓋:
<!-- 設置hive語句執行輸出文件是否開啓壓縮,具體的壓縮算法和壓縮格式取決於hadoop中
設置的相關參數 -->
<!-- 默認值:false -->
<property>
<name>hive.exec.compress.output</name>
<value>true</value>
<description>
This controls whether the final outputs of a query (to a local/HDFS file or a Hive table)
is compressed.
The compression codec and other options are determined from Hadoop config variables
mapred.output.compress*
</description>
</property>
<!-- 控制多個MR Job的中間結果文件是否啓用壓縮,具體的壓縮算法和壓縮格式取決於hadoop中
設置的相關參數 -->
<!-- 默認值:false -->
<property>
<name>hive.exec.compress.intermediate</name>
<value>true</value>
<description>
This controls whether intermediate files produced by Hive between multiple map-reduce jobs are compressed.
The compression codec and other options are determined from Hadoop config variables mapred.output.compress*
</description>
</property>
三、建立支持lzo壓縮數據讀取和寫入的Hive表
- 示例:
CREATE TABLE tmp like emp
STORED AS
INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat";
- 注意1:當前版本(2.3.0)的Hive只支持使用舊版本MR框架(com.hadoop.mapred)的API,使用的InputFormat必須實現
org.apache.hadoop.mapred.InputFormat
接口 - 注意2:當前版本(2.3.0)的Hive中要想使用lzo壓縮,需要將表數據讀取時使用的InputFormat設置爲
com.hadoop.mapred.DeprecatedLzoTextInputFormat
,此工具類在hadoop-lzo的jar包中(如:hadoop-lzo-0.4.20.jar)。只有使用LzoTextInputFormat
才能避免將lzo索引文件識別成數據文件,又因爲Hive支持支舊版本API,因此必須使用示例中的DeprecatedLzoTextInputFormat
才能使用lzo的分片功能。 - 注意3:使用
DeprecatedLzoTextInputFormat
只能識別後綴爲.lzo
的lzo壓縮文件,無法識別後綴爲.lzo_deflate
的lzo壓縮文件。前者是使用編解碼器LzopCodec
生成的,後者是使用LzoCodec
生成的,.lzo
壓縮文件能夠創建索引,而.lzo_deflate
壓縮文件無法創建索引,只有建立了lzo索引才能使用lzo分片功能。
PS:可以通過以下命令,來修改表的InputFormat/Outputformat/SerDe
-- 可以通過以下命令,修改表數據的讀取/寫入/序列化和反序列化方式
ALTER TABLE tmp
SET FILEFORMAT
INPUTFORMAT "com.hadoop.mapred.DeprecatedLzoTextInputFormat"
OUTPUTFORMAT "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat" -- Hive默認Outputformat
SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'; -- Hive默認SERDE
四、導入lzo壓縮文件,並對.lzo壓縮文件建立lzo索引
- 導入表數據文件(.lzo壓縮文件)
LOAD DATA INPATH '/tmp/data/output/emp/000000_0.lzo' OVERWRITE INTO TABLE tmp;
- 爲.lzo壓縮文件建立lzo索引,便於讀取時能夠進行切割分片,否則只能將整個文件作爲單個分片。示例命令如下,使用的是hadoop-lzo依賴中的工具類來爲lzo壓縮文件創建索引,索引文件與壓縮文件在同一路徑下,後綴名爲
.lzo.index
hadoop jar \
/opt/module/hadoop-2.7.7/share/hadoop/common/hadoop-lzo-0.4.21-SNAPSHOT.jar \
com.hadoop.compression.lzo.DistributedLzoIndexer \
/user/hive/warehouse/test.db/tmp/000000_0.lzo
- 查看索引是否建立成功
[tomandersen@hadoop101 libs]$ hadoop fs -ls /user/hive/warehouse/test.db/tmp/;
Found 2 items
-rwxr-xr-x 1 tomandersen supergroup 515 2020-06-19 17:43 /user/hive/warehouse/test.db/tmp/000000_0.lzo
-rw-r--r-- 1 tomandersen supergroup 8 2020-06-21 21:53 /user/hive/warehouse/test.db/tmp/000000_0.lzo.index
五、查詢表數據,驗證能夠正常讀取
- 設置
InputFormat
爲DeprecatedLzoTextInputFormat
,不再將lzo索引文件視爲數據文件,讀取結果正常
0: jdbc:hive2://hadoop101:10000/default (test)> select * from test.tmp;
+------------+------------+----------+------------+----------+---------------+----------+-----------+-------------+
| tmp.empno | tmp.ename | tmp.sex | tmp.job | tmp.mgr | tmp.hiredate | tmp.sal | tmp.comm | tmp.deptno |
+------------+------------+----------+------------+----------+---------------+----------+-----------+-------------+
| 7369 | SMITH | male | CLERK | 7902 | 1980-12-17 | 800.0 | NULL | 20 |
| 7499 | ALLEN | male | SALESMAN | 7698 | 1981-2-20 | 1600.0 | 300.0 | 30 |
| 7521 | WARD | female | SALESMAN | 7698 | 1981-2-22 | 1250.0 | 500.0 | 30 |
| 7566 | JONES | male | MANAGER | 7839 | 1981-4-2 | 2975.0 | NULL | 20 |
| 7654 | MARTIN | female | SALESMAN | 7698 | 1981-9-28 | 1250.0 | 1400.0 | 30 |
| 7698 | BLAKE | male | MANAGER | 7839 | 1981-5-1 | 2850.0 | NULL | 30 |
| 7782 | CLARK | male | MANAGER | 7839 | 1981-6-9 | 2450.0 | NULL | 10 |
| 7788 | SCOTT | male | ANALYST | 7566 | 1987-4-19 | 3000.0 | NULL | 20 |
| 7839 | KING | female | PRESIDENT | NULL | 1981-11-17 | 5000.0 | NULL | 10 |
| 7844 | TURNER | female | SALESMAN | 7698 | 1981-9-8 | 1500.0 | 0.0 | 30 |
| 7876 | ADAMS | male | CLERK | 7788 | 1987-5-23 | 1100.0 | NULL | 20 |
| 7900 | JAMES | male | CLERK | 7698 | 1981-12-3 | 950.0 | NULL | 30 |
| 7902 | FORD | male | ANALYST | 7566 | 1981-12-3 | 3000.0 | NULL | 20 |
| 7934 | MILLER | female | CLERK | 7782 | 1982-1-23 | 1300.0 | NULL | 10 |
+------------+------------+----------+------------+----------+---------------+----------+-----------+-------------+
14 rows selected (0.244 seconds)
- 未設置
InputFormat
爲DeprecatedLzoTextInputFormat
,會將lzo索引文件視爲數據文件進行讀取,查詢結果會多出一行NULL值
0: jdbc:hive2://hadoop101:10000/default (test)> select * from test.tmp;
+------------+------------+----------+------------+----------+---------------+----------+-----------+-------------+
| tmp.empno | tmp.ename | tmp.sex | tmp.job | tmp.mgr | tmp.hiredate | tmp.sal | tmp.comm | tmp.deptno |
+------------+------------+----------+------------+----------+---------------+----------+-----------+-------------+
| 7369 | SMITH | male | CLERK | 7902 | 1980-12-17 | 800.0 | NULL | 20 |
| 7499 | ALLEN | male | SALESMAN | 7698 | 1981-2-20 | 1600.0 | 300.0 | 30 |
| 7521 | WARD | female | SALESMAN | 7698 | 1981-2-22 | 1250.0 | 500.0 | 30 |
| 7566 | JONES | male | MANAGER | 7839 | 1981-4-2 | 2975.0 | NULL | 20 |
| 7654 | MARTIN | female | SALESMAN | 7698 | 1981-9-28 | 1250.0 | 1400.0 | 30 |
| 7698 | BLAKE | male | MANAGER | 7839 | 1981-5-1 | 2850.0 | NULL | 30 |
| 7782 | CLARK | male | MANAGER | 7839 | 1981-6-9 | 2450.0 | NULL | 10 |
| 7788 | SCOTT | male | ANALYST | 7566 | 1987-4-19 | 3000.0 | NULL | 20 |
| 7839 | KING | female | PRESIDENT | NULL | 1981-11-17 | 5000.0 | NULL | 10 |
| 7844 | TURNER | female | SALESMAN | 7698 | 1981-9-8 | 1500.0 | 0.0 | 30 |
| 7876 | ADAMS | male | CLERK | 7788 | 1987-5-23 | 1100.0 | NULL | 20 |
| 7900 | JAMES | male | CLERK | 7698 | 1981-12-3 | 950.0 | NULL | 30 |
| 7902 | FORD | male | ANALYST | 7566 | 1981-12-3 | 3000.0 | NULL | 20 |
| 7934 | MILLER | female | CLERK | 7782 | 1982-1-23 | 1300.0 | NULL | 10 |
| NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL | NULL |
+------------+------------+----------+------------+----------+---------------+----------+-----------+-------------+
15 rows selected (0.229 seconds)
報錯及解決方案
- 報錯一:Compression codec com.hadoop.compression.lzo.LzoCodec not found
解析:很顯然這是一個差找不到指定類的Bug,而LzoCodec是hadoop-lzo依賴中的工具類,因此將對應的jar包添加到classpath中即可,解決方案多種多樣
解決方案示例:在hive-env.sh配置文件中設置HIVE_AUX_JARS_PATH
環境變量,將hadoop-lzo依賴jar包放入此變量所指路徑(或者將此變量設置成hadoop-lzo.jar所在路徑)。
- 報錯二:Error: java.io.IOException: java.io.EOFException: Premature EOF from inputStream
解析:通過查找網上的資料發現,導致這樣的報錯可能有多種原因,最常見的就是由於在hadoop中lzo的相關配置,與Hive中的表相關設置相沖突(具體原因未知)
解決方案示例:將mapred-site.xml
文件中的mapreduce.output.fileoutputformat.compress.codec
參數設置成除了com.hadoop.compression.lzo.LzopCodec
意以外的其他值,如com.hadoop.compression.lzo.LzoCodec
,即更換最終MR輸出文件所使用的壓縮算法編解碼器爲非LzopCodec
的其他值