hive 存儲格式和壓縮方式一：Snappy + SequenceFile

原創

潇水汀寒

2020-06-28 01:04

爲什麼要用Sequence File:

a).壓縮 b).這種格式可分割，可供多個mapper 併發讀取

貼一段《Programming Hive》的：

Compressing files results in space savings but one of the downsides of storing raw
compressed files in Hadoop is that often these files are not splittable. Splittable files
can be broken up and processed in parts by multiple mappers in parallel. Most com-
pressed files are not splittable because you can only start reading from the beginning.
The sequence file format supported by Hadoop breaks a file into blocks and then op-
tionally compresses the blocks in a splittable way.

下面就用它一用

1、設置三個參數：

hive.exec.compress.output 聲明對 hive 查詢的輸出結果進行壓縮，並指定壓縮方式爲 Snappy。

對SequenceFile 有 mapred.output.compression.type，在CDH4中默認就是 BLOCK。

SET hive.exec.compress.output=true;
SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
SET mapred.output.compression.type=BLOCK;

               > SET hive.exec.compress.output;
hive.exec.compress.output=false
hive (sequence)> SET hive.exec.compress.output=true;
hive (sequence)> SET mapred.output.compression.codec;
mapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec
hive (sequence)> SET mapred.output.compression.codec=org.apache.hadoop.io.compress.SnappyCodec;
hive (sequence)> SET mapred.output.compression.type;
mapred.output.compression.type=BLOCK
hive (sequence)> set io.seqfile.compress.blocksize;
io.seqfile.compress.blocksize=1000000

2、創建TEXTFILE表，用來提供數據。

CREATE TABLE info_text(
......) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' LINES TERMINATED BY '\n' STORED AS TEXTFILE;

3、建目標SequenceFile 表

CREATE TABLE info_sequ(
...... ) 
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\;' LINES TERMINATED BY '\n' STORED AS SEQUENCEFILE;

4、爲text表導入數據,一個400+M的數據文件

load data local inpath 'file.txt' OVERWRITE into table info_text;

5、爲目標 SequenceFile 表導入數據

insert into table info_sequ select * from info_text;

6、測試結果是否相同

select * from info_sequ limit 20;
select * from info_text limit 20;

7、查看HDFS文件大小

drwxrwxrwx   - hive hadoop          0 2013-08-30 01:33 /user/hive/warehouse/sequence.db/info_sequ
-rw-r--r--   3 root hadoop  124330943 2013-08-30 01:33 /user/hive/warehouse/sequence.db/info_sequ/000000_0
-rw-r--r--   3 root hadoop   77730350 2013-08-30 01:33 /user/hive/warehouse/sequence.db/info_sequ/000001_0
drwxr-xr-x   - root hadoop          0 2013-08-30 01:33 /user/hive/warehouse/sequence.db/info_text
-rw-r--r--   3 root hadoop  438474539 2013-08-30 01:33 /user/hive/warehouse/sequence.db/info_text/file.txt

000000_0 文件開頭：

SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^A)org.apache.hadoop.io.compress.SnappyCodec^@^@^@^@<84>çé<9f>³ÅZ<97>/.¹*¿I²6ÿÿÿÿ<84>çé<9f>³ÅZ<97>/.¹*¿I²6<8e>2^M<8e>^Bg^@^@2

000001_0 文件開頭：

SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^A)org.apache.hadoop.io.compress.SnappyCodec^@^@^@^@[Ü^^Ü<9c>rx1µå'^HçÕcöÿÿÿÿ[Ü^^Ü<9c>rx1µå'^HçÕcö<8e>2Y<8e>^Bj^@^@2Y^@^@^BbÙ

Hive 中有兩個虛擬列：

INPUT__FILE__NAME map 任務的輸入文件名

BLOCK__OFFSET__INSIDE__FILE 當前文件的 position。據我理解，就如NIO的Buffer中的position。

hive> describe formatted locallzo; 
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	com.hadoop.mapred.DeprecatedLzoTextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat

hive> describe formatted raw;                                                              
# Storage Information	 	 
SerDe Library:      	org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe	 
InputFormat:        	org.apache.hadoop.mapred.TextInputFormat	 
OutputFormat:       	org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat	

可以比較兩個表 BLOCK__OFFSET__INSIDE__FILE 的區別：

存儲文件爲文本文件:
hive> select INPUT__FILE__NAME, unum, BLOCK__OFFSET__INSIDE__FILE from raw limit 10；
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt	3246065	0
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt	2037265	73
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt	2287465	149
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt	6581865	225
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt	6581865	298
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt	6581865	371
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt	1629568	447
hdfs://indigo:8020/user/hive/warehouse/raw/state=26/rawdata.txt	2185765	523

存儲文件爲lzo壓縮文件:
hive> select INPUT__FILE__NAME, office_no, BLOCK__OFFSET__INSIDE__FILE from locallzo limit 20;
hdfs://indigo:8020/tmp/table/tb/rawdata.txt.lzo	3246065	0
hdfs://indigo:8020/tmp/table/tb/rawdata.txt.lzo	2037265	28778
hdfs://indigo:8020/tmp/table/tb/rawdata.txt.lzo	2287465	28778
hdfs://indigo:8020/tmp/table/tb/rawdata.txt.lzo	6581865	28778
hdfs://indigo:8020/tmp/table/tb/rawdata.txt.lzo	6581865	28778
hdfs://indigo:8020/tmp/table/tb/rawdata.txt.lzo	6581865	28778
hdfs://indigo:8020/tmp/table/tb/rawdata.txt.lzo	1629568	28778

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

hive 存儲格式和壓縮方式一：Snappy + SequenceFile

fastjson parseArray

Hive SQL 語義分析：select count(*) from tableName

未解決問題2_Unknown rpc kind RPC_WRITABLE

hive 存儲格式和壓縮方式一：Snappy + SequenceFile

JMS入門（二）Chat示例簡要說明

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

hive 存儲格式和壓縮方式 一：Snappy + SequenceFile

hive 存儲格式和壓縮方式一：Snappy + SequenceFile