hive sql分區表

hive> create table lpx_partition_test(global_id int, company_name string)partitioned by (stat_date string, province string) row format delimited fields terminated by ',';
OK

Time taken: 0.114 seconds

由此可見hive sql中的分區列並不是一個實際存在的列，可以說是一個或多個僞列。

hive> desc extended lpx_partition_test;
OK
global_id   int
company_name   string
stat_date   string
province   string

Detailed Table Information   Table(tableName:lpx_partition_test, dbName:default, owner:root, createTime:1312186275, lastAccessTime:0, retention:0, sd:StorageDescriptor(cols:[FieldSchema(name:global_id, type:int, comment:null), FieldSchema(name:company_name, type:string, comment:null), FieldSchema(name:stat_date, type:string, comment:null), FieldSchema(name:province, type:string, comment:null)], location:hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test, inputFormat:org.apache.hadoop.mapred.TextInputFormat, outputFormat:org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, compressed:false, numBuckets:-1, serdeInfo:SerDeInfo(name:null, serializationLib:org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, parameters:{serialization.format=,, field.delim=,}), bucketCols:[], sortCols:[], parameters:{}), partitionKeys:[FieldSchema(name:stat_date, type:string, comment:null), FieldSchema(name:province, type:string, comment:null)], parameters:{transient_lastDdlTime=1312186275}, viewOriginalText:null, viewExpandedText:null, tableType:MANAGED_TABLE)
Time taken: 0.111 seconds

該例子中創建了stat_date和province作爲分區列。和oracle 中類似，要先創建了分區，纔可以插入數據。分區創建成功後在hdfs上會創建對應的文件。

hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-08', province='ZheJiang');
OK
Time taken: 0.464 seconds

hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-08', province='GuangDong');
OK
Time taken: 7.746 seconds
hive> alter table lpx_partition_test add PARTITION(stat_date='2011-06-09', province='ZheJiang');
OK
Time taken: 0.235 seconds

root@hadoop1:/opt/hadoop# bin/hadoop dfs -ls /user/hive/warehouse/lpx_partition_test
Found 2 items
drwxr-xr-x   - root supergroup          0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08
drwxr-xr-x   - root supergroup          0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-09

root@hadoop1:/opt/hadoop# bin/hadoop dfs -ls /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08
Found 2 items
drwxr-xr-x   - root supergroup          0 2011-08-01 16:42 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=GuangDong
drwxr-xr-x   - root supergroup          0 2011-08-01 16:37 /user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang

由此可見，每個分區都有一個獨立的文件對應，stat_date位於父層級，province位於子層級。

向分區中插入數據：
hive> drop table lpx_partition_test_in;
OK
Time taken: 6.971 seconds
hive> create table lpx_partition_test_in(global_id int, company_name string, province string)row format delimited fields terminated by ' ';
OK
Time taken: 0.275 seconds
hive> LOAD DATA LOCAL INPATH '/opt/hadoop/mytest/lpx_partition_test.txt' OVERWRITE INTO TABLE lpx_partition_test_in;
Copying data from file:/opt/hadoop/mytest/lpx_partition_test.txt
Copying file: file:/opt/hadoop/mytest/lpx_partition_test.txt
Loading data to table default.lpx_partition_test_in
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test_in
OK
Time taken: 0.428 seconds

hive> insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='ZheJiang') select global_id, company_name from lpx_partition_test_in where province='ZheJiang';
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20110801172929_4b36ae2a-9d00-4432-8746-7b4d62aa8378.log
Job running in-process (local Hadoop)
2011-08-01 17:29:35,384 null map = 100%, reduce = 0%
Ended Job = job_local_0001
Ended Job = -1620577194, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_17-29-30_013_2844131263666576737/-ext-10000
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=ZheJiang)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang
Partition default.lpx_partition_test{stat_date=2011-06-08, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 1, num_files: 1, num_rows: 3, total_size: 60]
OK
Time taken: 6.524 seconds

hive> select * from lpx_partition_test;
OK
99001   xxxcompany_name1   2011-06-08   ZheJiang
99002   xxxcompany_name1   2011-06-08   ZheJiang
99003   xxxcom2   2011-06-08   ZheJiang
Time taken: 0.559 seconds

hive> from lpx_partition_test_in
    > insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='ZheJiang') select global_id, company_name where province='ZheJiang'
    > insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province='GuangDong') select global_id, company_name where province='GuangDong'
    > insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-09', province='ZheJiang') select global_id, company_name where province='ZheJiang'
    > insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-09', province='GuangDong') select global_id, company_name where province='GuangDong';
Total MapReduce jobs = 5
Launching Job 1 out of 5
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20110801180606_1dc94690-8e64-41cc-a4d7-30e927408f30.log
Job running in-process (local Hadoop)
2011-08-01 18:06:22,147 null map = 0%, reduce = 0%
2011-08-01 18:06:23,149 null map = 100%, reduce = 0%
Ended Job = job_local_0001
Ended Job = 1501179483, job is filtered out (removed at runtime).
Ended Job = -24922011, job is filtered out (removed at runtime).
Ended Job = -2114178998, job is filtered out (removed at runtime).
Ended Job = 1437573638, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10000
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10002
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10004
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-06-16_672_4382965127366007981/-ext-10006
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=ZheJiang)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang
Partition default.lpx_partition_test{stat_date=2011-06-08, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 1, num_files: 1, num_rows: 3, total_size: 60]
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-09, province=ZheJiang)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-09/province=ZheJiang
Partition default.lpx_partition_test{stat_date=2011-06-09, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 2, num_files: 2, num_rows: 6, total_size: 120]
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=GuangDong)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=GuangDong
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-09, province=GuangDong)
Partition default.lpx_partition_test{stat_date=2011-06-09, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Table default.lpx_partition_test stats: [num_partitions: 3, num_files: 3, num_rows: 7, total_size: 143]
Partition default.lpx_partition_test{stat_date=2011-06-08, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Table default.lpx_partition_test stats: [num_partitions: 4, num_files: 4, num_rows: 8, total_size: 166]
OK
Time taken: 8.778 seconds

hive> select * from lpx_partition_test;
OK
99001   xxxcompany_name1   2011-06-08   GuangDong
99001   xxxcompany_name1   2011-06-08   ZheJiang
99002   xxxcompany_name1   2011-06-08   ZheJiang
99003   xxxcom2   2011-06-08   ZheJiang
99001   xxxcompany_name1   2011-06-09   GuangDong
99001   xxxcompany_name1   2011-06-09   ZheJiang
99002   xxxcompany_name1   2011-06-09   ZheJiang
99003   xxxcom2   2011-06-09   ZheJiang
Time taken: 0.356 seconds

--動態分區
如果有大量的數據需要插入到不同的分區，需要對每一個分區都寫一條insert語句，必須使用大量的insert語句。
爲了加載某一天各個省份的分區數據，必須爲每一個省份寫一條insert語句，使用起來非常不方便。如果需要插入另外一天的數據，必須要修改DML語句和DDL語句，而且每個insert語句轉化爲MapReduce任務也非常不方便。動態分區，用來解決這個問題，它可以根據輸入數據來動態決定如何創建哪個區分和將數據放入哪個分區。這個特性從0.6.0版開始有。在動態插入過程中，輸入列值被評估並動態地決定要插入的分區。如果相應的分區沒有被創建，它會自動地創建該分區。使用這個特性，我們可以使用一條insert語句來創建和寫入所有必要的分區。另外，因爲只有一條sql語句，所以對應地只有一個MapReduce任務。這可以極大地提升性能並且減少Hadoop聚類的負載。

動態分區參數：
hive.exec.max.dynamic.partitions.pernode：每個mapper or reducer創建動態分區的最大數量，小於100.
hive.exec.max.dynamic.partitions：每個DML語句可以創建的動態分區的數量，小於1000.
hive.exec.max.created.files:所有smapper or reducer創建文件的最大數量，小於100000.

hive> set hive.exec.dynamic.partition;
hive.exec.dynamic.partition=false
hive> set hive.exec.dynamic.partition = true;
hive> set hive.exec.dynamic.partition;
hive.exec.dynamic.partition=true

hive> from lpx_partition_test_in
    > insert overwrite table lpx_partition_test PARTITION(stat_date='2011-06-08', province) select global_id, company_name,province;
Total MapReduce jobs = 2
Launching Job 1 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Execution log at: /tmp/root/root_20110801183737_64ce8cf1-a068-4fbf-9d8e-561118569b2c.log
Job running in-process (local Hadoop)
2011-08-01 18:37:57,566 null map = 100%, reduce = 0%
Ended Job = job_local_0001
Ended Job = -1141443727, job is filtered out (removed at runtime).
Moving data to: hdfs://hadoop1:9000/tmp/hive-root/hive_2011-08-01_18-37-51_921_8609501383674778354/-ext-10000
Loading data to table default.lpx_partition_test partition (stat_date=2011-06-08, province=null)
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=GuangDong
Deleted hdfs://hadoop1:9000/user/hive/warehouse/lpx_partition_test/stat_date=2011-06-08/province=ZheJiang
   Loading partition {stat_date=2011-06-08, province=GuangDong}
   Loading partition {stat_date=2011-06-08, province=ZheJiang}
Partition default.lpx_partition_test{stat_date=2011-06-08, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=2011-06-08, province=ZheJiang} stats: [num_files: 1, num_rows: 3, total_size: 60]
Table default.lpx_partition_test stats: [num_partitions: 4, num_files: 4, num_rows: 8, total_size: 166]
OK
Time taken: 6.683 seconds

hive> from lpx_partition_test_in
    > insert overwrite table lpx_partition_test PARTITION(stat_date, province) select global_id, company_name,stat_date,province;
FAILED: Error in semantic analysis: Dynamic partition strict mode requires at least one static partition column. To turn this off set hive.exec.dynamic.partition.mode=nonstrict

hive> set hive.exec.dynamic.partition.mode=nonstrict;

hive> from lpx_partition_test_in t
    > insert overwrite table lpx_partition_test PARTITION(stat_date, province) select t.global_id, t.company_name, t.stat_date, t.province DISTRIBUTE BY t.stat_date, t.province;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapred.reduce.tasks=<number>
Execution log at: /tmp/root/root_20110802131616_02744950-1c88-4073-8aae-07c964073c1a.log
Job running in-process (local Hadoop)
2011-08-02 13:16:30,765 null map = 0%, reduce = 0%
2011-08-02 13:16:37,776 null map = 100%, reduce = 0%
2011-08-02 13:16:40,915 null map = 100%, reduce = 100%
Ended Job = job_local_0001
Loading data to table default.lpx_partition_test partition (stat_date=null, province=null)
   Loading partition {stat_date=20110608, province=GuangDong}
   Loading partition {stat_date=20110608, province=ZheJiang}
   Loading partition {stat_date=20110609, province=ZheJiang}
Partition default.lpx_partition_test{stat_date=20110608, province=GuangDong} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=20110608, province=ZheJiang} stats: [num_files: 1, num_rows: 1, total_size: 23]
Partition default.lpx_partition_test{stat_date=20110609, province=ZheJiang} stats: [num_files: 1, num_rows: 2, total_size: 37]
Table default.lpx_partition_test stats: [num_partitions: 7, num_files: 7, num_rows: 12, total_size: 249]
OK
Time taken: 26.672 seconds

hive> select * from lpx_partition_test;
OK
99001   xxxcompany_name1   2011-06-08   GuangDong
99001   xxxcompany_name1   2011-06-08   ZheJiang
99002   xxxcompany_name1   2011-06-08   ZheJiang
99003   xxxcom2   2011-06-08   ZheJiang
99001   xxxcompany_name1   2011-06-09   GuangDong
99001   xxxcompany_name1   2011-06-09   ZheJiang
99002   xxxcompany_name1   2011-06-09   ZheJiang
99003   xxxcom2   2011-06-09   ZheJiang
99001   xxxcompany_name1   20110608   GuangDong
99001   xxxcompany_name1   20110608   ZheJiang
99002   xxxcompany_name1   20110609   ZheJiang
99003   xxxcom2   20110609   ZheJiang
Time taken: 1.179 seconds

爲了讓分區列的值相同的數據儘量在同一個MapReduce中，這樣每一個mapreduce可以儘量少的產生新的文件夾，可以藉助distribute by 功能，將分區列值相同的數據放在一起。