1、查看hdfs中存儲數據的所有目錄,共有三個分區:month_id,day_id,prov_id,數據存儲目錄在/user/hive/data
2、創建相應數據表,主要指定這三個分區,摘寫分區語句
partitioned by (month_id int,day_id int,prov_id int)
hive中數據存儲目錄在/user/hive/warehouse/table_name
3、編寫腳本
#!/bin/sh
tbl_name=$1
for p1 in `hadoop fs -ls /user/hive/data | awk '{print $8}'`
do
month_id=`echo $p1 | awk -F '=' '{print $2}'`
for p2 in `hadoop fs -ls $p1 | awk '{print $8}'`
do
day_id=`echo $p2 | awk -F '=' '{print $3}'`
for p3 in `hadoop fs -ls $p2 | awk '{print $8}'`
do
prov_id=`echo $p3 | awk -F '=' '{print $4}'`
hive -e "load data inpath '/user/hive/data/month_id=$month_id/day_id=$day_id/prov_id=$prov_id/*' overwrite into table $tbl_name partition(month_id=$month_id,day_id=$day_id,prov_id=$prov_id)"
done
done
done
tbl_name=$1:獲取調用此腳本的第一個參數,指定目標表名稱
`hadoop fs -ls /user/hive/data | awk '{print $8}'`:獲取源數據一級目錄,如下:
for p1 in xxxx:for循環,遍歷源數據目錄
month_id=`echo $p1 | awk -F '=' '{print $2}'`:獲取分區字段值,如下:
hive -e "load data inpath '/user/hive/data/month_id=$month_id/day_id=$day_id/prov_id=$prov_id/*' overwrite into table $tbl_name partition(month_id=$month_id,day_id=$day_id,prov_id=$prov_id)":拼接源數據目錄,指定目標目錄,實現入庫。
4、調用腳本
./hdfs_to_hive.sh pts