DDL(Data Definition Language)
Hive數據存儲結構
1. Database:Hive中包含了多個數據庫,默認的數據庫爲default,對應於HDFS目錄是/user/hive/warehouse,可以通過hive.metastore.warehouse.dir參數進行配置(hive-site.xml中配置)
2. Table: Hive 中的表又分爲內部表和外部表 ,Hive 中的每張表對應於HDFS上的一個目錄,HDFS目錄爲:/user/hive/warehouse/[databasename.db]/tables 。
3. Partition:分區,使用某個字段對錶進行分區,這樣會方便查詢,提高效率;並且HDFS上會有對應的分區目錄:
/user/hive/warehouse/[databasename.db]/table/partitions
4. Bucket:桶,分區下面還可以進行分桶,可以對錶的數據進行更細緻的區分,HDFS上的分桶目錄:
/user/hive/warehouse/[databasename.db]/table/partitions/buckets
數據庫
1. 創建數據庫
CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
[COMMENT database_comment]
[LOCATION hdfs_path]
[WITH DBPROPERTIES (property_name=property_value, …)];IF NOT EXISTS:表判斷數據庫是否存在,不存在就會創建,存在就不會創建。
COMMENT:數據庫的描述
LOCATION:指定創建數據庫HDFS路徑,不加默認在/user/hive/warehouse/路徑下
WITH DBPROPERTIES:數據庫的屬性1. hive> 2. > CREATE DATABASE IF NOT EXISTS hive2 3. > COMMENT "it is my database" 4. > WITH DBPROPERTIES ("creator"="zhangsan", "date"="2018-08-08") 5. > ; 6. OK
創建的數據庫元數據信息記錄在mysql的basic01(hive-site.xml配置)數據庫中DBS表中
可用語句:seletc * from dbs \G; 查看1. mysql> select * from dbs \G; 2. *************************** 1. row *************************** 3. DB_ID: 1 4. DESC: Default Hive database 5. DB_LOCATION_URI: hdfs://192.168.137.200:9000/user/hive/warehouse 6. NAME: default 7. OWNER_NAME: public 8. OWNER_TYPE: ROLE
詳細的元數據信息記錄在basic01下的database_params表中
1. mysql> select * from database_params \G; 2. *************************** 1. row *************************** 3. DB_ID: 21 4. PARAM_KEY: creator 5. PARAM_VALUE: zhangsan 6. *************************** 2. row *************************** 7. DB_ID: 21 8. PARAM_KEY: date 9. PARAM_VALUE: 2018-08-08 10. 2 rows in set (0.00 sec)
2. 查詢數據庫
SHOW (DATABASES|SCHEMAS) [LIKE ‘identifier_with_wildcards’]
like 後面跟數據庫或表的關鍵字,可模糊查詢1. hive> show databases; 2. OK 3. default 4. word 5. wordcount
3. 查詢數據庫信息
DESCRIBE DATABASE [EXTENDED] db_name;
DESCRIBE DATABASE db_name:查看數據庫的描述信息和文件目錄位置路徑信息;
EXTENDED:顯示數據庫詳細屬性信息。1. hive> describe database hive2; 2. OK 3. hive2 it is my database hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db hadoop USER 4. Time taken: 0.119 seconds, Fetched: 1 row(s) 5. hive> describe database extended hive2; 6. OK 7. hive2 it is my database hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db hadoop USER {date=2018-08-08, creator=zhangsan} 8. Time taken: 0.135 seconds, Fetched: 1 row(s)
4. 修改數據庫信息
ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, …);
ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;
ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path;
1. hive> alter database hive2 set dbproperties ("update"="lisi"); 2. OK 3. hive> alter database hive2 set owner user wjx; 4. OK
查看數據庫信息:
修改前1. hive> describe database extended hive2; 2. OK 3. hive2 it is my database hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db hadoop USER {date=2018-08-08, creator=zhangsan} 4. Time taken: 0.135 seconds, Fetched: 1 row(s)
修改後
1. hive> describe database extended hive2; 2. OK 3. hive2 it is my database hdfs://192.168.137.200:9000/user/hive/warehouse/hive2.db wjx USER {date=2018-08- 08, creator=zhangsan, update=lisi} 5. Time taken: 0.235 seconds, Fetched: 1 row(s)
5. 刪除數據庫
DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];
RESTRICT:默認是restrict,如果該數據庫還有表存在則報錯;
CASCADE:級聯刪除數據庫(當數據庫還有表時,級聯刪除表後在刪除數據庫)。1. hive> drop database hive2; 2. OK
表
表的數據類型
int,
long,
float/double,
string,
boolean(布爾值)分隔符
行默認分隔符:“\n”
列指定分隔符:”\t”
1. 創建表
CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name
– (Note: TEMPORARY available in Hive 0.14.0 and later)
[(col_name data_type [COMMENT col_comment], … [constraint_specification])]
[COMMENT table_comment]
[PARTITIONED BY (col_name data_type [COMMENT col_comment], …)]
[CLUSTERED BY (col_name, col_name, …) [SORTED BY (col_name [ASC|DESC], …)] INTO num_buckets BUCKETS]
[SKEWED BY (col_name, col_name, …) – (Note: Available in Hive 0.10.0 and later)]
ON ((col_value, col_value, …), (col_value, col_value, …), …)
[STORED AS DIRECTORIES]
[
[ROW FORMAT row_format]
[STORED AS file_format]
| STORED BY ‘storage.handler.class.name’ [WITH SERDEPROPERTIES (…)]
– (Note: Available in Hive 0.6.0 and later)
][LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, …)]
– (Note: Available in Hive 0.6.0 and later)
[AS select_statement];
– (Note: Available in Hive 0.5.0 and later; not supported for external tables)
1.1 TEMPORARY(臨時表)
Hive從0.14.0開始提供創建臨時表的功能,表只對當前session有效,session退出後,表自動刪除。
1. CREATE TEMPORARY TABLE ...
1.2 EXTERNAL(外部表)
Hive上有兩種類型的表,一種是Managed Table(默認的),另一種是External Table(加上EXTERNAL關鍵字)。它倆的主要區別在於:當我們drop表時,Managed Table會同時刪去data(存儲在HDFS上)和meta data(存儲在MySQL),而External Table只會刪meta data。
1. hive> create table managed_table( 2. > id int, 3. > name string 4. > );
創建外部表
1. hive> create external table external_table( 2. > id int, 3. > name string 4. > );
查詢HDFS上的數據
1. [hadoop@zydatahadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse 2. Found 7 items 3. drwxr-xr-x - hadoop supergroup 0 2017-12-23 00:21 /user/hive/warehouse/external_table 4. drwxr-xr-x - hadoop supergroup 0 2017-12-22 14:42 /user/hive/warehouse/helloword 5. drwxr-xr-x - hadoop supergroup 0 2017-12-22 17:28 /user/hive/warehouse/hive1.db 6. drwxr-xr-x - hadoop supergroup 0 2017-12-22 17:40 /user/hive/warehouse/hive2.db 7. drwxr-xr-x - hadoop supergroup 0 2017-12-23 00:06 /user/hive/warehouse/managed_table 8. drwxr-xr-x - hadoop supergroup 0 2017-12-22 14:58 /user/hive/warehouse/word.db 9. drwxr-xr-x - hadoop supergroup 0 2017-12-22 15:34 /user/hive/warehouse/wordcount.db
表的元數據信息
1. mysql> select * from tbls \G; 2. *************************** 4. row *************************** 3. TBL_ID: 11 4. CREATE_TIME: 1513958794 5. DB_ID: 1 6. LAST_ACCESS_TIME: 0 7. OWNER: hadoop 8. RETENTION: 0 9. SD_ID: 11 10. TBL_NAME: managed_table 11. TBL_TYPE: MANAGED_TABLE 12. VIEW_EXPANDED_TEXT: NULL 13. VIEW_ORIGINAL_TEXT: NULL 14. *************************** 5. row *************************** 15. TBL_ID: 13 16. CREATE_TIME: 1513959668 17. DB_ID: 1 18. LAST_ACCESS_TIME: 0 19. OWNER: hadoop 20. RETENTION: 0 21. SD_ID: 13 22. TBL_NAME: external_table 23. TBL_TYPE: EXTERNAL_TABLE 24. VIEW_EXPANDED_TEXT: NULL 25. VIEW_ORIGINAL_TEXT: NULL
刪除表後查詢倆表信息
1. hive> drop table managed_table; 2. OK 3. Time taken: 0.807 seconds 4. hive> drop table external_table; 5. OK 1. [hadoop@zydatahadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse 2. Found 6 items 3. drwxr-xr-x - hadoop supergroup 0 2017-12-23 00:21 /user/hive/warehouse/external_table 4. drwxr-xr-x - hadoop supergroup 0 2017-12-22 14:42 /user/hive/warehouse/helloword 5. drwxr-xr-x - hadoop supergroup 0 2017-12-22 17:28 /user/hive/warehouse/hive1.db 6. drwxr-xr-x - hadoop supergroup 0 2017-12-22 17:40 /user/hive/warehouse/hive2.db 7. drwxr-xr-x - hadoop supergroup 0 2017-12-22 14:58 /user/hive/warehouse/word.db 8. drwxr-xr-x - hadoop supergroup 0 2017-12-22 15:34 /user/hive/warehouse/wordcount.db
刪除表後,內部表數據和元數據同時刪除,外部表只刪除元數據信息
1.3 [(col_name data_type [COMMENT col_comment], … [constraint_specification])] [COMMENT table_comment]
col_name:字段名;
data_type:字段類型;
COMMENT col_comment:字段的註釋;
[COMMENT table_comment]:表的註釋。1. hive> create table student( 2. > id int comment '學號', 3. > name string comment '姓' 4. > ) 5. > comment 'this is student information table'; 6. OK
1.4 PARTITIONED BY(分區表)
- 對錶進行以字段分區,可以提高對大數據信息的查詢,hive表中,可分爲靜態分區和動態分區,一般來說,動態分區較爲方便。
靜態分區
可以根據PARTITIONED BY創建分區表,一個表可以擁有一個或者多個分區,每個分區以文件夾的形式單獨存在表文件夾的目錄下;
分區是以字段的形式在表結構中存在,通過describe table命令可以查看到字段存在,但是該字段不存放實際的數據內容,僅僅是分區的表示。
分區建表分爲2種,一種是單分區,也就是說在表文件夾目錄下只有一級文件夾目錄。另外一種是多分區,表文件夾下出現多文件夾嵌套模式。
單分區表創建:1. hive> CREATE TABLE order_partition ( 2. > order_number string, 3. > event_time string 4. > ) 5. > PARTITIONED BY (event_month string); 6. OK
將order.txt 文件中的數據加載到order_partition表中
1. hive> load data local inpath '/home/hadoop/order.txt' overwrite into table order_partition partition (event_month='2014-05'); 2. hive> select * from order_partition; 3. 10703007267488 2014-05-01 06:01:12.334+01 2014-05 4. 10101043505096 2014-05-01 07:28:12.342+01 2014-05 5. 10103043509747 2014-05-01 07:50:12.33+01 2014-05 6. 10103043501575 2014-05-01 09:27:12.33+01 2014-05 7. 10104043514061 2014-05-01 09:03:12.324+01 2014-05
使用hadoop 命令創建分區,並把數據put 上去
1. [hadoop@zydatahadoop001 ~]$ hadoop fs -mkdir -p /user/hive/warehouse/order_partition/event_month=2014-06 2. [hadoop@zydatahadoop001 ~]$ hadoop fs -put /home/hadoop/order.txt /user/hive/warehouse/order_partition/event_month=2014-06 3. 上傳完成後查看錶order_partition 4. hive> select * from order_partition 5. > ; 6. OK 7. 10703007267488 2014-05-01 06:01:12.334+01 2014-05 8. 10101043505096 2014-05-01 07:28:12.342+01 2014-05 9. 10103043509747 2014-05-01 07:50:12.33+01 2014-05 10. 10103043501575 2014-05-01 09:27:12.33+01 2014-05 11. 10104043514061 2014-05-01 09:03:12.324+01 2014-05 12. Time taken: 2.034 seconds, Fetched: 5 row(s)
可以看到通過put上去的數據並沒有顯示,原因是hdfs上數據確實存在,但mysql中元數據信息並沒有記錄,可以通過以下命令解決:
msck repair table order_partition;
再次查看:1. hive> select * from order_partition; 2. OK 3. 10703007267488 2014-05-01 06:01:12.334+01 2014-05 4. 10101043505096 2014-05-01 07:28:12.342+01 2014-05 5. 10103043509747 2014-05-01 07:50:12.33+01 2014-05 6. 10103043501575 2014-05-01 09:27:12.33+01 2014-05 7. 10104043514061 2014-05-01 09:03:12.324+01 2014-05 8. 10703007267488 2014-05-01 06:01:12.334+01 2014-06 9. 10101043505096 2014-05-01 07:28:12.342+01 2014-06 10. 10103043509747 2014-05-01 07:50:12.33+01 2014-06 11. 10103043501575 2014-05-01 09:27:12.33+01 2014-06 12. 10104043514061 2014-05-01 09:03:12.324+01 2014-06
多分區
1. hive> CREATE TABLE order_partition2 ( 2. > order_number string, 3. > event_time string 4. > ) 5. > PARTITIONED BY (event_month string, step string); 6. OK
加載數據:
1. hive> load data local inpath '/home/hadoop/order.txt' overwrite into table order_multi_partition partition (event_month='2014-05',step=1);
查詢:
1. hive> select * from order_multi_partition;
2. OK
3. 10703007267488 2014-05-01 06:01:12.334+01 2014-05 1
4. 10101043505096 2014-05-01 07:28:12.342+01 2014-05 1
5. 10103043509747 2014-05-01 07:50:12.33+01 2014-05 1
6. 10103043501575 2014-05-01 09:27:12.33+01 2014-05 1
7. 10104043514061 2014-05-01 09:03:12.324+01 2014-05 1
查詢對應hdfs路徑
1. [hadoop@zydatahadoop001 ~]$ hdfs dfs -ls /user/hive/warehouse/order_multi_partition/event_month=2014-05
18/01/09
2. Found 1 items
3. drwxr-xr-x - hadoop supergroup 0 2018-01-09 22:52 /user/hive/warehouse/order_multi_partition/event_month=2014-05/step=1
可以看到多分區對應的路徑文件是多級嵌套。
動態分區
定義:DP列的指定方式與SP列相同 - 在分區子句中( Partition關鍵字後面),唯一的區別是,DP列沒有值,而SP列有值( Partition關鍵字後面只有key沒有value);
在INSERT … SELECT …查詢中,必須在SELECT語句中的列中最後指定動態分區列,並按PARTITION()子句中出現的順序進行排列;
所有DP列 - 只允許在非嚴格模式下使用。 在嚴格模式下,我們應該拋出一個錯誤。
如果動態分區和靜態分區一起使用,必須是動態分區的字段在前,靜態分區的字段在後。hive表中默認的是靜態分區,要使用動態分區需要進行如下設置:
set hive.exec.dynamic.partition=true; (開啓動態分區)
set hive.exec.dynamic.partition.mode=nonstrict;(指定動態分區模式,默認爲strict,即必須指定至少一個分區爲靜態分區,nonstrict模式表示允許所有的分區字段都可以使用動態分區。)
永久使用動態分區需要將其配置到hive-site.xml文件中,
*創建動態分區
1. CREATE TABLE emp_dynamic_partition (
2. empno int,
3. ename string,
4. job string,
5. mgr int,
6. hiredate string,
7. salary double,
8. comm double
9. )
10. PARTITIONED BY (deptno int)
使用靜態分區
1. CREATE TABLE emp_partition (
2. empno int,
3. ename string,
4. job string,
5. mgr int,
6. hiredate string,
7. salary double,
8. comm double
9. )
10. PARTITIONED BY (deptno int)
11. ROW FORMAT DELIMITED FIELDS TERMINATED BY "\t";
1. insert into table emp_partition partition(deptno=10)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=10;
2. insert into table emp_partition partition(deptno=20)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=20;
3. insert into table emp_partition partition(deptno=30)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm from emp where deptno=30;
查詢結果:
1. hive> select * from emp_partition;
2. OK
3. 7782 CLARK MANAGER 7839 1981/6/9 2450.0 NULL 10
4. 7839 KING PRESIDENT NULL 1981/11/17 5000.0 NULL 10
5. 7934 MILLER CLERK 7782 1982/1/23 1300.0 NULL 10
6. 7369 SMITH CLERK 7902 1980/12/17 800.0 NULL 20
7. 7566 JONES MANAGER 7839 1981/4/2 2975.0 NULL 20
8. 7788 SCOTT ANALYST 7566 1987/4/19 3000.0 NULL 20
9. 7876 ADAMS CLERK 7788 1987/5/23 1100.0 NULL 20
10. 7902 FORD ANALYST 7566 1981/12/3 3000.0 NULL 20
11. 7499 ALLEN SALESMAN 7698 1981/2/20 1600.0 300.0 30
12. 7521 WARD SALESMAN 7698 1981/2/22 1250.0 500.0 30
13. 7654 MARTIN SALESMAN 7698 1981/9/28 1250.0 1400.0 30
14. 7698 BLAKE MANAGER 7839 1981/5/1 2850.0 NULL 30
15. 7844 TURNER SALESMAN 7698 1981/9/8 1500.0 0.0 30
16. 7900 JAMES CLERK 7698 1981/12/3 950.0 NULL 30
使用動態分區
1. insert into table emp_dynamic_partition partition(deptno)
select empno,ename ,job ,mgr ,hiredate ,salary ,comm, deptno from emp;
2. hive> select * from emp_dynamic_partition;
3. OK
4. 7782 CLARK MANAGER 7839 1981/6/9 2450.0 NULL 10
5. 7839 KING PRESIDENT NULL 1981/11/17 5000.0 NULL 10
6. 7934 MILLER CLERK 7782 1982/1/23 1300.0 NULL 10
7. 7369 SMITH CLERK 7902 1980/12/17 800.0 NULL 20
8. 7566 JONES MANAGER 7839 1981/4/2 2975.0 NULL 20
9. 7788 SCOTT ANALYST 7566 1987/4/19 3000.0 NULL 20
10. 7876 ADAMS CLERK 7788 1987/5/23 1100.0 NULL 20
11. 7902 FORD ANALYST 7566 1981/12/3 3000.0 NULL 20
12. 7499 ALLEN SALESMAN 7698 1981/2/20 1600.0 300.0 30
13. 7521 WARD SALESMAN 7698 1981/2/22 1250.0 500.0 30
14. 7654 MARTIN SALESMAN 7698 1981/9/28 1250.0 1400.0 30
15. 7698 BLAKE MANAGER 7839 1981/5/1 2850.0 NULL 30
16. 7844 TURNER SALESMAN 7698 1981/9/8 1500.0 0.0 30
17. 7900 JAMES CLERK 7698 1981/12/3 950.0 NULL 30
可以看到使用動態分區只需要一條命令即可完成表的分區操作,很方便
1.5 ROW FORMAT
- 指定列分隔符:
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’
1.6 STORED AS(存儲格式)
- 默認存儲格式爲text
2. Create Table As Select (CTAS)
- 以select語句結果創建一張表,創建的表具有前表的格式和結果,會有mr
3. LIKE
- 使用like創建表時,只會創建表的結構,並無數據。
4. desc formatted table_name
- 以格式的形式顯示出表的信息內容
5. 查詢表
- show tables;
- show create table table_name;
6. 刪除表
- DROP TABLE [IF EXISTS] table_name [PURGE]; – (Note: PURGE available in Hive 0.14.0 and later)
指定PURGE後,數據不會放到回收箱,會直接刪除。
DROP TABLE刪除此表的元數據和數據。如果配置了垃圾箱(並且未指定PURGE),則實際將數據移至.Trash / Current目錄。元數據完全丟失。
刪除EXTERNAL表時,表中的數據不會從文件系統中刪除。
7. 修改表(Alter Table)
- 修改表名
ALTER TABLE table_name RENAME TO new_table_name; - 修改分區
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION ‘location’][, PARTITION partition_spec [LOCATION ‘location’], …];
- partition_spec:
- (partition_column = partition_col_value, partition_column = partition_col_value,
來自@若澤大數據