Hive DDL&DML&DQL - 台部落

Hive的DDL操作

創建數據庫

語法

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
  [COMMENT database_comment]
  [LOCATION hdfs_path]
  [WITH DBPROPERTIES (property_name=property_value, ...)];

創建數據庫，存儲在HDFS的默認路徑是/user/hive/warehouse/*.db

避免要創建的數據庫已經存在錯誤，增加if not exists判斷。（標準寫法）

hive (default)> create database if not exists db_hive1;

hive中可以指定數據庫在HDFS的存儲路徑,只需要指定location的參數即可

hive (default)> create database db_hive2 location '/tmp';

數據庫操作

查詢數據庫

hive (default)> show databases;

OK
database_name
db_hive1
default
Time taken: 0.015 seconds, Fetched: 2 row(s)

過濾查詢數據庫

hive (default)> show databases like 'db_*';

OK
database_name
db_hive1
Time taken: 0.021 seconds, Fetched: 1 row(s)

切換數據庫

hive (default)> use db_hive1;
OK
Time taken: 0.015 seconds

修改數據庫

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);   -- (Note: SCHEMA added in Hive 0.14.0)

ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;   -- (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0)

ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path; -- (Note: Hive 2.2.1, 2.4.0 and later)

刪除數據庫

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

如果刪除的數據庫不存在，避免報錯採用 if exists判斷數據庫是否存在

hive (db_hive1)> drop database if exists db_test;
OK
Time taken: 0.2 seconds

如果數據庫下面有表的存在，是刪不掉的，使用cascade命令強制級聯刪除

hive (db_hive1)> drop database if exists db_test cascade;
OK
Time taken: 0.2 seconds

表操作

創建表
語法

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
[(col_name data_type [COMMENT col_comment], ...)] 
[COMMENT table_comment] 
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 
[CLUSTERED BY (col_name, col_name, ...) 
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
[ROW FORMAT row_format] 
[STORED AS file_format] 
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement]

參數解析

EXTERNAL：表示創建一個外部表，外部表在刪表的時候只會刪除元數據而不會刪除數據；內部表在刪除的時候，元數據和數據都會刪除
COMMENT：爲表的列添加註釋
PARTITIONED BY：創建分區表
CLUSTERED BY：創建分桶表
ROW FORMAT：指定列與列之間的分隔字符，比如使用table分割 ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’
```
  DELIMITED [FIELDS TERMINATED BY char]
```
STORED AS：指定存儲文件類型常用的存儲文件類型

SEQUENCEFILE（二進制序列文件）、TEXTFILE（文本）、RCFILE（列式存儲格式文件），如果文件數據是純文本，可以使用STORED AS TEXTFILE。如果數據需要壓縮，使用 STORED AS SEQUENCEFILE。

LOCATION：表在HDFS上的存儲路徑
AS：後跟查詢語句，根據查詢結果創建表
LIKE：根據一個已存在的表創建另一個表，但是不復制數據

創建普通表指定列分隔符爲‘\t’,並指定存儲路徑

CREATE TABLE emp_test1(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/bigdata/';

查看錶

hive (db_hive1)> show tables;
OK
tab_name
emp_test1
Time taken: 0.082 seconds, Fetched: 1 row(s)

查看指定數據庫中的表

hive (default)> show tables in db_hive1;

模糊查詢查看錶

hive (db_hive1)> show tables like 'emp*'

查看錶字段信息

hive (db_hive1)> desc emp_test1;
OK
col_name        data_type       comment
empno                   int                                         
ename                   string                                      
job                     string                                      
mgr                     int                                         
hiredate                string                                      
sal                     double                                      
comm                    double                                      
deptno                  int                                         
Time taken: 0.132 seconds, Fetched: 8 row(s)

查看錶的詳細信息

hive (db_hive1)> desc formatted emp_test1;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
empno                   int                                         
ename                   string                                      
job                     string                                      
mgr                     int                                         
hiredate                string                                      
sal                     double                                      
comm                    double                                      
deptno                  int                                         
                 
# Detailed Table Information             
Database:               db_hive1                 
OwnerType:              USER                     
Owner:                  hadoop                   
CreateTime:             Wed Dec 18 16:36:37 CST 2019     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://JD:9000/bigdata   
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   true                
        numFiles                0                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               0                   
        transient_lastDdlTime   1576658516          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \t                  
        serialization.format    \t                  
Time taken: 0.085 seconds, Fetched: 40 row(s)

導入數據

其中加上local指的是在本地讀數據，不加local是在hdfs上讀數據

hive (db_hive1)> load data local inpath '/home/hadoop/data/emp.txt' into table emp_test1;
Loading data to table db_hive1.emp_test1
Table db_hive1.emp_test1 stats: [numFiles=0, numRows=0, totalSize=0, rawDataSize=0]
OK
Time taken: 0.427 seconds

根據查詢結果創建表(創建的表和數據都有)

hive (db_hive1)> create table emp_test2 as select * from emp_test1;
。。。#跑mr程序

使用like創建表（只複製表結構）

hive (db_hive1)> create table emp_test3 like emp_test1;
OK
Time taken: 0.082 seconds

修改表

修改表的名稱
hive (db_hive1)> alter table 
               > emp_test2 rename to emp_test3;

刪除表

刪除表
hive (db_hive1)> drop table emp_test;

清除表

清除表
hive (db_hive1)> truncate table emp_test1;

外部表和內部表

外部表
因爲表是外部表，所以Hive並非認爲其完全擁有這份數據。刪除該表並不會刪除掉這份數據，不過描述表的元數據信息會被刪除掉
外部表使用場景
每天收集到的網站數據,需要做大量的統計數據分析,所以在數據源（即原始數據）上使用外部表進行存儲，方便數據的共享，在做統計分析時候用到的中間表，結果表使用內部表，因爲這些數據不需要共享，使用內部表更爲合適。

建表語句
創建外部表

  CREATE EXTERNAL TABLE emp_test4(
  empno int,
  ename string,
  job string,
  mgr int,
  hiredate string,
  sal double,
  comm double,
  deptno int
  ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/bigdata/';

刪除表
外部表被刪除後只會刪除元數據信息，數據不會刪除
```
  hive (db_hive1)> drop table emp_test4;
```

外部表與內部表的轉換

(‘EXTERNAL’=’TRUE’)和(‘EXTERNAL’=’FALSE’)爲固定寫法，只能大寫，否則報錯

查看錶的類型

  hive (db_hive1)> desc formatted emp_test3;
  Table Type:             MANAGED_TABLE #內部表

修改emp_test3爲外部表

  hive (db_hive1)> alter table emp_test3 set tblproperties('EXTERNAL'='TRUE');
  OK
  Time taken: 0.082 seconds

查看修改後表的類型

  hive (db_hive1)> desc formatted emp_test3;
  Table Type:             EXTERNAL_TABLE

修改emp_test3爲內部表

  hive (db_hive1)> alter table emp_test3 set tblproperties('EXTERNAL'='FALSE');

查看修改後表的類型

  hive (db_hive1)> desc formatted emp_test3;
  Table Type:            MANAGED_TABLE

分區表

概念

分區表實際上就是對應一個HDFS文件系統上的獨立的文件夾，該文件夾下是該分區所有的數據文件。Hive中的分區就是分目錄，把一個大的數據集根據業務需要分割成小的數據集。在查詢時通過WHERE子句中的表達式選擇查詢所需要的指定的分區，這樣的查詢效率會提高很多。

優點

節省磁盤IO和網絡IO，因爲查是在磁盤查詢，跑mr程序有網絡的傳輸，消耗網絡IO

建表語句

create table order_partition(
order_no string,
event_time string
)
PARTITIONED BY (event_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

加載數據到分區表
加載數據的時候一定要指定分區字段，對於分區操作如果數據直接寫入hdfs沒有通過hive寫入，通過hive查詢是查詢不出來的，因爲元數據裏沒有

hive (db_hive1)> load data local inpath '/home/hadoop/data/order_created07.txt' into table order_partition partition(event_month='2014-05');

添加分區

hive (db_hive1)> alter table order_partition add if not exists partition(event_month='2014-07'); 
OK
Time taken: 0.095 seconds

查詢分區

hive (db_hive1)> select * from order_partition where event_month='2014-05';
OK
order_partition.order_no        order_partition.event_time      order_partition.event_month
10703007267488  2014-05-01 06:01:12.334+01      2014-05
10101043505096  2014-05-01 07:28:12.342+01      2014-05
10103043509747  2014-05-01 07:50:12.33+01       2014-05
10103043501575  2014-05-01 09:27:12.33+01       2014-05
10104043514061  2014-05-01 09:03:12.324+01      2014-05
Time taken: 0.07 seconds, Fetched: 5 row(s)

多分區聯合查詢

hive (db_hive1)> select * from order_partition where event_month='2014-07' or event_month='2014-05';

查看錶有多少個分區

hive (db_hive1)> show partitions order_partition;
OK
partition
event_month=2014-05
event_month=2014-07
Time taken: 0.153 seconds, Fetched: 2 row(s)

多級分區
單級分區：一層目錄
三級分區：三級目錄

建立兩級分區表
create table order_mulit_partition(
order_no string,
event_time string
)
partition by(event_month string, step string)
row format delimited fields terminated by ‘\t’;
加載數據到二級分區表中

hive (db_hive1)> load data local inpath '/home/hadoop/data/order_created.txt' into table order_mulit_partition partition(event_month='2014-05',step='1'); 

Loading data to table db_hive1.order_mulit_partition partition (event_month=2014-05, step=1)
Partition db_hive1.order_mulit_partition{event_month=2014-05, step=1} stats: [numFiles=1, numRows=0, totalSize=213, rawDataSize=0]
OK
Time taken: 0.969 seconds

查二級詢分區表數據

hive (db_hive1)> select * from order_mulit_partition where event_month='2014-05' and step='1';

把emp的數據加載到emp分區表中去，分區字段是deptno

建表語句
CREATE TABLE emp_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double
) 
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

導入數據
INSERT OVERWRITE TABLE emp_partition PARTITION (deptno=10) select empno,ename,job,mgr,hiredate,sal,comm   from emp where deptno=10;

動態分區

建表
CREATE TABLE emp_dynamic_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double
) 
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

動態導入數據，其中deptno放在查詢字段的最後，並與partition的deptno字段名一樣
INSERT OVERWRITE TABLE emp_dynamic_partition PARTITION (deptno) select empno,ename,job,mgr,hiredate,sal,comm,deptno from emp;

把數據直接上傳到分區目錄上，讓分區表和數據產生關聯的三種方式

上傳數據後修復（生產上不要用msck，因爲如果存在大量數據，會把服務器搞崩）

1、創建文件夾和上傳數據

[hadoop@JD ~]$ hadoop fs -mkdir -p /user/hive/warehouse/db_hive1.db/order_mulit_partition/event_month=2014-06/step=2

19/12/19 15:15:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@JD ~]$ hadoop fs -put '/home/hadoop/data/order_created.txt' '/user/hive/warehouse/db_hive1.db/order_mulit_partition/event_month=2014-06/step=2'   

19/12/19 15:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2、查詢數據（查不到上傳的數據）

hive (db_hive1)> select * from order_mulit_partition where event_month='2014-06' and step='2';

OK
order_mulit_partition.order_no  order_mulit_partition.event_time        order_mulit_partition.event_month       order_mulit_partition.step
Time taken: 0.058 seconds

3、執行修復命令

hive (db_hive1)> msck repair table order_mulit_partition;

OK
Partitions not in metastore:    order_mulit_partition:event_month=2014-06/step=2
Repair: Added partition to metastore order_mulit_partition:event_month=2014-06/step=2
Time taken: 0.147 seconds, Fetched: 2 row(s)

4、再次查詢

hive (db_hive1)> select * from order_mulit_partition where event_month='2014-06' and step='2';

OK
order_mulit_partition.order_no  order_mulit_partition.event_time        order_mulit_partition.event_month       order_mulit_partition.step
10703007267488  2014-05-01 06:01:12.334+01      2014-06 2
10101043505096  2014-05-01 07:28:12.342+01      2014-06 2
10103043509747  2014-05-01 07:50:12.33+01       2014-06 2
10103043501575  2014-05-01 09:27:12.33+01       2014-06 2
10104043514061  2014-05-01 09:03:12.324+01      2014-06 2
Time taken: 0.056 seconds, Fetched: 5 row(s)

上傳數據後添加分區（推薦方式）
1、上傳數據

[hadoop@JD ~]$ hadoop fs -mkdir -p /user/hive/warehouse/db_hive1.db/order_mulit_partition/event_month=2014-07/step=3

19/12/19 15:23:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@JD ~]$ 
[hadoop@JD ~]$ hadoop fs -put '/home/hadoop/data/order_created.txt' '/user/hive/warehouse/db_hive1.db/order_mulit_partition/event_month=2014-07/step=3'

19/12/19 15:23:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2、添加分區（添加分區信息到元數據中）

hive (db_hive1)> alter table order_mulit_partition  add partition(event_month='2014-07',step='3');
OK
Time taken: 0.083 seconds

3、查詢

hive (db_hive1)> select * from order_mulit_partition where event_month='2014-07' and step='3';

OK
order_mulit_partition.order_no  order_mulit_partition.event_time        order_mulit_partition.event_month       order_mulit_partition.step
10703007267488  2014-05-01 06:01:12.334+01      2014-07 3
10101043505096  2014-05-01 07:28:12.342+01      2014-07 3
10103043509747  2014-05-01 07:50:12.33+01       2014-07 3
10103043501575  2014-05-01 09:27:12.33+01       2014-07 3
10104043514061  2014-05-01 09:03:12.324+01      2014-07 3
Time taken: 0.053 seconds, Fetched: 5 row(s)

創建文件夾後load數據到分區
1、創建文件夾

[hadoop@JD ~]$ hadoop fs -mkdir -p /user/hive/warehouse/db_hive1.db/order_mulit_partition/event_month=2014-08/step=4

19/12/19 15:23:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2、load數據

hive (db_hive1)> load data local inpath '/home/hadoop/data/order_created.txt' into table
 order_mulit_partition partition(month='2014-08',step='4');

3、查詢數據

hive (db_hive1)> select * from order_mulit_partition where month='2014-08' and step='4';

DML操作

LOAD DATA : 加載數據
LOCAL: “本地” 沒有的話就HDFS
INPATH: 指定路徑
OVERWRITE：數據覆蓋沒有的話就是追加

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

向表中load數據

導入本地文件

hive (db_hive1)> load data local inpath '/home/hadoop/data/emp.txt' into table emp_test2;

導入hdfs上的文件

hive (db_hive1)> load data inpath '/hive/hadoop/emp.txt' into table emp_test2;

覆蓋表中已有的數據

hive (db_hive1)> load data local inpath '/home/hadoop/data/emp.txt' overwrite into table emp_test2;

通過insert語句插入數據

基本插入數據

1、創建分區表

create table order_partition1(
order_no string,
event_time string
)
PARTITIONED BY (event_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

2、插入數據

hive (db_hive1)> insert into table order_partition1 partition(event_month='2014-05') values('1','1111'),('2','2222');

根據查詢結果插入數據

insert into：以追加數據的方式插入到表或分區，原有數據不會刪除
insert overwrite：會覆蓋表或分區中已存在的數據
注意： insert不支持插入部分字段

hive (db_hive1)> insert overwrite table order_partition1 partition(event_month='2014-06') select order_no,event_time from order_partition1 where event_month='2014-05';

多表（多分區）插入模式（根據多張表查詢結果）

hive (db_hive1)> from order_partition1
              insert overwrite table order_partition1 partition(month='2014-07')
              select id, name where month='2014-05'
              insert overwrite table order_partition1 partition(month='2014-08')
              select id, name where month='2014-05';

CTAS create table … as select…

根據查詢結果創建表並導入數據（工作常用），表不能事先存在

create table if not exists student3 as select id, name from student;

insert：表必須事先存在

根據查詢結果向表中插入數據
insert overwrite table emp4 select * from emp;

或者還有一種反人類的語法
from emp insert into table emp4 select *;

創建表時通過Location指定加載數據路徑

1、先hdfs上傳數據

[hadoop@JD ~]$ hadoop fs -mkdir /emp;
[hadoop@JD ~]hadoop fs -put /home/hadoop/data/emp.txt /emp;

2、建表並指定路徑

hive (default)> create external table if not exists emp(
              id int, 
			  name string
              )
              row format delimited fields terminated by '\t'
              location '/emp;

3、查詢

hive (default)> select * from emp;

數據導出

導入到本地文件，並指定分隔符
INSERT OVERWRITE local DIRECTORY '/hivetmp' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' SELECT empno,ename FROM emp;	 

導入到HDFS文件，並指定分隔符
INSERT OVERWRITE DIRECTORY '/hivetmp' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' SELECT empno,ename FROM emp;	

或者直接使用hdfs fs -get 命令

或者使用hive的交互式命令，然後重定向到本地
hive -e 'select empno,ename from emp'> 1.txt

如果過濾查詢結果,使用grep
hive -e 'select empno,ename from emp | grep xxx'> 1.txt

Hive DQL

語法：

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
 [LIMIT [offset,] rows]

WHERE

過濾篩選
查詢區間使用between
查字符串在xxx之間使用in
查詢爲空的字段 is null
分區表必須加分區限制

ALL和DISTINCT

ALL和DISTINCT選項指定了是否返回重複值，默認是ALL，指定DISTINCT時會剔除重複的結果。注意，從Hive1.1.0開始，Hive支持SELECT DISTINCT *。ALL and DISTINCT can also be used in a UNION clause – see Union Syntax for more information.

hive> SELECT col1, col2 FROM t1
    1 3
    1 3
    1 4
    2 5
hive> SELECT DISTINCT col1, col2 FROM t1
    1 3
    1 4
    2 5
hive> SELECT DISTINCT col1 FROM t1
    1
    2

基於分區的查詢

一個Select如果沒有指定分區，執行時會掃描全表。如果加了分區條件，則會只掃描對應分區的數據。對於存在大量分區數據的表，加不加分區條件的差別是很大的。若不增加分區，可能會引起資源被全佔用，引起業務延遲。

# 普通查詢限制分區
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'

# 關聯時限制分區
SELECT page_views.*
FROM page_views JOIN dim_users
  ON (page_views.user_id = dim_users.id AND page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31')

HAVING

where和having的區別：

使用場景：where可以用於select、update、delete和insert into values(select * from table where …)語句中。having只能用於select語句中。
執行順序：where的搜索條件是在執行語句進行分組之前應用；having的搜索條件是在分組條件後執行的。即如果where和having一起用時，where會先執行，having後執行。
子句有區別：where子句中的條件表達式having都可以跟，而having子句中的有些表達式where不可以跟；having子句可以用集合函數（sum、count、avg、max和min），而where子句不可以。
總結：
1.WHERE 子句用來篩選 FROM 子句中指定的操作所產生的行。
2.GROUP BY 子句用來分組 WHERE 子句的輸出。
3.HAVING 子句用來從分組的結果中篩選行
以下兩個等價

SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10；
SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 WHERE t2.col2sum > 10；

聚合函數

聚合：多進一出，select中出現的字段，如果沒有出現在group by中，必須出現在聚合函數中
max
min
avg
count
sum

Hive SQL ，什麼情況跑MR，什麼情況不跑MR？

瞭解一個參數：hive.fetch.task.conversion,默認值是more.
以下面語句爲例：

SQL1 : select * from live_info;
SQL2 : select * from live_info where traffic in ("1", "3");
SQL3 : select count(1) from live_info;
SQL4 : select user, count(1) as user_count from live_info group by user;
SQL5 : select user, count(1) as user_count from live_info group by user order by user desc;

fetch

參數：hive.fetch.task.conversion

value	官網解釋	desc
none	Disable hive.fetch.task.conversion	select語句中只有desc不走MR
minimal	SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only	僅SQL1 不走MR
more	SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)	僅SQL1、SQL2不走MR ，過濾條件是某一個字段的內容或limit或分區條件不跑mr