Hive DDL&DML&DQL - 台部落

Hive的DDL操作

创建数据库

语法

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name
  [COMMENT database_comment]
  [LOCATION hdfs_path]
  [WITH DBPROPERTIES (property_name=property_value, ...)];

创建数据库，存储在HDFS的默认路径是/user/hive/warehouse/*.db

避免要创建的数据库已经存在错误，增加if not exists判断。（标准写法）

hive (default)> create database if not exists db_hive1;

hive中可以指定数据库在HDFS的存储路径,只需要指定location的参数即可

hive (default)> create database db_hive2 location '/tmp';

数据库操作

查询数据库

hive (default)> show databases;

OK
database_name
db_hive1
default
Time taken: 0.015 seconds, Fetched: 2 row(s)

过滤查询数据库

hive (default)> show databases like 'db_*';

OK
database_name
db_hive1
Time taken: 0.021 seconds, Fetched: 1 row(s)

切换数据库

hive (default)> use db_hive1;
OK
Time taken: 0.015 seconds

修改数据库

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);   -- (Note: SCHEMA added in Hive 0.14.0)

ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;   -- (Note: Hive 0.13.0 and later; SCHEMA added in Hive 0.14.0)

ALTER (DATABASE|SCHEMA) database_name SET LOCATION hdfs_path; -- (Note: Hive 2.2.1, 2.4.0 and later)

删除数据库

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

如果删除的数据库不存在，避免报错采用 if exists判断数据库是否存在

hive (db_hive1)> drop database if exists db_test;
OK
Time taken: 0.2 seconds

如果数据库下面有表的存在，是删不掉的，使用cascade命令强制级联删除

hive (db_hive1)> drop database if exists db_test cascade;
OK
Time taken: 0.2 seconds

表操作

创建表
语法

CREATE [EXTERNAL] TABLE [IF NOT EXISTS] table_name 
[(col_name data_type [COMMENT col_comment], ...)] 
[COMMENT table_comment] 
[PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)] 
[CLUSTERED BY (col_name, col_name, ...) 
[SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] 
[ROW FORMAT row_format] 
[STORED AS file_format] 
[LOCATION hdfs_path]
[TBLPROPERTIES (property_name=property_value, ...)]
[AS select_statement]

参数解析

EXTERNAL：表示创建一个外部表，外部表在删表的时候只会删除元数据而不会删除数据；内部表在删除的时候，元数据和数据都会删除
COMMENT：为表的列添加注释
PARTITIONED BY：创建分区表
CLUSTERED BY：创建分桶表
ROW FORMAT：指定列与列之间的分隔字符，比如使用table分割 ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’
```
  DELIMITED [FIELDS TERMINATED BY char]
```
STORED AS：指定存储文件类型常用的存储文件类型

SEQUENCEFILE（二进制序列文件）、TEXTFILE（文本）、RCFILE（列式存储格式文件），如果文件数据是纯文本，可以使用STORED AS TEXTFILE。如果数据需要压缩，使用 STORED AS SEQUENCEFILE。

LOCATION：表在HDFS上的存储路径
AS：后跟查询语句，根据查询结果创建表
LIKE：根据一个已存在的表创建另一个表，但是不复制数据

创建普通表指定列分隔符为‘\t’,并指定存储路径

CREATE TABLE emp_test1(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double,
deptno int
) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/bigdata/';

查看表

hive (db_hive1)> show tables;
OK
tab_name
emp_test1
Time taken: 0.082 seconds, Fetched: 1 row(s)

查看指定数据库中的表

hive (default)> show tables in db_hive1;

模糊查询查看表

hive (db_hive1)> show tables like 'emp*'

查看表字段信息

hive (db_hive1)> desc emp_test1;
OK
col_name        data_type       comment
empno                   int                                         
ename                   string                                      
job                     string                                      
mgr                     int                                         
hiredate                string                                      
sal                     double                                      
comm                    double                                      
deptno                  int                                         
Time taken: 0.132 seconds, Fetched: 8 row(s)

查看表的详细信息

hive (db_hive1)> desc formatted emp_test1;
OK
col_name        data_type       comment
# col_name              data_type               comment             
                 
empno                   int                                         
ename                   string                                      
job                     string                                      
mgr                     int                                         
hiredate                string                                      
sal                     double                                      
comm                    double                                      
deptno                  int                                         
                 
# Detailed Table Information             
Database:               db_hive1                 
OwnerType:              USER                     
Owner:                  hadoop                   
CreateTime:             Wed Dec 18 16:36:37 CST 2019     
LastAccessTime:         UNKNOWN                  
Protect Mode:           None                     
Retention:              0                        
Location:               hdfs://JD:9000/bigdata   
Table Type:             MANAGED_TABLE            
Table Parameters:                
        COLUMN_STATS_ACCURATE   true                
        numFiles                0                   
        numRows                 0                   
        rawDataSize             0                   
        totalSize               0                   
        transient_lastDdlTime   1576658516          
                 
# Storage Information            
SerDe Library:          org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       
InputFormat:            org.apache.hadoop.mapred.TextInputFormat         
OutputFormat:           org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat       
Compressed:             No                       
Num Buckets:            -1                       
Bucket Columns:         []                       
Sort Columns:           []                       
Storage Desc Params:             
        field.delim             \t                  
        serialization.format    \t                  
Time taken: 0.085 seconds, Fetched: 40 row(s)

导入数据

其中加上local指的是在本地读数据，不加local是在hdfs上读数据

hive (db_hive1)> load data local inpath '/home/hadoop/data/emp.txt' into table emp_test1;
Loading data to table db_hive1.emp_test1
Table db_hive1.emp_test1 stats: [numFiles=0, numRows=0, totalSize=0, rawDataSize=0]
OK
Time taken: 0.427 seconds

根据查询结果创建表(创建的表和数据都有)

hive (db_hive1)> create table emp_test2 as select * from emp_test1;
。。。#跑mr程序

使用like创建表（只复制表结构）

hive (db_hive1)> create table emp_test3 like emp_test1;
OK
Time taken: 0.082 seconds

修改表

修改表的名称
hive (db_hive1)> alter table 
               > emp_test2 rename to emp_test3;

删除表

删除表
hive (db_hive1)> drop table emp_test;

清除表

清除表
hive (db_hive1)> truncate table emp_test1;

外部表和内部表

外部表
因为表是外部表，所以Hive并非认为其完全拥有这份数据。删除该表并不会删除掉这份数据，不过描述表的元数据信息会被删除掉
外部表使用场景
每天收集到的网站数据,需要做大量的统计数据分析,所以在数据源（即原始数据）上使用外部表进行存储，方便数据的共享，在做统计分析时候用到的中间表，结果表使用内部表，因为这些数据不需要共享，使用内部表更为合适。

建表语句
创建外部表

  CREATE EXTERNAL TABLE emp_test4(
  empno int,
  ename string,
  job string,
  mgr int,
  hiredate string,
  sal double,
  comm double,
  deptno int
  ) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' LOCATION '/bigdata/';

删除表
外部表被删除后只会删除元数据信息，数据不会删除
```
  hive (db_hive1)> drop table emp_test4;
```

外部表与内部表的转换

(‘EXTERNAL’=’TRUE’)和(‘EXTERNAL’=’FALSE’)为固定写法，只能大写，否则报错

查看表的类型

  hive (db_hive1)> desc formatted emp_test3;
  Table Type:             MANAGED_TABLE #内部表

修改emp_test3为外部表

  hive (db_hive1)> alter table emp_test3 set tblproperties('EXTERNAL'='TRUE');
  OK
  Time taken: 0.082 seconds

查看修改后表的类型

  hive (db_hive1)> desc formatted emp_test3;
  Table Type:             EXTERNAL_TABLE

修改emp_test3为内部表

  hive (db_hive1)> alter table emp_test3 set tblproperties('EXTERNAL'='FALSE');

查看修改后表的类型

  hive (db_hive1)> desc formatted emp_test3;
  Table Type:            MANAGED_TABLE

分区表

概念

分区表实际上就是对应一个HDFS文件系统上的独立的文件夹，该文件夹下是该分区所有的数据文件。Hive中的分区就是分目录，把一个大的数据集根据业务需要分割成小的数据集。在查询时通过WHERE子句中的表达式选择查询所需要的指定的分区，这样的查询效率会提高很多。

优点

节省磁盘IO和网络IO，因为查是在磁盘查询，跑mr程序有网络的传输，消耗网络IO

建表语句

create table order_partition(
order_no string,
event_time string
)
PARTITIONED BY (event_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

加载数据到分区表
加载数据的时候一定要指定分区字段，对于分区操作如果数据直接写入hdfs没有通过hive写入，通过hive查询是查询不出来的，因为元数据里没有

hive (db_hive1)> load data local inpath '/home/hadoop/data/order_created07.txt' into table order_partition partition(event_month='2014-05');

添加分区

hive (db_hive1)> alter table order_partition add if not exists partition(event_month='2014-07'); 
OK
Time taken: 0.095 seconds

查询分区

hive (db_hive1)> select * from order_partition where event_month='2014-05';
OK
order_partition.order_no        order_partition.event_time      order_partition.event_month
10703007267488  2014-05-01 06:01:12.334+01      2014-05
10101043505096  2014-05-01 07:28:12.342+01      2014-05
10103043509747  2014-05-01 07:50:12.33+01       2014-05
10103043501575  2014-05-01 09:27:12.33+01       2014-05
10104043514061  2014-05-01 09:03:12.324+01      2014-05
Time taken: 0.07 seconds, Fetched: 5 row(s)

多分区联合查询

hive (db_hive1)> select * from order_partition where event_month='2014-07' or event_month='2014-05';

查看表有多少个分区

hive (db_hive1)> show partitions order_partition;
OK
partition
event_month=2014-05
event_month=2014-07
Time taken: 0.153 seconds, Fetched: 2 row(s)

多级分区
单级分区：一层目录
三级分区：三级目录

建立两级分区表
create table order_mulit_partition(
order_no string,
event_time string
)
partition by(event_month string, step string)
row format delimited fields terminated by ‘\t’;
加载数据到二级分区表中

hive (db_hive1)> load data local inpath '/home/hadoop/data/order_created.txt' into table order_mulit_partition partition(event_month='2014-05',step='1'); 

Loading data to table db_hive1.order_mulit_partition partition (event_month=2014-05, step=1)
Partition db_hive1.order_mulit_partition{event_month=2014-05, step=1} stats: [numFiles=1, numRows=0, totalSize=213, rawDataSize=0]
OK
Time taken: 0.969 seconds

查二级询分区表数据

hive (db_hive1)> select * from order_mulit_partition where event_month='2014-05' and step='1';

把emp的数据加载到emp分区表中去，分区字段是deptno

建表语句
CREATE TABLE emp_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double
) 
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

导入数据
INSERT OVERWRITE TABLE emp_partition PARTITION (deptno=10) select empno,ename,job,mgr,hiredate,sal,comm   from emp where deptno=10;

动态分区

建表
CREATE TABLE emp_dynamic_partition(
empno int,
ename string,
job string,
mgr int,
hiredate string,
sal double,
comm double
) 
PARTITIONED BY (deptno int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

动态导入数据，其中deptno放在查询字段的最后，并与partition的deptno字段名一样
INSERT OVERWRITE TABLE emp_dynamic_partition PARTITION (deptno) select empno,ename,job,mgr,hiredate,sal,comm,deptno from emp;

把数据直接上传到分区目录上，让分区表和数据产生关联的三种方式

上传数据后修复（生产上不要用msck，因为如果存在大量数据，会把服务器搞崩）

1、创建文件夹和上传数据

[hadoop@JD ~]$ hadoop fs -mkdir -p /user/hive/warehouse/db_hive1.db/order_mulit_partition/event_month=2014-06/step=2

19/12/19 15:15:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@JD ~]$ hadoop fs -put '/home/hadoop/data/order_created.txt' '/user/hive/warehouse/db_hive1.db/order_mulit_partition/event_month=2014-06/step=2'   

19/12/19 15:16:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2、查询数据（查不到上传的数据）

hive (db_hive1)> select * from order_mulit_partition where event_month='2014-06' and step='2';

OK
order_mulit_partition.order_no  order_mulit_partition.event_time        order_mulit_partition.event_month       order_mulit_partition.step
Time taken: 0.058 seconds

3、执行修复命令

hive (db_hive1)> msck repair table order_mulit_partition;

OK
Partitions not in metastore:    order_mulit_partition:event_month=2014-06/step=2
Repair: Added partition to metastore order_mulit_partition:event_month=2014-06/step=2
Time taken: 0.147 seconds, Fetched: 2 row(s)

4、再次查询

hive (db_hive1)> select * from order_mulit_partition where event_month='2014-06' and step='2';

OK
order_mulit_partition.order_no  order_mulit_partition.event_time        order_mulit_partition.event_month       order_mulit_partition.step
10703007267488  2014-05-01 06:01:12.334+01      2014-06 2
10101043505096  2014-05-01 07:28:12.342+01      2014-06 2
10103043509747  2014-05-01 07:50:12.33+01       2014-06 2
10103043501575  2014-05-01 09:27:12.33+01       2014-06 2
10104043514061  2014-05-01 09:03:12.324+01      2014-06 2
Time taken: 0.056 seconds, Fetched: 5 row(s)

上传数据后添加分区（推荐方式）
1、上传数据

[hadoop@JD ~]$ hadoop fs -mkdir -p /user/hive/warehouse/db_hive1.db/order_mulit_partition/event_month=2014-07/step=3

19/12/19 15:23:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
[hadoop@JD ~]$ 
[hadoop@JD ~]$ hadoop fs -put '/home/hadoop/data/order_created.txt' '/user/hive/warehouse/db_hive1.db/order_mulit_partition/event_month=2014-07/step=3'

19/12/19 15:23:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2、添加分区（添加分区信息到元数据中）

hive (db_hive1)> alter table order_mulit_partition  add partition(event_month='2014-07',step='3');
OK
Time taken: 0.083 seconds

3、查询

hive (db_hive1)> select * from order_mulit_partition where event_month='2014-07' and step='3';

OK
order_mulit_partition.order_no  order_mulit_partition.event_time        order_mulit_partition.event_month       order_mulit_partition.step
10703007267488  2014-05-01 06:01:12.334+01      2014-07 3
10101043505096  2014-05-01 07:28:12.342+01      2014-07 3
10103043509747  2014-05-01 07:50:12.33+01       2014-07 3
10103043501575  2014-05-01 09:27:12.33+01       2014-07 3
10104043514061  2014-05-01 09:03:12.324+01      2014-07 3
Time taken: 0.053 seconds, Fetched: 5 row(s)

创建文件夹后load数据到分区
1、创建文件夹

[hadoop@JD ~]$ hadoop fs -mkdir -p /user/hive/warehouse/db_hive1.db/order_mulit_partition/event_month=2014-08/step=4

19/12/19 15:23:43 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable

2、load数据

hive (db_hive1)> load data local inpath '/home/hadoop/data/order_created.txt' into table
 order_mulit_partition partition(month='2014-08',step='4');

3、查询数据

hive (db_hive1)> select * from order_mulit_partition where month='2014-08' and step='4';

DML操作

LOAD DATA : 加载数据
LOCAL: “本地” 没有的话就HDFS
INPATH: 指定路径
OVERWRITE：数据覆盖没有的话就是追加

LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

向表中load数据

导入本地文件

hive (db_hive1)> load data local inpath '/home/hadoop/data/emp.txt' into table emp_test2;

导入hdfs上的文件

hive (db_hive1)> load data inpath '/hive/hadoop/emp.txt' into table emp_test2;

覆盖表中已有的数据

hive (db_hive1)> load data local inpath '/home/hadoop/data/emp.txt' overwrite into table emp_test2;

通过insert语句插入数据

基本插入数据

1、创建分区表

create table order_partition1(
order_no string,
event_time string
)
PARTITIONED BY (event_month string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t';

2、插入数据

hive (db_hive1)> insert into table order_partition1 partition(event_month='2014-05') values('1','1111'),('2','2222');

根据查询结果插入数据

insert into：以追加数据的方式插入到表或分区，原有数据不会删除
insert overwrite：会覆盖表或分区中已存在的数据
注意： insert不支持插入部分字段

hive (db_hive1)> insert overwrite table order_partition1 partition(event_month='2014-06') select order_no,event_time from order_partition1 where event_month='2014-05';

多表（多分区）插入模式（根据多张表查询结果）

hive (db_hive1)> from order_partition1
              insert overwrite table order_partition1 partition(month='2014-07')
              select id, name where month='2014-05'
              insert overwrite table order_partition1 partition(month='2014-08')
              select id, name where month='2014-05';

CTAS create table … as select…

根据查询结果创建表并导入数据（工作常用），表不能事先存在

create table if not exists student3 as select id, name from student;

insert：表必须事先存在

根据查询结果向表中插入数据
insert overwrite table emp4 select * from emp;

或者还有一种反人类的语法
from emp insert into table emp4 select *;

创建表时通过Location指定加载数据路径

1、先hdfs上传数据

[hadoop@JD ~]$ hadoop fs -mkdir /emp;
[hadoop@JD ~]hadoop fs -put /home/hadoop/data/emp.txt /emp;

2、建表并指定路径

hive (default)> create external table if not exists emp(
              id int, 
			  name string
              )
              row format delimited fields terminated by '\t'
              location '/emp;

3、查询

hive (default)> select * from emp;

数据导出

导入到本地文件，并指定分隔符
INSERT OVERWRITE local DIRECTORY '/hivetmp' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' SELECT empno,ename FROM emp;	 

导入到HDFS文件，并指定分隔符
INSERT OVERWRITE DIRECTORY '/hivetmp' ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t' SELECT empno,ename FROM emp;	

或者直接使用hdfs fs -get 命令

或者使用hive的交互式命令，然后重定向到本地
hive -e 'select empno,ename from emp'> 1.txt

如果过滤查询结果,使用grep
hive -e 'select empno,ename from emp | grep xxx'> 1.txt

Hive DQL

语法：

SELECT [ALL | DISTINCT] select_expr, select_expr, ...
  FROM table_reference
  [WHERE where_condition]
  [GROUP BY col_list]
  [ORDER BY col_list]
  [CLUSTER BY col_list
    | [DISTRIBUTE BY col_list] [SORT BY col_list]
  ]
 [LIMIT [offset,] rows]

WHERE

过滤筛选
查询区间使用between
查字符串在xxx之间使用in
查询为空的字段 is null
分区表必须加分区限制

ALL和DISTINCT

ALL和DISTINCT选项指定了是否返回重复值，默认是ALL，指定DISTINCT时会剔除重复的结果。注意，从Hive1.1.0开始，Hive支持SELECT DISTINCT *。ALL and DISTINCT can also be used in a UNION clause – see Union Syntax for more information.

hive> SELECT col1, col2 FROM t1
    1 3
    1 3
    1 4
    2 5
hive> SELECT DISTINCT col1, col2 FROM t1
    1 3
    1 4
    2 5
hive> SELECT DISTINCT col1 FROM t1
    1
    2

基于分区的查询

一个Select如果没有指定分区，执行时会扫描全表。如果加了分区条件，则会只扫描对应分区的数据。对于存在大量分区数据的表，加不加分区条件的差别是很大的。若不增加分区，可能会引起资源被全占用，引起业务延迟。

# 普通查询限制分区
SELECT page_views.*
FROM page_views
WHERE page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31'

# 关联时限制分区
SELECT page_views.*
FROM page_views JOIN dim_users
  ON (page_views.user_id = dim_users.id AND page_views.date >= '2008-03-01' AND page_views.date <= '2008-03-31')

HAVING

where和having的区别：

使用场景：where可以用于select、update、delete和insert into values(select * from table where …)语句中。having只能用于select语句中。
执行顺序：where的搜索条件是在执行语句进行分组之前应用；having的搜索条件是在分组条件后执行的。即如果where和having一起用时，where会先执行，having后执行。
子句有区别：where子句中的条件表达式having都可以跟，而having子句中的有些表达式where不可以跟；having子句可以用集合函数（sum、count、avg、max和min），而where子句不可以。
总结：
1.WHERE 子句用来筛选 FROM 子句中指定的操作所产生的行。
2.GROUP BY 子句用来分组 WHERE 子句的输出。
3.HAVING 子句用来从分组的结果中筛选行
以下两个等价

SELECT col1 FROM t1 GROUP BY col1 HAVING SUM(col2) > 10；
SELECT col1 FROM (SELECT col1, SUM(col2) AS col2sum FROM t1 GROUP BY col1) t2 WHERE t2.col2sum > 10；

聚合函数

聚合：多进一出，select中出现的字段，如果没有出现在group by中，必须出现在聚合函数中
max
min
avg
count
sum

Hive SQL ，什么情况跑MR，什么情况不跑MR？

了解一个参数：hive.fetch.task.conversion,默认值是more.
以下面语句为例：

SQL1 : select * from live_info;
SQL2 : select * from live_info where traffic in ("1", "3");
SQL3 : select count(1) from live_info;
SQL4 : select user, count(1) as user_count from live_info group by user;
SQL5 : select user, count(1) as user_count from live_info group by user order by user desc;

fetch

参数：hive.fetch.task.conversion

value	官网解释	desc
none	Disable hive.fetch.task.conversion	select语句中只有desc不走MR
minimal	SELECT *, FILTER on partition columns (WHERE and HAVING clauses), LIMIT only	仅SQL1 不走MR
more	SELECT, FILTER, LIMIT only (including TABLESAMPLE, virtual columns)	仅SQL1、SQL2不走MR ，过滤条件是某一个字段的内容或limit或分区条件不跑mr