Hive是基於Hadoop的一個數據倉庫工具，用來進行數據提取、轉化、加載，這是一種可以存儲、查詢和分析存儲在Hadoop中的大規模數據的機制。hive數據倉庫工具能將結構化的數據文件映射爲一張數據庫表，並提供SQL查詢功能，能將SQL語句轉變成MapReduce任務來執行。Hive的優點是學習成本低，可以通過類似SQL語句實現快速MapReduce統計，使MapReduce變得更加簡單，而不必開發專門的MapReduce應用程序。

下面我們來實現一下Hadoop環境下Hive的安裝與實踐。

（1）軟件環境準備

Hadoop運行環境：即環境中已經能運行Hadoop。可以參見我的上一篇博文：超詳細的Hadoop3.1.2架構單機、僞分佈式、完全分佈式安裝和配置：https://blog.csdn.net/caojianhua2018/article/details/99174958

Mysql安裝包：可以直接從mysql官方網站上下載，也可以使用wget方式先下載安裝源，然後使用yum來實現安裝。

Hive安裝包：可以在http://mirror.bit.edu.cn/apache/hive/下載）

（2）MySQL下載安裝

1.下載mysql yum包：

[root@master ~]# wget http://repo.mysql.com/mysql57-community-release-el7-10.noarch.rpm

如果wget未安裝，則先使用yum install -y wget安裝一下。

2. 安裝mysql源：

[root@master ~]# rpm -Uvh mysql57-community-release-el7-10.noarch.rpm

3.安裝mysql服務端：

[root@master ~]# yum install -y mysql-community-server

4. 啓動mysql:

[root@master ~]# systemctl start mysqld.service

如果設置成開機自啓動，輸入如下命令即可：

[root@master ~]# systemctl enable mysqld
[root@master ~]# systemctl daemon-reload

5.查看mysql啓動狀態：

[root@master ~]# systemctl status mysqld.service

6.修改mysql服務root賬戶密碼：

mysql服務配置文件默認存放路徑爲：/etc/my.conf，可以打開該文件進行相應的修改。其日誌文件存放位置爲：/var/log/mysqld.log，這是log文件，可以直接more查看。

首先獲取系統自動生成的隨機密碼（從上述的log文件夾中尋找）：

[root@master ~]# grep "password" /var/log/mysqld.log
2020-02-05T12:08:30.659856Z 1 [Note] A temporary password is generated for root@localhost: JQ;PG0%UrrF0

後面那一串複雜字符就是隨機生成的密碼，我們需要將其修改成自己的密碼，這樣容易記住。

接下來開始修改密碼。使用這個密碼登錄進入mysql：

[root@master ~]# mysql -uroot -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.7.29

Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql>

進入後，在mysql>後輸入如下代碼來修改root密碼：

mysql> ALTER USER 'root'@'localhost' IDENTIFIED BY 'new password'

這裏的new password是代指字符串，即新密碼字符串，修改時要求密碼不能過於簡單，包括大小寫、其他字符等。例如我們修改爲Root-123，記住mysql這種命令行操作時每一行語句結尾要使用分號表示結束。

mysql> alter user 'root'@'localhost' identified by 'Root-123';
Query OK, 0 rows affected (0.19 sec)

如此便修改成功了，可以退出當前命令行重新登錄測試一下。

如果想設置遠程連接該mysql數據庫，就可以使用grant 權限方式來設置，如下：

mysql> grant all privileges on *.* to 'root'@'%' identified by 'Root-123' with grant option;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.06 sec)

爲了後續的HIVE 測試成功，可以先在mysql裏創建一個數據庫mydb。後面再hive-site.xml文件中就可以直接配置了。

mysql> create database mydb;

（3）Hive安裝部署

1.將安裝包解壓，解壓完成後將文件夾重命名爲hive-3.1.2-bin：

[hadoop@master ~]$ tar -zxvf apache-hive-3.1.2-bin.gz 
[hadoop@master ~]$ mv apache-hive-3.1.2-bin hive-3.1.2-bin

2. 下載mysql的java驅動並拷貝到hive的lib路徑下：

[hadoop@master ~]$ cp mysql-connector-java-8.0.16.jar hive-3.1.2-bin/lib

3. 配置Hive環境變量，如果設置爲所有用戶的環境變量，則需使用root用戶登錄，編輯/etc/profile文件即可，如果僅爲當前用戶的環境變量，則可以使用當期用戶目錄下的~/.bash_profile文件編輯，最後都是用source命令來使得環境變量設置生效。

#setting for hive
export HIVE_HOME=/home/hadoop/hive-3.1.2
export PATH=$PATH:$HIVE_HOME/bin

使用source命令使得環境變量設置生效：

[root@master ~]# source /etc/profile

此時可以正常啓動Hive了，系統默認使用自帶的derby數據庫：

[hadoop@master ~]$ hive -version
which: no hbase in (/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/hadoop/jdk1.8.0_11/bin:/home/hadoop/hadoop-3.1.2/bin:/home/hadoop/.local/bin:/home/hadoop/bin:/home/hadoop/jdk1.8.0_11/bin:/home/hadoop/hadoop-3.1.2/bin:/home/hadoop/hive-3.1.2/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hive-3.1.2/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-3.1.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = e25762a6-6e9c-4361-b12b-4de0fe10818b

Logging initialized using configuration in jar:file:/home/hadoop/hive-3.1.2/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive>

4. 修改Hive配置文件，進入hive安裝目錄下的conf文件夾，這裏就是配置文件存放的地方。將需要修改的配置文件模板拷貝並重命名即可，包括hive-env.sh和hive-site.xml文件

[hadoop@master hive-3.1.2]$ cd conf
[hadoop@master conf]$ ll
total 332
-rw-r--r--. 1 hadoop hadoop   1596 Aug 23 05:44 beeline-log4j2.properties.template
-rw-r--r--. 1 hadoop hadoop 300482 Aug 23 06:01 hive-default.xml.template
-rw-r--r--. 1 hadoop hadoop   2365 Aug 23 05:44 hive-env.sh.template
-rw-r--r--. 1 hadoop hadoop   2274 Aug 23 05:45 hive-exec-log4j2.properties.template
-rw-r--r--. 1 hadoop hadoop   3086 Aug 23 05:44 hive-log4j2.properties.template
-rw-r--r--. 1 hadoop hadoop   2060 Aug 23 05:44 ivysettings.xml
-rw-r--r--. 1 hadoop hadoop   3558 Aug 23 05:44 llap-cli-log4j2.properties.template
-rw-r--r--. 1 hadoop hadoop   7163 Aug 23 05:44 llap-daemon-log4j2.properties.template
-rw-r--r--. 1 hadoop hadoop   2662 Aug 23 05:44 parquet-logging.properties
[hadoop@master conf]$ cp hive-env.sh.template hive-env.sh
[hadoop@master conf]$ cp hive-default.xml.template hive-site.xml

然後修改hive-env.sh文件，添加如下內容：

# Set HADOOP_HOME to point to a specific hadoop install directory 
export HADOOP_HOME=/home/hadoop/hadoop-3.1.2

# Hive Configuration Directory can be controlled by: 
export HIVE_CONF_DIR=/home/hadoop/hive-3.1.2/conf
export JAVA_HOME=/home/hadoop/jdk1.8.0_11
# Folder containing extra libraries required for hive compilation/execution can be controlled by:
export HIVE_AUX_JARS_PATH=/home/hadoop/hive-3.1.2/lib

使用source命令使其更新生效：

[hadoop@master conf]$ source hive-env.sh

接下來修改hive-site.xml，這個文件內容很多，由於涉及到hive執行時有關數據存放路徑，所以可以先去創建好目錄，然後將路徑複製出來放到這個xml文件中。

[hadoop@master hive-3.1.2]$ hdfs dfs -mkdir -p /user/hive/warehouse
[hadoop@master hive-3.1.2]$ hdfs dfs -mkdir -p /tmp
[hadoop@master hive-3.1.2]$ hdfs dfs -chmod g+w /tmp
[hadoop@master hive-3.1.2]$ hdfs dfs -chmod g+w /user
[hadoop@master hive-3.1.2]$ hdfs dfs -chmod g+w /user/hive
[hadoop@master hive-3.1.2]$ hdfs dfs -chmod g+w /user/hive/warehouse

建好後，開始修改hive-site.xml文件：（註釋可刪除）

<configuration>
#選用jdbc連接mysql
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
</property>
#配置jdbc連接mysql中的數據表，用於存放數據
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://big01:3306/mydb?createDatabaseIfNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore.</description>
</property>
#這裏配置登錄mysql的用戶名，我這裏用的root
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
    <description>username to use against metastore database</description>
 </property>
#這裏配置登錄mysql的密碼
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>Root-123</value>
    <description>password to use against metastore database</description>
</property>
#這裏配置保存hive倉庫路徑，需要提前創建
<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
    <description>location of default database for the warehouse</description>
</property>
</configuration>

5. 初始化元數據庫

[hadoop@master hive-3.1.2]$ schematool -dbType mysql -initSchema

初始化完成後，就可以開始使用了。

6. hive測試使用

命令行輸入hive，出現hive>命令行就表明啓動成功了。

[hadoop@master hive-3.1.2]$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hive-3.1.2/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-3.1.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
which: no hbase in (/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/hadoop/jdk1.8.0_11/bin:/home/hadoop/hadoop-3.1.2/bin:/home/hadoop/.local/bin:/home/hadoop/bin:/home/hadoop/jdk1.8.0_11/bin:/home/hadoop/hadoop-3.1.2/bin:/home/hadoop/hive-3.1.2/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hive-3.1.2/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-3.1.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 467045bd-1840-424f-a66b-5d4d7e3ebf25

Logging initialized using configuration in jar:file:/home/hadoop/hive-3.1.2/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Hive Session ID = ae782d20-647b-4565-a694-e612b5f7539c
hive>

7. 啓動hiveserver2，在bin目錄下輸入./hiveserver2，啓動hive服務：

hive默認web的端口爲10002，此時在瀏覽器地址欄輸入：ip:10002，就可以看到web監控頁面：

（4）HIVE測試使用

1. 首先使用create database命令創建一個數據庫myhive：

hive> create database myhive;
OK
Time taken: 1.753 seconds

這些語法與sql類似，如果對sql比較熟悉的話，這些shell命令基本上差不多。如使用show databases命令可以查看當前目錄下的數據庫：

hive> show databases;
OK
default
myhive
Time taken: 0.732 seconds, Fetched: 2 row(s)

2. 然後我們使用myhive數據庫：

hive> use myhive;
OK
Time taken: 0.372 seconds

3. 在myhive數據庫裏新建一個user表，語法也是create table這種，不過注意不能使用hive使用的關鍵字，比如user就不能作爲表名。

hive> create table user(stuid int,name string,age int) row format delimited fields terminated by ',' stored as textfile;
NoViableAltException(334@[212:1: tableName : (db= identifier DOT tab= identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])
        at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)

上述語句提示有錯誤就是因爲使用了user作爲表名，修改爲stuinfo後就可以了。

hive> create table stuinfo(stuid int,name string,age int) row format delimited fields terminated by ',' stored as textfile;
OK
Time taken: 2.673 seconds

可以使用desc tablename來看一下表結構：

hive> desc stuinfo;
OK
stuid                   int                                         
name                    string                                      
age                     int                                         
Time taken: 5.245 seconds, Fetched: 3 row(s)

建表還有幾種方式，如下實踐：

hive> create table t1(fdate string,name string,age int) row format delimited fields terminated by ',';
OK
Time taken: 3.817 seconds
hive> create table t2(fdate string,name string,age int) row format delimited fields terminated by ',';
OK
Time taken: 0.37 seconds
hive> create table t3(fdate string,name string,age int) partitioned by (dt string)  row format delimited fields terminated by ',';
OK
Time taken: 0.103 seconds
hive> create table t4(fdate string,name string,age int) clustered by (fdate) into 3 buckets;
OK
Time taken: 0.146 seconds

4. 新增記錄，老的hive版本不支持用insert語句一條一條的進行插入操作，也不支持update操作，新版本已經支持,不過因爲是專門用於大數據處理，適合批量導入處理。數據可以load的方式加載到建立好的表中，數據一旦導入就不可以修改。insert命令主要用於將hive中的數據導出，導出的目的地可以是hdfs或本地filesysytem，導入什麼數據在於書寫的select語句。

因此這裏使用load語句，將數據記錄以txt文件的方式導入進來。測試的時候可以現在本地用戶目錄下新建一個txt文件，然後輸入如下內容：

1 caojianhua 40
2 topher 7
3 sophie 13
4 baby 2

然後保存爲test.txt。接下來就導入到hive中去：

hive> load data local inpath '/home/hadoop/test.txt' into table stuinfo;
Loading data to table myhive.stuinfo
OK
Time taken: 3.894 seconds

這裏使用的是local inpath，就是本地文件。如果要從hadoop中導入，就需要將inpath後面的路徑修改一下。

先在hadoop本地用戶目錄下新建一個testh.txt文本文件，內容與上述的test.txt類似。然後使用hdfs dfs -put命令將其上傳到hdfs中。

[hadoop@master ~]$ hdfs dfs -put testh.txt /data
[hadoop@master ~]$ hdfs dfs -ls /data
Found 2 items
-rw-r--r--   1 hadoop supergroup         41 2020-02-05 17:24 /data/a.txt
-rw-r--r--   1 hadoop supergroup         45 2020-02-06 10:22 /data/testh.txt

接下來在hive中進行導入操作：

hive> load data inpath "/data/testh.txt" into table stuinfo;
Loading data to table myhive.stuinfo
OK
Time taken: 1.8 seconds

注意到load導入操作時hive不會對數據格式進行檢查，只有當後續進行分析查詢的時候如果格式有問題就會出錯。繼續採用load path方式導入hdfs中的數據到表t1中，然後使用select方式查詢內容，這個與sql是一模一樣的操作：

hive> load data inpath '/data/t1.txt' into table t1;
Loading data to table myhive.t1
OK
Time taken: 2.324 seconds
hive> select * from t1;
OK
2020-02-07      caojianhua      40
2020-02-07      caozhifeng      7
2020-02-08      caoyiling       13
2020-02-09      caolina 40
Time taken: 2.786 seconds, Fetched: 4 row(s)

如果要查詢有多少行記錄，使用select count(*) from t1這種分析操作的時候，就會調用mapreduce進程來處理。此時對於內存要求也是挺高的，我們可以使用free -m來查詢。受限於虛擬機內存只有1G太小，所以我提前終止了進程。當內存在2G左右時，就可以得到結果，不過速度還是比較慢的。統計數據行數一共耗時200多秒，接近4分鐘。

hive> select count(*) from t1;
Query ID = hadoop_20200208231125_d893f7bc-eaa6-410b-bdba-3b9f7cc193fb
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1581167611828_0002, Tracking URL = http://master:8088/proxy/application_1581167611828_0002/
Kill Command = /home/hadoop/hadoop-3.1.2/bin/mapred job  -kill job_1581167611828_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-02-08 23:12:38,383 Stage-1 map = 0%,  reduce = 0%
2020-02-08 23:13:56,078 Stage-1 map = 0%,  reduce = 0%
2020-02-08 23:14:14,089 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.47 sec
2020-02-08 23:14:48,117 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.38 sec
MapReduce Total cumulative CPU time: 4 seconds 380 msec
Ended Job = job_1581167611828_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.38 sec   HDFS Read: 12298 HDFS Write: 101 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 380 msec
OK
4
Time taken: 215.677 seconds, Fetched: 1 row(s)

5.查詢操作。使用select 操作即可。這裏面我從本地以及hdfs中分別導入了記錄，記錄一共有8個，詳細如下：

hive> select * from stuinfo;
OK
1       caojianhua      40
2       topher  7
3       sophie  13
4       baby    2
1       caojia  40
2       zhifeng 7
3       yiling  13
4       lina    2
Time taken: 2.79 seconds, Fetched: 8 row(s)

Hadoop3.1.2僞分佈式環境下安裝和配置Hive3.1.2