Hadoop3.1.2僞分佈式環境下安裝和配置Hive3.1.2

Hive是基於Hadoop的一個數據倉庫工具,用來進行數據提取、轉化、加載,這是一種可以存儲、查詢和分析存儲在Hadoop中的大規模數據的機制。hive數據倉庫工具能將結構化的數據文件映射爲一張數據庫表,並提供SQL查詢功能,能將SQL語句轉變成MapReduce任務來執行。Hive的優點是學習成本低,可以通過類似SQL語句實現快速MapReduce統計,使MapReduce變得更加簡單,而不必開發專門的MapReduce應用程序。

下面我們來實現一下Hadoop環境下Hive的安裝與實踐。

目錄

(1)軟件環境準備

(2)MySQL下載安裝

(3)Hive安裝部署

(4)HIVE測試使用  


(1)軟件環境準備

Hadoop運行環境:即環境中已經能運行Hadoop。可以參見我的上一篇博文:超詳細的Hadoop3.1.2架構單機、僞分佈式、完全分佈式安裝和配置:https://blog.csdn.net/caojianhua2018/article/details/99174958

Mysql安裝包:可以直接從mysql官方網站上下載,也可以使用wget方式先下載安裝源,然後使用yum來實現安裝。

Hive安裝包:可以在http://mirror.bit.edu.cn/apache/hive/下載)

(2)MySQL下載安裝

1.下載mysql yum包:

[root@master ~]# wget http://repo.mysql.com/mysql57-community-release-el7-10.noarch.rpm

如果wget未安裝,則先使用yum install -y wget安裝一下。

2. 安裝mysql源:

[root@master ~]# rpm -Uvh mysql57-community-release-el7-10.noarch.rpm

3.安裝mysql服務端:

[root@master ~]# yum install -y mysql-community-server

4. 啓動mysql:

[root@master ~]# systemctl start mysqld.service

如果設置成開機自啓動,輸入如下命令即可:

[root@master ~]# systemctl enable mysqld
[root@master ~]# systemctl daemon-reload

5.查看mysql啓動狀態:

[root@master ~]# systemctl status mysqld.service

6.修改mysql服務root賬戶密碼:

mysql服務配置文件默認存放路徑爲:/etc/my.conf,可以打開該文件進行相應的修改。其日誌文件存放位置爲:/var/log/mysqld.log,這是log文件,可以直接more查看。

首先獲取系統自動生成的隨機密碼(從上述的log文件夾中尋找):

[root@master ~]# grep "password" /var/log/mysqld.log
2020-02-05T12:08:30.659856Z 1 [Note] A temporary password is generated for root@localhost: JQ;PG0%UrrF0

後面那一串複雜字符就是隨機生成的密碼,我們需要將其修改成自己的密碼,這樣容易記住。

接下來開始修改密碼。使用這個密碼登錄進入mysql:

[root@master ~]# mysql -uroot -p
Enter password: 
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.7.29

Copyright (c) 2000, 2020, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> 

進入後,在mysql>後輸入如下代碼來修改root密碼:

mysql> ALTER USER 'root'@'localhost' IDENTIFIED BY 'new password'

這裏的new password是代指字符串,即新密碼字符串,修改時要求密碼不能過於簡單,包括大小寫、其他字符等。例如我們修改爲Root-123,記住mysql這種命令行操作時每一行語句結尾要使用分號表示結束。

mysql> alter user 'root'@'localhost' identified by 'Root-123';
Query OK, 0 rows affected (0.19 sec)

如此便修改成功了,可以退出當前命令行重新登錄測試一下。

如果想設置遠程連接該mysql數據庫,就可以使用grant 權限方式來設置,如下:

mysql> grant all privileges on *.* to 'root'@'%' identified by 'Root-123' with grant option;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> flush privileges;
Query OK, 0 rows affected (0.06 sec)

爲了後續的HIVE 測試成功,可以先在mysql裏創建一個數據庫mydb。後面再hive-site.xml文件中就可以直接配置了。

mysql> create database mydb;

(3)Hive安裝部署

1.將安裝包解壓,解壓完成後將文件夾重命名爲hive-3.1.2-bin:

[hadoop@master ~]$ tar -zxvf apache-hive-3.1.2-bin.gz 
[hadoop@master ~]$ mv apache-hive-3.1.2-bin hive-3.1.2-bin

2. 下載mysql的java驅動並拷貝到hive的lib路徑下:

[hadoop@master ~]$ cp mysql-connector-java-8.0.16.jar hive-3.1.2-bin/lib

3. 配置Hive環境變量,如果設置爲所有用戶的環境變量,則需使用root用戶登錄,編輯/etc/profile文件即可,如果僅爲當前用戶的環境變量,則可以使用當期用戶目錄下的~/.bash_profile文件編輯,最後都是用source命令來使得環境變量設置生效。

#setting for hive
export HIVE_HOME=/home/hadoop/hive-3.1.2
export PATH=$PATH:$HIVE_HOME/bin

使用source命令使得環境變量設置生效:

[root@master ~]# source /etc/profile

此時可以正常啓動Hive了,系統默認使用自帶的derby數據庫:

[hadoop@master ~]$ hive -version
which: no hbase in (/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/hadoop/jdk1.8.0_11/bin:/home/hadoop/hadoop-3.1.2/bin:/home/hadoop/.local/bin:/home/hadoop/bin:/home/hadoop/jdk1.8.0_11/bin:/home/hadoop/hadoop-3.1.2/bin:/home/hadoop/hive-3.1.2/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hive-3.1.2/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-3.1.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = e25762a6-6e9c-4361-b12b-4de0fe10818b

Logging initialized using configuration in jar:file:/home/hadoop/hive-3.1.2/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
hive> 

4. 修改Hive配置文件,進入hive安裝目錄下的conf文件夾,這裏就是配置文件存放的地方。將需要修改的配置文件模板拷貝並重命名即可,包括hive-env.sh和hive-site.xml文件

[hadoop@master hive-3.1.2]$ cd conf
[hadoop@master conf]$ ll
total 332
-rw-r--r--. 1 hadoop hadoop   1596 Aug 23 05:44 beeline-log4j2.properties.template
-rw-r--r--. 1 hadoop hadoop 300482 Aug 23 06:01 hive-default.xml.template
-rw-r--r--. 1 hadoop hadoop   2365 Aug 23 05:44 hive-env.sh.template
-rw-r--r--. 1 hadoop hadoop   2274 Aug 23 05:45 hive-exec-log4j2.properties.template
-rw-r--r--. 1 hadoop hadoop   3086 Aug 23 05:44 hive-log4j2.properties.template
-rw-r--r--. 1 hadoop hadoop   2060 Aug 23 05:44 ivysettings.xml
-rw-r--r--. 1 hadoop hadoop   3558 Aug 23 05:44 llap-cli-log4j2.properties.template
-rw-r--r--. 1 hadoop hadoop   7163 Aug 23 05:44 llap-daemon-log4j2.properties.template
-rw-r--r--. 1 hadoop hadoop   2662 Aug 23 05:44 parquet-logging.properties
[hadoop@master conf]$ cp hive-env.sh.template hive-env.sh
[hadoop@master conf]$ cp hive-default.xml.template hive-site.xml

然後修改hive-env.sh文件,添加如下內容:

# Set HADOOP_HOME to point to a specific hadoop install directory 
export HADOOP_HOME=/home/hadoop/hadoop-3.1.2

# Hive Configuration Directory can be controlled by: 
export HIVE_CONF_DIR=/home/hadoop/hive-3.1.2/conf
export JAVA_HOME=/home/hadoop/jdk1.8.0_11
# Folder containing extra libraries required for hive compilation/execution can be controlled by:
export HIVE_AUX_JARS_PATH=/home/hadoop/hive-3.1.2/lib

使用source命令使其更新生效:

[hadoop@master conf]$ source hive-env.sh

接下來修改hive-site.xml,這個文件內容很多,由於涉及到hive執行時有關數據存放路徑,所以可以先去創建好目錄,然後將路徑複製出來放到這個xml文件中。

[hadoop@master hive-3.1.2]$ hdfs dfs -mkdir -p /user/hive/warehouse
[hadoop@master hive-3.1.2]$ hdfs dfs -mkdir -p /tmp
[hadoop@master hive-3.1.2]$ hdfs dfs -chmod g+w /tmp
[hadoop@master hive-3.1.2]$ hdfs dfs -chmod g+w /user
[hadoop@master hive-3.1.2]$ hdfs dfs -chmod g+w /user/hive
[hadoop@master hive-3.1.2]$ hdfs dfs -chmod g+w /user/hive/warehouse

建好後,開始修改hive-site.xml文件:(註釋可刪除)

<configuration>
#選用jdbc連接mysql
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
</property>
#配置jdbc連接mysql中的數據表,用於存放數據
<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://big01:3306/mydb?createDatabaseIfNotExist=true</value>
    <description>JDBC connect string for a JDBC metastore.</description>
</property>
#這裏配置登錄mysql的用戶名,我這裏用的root
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
    <description>username to use against metastore database</description>
 </property>
#這裏配置登錄mysql的密碼
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>Root-123</value>
    <description>password to use against metastore database</description>
</property>
#這裏配置保存hive倉庫路徑,需要提前創建
<property>
    <name>hive.metastore.warehouse.dir</name>
    <value>/user/hive/warehouse</value>
    <description>location of default database for the warehouse</description>
</property>
</configuration>

5. 初始化元數據庫

[hadoop@master hive-3.1.2]$ schematool -dbType mysql -initSchema

初始化完成後,就可以開始使用了。

6. hive測試使用

命令行輸入hive,出現hive>命令行就表明啓動成功了。

[hadoop@master hive-3.1.2]$ hive
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hive-3.1.2/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-3.1.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
which: no hbase in (/usr/local/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/home/hadoop/jdk1.8.0_11/bin:/home/hadoop/hadoop-3.1.2/bin:/home/hadoop/.local/bin:/home/hadoop/bin:/home/hadoop/jdk1.8.0_11/bin:/home/hadoop/hadoop-3.1.2/bin:/home/hadoop/hive-3.1.2/bin)
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/home/hadoop/hive-3.1.2/lib/log4j-slf4j-impl-2.10.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/home/hadoop/hadoop-3.1.2/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Hive Session ID = 467045bd-1840-424f-a66b-5d4d7e3ebf25

Logging initialized using configuration in jar:file:/home/hadoop/hive-3.1.2/lib/hive-common-3.1.2.jar!/hive-log4j2.properties Async: true
Loading class `com.mysql.jdbc.Driver'. This is deprecated. The new driver class is `com.mysql.cj.jdbc.Driver'. The driver is automatically registered via the SPI and manual loading of the driver class is generally unnecessary.
Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. spark, tez) or using Hive 1.X releases.
Hive Session ID = ae782d20-647b-4565-a694-e612b5f7539c
hive> 

 7. 啓動hiveserver2,在bin目錄下輸入./hiveserver2,啓動hive服務:

hive默認web的端口爲10002,此時在瀏覽器地址欄輸入:ip:10002,就可以看到web監控頁面:

(4)HIVE測試使用  

1. 首先使用create database命令創建一個數據庫myhive:

hive> create database myhive;
OK
Time taken: 1.753 seconds

這些語法與sql類似,如果對sql比較熟悉的話,這些shell命令基本上差不多。如使用show databases命令可以查看當前目錄下的數據庫:

hive> show databases;
OK
default
myhive
Time taken: 0.732 seconds, Fetched: 2 row(s)

2. 然後我們使用myhive數據庫:

hive> use myhive;
OK
Time taken: 0.372 seconds

3. 在myhive數據庫裏新建一個user表,語法也是create table這種,不過注意不能使用hive使用的關鍵字,比如user就不能作爲表名。

hive> create table user(stuid int,name string,age int) row format delimited fields terminated by ',' stored as textfile;
NoViableAltException(334@[212:1: tableName : (db= identifier DOT tab= identifier -> ^( TOK_TABNAME $db $tab) |tab= identifier -> ^( TOK_TABNAME $tab) );])
        at org.antlr.runtime.DFA.noViableAlt(DFA.java:158)

上述語句提示有錯誤就是因爲使用了user作爲表名,修改爲stuinfo後就可以了。

hive> create table stuinfo(stuid int,name string,age int) row format delimited fields terminated by ',' stored as textfile;
OK
Time taken: 2.673 seconds

可以使用desc tablename來看一下表結構:

hive> desc stuinfo;
OK
stuid                   int                                         
name                    string                                      
age                     int                                         
Time taken: 5.245 seconds, Fetched: 3 row(s)

建表還有幾種方式,如下實踐:

hive> create table t1(fdate string,name string,age int) row format delimited fields terminated by ',';
OK
Time taken: 3.817 seconds
hive> create table t2(fdate string,name string,age int) row format delimited fields terminated by ',';
OK
Time taken: 0.37 seconds
hive> create table t3(fdate string,name string,age int) partitioned by (dt string)  row format delimited fields terminated by ',';
OK
Time taken: 0.103 seconds
hive> create table t4(fdate string,name string,age int) clustered by (fdate) into 3 buckets;
OK
Time taken: 0.146 seconds

4. 新增記錄,老的hive版本不支持用insert語句一條一條的進行插入操作,也不支持update操作,新版本已經支持,不過因爲是專門用於大數據處理,適合批量導入處理。數據可以load的方式加載到建立好的表中,數據一旦導入就不可以修改。insert命令主要用於將hive中的數據導出,導出的目的地可以是hdfs或本地filesysytem,導入什麼數據在於書寫的select語句。

因此這裏使用load語句,將數據記錄以txt文件的方式導入進來。測試的時候可以現在本地用戶目錄下新建一個txt文件,然後輸入如下內容:

1 caojianhua 40
2 topher 7
3 sophie 13
4 baby 2

然後保存爲test.txt。接下來就導入到hive中去:

hive> load data local inpath '/home/hadoop/test.txt' into table stuinfo;
Loading data to table myhive.stuinfo
OK
Time taken: 3.894 seconds

這裏使用的是local inpath,就是本地文件。如果要從hadoop中導入,就需要將inpath後面的路徑修改一下。

先在hadoop本地用戶目錄下新建一個testh.txt文本文件,內容與上述的test.txt類似。然後使用hdfs dfs -put命令將其上傳到hdfs中。

[hadoop@master ~]$ hdfs dfs -put testh.txt /data
[hadoop@master ~]$ hdfs dfs -ls /data
Found 2 items
-rw-r--r--   1 hadoop supergroup         41 2020-02-05 17:24 /data/a.txt
-rw-r--r--   1 hadoop supergroup         45 2020-02-06 10:22 /data/testh.txt

接下來在hive中進行導入操作:

hive> load data inpath "/data/testh.txt" into table stuinfo;
Loading data to table myhive.stuinfo
OK
Time taken: 1.8 seconds

注意到load導入操作時hive不會對數據格式進行檢查,只有當後續進行分析查詢的時候如果格式有問題就會出錯。繼續採用load path方式導入hdfs中的數據到表t1中,然後使用select方式查詢內容,這個與sql是一模一樣的操作:

hive> load data inpath '/data/t1.txt' into table t1;
Loading data to table myhive.t1
OK
Time taken: 2.324 seconds
hive> select * from t1;
OK
2020-02-07      caojianhua      40
2020-02-07      caozhifeng      7
2020-02-08      caoyiling       13
2020-02-09      caolina 40
Time taken: 2.786 seconds, Fetched: 4 row(s)

如果要查詢有多少行記錄,使用select count(*) from t1這種分析操作的時候,就會調用mapreduce進程來處理。此時對於內存要求也是挺高的,我們可以使用free -m來查詢。受限於虛擬機內存只有1G太小,所以我提前終止了進程。當內存在2G左右時,就可以得到結果,不過速度還是比較慢的。統計數據行數一共耗時200多秒,接近4分鐘。

hive> select count(*) from t1;
Query ID = hadoop_20200208231125_d893f7bc-eaa6-410b-bdba-3b9f7cc193fb
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1581167611828_0002, Tracking URL = http://master:8088/proxy/application_1581167611828_0002/
Kill Command = /home/hadoop/hadoop-3.1.2/bin/mapred job  -kill job_1581167611828_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2020-02-08 23:12:38,383 Stage-1 map = 0%,  reduce = 0%
2020-02-08 23:13:56,078 Stage-1 map = 0%,  reduce = 0%
2020-02-08 23:14:14,089 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.47 sec
2020-02-08 23:14:48,117 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.38 sec
MapReduce Total cumulative CPU time: 4 seconds 380 msec
Ended Job = job_1581167611828_0002
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 4.38 sec   HDFS Read: 12298 HDFS Write: 101 SUCCESS
Total MapReduce CPU Time Spent: 4 seconds 380 msec
OK
4
Time taken: 215.677 seconds, Fetched: 1 row(s)

5.查詢操作。使用select 操作即可。這裏面我從本地以及hdfs中分別導入了記錄,記錄一共有8個,詳細如下:

hive> select * from stuinfo;
OK
1       caojianhua      40
2       topher  7
3       sophie  13
4       baby    2
1       caojia  40
2       zhifeng 7
3       yiling  13
4       lina    2
Time taken: 2.79 seconds, Fetched: 8 row(s)

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章