Hive

Hive 是建立在 Hadoop 上的數據倉庫基礎構架。它提供了一系列的工具，可以用來進行數據提取轉化加載（ETL ），這是一種可以存儲、查詢和分析存儲在 Hadoop 中的大規模數據的機制。Hive 定義了簡單的類 SQL 查詢語言，稱爲 QL ，它允許熟悉 SQL 的用戶查詢數據。同時，這個語言也允許熟悉 MapReduce 開發者的開發自定義的 mapper 和 reducer 來處理內建的 mapper 和 reducer 無法完成的複雜的分析工作。
Hive是SQL解析引擎，它將SQL語句轉譯成M/R Job然後在Hadoop執行。Hive的表其實就是HDFS的目錄/文件，按表名把文件夾分開。如果是分區表，則分區值是子文件夾，可以直接在M/R Job裏使用這些數據。

Hive 體系

用戶接口主要有三個：CLI，JDBC/ODBC和 WebUI

CLI，即Shell命令行
JDBC/ODBC 是 Hive 的Java，與使用傳統數據庫JDBC的方式類似
WebGUI是通過瀏覽器訪問 Hive

Hive 將元數據存儲在數據庫中(metastore)，目前只支持 mysql、derby。Hive 中的元數據包括表的名字，表的列和分區及其屬性，表的屬性（是否爲外部表等），表的數據所在目錄等
解釋器、編譯器、優化器完成 HQL 查詢語句從詞法分析、語法分析、編譯、優化以及查詢計劃（plan）的生成。生成的查詢計劃存儲在 HDFS 中，並在隨後有 MapReduce 調用執行
Hive 的數據存儲在 HDFS 中，大部分的查詢由 MapReduce 完成（包含 * 的查詢，比如 select * from table 不會生成 MapRedcue 任務）

Hive 執行原理

安裝模式

metastore是hive元數據的集中存放地

相關文檔資料： http://hive.apache.org/
官網下載： http://hive.apache.org/downloads.html

前置說明

安裝並啓動Hadoop 集羣：使用hadoop-2.9.0
hive只需要在 NameNode 節點安裝即可，可以不在datanode節點安裝
以下在本地獨立模式的基礎上安裝: 建議選擇遠程模式安裝mysql，步驟一樣，修改url爲其他節點名稱、賦權語句指定%接收任何主機連接

下載解壓配置環境變量

     https://mirrors.tuna.tsinghua.edu.cn/apache/hive/    
     選擇穩定版：apache-hive-2.3.4-bin.tar.gz
     
       主namenode節點hdp-01: mkdir -p /opt/hive
      通過FileZilla上傳tgz安裝包至hdp-01的hive目錄       
      解壓：cd /opt/hive && tar -zxvf apache-hive-2.3.4-bin.tar.gz
      
      vim ~/.bash_profile
     #末尾添加環境變量
	 export HIVE_HOME=/opt/hive/apache-hive-2.3.4-bin
	 export PATH=$HIVE_HOME/bin:$PATH
     #立即生效
     source ~/.bash_profile

Hive配置Hadoop HDFS

cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml

vim hive-site.xml

#修改 屬性：hive.metastore.warehouse.dir 爲 /data/hive/warehouse
#修改 屬性：hive.exec.scratchdir 爲 /data/hive/tmp

創建目錄配置權限

#創建warehouse目錄
hdfs dfs -mkdir -p /data/hive/warehouse
hdfs dfs -chmod 777 /data/hive/warehouse
#創建tmp目錄
hdfs dfs -mkdir -p /data/hive/tmp
hdfs dfs -chmod 777 /data/hive/tmp
#驗證
hdfs dfs -ls /data/hive/

修改 hive-site.xml中的臨時目錄

將文件中的所有 ${system:java.io.tmpdir}替換成/opt/hive/apache-hive-2.3.4-bin/tmp

sed -i 's/\${system:java.io.tmpdir}/\/opt\/hive\/apache-hive-2.3.4-bin\/tmp/g' hive-site.xml

將文件中所有的${system:user.name}替換爲hadoop

sed -i 's/\${system:user.name}/hadoop/g' hive-site.xml

mkdir -p /opt/hive/apache-hive-2.3.4-bin/tmp

安裝配置mysql

MySQL 安裝

sudo yum install -y mysql
當然也可以自己手動下載包安裝

上傳mysql驅動包並移動到$HIVE_HOME/lib

下載一個mysql-connector-java驅動包 mysql-connector-java-5.1.44.tar.gz，並上傳解壓

tar -zxvf mysql-connector-java-5.1.44.tar.gz 
cp mysql-connector-java-5.1.44-bin.jar $HIVE_HOME/lib

修改hive-site.xml數據庫相關配置

修改屬性值：
javax.jdo.option.ConnectionURL
javax.jdo.option.ConnectionDriverName
javax.jdo.option.ConnectionUserName
javax.jdo.option.ConnectionPassword
hive.metastore.schema.verification

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://hdp-01:3306/hive?createDatabaseIfNotExist=true</value>
    <description>
      JDBC connect string for a JDBC metastore.
      To use SSL to encrypt/authenticate the connection, provide database-specific SSL flag in the connection URL.
      For example, jdbc:postgresql://myhost/db?ssl=true for postgres database.
    </description>
  </property>

<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>


<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>hive</value>
    <description>Username to use against metastore database</description>
  </property>

<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>hive</value>
    <description>password to use against metastore database</description>
  </property>

<property>
    <name>hive.metastore.schema.verification</name>
    <value>false</value>
    <description>
      Enforce metastore schema version consistency.
      True: Verify that version information stored in is compatible with one from Hive jars.  Also disable automatic
            schema migration attempt. Users are required to manually migrate schema after Hive upgrade which ensures
            proper metastore schema migration. (Default)
      False: Warn if the version information stored in metastore doesn't match with one from in Hive jars.
    </description>
  </property>

複製hive-env.sh模板並配置

cp $HIVE_HOME/conf/hive-env.sh.template $HIVE_HOME/conf/hive-env.sh 
export HADOOP_HOME=/opt/hadoop/hadoop-2.9.0
export HIVE_CONF_DIR=/opt/hive/apache-hive-2.3.4-bin/conf
export HIVE_AUX_JARS_PATH=/opt/hive/apache-hive-2.3.4-bin/lib

啓動MySQL

sudo service mysqld start
sudo chkconfig mysqld on
sudo chkconfig --list | grep mysql

創建MySQL賬號並授權

mysql
create user hive@'hdp-01' identified by 'hive';
grant all privileges on *.* to 'hive'@'hdp-01' with grant option;
flush privileges;
此處使用本地獨立模式安裝的mysql，若使用遠程模式授權語句可改爲：grant all privileges on . to ‘hive’@‘hdp-01’ with grant option;

初始化hive 數據庫

schematool -initSchema -dbType mysql

啓動hive

nohup hive --service metastore &
nohup hive --service hiveserver2 &
hive

Hive的運行模式即任務的執行環境

分爲本地與集羣兩種
我們可以通過mapred.job.tracker 來指明
設置方式：
hive > SET mapred.job.tracker=local

Hive啓動方式

hive 命令行模式，直接輸入#/hive/bin/hive的執行程序，或者輸入 #hive --service cli
hive web界面的 (端口號9999) 啓動方式

    #hive --service hwi &
    用於通過瀏覽器來訪問hive
    http://hadoop0:9999/hwi/

hive 遠程服務 (端口號10000) 啓動方式
#hive --service hiveserver &

Hive跟傳統數據庫

PS：Hive 中沒有定義專門的數據格式，數據格式可以由用戶指定，用戶定義數據格式需要指定三個屬性：列分隔符（通常爲空格、”\t”、”\x001″）、行分隔符（”\n”）以及讀取文件數據的方法（Hive 中默認有三個文件格式 TextFile，SequenceFile 以及 RCFile）。由於在加載數據的過程中，不需要從用戶數據格式到 Hive 定義的數據格式的轉換，因此，Hive 在加載的過程中不會對數據本身進行任何修改，而只是將數據內容複製或者移動到相應的 HDFS 目錄中。而在數據庫中，不同的數據庫有不同的存儲引擎，定義了自己的數據格式。所有數據都會按照一定的組織存儲，因此，數據庫加載數據的過程會比較耗時。

Hive 數據類型

基本數據類型
tinyint/smallint/int/bigint
float/double
boolean
string
複雜數據類型
Array/Map/Struct
沒有date/datetime

Hive存儲格式

Hive的數據存儲基於Hadoop HDFS
Hive沒有專門的數據存儲格式
存儲結構主要包括：數據庫、文件、表、視圖
Hive默認可以直接加載文本文件（TextFile），還支持sequence file 、RC file
創建表時，指定Hive數據的列分隔符與行分隔符，Hive即可解析數據

DataBase

類似傳統數據庫的DataBase默認數據庫"default",使用#hive命令後，不使用hive>use <數據庫名>，系統默認的數據庫。可以顯式使用hive> use default;

表的格式

Table 內部表

Partition 分區表

Bucket Table 桶表

External Table 外部表

創建表格

CREATE TABLE page_view(viewTime INT, userid BIGINT,
page_url STRING, referrer_url STRING,
ip STRING COMMENT ‘IP Address of the User’)
COMMENT ‘This is the page view table’
PARTITIONED BY(dt STRING, country STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ‘\001’
STORED AS SEQUENCEFILE; TEXTFILE

指定數據格式

create table tab_ip_seq(id int,name string,ip string,country string)
row format delimited
fields terminated by ‘,’
stored as sequencefile;
insert overwrite table tab_ip_seq select * from tab_ext;

創建表格導入本地文件到數據庫

create table tab_ip(id int,name string,ip string,country string)
row format delimited
fields terminated by ‘,’
stored as textfile;
load data local inpath ‘/home/hadoop/ip.txt’ into table tab_ext;

創建external 表

一般創建內部表有HDFS統一管理數據的存儲路徑，但有時候業務數據在別的HDFS路徑中則需要指定數據源(也叫外部表)
CREATE EXTERNAL TABLE tab_ip_ext(id int, name string,
ip STRING,
country STRING)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘,’
STORED AS TEXTFILE
LOCATION ‘/external/hive’;

###CTAS 用於創建一些臨時表存儲中間結果
CREATE TABLE tab_ip_ctas
AS
SELECT id new_id, name new_name, ip new_ip,country new_country
FROM tab_ip_ext
SORT BY new_id;

//insert from select 用於向臨時表中追加中間結果數據
create table tab_ip_like like tab_ip;
insert overwrite table tab_ip_like select * from tab_ip;

//分區
create table tab_ip_part(id int,name string,ip string,country string)
partitioned by (part_flag string)
row format delimited fields terminated by ‘,’;
// 導入數據指定分區
load data local inpath ‘/home/hadoop/ip.txt’ overwrite into table tab_ip_part
partition(part_flag=‘part1’);
// 導入數據指定分區
load data local inpath ‘/home/hadoop/ip_part2.txt’ overwrite into table tab_ip_part
partition(part_flag=‘part2’);

select * from tab_ip_part;

select * from tab_ip_part where part_flag=‘part2’;
select count(*) from tab_ip_part where part_flag=‘part2’;

alter table tab_ip change id id_alter string;
ALTER TABLE tab_cts ADD PARTITION (partCol = ‘dt’) location ‘/external/hive/dt’;

show partitions tab_ip_part;

//write to hdfs
insert overwrite local directory ‘/home/hadoop/hivetemp/test.txt’ select * from tab_ip_part where part_flag=‘part1’;
insert overwrite directory ‘/hiveout.txt’ select * from tab_ip_part where part_flag=‘part1’;

//array
create table tab_array(a array,b array)
row format delimited
fields terminated by ‘\t’
collection items terminated by ‘,’;

示例數據
tobenbrone,laihama,woshishui 13866987898,13287654321
abc,iloveyou,itcast 13866987898,13287654321

select a[0] from tab_array;
select * from tab_array where array_contains(b,‘word’);
insert into table tab_array select array(0),array(name,ip) from tab_ext t;

//map
create table tab_map(name string,info map<string,string>)
row format delimited
fields terminated by ‘\t’
collection items terminated by ‘;’
map keys terminated by ‘:’;

示例數據：
fengjie age:18;size:36A;addr:usa
furong age:28;size:39C;addr:beijing;weight:180KG

load data local inpath ‘/home/hadoop/hivetemp/tab_map.txt’ overwrite into table tab_map;
insert into table tab_map select name,map(‘name’,name,‘ip’,ip) from tab_ext;

//struct
create table tab_struct(name string,info structage:int,tel:string,addr:string)
row format delimited
fields terminated by ‘\t’
collection items terminated by ‘,’

load data local inpath ‘/home/hadoop/hivetemp/tab_st.txt’ overwrite into table tab_struct;
insert into table tab_struct select name,named_struct(‘age’,id,‘tel’,name,‘addr’,country) from tab_ext;

//CLUSTER 比分區更細分桶
create table tab_ip_cluster(id int,name string,ip string,country string)
clustered by(id) into 3 buckets;

load data local inpath ‘/home/hadoop/ip.txt’ overwrite into table tab_ip_cluster;
set hive.enforce.bucketing=true;
insert into table tab_ip_cluster select * from tab_ip;

select * from tab_ip_cluster tablesample(bucket 2 out of 3 on id);

//cli shell
hive -S -e ‘select country,count(*) from tab_ext’ > /home/hadoop/hivetemp/e.txt
有了這種執行機制，就使得我們可以利用腳本語言（bash shell,python）進行hql語句的批量執行

select * from tab_ext sort by id desc limit 5;

select a.ip,b.book from tab_ext a join tab_ip_book b on(a.name=b.name);

//UDF
select if(id=1,first,no-first),name from tab_ext;

hive>add jar /home/hadoop/myudf.jar;
hive>CREATE TEMPORARY FUNCTION my_lower AS ‘org.dht.Lower’;
select my_upper(name) from tab_ext;

## 自定義函數 UDF
![在這裏插入圖片描述](https://img-blog.csdnimg.cn/20191214163709675.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9zb3doYXQuYmxvZy5jc2RuLm5ldA==,size_16,color_FFFFFF,t_70)

顯示所有函數：
hive> show functions;
查看函數用法：
hive> describe function substr;

```java
package mybigdata;

import java.util.HashMap;

import org.apache.hadoop.hive.ql.exec.UDF;

//  導入hive Jar 包 然後 繼承UDF  自定義實現 evaluate 函數即可 可以自定義輸入跟輸出 數據的個數跟類型
public class PhoneNbrToArea extends UDF{

	private static HashMap<String, String> areaMap = new HashMap<>();
	static {
		areaMap.put("1388", "beijing");
		areaMap.put("1399", "tianjin");
		areaMap.put("1366", "nanjing");
	}
	
	//一定要用public修飾才能被hive調用 
	public String evaluate(String pnb) {
		
		String result  = areaMap.get(pnb.substring(0,4))==null? (pnb+"    huoxing"):(pnb+"  "+areaMap.get(pnb.substring(0,4)));
		
		return result;
	}
}

將文件打包到HDFS中或者本地
hive語句中導入該jar包就可用這個函數了

add jar hdfs://cluster/user/kg/hive_udf/kg_graphx-1.0-SNAPSHOT.jar;
create temporary function jsonutil as 'com.credithc.rc.kg.udf.JsonUdf';
create temporary function jsonlist as 'com.credithc.rc.kg.udf.JsonUdaf';
====
hive> add jar /home/hadoop/hiveareaudf/jar 
hive> create temporary function getarea as 'mybigdata.PhoneNbrToArea';
hive> select getarea(phoneNB),upflow,downflow from t_flow;

參考

Hadoop集羣安裝Hive

SoWhat1412

發佈了369 篇原創文章 · 獲贊 1728 · 訪問量 150萬+

他的留言板關注

【Hadoop】第六天 Hive