Hive-從安裝到實踐

1、hive介紹

百度百科：

Hive是基於Hadoop的一個數據倉庫工具，可以將結構化的數據文件映射爲一張數據庫表，並提供簡單的sql查詢功能，可以將sql語句轉換爲MapReduce任務進行運行。其優點是學習成本低，可以通過類SQL語句快速實現簡單的MapReduce統計，不必開發專門的MapReduce應用，十分適合數據倉庫的統計分析。

Hive是建立在 Hadoop 上的數據倉庫基礎構架。它提供了一系列的工具，可以用來進行數據提取轉化加載（ETL），這是一種可以存儲、查詢和分析存儲在 Hadoop 中的大規模數據的機制。Hive 定義了簡單的類 SQL 查詢語言，稱爲 HQL，它允許熟悉 SQL 的用戶查詢數據。同時，這個語言也允許熟悉 MapReduce 開發者的開發自定義的 mapper 和 reducer 來處理內建的 mapper 和 reducer 無法完成的複雜的分析工作。

Hive 沒有專門的數據格式。 Hive 可以很好的工作在 Thrift 之上，控制分隔符，也允許用戶指定數據格式。

2、hive安裝

2.1 下載安裝

1. 下載hive——地址：http://mirror.bit.edu.cn/apache/hive/

2. 解壓：tar -zxvf apache-hive-3.1.2-bin.tar.gz -C /usr/local/

mv apache-hive-3.1.2-bin hive

3. 修改環境變量：export HIVE_HOME=/usr/local/hive

export PATH=$PATH:$HIVE_HOME/bin

4. 執行source /etc/profile：

執行hive --version

2.2、hive配置

1. 修改hive-site.xml文件

cp hive-default.xml.template hive-site.xml


<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
        <property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://hadoop01:3306/hive?createDatabaseIfNotExist=true</value>
  </property>

   <property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jdbc.Driver</value>
    <description>Driver class name for a JDBC metastore</description>
  </property>
    <property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>用戶</value>
    <description>Username to use against metastore database</description>
  </property>

   <property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>密碼</value>
    <description>password to use against metastore database</description>
  </property>

  <property>
    <name>hive.cli.print.current.db</name>
    <value>true</value>
    <description>顯示當前數據庫</description>
  </property>

  <property>
    <name>hive.cli.print.header</name>
    <value>true</value>
    <description>顯示id列</description>
  </property>

 </configuration>

2.2、設置參數

配置(配置文件)< 啓動(啓動時通過 -conf 設置參數)<命令行(set parm = value)

eg： set mapreduce.job.reduces = 3;

3、hive 數據類型

3.1、基本數據類型

   tinyint ->byte->1byte有符號整數->20
   smallint ->short->2byte有符號整數 -> 20
   int -> int -> 4byte有符號整數 -> 20
   bigint -> long -> 8byte有符號整數 -> 20
   boolean -> boolean -> (true/false) -> true
   float -> float - > 單精度浮點型 -> 3.14159
   double -> double ->雙精度浮點數 -> 3.14159
   string -> string -> 字符系列。可以指定字符集。可以使用單引號或者雙引號。 -> 'name' "name" 
   timestamp -> -> 時間類型
   binary -> -> 字節數組
  注意：常用的就是int ,bigint, double,string
  hive的string類型相當於數據庫中的varchar類型，該類型是一個可變字符串，不過它不能聲明其中最多存儲多少個字符，理論上它可以存儲2GB的字符數。

3.2、集合數據類型

1. struct ->結構體，和c語言的struct類似，可以通過“點”符號訪問元素內容。eg:struct{first string,last string} ==>取第一個元素可以通過字段first來引用。--select addr.city from personInfo;
2. map -> map是一組鍵值對元組集合；類似於java中的map
eg: "first" ->"john","last"->"Doe"  ,訪問第二個元素可以通過字段名“last”獲取  --select children['zhangsan'] from personInfo;
3. array ->數組集合arr["a","b"] ->arr[1]  --select friends[0] from personInfo;

3.3、實例

1.一條日誌
xiaoming,xiaohong_xiaolan,xiaohua1:17_xiaohua2:18,shanxin_hanzhong
zhangsan,lisi_wangwu,zhangsi:17_zhangwu:18,zhejiang_hanzhou
2.關係：name: xiaoming
       friends:xiaohong,xiaolan
       children:name:xiaohua1,age 17,name:xiaohua2,age:18
       addr:province:shanxi;city:hanzhong
3.建表：
  create table personInfo(
    name string，
    friends array<string>，
    children map<string,int>,
    addr struct<province:string,city:string>
  )
  row format delimited fields terminated by ','
  collection items terminated by '_'
  map keys terminated by ':'
  lines terminated by '\n';
  
  字段解釋：
   row format delimited fields terminated by ','--列分隔符
   collection items terminated by '_' --map struct 和array的分隔符（數據分割符號）集合分隔符
   map keys terminated by ':' --map中key與value的分隔符
   lines terminated by '\n'  --行分隔符
      
4.將數據插入文件：vim person.txt
 xiaoming,xiaohong_xiaolan,xiaohua1:17_xiaohua2:18,shanxin_hanzhong
zhangsan,lisi_wangwu,zhangsi:17_zhangwu:18,zhejiang_hangzhou

5.將數據加載到表中
laod data local inpath '/xx/person.txt' into table personInfo;

6.查詢數據：select * from personInfo;

3.4、數據類型轉換：

任何整數型都可以隱式轉換爲一個更大範圍的類型；eg : tinyint ->int
所有整型和數據類型的string都可以隱式轉成double.
tinyint、smallint、int 都可以轉成float；
可以使用cast顯示的對數據進行強制轉換;eg: cast('1' as int ) ==》將字符串轉爲整數。如果強制類型轉換失敗，如：cast ('s' as int) ,返回null。

4、hive 基本操作

4.1 、DDL數據定義

4.1.1 數據庫

1.創建數據庫
  （1）、 創建數據庫;數據庫在HDFS上的默認存儲路徑是:    /user/hive/warehouse/*.db。
     create database if not exists hive_2;
  （2）、 創建數據庫，指定數據庫在HDFS上存放的位置;/hive3.db
    create database hive_3 location '/hive3.db';
2.查詢數據庫
   (1)、 show hive_3; //查詢數據庫
   (2)、 show databases like 'hive*'; //模糊查詢
   (3)、 desc database hive_3;  //顯示數據庫信息
   (4)、 desc database extended hive_3; //查看擴展元數據
3.修改數據庫:
   可以使用alter database 命令爲某個庫的dbproperties 設置鍵值對屬性值，用於描述作用；數據庫的其他元數據不可被更改，包括（數據庫名和數據庫所在的文件位置）。
   (1)、 alter database hive_3 set dbproterties("createTime"="2019-09-27");
   (2)、 alter database hive_3 set dbproperties("createTime="2019-09-28","createUser"="wql");
4.刪除數據庫
  （1）、drop database hive_3; //只能刪除空數據庫
   (2)、drop database hive_3 cascade;// 強制刪除數據庫

 cascade;// 強制刪除數據庫

4.1.2 數據表

創建數據表

create [external] table [if not exists] table_name
[列名 類型 [comment 列註釋],....]
[comment 表註釋]
[partitioned by (列名 類型[comment 列註釋]，.....)] //創建一個外部表,通過location指向實際路徑
[clustered by (列名 類型 ,....)[sorted by (列名[ASC|DESC]，....)] INFO 桶數量 buckets ]
[row format 行分隔符]
[stored as 文件格式]
[location 路徑]

查看建表語句的詳細信息： show create table 表名；

hive 創建內部表時，會將數據移動到數據倉庫指向的路徑；若通過external 關鍵字創建一個外部表，在創建的同時指向實際的數據文件路徑(location) ，hive僅記錄數據所在的路徑，不會對數據位置進行移動。在刪除表的時候，內部表的元數據會被刪除，而只刪除外部表的引用，不會刪除元數據；

內部表和外部表之間的轉換

1. 查詢表的類型
   desc formatted personInfo；
 Table Type: MANAGED_TABLE  內部表
2. 修改personInfo 爲外部表
   alter table personInfo set tblprooerties('EXTERNAL' = 'TRUE');
3. 查詢表的類型
   desc formatted personInfo;
   Table Type: EXTERNAL_TABLE  外部表
4. 修改外部表personInfo 爲內部表
   alter table personInfo set tblproperties('EXTERNAL' = 'FALSE');
 注意： ('EXTERNAL' = 'TRUE') 和 ('EXTERNAL' = 'FALSE')爲固定寫法，區分大小寫。//true/false不區分大小寫。

注意： ('EXTERNAL' = 'TRUE') 和 ('EXTERNAL' = 'FALSE')爲固定寫法，區分大小寫。//true/false不區分大小寫。

分區表

分區表實際上就是對應一個HDFS文件系統上的獨立的文件夾，該文件夾是該分區所有的數據文件。Hive 中分區就是分目錄，把一個大的數據集根據業務需要分割成小的數據集。在查詢的時候通過where 子句中的表達式選擇查詢所需要的指定的分區，這樣查詢效率會提高很多。

3.1 分區表基本操作

1. 引入分區表（根據日期對文件進行管理）
 /user/hive/warehouse/hive_1.db/order_partition/month=201909/20190927.txt
 /user/hive/warehouse/hive_1.db/order_partition/month=2019010/20191028.txt
 
2.創建分區表
create table order_partition(oid int,price double, desc string) 
partitioned by (month string) row format delimited fields terminated by '\t';

3.加載數據到分區表中
load data local inpath '/home/qiulin/soft/hive/data/20190927.txt' into table order_partition partition(month='201909');

4. 查詢分區表數據
select * from order_partition where month = '201909';

5.增加分區
  （1）.增加單個分區
     alter table order_partition add  partition(month='201910');
  （2）、增加多個分區
     alter table order_partition add  partition(month='201911') partition(moth='201912');

6.刪除分區
   (1)、刪除單個分區
      alter table order_partition drop patition(month='201910');
   (2)、刪除多個分區
      alter table order_partition drop patition(month='201911'),partition(month='201912');
  
 7. 查看分區表有多少個分區
  show partitions order_partition;
  
 8. 查看分區表結構
   desc formatted order_partition;

3.2 常用分區表操作

1. 上傳數據後修復(存在大量歷史數據時，並且文件較多)--masck repair table table_name

 （1）、上傳數據：將數據上傳到hdfs上
    1.創建分區文件夾：dfs -mkdir -p /user/hive/warehouse/hive_1.db/order_partition/month=201911/day=01;
    2.將文件put到文件夾
     dfs -put /home/qiulin/soft/hive/data/20190927.txt  /user/hive/warehouse/hive_1.db/order_partition/month=201911/day=01;
     
 （2）、查詢數據（由於沒有元數據和分區表【partitions】建立聯繫），查詢不到數據
  (3)、修復命令
       masck repair table order_partition;
  (4)、再次查詢數據
       select * from order_partition where month = '201911' and day = '01';

       
       
 2. 上傳數據後添加分區

 (1)、上傳數據
       dfs -put /home/qiulin/soft/hive/data/20190927.txt  /user/hive/warehouse/hive_1.db/order_partition/month=201911/day=01;
  (2)、添加分區
       altere table order_partition add partition(month='201911',day='01');

       
 3. 創建文件夾之後 load數據到分區（分區表已經存在時）

 (1)、創建目錄
    dfs -mkdir -p /user/hive/warehouse/hive_1.db/order_partition/month=201911/day=01;
 （2）、上傳數據
    laod data local inpath '/home/qiulin/soft/hive/data/20190927.txt' into table order_partition partition(month='201912',day=01);

4、修改表

增加/修改/替換列信息

1. 重命名錶
 alter table table_name rename to new_table;
2. 更新列
 alter table table_name change column old_colum_name new_colum_name colum_type;
3. 新增列
 alter table table_name add colums (cloum_name cloum_type,...);
4. 替換列：整張表的字段會被改動,整張表的字段爲replace後的字段，數據在文件中，不會丟失；若是列類型和文件列的類型不一致，返回null；
 alter table table_name replace colums (colum_name colum_type,...);
注意： change 後跟的是colum,而 add/replace後面是colums;
5.清空表(只能清空內部表)
  truncate table table_name;
6. 刪除表
 drop table table_name;

4.2、DML 數據操作

1、向表中加入數據

1. 通過 load 向表中導入數據(load)

(1)、語法
     load data [local] inpath 'file_url' [overwrite] into table table_name [partition(part_colum=xxx,...)];
     local : 表示從本地加載數據到hive表；否則從HDFS加載數據到Hive表。
     overwrite: 表示覆蓋表中已有的數據，否則表示追加。

2. 通過查詢語句插入數據(insert)

(1)、語法
     a、創建一張分區表：
         create table person(id int,name string) partitioned by (month string) row format delimited fields terminated by '\t';
     b、插入數據:
         insert into table person partition(month="201909") values(1,"wql");
     c、插入查詢的數據（overwrite 覆蓋之前的數據）
     from person insert overwrite  table person partition(month='201907') select id ,name where month = "201909";

3. 根據查詢結果創建表並加載數據(as select )，創建出來的表字段名爲查詢的字段名。

 (1)、 語法
     create table if not exists table_name as select (colum_name,...) from source_table;
     eg：
     create table if not exists person2 as select id,name from person where month='201909';

4、通過location指定數據路徑

 (1)、語法
     a、創建表:
     create table if not exists table_name(colum_name colum_type,...) row format delimited fields terminited by '字段分隔符' location '數據表所在的dfs位置'
     eg:
     create table if not exists person3(id int , name string) row format delimited fields terminated by '\t' location '/user/hive/warehouse/hive_4.db/person3';
     b、上傳數據到HDFS上
     dfs -put /home/qiulin/soft/hive/data/person1.txt  /user/hive/warehouse/hive_4.db/person3;
     c、查詢數據(多個文件時，數據會追加)
     select * from person3;

 5、insert 導入數據到hive表中

 （1）、導出數據 row format delimited feilds   terminated by '\t',不加是沒有分隔符的
      a、查詢結果導出到本地(local)
      insert overwrite local directory '/home/qiulin/soft/hive/data/export' row format delimited feilds   terminated by '\t' select * from person3;
      b、查詢結果導出到HDFS上
      insert overwrite directory '/user/hive/warehouse/hive_4.db/person3' row format delimited fields terminated by '\t' select * from person3;


6、通過export 導出到HDFS ,再通過import導入（導出的時候附帶有元數據）

 （1）、export導出數據到HDFS
     export table personInfo to '/user/hive/warehouse/hive_4.db/person3';
  (2)、import導入數據(只能導入到新表中)
      import table person partiton(month='201905') from '/user/hive/warehouse/hive_4.db/person3'

5、查詢表數據

5.1、基本查詢

1、Join語句
   Hive支持通常的SQL JOIN語句，但是隻支持等值連接，不支持非等值(!=, >,<,...)連接。
   eg:
     select id,name  from person2 p2 join person3 p3 on p2.id = p3.id;
2. Join連接謂詞不支持or
   eg:
     select id,name  from person2 p2 join person3 p3 on p2.id = p3.id or p2.name = p3.name;(報錯) =>子查詢
     select id ,p2Name from 
     (select id,p2.name p2Name,p3.name p3Name  from person2 p2 join person3 p3 on p2.id = p3.id)a wherea.p2Name = p3Name

5.2 、排序

1. 全局排序（order by），一個reducer，主要出現order by，只會出現一個reducer
2. 每個reducer 內部排序(sort by )局部有序，全局無序
   eg:
      select * from person2 id sort by id  //
3. 分區排序（distribute by）
   distribute by :類似mr 中partition,進行分區，結合sort by 使用
   注意：Hive要求distribute by 語句要寫在sort by 語句之前，一定要多reducer進行處理，否則無法看到distribute by 的效果。
   eg:
    set mapreduce.job.reduces=3;
    insert overwrite local directory '/home/qiulin/soft/hive/data/result' select * from person3 distribute by id sort by name asc;
4. 當distribute 和 sort by 字段相同時，可以使用cluster by 排序。
    cluster by 除了具有distribute  by的功能外還兼具sort by 的功能。但是排序只能時升序排序，不能指定排序規則(asc|desc)。
  eg:
  select * from person2 cluster by id =>
  select * from person2 distribute by id sort by id //按id分區，相同id不一定都在同一個文件裏面。(隨機)

5.3、分桶表

分區針對的是數據的存儲路徑；分桶針對的是數據文件。

1. 創建分桶表
  create table per_buck(id int, name string) clustered by(id) into 3 buckets row format delimited fields terminated by "\t";
2. 插入數據到分桶表（只能通過insert into table 插入，通過MR可以將文件寫入不同的數據文件，通過load,import是不能拆分數據文件）
  （1）、設置屬性
  set hive.enforce.bucketing = true;
  set mapreduce.job.reduces = -1;
  （2）、插入數據
  insert  into table  per_buck select * from person3;
 3. 分桶抽樣-查詢分桶數據
   對於非常大的數據集，只需要查詢部分數據-抽樣查詢即可滿足的條件，使用分桶表最爲合適
   select * from per_buck tablesample(bucket 1 out of 3 on id);
   注意：tablesample 是抽樣語句，語法：tablesample (bucket x out of y)
   x:從哪個桶開始抽取，必須小於等於y;
   y:必須是bucks的倍數（>0）,bucket 總數爲3，當y=3時，抽取1個bucket數據；當y=6時，抽取1/2個bucket數據；

6、函數

hive 查看系統函數：show functions;

查看某個函數的使用：desc function extended 函數名;

用戶自定義函數

UDF:

import org.apache.hadoop.hive.ql.exec.UDF

需要實現evaluate 函數; evaluate 支持重載

在hive中創建函數

(1)、添加jar

add jar linux_jar_path

(2)、創建function

create [temporary] function [dbName.]function_name as class_name;
hive中刪除函數drop [temporary] function if exists [dbName.] function_name;

UDF 必須有返回類型，可以返回null，但是返回值類型不能爲void

7、壓縮

1.查看hadoop支持的壓縮類型

hadoop checknative

2.開啓reduce 輸出階段壓縮

1.開啓hive最終輸出數據壓縮功能
set hive.exec.compress.output=true;
2.開啓mapreduce最終輸出數據壓縮
set mapreduce.output.fileoutputformat.compress=true;
3.設置mapreduce最終數據輸出壓縮方式
set mapreduce.output.fileoutputformat.compress.codec =org.apache.hadoop.io.compress.SnappyCodec;
4.設置mapreduce最終數據輸出壓縮爲塊壓縮
set mapreduce.output.fileoutputformat.compress.type=BLOCK;

測試：
 insert overwrite local directory '/usr/local/hive/data/emp-snapy' select * from bussess distribute by costdate sort by cost desc;  //按照costdate 分區，cost排序

3.文件存儲格式

    1、文件存儲格式在創建表的時候指定存儲格式即可(stored as textfile|orc|parquet )
    2、查看文件大小：dfs -du -h /user/hive/warehouse/tableName/

Hive-從安裝到實踐

1、hive介紹

2、hive安裝

3、hive 數據類型

4、hive 基本操作

認知提升的方法

螞蟻面試：Springcloud核心組件的底層原理，你知道多少？

Java 打包

Hive-從安裝到實踐

多線程-基礎-1

Redis 從搭建到集羣實踐

MySQL調優總結

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結