Hive開發使用-

適用場景

1.海量數據的存儲處理
2.數據挖掘
3.海量數據的離線分析
3.1目前的Hive的Thrift服務端通常使用HiveServer2，它是HiveServer2改進版本，它提供了新的ThriftAPI來處理JDBC或者ODBC客戶端，可以進行Kerberos身份驗證，支持多個客戶端併發。
3.2BeeLine
HiveServer2還提供了新的CLI：BeeLine，它是Hive 0.11引入的新的交互式CLI，基於SQLLine，可以作爲Hive JDBC Client 端訪問HievServer2。
通過BeeLine連接hive
hive安裝目錄/bin/./beeline -u jdbc:hive2://hiveServer2所在ip:端口號 -n 用戶名
例如： ./beeline -u jdbc:hive2://127.0.0.1:10000 -n root

Hive數據庫

類似傳統數據庫的DataBase，在元數據庫裏實際上是一張表，對應於HDFS上的數據倉庫目錄下是一個文件夾。數據倉庫目錄路徑，由hive-site.xml中${hive.metastore.warehouse.dir}參數指定
創建數據庫示例：create database 數據庫名
元數據庫中查詢數據庫列表select * from dbs; 如下圖

內部表

   內部表與關係數據庫中的Table在概念上類似。每一個Table在Hive中都有一個相應的目錄存儲數據。所有的Table數據（不包括External Table）都保存在這個目錄中。刪除表時，元數據與數據都會被刪除。
   元數據庫中查詢數據表列表: 
   ![在這裏插入圖片描述](https://img-blog.csdnimg.cn/20181216210745688.png?x-oss-process=image/watermark,type_ZmFuZ3poZW5naGVpdGk,shadow_10,text_aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L2NodXhpbmdidWJpYW4=,size_16,color_FFFFFF,t_70)
   HDFS下對應存儲目錄：
   ![在這裏插入圖片描述](https://img-blog.csdnimg.cn/20181216210802205.png)

外部表

外部表指向已經在HDFS中存在的數據，可以創建Partition。它和內部表在元數據的組織上是相同的，而實際數據的存儲則有較大的差異。內部表的創建過程和數據加載過程這兩個過程可以分別獨立完成，也可以在同一個語句中完成，在加載數據的過程中，實際數據會被移動到數據倉庫目錄中；之後對數據訪問將會直接在數據倉庫目錄中完成。刪除表時，表中的數據和元數據將會被同時刪除。而外部表只有一個過程，加載數據和創建表同時完成（CREATE EXTERNAL TABLE ……LOCATION），實際數據是存儲在LOCATION後面指定的 HDFS 路徑中，並不會移動到數據倉庫目錄中。當刪除一個External Table時，僅刪除該鏈接。

如何選擇使用內部表或外部表？

如果所有處理都由hive來完成，則使用內部表
如果需要用hive和外部其他工具處理同一組數據集，則使用外部表。

Partition對應於關係數據庫中的Partition列的密集索引，但是Hive中Partition的組織方式和數據庫中的很不相同。在Hive中，表中的一個Partition對應於表下的一個目錄，所有的Partition的數據都存儲在對應的目錄中。例如pvs表中包含ds和city兩個Partition，則

對應於ds = 20090801, city= jinan 的HDFS子目錄爲：/wh/pvs/ds=20090801/city=jinan ；

桶，

Buckets是將表的列通過Hash算法進一步分解成不同的文件存儲。它對指定列計算hash，根據hash值切分數據，目的是爲了並行，每一個Bucket對應一個文件。分區是粗粒度的劃分，桶是細粒度的劃分，這樣做爲了可以讓查詢發生在小範圍的數據上以提高效率。適合進行表連接查詢、適合用於採樣分析。

例如將user列分散至32個bucket，首先對user列的值計算hash，則
對應hash值爲0的HDFS目錄爲：/wh/pvs/ds=20090801/ctry=US/part-00000；
對應hash值爲20的HDFS目錄爲：
/wh/pvs/ds=20090801/ctry=US/part-00020。
如果想應用很多的Map任務這樣是不錯的選擇。

Hive的視圖

       視圖與傳統數據庫的視圖類似。視圖是隻讀的，它基於的基本表，如果改變，數據增加不會影響視圖的呈現；如果刪除，會出現問題。如果不指定視圖的列，會根據select語句後的生成。

視圖的簡單示例：
創建表：create view test_view as select * from test;
查看數據：select * from test_view;

CREATE TABLE 創建一個指定名字的表。如果相同名字的表已經存在，則拋出異常；用戶可以用 IF NOT EXIST 選項來忽略這個異常。
EXTERNAL 關鍵字可以讓用戶創建一個外部表，在建表的同時指定一個指向實際數據的路徑（LOCATION），
有分區的表可以在創建的時候使用 PARTITIONED BY 語句。一個表可以擁有一個或者多個分區，每一個分區單獨存在一個目錄下。
表和分區都可以對某個列進行 CLUSTERED BY 操作，將若干個列放入一個桶（bucket）中。
可以利用SORT BY 對數據進行排序。這樣可以爲特定應用提高性能。
默認的字段分隔符爲ascii碼的控制符\001(^A)
tab分隔符爲 \t。只支持單個字符的分隔符。
如果文件數據是純文本，可以使用 STORED AS
TEXTFILE。如果數據需要壓縮，使用 STORED
AS SEQUENCE 。

Hive開發使用-Hive加載數據命令

LOAD DATA [LOCAL] INPATH ‘filepath’ [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 …)]

Load 操作只是單純的複製/移動操作，將數據文件移動到 Hive 表對應的位置。如果表中存在分區，則必須指定分區名
加載本地數據，指定LOCAL關鍵字，即本地，可以同時給定分區信息。
load 命令會去查找本地文件系統中的 filepath。如果發現是相對路徑，則路徑會被解釋爲相對於當前用戶的當前路徑。用戶也可以爲本地文件指定一個完整的 URI，比如：file:///user/hive/project/data1.
例如：加載本地數據，同時給定分區信息：
hive> LOAD DATA LOCAL INPATH ‘file:///examples/files/kv2.txt’ OVERWRITE INTO TABLE invites PARTITION (ds=‘2008-08-15’);
加載DFS數據，同時給定分區信息：
如果 filepath 可以是相對路徑 URI路徑，對於相對路徑，Hive 會使用在 hadoop 配置文件中定義的 fs.defaultFS 指定的Namenode 的 URI來自動拼接完整路徑。
例如：加載數據到hdfs中，同時給定分區信息
hive> LOAD DATA INPATH ‘/user/myname/kv2.txt’ OVERWRITE INTO TABLE invites
PARTITION (ds=‘2008-08-15’);
OVERWRITE
指定 OVERWRITE ,目標表（或者分區）中的內容（如果有）會被刪除，然後再將 filepath 指向的文件/目錄中的內容添加到表/分區中。如果目標表（分區）已經有一個文件，並且文件名和 filepath 中的文件名衝突，那麼現有的文件會被新文件所替代。

內部表

建表示例：
例如：創建人員信息表person_inside，列以逗號","分隔。

create table person_inside (id string,name string,sex string,age int)
row format delimited fields terminated by ‘,’ stored as textfile;
加載數據：本地數據位置： /tmp/person.txt
load data local inpath ‘file:///tmp/person.txt’ into table person_inside;

外部表
例如：創建人員信息表person_ex，列以逗號","分隔。
外部表對應路徑：hdfs://mycluster/hivedb/person.txt
建表示例：
create external table person_ext
(id string,name string,sex string,age int)
row format delimited fields terminated by ‘,’
stored as textfile
location ‘/hivedb’; (注意：location後面跟的是目錄，不是文件，hive將依據默認配置的hdfs路徑，自動將整個目錄下的文件都加載到表中)
hive 默認數據倉庫路徑下，不會生成外部表的文件目錄，
查看錶信息： desc formatted person_ext; 查看location指向。
查詢數據：select * from person_ext;
刪除表：drop table person_ext;
只刪除邏輯表，不刪除數據文件，數據文件依然存在

分區表

例如：創建人員信息表person_part，列以逗號","分隔。建立city爲分區。
建表示例：
create table person_part
(id string,name string,sex string,age int)
partitioned by (city string)
row format delimited fields terminated by ‘,’
stored as textfile;
加載數據：本地數據位置： /tmp/person.txt
load data local inpath ‘file:///tmp/person.txt’ into table
person_part partition(city=‘jinan’);
數據存儲在以分區 city='jinan’爲目錄的路徑下
根據分區查詢數據：hive 會自動判斷where語句中是否包含分區的字段。而且可以使用大於小於等運算符
select * from person_part where city=‘jinan’;

分桶表

例如：創建人員信息表person_bucket，列以逗號","分隔，在年齡age字段上建5個桶。
建表示例：
create table person_bucket
(id string,name string,sex string,age int) partitioned by (city string)
clustered by (age) sorted by(name) into 5 buckets
row format delimited fields terminated by ‘,’
stored as textfile;
打開桶參數： set hive.enforce.bucketing = true;
加載數據：insert into table person_bucket partition (city=‘jinan’) select * from person_inside;

數據加載到桶表時，會對字段取hash值，然後與桶的數量取模。把數據放到對應的文件中。

抽樣查詢：查詢5個桶中的第2個桶，即000001_0 文件
select * from person_bucket tablesample(bucket 2 out of 5 on age);

分桶表：

注意：
要生成桶的數據，只能是由其他表通過insert into 或是insert overwrite導入數據，如果使用LOAD DATA 加載數據，則不能生成桶數據。
定義桶可以使用整型字段或是string類型字段。
若表沒有定義桶也可以進行隨機抽樣，但是要全表掃描，速度慢。
必須先set hive.enforce.bucketing = true，纔可以將數據正常寫入桶中。

導出到本地文件系統
insert overwrite local directory ‘/tmp/exporttest/’ select * from person_inside;
注意：導出路徑爲文件夾路徑，不必指定文件名。執行語句後，會在本地目錄的/tmp/exporttest/下生成一個000000_0結果集數據文件。
導出的數據列之間的分隔符默認是^A(ascii碼是\001)。

導出到HDFS中
insert overwrite directory ‘/hivedb’ select * from person_inside;
注意：導出路徑爲文件夾路徑，不必指定文件名。執行語句後，會在HDFS目錄的/hivedb下生成一個000000_0結果集數據文件。

導出到Hive的另一個表中
insert into table person_part partition (city=‘jinan’) select * from person_inside;

基於Partition的查詢
例如：分區爲 city
SELECT * FROM person_part WHERE city=‘jinan’;
限制條數查詢 LIMIT
Limit可以限制查詢的記錄數。查詢的結果是隨機選擇的。下面的查詢語句從t1表中隨機查詢5條記錄：
SELECT * FROM person_inside LIMIT 5;
Top N查詢
下面的查詢語句查詢年齡最大的5個人。
set mapred.reduce.tasks= 2; 設置mapReduce任務數爲2 個
Hive多表關聯使用join…on語句
Hive只支持等值連接，即ON子句中使用等號連接，不支持非等值連接。
如果連接語句中有WHERE子句，會先執行JOIN子句，再執行WHERE子句。
可以 join 多個表。

創建employee表
創建表
create table employee(employee_id string,name string)
row format delimited fields terminated by ‘,’ stored as textfile;
加載數據：本地數據位置： /tmp/employee.txt
load data local inpath ‘file:///tmp/employee.txt’ into table employee;
創建job表
創建表
create table job (job_id string,job string,employee_id string)
row format delimited fields terminated by ‘,’ stored as textfile;
加載數據：本地數據位置： /tmp/job.txt
load data local inpath ‘file:///tmp/job.txt’ into table job ;

內連接

指的是把符合兩邊連接條件的數據查詢出來。
查詢語句
select * from employee join job on employee.employee_id=job.employee_id;

左外連接

如果左邊有數據，右邊沒有數據，則左邊有數據的記錄的對應列返回爲空。
查詢語句
select * from employee left outer join job on employee.employee_id=job.employee_id;
注意：不能使用left join，只能使用left outer join。

右外連接

如果左邊沒有數據，右邊有數據，則右邊有數據的記錄對應列返回爲空。
查詢語句
select * from employee right outer join job on employee.employee_id=job.employee_id;
注意：不能使用right join，只能使用right outer join。

全外連接

顯示左外連接，右外連接的合集。
查詢語句
select * from employee full outer join job on employee.employee_id=job.employee_id;

左半連接

左半連接與in操作或者exists操作，效果一樣。
查詢語句
select * from employee left semi join job on employee.employee_id=job.employee_id;
上面語句相當於如下語句:
select * from employee where employee_id in (select employee_id from job);

hive 0.9.0版本開始支持 in、not in 、like、not like in

in

左邊的表在右邊表的範圍內。與left semi join 效果一樣。
select * from employee where employee_id in (select employee_id from job);

not in

左邊的表不在右邊表的範圍內。
select * from employee where employee_id not in (select employee_id from job);

like

查詢左右模糊匹配的所有結果。
select * from employee where name like ‘張%’;

not like

查詢左右模糊匹配以外的所有結果。
select * from employee where name not like ‘張%’;

查詢數據庫：show databases;
模糊搜索表：show tables like ‘name’;
刪除數據庫：drop database dbname;
刪除數據表：drop table tablename;
查看錶結構信息：desc table_name;
查看詳細表結構信息： desc formatted table_name;
查看分區信息： show partitions table_name;
查看hdfs文件列表信息：hadoop fs -ls /user/hive/warehouse/
查看hdfs文件內容：hadoop fs -cat /user/hive/warehouse/file.txt

三個文件，用戶文件users.dat,電影文件movies.dat評論文件ratings.dat

百萬級電影評論數據分析代碼

//創建movie表。電影信息表
create table movie
(movie_id int,movie_name string,movie_leixing string)
row format delimited fields terminated by ‘^’
stored as textfile
//從本地填充數據進入內部表
load data local inpath ‘/home/cloudera/Desktop/movie.dat’ into table movie

如法炮製其他兩張表
//創建user表，用戶信息表
create table user
(user_id int,user_sex string,user_age int,user_zhiye string,user_youbian string)
row format delimited fields terminated by ‘^’
stored as textfile
//從本地填充數據進入內部表
load data local inpath ‘/home/cloudera/Desktop/user.dat’ into table user

//創建rating表，評論信息表
create table rating
(user_id int,movie_id int,rating_pingfen int,rating_shijian string)
row format delimited fields terminated by ‘^’
stored as textfile;
//從本地填充數據進入內部表
load data local inpath ‘/home/cloudera/Desktop.rating.dat’ into table rating

//利用內連接進行數據合併
第一步合併評論數據表和用戶數據表
即合併rating表和user表
1.建表填充數據法，首先創建合併表的表即rating_user表
create table rating_user
(user_id int,movie_id int,rating_pingfen string,rating_shijian string,user_sex string,user_age int,user_zhiye string,user_youbian string)
row format delimited fields terminated by ‘^’
stored as textfile;
//將查詢結果插入到此表中
insert into table rating_user select rating.user_id,movie_id,rating_pingfen,rating_shijian,user_sex,user_age,user_zhiye,user_youbian
from rating join user on rating.user_id=user.user_id
2.查詢建表法，
create table rating_user
row format serde ‘org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe’
stored as rcfile
as
select rating.user_id,movie_id,rating_pingfen,rating_shijian,user_sex,user_age,user_zhiye,user_youbian
from rating join user on rating.user_id=user.user_id

3.不建表法，通過將第一次join的結果作爲第二次join的條件，一次查出所有數據的合集
select rating_user.user_id,rating_user.movie_id,rating_pingfen,rating_shijian,user_sex,user_age,user_zhiye,user_youbian,movie_name,movie_leixing from
(select rating.user_id,movie_id,rating_pingfen,rating_shijian,user_sex,user_age,user_zhiye,user_youbian from rating join user on rating.user_id=user.user_id)
rating_user join movie on rating_user.movie_id=movie.movie_id

2.一步到位查詢建表法，
create table rating_user_movie
row format serde ‘org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe’
stored as rcfile
as
select rating_user.user_id,rating_user.movie_id,rating_pingfen,rating_shijian,user_sex,user_age,user_zhiye,user_youbian,movie_name,movie_leixing from
(select rating.user_id,movie_id,rating_pingfen,rating_shijian,user_sex,user_age,user_zhiye,user_youbian from rating join user on rating.user_id=user.user_id)
rating_user join movie on rating_user.movie_id=movie.movie_id

//查詢合併後的表的數據條數
select count(*) from rating_user_movie;

//對數據進行合併查詢

select count(movie_id) from rating_user_movie
//通過電影id進行分組，將統計出的數據條數作爲新的列，對組內數據進行合併，最後對所有數據進行排序，倒序排序，取出前20條
select movie_id,count(movie_id) as m1
from rating_user_movie
group by movie_id
order by m1 desc
limit 20;
2.//通過電影id和電影name進行分組，將統計出的數據條數作爲新的列，對組內數據進行合併，最後對所有數據進行排序，倒序排序，取出前20條
select movie_name,count(movie_id) as m1
from rating_user_movie
group by movie_id,movie_name
order by m1 desc
having m1 >= 20
limit 20;
3.//通過電影id和電影name進行分組，將統計出的數據平均值作爲新的列，對組內數據進行合併，
計數，做條件判斷，取平均值，最後對數據進行排序
select movie_name,count(movie_id) as m1,avg(rating_pingfen) as m2
from rating_user_movie
group by movie_id,movie_name
having m1 >= 100
order by m2 desc
limit 10;

//查看合併數據條數是否正確
select count(*) from rating_user_movie;
select count(1) from rating_user_movie group by movie_id;
4.//通過電影id和性別進行分組，經統計出的數據平均值作爲新的列，對組內數據做合併，
//通過電影id，電影名字，電影類型，用戶性別，進行分組，統計出
select movie_id,movie_name,movie_leixing,user_sex,avg(rating_pingfen) as f
from rating_user_movie where user_sex like ‘F’
group by movie_id,movie_name,movie_leixing,user_sex
limit 5;

//通過電影的id，用戶的性別來對電影表做數據平均
select movie_id,movie_name,movie_leixing,user_sex,avg(rating_pingfen) as f
from rating_user_movie
group by movie_id,movie_name,movie_leixing,user_sex
limit 5;
4.平均值過後對數據進行處理時，先根據性別進行分割在join連接到一起。
select m1.movie_id,m1.movie_name,m1.movie_leixing,f,m from
(select movie_id,movie_name,movie_leixing,user_sex,avg(rating_pingfen) as f
from rating_user_movie where user_sex like ‘F’
group by movie_id,movie_name,movie_leixing,user_sex) f1 join
(select movie_id,movie_name,movie_leixing,user_sex,avg(rating_pingfen) as m
from rating_user_movie where user_sex like ‘M’
group by movie_id,movie_name,movie_leixing,user_sex) m1 on f1.movie_id=m1.movie_id

select movie_id,movie_name,count(movie_id) as m1
from rating_user_movie
group by movie_id,movie_name
order by m1 desc
limit 50;

select movie_id,user_age,avg(rating_pingfen) as f1
from rating_user_movie
group by movie_id,movie_name,user_age

5.1//通過兩次篩選，對第一次篩選出的評論數量前50條電影數據做處理後，
連接第二次篩選出的每部電影每個年齡段的平均分，最終七個年齡段連接出350條數據
並將其生成data_age表
create table data_age
row format serde ‘org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe’
stored as rcfile
as
select t1.movie_name,t2.user_age,pingjunfen from
(select movie_id,movie_name,count(movie_id) as m1
from rating_user_movie
group by movie_id,movie_name
order by m1 desc
limit 50) t1 join
(select movie_id,user_age,avg(rating_pingfen) as pingjunfen
from rating_user_movie
group by movie_id,movie_name,user_age) t2 on
t1.movie_id=t2.movie_id

5.2//行轉列
select movie_name, case 0<user_age and user_age<10 when “(1-9]” then pingjunfen else 0 end as (1-9],
case user_age when 9<user_age and user_age<19 then pingjunfen else 0 end as (1-9],
case user_age when 19<user_age and user_age<29 then pingjunfen else 0 end as (1-9],
case user_age when 29<user_age and user_age<39 then pingjunfen else 0 end as (1-9],
case user_age when 39<user_age and user_age<49 then pingjunfen else 0 end as (1-9],
case user_age when 49<user_age and user_age<59 then pingjunfen else 0 end as (1-9],
from data_age;

select movie_name,
case user_age when 1 then pingjunfen else 0 end as 1,
case user_age when 18 then pingjunfen else 0 end as 18,
case user_age when 25 then pingjunfen else 0 end as 25,
case user_age when 35 then pingjunfen else 0 end as 35,
case user_age when 45 then pingjunfen else 0 end as 45,
case user_age when 50 then pingjunfen else 0 end as 50,
case user_age when 56 then pingjunfen else 0 end as 56,
from data_age;

適用場景

Hive數據庫

內部表

外部表

如何選擇使用內部表或外部表？

分區，表分區位於表目錄的下級目錄

桶，

Hive的視圖

Hive開發使用-Hive加載數據命令

內部表

分區表

分桶表

分桶表：

內連接

左外連接

右外連接

全外連接

左半連接

hive 0.9.0版本開始支持 in、not in 、like、not like in

in

not in

like

not like

三個文件，用戶文件users.dat,電影文件movies.dat評論文件ratings.dat

百萬級電影評論數據分析代碼

ubuntu18.04中/etc/apt/sources.list鏡像源文件配置錯誤

論一論Token

ubuntu完全卸載Docker

關於Mysql服務的一點積累-Mysql基礎

HIve之行轉列，列轉行操作

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結