1、數據類型和文件格式
數據類型
創建表時需要指定字段的數據類型,hive支持一些集合數據類型,STRUCT、MAP和ARRAY:
- STRUCT:STRUCT first:INT, second:STRING
struct(5, ‘jack’)
通過字段名.first 和 字段名.second訪問內容 - MAP:MAP STRING, FLOAT
map(‘first’, 5.2)
通過字段名[‘first’]獲取數據 - ARRAY:ARRAY STRING>
ARRAY(‘jack’, ‘rose’)
通過字段名[0]來訪問
建表例子:
CREATE TABLE employees(
name STRING,
salary FLOAT,
subordinate ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
addres STRCUT<street:STRING, city:STRING, STATE:STRING>
);
文件格式
分隔符:
- \n : 換行符
- ^A:用於分割字段(列),在create table 中可以使用8進制的編碼\001
- ^B:用於分割ARRAY或者STRUCT中的元素,或MAP中鍵值對的分割,8進制\002
- ^C:用於MAP中鍵和值得分割,8進制\003
CREATE TABLE employees(
name STRING,
salary FLOAT,
subordinate ARRAY<STRING>,
deductions MAP<STRING, FLOAT>,
addres STRCUT<street:STRING, city:STRING, STATE:STRING>
)
ROW FORMAT DELIMITED
# 必須寫在其他子句之前
FIELDS TERMINATED BY '\001'
# 指定^A作爲字段(列)的分割符
COLLECTION ITEMS TERMINATED BY '\002'
# 指定^B作爲集合元素的分隔符
MAP KEYS TERMINATED BY '\003'
# 指定^C爲鍵值對的分隔符
LINES TERMINATED BY '\n'
# 指定換行符,目前只能用\n
STORED AS TEXTFILE;
# 指定存儲格式
;
2、數據庫和表的操作
2.1 數據庫中的常用操作
- SHOW DATABASES:展示所有的數據庫,想篩選可以用LIKE ‘H.*’用正則表達式
- CREATE DATABASE financials:創建數據庫
- DESCRIBE DATABASE financials:查看數據庫位置
- USE:指定某個數據庫爲當前工作數據庫
- DROP DATABASE IF EXISTS financials:刪除數據庫
- SHOW PARTITIONS financials:查看錶中存在的所有分區
- DESCRIBE EXTENED employees:顯示分區鍵
2.2 創建表
管理表、外部表和分區表
管理表和外部表的區別:
管理表將數據移動到數據倉庫指向的路徑,僅記錄數據所在的路徑,不對數據的位置做任何改變。外部表將數據存放到指定目錄中。Hive刪除表時,管理表的元數據和數據會被一起刪除,而外部表只刪除元數據,不刪除數據。
管理表和外部表的使用場景:
①外部表:比如某個公司的原始日誌數據存放在一個目錄中,多個部門對這些原始數據進行分析,那麼創建外部表是明智選擇,這樣原始數據不會被刪除;
②管理表:對原始數據或比較重要的中間數據進行建表存儲;
③分區表:將每個小時或每天的日誌文件進行分區存儲,可以針對某個特定時間段做業務分析,而不必分析掃描所有數據;
創建代碼
CREATE TABLE IF NOT EXISTS 表名(...)
# 創建管理表
CREATE EXTERNAL TABLE IF NOT EXISTS 表名(...)
# 創建外部表
CREATE EXTERNAL TABLE IF NOT EXISTS 表名(...)
PARTITIONED BY (country STRING, state STRING);
# 創建外部分區表
2.3 修改表
通過ALTER關鍵字對錶進行修改
表重命名
ALTER TABLE previous_name RENAME TO new_name
增加、修改、刪除表分區
增加分區表:
ALTER TABLE table_name ADD IF NOT EXISTS
PARTITION (year=2011, month=1, day=1) LOCATION ‘/logs/2011/01/01’;
修改分區表,移動位置:
ALTER TABLE table_name PARTITION (year=2011, month=1, day=1)
SET LOCATION ‘/new_logs/2011/01/01’;
刪除某個分區:
ALTER TABLE table_name DROP IF EXISTS PARTITION (year=2011, month=1, day=1);
修改列
增加列:
ALTER TABLE tabl_name ADD COLUMNS(
app_ name STRING COMMENT ‘Application name’,
session_di LONG COMMENT ‘the current session id’
);
刪除或替換列:
ALTER TABLE table_name REPLACE COLUMNS(
hms INT COMMENT ‘hour, minute, seconds’
…
)
刪除表
DROP TABLE IF EXISTS employees;
2.4 插入數據
2.4.1 本地數據插入
LOAD DATA LOCAL INPATH '/data_path'
OVERWRITE INTO TABLE employees
PARTITION (country='US', state='CA');
OVERWRITE 關鍵字:會先刪除原有數據,在插入新數據,如果想追加插入把OVERWRITE改成INTO,如果存在分區,就直接加入,不存在會先新建這個分區。
2.4.2 通過查詢語句向表中插入數據
INSERT OVERWRITE TABLE employees
PARTITION (country='US', state='CA')
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';
2.4.3 動態分區插入
動態分區不需要指定分區,由系統自己選擇,使用時需要更改兩個設置
set hive.exec.dynamic.partition=true;
設置開啓動態分區,動態分區在插入數據時可以不指定分區類型,系統自動選擇
set hive.exec.dynamic.partition.mode=nonstrict;
動態分區的模式,默認strict,表示必須指定至少一個分區爲靜態分區,nonstrict模式表示允許所有的分區字段都可以使用動態分區。
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE employees
PARTITION (country, state)
SELECT ..., se.cty, se.st
FROM staged_employees se;
3、表的查詢
3.1 SELECT … FROM
3.1.1 列可選操作:
- 正則表達式:SELECT ‘price.*’ FROM employees
- 算術運算:常用:+ - * / %(求餘)
- 數學函數:常用:round(DOUBLE d, INT n) 保留n位小數、sqrt、abs、exp、ln
- 聚合函數:常用:
- count(*) 計算總行數包括null行
- count(列名) 計算列中非null行數
- sum()、avg、min、max、variance
- covar_pop(col1, col2):返回協方差
- corr(col1, col2):返回相關係數
- 內置函數
3.1.2 其他可選參數
SELECT e.col1 as col1_name FROM employees e LIMIT 10;
as:列別名、LIMIT:限制返回行數
3.1.3 嵌套SELECT語句
SELECT e.name, e.salary FROM
(SELECT person_id as name, salary FROM employees) e
where 指定條件 and 條件
3.2 WHERE 語句
- 同上例一樣,給select語句限定條件,用and或者or連接條件間的關係。
- 可以用正則表示帥的。LIKE和RLIKE
3.3 GROUP BY 和 HAVING 語句
where用於篩選原表的內容,group by 對結果進行分組, having對group by結果過濾,順序是where -> group by -> having
SELECT year(ymd), avg(price_close) FROM stocks
WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
GROUP BY year(ymd)
HAVING avg(price_close) > 50.0;
3.4 JOIN 語句
JOIN … ON,JOIN連接兩個表,ON表示兩個表的連接條件
SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks JOIN dividends d ON s.ymd = d.ymd and s.symbol=d.symbol
WHERE s.symbol = 'AAPL';
INNER JOIN:JOIN默認是INNER JOIN 要兩邊同時存在纔會顯示
LEFT OUTER JOIN:左表滿足條件則顯示,右邊填NULL
RIGHT OUTER JOIN:同上
FULL OUTER JOIN:兩邊都填NULL
JOIN:笛卡爾積
同時有where和join時,是先執行join再執行where,所以如果是outer join,大量數據中有null,可能會被where語句過濾掉
4、一個例子
官方API文檔:
https://cwiki.apache.org/confluence/display/Hive/Home#Home-UserDocumentation
使用工具:beeline
常用指令:
- 1、!connect url –連接不同的Hive2服務器
- 2、!exit –退出shell
- 3、!help –顯示全部命令列表
代碼
beeline -e "
# 設置動態分區
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
# 使用database1
use database1;
# 建立外部表方便數據分析
CREATE EXTERNAL TABLE IF NOT EXISTS person_feature
(
imei STRING COMMENT '設備編號',
shop_app_usage_duration DOUBLE COMMENT '使用時長',
pay_app_usage_duration DOUBLE COMMENT '使用時長'
)
COMMENT '用戶信息表'
PARTITIONED BY (pt_d STRING COMMENT '天分區')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '/AppData/data1'
TBLPROPERTIES('orc.compress'='ZLIB')
;
INSERT OVERWRITE TABLE ads_persona_datamining_consumption_level_features_dev_ds
PARTITION(pt_d='$date')
SELECT
t1.imei AS imei
,t1.shop_app_usage_duration AS shop_app_usage_duration
,t2.pay_app_usage_duration AS pay_app_usage_duration
,t3.finance_app_usage_duration AS finance_app_usage_duration
,t4.bank_app_usage_duration AS bank_app_usage_duration
,t5.travel_app_usage_duration AS travel_app_usage_duration
,t6.carUsage_app_usage_duration AS carUsage_app_usage_duration
,t7.house_app_usage_duration AS house_app_usage_duration
,t8.takeout_app_usage_duration AS takeout_app_usage_duration
,t9.carInfo_app_usage_duration AS carInfo_app_usage_duration
,t10.stock_app_usage_duration AS stock_app_usage_duration
,t11.carRaising_app_usage_duration AS carRaising_app_usage_duration
,t12.price_dev AS price_dev
,t13.city AS city
,t14.tvlr_hobby_dev AS tvlr_hobby_dev
,t14.bsn_tvlr_dev AS bsn_tvlr_dev
,t15.didi_use_duration AS didi_use_duration
,t16.hw_pay_30dy_pay_amt_dev AS hw_pay_30dy_pay_amt_dev
FROM
(
SELECT
imei
,duration_30d as shop_app_usage_duration
FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
WHERE category='商城'
)t1
LEFT JOIN
(
SELECT
imei
,duration_30d as pay_app_usage_duration
FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
WHERE category='支付'
)t2
ON (t1.imei=t2.imei)
LEFT JOIN
(
SELECT
imei
,duration_30d as finance_app_usage_duration
FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
WHERE category='理財'
)t3
ON (t1.imei=t3.imei)
LEFT JOIN
(
SELECT
imei
,duration_30d as bank_app_usage_duration
FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
WHERE category='銀行'
)t4
ON (t1.imei=t4.imei)
LEFT JOIN
(
SELECT
imei
,duration_30d as travel_app_usage_duration
FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
WHERE category='旅遊'
)t5
ON (t1.imei=t5.imei)
LEFT JOIN
(
SELECT
imei
,duration_30d as carUsage_app_usage_duration
FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
WHERE category='用車'
)t6
ON (t1.imei=t6.imei)
LEFT JOIN
(
SELECT
imei
,duration_30d as house_app_usage_duration
FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
WHERE category='租房買房'
)t7
ON (t1.imei=t7.imei)
LEFT JOIN
(
SELECT
imei
,duration_30d as takeout_app_usage_duration
FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
WHERE category='外賣'
)t8
ON (t1.imei=t8.imei)
LEFT JOIN
(
SELECT
imei
,duration_30d as carInfo_app_usage_duration
FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
WHERE category='汽車資訊'
)t9
ON (t1.imei=t9.imei)
LEFT JOIN
(
SELECT
imei
,duration_30d as stock_app_usage_duration
FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
WHERE category='股票基金'
)t10
ON (t1.imei=t10.imei)
LEFT JOIN
(
SELECT
imei
,duration_30d as carRaising_app_usage_duration
FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
WHERE category='養車'
)t11
ON (t1.imei=t11.imei)
LEFT JOIN
(
SELECT
imei
,price_dev
FROM ads_persona_label_0level_v3_price_dev_ds
WHERE pt_d='$date'
)t12
ON (t1.imei=t12.imei)
LEFT JOIN
(
SELECT
id
,city
FROM ads_persona_label_0level_v3_residence_new_ds
WHERE pt_d='$date' AND pt_id_type='dev'
)t13
ON (t1.imei=t13.id)
LEFT JOIN
(
SELECT
imei
,tvlr_hobby_dev
,bsn_tvlr_dev
FROM ads_persona_label_0level_v3_app_pref_tvlr_hobby_dev_ds
WHERE pt_d=concat('$month','01')
)t14
ON (t1.imei=t14.imei)
LEFT JOIN
(
SELECT
imei
,use_duration AS didi_use_duration
FROM ads_persona_label_1level_app_usage_30d_dev_dm
WHERE pt_d='$date' AND package_name='com.sdu.didi.psnger'
)t15
ON (t1.imei=t15.imei)
LEFT JOIN
(
SELECT
imei
,hw_pay_30dy_pay_amt_dev
FROM ads_persona_label_0level_v3_hw_pay_30dy_fee_dev_ds
WHERE pt_d='$date'
)t16
ON (t1.imei=t16.imei)
;
# @DESC 保留近30天數據
ALTER TABLE ads_persona_datamining_consumption_level_features_dev_ds DROP IF EXISTS PARTITION (pt_d ='${start_time,-30,yyyyMMdd}');
"