Hive筆記(查詢和建表)

1、數據類型和文件格式

數據類型

創建表時需要指定字段的數據類型,hive支持一些集合數據類型,STRUCT、MAP和ARRAY:

  • STRUCT:STRUCT < first:INT, second:STRING>
    struct(5, ‘jack’)
    通過字段名.first 和 字段名.second訪問內容
  • MAP:MAP< STRING, FLOAT>
    map(‘first’, 5.2)
    通過字段名[‘first’]獲取數據
  • ARRAY:ARRAY< STRING>
    ARRAY(‘jack’, ‘rose’)
    通過字段名[0]來訪問

建表例子:

CREATE TABLE employees(
name        STRING,
salary      FLOAT,
subordinate ARRAY<STRING>,
deductions  MAP<STRING, FLOAT>,
addres      STRCUT<street:STRING, city:STRING, STATE:STRING>
);

文件格式

分隔符:

  • \n : 換行符
  • ^A:用於分割字段(列),在create table 中可以使用8進制的編碼\001
  • ^B:用於分割ARRAY或者STRUCT中的元素,或MAP中鍵值對的分割,8進制\002
  • ^C:用於MAP中鍵和值得分割,8進制\003
CREATE TABLE employees(
name        STRING,
salary      FLOAT,
subordinate ARRAY<STRING>,
deductions  MAP<STRING, FLOAT>,
addres      STRCUT<street:STRING, city:STRING, STATE:STRING>
)
ROW FORMAT DELIMITED
# 必須寫在其他子句之前
FIELDS TERMINATED BY '\001'
# 指定^A作爲字段(列)的分割符
COLLECTION ITEMS TERMINATED BY '\002'
# 指定^B作爲集合元素的分隔符
MAP KEYS TERMINATED BY '\003'
# 指定^C爲鍵值對的分隔符
LINES TERMINATED BY '\n'
# 指定換行符,目前只能用\n
STORED AS TEXTFILE;
# 指定存儲格式
;

2、數據庫和表的操作

2.1 數據庫中的常用操作

  • SHOW DATABASES:展示所有的數據庫,想篩選可以用LIKE ‘H.*’用正則表達式
  • CREATE DATABASE financials:創建數據庫
  • DESCRIBE DATABASE financials:查看數據庫位置
  • USE:指定某個數據庫爲當前工作數據庫
  • DROP DATABASE IF EXISTS financials:刪除數據庫
  • SHOW PARTITIONS financials:查看錶中存在的所有分區
  • DESCRIBE EXTENED employees:顯示分區鍵

2.2 創建表

管理表、外部表和分區表

管理表和外部表的區別:
管理表將數據移動到數據倉庫指向的路徑,僅記錄數據所在的路徑,不對數據的位置做任何改變。外部表將數據存放到指定目錄中。Hive刪除表時,管理表的元數據和數據會被一起刪除,而外部表只刪除元數據,不刪除數據。
管理表和外部表的使用場景:
①外部表:比如某個公司的原始日誌數據存放在一個目錄中,多個部門對這些原始數據進行分析,那麼創建外部表是明智選擇,這樣原始數據不會被刪除;
②管理表:對原始數據或比較重要的中間數據進行建表存儲;
③分區表:將每個小時或每天的日誌文件進行分區存儲,可以針對某個特定時間段做業務分析,而不必分析掃描所有數據;

創建代碼

CREATE TABLE IF NOT EXISTS 表名(...)
# 創建管理表
CREATE EXTERNAL TABLE IF NOT EXISTS 表名(...)
# 創建外部表
CREATE EXTERNAL TABLE IF NOT EXISTS 表名(...)
PARTITIONED BY (country STRING, state STRING);
# 創建外部分區表

2.3 修改表

通過ALTER關鍵字對錶進行修改

表重命名

ALTER TABLE previous_name RENAME TO new_name

增加、修改、刪除表分區

增加分區表:
ALTER TABLE table_name ADD IF NOT EXISTS
PARTITION (year=2011, month=1, day=1) LOCATION ‘/logs/2011/01/01’;
修改分區表,移動位置:
ALTER TABLE table_name PARTITION (year=2011, month=1, day=1)
SET LOCATION ‘/new_logs/2011/01/01’;
刪除某個分區:
ALTER TABLE table_name DROP IF EXISTS PARTITION (year=2011, month=1, day=1);

修改列

增加列:
ALTER TABLE tabl_name ADD COLUMNS(
app_ name STRING COMMENT ‘Application name’,
session_di LONG COMMENT ‘the current session id’
);
刪除或替換列:
ALTER TABLE table_name REPLACE COLUMNS(
hms INT COMMENT ‘hour, minute, seconds’

)

刪除表

DROP TABLE IF EXISTS employees;

2.4 插入數據

2.4.1 本地數據插入

LOAD DATA LOCAL INPATH '/data_path'
OVERWRITE INTO TABLE employees
PARTITION (country='US', state='CA');

OVERWRITE 關鍵字:會先刪除原有數據,在插入新數據,如果想追加插入把OVERWRITE改成INTO,如果存在分區,就直接加入,不存在會先新建這個分區。

2.4.2 通過查詢語句向表中插入數據

INSERT OVERWRITE TABLE employees
PARTITION (country='US', state='CA')
SELECT * FROM staged_employees se
WHERE se.cnty = 'US' AND se.st = 'OR';

2.4.3 動態分區插入

動態分區不需要指定分區,由系統自己選擇,使用時需要更改兩個設置
set hive.exec.dynamic.partition=true;
設置開啓動態分區,動態分區在插入數據時可以不指定分區類型,系統自動選擇
set hive.exec.dynamic.partition.mode=nonstrict;
動態分區的模式,默認strict,表示必須指定至少一個分區爲靜態分區,nonstrict模式表示允許所有的分區字段都可以使用動態分區。

set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE employees
PARTITION (country, state)
SELECT ..., se.cty, se.st
FROM staged_employees se;

3、表的查詢

3.1 SELECT … FROM

3.1.1 列可選操作:

  • 正則表達式:SELECT ‘price.*’ FROM employees
  • 算術運算:常用:+ - * / %(求餘)
  • 數學函數:常用:round(DOUBLE d, INT n) 保留n位小數、sqrt、abs、exp、ln
  • 聚合函數:常用:
    • count(*) 計算總行數包括null行
    • count(列名) 計算列中非null行數
    • sum()、avg、min、max、variance
    • covar_pop(col1, col2):返回協方差
    • corr(col1, col2):返回相關係數
  • 內置函數

3.1.2 其他可選參數

SELECT e.col1 as col1_name FROM employees e LIMIT 10;
as:列別名、LIMIT:限制返回行數

3.1.3 嵌套SELECT語句

SELECT e.name, e.salary FROM 
(SELECT person_id as name, salary FROM employees) e
where 指定條件 and 條件

3.2 WHERE 語句

  • 同上例一樣,給select語句限定條件,用and或者or連接條件間的關係。
  • 可以用正則表示帥的。LIKE和RLIKE

3.3 GROUP BY 和 HAVING 語句

where用於篩選原表的內容,group by 對結果進行分組, having對group by結果過濾,順序是where -> group by -> having

SELECT year(ymd), avg(price_close) FROM stocks
WHERE exchange = 'NASDAQ' AND symbol = 'AAPL'
GROUP BY year(ymd)
HAVING avg(price_close) > 50.0;

3.4 JOIN 語句

JOIN … ON,JOIN連接兩個表,ON表示兩個表的連接條件

SELECT s.ymd, s.symbol, s.price_close, d.dividend
FROM stocks JOIN dividends d ON s.ymd = d.ymd and s.symbol=d.symbol
WHERE s.symbol = 'AAPL';

INNER JOIN:JOIN默認是INNER JOIN 要兩邊同時存在纔會顯示
LEFT OUTER JOIN:左表滿足條件則顯示,右邊填NULL
RIGHT OUTER JOIN:同上
FULL OUTER JOIN:兩邊都填NULL
JOIN:笛卡爾積

同時有where和join時,是先執行join再執行where,所以如果是outer join,大量數據中有null,可能會被where語句過濾掉

4、一個例子

官方API文檔:

https://cwiki.apache.org/confluence/display/Hive/Home#Home-UserDocumentation

使用工具:beeline

常用指令:

  • 1、!connect url –連接不同的Hive2服務器
  • 2、!exit –退出shell
  • 3、!help –顯示全部命令列表

代碼

beeline -e "
# 設置動態分區
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
# 使用database1
use database1;
# 建立外部表方便數據分析
CREATE EXTERNAL TABLE IF NOT EXISTS person_feature
(
    imei                STRING COMMENT '設備編號',
    shop_app_usage_duration          DOUBLE COMMENT '使用時長',
    pay_app_usage_duration          DOUBLE COMMENT '使用時長'
)
COMMENT '用戶信息表'
PARTITIONED BY (pt_d STRING COMMENT '天分區')
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\001'
LINES TERMINATED BY '\n'
STORED AS ORC
LOCATION '/AppData/data1'
TBLPROPERTIES('orc.compress'='ZLIB')
;
INSERT OVERWRITE TABLE ads_persona_datamining_consumption_level_features_dev_ds
PARTITION(pt_d='$date')
SELECT
    t1.imei                 AS imei
    ,t1.shop_app_usage_duration  AS shop_app_usage_duration
    ,t2.pay_app_usage_duration   AS pay_app_usage_duration
    ,t3.finance_app_usage_duration AS finance_app_usage_duration
    ,t4.bank_app_usage_duration  AS bank_app_usage_duration
    ,t5.travel_app_usage_duration  AS travel_app_usage_duration
    ,t6.carUsage_app_usage_duration  AS carUsage_app_usage_duration
    ,t7.house_app_usage_duration  AS house_app_usage_duration
    ,t8.takeout_app_usage_duration  AS takeout_app_usage_duration
    ,t9.carInfo_app_usage_duration  AS carInfo_app_usage_duration
    ,t10.stock_app_usage_duration  AS stock_app_usage_duration
    ,t11.carRaising_app_usage_duration  AS carRaising_app_usage_duration
    ,t12.price_dev          AS price_dev
    ,t13.city               AS city
    ,t14.tvlr_hobby_dev         AS tvlr_hobby_dev
    ,t14.bsn_tvlr_dev       AS bsn_tvlr_dev
    ,t15.didi_use_duration  AS didi_use_duration
    ,t16.hw_pay_30dy_pay_amt_dev  AS hw_pay_30dy_pay_amt_dev
FROM
    (
        SELECT
            imei
            ,duration_30d as shop_app_usage_duration
        FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
        WHERE category='商城'
    )t1
    LEFT JOIN
    (
        SELECT
            imei
            ,duration_30d as pay_app_usage_duration
        FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
        WHERE category='支付'
    )t2
    ON (t1.imei=t2.imei)
    LEFT JOIN
    (
        SELECT
            imei
            ,duration_30d as finance_app_usage_duration
        FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
        WHERE category='理財'
    )t3
    ON (t1.imei=t3.imei)
    LEFT JOIN
    (
        SELECT
            imei
            ,duration_30d as bank_app_usage_duration
        FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
        WHERE category='銀行'
    )t4
    ON (t1.imei=t4.imei)
    LEFT JOIN
    (
        SELECT
            imei
            ,duration_30d as travel_app_usage_duration
        FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
        WHERE category='旅遊'
    )t5
    ON (t1.imei=t5.imei)
    LEFT JOIN
    (
        SELECT
            imei
            ,duration_30d as carUsage_app_usage_duration
        FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
        WHERE category='用車'
    )t6
    ON (t1.imei=t6.imei)
    LEFT JOIN
    (
        SELECT
            imei
            ,duration_30d as house_app_usage_duration
        FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
        WHERE category='租房買房'
    )t7
    ON (t1.imei=t7.imei)
    LEFT JOIN
    (
        SELECT
            imei
            ,duration_30d as takeout_app_usage_duration
        FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
        WHERE category='外賣'
    )t8
    ON (t1.imei=t8.imei)
    LEFT JOIN
    (
        SELECT
            imei
            ,duration_30d as carInfo_app_usage_duration
        FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
        WHERE category='汽車資訊'
    )t9
    ON (t1.imei=t9.imei)
    LEFT JOIN
    (
        SELECT
            imei
            ,duration_30d as stock_app_usage_duration
        FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
        WHERE category='股票基金'
    )t10
    ON (t1.imei=t10.imei)
    LEFT JOIN
    (
        SELECT
            imei
            ,duration_30d as carRaising_app_usage_duration
        FROM ads_persona_label_1level_app_category_usage_30d_dev_ds
        WHERE category='養車'
    )t11
    ON (t1.imei=t11.imei)
    LEFT JOIN
    (
        SELECT
            imei
            ,price_dev
        FROM ads_persona_label_0level_v3_price_dev_ds
        WHERE pt_d='$date'
    )t12
    ON (t1.imei=t12.imei)
    LEFT JOIN 
    (
        SELECT
            id
            ,city
        FROM ads_persona_label_0level_v3_residence_new_ds
        WHERE pt_d='$date' AND pt_id_type='dev'
    )t13
    ON (t1.imei=t13.id)
    LEFT JOIN
    (
        SELECT
            imei
            ,tvlr_hobby_dev
            ,bsn_tvlr_dev
        FROM ads_persona_label_0level_v3_app_pref_tvlr_hobby_dev_ds
        WHERE pt_d=concat('$month','01')
    )t14
    ON (t1.imei=t14.imei)
    LEFT JOIN
    (
        SELECT
            imei
            ,use_duration AS didi_use_duration
        FROM ads_persona_label_1level_app_usage_30d_dev_dm
        WHERE pt_d='$date' AND package_name='com.sdu.didi.psnger'
    )t15
    ON (t1.imei=t15.imei)
    LEFT JOIN
    (
        SELECT
            imei
            ,hw_pay_30dy_pay_amt_dev
        FROM ads_persona_label_0level_v3_hw_pay_30dy_fee_dev_ds
        WHERE pt_d='$date'
    )t16
    ON (t1.imei=t16.imei)
;
# @DESC 保留近30天數據
ALTER TABLE ads_persona_datamining_consumption_level_features_dev_ds DROP IF EXISTS PARTITION (pt_d ='${start_time,-30,yyyyMMdd}');
"
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章