Hive工程實踐

原創

2020-06-01 20:20

最近在參與某toB項目，數據需離線統計出並推送至線上業務庫，其中用hive做的離線分析。總結寫下常見問題及心得吧。

一.工程類技術範疇：數據統計工作大題劃分爲四步：指標統計、批量腳本、數據格式、異常流程；

step1. 指標統計：通過創建表存儲每個指標的值，例如用hive表loan_apply_rate存儲申請通過率；複雜度在於：指標值多，且指標定義可能不明確；

step2. 批量腳本：將step1創建的各張表綜合成批量執行的perl腳本；複雜度在於：若執行時間長，會影響業務方使用，可自行測試出大小適中的perl腳本（把大的腳本做垂直區分，如申請類一個腳本，提現類一個腳本；或者做水平區分，如vintage指標依賴中間許多邏輯，可以把部分邏輯單獨拆分爲中間表，最終vintage指標再依賴該中間表）；

step3.數據格式：新建一張總表，該表存儲所有的指標值；並且將step2生成的錶轉化成業務方期望的數據格式（可以把step2指標轉換爲多個業務方期望格式，做指標複用）。示例如下：

step4.異常流程：包括批量腳本父子任務執行順序異常，今日統計的數據異常時數據回滾或重新統計等，數據去重以及數據備份等；

二.hive類技術範疇

1. 常用優化

1.1 定理：如果只用rn=1，即只需最值，則沒必要用rownumber。查找申請表裏授信金額最大的一筆訂單？

case1: select * from a where dt='2018-12-19' order by loan_amount desc limit 1;（map70s 、reduce400s）（常用但低效）

case2: select * from (select *,max(loan_amount ) la from a where dt='2018-12-19') a where la=loan_amount ;（map70s 、reduce1300s ）（常用但低效）

case3: select * from (select *, row_number() over(sort by loan_amount desc) rn from a where dt='2018-12-19) a where rn=1;（map70s 、reduce9000s timeout）

case4: select * from (select * from a where dt='2018-12-19') a join (select max(loan_amount) la from a where dt=2018-12-19') b on a.loan_amount=la; (map70s、map70s、reduce2s )

case5: select * from (select max(struct(apply_no,loan_amount)) la from a where dt='2018-12-19') b;(map70s、reduce2s)

1.2 定理: 替代distinct

case1: select count(distinct(user_jrid)) from user where dt=‘2018-12-19’; (完成時間：800s)(因爲distinct是o(n^log2 n),且只有一個reduce)（常用但低效）

case2: select 1,count(1) from (select user_jrid from a where dt='2018-12-31' group by user_jrid) a ; (通過groupby 並行化去重,完成時間：80s)(o(n^log2 n)，但是可多個reduce並行執行);

1.3 各階段複雜度:

2. UDF

指定爲月末：

2.1 when split(statistics_date,'-')[1] in ('1','3','5','7','8','10','12') then concat(statistics_date,'-31')
when split(statistics_date,'-')[1] in ('4','6','9','11') then concat(statistics_date,'-30')
when cast(split(statistics_date,'-')[0] as int)%4=0 and split(statistics_date,'-')[1] in ('2') then concat(statistics_date,'-29')
when cast(split(statistics_date,'-')[0] as int)%4!=0 and split(statistics_date,'-')[1] in ('2') then concat(statistics_date,'-28') end as new_statistics_date

2.2 date_sub(concat(substr(concat(substr(created_date, 1, 7), '-01'), 1, 7), '-01'), 1)

3.常用函數

3.1 行轉列：collect_set/collect_list（得到的是array<String>類型）；clollect_ws可以合併collect_set（如collect_ws(',',collect_set())）；

case1: 產品默認排序，把產品彙總到一行。

3.2 列轉行：lateral view explode/pos_explode

case1: select v from (select split('1 2 3 4 5 6 7 8 9 0',' ') v1 ) t1 lateral view explode(v1) t2 as v;

case2: select date_sub(from_unixtime(unix_timestamp(),'yyyy-MM-dd'),t.pos + 1) as biz_date from (select pose_explode(split(space(30),' '))); 如下圖，統計某行過去30天每天的申請提現指標。(若用group by的原因，則select的字段需做collect_set判斷；本語句select字段多，繁瑣)

3.3 select * from (select *,row_number() over(partition by cash_id order by modified_date desc) as rn from cash_apply) a where rn=1；提現表爲增量表，上述語句可查找到最新的提現表

3.4 其他：instr； months_between;
order by,sort by, distribute by, cluster by：參照 https://blog.csdn.net/zhanglh046/article/details/78572939

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Hive工程實踐

DAPPER 事務 TRANSACTION

Java中線程的創建方式

工程常用

Hive工程實踐

線程池及併發包

Spring/Ibatis框架部分解析（TODO）

Innodb

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結