Hive分析統計離線日誌信息

關注公衆號:分享電腦學習
回覆"百度雲盤" 可以免費獲取所有學習文檔的代碼(不定期更新)
雲盤目錄說明:
tools目錄是安裝包
res 目錄是每一個課件對應的代碼和資源等
doc 目錄是一些第三方的文檔工具

 

承接上一篇文檔《新增訪客數量MR統計之MR數據輸出到MySQL

hive-1.2.1的版本可以直接映射HBase已經存在的表

如果說想在hive創建表,同時HBase不存在對應的表,也想做映射,那麼採用編譯後的hive版本hive-1.2.1-hbase

 

1. Hive中創建外部表,關聯hbase

CREATE EXTERNAL TABLE event_log_20180728(
key string,
pl string,
ver string,
s_time string,
u_ud string,
u_sd string,
en string)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,info:pl,info:ver,info:s_time,info:u_ud,info:u_sd,info:en")
TBLPROPERTIES("hbase.table.name" = "event_log_20180728");

Hive分析統計離線日誌信息

 

統計多少個新用戶:

select count(*) from event_log_20180728 where en="e_l";

Hive分析統計離線日誌信息

 

Hive分析統計離線日誌信息

 

2. 提取數據,進行初步的數據過濾操作,最終將數據保存到臨時表

 

創建臨時表

CREATE TABLE stats_hourly_tmp01(
pl string,
ver string,
s_time string,
u_ud string,
u_sd string,
en string,
`date` string,
hour int
);

Hive分析統計離線日誌信息

 

 

將原始數據提取到臨時表中

INSERT OVERWRITE TABLE stats_hourly_tmp01
SELECT pl,ver,s_time,u_ud,u_sd,en,
from_unixtime(cast(s_time/1000 as int),'yyyy-MM-dd'), hour(from_unixtime(cast(s_time/1000 as int),'yyyy-MM-dd HH:mm:ss'))
FROM event_log_20200510
WHERE en="e_l" or en="e_pv";

Hive分析統計離線日誌信息

 

 

SELECT from_unixtime(cast(s_time/1000 as int),'yyyy-MM-dd'),from_unixtime(cast(s_time/1000 as int),'yyyy-MM-dd HH:mm:ss') FROM event_log_20180728;

 

查看結果

Hive分析統計離線日誌信息

 

 

3. 具體kpi的分析

創建臨時表保存數據結果

CREATE TABLE stats_hourly_tmp02(
pl string,
ver string,
`date` string,
kpi string,
hour int,
value int
);

Hive分析統計離線日誌信息

 

統計活躍用戶 u_ud 有多少就有多少用戶

統計platform維度是:(name,version)

INSERT OVERWRITE TABLE stats_hourly_tmp02
SELECT pl,ver,`date`,'hourly_new_install_users' as kpi,hour,COUNT(distinct u_ud) as v
FROM stats_hourly_tmp01
WHERE en="e_l"
GROUP BY pl,ver,`date`,hour;

Hive分析統計離線日誌信息

 

查看結果:

Hive分析統計離線日誌信息

 

統計會話長度指標

會話長度 = 一個會話中最後一條記錄的時間 - 第一條的記錄時間 = maxtime - mintime

步驟:

1. 計算出每個會話的會話長度 group by u_sd

2. 統計每個區間段的總會話長度

 

統計platform維度是:(name,version)

INSERT INTO TABLE

SELECT pl,ver,`date`,'hourly_session_length' as kpi,hour, sum(s_length)/1000 as v
FROM (
SELECT pl,ver,`date`,hour,u_sd,(max(s_time) - min(s_time)) as s_length
FROM stats_hourly_tmp01
GROUP BY pl,ver,`date`,hour,u_sd
) tmp
GROUP BY pl,ver,`date`,hour;

Hive分析統計離線日誌信息

 

查看結果

Hive分析統計離線日誌信息

 

 

將tmp02的數據轉換爲和mysql表結構一致的數據

窄錶轉寬表 => 轉換的結果保存到臨時表中

CREATE TABLE stats_hourly_tmp03(
pl string, ver string, `date` string, kpi string,
hour00 int, hour01 int, hour02 int, hour03 int,
hour04 int, hour05 int, hour06 int, hour07 int,
hour08 int, hour09 int, hour10 int, hour11 int,
hour12 int, hour13 int, hour14 int, hour15 int,
hour16 int, hour17 int, hour18 int, hour19 int,
hour20 int, hour21 int, hour22 int, hour23 int
);

Hive分析統計離線日誌信息

 

INSERT OVERWRITE TABLE stats_hourly_tmp03
SELECT pl,ver,`date`,kpi,
max(case when hour=0 then value else 0 end) as h0,
max(case when hour=1 then value else 0 end) as h1,
max(case when hour=2 then value else 0 end) as h2,
max(case when hour=3 then value else 0 end) as h3,
max(case when hour=4 then value else 0 end) as h4,
max(case when hour=5 then value else 0 end) as h5,
max(case when hour=6 then value else 0 end) as h6,
max(case when hour=7 then value else 0 end) as h7,
max(case when hour=8 then value else 0 end) as h8,
max(case when hour=9 then value else 0 end) as h9,
max(case when hour=10 then value else 0 end) as h10,
max(case when hour=11 then value else 0 end) as h11,
max(case when hour=12 then value else 0 end) as h12,
max(case when hour=13 then value else 0 end) as h13,
max(case when hour=14 then value else 0 end) as h14,
max(case when hour=15 then value else 0 end) as h15,
max(case when hour=16 then value else 0 end) as h16,
max(case when hour=17 then value else 0 end) as h17,
max(case when hour=18 then value else 0 end) as h18,
max(case when hour=19 then value else 0 end) as h19,
max(case when hour=20 then value else 0 end) as h20,
max(case when hour=21 then value else 0 end) as h21,
max(case when hour=22 then value else 0 end) as h22,
max(case when hour=23 then value else 0 end) as h23
FROM stats_hourly_tmp02
GROUP BY pl,ver,`date`,kpi;

Hive分析統計離線日誌信息

 

select hour14,hour15,hour16 from stats_hourly_tmp03;

結果:

Hive分析統計離線日誌信息

 

 

將維度的屬性值轉換爲id,使用UDF進行轉換

1. 將udf文件夾中的所有自定義HIVE的UDF放到項目中

2. 使用run maven install環境進行打包

3. 將打包形成的jar文件上傳到HDFS上的/jar文件夾中

4. hive中創建自定義函數,命令如下:

 

create function dateconverter as 'com.xlgl.wzy.hive.udf.DateDimensionConverterUDF' using jar 'hdfs://master:9000/jar/transformer-0.0.1.jar';

Hive分析統計離線日誌信息

 

 

create function kpiconverter as 'com.xlgl.wzy.hive.udf.KpiDimensionConverterUDF' using jar 'hdfs://master:9000/jar/transformer-0.0.1.jar';

Hive分析統計離線日誌信息

 

create function platformconverter as 'com.xlgl.wzy.hive.udf.PlatformDimensionConverterUDF' using jar 'hdfs://master:9000/jar/transformer-0.0.1.jar';

Hive分析統計離線日誌信息

 

 

創建hive中對應mysql的最終表結構

CREATE TABLE stats_hourly(
platform_dimension_id int,
date_dimension_id int,
kpi_dimension_id int,
hour00 int, hour01 int, hour02 int, hour03 int,
hour04 int, hour05 int, hour06 int, hour07 int,
hour08 int, hour09 int, hour10 int, hour11 int,
hour12 int, hour13 int, hour14 int, hour15 int,
hour16 int, hour17 int, hour18 int, hour19 int,
hour20 int, hour21 int, hour22 int, hour23 int
);

Hive分析統計離線日誌信息

 

INSERT OVERWRITE TABLE stats_hourly
SELECT
platformconverter(pl,ver), dateconverter(`date`,'day'),kpiconverter(kpi),
hour00 , hour01 , hour02 , hour03 ,
hour04 , hour05 , hour06 , hour07 ,
hour08 , hour09 , hour10 , hour11 ,
hour12 , hour13 , hour14 , hour15 ,
hour16 , hour17 , hour18 , hour19 ,
hour20 , hour21 , hour22 , hour23
FROM stats_hourly_tmp03;

Hive分析統計離線日誌信息

 

Hive分析統計離線日誌信息

 

導出sqoop-》mysql

bin/sqoop export \
--connect jdbc:mysql://master:3306/test \
--username root \
--password 123456 \
--table stats_hourly \
--export-dir /user/hive/warehouse/log_lx.db/stats_hourly \
-m 1 \
--input-fields-terminated-by '\001'

Hive分析統計離線日誌信息

 

查詢mysql

Hive分析統計離線日誌信息

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章