大數據階段項目之項目實現

              大數據階段項目之項目實現 

目錄

              大數據階段項目之項目實現 

一.啓動Hadoop分佈式集羣(僞分佈式)

二.創建一個文件夾存儲數據

三.將文件收集到HDFS

1.在Flume的data下創建zebra.conf

2.利用flume收集數據,將收集的數據落地到HDFS系統中。

3.執行命令,存儲HDFS

4.查看eclipse中是否存在

四.啓動HIVE

五.使用hive操作

1.創建zebra數據庫

2.建立外部表,指向要處理的數據(外部表+分區表,用時間作爲分區)

3.修復分區

4.查看數據

5.手動設置分區

6.再次查看數據

五.清洗數據,提取有用數據(23個字段)

1.建表語句

2.插入數據

3.查看數據

六.對清晰之後的數據進行整理,建立事實表

1.建表語句:

2.插入數據:

七.查詢關心的信息,以應用受歡迎程度表爲例:

1.建表語句:

2.插入數據:

3.查詢前5名受歡迎app:

八.Sqoop組件工作流程:

1.在mysql建立對應的表

2.利用sqoop導出d_h_http_apptype表:

3.查看數據

4.可視化頁面


一.啓動Hadoop分佈式集羣(僞分佈式)

二.創建一個文件夾存儲數據

三.將文件收集到HDFS

HIVE是在HDFS上操作的,需要把文件存儲到HDFS中,進行操作

1.在Flume的data下創建zebra.conf

2.利用flume收集數據,將收集的數據落地到HDFS系統中。

flume在收集日誌的時候,按天爲單位進行收集

a1.sources=r1
a1.channels=c1
a1.sinks=s1

a1.sources.r1.type=spooldir
a1.sources.r1.spoolDir=/home/zebra
a1.sources.r1.interceptors=i1
a1.sources.r1.interceptors.i1.type=timestamp

a1.sinks.s1.type=hdfs
a1.sinks.s1.hdfs.path=hdfs://192.168.150.137:9000/zebra/reportTime=%Y-%m-%d
a1.sinks.s1.hdfs.fileType=DataStream
a1.sinks.s1.hdfs.rollInterval=30
a1.sinks.s1.hdfs.rollSize=0
a1.sinks.s1.hdfs.rollCount=0

a1.channels.c1.type=memory
a1.sources.r1.channels=c1
a1.sinks.s1.channel=c1

3.執行命令,存儲HDFS

4.查看eclipse中是否存在

四.啓動HIVE

五.使用hive操作

1.創建zebra數據庫

  • 執行:create database zebra;
  • 執行:use zebra;

2.建立外部表,指向要處理的數據(外部表+分區表,用時間作爲分區)

		建表語句:create EXTERNAL table zebra (a1 string,a2 string,a3 string,a4 string,a5 string,a6 string,a7 string,a8 string,a9 string,a10 string,a11 string,a12 string,a13 string,a14 string,a15 string,a16 string,a17 string,a18 string,a19 string,a20 string,a21 string,a22 string,a23 string,a24 string,a25 string,a26 string,a27 string,a28 string,a29 string,a30 string,a31 string,a32 string,a33 string,a34 string,a35 string,a36 string,a37 string,a38 string,a39 string,a40 string,a41 string,a42 string,a43 string,a44 string,a45 string,a46 string,a47 string,a48 string,a49 string,a50 string,a51 string,a52 string,a53 string,a54 string,a55 string,a56 string,a57 string,a58 string,a59 string,a60 string,a61 string,a62 string,a63 string,a64 string,a65 string,a66 string,a67 string,a68 string,a69 string,a70 string,a71 string,a72 string,a73 string,a74 string,a75 string,a76 string,a77 string) partitioned by (reporttime string) row format delimited fields terminated by '|' stored as textfile location '/zebra';
		增加分區操作:msck repair table zebra;
		可以通過抽樣語法來檢驗:select * from zebra TABLESAMPLE (1 ROWS);來檢查是否導入成功

增加分區操作:msck repair table zebra;

3.修復分區

這個表是手動創建的,hive無法識別,需要手動修復分區

4.查看數據

查看不到數據,也是就說,數據沒有拿到

msck repair   添加分區,如果不好用的話,只能手動添加了

5.手動設置分區

PS:日期對應當前日期

6.再次查看數據

五.清洗數據,提取有用數據(23個字段)

1.建表語句

  • create table dataclear(reporttime string,appType bigint,appSubtype bigint,userIp string,userPort bigint,appServerIP string,appServerPort bigint,host string,cellid string,appTypeCode bigint,interruptType String,transStatus bigint,trafficUL bigint,trafficDL bigint,retranUL bigint,retranDL bigint,procdureStartTime bigint,procdureEndTime bigint)row format delimited fields terminated by '|';

2.插入數據

  • insert overwrite table dataclear select concat(reporttime,' ','00:00:00'),a23,a24,a27,a29,a31,a33,a59,a17,a19,a68,a55,a34,a35,a40,a41,a20,a21 from zebra;

3.查看數據

56s

效率巨低 20s

六.對清晰之後的數據進行整理,建立事實表

1.建表語句:

  • create table f_http_app_host (reporttime string,appType bigint,appSubtype bigint,userIp string,userPort bigint,appServerIP string,appServerPort bigint,host string,cellid string,attempts bigint,accepts bigint,trafficUL bigint,trafficDL bigint,retranUL bigint,retranDL bigint,failCount bigint,transDelay bigint)row format delimited fields terminated by '|';

2.插入數據:

  • insert overwrite table f_http_app_host select reporttime,appType,appSubtype,userIp,userPort,appServerIP,appServerPort,host, if(cellid == '',"000000000",cellid),if(appTypeCode == 103,1,0),if(appTypeCode == 103 and find_in_set(transStatus,"10,11,12,13,14,15,32,33,34,35,36,37,38,48,49,50,51,52,53,54,55,199,200,201,202,203,204,205,206,302,304,306")!=0 and interruptType == 0,1,0),if(apptypeCode == 103,trafficUL,0), if(apptypeCode == 103,trafficDL,0), if(apptypeCode == 103,retranUL,0), if(apptypeCode == 103,retranDL,0), if(appTypeCode == 103 and transStatus == 1 and interruptType == 0,1,0),if(appTypeCode == 103, procdureEndTime - procdureStartTime,0) from dataclear;

 

七.查詢關心的信息,以應用受歡迎程度表爲例:

1.建表語句:

  • create table D_H_HTTP_APPTYPE(hourid string,appType int,appSubtype int,attempts bigint,accepts bigint,succRatio double,trafficUL bigint,trafficDL bigint,totalTraffic bigint,retranUL bigint,retranDL bigint,retranTraffic bigint,failCount bigint,transDelay bigint) row format delimited fields terminated by '|';

根據總表dataproc,按條件做聚合以及字段的累加

2.插入數據:

  • insert overwrite table D_H_HTTP_APPTYPE select reporttime,apptype,appsubtype,sum(attempts),sum(accepts),round(sum(accepts)/sum(attempts),2),sum(trafficUL),sum(trafficDL),sum(trafficUL)+sum(trafficDL),sum(retranUL),sum(retranDL),sum(retranUL)+sum(retranDL),sum(failCount),sum(transDelay)from f_http_app_host group by reporttime,apptype,appsubtype;

 

3.查詢前5名受歡迎app:

  • select hourid,apptype,sum(totalTraffic) as tt from D_H_HTTP_APPTYPE group by hourid,apptype sort by tt desc limit 5;

八.Sqoop組件工作流程:

1.在mysql建立對應的表

2.利用sqoop導出d_h_http_apptype表:

  • sh sqoop export --connect jdbc:mysql://hadoop03:3306/zebra --username root --password root --export-dir '/user/hive/warehouse/zebra.db/d_h_http_apptype/000000_0' --table D_H_HTTP_APPTYPE -m 1 --fields-terminated-by '|'

3.查看數據

4.可視化頁面

 

 

 

 

 

 

 

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章