需求背景:
我們將女生主動和男生建立聯繫定義爲女生打招呼,app中女生打招呼的方式有兩種:主動發起文字聊天和主動發起音視頻聊天。這些數據的採集通過在應用程序中增加埋點,最終成爲日誌文件保存在服務器上。日誌內容如下:
文字聊天,日誌文件 social_talklist_im_2020-06-23.log,內容示例如下:
2020-06-23 23:59:44,10.3.1.32,[8988487,9050759]
2020-06-23 23:59:47,10.3.1.32,[9016540,8946882]
2020-06-23 23:59:47,10.3.1.32,[9011059,9050680]
後兩段內容含義是 發起文字聊天的女生uid和接收文字消息的男生uid。
音視頻聊天,日誌文件 social_talklist_video_2020-06-23.log,內容示例如下:
2020-06-23 23:59:33,10.3.1.34,["8935739",8808419]
2020-06-23 23:59:55,10.3.1.20,["9037381",9050732]
需求是,以指定公會爲例,統計該公會某一天通過文字打招呼的女生數量,人均打招呼的次數,打招呼的男生數量等數據
分析:
hive的 sql on hadoop 首先需要我們有一個csv格式的文件,我們可以將他導入到hive表中,通過sql語句自動生成hadoop的mapreduce任務執行。
csv文件從那裏來?我們首先需要將日誌文件上傳到hdfs文件系統。寫一個只有map的mapreduce程序,將日誌文件解析成csv文件。
此外 家族表和公會成員表存儲在mysql中,我們需要將他導入到hive表中,一遍進行聯表查詢。
實現:
1. 將日誌文件上傳到hdfs文件系統
gradle文件引入依賴:`implementation "org.apache.hadoop:hadoop-client:$hadoopVersion"`
gradle installDist //打成可執行文件
cd build/install/LogAnalysic //進入執行目錄
bin/LogAnalysic com.mustafa.bigdata.Upload2Hdfs /home/mustafa/Documents/bigdata/jiazu/input/social_talklist_im_2020-06-23.log bigdata/jiazu/input/im/2020-06-23.log //上傳日誌文件
bin/LogAnalysic com.mustafa.bigdata.Upload2Hdfs /home/mustafa/Documents/bigdata/jiazu/input/social_talklist_im_2020-06-24.log bigdata/jiazu/input/im/2020-06-24.log
bin/LogAnalysic com.mustafa.bigdata.Upload2Hdfs /home/mustafa/Documents/bigdata/jiazu/input/social_talklist_video_2020-06-23.log bigdata/jiazu/input/video/2020-06-23.log
bin/LogAnalysic com.mustafa.bigdata.Upload2Hdfs /home/mustafa/Documents/bigdata/jiazu/input/social_talklist_video_2020-06-24.log bigdata/jiazu/input/video/2020-06-24.log
代碼參照以下鏈接:hdfs上傳文件
2. 將日誌文件解析成csv文件
我們需要將項目打成jar包,在yarn上跑mapreduce程序
gradle額外需要引入依賴:` implementation 'com.google.code.gson:gson:2.6.2'
implementation 'commons-lang:commons-lang:2.6'`
gradle jar
HADOOP_HOME=/data/home/software/hadoop-2.6.0-cdh5.16.2
$HADOOP_HOME/bin/yarn jar \
build/libs/LogAnalysic-1.0-SNAPSHOT.jar \
com.mustafa.bigdata.ParseLog2Csv bigdata/jiazu/input/im bigdata/jiazu/output/im
$HADOOP_HOME/bin/yarn jar \
build/libs/LogAnalysic-1.0-SNAPSHOT.jar \
com.mustafa.bigdata.ParseLog2Csv bigdata/jiazu/input/video bigdata/jiazu/output/video
代碼參照以下鏈接:日誌解析成csv文件
3. 將mysql中家族表和家族成員表導入hive中
bin/sqoop import \
--connect jdbc:mysql://mustafa-PC:3306/jiazu \
--username root \
--password 123456 \
--columns jzid,jzname \
--table groups \
--num-mappers 1 \
--mapreduce-job-name jiazu_groups_to_hive \
--fields-terminated-by '\t' \
--target-dir /user/mustafa/hive/tables/jiazu/groups \
--delete-target-dir \
--hive-import \
--create-hive-table \
--hive-database jiazu \
--hive-overwrite \
--hive-table groups
bin/sqoop import \
--connect jdbc:mysql://mustafa-PC:3306/jiazu \
--username root \
--password 123456 \
--columns jzid,uid \
--table member \
--num-mappers 1 \
--mapreduce-job-name jiazu_member_to_hive \
--fields-terminated-by '\t' \
--target-dir /user/mustafa/hive/tables/jiazu/member \
--delete-target-dir \
--hive-import \
--create-hive-table \
--hive-database jiazu \
--hive-overwrite \
--hive-table member
4. 使用hive進行離線日誌的分析
主要是以下兩個sql語句:
select m.uid from groups g left join member m on g.jzid = m.jzid where g.jzname = 'aaa'
select count(fuid) as greet_times, count(distinct fuid) as greet_users, count(distinct tuid) as disturb_users from im where fuid in ($sql1) and time >= '2020-06-23 00:00:00' and time < '2020-06-24 00:00:00'
gradle中引入依賴:`implementation "org.apache.hive:hive-jdbc:$hiveVersion"`
代碼:
public class LoadCsv2Table {
public static void main(String[] args) throws UnknownHostException, SQLException, ClassNotFoundException {
InetAddress addr = InetAddress.getLocalHost();
String hostname = addr.getHostName();
Class.forName("org.apache.hive.jdbc.HiveDriver");
String jdbc_url = "jdbc:hive2://" + hostname + ":10000/jiazu";
Connection conn = DriverManager.getConnection(jdbc_url,"mustafa", null);
Statement st = conn.createStatement();
st.execute("create table if not exists im (\n" +
" id int,\n" +
" fuid int,\n" +
" tuid int,\n" +
" time TIMESTAMP\n" +
")\n" +
"row format delimited fields terminated by '\\t'");
st.execute("load data inpath '/user/mustafa/bigdata/jiazu/output/im' into table im");
st.execute("create external table if not exists video (\n" +
" id int,\n" +
" fuid int,\n" +
" tuid int,\n" +
" time TIMESTAMP\n" +
")\n" +
"row format delimited fields terminated by '\\t'");
st.execute("load data inpath '/user/mustafa/bigdata/jiazu/output/video' into table video");
StringBuilder uidsStringBuilder = new StringBuilder("");
ResultSet resultset = st.executeQuery("select m.uid from groups g left join member m on g.jzid = m.jzid where g.jzname = 'SP'");
while(resultset.next()) {
uidsStringBuilder.append(resultset.getString("uid")).append(",");
}
String uids = uidsStringBuilder.toString().substring(0, uidsStringBuilder.toString().length()-1);
st.execute("insert overwrite local directory '/home/mustafa/Desktop/im/im-2020-06-23' row format delimited fields terminated by ',' select count(fuid) as greet_times, count(distinct fuid) as greet_users, count(distinct tuid) as disturb_users from im where fuid in (" + uids + ") and time >= '2020-06-23 00:00:00' and time < '2020-06-24 00:00:00'");
st.execute("insert overwrite local directory '/home/mustafa/Desktop/im/im-2020-06-24' row format delimited fields terminated by ',' select count(fuid) as greet_times, count(distinct fuid) as greet_users, count(distinct tuid) as disturb_users from im where fuid in (" + uids + ") and time >= '2020-06-24 00:00:00' and time < '2020-06-25 00:00:00'");
st.execute("insert overwrite local directory '/home/mustafa/Desktop/im/video-2020-06-23' row format delimited fields terminated by ',' select count(fuid) as greet_times, count(distinct fuid) as greet_users, count(distinct tuid) as disturb_users from video where fuid in (" + uids + ") and time >= '2020-06-23 00:00:00' and time < '2020-06-24 00:00:00'");
st.execute("insert overwrite local directory '/home/mustafa/Desktop/im/video-2020-06-24' row format delimited fields terminated by ',' select count(fuid) as greet_times, count(distinct fuid) as greet_users, count(distinct tuid) as disturb_users from video where fuid in (" + uids + ") and time >= '2020-06-24 00:00:00' and time < '2020-06-25 00:00:00'");
st.close();
conn.close();
}
}
數據分析的結果保存在本地的 `/home/mustafa/Desktop/im`路徑下
詳細的代碼請參照以下鏈接: