做什麼?
統計各個範圍Session步長、訪問時長佔比統計
-
訪問時長:session的最早時間與最晚時間之差。
-
訪問步長:session中的action操作個數。
即:統計出符合篩選條件的session中,訪問時長在1s3s、4s6s、7s9s、10s30s、30s60s、1m3m、3m10m、10m30m、30m,訪問步長在1_3、4_6、…以上各個範圍內的各種session的佔比
需求分析:
對應的類:
case class UserVisitAction(date: String,
user_id: Long,
session_id: String,
page_id: Long,
action_time: String,
search_keyword: String,
click_category_id: Long,
click_product_id: Long,
order_category_ids: String,
order_product_ids: String,
pay_category_ids: String,
pay_product_ids: String,
city_id: Long
)
每一條action數據代表了一個用戶的行爲,即爲點擊,下單,付款等其中一種行爲,所有的記錄中包含了同一個用戶的多個action
- 基於這一點,我們很容想到,可以用groupByKey對同一用戶的所有行爲進行一個聚合操作,接着取過濾判斷用戶的行爲是否符合條件限制即可.
- 那麼問題是,一個用戶有很多的行爲,我們不可能一條一條的去判斷是否符合條件,所以在判斷之前,還需要一個統計操作,將一個用戶的所有行爲統計成一條信息,統計的同時計算訪問時長和步長,最後格式如下:
UserId:searchKeywords=Lamer,小龍蝦,機器學習,吸塵器,蘋果,洗面奶,保溫杯,華爲手機|clickCategoryIds=|visitLength=3471|stepLength=48|startTime=2020-05-23 10:00:26)
- 接着,根據userId,將這條信息與UserInfo進行一個映射關係,新成一條新的,更完整的用戶信息
(e8ef831e7fd4475990a80e425af946ad,sessionid=e8ef831e7fd4475990a80e425af946ad|searchKeywords=Lamer,小龍蝦,機器學習,吸塵器,蘋果,洗面奶,保溫杯,華爲手機|clickCategoryIds=|visitLength=3471|stepLength=48|startTime=2020-05-23 10:00:26|age=45|professional=professional43|sex=male|city=city2)
- 接着就可以進心我們的過濾信息,並且在過濾的同時,更新每個區域範圍內的累加器(自定義累加器,統計各個範圍的時長和步長)
- 最後,將各個範圍的時長,步長除於總的時間,就是佔比了
流程圖如下:
步驟解析:
- 讀取hadoop數據,返回基本的action信息:
def basicActions(session:SparkSession,task:JSONObject)={
import session.implicits._;
val df=session.read.parquet("hdfs://hadoop1:9000/data/user_visit_action").as[UserVisitAction];
df.filter(item=>{
val date=item.action_time;
val start=task.getString(Constants.PARAM_START_DATE);
val end=task.getString(Constants.PARAM_END_DATE);
date>=start&&date<=end;
})
df.rdd;
}
- 轉化actions,格式爲(sessionId,action),groupByKey算子,聚合同一用戶的actions
val basicActionMap=basicActions.map(item=>{
val sessionId=item.session_id;
(sessionId,item);
})
val groupBasicActions=basicActionMap.groupByKey();
- 讀取hadoop,獲取用戶的基本信息
def getUserInfo(session: SparkSession) = {
import session.implicits._;
val ds=session.read.parquet("hdfs://hadoop1:9000/data/user_Info").as[UserInfo].map(item=>(item.user_id,item));
ds.rdd;
}
- 映射用戶信息與聚合信息,形成新的完整的信息
def AggInfoAndActions(aggUserActions: RDD[(Long, String)], userInfo: RDD[(Long, UserInfo)])={
//根據user_id建立映射關係===>用Join算子
userInfo.join(aggUserActions).map{
case (userId,(userInfo: UserInfo,aggrInfo))=>{
val age = userInfo.age
val professional = userInfo.professional
val sex = userInfo.sex
val city = userInfo.city
val fullInfo = aggrInfo + "|" +
Constants.FIELD_AGE + "=" + age + "|" +
Constants.FIELD_PROFESSIONAL + "=" + professional + "|" +
Constants.FIELD_SEX + "=" + sex + "|" +
Constants.FIELD_CITY + "=" + city
val sessionId = StringUtil.getFieldFromConcatString(aggrInfo, "\\|", Constants.FIELD_SESSION_ID)
(sessionId, fullInfo)
}
}
- 過濾數據,更新累加器
累加器的定義:
package scala
import org.apache.spark.util.AccumulatorV2
import scala.collection.mutable
class sessionAccumulator extends AccumulatorV2[String,mutable.HashMap[String,Int]] {
val countMap=new mutable.HashMap[String,Int]();
override def isZero: Boolean = {
countMap.isEmpty;
}
override def copy(): AccumulatorV2[String, mutable.HashMap[String, Int]] = {
val acc=new sessionAccumulator;
acc.countMap++=this.countMap;
acc
}
override def reset(): Unit = {
countMap.clear;
}
override def add(v: String): Unit = {
if (!countMap.contains(v)){
countMap+=(v->0);
}
countMap.update(v,countMap(v)+1);
}
override def merge(other: AccumulatorV2[String, mutable.HashMap[String, Int]])={
other match{
case acc:sessionAccumulator=>acc.countMap.foldLeft(this.countMap){
case(map,(k,v))=>map+=(k->(map.getOrElse(k,0)+v));
}
}
}
override def value: mutable.HashMap[String, Int] = {
this.countMap;
}
}
過濾函數:
def filterInfo(finalInfo: RDD[(String, String)],task:JSONObject,accumulator:sessionAccumulator) = {
//1.獲取限制條件
//獲取限制條件的基本信息
val startAge = task.get(Constants.PARAM_START_AGE);
val endAge = task.get( Constants.PARAM_END_AGE);
val professionals = task.get(Constants.PARAM_PROFESSIONALS)
val cities = task.get(Constants.PARAM_CITIES)
val sex = task.get(Constants.PARAM_SEX)
val keywords =task.get(Constants.PARAM_KEYWORDS)
val categoryIds = task.get(Constants.PARAM_CATEGORY_IDS)
//拼接基本條件
var filterInfo = (if(startAge != null) Constants.PARAM_START_AGE + "=" + startAge + "|" else "") +
(if (endAge != null) Constants.PARAM_END_AGE + "=" + endAge + "|" else "") +
(if (professionals != null) Constants.PARAM_PROFESSIONALS + "=" + professionals + "|" else "") +
(if (cities != null) Constants.PARAM_CITIES + "=" + cities + "|" else "") +
(if (sex != null) Constants.PARAM_SEX + "=" + sex + "|" else "") +
(if (keywords != null) Constants.PARAM_KEYWORDS + "=" + keywords + "|" else "") +
(if (categoryIds != null) Constants.PARAM_CATEGORY_IDS + "=" + categoryIds else "")
if(filterInfo.endsWith("\\|"))
filterInfo = filterInfo.substring(0, filterInfo.length - 1)
finalInfo.filter{
case (sessionId,fullInfo)=>{
var success=true;
if(!ValidUtils.between(fullInfo, Constants.FIELD_AGE, filterInfo, Constants.PARAM_START_AGE, Constants.PARAM_END_AGE)){
success = false
}else if(!ValidUtils.in(fullInfo, Constants.FIELD_PROFESSIONAL, filterInfo, Constants.PARAM_PROFESSIONALS)){
success = false
}else if(!ValidUtils.in(fullInfo, Constants.FIELD_CITY, filterInfo, Constants.PARAM_CITIES)){
success = false
}else if(!ValidUtils.equal(fullInfo, Constants.FIELD_SEX, filterInfo, Constants.PARAM_SEX)){
success = false
}else if(!ValidUtils.in(fullInfo, Constants.FIELD_SEARCH_KEYWORDS, filterInfo, Constants.PARAM_KEYWORDS)){
success = false
}else if(!ValidUtils.in(fullInfo, Constants.FIELD_CLICK_CATEGORY_IDS, filterInfo, Constants.PARAM_CATEGORY_IDS)){
success = false
}
//跟新累加器
if (success){
//先累加總的session數量
accumulator.add(Constants.SESSION_COUNT);
val visitLength=StringUtil.getFieldFromConcatString(fullInfo,"\\|",Constants.FIELD_VISIT_LENGTH).toLong;
val stepLength=StringUtil.getFieldFromConcatString(fullInfo,"\\|",Constants.FIELD_STEP_LENGTH).toLong;
calculateVisitLength(visitLength,accumulator);
calculateStepLength(stepLength,accumulator);
}
success;
}
}
}
calculateVisitLength:
def calculateVisitLength(visitLength: Long, sessionStatisticAccumulator: sessionAccumulator) = {
if(visitLength >= 1 && visitLength <= 3){
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_1s_3s)
}else if(visitLength >=4 && visitLength <= 6){
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_4s_6s)
}else if (visitLength >= 7 && visitLength <= 9) {
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_7s_9s)
} else if (visitLength >= 10 && visitLength <= 30) {
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_10s_30s)
} else if (visitLength > 30 && visitLength <= 60) {
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_30s_60s)
} else if (visitLength > 60 && visitLength <= 180) {
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_1m_3m)
} else if (visitLength > 180 && visitLength <= 600) {
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_3m_10m)
} else if (visitLength > 600 && visitLength <= 1800) {
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_10m_30m)
} else if (visitLength > 1800) {
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_30m)
}
}
- 計算各個範圍內的佔比,寫入數據庫
def getSessionRatio(sparkSession: SparkSession,taskUUID:String, FilterInfo: RDD[(String, String)], value:mutable.HashMap[String,Int]) = {
val session_count = value.getOrElse(Constants.SESSION_COUNT, 1).toDouble
val visit_length_1s_3s = value.getOrElse(Constants.TIME_PERIOD_1s_3s, 0)
val visit_length_4s_6s = value.getOrElse(Constants.TIME_PERIOD_4s_6s, 0)
val visit_length_7s_9s = value.getOrElse(Constants.TIME_PERIOD_7s_9s, 0)
val visit_length_10s_30s = value.getOrElse(Constants.TIME_PERIOD_10s_30s, 0)
val visit_length_30s_60s = value.getOrElse(Constants.TIME_PERIOD_30s_60s, 0)
val visit_length_1m_3m = value.getOrElse(Constants.TIME_PERIOD_1m_3m, 0)
val visit_length_3m_10m = value.getOrElse(Constants.TIME_PERIOD_3m_10m, 0)
val visit_length_10m_30m = value.getOrElse(Constants.TIME_PERIOD_10m_30m, 0)
val visit_length_30m = value.getOrElse(Constants.TIME_PERIOD_30m, 0)
val step_length_1_3 = value.getOrElse(Constants.STEP_PERIOD_1_3, 0)
val step_length_4_6 = value.getOrElse(Constants.STEP_PERIOD_4_6, 0)
val step_length_7_9 = value.getOrElse(Constants.STEP_PERIOD_7_9, 0)
val step_length_10_30 = value.getOrElse(Constants.STEP_PERIOD_10_30, 0)
val step_length_30_60 = value.getOrElse(Constants.STEP_PERIOD_30_60, 0)
val step_length_60 = value.getOrElse(Constants.STEP_PERIOD_60, 0)
val visit_length_1s_3s_ratio = NumberUtils.formatDouble(visit_length_1s_3s / session_count, 2)
val visit_length_4s_6s_ratio = NumberUtils.formatDouble(visit_length_4s_6s / session_count, 2)
val visit_length_7s_9s_ratio = NumberUtils.formatDouble(visit_length_7s_9s / session_count, 2)
val visit_length_10s_30s_ratio = NumberUtils.formatDouble(visit_length_10s_30s / session_count, 2)
val visit_length_30s_60s_ratio = NumberUtils.formatDouble(visit_length_30s_60s / session_count, 2)
val visit_length_1m_3m_ratio = NumberUtils.formatDouble(visit_length_1m_3m / session_count, 2)
val visit_length_3m_10m_ratio = NumberUtils.formatDouble(visit_length_3m_10m / session_count, 2)
val visit_length_10m_30m_ratio = NumberUtils.formatDouble(visit_length_10m_30m / session_count, 2)
val visit_length_30m_ratio = NumberUtils.formatDouble(visit_length_30m / session_count, 2)
val step_length_1_3_ratio = NumberUtils.formatDouble(step_length_1_3 / session_count, 2)
val step_length_4_6_ratio = NumberUtils.formatDouble(step_length_4_6 / session_count, 2)
val step_length_7_9_ratio = NumberUtils.formatDouble(step_length_7_9 / session_count, 2)
val step_length_10_30_ratio = NumberUtils.formatDouble(step_length_10_30 / session_count, 2)
val step_length_30_60_ratio = NumberUtils.formatDouble(step_length_30_60 / session_count, 2)
val step_length_60_ratio = NumberUtils.formatDouble(step_length_60 / session_count, 2)
//數據封裝
val stat = SessionAggrStat(taskUUID, session_count.toInt, visit_length_1s_3s_ratio, visit_length_4s_6s_ratio, visit_length_7s_9s_ratio,
visit_length_10s_30s_ratio, visit_length_30s_60s_ratio, visit_length_1m_3m_ratio,
visit_length_3m_10m_ratio, visit_length_10m_30m_ratio, visit_length_30m_ratio,
step_length_1_3_ratio, step_length_4_6_ratio, step_length_7_9_ratio,
step_length_10_30_ratio, step_length_30_60_ratio, step_length_60_ratio)
val sessionRatioRDD = sparkSession.sparkContext.makeRDD(Array(stat))
//寫入數據庫
import sparkSession.implicits._
sessionRatioRDD.toDF().write
.format("jdbc")
.option("url", ConfigurationManager.config.getString(Constants.JDBC_URL))
.option("user", ConfigurationManager.config.getString(Constants.JDBC_USER))
.option("password", ConfigurationManager.config.getString(Constants.JDBC_PASSWORD))
.option("dbtable", "session_stat_ratio_0416")
.mode(SaveMode.Append)
.save()
sessionRatioRDD;
}
完整代碼:
主函數:
package scala
import java.util.UUID
import com.alibaba.fastjson.{JSON, JSONObject}
import commons.conf.ConfigurationManager
import commons.constant.Constants
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import server.serverOne
object sessionStat {
def main(args: Array[String]): Unit = {
//server
val oneServer=new serverOne;
//sparksession
val conf=new SparkConf().setAppName("session").setMaster("local[*]");
val session=SparkSession.builder().config(conf).getOrCreate();
session.sparkContext.setLogLevel("ERROR");
//獲取配置
val str=ConfigurationManager.config.getString(Constants.TASK_PARAMS);
val task:JSONObject=JSON.parseObject(str);
//主鍵
val taskUUID=UUID.randomUUID().toString;
val filterInfo=getFilterFullResult(oneServer,session,task,taskUUID);
}
def getFilterFullResult(oneServer: serverOne, session: SparkSession, task: JSONObject,taskUUID:String) ={
//1.獲取基本的action信息
val basicActions=oneServer.basicActions(session,task);
//2.根據session聚合信息
val basicActionMap=basicActions.map(item=>{
val sessionId=item.session_id;
(sessionId,item);
})
val groupBasicActions=basicActionMap.groupByKey();
//3.根據每個用戶的sessionId->actions,將actions統計成一條str信息
val aggUserActions=oneServer.AggActionGroup(groupBasicActions);
//4.讀取hadoop文件,獲取用戶的基本信息
val userInfo=oneServer.getUserInfo(session);
//5.根據user_Id,將userInfo的信息插入到aggUserActions,形成更完整的信息
val finalInfo=oneServer.AggInfoAndActions(aggUserActions,userInfo);
finalInfo.cache();
//6.根據common模塊裏的限制條件過濾數據,跟新累加器
val accumulator=new sessionAccumulator;
session.sparkContext.register(accumulator);
val FilterInfo=oneServer.filterInfo(finalInfo,task,accumulator);
FilterInfo.foreach(println);
/*
目前爲止,我們已經得到了所有符合條件的過濾總和信息,以及每個範圍內的session數量(累加器),
*/
//7.計算每個範圍內的session佔比,
val sessionRatioCount= oneServer.getSessionRatio(session,taskUUID,FilterInfo,accumulator.value);
sessionRatioCount.foreach(println);
}
}
服務類:
package server
import java.util.Date
import com.alibaba.fastjson.JSONObject
import commons.conf.ConfigurationManager
import commons.constant.Constants
import commons.model.{SessionAggrStat, UserInfo, UserVisitAction}
import commons.utils.{DateUtils, NumberUtils, StringUtil, ValidUtils}
import org.apache.commons.lang.StringUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.spark_project.jetty.server.Authentication.User
import scala.collection.mutable
class serverOne extends Serializable {
def getSessionRatio(sparkSession: SparkSession,taskUUID:String, FilterInfo: RDD[(String, String)], value:mutable.HashMap[String,Int]) = {
val session_count = value.getOrElse(Constants.SESSION_COUNT, 1).toDouble
val visit_length_1s_3s = value.getOrElse(Constants.TIME_PERIOD_1s_3s, 0)
val visit_length_4s_6s = value.getOrElse(Constants.TIME_PERIOD_4s_6s, 0)
val visit_length_7s_9s = value.getOrElse(Constants.TIME_PERIOD_7s_9s, 0)
val visit_length_10s_30s = value.getOrElse(Constants.TIME_PERIOD_10s_30s, 0)
val visit_length_30s_60s = value.getOrElse(Constants.TIME_PERIOD_30s_60s, 0)
val visit_length_1m_3m = value.getOrElse(Constants.TIME_PERIOD_1m_3m, 0)
val visit_length_3m_10m = value.getOrElse(Constants.TIME_PERIOD_3m_10m, 0)
val visit_length_10m_30m = value.getOrElse(Constants.TIME_PERIOD_10m_30m, 0)
val visit_length_30m = value.getOrElse(Constants.TIME_PERIOD_30m, 0)
val step_length_1_3 = value.getOrElse(Constants.STEP_PERIOD_1_3, 0)
val step_length_4_6 = value.getOrElse(Constants.STEP_PERIOD_4_6, 0)
val step_length_7_9 = value.getOrElse(Constants.STEP_PERIOD_7_9, 0)
val step_length_10_30 = value.getOrElse(Constants.STEP_PERIOD_10_30, 0)
val step_length_30_60 = value.getOrElse(Constants.STEP_PERIOD_30_60, 0)
val step_length_60 = value.getOrElse(Constants.STEP_PERIOD_60, 0)
val visit_length_1s_3s_ratio = NumberUtils.formatDouble(visit_length_1s_3s / session_count, 2)
val visit_length_4s_6s_ratio = NumberUtils.formatDouble(visit_length_4s_6s / session_count, 2)
val visit_length_7s_9s_ratio = NumberUtils.formatDouble(visit_length_7s_9s / session_count, 2)
val visit_length_10s_30s_ratio = NumberUtils.formatDouble(visit_length_10s_30s / session_count, 2)
val visit_length_30s_60s_ratio = NumberUtils.formatDouble(visit_length_30s_60s / session_count, 2)
val visit_length_1m_3m_ratio = NumberUtils.formatDouble(visit_length_1m_3m / session_count, 2)
val visit_length_3m_10m_ratio = NumberUtils.formatDouble(visit_length_3m_10m / session_count, 2)
val visit_length_10m_30m_ratio = NumberUtils.formatDouble(visit_length_10m_30m / session_count, 2)
val visit_length_30m_ratio = NumberUtils.formatDouble(visit_length_30m / session_count, 2)
val step_length_1_3_ratio = NumberUtils.formatDouble(step_length_1_3 / session_count, 2)
val step_length_4_6_ratio = NumberUtils.formatDouble(step_length_4_6 / session_count, 2)
val step_length_7_9_ratio = NumberUtils.formatDouble(step_length_7_9 / session_count, 2)
val step_length_10_30_ratio = NumberUtils.formatDouble(step_length_10_30 / session_count, 2)
val step_length_30_60_ratio = NumberUtils.formatDouble(step_length_30_60 / session_count, 2)
val step_length_60_ratio = NumberUtils.formatDouble(step_length_60 / session_count, 2)
//數據封裝
val stat = SessionAggrStat(taskUUID, session_count.toInt, visit_length_1s_3s_ratio, visit_length_4s_6s_ratio, visit_length_7s_9s_ratio,
visit_length_10s_30s_ratio, visit_length_30s_60s_ratio, visit_length_1m_3m_ratio,
visit_length_3m_10m_ratio, visit_length_10m_30m_ratio, visit_length_30m_ratio,
step_length_1_3_ratio, step_length_4_6_ratio, step_length_7_9_ratio,
step_length_10_30_ratio, step_length_30_60_ratio, step_length_60_ratio)
val sessionRatioRDD = sparkSession.sparkContext.makeRDD(Array(stat))
//寫入數據庫
import sparkSession.implicits._
sessionRatioRDD.toDF().write
.format("jdbc")
.option("url", ConfigurationManager.config.getString(Constants.JDBC_URL))
.option("user", ConfigurationManager.config.getString(Constants.JDBC_USER))
.option("password", ConfigurationManager.config.getString(Constants.JDBC_PASSWORD))
.option("dbtable", "session_stat_ratio_0416")
.mode(SaveMode.Append)
.save()
sessionRatioRDD;
}
def filterInfo(finalInfo: RDD[(String, String)],task:JSONObject,accumulator:sessionAccumulator) = {
//1.獲取限制條件
//獲取限制條件的基本信息
val startAge = task.get(Constants.PARAM_START_AGE);
val endAge = task.get( Constants.PARAM_END_AGE);
val professionals = task.get(Constants.PARAM_PROFESSIONALS)
val cities = task.get(Constants.PARAM_CITIES)
val sex = task.get(Constants.PARAM_SEX)
val keywords =task.get(Constants.PARAM_KEYWORDS)
val categoryIds = task.get(Constants.PARAM_CATEGORY_IDS)
//拼接基本條件
var filterInfo = (if(startAge != null) Constants.PARAM_START_AGE + "=" + startAge + "|" else "") +
(if (endAge != null) Constants.PARAM_END_AGE + "=" + endAge + "|" else "") +
(if (professionals != null) Constants.PARAM_PROFESSIONALS + "=" + professionals + "|" else "") +
(if (cities != null) Constants.PARAM_CITIES + "=" + cities + "|" else "") +
(if (sex != null) Constants.PARAM_SEX + "=" + sex + "|" else "") +
(if (keywords != null) Constants.PARAM_KEYWORDS + "=" + keywords + "|" else "") +
(if (categoryIds != null) Constants.PARAM_CATEGORY_IDS + "=" + categoryIds else "")
if(filterInfo.endsWith("\\|"))
filterInfo = filterInfo.substring(0, filterInfo.length - 1)
finalInfo.filter{
case (sessionId,fullInfo)=>{
var success=true;
if(!ValidUtils.between(fullInfo, Constants.FIELD_AGE, filterInfo, Constants.PARAM_START_AGE, Constants.PARAM_END_AGE)){
success = false
}else if(!ValidUtils.in(fullInfo, Constants.FIELD_PROFESSIONAL, filterInfo, Constants.PARAM_PROFESSIONALS)){
success = false
}else if(!ValidUtils.in(fullInfo, Constants.FIELD_CITY, filterInfo, Constants.PARAM_CITIES)){
success = false
}else if(!ValidUtils.equal(fullInfo, Constants.FIELD_SEX, filterInfo, Constants.PARAM_SEX)){
success = false
}else if(!ValidUtils.in(fullInfo, Constants.FIELD_SEARCH_KEYWORDS, filterInfo, Constants.PARAM_KEYWORDS)){
success = false
}else if(!ValidUtils.in(fullInfo, Constants.FIELD_CLICK_CATEGORY_IDS, filterInfo, Constants.PARAM_CATEGORY_IDS)){
success = false
}
//跟新累加器
if (success){
//先累加總的session數量
accumulator.add(Constants.SESSION_COUNT);
val visitLength=StringUtil.getFieldFromConcatString(fullInfo,"\\|",Constants.FIELD_VISIT_LENGTH).toLong;
val stepLength=StringUtil.getFieldFromConcatString(fullInfo,"\\|",Constants.FIELD_STEP_LENGTH).toLong;
calculateVisitLength(visitLength,accumulator);
calculateStepLength(stepLength,accumulator);
}
success;
}
}
}
def calculateVisitLength(visitLength: Long, sessionStatisticAccumulator: sessionAccumulator) = {
if(visitLength >= 1 && visitLength <= 3){
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_1s_3s)
}else if(visitLength >=4 && visitLength <= 6){
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_4s_6s)
}else if (visitLength >= 7 && visitLength <= 9) {
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_7s_9s)
} else if (visitLength >= 10 && visitLength <= 30) {
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_10s_30s)
} else if (visitLength > 30 && visitLength <= 60) {
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_30s_60s)
} else if (visitLength > 60 && visitLength <= 180) {
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_1m_3m)
} else if (visitLength > 180 && visitLength <= 600) {
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_3m_10m)
} else if (visitLength > 600 && visitLength <= 1800) {
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_10m_30m)
} else if (visitLength > 1800) {
sessionStatisticAccumulator.add(Constants.TIME_PERIOD_30m)
}
}
def calculateStepLength(stepLength: Long, sessionStatisticAccumulator: sessionAccumulator) = {
if(stepLength >=1 && stepLength <=3){
sessionStatisticAccumulator.add(Constants.STEP_PERIOD_1_3)
}else if (stepLength >= 4 && stepLength <= 6) {
sessionStatisticAccumulator.add(Constants.STEP_PERIOD_4_6)
} else if (stepLength >= 7 && stepLength <= 9) {
sessionStatisticAccumulator.add(Constants.STEP_PERIOD_7_9)
} else if (stepLength >= 10 && stepLength <= 30) {
sessionStatisticAccumulator.add(Constants.STEP_PERIOD_10_30)
} else if (stepLength > 30 && stepLength <= 60) {
sessionStatisticAccumulator.add(Constants.STEP_PERIOD_30_60)
} else if (stepLength > 60) {
sessionStatisticAccumulator.add(Constants.STEP_PERIOD_60)
}
}
def getUserInfo(session: SparkSession) = {
import session.implicits._;
val ds=session.read.parquet("hdfs://hadoop1:9000/data/user_Info").as[UserInfo].map(item=>(item.user_id,item));
ds.rdd;
}
def basicActions(session:SparkSession,task:JSONObject)={
import session.implicits._;
val df=session.read.parquet("hdfs://hadoop1:9000/data/user_visit_action").as[UserVisitAction];
df.filter(item=>{
val date=item.action_time;
val start=task.getString(Constants.PARAM_START_DATE);
val end=task.getString(Constants.PARAM_END_DATE);
date>=start&&date<=end;
})
df.rdd;
}
def AggActionGroup(groupBasicActions: RDD[(String, Iterable[UserVisitAction])])={
groupBasicActions.map{
case (sessionId,actions)=>{
var userId = -1L
var startTime:Date = null
var endTime:Date = null
var stepLength = 0
val searchKeywords = new StringBuffer("")
val clickCategories = new StringBuffer("")
//循環遍歷actions,更新信息
for (action<-actions){
if(userId == -1L){
userId=action.user_id;
}
val time=DateUtils.parseTime(action.action_time);
if (startTime==null||startTime.after(time))startTime=time;
if (endTime==null||endTime.before(time))endTime=time;
val key=action.search_keyword;
if (!StringUtils.isEmpty(key) && !searchKeywords.toString.contains(key))searchKeywords.append(key+",");
val click=action.click_category_id;
if ( click!= -1L && clickCategories.toString.contains(click))searchKeywords.append(click+",");
stepLength+=1;
}
// searchKeywords.toString.substring(0, searchKeywords.toString.length)
val searchKw = StringUtil.trimComma(searchKeywords.toString)
val clickCg = StringUtil.trimComma(clickCategories.toString)
val visitLength = (endTime.getTime - startTime.getTime) / 1000
val aggrInfo = Constants.FIELD_SESSION_ID + "=" + sessionId + "|" +
Constants.FIELD_SEARCH_KEYWORDS + "=" + searchKw + "|" +
Constants.FIELD_CLICK_CATEGORY_IDS + "=" + clickCg + "|" +
Constants.FIELD_VISIT_LENGTH + "=" + visitLength + "|" +
Constants.FIELD_STEP_LENGTH + "=" + stepLength + "|" +
Constants.FIELD_START_TIME + "=" + DateUtils.formatTime(startTime)
(userId, aggrInfo)
}
}
}
def AggInfoAndActions(aggUserActions: RDD[(Long, String)], userInfo: RDD[(Long, UserInfo)])={
//根據user_id建立映射關係===>用Join算子
userInfo.join(aggUserActions).map{
case (userId,(userInfo: UserInfo,aggrInfo))=>{
val age = userInfo.age
val professional = userInfo.professional
val sex = userInfo.sex
val city = userInfo.city
val fullInfo = aggrInfo + "|" +
Constants.FIELD_AGE + "=" + age + "|" +
Constants.FIELD_PROFESSIONAL + "=" + professional + "|" +
Constants.FIELD_SEX + "=" + sex + "|" +
Constants.FIELD_CITY + "=" + city
val sessionId = StringUtil.getFieldFromConcatString(aggrInfo, "\\|", Constants.FIELD_SESSION_ID)
(sessionId, fullInfo)
}
}
}
}
總結
- 一個用戶對應多條數據,應先對數據進行統計聚合操作,轉化爲一條整體的數據
- 對於有相同鍵,但是值的結構不同的k-v類型RDD,可以用join進行聚合
- 自定義累加器