需求一:各個範圍Session步長、訪問時長佔比統計

做什麼?

統計各個範圍Session步長、訪問時長佔比統計

  • 訪問時長:session的最早時間與最晚時間之差。

  • 訪問步長:session中的action操作個數。

即:統計出符合篩選條件的session中,訪問時長在1s3s、4s6s、7s9s、10s30s、30s60s、1m3m、3m10m、10m30m、30m,訪問步長在1_3、4_6、…以上各個範圍內的各種session的佔比

需求分析:

在這裏插入圖片描述
對應的類:

case class UserVisitAction(date: String,
                           user_id: Long,
                           session_id: String,
                           page_id: Long,
                           action_time: String,
                           search_keyword: String,
                           click_category_id: Long,
                           click_product_id: Long,
                           order_category_ids: String,
                           order_product_ids: String,
                           pay_category_ids: String,
                           pay_product_ids: String,
                           city_id: Long
                          )

每一條action數據代表了一個用戶的行爲,即爲點擊,下單,付款等其中一種行爲,所有的記錄中包含了同一個用戶的多個action

  • 基於這一點,我們很容想到,可以用groupByKey對同一用戶的所有行爲進行一個聚合操作,接着取過濾判斷用戶的行爲是否符合條件限制即可.
  • 那麼問題是,一個用戶有很多的行爲,我們不可能一條一條的去判斷是否符合條件,所以在判斷之前,還需要一個統計操作,將一個用戶的所有行爲統計成一條信息,統計的同時計算訪問時長和步長,最後格式如下:
UserId:searchKeywords=Lamer,小龍蝦,機器學習,吸塵器,蘋果,洗面奶,保溫杯,華爲手機|clickCategoryIds=|visitLength=3471|stepLength=48|startTime=2020-05-23 10:00:26)
  • 接着,根據userId,將這條信息與UserInfo進行一個映射關係,新成一條新的,更完整的用戶信息
(e8ef831e7fd4475990a80e425af946ad,sessionid=e8ef831e7fd4475990a80e425af946ad|searchKeywords=Lamer,小龍蝦,機器學習,吸塵器,蘋果,洗面奶,保溫杯,華爲手機|clickCategoryIds=|visitLength=3471|stepLength=48|startTime=2020-05-23 10:00:26|age=45|professional=professional43|sex=male|city=city2)
  • 接着就可以進心我們的過濾信息,並且在過濾的同時,更新每個區域範圍內的累加器(自定義累加器,統計各個範圍的時長和步長)
  • 最後,將各個範圍的時長,步長除於總的時間,就是佔比了
    流程圖如下:
    在這裏插入圖片描述

步驟解析:

  1. 讀取hadoop數據,返回基本的action信息:
  def basicActions(session:SparkSession,task:JSONObject)={
       import session.implicits._;
       val df=session.read.parquet("hdfs://hadoop1:9000/data/user_visit_action").as[UserVisitAction];
       df.filter(item=>{
         val date=item.action_time;
         val start=task.getString(Constants.PARAM_START_DATE);
         val end=task.getString(Constants.PARAM_END_DATE);
         date>=start&&date<=end;
       })
       df.rdd;
     }
  1. 轉化actions,格式爲(sessionId,action),groupByKey算子,聚合同一用戶的actions
    val basicActionMap=basicActions.map(item=>{
       val sessionId=item.session_id;
      (sessionId,item);
    })
     val groupBasicActions=basicActionMap.groupByKey();
  1. 讀取hadoop,獲取用戶的基本信息
  def getUserInfo(session: SparkSession) = {
    import session.implicits._;
    val ds=session.read.parquet("hdfs://hadoop1:9000/data/user_Info").as[UserInfo].map(item=>(item.user_id,item));
    ds.rdd;
  }
  1. 映射用戶信息與聚合信息,形成新的完整的信息
  def AggInfoAndActions(aggUserActions: RDD[(Long, String)], userInfo: RDD[(Long, UserInfo)])={
    //根據user_id建立映射關係===>用Join算子
    userInfo.join(aggUserActions).map{
      case (userId,(userInfo: UserInfo,aggrInfo))=>{
        val age = userInfo.age
        val professional = userInfo.professional
        val sex = userInfo.sex
        val city = userInfo.city
        val fullInfo = aggrInfo + "|" +
          Constants.FIELD_AGE + "=" + age + "|" +
          Constants.FIELD_PROFESSIONAL + "=" + professional + "|" +
          Constants.FIELD_SEX + "=" + sex + "|" +
          Constants.FIELD_CITY + "=" + city

        val sessionId = StringUtil.getFieldFromConcatString(aggrInfo, "\\|", Constants.FIELD_SESSION_ID)

        (sessionId, fullInfo)
      }
    }
  1. 過濾數據,更新累加器
    累加器的定義:
package scala

import org.apache.spark.util.AccumulatorV2

import scala.collection.mutable

class sessionAccumulator extends AccumulatorV2[String,mutable.HashMap[String,Int]] {
  val countMap=new mutable.HashMap[String,Int]();
  override def isZero: Boolean = {
    countMap.isEmpty;
  }

  override def copy(): AccumulatorV2[String, mutable.HashMap[String, Int]] = {
    val acc=new sessionAccumulator;
    acc.countMap++=this.countMap;
    acc
  }

  override def reset(): Unit = {
    countMap.clear;
  }

  override def add(v: String): Unit = {
    if (!countMap.contains(v)){
      countMap+=(v->0);
    }
    countMap.update(v,countMap(v)+1);
  }

  override def merge(other: AccumulatorV2[String, mutable.HashMap[String, Int]])={
    other match{
      case acc:sessionAccumulator=>acc.countMap.foldLeft(this.countMap){
        case(map,(k,v))=>map+=(k->(map.getOrElse(k,0)+v));
      }
    }
  }

  override def value: mutable.HashMap[String, Int] = {
    this.countMap;
  }
}

過濾函數:

 def filterInfo(finalInfo: RDD[(String, String)],task:JSONObject,accumulator:sessionAccumulator) = {
    //1.獲取限制條件
    //獲取限制條件的基本信息
    val startAge = task.get(Constants.PARAM_START_AGE);
    val endAge = task.get( Constants.PARAM_END_AGE);
    val professionals = task.get(Constants.PARAM_PROFESSIONALS)
    val cities = task.get(Constants.PARAM_CITIES)
    val sex = task.get(Constants.PARAM_SEX)
    val keywords =task.get(Constants.PARAM_KEYWORDS)
    val categoryIds = task.get(Constants.PARAM_CATEGORY_IDS)

    //拼接基本條件
    var filterInfo = (if(startAge != null) Constants.PARAM_START_AGE + "=" + startAge + "|" else "") +
      (if (endAge != null) Constants.PARAM_END_AGE + "=" + endAge + "|" else "") +
      (if (professionals != null) Constants.PARAM_PROFESSIONALS + "=" + professionals + "|" else "") +
      (if (cities != null) Constants.PARAM_CITIES + "=" + cities + "|" else "") +
      (if (sex != null) Constants.PARAM_SEX + "=" + sex + "|" else "") +
      (if (keywords != null) Constants.PARAM_KEYWORDS + "=" + keywords + "|" else "") +
      (if (categoryIds != null) Constants.PARAM_CATEGORY_IDS + "=" + categoryIds else "")

    if(filterInfo.endsWith("\\|"))
      filterInfo = filterInfo.substring(0, filterInfo.length - 1)

    finalInfo.filter{
          case (sessionId,fullInfo)=>{
            var success=true;
            if(!ValidUtils.between(fullInfo, Constants.FIELD_AGE, filterInfo, Constants.PARAM_START_AGE, Constants.PARAM_END_AGE)){
              success = false
            }else if(!ValidUtils.in(fullInfo, Constants.FIELD_PROFESSIONAL, filterInfo, Constants.PARAM_PROFESSIONALS)){
              success = false
            }else if(!ValidUtils.in(fullInfo, Constants.FIELD_CITY, filterInfo, Constants.PARAM_CITIES)){
              success = false
            }else if(!ValidUtils.equal(fullInfo, Constants.FIELD_SEX, filterInfo, Constants.PARAM_SEX)){
              success = false
            }else if(!ValidUtils.in(fullInfo, Constants.FIELD_SEARCH_KEYWORDS, filterInfo, Constants.PARAM_KEYWORDS)){
              success = false
            }else if(!ValidUtils.in(fullInfo, Constants.FIELD_CLICK_CATEGORY_IDS, filterInfo, Constants.PARAM_CATEGORY_IDS)){
              success = false
            }
            //跟新累加器
            if (success){
              //先累加總的session數量
              accumulator.add(Constants.SESSION_COUNT);
              val visitLength=StringUtil.getFieldFromConcatString(fullInfo,"\\|",Constants.FIELD_VISIT_LENGTH).toLong;
              val stepLength=StringUtil.getFieldFromConcatString(fullInfo,"\\|",Constants.FIELD_STEP_LENGTH).toLong;

              calculateVisitLength(visitLength,accumulator);
              calculateStepLength(stepLength,accumulator);
            }
            success;
          }
        }
  }

calculateVisitLength:

  def calculateVisitLength(visitLength: Long, sessionStatisticAccumulator: sessionAccumulator) = {
    if(visitLength >= 1 && visitLength <= 3){
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_1s_3s)
    }else if(visitLength >=4 && visitLength  <= 6){
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_4s_6s)
    }else if (visitLength >= 7 && visitLength <= 9) {
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_7s_9s)
    } else if (visitLength >= 10 && visitLength <= 30) {
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_10s_30s)
    } else if (visitLength > 30 && visitLength <= 60) {
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_30s_60s)
    } else if (visitLength > 60 && visitLength <= 180) {
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_1m_3m)
    } else if (visitLength > 180 && visitLength <= 600) {
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_3m_10m)
    } else if (visitLength > 600 && visitLength <= 1800) {
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_10m_30m)
    } else if (visitLength > 1800) {
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_30m)
    }
  }
  1. 計算各個範圍內的佔比,寫入數據庫
  def getSessionRatio(sparkSession: SparkSession,taskUUID:String, FilterInfo: RDD[(String, String)], value:mutable.HashMap[String,Int]) = {
    val session_count = value.getOrElse(Constants.SESSION_COUNT, 1).toDouble

    val visit_length_1s_3s = value.getOrElse(Constants.TIME_PERIOD_1s_3s, 0)
    val visit_length_4s_6s = value.getOrElse(Constants.TIME_PERIOD_4s_6s, 0)
    val visit_length_7s_9s = value.getOrElse(Constants.TIME_PERIOD_7s_9s, 0)
    val visit_length_10s_30s = value.getOrElse(Constants.TIME_PERIOD_10s_30s, 0)
    val visit_length_30s_60s = value.getOrElse(Constants.TIME_PERIOD_30s_60s, 0)
    val visit_length_1m_3m = value.getOrElse(Constants.TIME_PERIOD_1m_3m, 0)
    val visit_length_3m_10m = value.getOrElse(Constants.TIME_PERIOD_3m_10m, 0)
    val visit_length_10m_30m = value.getOrElse(Constants.TIME_PERIOD_10m_30m, 0)
    val visit_length_30m = value.getOrElse(Constants.TIME_PERIOD_30m, 0)

    val step_length_1_3 = value.getOrElse(Constants.STEP_PERIOD_1_3, 0)
    val step_length_4_6 = value.getOrElse(Constants.STEP_PERIOD_4_6, 0)
    val step_length_7_9 = value.getOrElse(Constants.STEP_PERIOD_7_9, 0)
    val step_length_10_30 = value.getOrElse(Constants.STEP_PERIOD_10_30, 0)
    val step_length_30_60 = value.getOrElse(Constants.STEP_PERIOD_30_60, 0)
    val step_length_60 = value.getOrElse(Constants.STEP_PERIOD_60, 0)

    val visit_length_1s_3s_ratio = NumberUtils.formatDouble(visit_length_1s_3s / session_count, 2)
    val visit_length_4s_6s_ratio = NumberUtils.formatDouble(visit_length_4s_6s / session_count, 2)
    val visit_length_7s_9s_ratio = NumberUtils.formatDouble(visit_length_7s_9s / session_count, 2)
    val visit_length_10s_30s_ratio = NumberUtils.formatDouble(visit_length_10s_30s / session_count, 2)
    val visit_length_30s_60s_ratio = NumberUtils.formatDouble(visit_length_30s_60s / session_count, 2)
    val visit_length_1m_3m_ratio = NumberUtils.formatDouble(visit_length_1m_3m / session_count, 2)
    val visit_length_3m_10m_ratio = NumberUtils.formatDouble(visit_length_3m_10m / session_count, 2)
    val visit_length_10m_30m_ratio = NumberUtils.formatDouble(visit_length_10m_30m / session_count, 2)
    val visit_length_30m_ratio = NumberUtils.formatDouble(visit_length_30m / session_count, 2)

    val step_length_1_3_ratio = NumberUtils.formatDouble(step_length_1_3 / session_count, 2)
    val step_length_4_6_ratio = NumberUtils.formatDouble(step_length_4_6 / session_count, 2)
    val step_length_7_9_ratio = NumberUtils.formatDouble(step_length_7_9 / session_count, 2)
    val step_length_10_30_ratio = NumberUtils.formatDouble(step_length_10_30 / session_count, 2)
    val step_length_30_60_ratio = NumberUtils.formatDouble(step_length_30_60 / session_count, 2)
    val step_length_60_ratio = NumberUtils.formatDouble(step_length_60 / session_count, 2)

    //數據封裝
    val stat = SessionAggrStat(taskUUID, session_count.toInt, visit_length_1s_3s_ratio, visit_length_4s_6s_ratio, visit_length_7s_9s_ratio,
      visit_length_10s_30s_ratio, visit_length_30s_60s_ratio, visit_length_1m_3m_ratio,
      visit_length_3m_10m_ratio, visit_length_10m_30m_ratio, visit_length_30m_ratio,
      step_length_1_3_ratio, step_length_4_6_ratio, step_length_7_9_ratio,
      step_length_10_30_ratio, step_length_30_60_ratio, step_length_60_ratio)

    val sessionRatioRDD = sparkSession.sparkContext.makeRDD(Array(stat))

    //寫入數據庫
    import sparkSession.implicits._
    sessionRatioRDD.toDF().write
      .format("jdbc")
      .option("url", ConfigurationManager.config.getString(Constants.JDBC_URL))
      .option("user", ConfigurationManager.config.getString(Constants.JDBC_USER))
      .option("password", ConfigurationManager.config.getString(Constants.JDBC_PASSWORD))
      .option("dbtable", "session_stat_ratio_0416")
      .mode(SaveMode.Append)
      .save()
    sessionRatioRDD;
  }

完整代碼:

主函數:

package scala

import java.util.UUID

import com.alibaba.fastjson.{JSON, JSONObject}
import commons.conf.ConfigurationManager
import commons.constant.Constants
import org.apache.spark.SparkConf
import org.apache.spark.sql.SparkSession
import server.serverOne

object sessionStat {


  def main(args: Array[String]): Unit = {
    //server
    val oneServer=new serverOne;
    //sparksession
    val conf=new SparkConf().setAppName("session").setMaster("local[*]");
    val session=SparkSession.builder().config(conf).getOrCreate();
    session.sparkContext.setLogLevel("ERROR");
    //獲取配置
    val str=ConfigurationManager.config.getString(Constants.TASK_PARAMS);
    val task:JSONObject=JSON.parseObject(str);
    //主鍵
    val taskUUID=UUID.randomUUID().toString;

    val filterInfo=getFilterFullResult(oneServer,session,task,taskUUID);
  }
  def getFilterFullResult(oneServer: serverOne, session: SparkSession, task: JSONObject,taskUUID:String) ={
    //1.獲取基本的action信息
    val basicActions=oneServer.basicActions(session,task);
    //2.根據session聚合信息
    val basicActionMap=basicActions.map(item=>{
      val sessionId=item.session_id;
      (sessionId,item);
    })
    val groupBasicActions=basicActionMap.groupByKey();
    //3.根據每個用戶的sessionId->actions,將actions統計成一條str信息
    val aggUserActions=oneServer.AggActionGroup(groupBasicActions);
    //4.讀取hadoop文件,獲取用戶的基本信息
    val userInfo=oneServer.getUserInfo(session);
    //5.根據user_Id,將userInfo的信息插入到aggUserActions,形成更完整的信息
    val finalInfo=oneServer.AggInfoAndActions(aggUserActions,userInfo);
    finalInfo.cache();
    //6.根據common模塊裏的限制條件過濾數據,跟新累加器
    val accumulator=new sessionAccumulator;
    session.sparkContext.register(accumulator);
    val FilterInfo=oneServer.filterInfo(finalInfo,task,accumulator);
    FilterInfo.foreach(println);
    /*
    目前爲止,我們已經得到了所有符合條件的過濾總和信息,以及每個範圍內的session數量(累加器),
    */
     //7.計算每個範圍內的session佔比,
    val sessionRatioCount= oneServer.getSessionRatio(session,taskUUID,FilterInfo,accumulator.value);
    sessionRatioCount.foreach(println);
  }
}

服務類:

package server

import java.util.Date

import com.alibaba.fastjson.JSONObject
import commons.conf.ConfigurationManager
import commons.constant.Constants
import commons.model.{SessionAggrStat, UserInfo, UserVisitAction}
import commons.utils.{DateUtils, NumberUtils, StringUtil, ValidUtils}
import org.apache.commons.lang.StringUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{SaveMode, SparkSession}
import org.spark_project.jetty.server.Authentication.User

import scala.collection.mutable

class serverOne extends Serializable {
  def getSessionRatio(sparkSession: SparkSession,taskUUID:String, FilterInfo: RDD[(String, String)], value:mutable.HashMap[String,Int]) = {
    val session_count = value.getOrElse(Constants.SESSION_COUNT, 1).toDouble

    val visit_length_1s_3s = value.getOrElse(Constants.TIME_PERIOD_1s_3s, 0)
    val visit_length_4s_6s = value.getOrElse(Constants.TIME_PERIOD_4s_6s, 0)
    val visit_length_7s_9s = value.getOrElse(Constants.TIME_PERIOD_7s_9s, 0)
    val visit_length_10s_30s = value.getOrElse(Constants.TIME_PERIOD_10s_30s, 0)
    val visit_length_30s_60s = value.getOrElse(Constants.TIME_PERIOD_30s_60s, 0)
    val visit_length_1m_3m = value.getOrElse(Constants.TIME_PERIOD_1m_3m, 0)
    val visit_length_3m_10m = value.getOrElse(Constants.TIME_PERIOD_3m_10m, 0)
    val visit_length_10m_30m = value.getOrElse(Constants.TIME_PERIOD_10m_30m, 0)
    val visit_length_30m = value.getOrElse(Constants.TIME_PERIOD_30m, 0)

    val step_length_1_3 = value.getOrElse(Constants.STEP_PERIOD_1_3, 0)
    val step_length_4_6 = value.getOrElse(Constants.STEP_PERIOD_4_6, 0)
    val step_length_7_9 = value.getOrElse(Constants.STEP_PERIOD_7_9, 0)
    val step_length_10_30 = value.getOrElse(Constants.STEP_PERIOD_10_30, 0)
    val step_length_30_60 = value.getOrElse(Constants.STEP_PERIOD_30_60, 0)
    val step_length_60 = value.getOrElse(Constants.STEP_PERIOD_60, 0)

    val visit_length_1s_3s_ratio = NumberUtils.formatDouble(visit_length_1s_3s / session_count, 2)
    val visit_length_4s_6s_ratio = NumberUtils.formatDouble(visit_length_4s_6s / session_count, 2)
    val visit_length_7s_9s_ratio = NumberUtils.formatDouble(visit_length_7s_9s / session_count, 2)
    val visit_length_10s_30s_ratio = NumberUtils.formatDouble(visit_length_10s_30s / session_count, 2)
    val visit_length_30s_60s_ratio = NumberUtils.formatDouble(visit_length_30s_60s / session_count, 2)
    val visit_length_1m_3m_ratio = NumberUtils.formatDouble(visit_length_1m_3m / session_count, 2)
    val visit_length_3m_10m_ratio = NumberUtils.formatDouble(visit_length_3m_10m / session_count, 2)
    val visit_length_10m_30m_ratio = NumberUtils.formatDouble(visit_length_10m_30m / session_count, 2)
    val visit_length_30m_ratio = NumberUtils.formatDouble(visit_length_30m / session_count, 2)

    val step_length_1_3_ratio = NumberUtils.formatDouble(step_length_1_3 / session_count, 2)
    val step_length_4_6_ratio = NumberUtils.formatDouble(step_length_4_6 / session_count, 2)
    val step_length_7_9_ratio = NumberUtils.formatDouble(step_length_7_9 / session_count, 2)
    val step_length_10_30_ratio = NumberUtils.formatDouble(step_length_10_30 / session_count, 2)
    val step_length_30_60_ratio = NumberUtils.formatDouble(step_length_30_60 / session_count, 2)
    val step_length_60_ratio = NumberUtils.formatDouble(step_length_60 / session_count, 2)

    //數據封裝
    val stat = SessionAggrStat(taskUUID, session_count.toInt, visit_length_1s_3s_ratio, visit_length_4s_6s_ratio, visit_length_7s_9s_ratio,
      visit_length_10s_30s_ratio, visit_length_30s_60s_ratio, visit_length_1m_3m_ratio,
      visit_length_3m_10m_ratio, visit_length_10m_30m_ratio, visit_length_30m_ratio,
      step_length_1_3_ratio, step_length_4_6_ratio, step_length_7_9_ratio,
      step_length_10_30_ratio, step_length_30_60_ratio, step_length_60_ratio)

    val sessionRatioRDD = sparkSession.sparkContext.makeRDD(Array(stat))

    //寫入數據庫
    import sparkSession.implicits._
    sessionRatioRDD.toDF().write
      .format("jdbc")
      .option("url", ConfigurationManager.config.getString(Constants.JDBC_URL))
      .option("user", ConfigurationManager.config.getString(Constants.JDBC_USER))
      .option("password", ConfigurationManager.config.getString(Constants.JDBC_PASSWORD))
      .option("dbtable", "session_stat_ratio_0416")
      .mode(SaveMode.Append)
      .save()
    sessionRatioRDD;
  }

  def filterInfo(finalInfo: RDD[(String, String)],task:JSONObject,accumulator:sessionAccumulator) = {
    //1.獲取限制條件
    //獲取限制條件的基本信息
    val startAge = task.get(Constants.PARAM_START_AGE);
    val endAge = task.get( Constants.PARAM_END_AGE);
    val professionals = task.get(Constants.PARAM_PROFESSIONALS)
    val cities = task.get(Constants.PARAM_CITIES)
    val sex = task.get(Constants.PARAM_SEX)
    val keywords =task.get(Constants.PARAM_KEYWORDS)
    val categoryIds = task.get(Constants.PARAM_CATEGORY_IDS)

    //拼接基本條件
    var filterInfo = (if(startAge != null) Constants.PARAM_START_AGE + "=" + startAge + "|" else "") +
      (if (endAge != null) Constants.PARAM_END_AGE + "=" + endAge + "|" else "") +
      (if (professionals != null) Constants.PARAM_PROFESSIONALS + "=" + professionals + "|" else "") +
      (if (cities != null) Constants.PARAM_CITIES + "=" + cities + "|" else "") +
      (if (sex != null) Constants.PARAM_SEX + "=" + sex + "|" else "") +
      (if (keywords != null) Constants.PARAM_KEYWORDS + "=" + keywords + "|" else "") +
      (if (categoryIds != null) Constants.PARAM_CATEGORY_IDS + "=" + categoryIds else "")

    if(filterInfo.endsWith("\\|"))
      filterInfo = filterInfo.substring(0, filterInfo.length - 1)

    finalInfo.filter{
          case (sessionId,fullInfo)=>{
            var success=true;
            if(!ValidUtils.between(fullInfo, Constants.FIELD_AGE, filterInfo, Constants.PARAM_START_AGE, Constants.PARAM_END_AGE)){
              success = false
            }else if(!ValidUtils.in(fullInfo, Constants.FIELD_PROFESSIONAL, filterInfo, Constants.PARAM_PROFESSIONALS)){
              success = false
            }else if(!ValidUtils.in(fullInfo, Constants.FIELD_CITY, filterInfo, Constants.PARAM_CITIES)){
              success = false
            }else if(!ValidUtils.equal(fullInfo, Constants.FIELD_SEX, filterInfo, Constants.PARAM_SEX)){
              success = false
            }else if(!ValidUtils.in(fullInfo, Constants.FIELD_SEARCH_KEYWORDS, filterInfo, Constants.PARAM_KEYWORDS)){
              success = false
            }else if(!ValidUtils.in(fullInfo, Constants.FIELD_CLICK_CATEGORY_IDS, filterInfo, Constants.PARAM_CATEGORY_IDS)){
              success = false
            }
            //跟新累加器
            if (success){
              //先累加總的session數量
              accumulator.add(Constants.SESSION_COUNT);
              val visitLength=StringUtil.getFieldFromConcatString(fullInfo,"\\|",Constants.FIELD_VISIT_LENGTH).toLong;
              val stepLength=StringUtil.getFieldFromConcatString(fullInfo,"\\|",Constants.FIELD_STEP_LENGTH).toLong;

              calculateVisitLength(visitLength,accumulator);
              calculateStepLength(stepLength,accumulator);
            }
            success;
          }
        }
  }
  def calculateVisitLength(visitLength: Long, sessionStatisticAccumulator: sessionAccumulator) = {
    if(visitLength >= 1 && visitLength <= 3){
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_1s_3s)
    }else if(visitLength >=4 && visitLength  <= 6){
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_4s_6s)
    }else if (visitLength >= 7 && visitLength <= 9) {
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_7s_9s)
    } else if (visitLength >= 10 && visitLength <= 30) {
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_10s_30s)
    } else if (visitLength > 30 && visitLength <= 60) {
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_30s_60s)
    } else if (visitLength > 60 && visitLength <= 180) {
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_1m_3m)
    } else if (visitLength > 180 && visitLength <= 600) {
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_3m_10m)
    } else if (visitLength > 600 && visitLength <= 1800) {
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_10m_30m)
    } else if (visitLength > 1800) {
      sessionStatisticAccumulator.add(Constants.TIME_PERIOD_30m)
    }
  }

  def calculateStepLength(stepLength: Long, sessionStatisticAccumulator: sessionAccumulator) = {
    if(stepLength >=1 && stepLength <=3){
      sessionStatisticAccumulator.add(Constants.STEP_PERIOD_1_3)
    }else if (stepLength >= 4 && stepLength <= 6) {
      sessionStatisticAccumulator.add(Constants.STEP_PERIOD_4_6)
    } else if (stepLength >= 7 && stepLength <= 9) {
      sessionStatisticAccumulator.add(Constants.STEP_PERIOD_7_9)
    } else if (stepLength >= 10 && stepLength <= 30) {
      sessionStatisticAccumulator.add(Constants.STEP_PERIOD_10_30)
    } else if (stepLength > 30 && stepLength <= 60) {
      sessionStatisticAccumulator.add(Constants.STEP_PERIOD_30_60)
    } else if (stepLength > 60) {
      sessionStatisticAccumulator.add(Constants.STEP_PERIOD_60)
    }
  }
  def getUserInfo(session: SparkSession) = {
    import session.implicits._;
    val ds=session.read.parquet("hdfs://hadoop1:9000/data/user_Info").as[UserInfo].map(item=>(item.user_id,item));
    ds.rdd;
  }

  def basicActions(session:SparkSession,task:JSONObject)={
       import session.implicits._;
       val df=session.read.parquet("hdfs://hadoop1:9000/data/user_visit_action").as[UserVisitAction];
       df.filter(item=>{
         val date=item.action_time;
         val start=task.getString(Constants.PARAM_START_DATE);
         val end=task.getString(Constants.PARAM_END_DATE);
         date>=start&&date<=end;
       })
       df.rdd;
     }

  def AggActionGroup(groupBasicActions: RDD[(String, Iterable[UserVisitAction])])={
       groupBasicActions.map{
         case (sessionId,actions)=>{
           var userId = -1L

           var startTime:Date = null
           var endTime:Date = null

           var stepLength = 0

           val searchKeywords = new StringBuffer("")
           val clickCategories = new StringBuffer("")

           //循環遍歷actions,更新信息
           for (action<-actions){
             if(userId == -1L){
               userId=action.user_id;
             }
             val time=DateUtils.parseTime(action.action_time);
             if (startTime==null||startTime.after(time))startTime=time;
             if (endTime==null||endTime.before(time))endTime=time;

             val key=action.search_keyword;

             if (!StringUtils.isEmpty(key) && !searchKeywords.toString.contains(key))searchKeywords.append(key+",");

             val click=action.click_category_id;
             if ( click!= -1L && clickCategories.toString.contains(click))searchKeywords.append(click+",");

             stepLength+=1;

           }
           // searchKeywords.toString.substring(0, searchKeywords.toString.length)
           val searchKw = StringUtil.trimComma(searchKeywords.toString)
           val clickCg = StringUtil.trimComma(clickCategories.toString)

           val visitLength = (endTime.getTime - startTime.getTime) / 1000

           val aggrInfo = Constants.FIELD_SESSION_ID + "=" + sessionId + "|" +
             Constants.FIELD_SEARCH_KEYWORDS + "=" + searchKw + "|" +
             Constants.FIELD_CLICK_CATEGORY_IDS + "=" + clickCg + "|" +
             Constants.FIELD_VISIT_LENGTH + "=" + visitLength + "|" +
             Constants.FIELD_STEP_LENGTH + "=" + stepLength + "|" +
             Constants.FIELD_START_TIME + "=" + DateUtils.formatTime(startTime)

           (userId, aggrInfo)

         }
       }
  }

  def AggInfoAndActions(aggUserActions: RDD[(Long, String)], userInfo: RDD[(Long, UserInfo)])={
    //根據user_id建立映射關係===>用Join算子
    userInfo.join(aggUserActions).map{
      case (userId,(userInfo: UserInfo,aggrInfo))=>{
        val age = userInfo.age
        val professional = userInfo.professional
        val sex = userInfo.sex
        val city = userInfo.city
        val fullInfo = aggrInfo + "|" +
          Constants.FIELD_AGE + "=" + age + "|" +
          Constants.FIELD_PROFESSIONAL + "=" + professional + "|" +
          Constants.FIELD_SEX + "=" + sex + "|" +
          Constants.FIELD_CITY + "=" + city

        val sessionId = StringUtil.getFieldFromConcatString(aggrInfo, "\\|", Constants.FIELD_SESSION_ID)

        (sessionId, fullInfo)
      }
    }
  }

}

總結

  • 一個用戶對應多條數據,應先對數據進行統計聚合操作,轉化爲一條整體的數據
  • 對於有相同鍵,但是值的結構不同的k-v類型RDD,可以用join進行聚合
  • 自定義累加器
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章