數據倉庫項目筆記8

路徑分析-轉化率概念

業務背景:公司有很多很多的各種類型的業務,而每一項業務往往能分成若干個操作環節,用戶在業務的各個操作環節上進行操作，一步步走向業務目標（比如買單，比如註冊成功，比如充值完成，比如進入充值頁）那麼，一個業務的操作環節鏈條，就叫做這個業務的轉化路徑！
轉化率，漏斗模型: 路徑中，每一個環節上的事件發生次數或人數，都會不同，一般是前面的環節上人數多，越往後越少，這樣就引出一個概念：轉化率
下圖爲一個業務的轉換
下圖爲用戶的行爲軌跡爲不同業務步驟轉換打基礎當join不同業務流程時可獲取不同業務的步驟轉換(漏斗轉換率)

從報表頁面原型中，分析出所要計算的數據要素

所訪問的頁面
第幾步訪問的
前一步是哪個頁面
這是哪個人
這是哪個會話
維度…..

uid	sessionid	訪問步驟序號	頁面	前一頁

ods->dwd 篩選uid sid 頁面頁面類型
通過lead over 獲取前一個頁面 row number 獲取步驟數

-- 建表：
drop table if exists  dws_acc_route;
create table dws_acc_route(
uid string,
sessionid string,
url string,
sno int,
pre_url string
)
partitioned by (dt string)
stored as parquet
;


-- etl計算
insert into table dws_acc_route partition(dt='2019-06-16')
select
imei as uid,
sessionid,
event['url'] as url,
row_number() over(partition by imei,sessionid order by commit_time) as sno,
lag(event['url']) over(partition by imei,sessionid order by commit_time) as pre_url

from ods_traffic_log where dt='2019-06-16' and eventtype='pg_view'

小需求: 去除重新刷新的頁面

/*
   訪問路徑分析：dws_user_acc_route
   @src  demo_dwd_event_dtl 事件記錄明細表
   
   計算的邏輯細節：
		1. 考慮用戶在同一個頁面多次刷新，是否要重複計數的問題（此處公司想去重）
				可以將用戶的訪問記錄，通過lead over錯位，如果在同一行中，出現兩個 B,A   B,A的組合，則可以去掉
		2. 一條記錄在用戶會話中是第幾步訪問，通過對上面整理後的數據打rownumber即可
		3. 詳細看下圖：
*/

-- 造demo測試數據
vi e.dat
u01,s01,pg_view,A,X,1
u01,s01,pg_view,B,A,2
u01,s01,pg_view,B,A,3
u01,s01,pg_view,B,A,4
u01,s01,pg_view,C,B,5
u01,s01,pg_view,D,C,6
u01,s01,pg_view,A,D,7
u01,s01,pg_view,B,A,8
u02,s21,pg_view,A,X,1
u02,s21,pg_view,B,A,2
u02,s21,pg_view,D,B,3
u02,s21,pg_view,B,D,4
u02,s21,pg_view,E,B,5
u02,s21,pg_view,F,E,6
u02,s21,pg_view,D,F,7
u02,s21,pg_view,B,D,8

-- 建demo測試表
drop table if exists demo_dwd_event_dtl;
create table demo_dwd_event_dtl(
uid string,
sessionid string,
event_type string,
url string,
reference string,
commit_time bigint
)
row format delimited fields terminated by ',';


load data local inpath '/root/data/demo_dwd_event_dtl.dat' into table demo_dwd_event_dtl;

-- 邏輯示意圖
uid,sessionid,event_type,url,reference ,commit_time      lead() over()
u01, s01,   pg_view, A,    X           ,  time1          B,    A
u01, s01,   pg_view, B,    A           ,  time2          B,    A
u01, s01,   pg_view, B,    A           ,  time3          B,    A
u01, s01,   pg_view, B,    A           ,  time4          C,    B
u01, s01,   pg_view, C,    B           ,  time5          D,    C
u01, s01,   pg_view, D,    C           ,  time6          A,    D
u01, s01,   pg_view, A,    D           ,  time7          B,    A
u01, s01,   pg_view, B,    A           ,  time8          


--  @dst建表
create table ads_user_acc_route(
uid string,   --用戶標識
sessionid string,  -- 會話標識
step int,   -- 訪問步驟號
url string,  -- 訪問的頁面
ref string   -- 前一個頁面（所來自的頁面）
)
partitioned by (dt string)
stored as parquet
;
-- etl計算

-- 先過濾掉重複刷新的記錄

with tmp as
(
select
uid,
sessionid,
event_type,
url,
reference,
commit_time,
concat_ws('-',url,reference) as tuple,
lead(concat_ws('-',url,reference),1) over(partition by uid,sessionid order by commit_time) as tuple2
from
demo_dwd_event_dtl
)

-- 得到結果
/*
+------+------------+-------------+------+------------+--------------+--------+---------+
| uid  | sessionid  | event_type  | url  | reference  | commit_time  | tuple  | tuple2  |
+------+------------+-------------+------+------------+--------------+--------+---------+
| u01  | s01        | pg_view     | A    | X          | 1            | A-X    | B-A     |
| u01  | s01        | pg_view     | B    | A          | 2            | B-A    | B-A     |
| u01  | s01        | pg_view     | B    | A          | 3            | B-A    | B-A     |
| u01  | s01        | pg_view     | B    | A          | 4            | B-A    | C-B     |
| u01  | s01        | pg_view     | C    | B          | 5            | C-B    | D-C     |
| u01  | s01        | pg_view     | D    | C          | 6            | D-C    | A-D     |
| u01  | s01        | pg_view     | A    | D          | 7            | A-D    | B-A     |
| u01  | s01        | pg_view     | B    | A          | 8            | B-A    | NULL    |
| u02  | s21        | pg_view     | A    | X          | 1            | A-X    | B-A     |
| u02  | s21        | pg_view     | B    | A          | 2            | B-A    | D-B     |
| u02  | s21        | pg_view     | D    | B          | 3            | D-B    | B-D     |
| u02  | s21        | pg_view     | B    | D          | 4            | B-D    | E-B     |
| u02  | s21        | pg_view     | E    | B          | 5            | E-B    | F-E     |
| u02  | s21        | pg_view     | F    | E          | 6            | F-E    | D-F     |
| u02  | s21        | pg_view     | D    | F          | 7            | D-F    | B-D     |
| u02  | s21        | pg_view     | B    | D          | 8            | B-D    | NULL    |
+------+------------+-------------+------+------------+--------------+--------+---------+
*/

-- 將上述中間結果中的tuple=tuple2的記錄去除

select 
uid,
sessionid,
event_type,
url,
reference,
commit_time
from tmp
where tuple !<==> tuple2 or tuple2 is null

-- 並按同一個人的同一個會話的時間順序 標記行號，對上步驟的sql略微改造一下：
select 
uid,
sessionid,
event_type,
url,
reference,
commit_time,
row_number() over(partition by uid,sessionid order by commit_time) as step
from tmp
where tuple != tuple2 or tuple2 is not null

-- 最後的完整語句

with tmp as
(
select
uid,
sessionid,
event_type,
url,
reference,
commit_time,
concat_ws('-',url,reference) as tuple,
lead(concat_ws('-',url,reference),1) over(partition by uid,sessionid order by commit_time) as tuple2
from 
demo_dwd_event_dtl
)

select 
uid,
sessionid,
event_type,
url,
reference,
commit_time,
row_number() over(partition by uid,sessionid order by commit_time) as step
from tmp
where tuple != tuple2
;
-- 訪問路徑明細過濾重複刷新 row_number() 實現
with tmp as (
select 
uid,
sessionid,
event_type,
url,
reference,
commit_time,
row_number() over(partition by uid,sessionid order by commit_time) - row_number() over(partition by uid,sessionid,url,reference order by commit_time) 
 as rn
from
demo_dwd_event_dtl
)

select * 
,row_number() over(partition by uid,sessionid order by commit_time) as step
from (
select
uid,
sessionid,
event_type,
url,
reference,
max(commit_time) commit_time

from
tmp
group by uid,sessionid,url,reference,event_type,rn) o

轉換率漏斗需求:不同業務流程的完成步驟的人數 (用戶的行爲軌跡join不同業務流程時可獲取不同業務的步驟轉換)

所需字段

uid	tid	step
用戶id	業務id	完成步驟

流程解析

業務步驟流程由業務人員通過web控制
數據分析join業務流程表來實現業務(每天晚上獲取新業務分析)

/*
   造數據1： 用戶訪問路徑記錄表，在hive數倉 dws_user_acc_route
*/
uid,sid,step,url,ref
u01,s01,1,X
u01,s01,2,Y,X
u01,s01,3,A,Y
u01,s01,4,B,A
u01,s01,5,C,B
u01,s01,6,B,C
u01,s01,7,o,B
u02,s02,1,A
u02,s02,2,C,A
u02,s02,3,A,C
u02,s02,4,B,A
u02,s02,5,D
u02,s02,6,B,D
u02,s02,7,C,B


/*
    造數據2： 業務轉化路徑定義表，在元數據管理中  transaction_route
       T101	1	步驟1	A	null
       T101	2	步驟2	B	A
       T101	3	步驟3	C	B
       T102	1	步驟1	D	null
       T102	2	步驟2	B	D
       T102	3	步驟3	C	B
*/


/*
	計算步驟：
		1. 加載 元數據庫中  transaction_route 表，整理格式
		2. 讀取 數倉中的  dws_user_acc_route 表
		3. 對用戶訪問路徑數據，按session分組，將一個人的一次會話中的所有訪問記錄整合到一起
		        u01,s01,1,X
                u01,s01,2,Y,X
                u01,s01,3,A,Y
                u01,s01,4,B,A
                u01,s01,5,C,B
                u01,s01,6,B,C
                u01,s01,7,o,B
		
		4. 然後依照公司定義的規則，判斷這個人是否完成了 ？ 業務路徑中的 ？ 步驟，如果完成了，則輸出如下數據：多次完成相同步驟忽略不計
				u01   t101   step_1  完成了1步驟
				u01   t101   step_2  完成2個步驟
				u01   t102   step_1  只完成一個步驟
				u02   t101   step_1
				u02   t102   step_1
				u02   t102   step_2		
				......
				
		5. 對上述數據，按業務id和步驟號，分組統計人數，得到結果：
				t101  step_1   2
				t101  step_2   1
				t101  step_3   1
				t102  step_1   3
				t102  step_2   2
				......
*/
/**
      * 接下來要做的事，就是去判斷每個人的行爲，是否滿足某個業務的某個步驟定義，如果滿足，則輸出：
      * u_id,t_id,t_step
      * 怎麼做呢？怎麼做都好說，問題在於，判斷標準的是什麼？
      * 對於一個人的行爲，是否滿足某業務的步驟定義，可能有如下界定標準：
      * 比如，業務定義的步驟事件分別爲： A B C D
      * 假如，某個人的行爲記錄爲：
      * 張三： A  A  B  A  B  C
      * 李四： C  D  A  B  C  E  D
      * 王五： A  B  B  C  E  A  D
      * 趙六： B  C  E  E  D
      * 那麼：這些算滿足了業務定義的哪些步驟？
      * 標準1：  判斷是否滿足業務C步驟，必須要求前面兩個近鄰的事件是A B
      * 標準2：  判斷是否滿足業務C步驟，只要求C事件發生前，前面發生過 B ，B前面發生過A，不要求緊鄰
算法實現
def routeMatch(userActions: List[String], transSteps: List[String]): List[Int] = {
    val buffer = new ListBuffer[Int]

    var index = -1
    var flag = true
    for (i <- 0 until transSteps.size if flag) {
      index = userActions.indexOf(transSteps(i), index + 1)
      if(index != -1) {
        buffer += i+1
      }else{
        flag = false
      }
    }
    buffer.toList
  }
      * 標準3：  判斷是否滿足業務C步驟，只要發生了C事件且前面發生B就行
      * 標準4：  判斷是否滿足業務C步驟，只要發生了C事件就算
      *
      * 那麼：在寫代碼時，究竟按哪個標準來計算？ 看公司開會討論的需求！
      * 咱們下面以 標準2 爲例！
      *
      * 做法：
      * 將一個用戶的所有行爲按時間先後順序收集到一起：A  A  B  A  B  C
      * 將業務路徑定義做成廣播變量：
      * Map(
      * "t101" -> list(A,B,C,D)
      * "t102" -> list(D,B,C)
      * )
      * 然後，將這個人的行爲和業務定義去對比，按標準2對比
      * 具體來說：
      * 拿業務定義中的步驟1，去用戶的行爲序列中搜索，如果搜索到，則繼續
      * 拿步驟2，去用戶行爲序列種步驟1事件後面去搜索，如果搜索到，則繼續
      * 以此類推!
      *
      *
      **/

/**
*@Description: 不同業務流程的完成步驟的人數 | 用戶id | 業務id |完成步驟 |
*@Author: dyc
*@date: 2019/9/7
*/
object FunnelAnalysis {

  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.WARN)

    val spark = SparkUtil.getSparkSession()

    // 加載業務路徑定義 元數據
    val props = new Properties()
    props.setProperty("user","root")
    props.setProperty("password","123456")

    val routeDefine = spark.read.jdbc("jdbc:mysql://localhost:3306/test","transaction_route",props)
    //獲取用戶定義的業務流程步驟
    val routeMap: collection.Map[String, List[(Int, String)]] = routeDefine.rdd.map(row => {
      val t_id = row.getAs[String]("tid")
      val t_event = row.getAs[String]("event_id")
      val t_step = row.getAs[Int]("step")

      (t_id, (t_step, t_event))
    }).groupByKey().mapValues(it => {
      val tuples: List[(Int, String)] = it.toList.sortBy(_._1)
      tuples
    }).collectAsMap()
    val bc: Broadcast[collection.Map[String, List[(Int, String)]]] = spark.sparkContext.broadcast(routeMap)

    // 加載用戶訪問路徑數據
    val userRoute = spark.read.option("header","true").csv("data_ware/data/dws_user_acc_route/part-00000.csv")
    import spark.implicits._
    //val ds: RelationalGroupedDataset = userRoute.selectExpr("uid","sid", "collect_set(concat_ws('-',step,url,ref)").groupBy('uid,'sid)
    userRoute.createTempView("t")
    val df = spark.sql(
      """
        |with tmp as (select * from t order by step)
        |select uid,sid,
        |collect_list(url) as url_order
        |from
        |tmp group by uid, sid
        |
      """.stripMargin)
    //獲取每個用戶一次會話操作的步驟
    val usStep = df.map(row => {
      val uid: String = row.getAs[String]("uid")
      val sid: String = row.getAs[String]("sid")

      val strings: Seq[String] = row.getAs[Seq[String]]("url_order")
      ((uid,sid), strings.mkString)
    })

//    usStep.show(10, false)
    //獲取每個用戶一次會話操作的步驟 進行業務流程匹配
    val ustidstep = usStep.flatMap(t => {
      val uid = t._1._1
      val userlist = t._2.toList.map(_.toString)
      val routeMap: collection.Map[String, List[(Int, String)]] = bc.value
      val resList = new ListBuffer[(String, String, Int)]
      //對每個業務流程匹配 用戶操作步驟
      for ((k, v) <- routeMap) {
        val routeList: List[String] = v.map(t => t._2)
        val steps: List[Int] = TransactionRouteMatch.routeMatch(userlist, routeList)
        val tuples: List[(String, String, Int)] = steps.map(i => {
          (uid, k, i)
        })
        resList ++= tuples
      }
      resList
    }).toDF("uid", "sid", "step")
    ustidstep.show(10,false)

    spark.close()

  }

}

關於不能再foreach中調用list.remove方法
- 進行上述操作是啥會拋出ConcurrentModificationException，原因是迭代器在調用next方法是會調用checkModification方法，檢查modCount（集合被修改的次數）和expectedModCount（迭代器期待集合被修改的次數獲取迭代器時賦值爲modCount）是否相等；modCount爲Arraylist的成員變量(繼承自父類AbstractList)， expectedModCount爲Arraylist的內部類Itr的成員變量；當調用Arraylist的remove方法時，只會修改modCount，而不會修改expectedModCount，所以當Itr調用next方法時，就會拋出異常；而Itr自己的remove方法中對二者進行了賦值處理，保證兩者相同；
- 另外，即使是在集合的最後一個元素時執行的刪除，也會使Itr調用next方法，原因是Itr的hasNext方法中判斷了ArrayList的成員變量cursor和size的值是否相等，若不相等則返回true，而通過Arraylist的remove方法刪除數據時，size會被減1，但cursor(爲當前遍歷值索引index+1)不會更改，導致兩者不相等，hasNext方法返回true，還是會調用next方法；
- 關於iterator移除時不報錯是因爲next的時候cursor是賦值爲當前遍歷值索引index+1 而lastRet賦值爲index, 而remove時候賦值cursor爲lastRet,size也相應減1 保證cursor遍歷完集合才跳出hasNext(){return cursor!=size}

數據倉庫項目筆記8

路徑分析-轉化率概念

從報表頁面原型中，分析出所要計算的數據要素

轉換率漏斗需求:不同業務流程的完成步驟的人數 (用戶的行爲軌跡join不同業務流程時可獲取不同業務的步驟轉換)

流程解析

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

用戶畫像項目筆記3

centos7 安裝tensorflowserving

tensorflow 加載模型AttributeError UserObject object has no attribute

修改/etc/security/limits.conf 時卻一直不能生效

用戶畫像項目筆記2

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結