大數據【企業級360°全方位用戶畫像】基於RFM模型的挖掘型標籤開發

寫在前面： 博主是一名軟件工程系大數據應用開發專業大二的學生，暱稱來源於《愛麗絲夢遊仙境》中的Alice和自己的暱稱。作爲一名互聯網小白，寫博客一方面是爲了記錄自己的學習歷程，一方面是希望能夠幫助到很多和自己一樣處於起步階段的萌新。由於水平有限，博客中難免會有一些錯誤，有紕漏之處懇請各位大佬不吝賜教！個人小站:http://alices.ibilibili.xyz/ , 博客主頁:https://alice.blog.csdn.net/
儘管當前水平可能不及各位大佬，但我還是希望自己能夠做得更好，因爲一天的生活就是一生的縮影。我希望在最美的年華，做最好的自己！

在前面的幾篇博客中，博主不僅爲大家介紹了匹配型標籤和統計型標籤的開發流程，還爲大家科普了關於機器學習的一些"乾貨"，包括但不限於KMeans算法等…本篇博客，我們將正式開發一個基於RFM模型的挖掘型標籤，對RFM不瞭解的朋友可以👉大數據【企業級360°全方位用戶畫像】之RFM模型和KMeans聚類算法~

我們本次需要開發的標籤是用戶價值。相信光聽這個標籤名，大家就應該清楚這種比較抽象的標籤，只能通過挖掘型算法去進行開發。

話不多說，我們來看看開發一個這樣的標籤需要經歷哪些步驟？

添加標籤

首先我們需要在用戶畫像項目中的web頁面添加這個需求所需要的四級標籤(標籤名)和五級標籤(標籤值)。

添加成功之後，我們可以在後臺數據庫中看到數據。

開發

頁面所需標籤和標籤值已經準備好了，剩下的就該我們擼代碼了。

準備pom

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <parent>
        <artifactId>userprofile29</artifactId>
        <groupId>cn.itcast.up</groupId>
        <version>1.0-SNAPSHOT</version>
    </parent>
    <modelVersion>4.0.0</modelVersion>

    <artifactId>Job</artifactId>

    <properties>
        <scala.version>2.11.8</scala.version>
        <spark.version>2.2.0</spark.version>
        <hbase.version>1.2.0-cdh5.14.0</hbase.version>
        <solr.version>4.10.3-cdh5.14.0</solr.version>
        <mysql.version>8.0.17</mysql.version>
        <slf4j.version>1.7.21</slf4j.version>

        <maven-compiler-plugin.version>3.1</maven-compiler-plugin.version>
        <build-helper-plugin.version>3.0.0</build-helper-plugin.version>
        <scala-compiler-plugin.version>3.2.0</scala-compiler-plugin.version>
        <maven-shade-plugin.version>3.2.1</maven-shade-plugin.version>
    </properties>

    <dependencies>
        <!-- Spark -->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-mllib_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.scalanlp</groupId>
            <artifactId>breeze_2.11</artifactId>
            <version>0.13</version>
        </dependency>

        <!-- HBase -->
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>${hbase.version}</version>
        </dependency>

        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-common</artifactId>
            <version>${hbase.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>${hbase.version}</version>
        </dependency>

        <!-- Solr -->
        <dependency>
            <groupId>org.apache.solr</groupId>
            <artifactId>solr-core</artifactId>
            <version>${solr.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.solr</groupId>
            <artifactId>solr-solrj</artifactId>
            <version>${solr.version}</version>
        </dependency>

        <!-- MySQL -->
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>${mysql.version}</version>
        </dependency>

        <!-- Logging -->
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-api</artifactId>
            <version>${slf4j.version}</version>
        </dependency>
        <dependency>
            <groupId>org.slf4j</groupId>
            <artifactId>slf4j-simple</artifactId>
            <version>${slf4j.version}</version>
        </dependency>

        <dependency>
            <groupId>cn.itcast.up29</groupId>
            <artifactId>common</artifactId>
            <version>1.0-SNAPSHOT</version>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>build-helper-maven-plugin</artifactId>
                <version>${build-helper-plugin.version}</version>
                <executions>
                    <execution>
                        <phase>generate-sources</phase>
                        <goals>
                            <goal>add-source</goal>
                        </goals>
                        <configuration>
                            <sources>
                                <source>src/main/java</source>
                                <source>src/main/scala</source>
                            </sources>
                        </configuration>
                    </execution>
                </executions>
            </plugin>

            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>${maven-compiler-plugin.version}</version>
                <configuration>
                    <encoding>UTF-8</encoding>
                    <source>1.8</source>
                    <target>1.8</target>
                    <verbose>true</verbose>
                    <fork>true</fork>
                </configuration>
            </plugin>

            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>${scala-compiler-plugin.version}</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <artifactId>maven-assembly-plugin</artifactId>
                <configuration>
                    <archive>
                        <manifest>
                            <!--這裏要替換成jar包main方法所在類 -->
                            <mainClass>cn.itcast.up29.TestTag</mainClass>
                        </manifest>
                        <manifestEntries>
                            <Class-Path>.</Class-Path>
                        </manifestEntries>
                    </archive>
                    <descriptorRefs>
                        <descriptorRef>jar-with-dependencies</descriptorRef>
                    </descriptorRefs>
                </configuration>
                <executions>
                    <execution>
                        <id>make-assembly</id> <!-- this is used for inheritance merges -->
                        <phase>package</phase> <!-- 指定在打包節點執行jar包合併操作 -->
                        <goals>
                            <goal>single</goal>
                        </goals>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
    <repositories>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
    </repositories>

</project>

代碼開發

這裏需要提及一點，因爲在之前寫的一篇介紹👉RFM模型和KMeans聚類算法的博客。最後在代碼演示階段，爲大家展示了利用KMeans算法計算鳶尾花所屬分類的一個小Demo，那一篇雖說每一步的註釋和實現的最終效果都在代碼中體現出來了，但沒有詳細地爲大家介紹代碼流程。所以，藉着本篇同樣爲挖掘型算法的一個經典案例，下面將好好爲大家介紹一下挖掘型標籤的開發流程。

1、繼承BaseModel，設置任務名稱，設置自己標籤的ID，調用exec，重寫getNewTag方法，getNewTag實現新標籤的製作

對於不清楚什麼是BaseModel類的朋友，可以先去看看博主的這一篇博客👉標籤開發代碼抽取。因爲在開發不同類型的標籤過程中，存在着大量的代碼重複性冗餘，所以博主就在那一篇博客中，介紹瞭如何抽取標籤的過程，並將其命名爲BaseModel。我們往後還想基於這個項目進行標籤的開發，只需要創建一個類，實現這個特質，然後就只需要編寫較少的核心部分代碼即可，可謂是十分的便捷了~

object TestModel  extends BaseModel {


  // 設置任務名稱
  override def setAppName: String = "RFMModel"

  // 設置用戶價值id
  override def setFourTagId: String = "168"

  override def getNewTag(spark: SparkSession, fiveTagDF: DataFrame, hbaseDF: DataFrame): DataFrame = {
    
    
    
  }
  
}

2、根據傳入的hbase數據的DF，獲取出RFM三個數據

因爲我們計算的是用戶價值，符合我們之前提到的RFM模型，所以我們需要分別針對這三個角度，將各自的數據求取出來。

   //RFM三個單詞
    val recencyStr: String = "recency"
    val frequencyStr: String = "frequency"
    val monetaryStr: String = "monetary"

    // 特徵單詞
    val featureStr: String = "feature"
    val predictStr: String = "predict"

    // 計算業務數據
    // R(最後的交易時間到當前時間的距離)
    // F(交易數量【半年/一年/所有】)
    // M(交易總金額【半年/一年/所有】)

    // 引入隱式轉換
    import spark.implicits._
    //引入java 和scala相互轉換
    import scala.collection.JavaConverters._
    //引入sparkSQL的內置函數
    import org.apache.spark.sql.functions._

    // 用於計算 R 數值
    // 與當前時間的時間差 - 當前時間用於求訂單中最大的時間
    val getRecency: Column = functions.datediff(current_timestamp(),from_unixtime(max("finishTime")))-300 as recencyStr

    // 計算F的值
    val getFrequency: Column = functions.count("orderSn") as frequencyStr

    // 計算M數值  sum
    val getMonetary: Column = functions.sum("orderAmount") as monetaryStr


    // 由於每個用戶有多個訂單，所以計算一個用戶的RFM，需要使用用戶id進行分組
    val getRFMDF: DataFrame = hbaseDF.groupBy("memberId")
      .agg(getRecency, getFrequency, getMonetary)

    getRFMDF.show(false)
    /*
    +---------+-------+---------+------------------+
    |memberId |recency|frequency|monetary          |
    +---------+-------+---------+------------------+
    |13822725 |10     |116      |179298.34         |
    |13823083 |10     |132      |233524.17         |
    |138230919|10     |125      |240061.56999999998|
     */

這裏，體貼的博主還將答案以註釋的形式標記在了上邊。大家可以參考一下喲~

3、歸一化【打分】

這裏需要解釋下，爲什麼需要進行數據的歸一化。由於三個數據的量綱（單位）不統一，所以無法直接計算，需要進行數據的歸一化。

這裏歸一化的方法，我們採用的是自定義方法，與之前鳶尾花的案例所直接調用的MinMaxScaler還有是有差異的。

    //現有的RFM 量綱不統一，需要執行歸一化   爲RFM打分
    //R: 1-3天=5分，4-6天=4分，7-9天=3分，10-15天=2分，大於16天=1分
    //F: ≥200=5分，150-199=4分，100-149=3分，50-99=2分，1-49=1分
    //M: ≥20w=5分，10-19w=4分，5-9w=3分，1-4w=2分，<1w=1分

    //計算R的分數
    var getRecencyScore: Column =functions.when((col(recencyStr)>=1)&&(col(recencyStr)<=3),5)
      .when((col(recencyStr)>=4)&&(col(recencyStr)<=6),4)
      .when((col(recencyStr)>=7)&&(col(recencyStr)<=9),3)
      .when((col(recencyStr)>=10)&&(col(recencyStr)<=15),2)
      .when(col(recencyStr)>=16,1)
      .as(recencyStr)

    //計算F的分數
    var getFrequencyScore: Column =functions.when(col(frequencyStr) >= 200, 5)
      .when((col(frequencyStr) >= 150) && (col(frequencyStr) <= 199), 4)
      .when((col(frequencyStr) >= 100) && (col(frequencyStr) <= 149), 3)
      .when((col(frequencyStr) >= 50) && (col(frequencyStr) <= 99), 2)
      .when((col(frequencyStr) >= 1) && (col(frequencyStr) <= 49), 1)
      .as(frequencyStr)

    //計算M的分數
    var getMonetaryScore: Column =functions.when(col(monetaryStr) >= 200000, 5)
      .when(col(monetaryStr).between(100000, 199999), 4)
      .when(col(monetaryStr).between(50000, 99999), 3)
      .when(col(monetaryStr).between(10000, 49999), 2)
      .when(col(monetaryStr) <= 9999, 1)
      .as(monetaryStr)

    //計算RFM的分數
    val getRFMScoreDF: DataFrame = getRFMDF.select('memberId ,getRecencyScore,getFrequencyScore,getMonetaryScore)

    println("--------------------------------------------------")
    //getRENScoreDF.show()

/* +---------+-------+---------+--------+
| memberId|recency|frequency|monetary|
+---------+-------+---------+--------+
| 13822725|      2|        3|       4|
| 13823083|      2|        3|       5|
|138230919|      2|        3|       5|
| 13823681|      2|        3|       4|
*/

4、將RFM的分數進行向量化

因爲我們接下來就要對RFM的數據就行KMeans聚類計算，爲了將RFM的數據轉換成與KMeans計算所要求數據格式相同，我們這裏還需要多一個操作，便是將上邊歸一化後的分數結果進行向量化。

    val RFMFeature: DataFrame = new VectorAssembler()
      .setInputCols(Array(recencyStr, frequencyStr, monetaryStr))
      .setOutputCol(featureStr)
      .transform(getRFMScoreDF)

    RFMFeature.show()
/* +---------+-------+---------+--------+-------------+
| memberId|recency|frequency|monetary|      feature|
+---------+-------+---------+--------+-------------+
| 13822725|      2|        3|       4|[2.0,3.0,4.0]|
| 13823083|      2|        3|       5|[2.0,3.0,5.0]|
|138230919|      2|        3|       5|[2.0,3.0,5.0]|
| 13823681|      2|        3|       4|[2.0,3.0,4.0]|
|  4033473|      2|        3|       5|[2.0,3.0,5.0]| */

5、數據分類

這裏我們終於調用上了KMeans聚類算法，對數據進行分類。

    val model: KMeansModel = new KMeans()
      .setK(7) // 設置7類
      .setMaxIter(5) // 迭代計算5次
      .setFeaturesCol(featureStr) // 設置特徵數據
      .setPredictionCol("featureOut") // 計算完畢後的標籤結果
      .fit(RFMFeature)

    // 將其轉換成 DF
    val modelDF: DataFrame = model.transform(RFMFeature)

    modelDF.show()
/*+---------+-------+---------+--------+-------------+----------+
| memberId|recency|frequency|monetary|      feature|featureOut|
+---------+-------+---------+--------+-------------+----------+
| 13822725|      2|        3|       4|[2.0,3.0,4.0]|         1|
| 13823083|      2|        3|       5|[2.0,3.0,5.0]|         0|
|138230919|      2|        3|       5|[2.0,3.0,5.0]|         0|
| 13823681|      2|        3|       4|[2.0,3.0,4.0]|         1|*/

6、計算每個類別的價值，針對價值進行倒敘排序

這裏所謂的每種類別的價值，指的是每一箇中心點，也就是質心包含所有點的總和。

至於爲什麼需要倒序排序，是因爲我們不同的價值標籤值在數據庫中的rule是從0開始的，而將價值分類按照價值高低倒序排序後，之後我們獲取到分類索引時，從高到底的索引也是從0開始的，這樣我們後續進行關聯的時候就輕鬆很多。

    //6、分類排序  遍歷所有的分類(0-6)
    //獲取每個類別內的價值（）中心點包含的所有點的總和就是這個類的價值
    //model.clusterCenters.indices   據類中心角標
    //model.clusterCenters(i)  具體的某一個類別（簇）

    val clusterCentersSum: immutable.IndexedSeq[(Int, Double)] = for(i <- model.clusterCenters.indices) yield (i,model.clusterCenters(i).toArray.sum)
    val clusterCentersSumSort: immutable.IndexedSeq[(Int, Double)] = clusterCentersSum.sortBy(_._2).reverse


    clusterCentersSumSort.foreach(println)
 /*
(4,11.038461538461538)
(0,10.0)
(1,9.0)
(3,8.0)
(6,6.0)
(5,4.4)
(2,3.0)
*/

7、對排序後的分類數據獲取角標

正如我們第六步所說的，我們這裏獲取到分類數據的角標，方便後續的關聯查詢。

   // 獲取到每種分類及其對應的索引
    val clusterCenterIndex: immutable.IndexedSeq[(Int, Int)] = for(a <- clusterCentersSumSort.indices) yield (clusterCentersSumSort(a)._1,a)
    clusterCenterIndex.foreach(println)
    /*
    類別的價值從高到底
    角標是從0-6
    (4,0)
    (0,1)
    (1,2)
    (3,3)
    (6,4)
    (5,5)
    (2,6)
     */

8、排序後的數據與標籤系統內的五級標籤數據進行join

這裏我們在獲取到了排序後的數據後，將其與標籤系統內的五級標籤數據進行join。爲了後續我們方便查找調用，我們將join後的數據，封裝到了List集合。

 val clusterCenterIndexDF: DataFrame = clusterCenterIndex.toDF("type","index")

    // 開始join
    val JoinDF: DataFrame = fiveTagDF.join(clusterCenterIndexDF,fiveTagDF.col("rule") ===  clusterCenterIndexDF.col("index"))

    println("- - - - - - - -")
    JoinDF.show()
/*+---+----+----+-----+
| id|rule|type|index|
+---+----+----+-----+
|169|   0|   4|    0|
|170|   1|   0|    1|
|171|   2|   1|    2|
|172|   3|   3|    3|
|173|   4|   6|    4|
|174|   5|   5|    5|
|175|   6|   2|    6|
+---+----+----+-----+*/
    val fiveTageList: List[TagRule] = JoinDF.map(row => {

      val id: String = row.getAs("id").toString
      val types: String = row.getAs("type").toString

      TagRule(id.toInt, types)
    }).collectAsList() // 將DataSet轉換成util.List[TagRule]   這個類型遍歷時無法獲取id,rule數據
      .asScala.toList

    println("- - - - - - - -")

9、編寫UDF，實現標籤的開發計算

到了這一步，我們就可以編寫UDF函數，在函數中調用第八步所封裝的List集合對傳入參數進行一個匹配。然後我們在對KMeans聚合計算後的數據進行一個查詢的過程中，就可以調用UDF，實現用戶id和用戶價值分類id進行一個匹配。

// 需要自定義UDF函數
    val getRFMTags: UserDefinedFunction = udf((featureOut: String) => {
      // 設置標籤的默認值
      var tagId: Int = 0
      // 遍歷每一個五級標籤的rule
      for (tagRule <- fiveTageList) {
        if (tagRule.rule == featureOut) {
          tagId = tagRule.id
        }
      }
      tagId
    })

    val CustomerValueTag: DataFrame = modelDF.select('memberId .as("userId"),getRFMTags('featureOut).as("tagsId"))



    CustomerValueTag.show(false)

10、返回最新計算的標籤

到了最後一步，就比較簡單了，我們只需要將第九步得到的結果返回即可。

    CustomerValueTag

爲了方便大家閱讀，這裏我再貼上完整的源碼。

對代碼中有任何的疑問，歡迎在評論區留言或者後臺私信我都可以喲~

完整源碼

import com.czxy.base.BaseModel
import com.czxy.bean.TagRule
import org.apache.spark.ml.clustering.{KMeans, KMeansModel}
import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.sql.expressions.UserDefinedFunction
import org.apache.spark.sql.{Column, DataFrame, SparkSession, functions}

import scala.collection.immutable

/*
 * @Author: Alice菌
 * @Date: 2020/6/22 09:18
 * @Description: 

    此代碼用於計算 用戶畫像價值模型

 */
object RFMModel extends BaseModel{

  // 設置任務名稱
  override def setAppName: String = "RFMModel"

  // 設置用戶價值id
  override def setFourTagId: String = "168"

  override def getNewTag(spark: SparkSession, fiveTagDF: DataFrame, hbaseDF: DataFrame): DataFrame = {

    //fiveTagDF.show()
    /*
    +---+----+
    | id|rule|
    +---+----+
    |169|   0|
    |170|   1|
    |171|   2|
    |172|   3|
    |173|   4|
    |174|   5|
    |175|   6|
+---+----+
     */
    //hbaseDF.show()
    /*
    +---------+----------+--------------------+-----------+
    | memberId|finishTime|             orderSn|orderAmount|
    +---------+----------+--------------------+-----------+
    | 13823431|1564415022|gome_792756751164275|    2479.45|
    |  4035167|1565687310|jd_14090106121770839|    2449.00|
    |  4035291|1564681801|jd_14090112394810659|    1099.42|
    |  4035041|1565799378|amazon_7877495617...|    1999.00|
     */

    //RFM三個單詞
    val recencyStr: String = "recency"
    val frequencyStr: String = "frequency"
    val monetaryStr: String = "monetary"

    // 特徵單詞
    val featureStr: String = "feature"
    val predictStr: String = "predict"

    // 計算業務數據
    // R(最後的交易時間到當前時間的距離)
    // F(交易數量【半年/一年/所有】)
    // M(交易總金額【半年/一年/所有】)

    // 引入隱式轉換
    import spark.implicits._
    //引入java 和scala相互轉換
    import scala.collection.JavaConverters._
    //引入sparkSQL的內置函數
    import org.apache.spark.sql.functions._

    // 用於計算 R 數值
    // 與當前時間的時間差 - 當前時間用於求訂單中最大的時間
    val getRecency: Column = functions.datediff(current_timestamp(),from_unixtime(max("finishTime")))-300 as recencyStr

    // 計算F的值
    val getFrequency: Column = functions.count("orderSn") as frequencyStr

    // 計算M數值  sum
    val getMonetary: Column = functions.sum("orderAmount") as monetaryStr


    // 由於每個用戶有多個訂單，所以計算一個用戶的RFM，需要使用用戶id進行分組
    val getRFMDF: DataFrame = hbaseDF.groupBy("memberId")
      .agg(getRecency, getFrequency, getMonetary)

    getRFMDF.show(false)
    /*
    +---------+-------+---------+------------------+
    |memberId |recency|frequency|monetary          |
    +---------+-------+---------+------------------+
    |13822725 |10     |116      |179298.34         |
    |13823083 |10     |132      |233524.17         |
    |138230919|10     |125      |240061.56999999998|
     */

    //現有的RFM 量綱不統一，需要執行歸一化   爲RFM打分
    //R: 1-3天=5分，4-6天=4分，7-9天=3分，10-15天=2分，大於16天=1分
    //F: ≥200=5分，150-199=4分，100-149=3分，50-99=2分，1-49=1分
    //M: ≥20w=5分，10-19w=4分，5-9w=3分，1-4w=2分，<1w=1分

    //計算R的分數
    var getRecencyScore: Column =functions.when((col(recencyStr)>=1)&&(col(recencyStr)<=3),5)
      .when((col(recencyStr)>=4)&&(col(recencyStr)<=6),4)
      .when((col(recencyStr)>=7)&&(col(recencyStr)<=9),3)
      .when((col(recencyStr)>=10)&&(col(recencyStr)<=15),2)
      .when(col(recencyStr)>=16,1)
      .as(recencyStr)

    //計算F的分數
    var getFrequencyScore: Column =functions.when(col(frequencyStr) >= 200, 5)
      .when((col(frequencyStr) >= 150) && (col(frequencyStr) <= 199), 4)
      .when((col(frequencyStr) >= 100) && (col(frequencyStr) <= 149), 3)
      .when((col(frequencyStr) >= 50) && (col(frequencyStr) <= 99), 2)
      .when((col(frequencyStr) >= 1) && (col(frequencyStr) <= 49), 1)
      .as(frequencyStr)

    //計算M的分數
    var getMonetaryScore: Column =functions.when(col(monetaryStr) >= 200000, 5)
      .when(col(monetaryStr).between(100000, 199999), 4)
      .when(col(monetaryStr).between(50000, 99999), 3)
      .when(col(monetaryStr).between(10000, 49999), 2)
      .when(col(monetaryStr) <= 9999, 1)
      .as(monetaryStr)

    // 2、計算RFM的分數
    val getRFMScoreDF: DataFrame = getRFMDF.select('memberId ,getRecencyScore,getFrequencyScore,getMonetaryScore)

    println("--------------------------------------------------")
    //getRENScoreDF.show()

/* +---------+-------+---------+--------+
| memberId|recency|frequency|monetary|
+---------+-------+---------+--------+
| 13822725|      2|        3|       4|
| 13823083|      2|        3|       5|
|138230919|      2|        3|       5|
| 13823681|      2|        3|       4|
*/
    // 3、將數據轉換成向量

    val RFMFeature: DataFrame = new VectorAssembler()
      .setInputCols(Array(recencyStr, frequencyStr, monetaryStr))
      .setOutputCol(featureStr)
      .transform(getRFMScoreDF)

    RFMFeature.show()
/* +---------+-------+---------+--------+-------------+
| memberId|recency|frequency|monetary|      feature|
+---------+-------+---------+--------+-------------+
| 13822725|      2|        3|       4|[2.0,3.0,4.0]|
| 13823083|      2|        3|       5|[2.0,3.0,5.0]|
|138230919|      2|        3|       5|[2.0,3.0,5.0]|
| 13823681|      2|        3|       4|[2.0,3.0,4.0]|
|  4033473|      2|        3|       5|[2.0,3.0,5.0]| */

    // 4、數據分類
    val model: KMeansModel = new KMeans()
      .setK(7) // 設置7類
      .setMaxIter(5) // 迭代計算5次
      .setFeaturesCol(featureStr) // 設置特徵數據
      .setPredictionCol("featureOut") // 計算完畢後的標籤結果
      .fit(RFMFeature)

    // 將其轉換成 DF
    val modelDF: DataFrame = model.transform(RFMFeature)

    modelDF.show()
/*+---------+-------+---------+--------+-------------+----------+
| memberId|recency|frequency|monetary|      feature|featureOut|
+---------+-------+---------+--------+-------------+----------+
| 13822725|      2|        3|       4|[2.0,3.0,4.0]|         1|
| 13823083|      2|        3|       5|[2.0,3.0,5.0]|         0|
|138230919|      2|        3|       5|[2.0,3.0,5.0]|         0|
| 13823681|      2|        3|       4|[2.0,3.0,4.0]|         1|

截止到目前，用戶的分類已經完畢，用戶和對應的類別已經有了
缺少類別與標籤ID的對應關係
這個分類完之後，featureOut的 0-6 只表示7個不同的類別，並不是標籤中的 0-6 的級別
*/
    modelDF.groupBy("featureOut")
        .agg(max(col("recency")+col("frequency")+col("monetary")) as "max",
          min(col("recency")+col("frequency")+col("monetary")) as "min").show()

/*
+----------+---+---+
|featureOut|max|min|
+----------+---+---+
|         1|  9|  9|
|         6|  6|  6|
|         3|  9|  7|
|         5|  5|  4|
|         4| 12| 11|
|         2|  3|  3|
|         0| 10| 10|
+----------+---+---+
*/

    println("===========================================")

    //5、分類排序  遍歷所有的分類(0-6)
    //獲取每個類別內的價值（）中心點包含的所有點的總和就是這個類的價值
    //model.clusterCenters.indices   據類中心角標
    //model.clusterCenters(i)  具體的某一個類別（簇）

    val clusterCentersSum: immutable.IndexedSeq[(Int, Double)] = for(i <- model.clusterCenters.indices) yield (i,model.clusterCenters(i).toArray.sum)
    val clusterCentersSumSort: immutable.IndexedSeq[(Int, Double)] = clusterCentersSum.sortBy(_._2).reverse


    clusterCentersSumSort.foreach(println)
 /*
(4,11.038461538461538)
(0,10.0)
(1,9.0)
(3,8.0)
(6,6.0)
(5,4.4)
(2,3.0)
*/

    // 獲取到每種分類及其對應的索引
    val clusterCenterIndex: immutable.IndexedSeq[(Int, Int)] = for(a <- clusterCentersSumSort.indices) yield (clusterCentersSumSort(a)._1,a)
    clusterCenterIndex.foreach(println)
    /*
    類別的價值從高到底
    角標是從0-6
    (4,0)
    (0,1)
    (1,2)
    (3,3)
    (6,4)
    (5,5)
    (2,6)
     */

    //6、分類數據和標籤數據join
    // 將其轉換成DF
    val clusterCenterIndexDF: DataFrame = clusterCenterIndex.toDF("type","index")

    // 開始join
    val JoinDF: DataFrame = fiveTagDF.join(clusterCenterIndexDF,fiveTagDF.col("rule") ===  clusterCenterIndexDF.col("index"))

    println("- - - - - - - -")
    JoinDF.show()
/*+---+----+----+-----+
| id|rule|type|index|
+---+----+----+-----+
|169|   0|   4|    0|
|170|   1|   0|    1|
|171|   2|   1|    2|
|172|   3|   3|    3|
|173|   4|   6|    4|
|174|   5|   5|    5|
|175|   6|   2|    6|
+---+----+----+-----+*/
    val fiveTageList: List[TagRule] = JoinDF.map(row => {

      val id: String = row.getAs("id").toString
      val types: String = row.getAs("type").toString

      TagRule(id.toInt, types)
    }).collectAsList() // 將DataSet轉換成util.List[TagRule]   這個類型遍歷時無法獲取id,rule數據
      .asScala.toList

    println("- - - - - - - -")

    //7、獲得數據標籤（udf）
    // 需要自定義UDF函數
    val getRFMTags: UserDefinedFunction = udf((featureOut: String) => {
      // 設置標籤的默認值
      var tagId: Int = 0
      // 遍歷每一個五級標籤的rule
      for (tagRule <- fiveTageList) {
        if (tagRule.rule == featureOut) {
          tagId = tagRule.id
        }
      }
      tagId
    })

    val CustomerValueTag: DataFrame = modelDF.select('memberId .as("userId"),getRFMTags('featureOut).as("tagsId"))

    println("*****************************************")

    CustomerValueTag.show(false)

    println("*****************************************")


    //8、表現寫入hbase
    CustomerValueTag
  }


  def main(args: Array[String]): Unit = {

    exec()

  }
}

如果程序運行完畢無誤，我們可以去Hbase中查看我們標籤是否寫入到test表中。

scan "test",{LIMIT => 10}

發現有用戶已經有了用戶價值的標籤值後，說明我們的標籤開發工作就完成了~~

結語

本篇博客，主要爲大家簡單介紹了用戶畫像項目中挖掘型標籤的開發流程，相信大家在看完這篇博客之後，對機器學習算法會更感興趣。博主後續呢，會爲大家帶來關於機器學習的面試題，各位小夥伴們，敬請期待😎

如果以上過程中出現了任何的紕漏錯誤，煩請大佬們指正😅

受益的朋友或對大數據技術感興趣的夥伴記得點贊關注支持一波🙏

希望我們都能在學習的道路上越走越遠😉

大數據【企業級360°全方位用戶畫像】基於RFM模型的挖掘型標籤開發

添加標籤

開發

準備pom

代碼開發

完整源碼

結語

【學點數據結構和算法】06-二叉堆和優先隊列

大數據【企業級360°全方位用戶畫像】基於USG模型的挖掘型標籤開發

【學點數據結構和算法】05-樹

大數據【企業級360°全方位用戶畫像】之USG模型和決策樹分類算法

【學點數據結構和算法】03-棧和隊列

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結