Course_clustering_model

|字段|含義|類型|
|:–?:–?
|interested_travel |旅行偏好|二分類|
|computer_owner |是否有家用電腦|二分類|
|age |估計的年齡|連續|
|home_value |房產價格|連續|
|loan_ratio|貸款比率|連續|
|risk_score |風險分數|連續|
|marital |婚姻狀況估計|連續|
|interested_sport |運動偏好|連續|
|HH_grandparent|戶主祖父母是否健在估計|連續|
|HH_dieting |戶主節食偏好|連續|
|HH_head_age|戶主年齡|連續|
|auto_member |駕駛俱樂部估計|連續|
|interested_golf |高爾夫偏好|二分類|
|interested_gambling |博彩偏好|二分類|
|HH_has_children |戶主是否有孩子|二分類|
|HH_adults_num |家庭成年人數量|連續|
|interested_reading |閱讀偏好|有序分類|

1、數據集中的變量較多,如果全部進入模型會導致模型解釋困難。因此,一方面我們對於有相關性的變量進行降維,減少變量數目;另一方面,基於業務理解,我們預先將變量進行分組,使得同一組的變量能儘量解釋業務的一個方面。比如本例中將變量分成兩組,分別是家庭基本情況和用戶愛好,通過對每組變量分別進行聚類,獲取用戶的側寫,再將兩個聚類結果進行綜合,以獲得較完整的用戶畫像。

2、本例中數據類型複雜,包含了連續變量、無序分類和有序分類變量。由於K-means僅用於連續型變量聚類,因此需要對變量進行預處理。對於有序分類變量,如果分類水平較多可以視作連續變量處理,否則視作無序分類變量一樣處理,再進入模型;無序分類變量數目較少時,可以使用其啞變量編碼進入模型。本例中由於有較多的二分類變量,又集中在用戶愛好這一方面,因此我們將interested_reading這一有序分類變量二值化,再與其他幾個二分類變量一起進行彙總,得到用戶的“愛好廣度”,使用“愛好廣度”與其他連續型的愛好類變量進行聚類。

3、離散變量如HH_has_children一般不參與聚類,因爲其本身就可以視作是簇的標籤;如果爲了後期解釋模型時簡化處理,在離散變量不多的情況之下,也可以做啞變量變換後進入模型。

讀取數據

%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
travel = pd.read_csv('data_travel.csv',skipinitialspace=True)
travel.head()
interested_travel computer_owner age home_value loan_ratio risk_score marital interested_sport HH_grandparent HH_dieting HH_head_age auto_member interested_golf interested_gambling HH_has_children HH_adults_num interested_reading
0 NaN NaN 64 124035 73 932 3 312 420 149 96 626 0 0 NaN NaN 0
1 0.0 1.0 69 138574 73 1000 7 241 711 263 68 658 0 0 N 5.0 3
2 0.0 0.0 57 148136 77 688 1 367 240 240 56 354 0 1 N 2.0 1
3 1.0 1.0 80 162532 74 932 7 291 832 197 86 462 1 1 Y 2.0 3
4 1.0 1.0 48 133580 77 987 10 137 121 209 42 423 0 1 Y 3.0 3
travel.describe(include='all')
interested_travel computer_owner age home_value loan_ratio risk_score marital interested_sport HH_grandparent HH_dieting HH_head_age auto_member interested_golf interested_gambling HH_has_children HH_adults_num interested_reading
count 149788.000000 149788.000000 167177.000000 167177.000000 167177.000000 167177.000000 167177.000000 167177.000000 167177.000000 167177.000000 167177.000000 167177.000000 167177.000000 167177.000000 159899 145906.000000 167177
unique NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 2 NaN 5
top NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN N NaN 3
freq NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN 111462 NaN 65096
mean 0.427745 0.856571 59.507079 207621.314798 66.762707 817.031751 6.884015 259.431776 377.072498 204.593341 59.368023 486.861273 0.373012 0.357842 NaN 2.770832 NaN
std 0.494753 0.350511 14.311733 107822.501900 9.751835 165.490295 2.610552 78.867456 248.045395 78.971038 16.712912 151.167457 0.483607 0.479367 NaN 1.285417 NaN
min 0.000000 0.000000 18.000000 48910.000000 0.000000 1.000000 1.000000 60.000000 0.000000 47.000000 18.000000 49.000000 0.000000 0.000000 NaN 0.000000 NaN
25% 0.000000 1.000000 49.000000 135595.000000 63.000000 748.000000 5.000000 204.000000 182.000000 144.000000 48.000000 377.000000 0.000000 0.000000 NaN 2.000000 NaN
50% 0.000000 1.000000 59.000000 182106.000000 69.000000 844.000000 7.000000 251.000000 351.000000 185.000000 60.000000 492.000000 0.000000 0.000000 NaN 2.000000 NaN
75% 1.000000 1.000000 70.000000 248277.000000 73.000000 945.000000 9.000000 306.000000 528.000000 252.000000 71.000000 600.000000 1.000000 1.000000 NaN 4.000000 NaN
max 1.000000 1.000000 99.000000 1000000.000000 102.000000 1000.000000 10.000000 920.000000 980.000000 633.000000 99.000000 878.000000 1.000000 1.000000 NaN 7.000000 NaN

數據預處理

填補缺失值

有缺失情況的變量皆爲分類變量,且確實比例並不高,因此用衆數進行填補

fill_cols = ['interested_travel', 'computer_owner', 'HH_adults_num']
fill_values = {col: travel[col].mode()[0] for col in fill_cols}

travel = travel.fillna(fill_values)

修正錯誤值

HH_has_children的分類水平以字符形式表示,需要轉換爲整型,同時其中的缺失值應當表示沒有小孩,因此替換爲0
閱讀愛好interested_reading中包含錯誤值“.”,將其以0進行替換,代表該用戶對閱讀沒有興趣。

travel['interested_reading'].value_counts(dropna=False)
3    65096
1    43832
0    32919
2    24488
.      842
Name: interested_reading, dtype: int64
travel['HH_has_children'] = travel['HH_has_children']\
    .replace({'N':0, 'Y':1, np.NaN:0})
    
travel['interested_reading'] = travel['interested_reading']\
    .replace({'.':'0'}).astype('int')

對離散型變量進行處理

  • 使用k-means聚類,一般不分析離散變量,但可以根據業務理解,將離散型變量進行變換

分析離散變量的相關性

_cols = [
    'interested_travel',
    'computer_owner',
    'marital',
    'interested_golf', 
    'interested_gambling', 
    'HH_has_children',
    'interested_reading'
]

sample = travel[_cols].sample(3000, random_state=12345)
from itertools import combinations
from scipy import stats

for colA, colB in combinations(_cols, 2):
    crosstab = pd.crosstab(sample[colA], sample[colB])
    pval = stats.chi2_contingency(crosstab)[1]
    if pval > 0.05:
        print('p-value = %0.3f between "%s" and "%s"' %(pval, colA, colB))
p-value = 0.710 between "interested_travel" and "HH_has_children"
p-value = 0.495 between "computer_owner" and "HH_has_children"
p-value = 0.272 between "interested_golf" and "HH_has_children"

對於用戶愛好,可以將旅行、電腦、高爾夫、博彩、閱讀這幾個分類型變量綜合成一個愛好廣度指標,其代表了用戶休閒娛樂愛好,而連續型的interested_sport、HH_dieting屬於健康類愛好,auto_member屬於奢侈型愛好,因此,可以從多個角度來分析用戶的愛好。

  • 先對interested_reading進行二值化
from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=1.5)
travel['interested_reading'] = binarizer.fit_transform(
    travel[['interested_reading']])
  • 生成二分類偏好填充率
interest =[
    'interested_travel',
    'computer_owner',
    'interested_golf', 
    'interested_gambling',
    'interested_reading'
]
n_ = len(interest)

travel = travel.drop(interest, axis=1)\
               .assign(interest=travel[interest].sum(axis=1) / n_)

正態化、標準化

  • 對不同類型變量執行不同處理,連續變量、有序分類變量及無序分類變量在處理上均有不同,因此先按類型對變量分組,不同組採用不同的處理策略
  • 如果一個連續變量的可能取值很少,如marital(10個水平)、interest(5個水平)、HH_adults_num(8個水平)等,當將其作爲普通連續變量一樣進行分佈轉換,可能生成一些離羣值(例如對marital使用scikit-learn進行正態轉換,會發現1和10對應的數據點離開均值達到5個標準差)。因此本例中將這幾個連續變量作爲有序分類變量對待,但不進行分佈轉化,僅做標準化處理。
continuous_cols = ['age', 'home_value', 'risk_score', 'interested_sport', 
                   'HH_dieting', 'auto_member', 'HH_grandparent',
                   'HH_head_age', 'loan_ratio']

categorical_cols = ['marital', 'interest', 'HH_adults_num']

discreate_cols = ['HH_has_children']

爲了聚類後的簇大小能比較接近,對於偏態嚴重的連續變量應轉換其分佈,令其接近正態分佈或均勻分佈

  • 對連續變量正態化
travel[continuous_cols].hist(bins=25)
plt.show()

[外鏈圖片轉存失敗(img-AEdJ6YWb-1562748166654)(output_27_0.png)]

from sklearn.preprocessing import QuantileTransformer

qt = QuantileTransformer(n_quantiles=100, output_distribution='normal')
qt_data = qt.fit_transform(travel[continuous_cols])

pd.DataFrame(qt_data, columns=continuous_cols).hist(bins=25)
plt.show()

[外鏈圖片轉存失敗(img-R3MpmkfN-1562748166655)(output_28_0.png)]

  • 對有多個水平的有序分類變量進行標準化

如前所述,儘管HH_adults_num、marital和interest屬於連續變量,但都僅有不到10個水平,因此與有序分類變量一樣,僅做標準化

from sklearn.preprocessing import scale

scale_data = scale(travel[categorical_cols])

對二分類變量不做處理,合併各類型的變量

data = np.hstack([qt_data, scale_data, travel[discreate_cols]])
data = pd.DataFrame(
    data, columns=continuous_cols + categorical_cols + discreate_cols)
data.head()
age home_value risk_score interested_sport HH_dieting auto_member HH_grandparent HH_head_age loan_ratio marital interest HH_adults_num HH_has_children
0 0.321971 -0.892802 0.574460 0.742289 -0.515705 0.848859 0.265318 2.508596 0.698526 -1.487818 -1.035281 -0.547836 0.0
1 0.619855 -0.624538 5.199338 -0.139710 0.822532 1.073988 1.182609 0.458679 0.698526 0.044429 -0.354828 1.895118 0.0
2 -0.126937 -0.464935 -0.935569 1.324958 0.525369 -0.810751 -0.446657 -0.229884 1.549706 -2.253942 -1.035281 -0.547836 0.0
3 1.335178 -0.253449 0.574460 0.506088 0.114185 -0.175603 1.413272 1.639976 0.927754 0.044429 1.686529 -0.547836 1.0
4 -0.747859 -0.711465 1.144237 -1.798170 0.225553 -0.403108 -0.937532 -0.987837 1.549706 1.193615 1.006077 0.266482 1.0

維度分析

根據業務需求,將變量從兩個大的維度進行考慮

  • 其一爲用戶家庭屬性(包括家庭基本情況及財務情況),其二爲用戶個人偏好情況(包括對運動、節食等的興趣程度)
household = ['age', 'marital', 'HH_adults_num', 'home_value', 
             'risk_score', 'HH_grandparent', 'HH_head_age', 'loan_ratio']
hobby = ['HH_dieting', 'auto_member','interest', 'interested_sport']
#  'HH_has_children', 
data[hobby].corr()
HH_dieting auto_member interest interested_sport
HH_dieting 1.000000 0.134293 0.246711 0.510775
auto_member 0.134293 1.000000 0.316871 0.458742
interest 0.246711 0.316871 1.000000 0.219868
interested_sport 0.510775 0.458742 0.219868 1.000000

分別對兩個維度進行因子分析,由於scikit-learn中的因子分析並不提供因子旋轉,因此我們使用另外的因子分析包,例如fa-kit,這需要事先使用pip install fa-kit進行安裝,參考https://github.com/bmcmenamin/fa_kit 。安裝完成後,可以對用戶的家庭屬性和偏好屬性分別進行因子分析

先使用主成分分析確定保留多少個主成分合適:

from sklearn.decomposition import PCA

pca_hh = PCA().fit(data[household])
pca_hh.explained_variance_ratio_.cumsum()
array([0.35366862, 0.56542894, 0.7311212 , 0.83103641, 0.88473556,
       0.92921371, 0.96733029, 1.        ])

保留4個主成分是適當的

from fa_kit import FactorAnalysis
from fa_kit import plotting as fa_plotting
fa_hh = FactorAnalysis.load_data_samples(
    data[household],
    preproc_demean=True,
    preproc_scale=True
)
fa_hh.extract_components()

設定提取主成分的方式。默認爲“broken_stick”方法,建議使用“top_n”法

fa_hh.find_comps_to_retain(method='top_n',num_keep=4)
array([0, 1, 2, 3], dtype=int64)

通過最大方差法進行因子旋轉

fa_hh.rotate_components(method='varimax')
fa_plotting.graph_summary(fa_hh)

在這裏插入圖片描述
[外鏈圖片轉存失敗(img-kx0yY1Is-1562748166656)(output_45_0.png)]

在這裏插入圖片描述

[外鏈圖片轉存失敗(img-Q7fzP1x0-1562748166656)(output_45_1.png)]

各個因子:

pd.DataFrame(fa_hh.comps['rot'].T, columns=household)
age marital HH_adults_num home_value risk_score HH_grandparent HH_head_age loan_ratio
0 0.545378 -0.106102 0.070033 -0.192464 0.004744 0.565773 0.546836 -0.173800
1 -0.093753 -0.128673 0.091397 -0.666906 0.011150 0.109866 -0.065798 0.710642
2 0.076980 0.643012 0.758064 -0.028565 -0.001814 0.009484 -0.070762 -0.005709
3 0.112652 0.011355 -0.010758 0.059238 0.981828 -0.108131 0.040533 0.078959

可以看到:

  • 第一個因子在age、HH_grandparent、HH_head_age上的權重顯著較高,從業務上理解,這三個變量的綜合可以認爲是用戶所處的生命週期;
  • 第二個因子在home_value、loan_ratio上的權重顯著較高,這個因子主要表示了用戶財務狀況;
  • 第三個因子在marital、HH_adults_num上的權重顯著較高,該因子代表了家庭的人口規模;
  • 第四個因子僅在risk_score上的權重較高,因此該因子代表的就是用戶的風險

計算因子得分:

data_hh = pd.DataFrame(
    np.dot(data[household], fa_hh.comps['rot']), 
    columns=['life_circle','finance', 'HH_size', 'risk']
)

同理,對用戶的偏好屬性進行因子分析(保留3個因子):

fa_hb = FactorAnalysis.load_data_samples(
    data[hobby], 
    preproc_demean=True,
    preproc_scale=True
)

fa_hb.extract_components()

fa_hb.find_comps_to_retain(method='top_n', num_keep=3)
fa_hb.rotate_components(method='varimax')
pd.DataFrame(fa_hb.comps['rot'].T, columns=hobby)
HH_dieting auto_member interest interested_sport
0 -0.175674 0.868191 0.033353 0.462893
1 0.832396 -0.120642 0.027784 0.540176
2 0.088207 0.071094 0.978842 -0.170394
  • 第一個因子在auto_member和interested_sport上的權重較高,是對用戶運動偏好的度量
  • 第二個因子在HH_dieting和interested_sport上的權重較高,是對用戶健康生活方式的度量
  • 第三個因子僅在interest上的權重較高,是對用戶休閒娛樂偏好的度量

計算因子得分

data_hb = pd.DataFrame(
    np.dot(data[hobby], fa_hb.comps['rot']), 
    columns=['sports', 'health', 'leisure']
)

選擇K-means聚類中的K值

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

def cluster_plot(data, k_range=range(2, 12), n_init=5, sample_size=2000, 
                 n_jobs=-1):
    scores = []
    models = {}
    for k in k_range:
        kmeans = KMeans(n_clusters=k, n_init=n_init, n_jobs=n_jobs)
        kmeans.fit(data)
        models[k] = kmeans
        sil = silhouette_score(data, kmeans.labels_, 
                               sample_size=sample_size)
        scores.append([k, kmeans.inertia_, sil])

    scores_df = pd.DataFrame(scores, columns=['k','sum_square_dist', 'sil'])
    plt.figure(figsize=[9, 2])
    plt.subplot(121, ylabel='sum_square')
    plt.plot(scores_df.k, scores_df.sum_square_dist)
    plt.subplot(122, ylabel='silhouette_score')
    plt.plot(scores_df.k, scores_df.sil)
    plt.show()
    return models
scale_data_hh = scale(data_hh)
models_hh = cluster_plot(scale_data_hh)

在這裏插入圖片描述
[外鏈圖片轉存失敗(img-dAmxQoCh-1562748166656)(output_58_0.png)]

scale_data_hb = scale(data_hb)
models_hb = cluster_plot(scale_data_hb)

在這裏插入圖片描述
[外鏈圖片轉存失敗(img-AFeQsozE-1562748166657)(output_59_0.png)]

選擇適當K值分別進行聚類,並將相應標籤連接至原始數據集

hh_labels = pd.DataFrame(models_hh[3].labels_, columns=['hh'])
hb_labels = pd.DataFrame(models_hb[2].labels_, columns=['hb'])
clusters = travel.join(hh_labels).join(hb_labels)
clusters.head()
age home_value loan_ratio risk_score marital interested_sport HH_grandparent HH_dieting HH_head_age auto_member HH_has_children HH_adults_num interest hh hb
0 64 124035 73 932 3 312 420 149 96 626 0 2.0 0.2 1 0
1 69 138574 73 1000 7 241 711 263 68 658 0 5.0 0.4 2 1
2 57 148136 77 688 1 367 240 240 56 354 0 2.0 0.2 1 0
3 80 162532 74 932 7 291 832 197 86 462 1 2.0 1.0 1 1
4 48 133580 77 987 10 137 121 209 42 423 1 3.0 0.8 1 0

對各個簇的特徵進行描述——使用原始數據

from sklearn.tree import DecisionTreeClassifier

clf_hh = DecisionTreeClassifier()
clf_hb = DecisionTreeClassifier()

clf_hh.fit(clusters[household], clusters['hh'])
clf_hb.fit(clusters[hobby], clusters['hb'])
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')
import pydotplus
from IPython.display import Image
import sklearn.tree as tree

dot_hh = tree.export_graphviz(
    clf_hh,
    out_file=None, 
    feature_names=household, 
    class_names=['0','1', '2'],
    max_depth=2, 
    filled=True
) 

graph_hh = pydotplus.graph_from_dot_data(dot_hh)  
Image(graph_hh.create_png()) 

在這裏插入圖片描述
[外鏈圖片轉存失敗(img-duws2etM-1562748166657)(output_64_0.png)]

有多個屬性可用於分析用戶的特徵,使用決策樹會計算屬性對於類別的重要性,因此可用於發現其中較突出的屬性(特徵)

  • 標籤hh=0用戶的突出特徵是已婚低風險
  • 標籤hh=1用戶的突出特徵是未婚低風險
  • 標籤hh=2的用戶的突出特徵是高風險
dot_hb = tree.export_graphviz(
    clf_hb,
    out_file=None, 
    feature_names=hobby, 
    class_names=['0','1'],
    max_depth=2, 
    filled=True
) 

graph_hb = pydotplus.graph_from_dot_data(dot_hb)  
Image(graph_hb.create_png()) 

在這裏插入圖片描述
[外鏈圖片轉存失敗(img-1UMbr9Db-1562748166657)(output_66_0.png)]

  • 標籤hb=0的用戶的突出特徵是興趣度低、不好運動
  • 標籤hb=1用戶的突出特徵是喜歡運動及汽車俱樂部活動

可以進行多維彙總分析

ana = pd.pivot_table(clusters, index='hh', columns='hb', aggfunc='mean').T
ana.swaplevel('hb', 0).sortlevel(0)
E:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)
  from ipykernel import kernelapp as app
hh 0 1 2
hb
0 HH_adults_num 3.269823 2.021071 2.423398
HH_dieting 166.936340 173.674181 158.180365
HH_grandparent 269.372191 355.571933 399.976093
HH_has_children 0.459865 0.238048 0.148394
HH_head_age 54.562345 57.899566 68.807469
age 55.791545 54.738482 66.056585
auto_member 453.438970 366.509789 447.716367
home_value 237195.115295 155775.222623 175275.060263
interest 0.351432 0.296582 0.294950
interested_sport 227.167794 209.014134 237.206394
loan_ratio 65.255430 72.004900 68.444051
marital 8.685870 5.187358 5.895035
risk_score 813.785306 737.985482 999.991654
1 HH_adults_num 3.344590 2.189454 2.770638
HH_dieting 229.577694 257.790373 227.790940
HH_grandparent 331.204838 536.099752 514.176211
HH_has_children 0.370444 0.169496 0.165677
HH_head_age 57.733696 63.795255 67.524516
age 60.438336 64.279766 70.390155
auto_member 593.903952 526.606734 596.637713
home_value 277657.102649 164260.575577 210227.962938
interest 0.678365 0.718116 0.691400
interested_sport 302.135937 293.386877 319.072496
loan_ratio 60.209289 70.427761 63.409021
marital 8.615938 5.607380 6.862574
risk_score 845.519363 799.222944 999.984007

其中一個用戶羣特徵爲:

  • 重點用戶羣例:
標籤 特徵
hh=0,hb=0 年輕的已婚有子中產階層,對運動、休閒偏好中等
hh=0,hb=1 年輕、家庭成員數少,有較高的還貸壓力與較低的風險,對運動、休閒等活動偏好低
hh=1,hb=0 已婚、高房產價值、低貸款比率,家庭成員多
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章