|字段|含義|類型|
|:–?:–?
|interested_travel |旅行偏好|二分類|
|computer_owner |是否有家用電腦|二分類|
|age |估計的年齡|連續|
|home_value |房產價格|連續|
|loan_ratio|貸款比率|連續|
|risk_score |風險分數|連續|
|marital |婚姻狀況估計|連續|
|interested_sport |運動偏好|連續|
|HH_grandparent|戶主祖父母是否健在估計|連續|
|HH_dieting |戶主節食偏好|連續|
|HH_head_age|戶主年齡|連續|
|auto_member |駕駛俱樂部估計|連續|
|interested_golf |高爾夫偏好|二分類|
|interested_gambling |博彩偏好|二分類|
|HH_has_children |戶主是否有孩子|二分類|
|HH_adults_num |家庭成年人數量|連續|
|interested_reading |閱讀偏好|有序分類|
1、數據集中的變量較多,如果全部進入模型會導致模型解釋困難。因此,一方面我們對於有相關性的變量進行降維,減少變量數目;另一方面,基於業務理解,我們預先將變量進行分組,使得同一組的變量能儘量解釋業務的一個方面。比如本例中將變量分成兩組,分別是家庭基本情況和用戶愛好,通過對每組變量分別進行聚類,獲取用戶的側寫,再將兩個聚類結果進行綜合,以獲得較完整的用戶畫像。
2、本例中數據類型複雜,包含了連續變量、無序分類和有序分類變量。由於K-means僅用於連續型變量聚類,因此需要對變量進行預處理。對於有序分類變量,如果分類水平較多可以視作連續變量處理,否則視作無序分類變量一樣處理,再進入模型;無序分類變量數目較少時,可以使用其啞變量編碼進入模型。本例中由於有較多的二分類變量,又集中在用戶愛好這一方面,因此我們將interested_reading這一有序分類變量二值化,再與其他幾個二分類變量一起進行彙總,得到用戶的“愛好廣度”,使用“愛好廣度”與其他連續型的愛好類變量進行聚類。
3、離散變量如HH_has_children一般不參與聚類,因爲其本身就可以視作是簇的標籤;如果爲了後期解釋模型時簡化處理,在離散變量不多的情況之下,也可以做啞變量變換後進入模型。
讀取數據
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
travel = pd.read_csv('data_travel.csv',skipinitialspace=True)
travel.head()
interested_travel | computer_owner | age | home_value | loan_ratio | risk_score | marital | interested_sport | HH_grandparent | HH_dieting | HH_head_age | auto_member | interested_golf | interested_gambling | HH_has_children | HH_adults_num | interested_reading | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | NaN | NaN | 64 | 124035 | 73 | 932 | 3 | 312 | 420 | 149 | 96 | 626 | 0 | 0 | NaN | NaN | 0 |
1 | 0.0 | 1.0 | 69 | 138574 | 73 | 1000 | 7 | 241 | 711 | 263 | 68 | 658 | 0 | 0 | N | 5.0 | 3 |
2 | 0.0 | 0.0 | 57 | 148136 | 77 | 688 | 1 | 367 | 240 | 240 | 56 | 354 | 0 | 1 | N | 2.0 | 1 |
3 | 1.0 | 1.0 | 80 | 162532 | 74 | 932 | 7 | 291 | 832 | 197 | 86 | 462 | 1 | 1 | Y | 2.0 | 3 |
4 | 1.0 | 1.0 | 48 | 133580 | 77 | 987 | 10 | 137 | 121 | 209 | 42 | 423 | 0 | 1 | Y | 3.0 | 3 |
travel.describe(include='all')
interested_travel | computer_owner | age | home_value | loan_ratio | risk_score | marital | interested_sport | HH_grandparent | HH_dieting | HH_head_age | auto_member | interested_golf | interested_gambling | HH_has_children | HH_adults_num | interested_reading | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 149788.000000 | 149788.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 167177.000000 | 159899 | 145906.000000 | 167177 |
unique | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2 | NaN | 5 |
top | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | N | NaN | 3 |
freq | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 111462 | NaN | 65096 |
mean | 0.427745 | 0.856571 | 59.507079 | 207621.314798 | 66.762707 | 817.031751 | 6.884015 | 259.431776 | 377.072498 | 204.593341 | 59.368023 | 486.861273 | 0.373012 | 0.357842 | NaN | 2.770832 | NaN |
std | 0.494753 | 0.350511 | 14.311733 | 107822.501900 | 9.751835 | 165.490295 | 2.610552 | 78.867456 | 248.045395 | 78.971038 | 16.712912 | 151.167457 | 0.483607 | 0.479367 | NaN | 1.285417 | NaN |
min | 0.000000 | 0.000000 | 18.000000 | 48910.000000 | 0.000000 | 1.000000 | 1.000000 | 60.000000 | 0.000000 | 47.000000 | 18.000000 | 49.000000 | 0.000000 | 0.000000 | NaN | 0.000000 | NaN |
25% | 0.000000 | 1.000000 | 49.000000 | 135595.000000 | 63.000000 | 748.000000 | 5.000000 | 204.000000 | 182.000000 | 144.000000 | 48.000000 | 377.000000 | 0.000000 | 0.000000 | NaN | 2.000000 | NaN |
50% | 0.000000 | 1.000000 | 59.000000 | 182106.000000 | 69.000000 | 844.000000 | 7.000000 | 251.000000 | 351.000000 | 185.000000 | 60.000000 | 492.000000 | 0.000000 | 0.000000 | NaN | 2.000000 | NaN |
75% | 1.000000 | 1.000000 | 70.000000 | 248277.000000 | 73.000000 | 945.000000 | 9.000000 | 306.000000 | 528.000000 | 252.000000 | 71.000000 | 600.000000 | 1.000000 | 1.000000 | NaN | 4.000000 | NaN |
max | 1.000000 | 1.000000 | 99.000000 | 1000000.000000 | 102.000000 | 1000.000000 | 10.000000 | 920.000000 | 980.000000 | 633.000000 | 99.000000 | 878.000000 | 1.000000 | 1.000000 | NaN | 7.000000 | NaN |
數據預處理
填補缺失值
有缺失情況的變量皆爲分類變量,且確實比例並不高,因此用衆數進行填補
fill_cols = ['interested_travel', 'computer_owner', 'HH_adults_num']
fill_values = {col: travel[col].mode()[0] for col in fill_cols}
travel = travel.fillna(fill_values)
修正錯誤值
HH_has_children的分類水平以字符形式表示,需要轉換爲整型,同時其中的缺失值應當表示沒有小孩,因此替換爲0
閱讀愛好interested_reading中包含錯誤值“.”,將其以0進行替換,代表該用戶對閱讀沒有興趣。
travel['interested_reading'].value_counts(dropna=False)
3 65096
1 43832
0 32919
2 24488
. 842
Name: interested_reading, dtype: int64
travel['HH_has_children'] = travel['HH_has_children']\
.replace({'N':0, 'Y':1, np.NaN:0})
travel['interested_reading'] = travel['interested_reading']\
.replace({'.':'0'}).astype('int')
對離散型變量進行處理
- 使用k-means聚類,一般不分析離散變量,但可以根據業務理解,將離散型變量進行變換
分析離散變量的相關性
_cols = [
'interested_travel',
'computer_owner',
'marital',
'interested_golf',
'interested_gambling',
'HH_has_children',
'interested_reading'
]
sample = travel[_cols].sample(3000, random_state=12345)
from itertools import combinations
from scipy import stats
for colA, colB in combinations(_cols, 2):
crosstab = pd.crosstab(sample[colA], sample[colB])
pval = stats.chi2_contingency(crosstab)[1]
if pval > 0.05:
print('p-value = %0.3f between "%s" and "%s"' %(pval, colA, colB))
p-value = 0.710 between "interested_travel" and "HH_has_children"
p-value = 0.495 between "computer_owner" and "HH_has_children"
p-value = 0.272 between "interested_golf" and "HH_has_children"
對於用戶愛好,可以將旅行、電腦、高爾夫、博彩、閱讀這幾個分類型變量綜合成一個愛好廣度指標,其代表了用戶休閒娛樂愛好,而連續型的interested_sport、HH_dieting屬於健康類愛好,auto_member屬於奢侈型愛好,因此,可以從多個角度來分析用戶的愛好。
- 先對interested_reading進行二值化
from sklearn.preprocessing import Binarizer
binarizer = Binarizer(threshold=1.5)
travel['interested_reading'] = binarizer.fit_transform(
travel[['interested_reading']])
- 生成二分類偏好填充率
interest =[
'interested_travel',
'computer_owner',
'interested_golf',
'interested_gambling',
'interested_reading'
]
n_ = len(interest)
travel = travel.drop(interest, axis=1)\
.assign(interest=travel[interest].sum(axis=1) / n_)
正態化、標準化
- 對不同類型變量執行不同處理,連續變量、有序分類變量及無序分類變量在處理上均有不同,因此先按類型對變量分組,不同組採用不同的處理策略
- 如果一個連續變量的可能取值很少,如marital(10個水平)、interest(5個水平)、HH_adults_num(8個水平)等,當將其作爲普通連續變量一樣進行分佈轉換,可能生成一些離羣值(例如對marital使用scikit-learn進行正態轉換,會發現1和10對應的數據點離開均值達到5個標準差)。因此本例中將這幾個連續變量作爲有序分類變量對待,但不進行分佈轉化,僅做標準化處理。
continuous_cols = ['age', 'home_value', 'risk_score', 'interested_sport',
'HH_dieting', 'auto_member', 'HH_grandparent',
'HH_head_age', 'loan_ratio']
categorical_cols = ['marital', 'interest', 'HH_adults_num']
discreate_cols = ['HH_has_children']
爲了聚類後的簇大小能比較接近,對於偏態嚴重的連續變量應轉換其分佈,令其接近正態分佈或均勻分佈
- 對連續變量正態化
travel[continuous_cols].hist(bins=25)
plt.show()
[外鏈圖片轉存(img-AEdJ6YWb-1562748166654)(output_27_0.png)]
from sklearn.preprocessing import QuantileTransformer
qt = QuantileTransformer(n_quantiles=100, output_distribution='normal')
qt_data = qt.fit_transform(travel[continuous_cols])
pd.DataFrame(qt_data, columns=continuous_cols).hist(bins=25)
plt.show()
[外鏈圖片轉存(img-R3MpmkfN-1562748166655)(output_28_0.png)]
- 對有多個水平的有序分類變量進行標準化
如前所述,儘管HH_adults_num、marital和interest屬於連續變量,但都僅有不到10個水平,因此與有序分類變量一樣,僅做標準化
from sklearn.preprocessing import scale
scale_data = scale(travel[categorical_cols])
對二分類變量不做處理,合併各類型的變量
data = np.hstack([qt_data, scale_data, travel[discreate_cols]])
data = pd.DataFrame(
data, columns=continuous_cols + categorical_cols + discreate_cols)
data.head()
age | home_value | risk_score | interested_sport | HH_dieting | auto_member | HH_grandparent | HH_head_age | loan_ratio | marital | interest | HH_adults_num | HH_has_children | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0.321971 | -0.892802 | 0.574460 | 0.742289 | -0.515705 | 0.848859 | 0.265318 | 2.508596 | 0.698526 | -1.487818 | -1.035281 | -0.547836 | 0.0 |
1 | 0.619855 | -0.624538 | 5.199338 | -0.139710 | 0.822532 | 1.073988 | 1.182609 | 0.458679 | 0.698526 | 0.044429 | -0.354828 | 1.895118 | 0.0 |
2 | -0.126937 | -0.464935 | -0.935569 | 1.324958 | 0.525369 | -0.810751 | -0.446657 | -0.229884 | 1.549706 | -2.253942 | -1.035281 | -0.547836 | 0.0 |
3 | 1.335178 | -0.253449 | 0.574460 | 0.506088 | 0.114185 | -0.175603 | 1.413272 | 1.639976 | 0.927754 | 0.044429 | 1.686529 | -0.547836 | 1.0 |
4 | -0.747859 | -0.711465 | 1.144237 | -1.798170 | 0.225553 | -0.403108 | -0.937532 | -0.987837 | 1.549706 | 1.193615 | 1.006077 | 0.266482 | 1.0 |
維度分析
根據業務需求,將變量從兩個大的維度進行考慮
- 其一爲用戶家庭屬性(包括家庭基本情況及財務情況),其二爲用戶個人偏好情況(包括對運動、節食等的興趣程度)
household = ['age', 'marital', 'HH_adults_num', 'home_value',
'risk_score', 'HH_grandparent', 'HH_head_age', 'loan_ratio']
hobby = ['HH_dieting', 'auto_member','interest', 'interested_sport']
# 'HH_has_children',
data[hobby].corr()
HH_dieting | auto_member | interest | interested_sport | |
---|---|---|---|---|
HH_dieting | 1.000000 | 0.134293 | 0.246711 | 0.510775 |
auto_member | 0.134293 | 1.000000 | 0.316871 | 0.458742 |
interest | 0.246711 | 0.316871 | 1.000000 | 0.219868 |
interested_sport | 0.510775 | 0.458742 | 0.219868 | 1.000000 |
分別對兩個維度進行因子分析,由於scikit-learn中的因子分析並不提供因子旋轉,因此我們使用另外的因子分析包,例如fa-kit,這需要事先使用pip install fa-kit進行安裝,參考https://github.com/bmcmenamin/fa_kit 。安裝完成後,可以對用戶的家庭屬性和偏好屬性分別進行因子分析
先使用主成分分析確定保留多少個主成分合適:
from sklearn.decomposition import PCA
pca_hh = PCA().fit(data[household])
pca_hh.explained_variance_ratio_.cumsum()
array([0.35366862, 0.56542894, 0.7311212 , 0.83103641, 0.88473556,
0.92921371, 0.96733029, 1. ])
保留4個主成分是適當的
from fa_kit import FactorAnalysis
from fa_kit import plotting as fa_plotting
fa_hh = FactorAnalysis.load_data_samples(
data[household],
preproc_demean=True,
preproc_scale=True
)
fa_hh.extract_components()
設定提取主成分的方式。默認爲“broken_stick”方法,建議使用“top_n”法
fa_hh.find_comps_to_retain(method='top_n',num_keep=4)
array([0, 1, 2, 3], dtype=int64)
通過最大方差法進行因子旋轉
fa_hh.rotate_components(method='varimax')
fa_plotting.graph_summary(fa_hh)
[外鏈圖片轉存失敗(img-kx0yY1Is-1562748166656)(output_45_0.png)]
[外鏈圖片轉存失敗(img-Q7fzP1x0-1562748166656)(output_45_1.png)]
各個因子:
pd.DataFrame(fa_hh.comps['rot'].T, columns=household)
age | marital | HH_adults_num | home_value | risk_score | HH_grandparent | HH_head_age | loan_ratio | |
---|---|---|---|---|---|---|---|---|
0 | 0.545378 | -0.106102 | 0.070033 | -0.192464 | 0.004744 | 0.565773 | 0.546836 | -0.173800 |
1 | -0.093753 | -0.128673 | 0.091397 | -0.666906 | 0.011150 | 0.109866 | -0.065798 | 0.710642 |
2 | 0.076980 | 0.643012 | 0.758064 | -0.028565 | -0.001814 | 0.009484 | -0.070762 | -0.005709 |
3 | 0.112652 | 0.011355 | -0.010758 | 0.059238 | 0.981828 | -0.108131 | 0.040533 | 0.078959 |
可以看到:
- 第一個因子在age、HH_grandparent、HH_head_age上的權重顯著較高,從業務上理解,這三個變量的綜合可以認爲是用戶所處的生命週期;
- 第二個因子在home_value、loan_ratio上的權重顯著較高,這個因子主要表示了用戶財務狀況;
- 第三個因子在marital、HH_adults_num上的權重顯著較高,該因子代表了家庭的人口規模;
- 第四個因子僅在risk_score上的權重較高,因此該因子代表的就是用戶的風險
計算因子得分:
data_hh = pd.DataFrame(
np.dot(data[household], fa_hh.comps['rot']),
columns=['life_circle','finance', 'HH_size', 'risk']
)
同理,對用戶的偏好屬性進行因子分析(保留3個因子):
fa_hb = FactorAnalysis.load_data_samples(
data[hobby],
preproc_demean=True,
preproc_scale=True
)
fa_hb.extract_components()
fa_hb.find_comps_to_retain(method='top_n', num_keep=3)
fa_hb.rotate_components(method='varimax')
pd.DataFrame(fa_hb.comps['rot'].T, columns=hobby)
HH_dieting | auto_member | interest | interested_sport | |
---|---|---|---|---|
0 | -0.175674 | 0.868191 | 0.033353 | 0.462893 |
1 | 0.832396 | -0.120642 | 0.027784 | 0.540176 |
2 | 0.088207 | 0.071094 | 0.978842 | -0.170394 |
- 第一個因子在auto_member和interested_sport上的權重較高,是對用戶運動偏好的度量
- 第二個因子在HH_dieting和interested_sport上的權重較高,是對用戶健康生活方式的度量
- 第三個因子僅在interest上的權重較高,是對用戶休閒娛樂偏好的度量
計算因子得分
data_hb = pd.DataFrame(
np.dot(data[hobby], fa_hb.comps['rot']),
columns=['sports', 'health', 'leisure']
)
選擇K-means聚類中的K值
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
def cluster_plot(data, k_range=range(2, 12), n_init=5, sample_size=2000,
n_jobs=-1):
scores = []
models = {}
for k in k_range:
kmeans = KMeans(n_clusters=k, n_init=n_init, n_jobs=n_jobs)
kmeans.fit(data)
models[k] = kmeans
sil = silhouette_score(data, kmeans.labels_,
sample_size=sample_size)
scores.append([k, kmeans.inertia_, sil])
scores_df = pd.DataFrame(scores, columns=['k','sum_square_dist', 'sil'])
plt.figure(figsize=[9, 2])
plt.subplot(121, ylabel='sum_square')
plt.plot(scores_df.k, scores_df.sum_square_dist)
plt.subplot(122, ylabel='silhouette_score')
plt.plot(scores_df.k, scores_df.sil)
plt.show()
return models
scale_data_hh = scale(data_hh)
models_hh = cluster_plot(scale_data_hh)
[外鏈圖片轉存失敗(img-dAmxQoCh-1562748166656)(output_58_0.png)]
scale_data_hb = scale(data_hb)
models_hb = cluster_plot(scale_data_hb)
[外鏈圖片轉存失敗(img-AFeQsozE-1562748166657)(output_59_0.png)]
選擇適當K值分別進行聚類,並將相應標籤連接至原始數據集
hh_labels = pd.DataFrame(models_hh[3].labels_, columns=['hh'])
hb_labels = pd.DataFrame(models_hb[2].labels_, columns=['hb'])
clusters = travel.join(hh_labels).join(hb_labels)
clusters.head()
age | home_value | loan_ratio | risk_score | marital | interested_sport | HH_grandparent | HH_dieting | HH_head_age | auto_member | HH_has_children | HH_adults_num | interest | hh | hb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 64 | 124035 | 73 | 932 | 3 | 312 | 420 | 149 | 96 | 626 | 0 | 2.0 | 0.2 | 1 | 0 |
1 | 69 | 138574 | 73 | 1000 | 7 | 241 | 711 | 263 | 68 | 658 | 0 | 5.0 | 0.4 | 2 | 1 |
2 | 57 | 148136 | 77 | 688 | 1 | 367 | 240 | 240 | 56 | 354 | 0 | 2.0 | 0.2 | 1 | 0 |
3 | 80 | 162532 | 74 | 932 | 7 | 291 | 832 | 197 | 86 | 462 | 1 | 2.0 | 1.0 | 1 | 1 |
4 | 48 | 133580 | 77 | 987 | 10 | 137 | 121 | 209 | 42 | 423 | 1 | 3.0 | 0.8 | 1 | 0 |
對各個簇的特徵進行描述——使用原始數據
from sklearn.tree import DecisionTreeClassifier
clf_hh = DecisionTreeClassifier()
clf_hb = DecisionTreeClassifier()
clf_hh.fit(clusters[household], clusters['hh'])
clf_hb.fit(clusters[hobby], clusters['hb'])
DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, presort=False, random_state=None,
splitter='best')
import pydotplus
from IPython.display import Image
import sklearn.tree as tree
dot_hh = tree.export_graphviz(
clf_hh,
out_file=None,
feature_names=household,
class_names=['0','1', '2'],
max_depth=2,
filled=True
)
graph_hh = pydotplus.graph_from_dot_data(dot_hh)
Image(graph_hh.create_png())
[外鏈圖片轉存失敗(img-duws2etM-1562748166657)(output_64_0.png)]
有多個屬性可用於分析用戶的特徵,使用決策樹會計算屬性對於類別的重要性,因此可用於發現其中較突出的屬性(特徵)
- 標籤hh=0用戶的突出特徵是已婚低風險
- 標籤hh=1用戶的突出特徵是未婚低風險
- 標籤hh=2的用戶的突出特徵是高風險
dot_hb = tree.export_graphviz(
clf_hb,
out_file=None,
feature_names=hobby,
class_names=['0','1'],
max_depth=2,
filled=True
)
graph_hb = pydotplus.graph_from_dot_data(dot_hb)
Image(graph_hb.create_png())
[外鏈圖片轉存失敗(img-1UMbr9Db-1562748166657)(output_66_0.png)]
- 標籤hb=0的用戶的突出特徵是興趣度低、不好運動
- 標籤hb=1用戶的突出特徵是喜歡運動及汽車俱樂部活動
可以進行多維彙總分析
ana = pd.pivot_table(clusters, index='hh', columns='hb', aggfunc='mean').T
ana.swaplevel('hb', 0).sortlevel(0)
E:\Anaconda3\lib\site-packages\ipykernel\__main__.py:2: FutureWarning: sortlevel is deprecated, use sort_index(level= ...)
from ipykernel import kernelapp as app
hh | 0 | 1 | 2 | |
---|---|---|---|---|
hb | ||||
0 | HH_adults_num | 3.269823 | 2.021071 | 2.423398 |
HH_dieting | 166.936340 | 173.674181 | 158.180365 | |
HH_grandparent | 269.372191 | 355.571933 | 399.976093 | |
HH_has_children | 0.459865 | 0.238048 | 0.148394 | |
HH_head_age | 54.562345 | 57.899566 | 68.807469 | |
age | 55.791545 | 54.738482 | 66.056585 | |
auto_member | 453.438970 | 366.509789 | 447.716367 | |
home_value | 237195.115295 | 155775.222623 | 175275.060263 | |
interest | 0.351432 | 0.296582 | 0.294950 | |
interested_sport | 227.167794 | 209.014134 | 237.206394 | |
loan_ratio | 65.255430 | 72.004900 | 68.444051 | |
marital | 8.685870 | 5.187358 | 5.895035 | |
risk_score | 813.785306 | 737.985482 | 999.991654 | |
1 | HH_adults_num | 3.344590 | 2.189454 | 2.770638 |
HH_dieting | 229.577694 | 257.790373 | 227.790940 | |
HH_grandparent | 331.204838 | 536.099752 | 514.176211 | |
HH_has_children | 0.370444 | 0.169496 | 0.165677 | |
HH_head_age | 57.733696 | 63.795255 | 67.524516 | |
age | 60.438336 | 64.279766 | 70.390155 | |
auto_member | 593.903952 | 526.606734 | 596.637713 | |
home_value | 277657.102649 | 164260.575577 | 210227.962938 | |
interest | 0.678365 | 0.718116 | 0.691400 | |
interested_sport | 302.135937 | 293.386877 | 319.072496 | |
loan_ratio | 60.209289 | 70.427761 | 63.409021 | |
marital | 8.615938 | 5.607380 | 6.862574 | |
risk_score | 845.519363 | 799.222944 | 999.984007 |
其中一個用戶羣特徵爲:
- 重點用戶羣例:
標籤 | 特徵 |
---|---|
hh=0,hb=0 | 年輕的已婚有子中產階層,對運動、休閒偏好中等 |
hh=0,hb=1 | 年輕、家庭成員數少,有較高的還貸壓力與較低的風險,對運動、休閒等活動偏好低 |
hh=1,hb=0 | 已婚、高房產價值、低貸款比率,家庭成員多 |