总体思路

基于聚类的精准推荐总体

先基于聚类：
- 把用户分群，对每一个客户都标记上标签值。
生成推荐规则：
- 在用户没买过的商品中，同类客户总购买次数（平均购买次数）最多的商品，就是这类客户最喜欢的商品

数据清洗

优先删除：
- 缺失率90%以上
- 整个字段只有1个值
- 整个字段有效信息几乎没有
需要转码：
- 哑变量编码

数据整合

目的是生成一张用户所有购买行为的信息表
- 三表进行连接合并

# 导入数据包
import numpy as np
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from sklearn.cluster import KMeans
plt.style.use('seaborn')
plt.rcParams['font.sans-serif']=['Simhei']  #显示中文,解决图中无法显示中文的问题
plt.rcParams['axes.unicode_minus']=False    #设置显示中文后,负号显示受影响。解决座标轴上乱码问题

数据清洗

订单表清洗

数据初步探索

order = pd.read_csv(r"...\order.csv",index_col=0)
order.head(1)

	订单编号	买家会员名	买家应付货款	买家应付邮费	买家支付积分	总金额	返点积分	买家实际支付金额	买家实际支付积分	订单状态	...	是否代付	定金排名	修改后的sku	修改后的收货地址	异常信息	天猫卡券抵扣	集分宝抵扣	是否是O2O交易	退款金额	预约门店
0	21407300627014900	1425	58.51	0.0	0	58.51	0	58.51	0	交易成功	...	否	NaN	NaN	NaN	NaN	NaN	NaN	NaN	0.0	NaN

1 rows × 45 columns

order.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 3989 entries, 0 to 3988
Data columns (total 45 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   订单编号      3989 non-null   int64  
 1   买家会员名     3989 non-null   int64  
 2   买家应付货款    3989 non-null   float64
 3   买家应付邮费    3989 non-null   float64
 4   买家支付积分    3989 non-null   int64  
 5   总金额       3989 non-null   float64
 6   返点积分      3989 non-null   int64  
 7   买家实际支付金额  3989 non-null   float64
 8   买家实际支付积分  3989 non-null   int64  
 9   订单状态      3989 non-null   object 
 10  买家留言      384 non-null    object 
 11  收货人姓名     3989 non-null   int64  
 12  收货地址      3989 non-null   object 
 13  运送方式      3989 non-null   object 
 14  联系电话      142 non-null    object 
 15  联系手机      3986 non-null   object 
 16  订单创建时间    3989 non-null   object 
 17  订单付款时间    3989 non-null   object 
 18  宝贝标题      3989 non-null   object 
 19  宝贝种类      3989 non-null   int64  
 20  物流单号      3988 non-null   object 
 21  物流公司      3988 non-null   object 
 22  订单备注      460 non-null    object 
 23  宝贝总数量     3989 non-null   int64  
 24  店铺Id      3989 non-null   int64  
 25  店铺名称      3989 non-null   int64  
 26  订单关闭原因    3989 non-null   object 
 27  卖家服务费     3989 non-null   int64  
 28  买家服务费     3989 non-null   object 
 29  发票擡头      0 non-null      float64
 30  是否手机订单    3728 non-null   object 
 31  分阶段订单信息   0 non-null      float64
 32  特权订金订单id  0 non-null      float64
 33  是否上传合同照片  3989 non-null   object 
 34  是否上传小票    3989 non-null   object 
 35  是否代付      3989 non-null   object 
 36  定金排名      0 non-null      float64
 37  修改后的sku   0 non-null      float64
 38  修改后的收货地址  61 non-null     object 
 39  异常信息      0 non-null      float64
 40  天猫卡券抵扣    0 non-null      float64
 41  集分宝抵扣     12 non-null     float64
 42  是否是O2O交易  0 non-null      float64
 43  退款金额      3989 non-null   float64
 44  预约门店      0 non-null      float64
dtypes: float64(15), int64(11), object(19)
memory usage: 1.6+ MB

删除无用信息

# 删除空值项大于20%字段：
order=order.dropna(axis=1,thresh=order.shape[0]*0.2)

#删除整个字段只有一个信息的值：
for  i in order.columns:
    if order[i].nunique()==1:
        del order[i]  
        
# 手动选择与用户购买信息有关字段
order=order[["订单编号","买家会员名","买家实际支付金额","收货地址","宝贝标题 ","宝贝种类","宝贝总数量","退款金额"]]

数据编码

#退款金额0-1独热编码
order.退款金额=np.where(order.退款金额>0,1,0)

#收货地址独热编码
address=order.收货地址.str[:3].str.strip()
address=pd.get_dummies(address,prefix="地址")


#宝贝种类独热编码
kinds=pd.get_dummies(order.宝贝种类,prefix="宝贝种类")

#删除原表中已经编码完成的字段，将编码完成后字段加入订单表
order=order.drop(["收货地址","宝贝种类"],axis=1)
order=pd.concat([order,address,kinds],axis=1)

order.head(1)

	订单编号	买家会员名	买家实际支付金额	宝贝标题	宝贝总数量	退款金额	地址_上海	地址_云南省	地址_内蒙古	地址_北京	...	宝贝种类_39	宝贝种类_40	宝贝种类_41	宝贝种类_43	宝贝种类_45	宝贝种类_46	宝贝种类_47	宝贝种类_48	宝贝种类_49	宝贝种类_50
0	21407300627014900	1425	58.51	...	59	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

1 rows × 83 columns

订单详情表清洗

数据初步探索

order_detail=pd.read_csv(r"...\Items_order.csv")
order_detail.head(1)

	订单编号	标题	价格	购买数量	外部系统编号	商品属性	套餐信息	备注	订单状态	商家编码
0	21407300627014900	...	0.58	12	WY013-2SZD0426	颜色分类：小号	NaN	NaN	交易成功	WY013-2SZD0426

order_detail.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21897 entries, 0 to 21896
Data columns (total 10 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   订单编号    21897 non-null  int64  
 1   标题      21897 non-null  object 
 2   价格      21897 non-null  float64
 3   购买数量    21897 non-null  int64  
 4   外部系统编号  21897 non-null  object 
 5   商品属性    12636 non-null  object 
 6   套餐信息    0 non-null      float64
 7   备注      130 non-null    object 
 8   订单状态    21897 non-null  object 
 9   商家编码    21897 non-null  object 
dtypes: float64(2), int64(2), object(6)
memory usage: 1.7+ MB

删除无用信息

order_detail=order_detail.dropna(axis=1,thresh=order_detail.shape[0]*0.2)

for  i in order_detail.columns:
    if order_detail[i].nunique()==1:
        del order_detail[i]        
        
order_detail=order_detail[["订单编号","标题","价格","购买数量","订单状态"]]

# 筛选交易成功的记录
order_detail=order_detail[order_detail.订单状态=="交易成功"]
order_detail=order_detail.reset_index(drop=True).iloc[:,:-1]

order_detail.head(1)

	订单编号	标题	价格	购买数量
0	21407300627014900	...	0.58	12

商品详情表

数据初步探索

items_detail=pd.read_csv(r"...\Items_attribute.csv")
items_detail.head(1)

	宝贝ID	标题	价格	玩具类型	适用年龄	品牌
0	537396783238	...	8.9	塑胶玩具	3岁,4岁,5岁,6岁	3

items_detail.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 6 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   宝贝ID    288 non-null    int64  
 1   标题      288 non-null    object 
 2   价格      288 non-null    float64
 3   玩具类型    252 non-null    object 
 4   适用年龄    284 non-null    object 
 5   品牌      288 non-null    int64  
dtypes: float64(1), int64(2), object(3)
memory usage: 13.6+ KB

数据编码

#填充缺失值：
items_detail.适用年龄=items_detail.适用年龄.fillna(items_detail.适用年龄.mode()[0])

a=[]
for i in items_detail.适用年龄.value_counts().index:
    a.extend(i.split(","))    
a=list(set(a))
# 对出现的所有年龄小标签作人工分类：
baby=["3个月","6个月","12个月"]
youer=['18个月','2岁','3岁']
xueqian=['4岁','5岁','6岁']
stu=['7岁','8岁','9岁','10岁','11岁','12岁','13岁','14岁','14岁以上']
def change(x):
    a=x.split(",")
    st=""
    for i in a:
        if i in baby:
            if st.find("婴儿")!=-1:
                continue
            st=st+"婴儿|"
        elif i in youer:
            if st.find("幼儿")!=-1:
                continue
            st=st+"幼儿|"
        elif i in xueqian:
            if st.find("学前")!=-1:
                continue
            st=st+"学前|"           
        else:
            if st.find("学生")!=-1:
                continue
            st=st+"学生|"    
    return st

age=items_detail.适用年龄.apply(change)
age=age.str.get_dummies("|")
age.columns="年龄_"+age.columns
age

	年龄_婴儿	年龄_学前	年龄_学生	年龄_幼儿
0	0	1	0	1
1	0	1	0	1
2	0	1	1	1
3	0	1	1	1
4	0	1	0	1
...	...	...	...	...
283	0	1	1	1
284	0	1	1	1
285	1	0	0	1
286	0	1	1	1
287	0	1	1	1

288 rows × 4 columns

#品牌字段处理
items_detail.品牌.isnull().sum()
brand=pd.get_dummies(items_detail.品牌,prefix="品牌")

#商品详情表
items_detail=pd.concat([items_detail.iloc[:,:3],age,brand],axis=1)

表合并

三表合一个表

#表合并的目的：在三张表中，尽可能地保留**客户的购买行为信息

table_01 = pd.merge(order_detail, items_detail, how="inner", on="标题")

#第二次表合并

table_02 = pd.merge(table_01, order, how="left", on="订单编号")

先把介意直接求和的客户信息表制作出来

#把table_02转化为一行记录一个用户所有购买行为的信息表。

#  删除重复值
table_02 = table_02.drop_duplicates()

table_03 = table_02.drop(["订单编号", "标题", "宝贝ID", "宝贝标题 ", "价格_x", "价格_y"],
                         axis=1)

table_04 = table_03.drop(["买家实际支付金额", "宝贝总数量"], axis=1)

order_tag_01 = table_04.groupby("买家会员名").sum()

再把不能直接求和的用order 表求得金额

table_05 = table_02[["订单编号", "买家会员名", "买家实际支付金额", "宝贝总数量"]]

order_tag_02 = order.groupby("买家会员名")[["买家实际支付金额", "宝贝总数量"]].mean()

得到用户购买行为信息表

order_tag_all = pd.merge(order_tag_01, order_tag_02, how="inner", on="买家会员名")

# 看一下缺失值
order_tag_all.isnull().sum()[order_tag_all.isnull().sum() != 0]

Series([], dtype: int64)

数据建模

初次建模

# 归一化。

mms=MinMaxScaler()
data_norm=mms.fit_transform(order_tag_all.values)

# 手肘法调参

sse=[]
for k in range(1,25):
    km=KMeans(n_clusters=k)
    km.fit(data_norm)
    sse.append(km.inertia_)
#可视化学习曲线   
plt.plot(range(1,25),sse,marker="o")

[<matplotlib.lines.Line2D at 0x2a51b044448>]

删除无效字段后再次建模

# 批量删除“宝贝种类”这些字段
a = []
for i in order_tag_all.columns:
    if i.find("宝贝种类")!=-1 or i.find("地址")!=-1 or i.find("品牌")!=-1 :
        a.append(i)       
order_tag_all.drop(a,axis=1,inplace=True)

# 重新再做归一化

mms=MinMaxScaler()
data_norm=mms.fit_transform(order_tag_all.values)

# 拟合模型

import matplotlib.pyplot as plt
sse=[]
for k in range(1,25):
    km=KMeans(n_clusters=k)
    km.fit(data_norm)
    sse.append(km.inertia_)


plt.plot(range(1,25),sse,marker="o")

[<matplotlib.lines.Line2D at 0x2a522cdae88>]

# 轮廓系数图确定参数k
score=[]
for k in range(2,25):
    km=KMeans(n_clusters=k)
    res_km=km.fit(data_norm)
    score.append(silhouette_score(data_norm,res_km.labels_))
# 可视化轮廓系数   
plt.plot(range(2,25),score,marker="o")

[<matplotlib.lines.Line2D at 0x2a5228d4688>]

# 在k为5的时候，模型能够兼顾sse较小且轮廓系数较大。

用k=5 生成最后模型，打上标签

# 重新建模

km=KMeans(n_clusters=5)
km.fit(data_norm)
clusters=km.labels_
pd.Series(clusters).value_counts()


# 把不同的会员和对应的标签匹配上：
order_tag_all["类别"]=clusters
result=order_tag_all["类别"]

	年龄_婴儿	年龄_学前	年龄_学生	年龄_幼儿
0	0	1	0	1
1	0	1	0	1
2	0	1	1	1
3	0	1	1	1
4	0	1	0	1
...	...	...	...	...
283	0	1	1	1
284	0	1	1	1
285	1	0	0	1
286	0	1	1	1
287	0	1	1	1

	标题	购买次数	推荐指数
0	..	217.0	❤❤❤❤❤
1	..	209.0	❤❤❤❤
2	..	199.0	❤❤❤
3	...	187.0	❤❤
4	...	141.0	❤

	年龄_婴儿	年龄_学前	年龄_学生	年龄_幼儿
0	0	1	0	1
1	0	1	0	1
2	0	1	1	1
3	0	1	1	1
4	0	1	0	1
...	...	...	...	...
283	0	1	1	1
284	0	1	1	1
285	1	0	0	1
286	0	1	1	1
287	0	1	1	1

电商数据基于聚类的精准营销项目

总体思路

数据清洗

订单表清洗

数据初步探索

删除无用信息

数据编码

订单详情表清洗

数据初步探索

删除无用信息

商品详情表

数据初步探索

数据编码

表合并

三表合一个表

先把介意直接求和的客户信息表制作出来

再把不能直接求和的用order 表求得金额

得到用户购买行为信息表

数据建模

初次建模

删除无效字段后 再次建模

用k=5 生成最后模型，打上标签

推荐系统

用户-商品-购买次数表

用户-未购买商品表

用户- 未购买商品-类别表

类别-商品-购买次数表

最终推荐表

删除无效字段后再次建模

	年龄_婴儿	年龄_学前	年龄_学生	年龄_幼儿
0	0	1	0	1
1	0	1	0	1
2	0	1	1	1
3	0	1	1	1
4	0	1	0	1
...	...	...	...	...
283	0	1	1	1
284	0	1	1	1
285	1	0	0	1
286	0	1	1	1
287	0	1	1	1