數據集下載:https://www.kaggle.com/sulmansarwar/transactions-from-a-bakery?select=BreadBasket_DMS.csv
matplotlib設置繪圖風格:https://blog.csdn.net/weixin_42968458/article/details/82889736
覺得比較好看的幾款:
fivethirtyeight,seaborn-colorblind,seaborn-paper這三款差不多;seaborn-white是背景爲白色的。
EDA
import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('C:/Users/admin/Desktop/Apriori/BreadBasket_DMS.csv')
df['Item']=df['Item'].str.lower()
x=df['Item']== 'none'
print(x.value_counts())
False 20507
True 786
Name: Item, dtype: int64
df=df.drop(df[df.Item == 'none'].index) #這裏不是null,所以不能用df.dropna(axis = 0)
去除沒有買東西的記錄,none相當於一個去超市逛了逛,但是最後沒有買東西。
len(df['Item'].unique())
94
Item一共有94種不同的商品
分析最暢銷的商品
df_for_top10_Items=df['Item'].value_counts().head(10)
Item_array= np.arange(len(df_for_top10_Items))
import matplotlib.pyplot as plt
plt.style.use("fivethirtyeight")
plt.figure(figsize=(12,5))
Items_name=['coffee','bread','tea','cake','pastry','sandwich','medialuna','hot chocolate','cookies','brownie']
plt.bar(Item_array,df_for_top10_Items.iloc[:])
plt.xticks(Item_array,Items_name) #設置橫座標
plt.title('Top 5 most selling items',fontsize=18)
#添加數據標籤
for x, y in zip(Item_array,df_for_top10_Items.iloc[:]):
plt.text(x+0.05,y+0.15,'%.0f' %y,ha='center',va='bottom')
plt.show()
df_for_top10_Items.iloc[:]
coffee 5471
bread 3325
tea 1435
cake 1025
pastry 856
sandwich 771
medialuna 616
hot chocolate 590
cookies 540
brownie 379
Name: Item, dtype: int64
按星期分析的交易頻次(售出商品數量)
df['Date'] = pd.to_datetime(df['Date']) #將數據轉化爲日期格式
df['Time'] = pd.to_datetime(df['Time'],format= '%H:%M:%S' ).dt.hour #獲取小時這個數字
df['day_of_week'] = df['Date'].dt.weekday
#得出具體一天是星期幾,0表示星期一,6表示星期天
d=df.loc[:,'Date']
df['day_of_week'].value_counts()
5 4605
4 3124
6 3095
3 2646
1 2392
0 2324
2 2321
Name: day_of_week, dtype: int64
weekday_names=[ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
Weekday_number=[0,1,2,3,4,5,6]
week_df = d.groupby(d.dt.weekday).count().reindex(Weekday_number)
Item_array_week= np.arange(len(week_df))
plt.style.use("seaborn-paper")
plt.figure(figsize=(9,5))
plt.bar(Item_array_week,week_df)
plt.xticks(Item_array_week,weekday_names)
plt.title('Number of Transactions made based on Weekdays',fontsize=18)
#添加數據標籤
for x, y in zip(Item_array_week,week_df):
plt.text(x+0.05,y+0.15,'%.0f' %y,ha='center',va='bottom')
plt.show()
果然從週五開始放飛,一直持續到週日,並且週六購買的數量最多。
分析各個時段的購買商品數目(注意這裏不是顧客數)
dt=df.loc[:,'Time']
Hour_names=[ 7, 8, 9,10,11,12,13,14,15,16,17,18,19,20,21,22,23]
time_df=dt.groupby(dt).count().reindex(Hour_names)
Item_array_hour= np.arange(len(time_df))
plt.figure(figsize=(13,5))
plt.bar(Item_array_hour,time_df)
plt.xticks(Item_array_hour,Hour_names)
plt.title('Number of Transactions made based on Hours',fontsize=18)
#添加數據標籤
for x, y in zip(Item_array_hour,time_df):
plt.text(x+0.05,y+0.15,'%.0f' %y,ha='center',va='bottom')
plt.show()
Apriori
同一時間的Transaction,我們認爲是一個人的購物籃,所以需要將對應的Item按照Transaction分組,然後形成list,根據Apriori的要求,也可以形成tuple,但是這裏最小不要形成tuple,因爲對於tuple裏只有一個元素的,末尾會添上一個問號,這個會影響到頻繁項集的篩選。
處理參考網址:https://cloud.tencent.com/developer/ask/175315
df.groupby(['Transaction'])['Item'].apply(list).head()
Transaction
1 [bread]
2 [scandinavian, scandinavian]
3 [hot chocolate, jam, cookies]
4 [muffin]
5 [coffee, pastry, bread]
Name: Item, dtype: object
from efficient_apriori import apriori
data = list(df.groupby(['Transaction'])['Item'].apply(list))
itemsets, rules = apriori(data, min_support=0.05, min_confidence=0.2)
itemsets
{1: {('bread',): 3097,
('cake',): 983,
('coffee',): 4528,
('cookies',): 515,
('hot chocolate',): 552,
('medialuna',): 585,
('pastry',): 815,
('sandwich',): 680,
('tea',): 1350},
2: {('bread', 'coffee'): 852, ('cake', 'coffee'): 518}}
這裏可以發現人們的一些購買習慣,如經常購買的物品,或者經常購買的物品組合,這些購買習慣中隱藏着一定強度的關聯規則。
rules
[{bread} -> {coffee}, {cake} -> {coffee}]
min_support和min_confidence的初始值需要小一些,否則輸出來是空的。經過不但嘗試發現
{bread} -> {coffee}約0.2左右,{cake} -> {coffee}約0.52左右。即買麪包後有20%的可能買咖啡,買蛋糕有50%的可能性買咖啡。