降維案例(Instacart Market Basket Analysis)
文章目錄
1.探究問題
用戶對物品類別的喜好細分降維
2.數據集的描述
原網址:https://www.kaggle.com/c/instacart-market-basket-analysis/data
aisles.csv
aisle_id,aisle
1,prepared soups salads
2,specialty cheeses
3,energy granola bars
…
departments.csv
department_id,department
1,frozen
2,other
3,bakery
…
order_products__*.csv
These files specify which products were purchased in each order. order_products__prior.csv
contains previous order contents for all customers. ‘reordered’ indicates that the customer has a previous order that contains the product. Note that some orders will have no reordered items. You may predict an explicit ‘None’ value for orders with no reordered items. See the evaluation page for full details.
order_id,product_id,add_to_cart_order,reordered
1,49302,1,1
1,11109,2,1
1,10246,3,0
…
orders.csv
This file tells to which set (prior, train, test) an order belongs. You are predicting reordered items only for the test set orders. ‘order_dow’ is the day of week.
order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
2539329,1,prior,1,2,08,
2398795,1,prior,2,3,07,15.0
473747,1,prior,3,3,12,21.0
…
products.csv
product_id,product_name,aisle_id,department_id
1,Chocolate Sandwich Cookies,61,19
2,All-Seasons Salt,104,13
3,Robust Golden Unsweetened Oolong Tea,94,7
…
sample_submission.csv
order_id,products
17,39276
34,39276
137,39276
…
3.問題分析
用戶與購買物品類別的關係,用機器學習解決這個問題,就要用機器學習的格式:特徵值與樣本
購買物品類別 | 1 | 2 | 3 | 4 |
---|---|---|---|---|
用戶1 | ** | ** | ** | ** |
用戶2 | ** | ** | ** | ** |
用戶3 | ** | ** | ** | ** |
數據特徵值:
products.csv
商品信息:
product_id
,product_name,aisle_id
,department_id
order_products__prior.csv
訂單與商品信息:
order_id
,product_id
,add_to_cart_order,reordered
orders.csv
用戶的訂單信息:
order_id
,user_id
,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
aisles.csv
商品所屬具體物品類別:
aisle_id
,aisle
4.合併數據
4.1 合併各張表到⼀張表當中
>>> import pandas as pd
>>> from sklearn.decomposition import PCA
>>> prior=pd.read_csv("E:\PycharmProjects\ML_code\order_products__prior.csv")
>>>
>>> products=pd.read_csv("E:\PycharmProjects\ML_code\products.csv")
>>> orders=pd.read_csv("E:\PycharmProjects\ML_code\orders.csv")
>>> aisles=pd.read_csv("E:\PycharmProjects\ML_code\Aisles.csv")
>>> _mg=pd.merge(prior,products,on=['product_id','product_id'])
>>> _mg=pd.merge(_mg,orders,on=['order_id','order_id'])
>>> mt=pd.merge(_mg,aisles,on=['aisle_id','aisle_id'])
>>> mt.head(10)
order_id product_id add_to_cart_order reordered ... order_dow order_hour_of_day days_since_prior_order aisle
0 2 33120 1 1 ... 5 9 8.0 eggs
1 26 33120 5 0 ... 0 16 7.0 eggs
2 120 33120 13 0 ... 6 8 10.0 eggs
3 327 33120 5 1 ... 6 9 8.0 eggs
4 390 33120 28 1 ... 0 12 9.0 eggs
5 537 33120 2 1 ... 2 8 3.0 eggs
6 582 33120 7 1 ... 2 19 10.0 eggs
7 608 33120 5 1 ... 3 21 12.0 eggs
8 623 33120 1 1 ... 3 12 3.0 eggs
9 689 33120 4 1 ... 1 13 3.0 eggs
[10 rows x 14 columns]
4.2 建⽴⼀個交叉表
交叉表,特殊的分組crosstab
>>> cross=pd.crosstab(mt['user_id'][:22000000],mt['aisle'])
>>> cross.head(10)
aisle baking ingredients canned jarred vegetables cereal crackers ... spreads tea water seltzer sparkling water yogurt
user_id ...
1 0 0 3 0 ... 1 0 0 1
2 2 0 0 11 ... 3 1 2 42
3 0 0 0 6 ... 4 1 2 0
4 0 0 0 0 ... 0 0 1 0
5 0 0 0 0 ... 0 0 0 3
6 0 2 0 0 ... 0 0 0 0
7 2 0 0 6 ... 0 0 0 5
8 1 1 0 0 ... 0 0 0 0
9 2 0 2 1 ... 0 0 2 19
10 0 0 0 0 ... 0 0 0 2
[10 rows x 39 columns]
實際有[10 rows x 134 columns]
4.3 memoryerror
計算過程遇到memoryerror
,因爲自己的電腦內存太小,只有4G,跟不上時代步伐了。儘管如此,學習的腳步不能停,先買個內存條,內存條下單也不能馬上到,所以只好研究一下怎樣在現有條件下完成數據的讀取。開始用的pycharm,數據合併就會溢出,查了查資料,發現也看不懂。。。,有一篇博客寫可以用anaconda的解釋器,試了一下,果然有用(嘗試用cmd,失敗了)
不過後來做交叉表,還是不可避免的溢出了,所以就選了前22000000條數據試一下效果,哎,流下了貧窮的淚水。
4.4 主成分分析
>>> pca=PCA(n_components=0.9)
>>> data=pca.fit_transform(cross)
>>> data
array([[-23.71768331, 2.17718661, -2.29373653, ..., -0.24994954,
6.86790629, -2.58754702],
[ 5.29780045, 36.83290254, 10.02744956, ..., 2.34996671,
4.00967108, -4.67927275],
[ -6.78286443, 3.66243888, -10.50058495, ..., -0.88282879,
6.81652741, 7.57907716],
...,
[ 7.76249421, 7.55456705, 7.58321022, ..., -5.59256767,
-3.47590013, -0.81346423],
[ 75.02309345, 13.63570317, 5.39483326, ..., -6.25728837,
-12.38297505, 19.29131442],
[-14.29193925, 6.79998731, -5.37130876, ..., -0.48382496,
5.78280479, 5.86108088]])
>>> data.shape
(205263, 9)
之前39個特徵,現在有9個