降维案例(Instacart Market Basket Analysis)
文章目录
1.探究问题
用户对物品类别的喜好细分降维
2.数据集的描述
原网址:https://www.kaggle.com/c/instacart-market-basket-analysis/data
aisles.csv
aisle_id,aisle
1,prepared soups salads
2,specialty cheeses
3,energy granola bars
…
departments.csv
department_id,department
1,frozen
2,other
3,bakery
…
order_products__*.csv
These files specify which products were purchased in each order. order_products__prior.csv
contains previous order contents for all customers. ‘reordered’ indicates that the customer has a previous order that contains the product. Note that some orders will have no reordered items. You may predict an explicit ‘None’ value for orders with no reordered items. See the evaluation page for full details.
order_id,product_id,add_to_cart_order,reordered
1,49302,1,1
1,11109,2,1
1,10246,3,0
…
orders.csv
This file tells to which set (prior, train, test) an order belongs. You are predicting reordered items only for the test set orders. ‘order_dow’ is the day of week.
order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
2539329,1,prior,1,2,08,
2398795,1,prior,2,3,07,15.0
473747,1,prior,3,3,12,21.0
…
products.csv
product_id,product_name,aisle_id,department_id
1,Chocolate Sandwich Cookies,61,19
2,All-Seasons Salt,104,13
3,Robust Golden Unsweetened Oolong Tea,94,7
…
sample_submission.csv
order_id,products
17,39276
34,39276
137,39276
…
3.问题分析
用户与购买物品类别的关系,用机器学习解决这个问题,就要用机器学习的格式:特征值与样本
购买物品类别 | 1 | 2 | 3 | 4 |
---|---|---|---|---|
用户1 | ** | ** | ** | ** |
用户2 | ** | ** | ** | ** |
用户3 | ** | ** | ** | ** |
数据特征值:
products.csv
商品信息:
product_id
,product_name,aisle_id
,department_id
order_products__prior.csv
订单与商品信息:
order_id
,product_id
,add_to_cart_order,reordered
orders.csv
用户的订单信息:
order_id
,user_id
,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
aisles.csv
商品所属具体物品类别:
aisle_id
,aisle
4.合并数据
4.1 合并各张表到⼀张表当中
>>> import pandas as pd
>>> from sklearn.decomposition import PCA
>>> prior=pd.read_csv("E:\PycharmProjects\ML_code\order_products__prior.csv")
>>>
>>> products=pd.read_csv("E:\PycharmProjects\ML_code\products.csv")
>>> orders=pd.read_csv("E:\PycharmProjects\ML_code\orders.csv")
>>> aisles=pd.read_csv("E:\PycharmProjects\ML_code\Aisles.csv")
>>> _mg=pd.merge(prior,products,on=['product_id','product_id'])
>>> _mg=pd.merge(_mg,orders,on=['order_id','order_id'])
>>> mt=pd.merge(_mg,aisles,on=['aisle_id','aisle_id'])
>>> mt.head(10)
order_id product_id add_to_cart_order reordered ... order_dow order_hour_of_day days_since_prior_order aisle
0 2 33120 1 1 ... 5 9 8.0 eggs
1 26 33120 5 0 ... 0 16 7.0 eggs
2 120 33120 13 0 ... 6 8 10.0 eggs
3 327 33120 5 1 ... 6 9 8.0 eggs
4 390 33120 28 1 ... 0 12 9.0 eggs
5 537 33120 2 1 ... 2 8 3.0 eggs
6 582 33120 7 1 ... 2 19 10.0 eggs
7 608 33120 5 1 ... 3 21 12.0 eggs
8 623 33120 1 1 ... 3 12 3.0 eggs
9 689 33120 4 1 ... 1 13 3.0 eggs
[10 rows x 14 columns]
4.2 建⽴⼀个交叉表
交叉表,特殊的分组crosstab
>>> cross=pd.crosstab(mt['user_id'][:22000000],mt['aisle'])
>>> cross.head(10)
aisle baking ingredients canned jarred vegetables cereal crackers ... spreads tea water seltzer sparkling water yogurt
user_id ...
1 0 0 3 0 ... 1 0 0 1
2 2 0 0 11 ... 3 1 2 42
3 0 0 0 6 ... 4 1 2 0
4 0 0 0 0 ... 0 0 1 0
5 0 0 0 0 ... 0 0 0 3
6 0 2 0 0 ... 0 0 0 0
7 2 0 0 6 ... 0 0 0 5
8 1 1 0 0 ... 0 0 0 0
9 2 0 2 1 ... 0 0 2 19
10 0 0 0 0 ... 0 0 0 2
[10 rows x 39 columns]
实际有[10 rows x 134 columns]
4.3 memoryerror
计算过程遇到memoryerror
,因为自己的电脑内存太小,只有4G,跟不上时代步伐了。尽管如此,学习的脚步不能停,先买个内存条,内存条下单也不能马上到,所以只好研究一下怎样在现有条件下完成数据的读取。开始用的pycharm,数据合并就会溢出,查了查资料,发现也看不懂。。。,有一篇博客写可以用anaconda的解释器,试了一下,果然有用(尝试用cmd,失败了)
不过后来做交叉表,还是不可避免的溢出了,所以就选了前22000000条数据试一下效果,哎,流下了贫穷的泪水。
4.4 主成分分析
>>> pca=PCA(n_components=0.9)
>>> data=pca.fit_transform(cross)
>>> data
array([[-23.71768331, 2.17718661, -2.29373653, ..., -0.24994954,
6.86790629, -2.58754702],
[ 5.29780045, 36.83290254, 10.02744956, ..., 2.34996671,
4.00967108, -4.67927275],
[ -6.78286443, 3.66243888, -10.50058495, ..., -0.88282879,
6.81652741, 7.57907716],
...,
[ 7.76249421, 7.55456705, 7.58321022, ..., -5.59256767,
-3.47590013, -0.81346423],
[ 75.02309345, 13.63570317, 5.39483326, ..., -6.25728837,
-12.38297505, 19.29131442],
[-14.29193925, 6.79998731, -5.37130876, ..., -0.48382496,
5.78280479, 5.86108088]])
>>> data.shape
(205263, 9)
之前39个特征,现在有9个