ML算法基础——特征工程(降维案例)

降维案例(Instacart Market Basket Analysis)

1.探究问题

用户对物品类别的喜好细分降维

2.数据集的描述

原网址:https://www.kaggle.com/c/instacart-market-basket-analysis/data

aisles.csv

aisle_id,aisle
1,prepared soups salads
2,specialty cheeses
3,energy granola bars

departments.csv

department_id,department
1,frozen
2,other
3,bakery

order_products__*.csv

These files specify which products were purchased in each order. order_products__prior.csv
contains previous order contents for all customers. ‘reordered’ indicates that the customer has a previous order that contains the product. Note that some orders will have no reordered items. You may predict an explicit ‘None’ value for orders with no reordered items. See the evaluation page for full details.

order_id,product_id,add_to_cart_order,reordered
1,49302,1,1
1,11109,2,1
1,10246,3,0

orders.csv

This file tells to which set (prior, train, test) an order belongs. You are predicting reordered items only for the test set orders. ‘order_dow’ is the day of week.

order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
2539329,1,prior,1,2,08,
2398795,1,prior,2,3,07,15.0
473747,1,prior,3,3,12,21.0

products.csv

product_id,product_name,aisle_id,department_id
1,Chocolate Sandwich Cookies,61,19
2,All-Seasons Salt,104,13
3,Robust Golden Unsweetened Oolong Tea,94,7

sample_submission.csv

order_id,products
17,39276
34,39276
137,39276

3.问题分析

用户与购买物品类别的关系,用机器学习解决这个问题,就要用机器学习的格式:特征值与样本

购买物品类别 1 2 3 4
用户1 ** ** ** **
用户2 ** ** ** **
用户3 ** ** ** **

数据特征值:
products.csv商品信息:

product_id,product_name,aisle_id,department_id

order_products__prior.csv 订单与商品信息:

order_id,product_id,add_to_cart_order,reordered

orders.csv 用户的订单信息:

order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order

aisles.csv 商品所属具体物品类别:

aisle_id,aisle

4.合并数据

4.1 合并各张表到⼀张表当中

>>> import pandas as pd
>>> from sklearn.decomposition import PCA
>>> prior=pd.read_csv("E:\PycharmProjects\ML_code\order_products__prior.csv")
>>>
>>> products=pd.read_csv("E:\PycharmProjects\ML_code\products.csv")
>>> orders=pd.read_csv("E:\PycharmProjects\ML_code\orders.csv")
>>> aisles=pd.read_csv("E:\PycharmProjects\ML_code\Aisles.csv")
>>> _mg=pd.merge(prior,products,on=['product_id','product_id'])
>>> _mg=pd.merge(_mg,orders,on=['order_id','order_id'])
>>> mt=pd.merge(_mg,aisles,on=['aisle_id','aisle_id'])
>>> mt.head(10)
   order_id  product_id  add_to_cart_order  reordered  ... order_dow  order_hour_of_day  days_since_prior_order  aisle
0         2       33120                  1          1  ...         5                  9                     8.0   eggs
1        26       33120                  5          0  ...         0                 16                     7.0   eggs
2       120       33120                 13          0  ...         6                  8                    10.0   eggs
3       327       33120                  5          1  ...         6                  9                     8.0   eggs
4       390       33120                 28          1  ...         0                 12                     9.0   eggs
5       537       33120                  2          1  ...         2                  8                     3.0   eggs
6       582       33120                  7          1  ...         2                 19                    10.0   eggs
7       608       33120                  5          1  ...         3                 21                    12.0   eggs
8       623       33120                  1          1  ...         3                 12                     3.0   eggs
9       689       33120                  4          1  ...         1                 13                     3.0   eggs

[10 rows x 14 columns]

4.2 建⽴⼀个交叉表

交叉表,特殊的分组crosstab

>>> cross=pd.crosstab(mt['user_id'][:22000000],mt['aisle'])
>>> cross.head(10)
aisle    baking ingredients  canned jarred vegetables  cereal  crackers  ...  spreads  tea  water seltzer sparkling water  yogurt
user_id                                                                  ...
1                         0                         0       3         0  ...        1    0                              0       1
2                         2                         0       0        11  ...        3    1                              2      42
3                         0                         0       0         6  ...        4    1                              2       0
4                         0                         0       0         0  ...        0    0                              1       0
5                         0                         0       0         0  ...        0    0                              0       3
6                         0                         2       0         0  ...        0    0                              0       0
7                         2                         0       0         6  ...        0    0                              0       5
8                         1                         1       0         0  ...        0    0                              0       0
9                         2                         0       2         1  ...        0    0                              2      19
10                        0                         0       0         0  ...        0    0                              0       2

[10 rows x 39 columns]

实际有[10 rows x 134 columns]

4.3 memoryerror

计算过程遇到memoryerror,因为自己的电脑内存太小,只有4G,跟不上时代步伐了。尽管如此,学习的脚步不能停,先买个内存条,内存条下单也不能马上到,所以只好研究一下怎样在现有条件下完成数据的读取。开始用的pycharm,数据合并就会溢出,查了查资料,发现也看不懂。。。,有一篇博客写可以用anaconda的解释器,试了一下,果然有用(尝试用cmd,失败了)

不过后来做交叉表,还是不可避免的溢出了,所以就选了前22000000条数据试一下效果,哎,流下了贫穷的泪水。

4.4 主成分分析

>>> pca=PCA(n_components=0.9)
>>> data=pca.fit_transform(cross)
>>> data
array([[-23.71768331,   2.17718661,  -2.29373653, ...,  -0.24994954,
          6.86790629,  -2.58754702],
       [  5.29780045,  36.83290254,  10.02744956, ...,   2.34996671,
          4.00967108,  -4.67927275],
       [ -6.78286443,   3.66243888, -10.50058495, ...,  -0.88282879,
          6.81652741,   7.57907716],
       ...,
       [  7.76249421,   7.55456705,   7.58321022, ...,  -5.59256767,
         -3.47590013,  -0.81346423],
       [ 75.02309345,  13.63570317,   5.39483326, ...,  -6.25728837,
        -12.38297505,  19.29131442],
       [-14.29193925,   6.79998731,  -5.37130876, ...,  -0.48382496,
          5.78280479,   5.86108088]])
>>> data.shape
(205263, 9)

之前39个特征,现在有9个

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章