ML算法基礎——特徵工程(降維案例)

降維案例(Instacart Market Basket Analysis)

1.探究問題

用戶對物品類別的喜好細分降維

2.數據集的描述

原網址:https://www.kaggle.com/c/instacart-market-basket-analysis/data

aisles.csv

aisle_id,aisle
1,prepared soups salads
2,specialty cheeses
3,energy granola bars

departments.csv

department_id,department
1,frozen
2,other
3,bakery

order_products__*.csv

These files specify which products were purchased in each order. order_products__prior.csv
contains previous order contents for all customers. ‘reordered’ indicates that the customer has a previous order that contains the product. Note that some orders will have no reordered items. You may predict an explicit ‘None’ value for orders with no reordered items. See the evaluation page for full details.

order_id,product_id,add_to_cart_order,reordered
1,49302,1,1
1,11109,2,1
1,10246,3,0

orders.csv

This file tells to which set (prior, train, test) an order belongs. You are predicting reordered items only for the test set orders. ‘order_dow’ is the day of week.

order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
2539329,1,prior,1,2,08,
2398795,1,prior,2,3,07,15.0
473747,1,prior,3,3,12,21.0

products.csv

product_id,product_name,aisle_id,department_id
1,Chocolate Sandwich Cookies,61,19
2,All-Seasons Salt,104,13
3,Robust Golden Unsweetened Oolong Tea,94,7

sample_submission.csv

order_id,products
17,39276
34,39276
137,39276

3.問題分析

用戶與購買物品類別的關係,用機器學習解決這個問題,就要用機器學習的格式:特徵值與樣本

購買物品類別 1 2 3 4
用戶1 ** ** ** **
用戶2 ** ** ** **
用戶3 ** ** ** **

數據特徵值:
products.csv商品信息:

product_id,product_name,aisle_id,department_id

order_products__prior.csv 訂單與商品信息:

order_id,product_id,add_to_cart_order,reordered

orders.csv 用戶的訂單信息:

order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order

aisles.csv 商品所屬具體物品類別:

aisle_id,aisle

4.合併數據

4.1 合併各張表到⼀張表當中

>>> import pandas as pd
>>> from sklearn.decomposition import PCA
>>> prior=pd.read_csv("E:\PycharmProjects\ML_code\order_products__prior.csv")
>>>
>>> products=pd.read_csv("E:\PycharmProjects\ML_code\products.csv")
>>> orders=pd.read_csv("E:\PycharmProjects\ML_code\orders.csv")
>>> aisles=pd.read_csv("E:\PycharmProjects\ML_code\Aisles.csv")
>>> _mg=pd.merge(prior,products,on=['product_id','product_id'])
>>> _mg=pd.merge(_mg,orders,on=['order_id','order_id'])
>>> mt=pd.merge(_mg,aisles,on=['aisle_id','aisle_id'])
>>> mt.head(10)
   order_id  product_id  add_to_cart_order  reordered  ... order_dow  order_hour_of_day  days_since_prior_order  aisle
0         2       33120                  1          1  ...         5                  9                     8.0   eggs
1        26       33120                  5          0  ...         0                 16                     7.0   eggs
2       120       33120                 13          0  ...         6                  8                    10.0   eggs
3       327       33120                  5          1  ...         6                  9                     8.0   eggs
4       390       33120                 28          1  ...         0                 12                     9.0   eggs
5       537       33120                  2          1  ...         2                  8                     3.0   eggs
6       582       33120                  7          1  ...         2                 19                    10.0   eggs
7       608       33120                  5          1  ...         3                 21                    12.0   eggs
8       623       33120                  1          1  ...         3                 12                     3.0   eggs
9       689       33120                  4          1  ...         1                 13                     3.0   eggs

[10 rows x 14 columns]

4.2 建⽴⼀個交叉表

交叉表,特殊的分組crosstab

>>> cross=pd.crosstab(mt['user_id'][:22000000],mt['aisle'])
>>> cross.head(10)
aisle    baking ingredients  canned jarred vegetables  cereal  crackers  ...  spreads  tea  water seltzer sparkling water  yogurt
user_id                                                                  ...
1                         0                         0       3         0  ...        1    0                              0       1
2                         2                         0       0        11  ...        3    1                              2      42
3                         0                         0       0         6  ...        4    1                              2       0
4                         0                         0       0         0  ...        0    0                              1       0
5                         0                         0       0         0  ...        0    0                              0       3
6                         0                         2       0         0  ...        0    0                              0       0
7                         2                         0       0         6  ...        0    0                              0       5
8                         1                         1       0         0  ...        0    0                              0       0
9                         2                         0       2         1  ...        0    0                              2      19
10                        0                         0       0         0  ...        0    0                              0       2

[10 rows x 39 columns]

實際有[10 rows x 134 columns]

4.3 memoryerror

計算過程遇到memoryerror,因爲自己的電腦內存太小,只有4G,跟不上時代步伐了。儘管如此,學習的腳步不能停,先買個內存條,內存條下單也不能馬上到,所以只好研究一下怎樣在現有條件下完成數據的讀取。開始用的pycharm,數據合併就會溢出,查了查資料,發現也看不懂。。。,有一篇博客寫可以用anaconda的解釋器,試了一下,果然有用(嘗試用cmd,失敗了)

不過後來做交叉表,還是不可避免的溢出了,所以就選了前22000000條數據試一下效果,哎,流下了貧窮的淚水。

4.4 主成分分析

>>> pca=PCA(n_components=0.9)
>>> data=pca.fit_transform(cross)
>>> data
array([[-23.71768331,   2.17718661,  -2.29373653, ...,  -0.24994954,
          6.86790629,  -2.58754702],
       [  5.29780045,  36.83290254,  10.02744956, ...,   2.34996671,
          4.00967108,  -4.67927275],
       [ -6.78286443,   3.66243888, -10.50058495, ...,  -0.88282879,
          6.81652741,   7.57907716],
       ...,
       [  7.76249421,   7.55456705,   7.58321022, ...,  -5.59256767,
         -3.47590013,  -0.81346423],
       [ 75.02309345,  13.63570317,   5.39483326, ...,  -6.25728837,
        -12.38297505,  19.29131442],
       [-14.29193925,   6.79998731,  -5.37130876, ...,  -0.48382496,
          5.78280479,   5.86108088]])
>>> data.shape
(205263, 9)

之前39個特徵,現在有9個

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章