特征工程案例--(合并表,交叉表、主成分分析)

目标:特征降维处理主成分分析APA

方法:

关联表:user_id---->aisle

交叉表:构造每个用户购买了哪些物品细分类别的商品及数量

降维处理:主成分分析APA

数据来源:https://www.kaggle.com/c/instacart-market-basket-analysis/data

·order_products_prior.csv:订单与商品信息
    。字段:order_id,product_id,add_to_cart_order,reordered
    。解释:订单id,产品id,加入购物车订单,再次订购(不止一次订购)
·products.csv:商品信息
    。字段:product_id,product_name,aisle_id,department_id
    。解释:产品id,产品名称,物品类别id,产品大分类id
·orders.csv:用户的订单信息
    。字段:order_id,user_id,eval_set,order_number,order_dow,order_hour_of_day,days_since_prior_order
    。解释:订单编号,用户编号,评价等级,订单数量,星期几,当天的购买时段h,距离预定日期的天数
·aisles.csv:商品所属具体物品类别
    。字段:aisle_id,aisle 
    。解释:物品细分类别id,物品细分类别名称
import numpy as np
import pandas as pd
#获取数据
aisles = pd.read_csv(r"E:\instacart-market-basket-analysis\aisles.csv",sep=",",encoding="utf-8")
orders = pd.read_csv(r"E:\instacart-market-basket-analysis\orders.csv",sep=",",encoding="utf-8")
products = pd.read_csv(r"E:\instacart-market-basket-analysis\products.csv",sep=",",encoding="utf-8")
order_products_prior = pd.read_csv(r"E:\instacart-market-basket-analysis\order_products__prior.csv",sep=",",encoding="utf-8")
#查验数据
display(aisles.head(3))
display(orders.head(3))
display(products.head(3))
display(order_products_prior.head(3))
aisle_id aisle
0 1 prepared soups salads
1 2 specialty cheeses
2 3 energy granola bars
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order
0 2539329 1 prior 1 2 8 NaN
1 2398795 1 prior 2 3 7 15.0
2 473747 1 prior 3 3 12 21.0
product_id product_name aisle_id department_id
0 1 Chocolate Sandwich Cookies 61 19
1 2 All-Seasons Salt 104 13
2 3 Robust Golden Unsweetened Oolong Tea 94 7
order_id product_id add_to_cart_order reordered
0 2 33120 1 1
1 2 28985 2 1
2 2 9327 3 0
import time
#关联表:user_id---->aisle
data01 = pd.merge(orders,order_products_prior,how='inner',on=["order_id","order_id"])
time.sleep(15)
data02 = pd.merge(data01,products,on=["product_id","product_id"])
data03 = pd.merge(data02,aisles,on=["aisle_id","aisle_id"])
time.sleep(3)
display(data03.shape,data03.tail(10000))
(32434489, 14)
order_id user_id eval_set order_number order_dow order_hour_of_day days_since_prior_order product_id add_to_cart_order reordered product_name aisle_id department_id aisle
32424489 2542240 75675 prior 12 5 12 5.0 44471 7 1 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424490 3260483 75675 prior 16 0 9 14.0 44471 21 1 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424491 2196407 75675 prior 30 0 11 12.0 44471 9 1 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424492 532672 75675 prior 38 5 13 7.0 44471 20 1 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424493 1705047 75675 prior 39 5 13 0.0 44471 20 1 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424494 998672 75675 prior 48 5 14 11.0 44471 13 1 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424495 2149746 75675 prior 49 6 9 8.0 44471 6 1 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424496 483804 75804 prior 12 6 15 4.0 44471 19 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424497 1783191 76027 prior 6 4 16 13.0 44471 13 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424498 3074202 76027 prior 7 2 15 5.0 44471 8 1 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424499 431155 76081 prior 8 0 14 16.0 44471 8 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424500 2879529 76238 prior 36 6 10 6.0 44471 25 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424501 1652877 76238 prior 39 5 10 6.0 44471 10 1 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424502 737972 76466 prior 20 0 10 7.0 44471 7 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424503 3154632 76556 prior 80 3 18 2.0 44471 7 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424504 1776861 76576 prior 7 0 15 7.0 44471 2 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424505 2695824 76726 prior 4 0 11 28.0 44471 26 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424506 3176388 76823 prior 1 6 12 NaN 44471 19 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424507 1441764 76866 prior 13 0 16 25.0 44471 7 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424508 2888446 76868 prior 17 5 10 16.0 44471 19 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424509 2670733 77148 prior 19 1 9 12.0 44471 24 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424510 2328300 77187 prior 1 1 9 NaN 44471 1 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424511 1923581 77229 prior 21 3 11 17.0 44471 2 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424512 2042750 77229 prior 24 0 14 12.0 44471 6 1 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424513 2685754 77238 prior 2 0 9 6.0 44471 5 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424514 1401197 77265 prior 6 1 5 9.0 44471 8 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424515 2917195 77265 prior 10 4 20 5.0 44471 4 1 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424516 1321674 77265 prior 31 0 10 11.0 44471 2 1 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424517 1268589 77265 prior 37 1 18 29.0 44471 7 1 Free & Clear Unscented Baby Wipes 82 18 baby accessories
32424518 3044303 77280 prior 23 4 23 1.0 44471 3 0 Free & Clear Unscented Baby Wipes 82 18 baby accessories
... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
32434459 814403 161964 prior 10 6 12 5.0 26478 20 0 Frozen Apple Juice 113 1 frozen juice
32434460 503516 175436 prior 4 5 16 13.0 26478 18 0 Frozen Apple Juice 113 1 frozen juice
32434461 385156 183189 prior 4 1 23 22.0 26478 2 0 Frozen Apple Juice 113 1 frozen juice
32434462 471382 85005 prior 7 5 0 13.0 24344 1 0 Frozen Concentrate Non-Alcoholic Pina Colada 113 1 frozen juice
32434463 1833016 92263 prior 5 2 13 8.0 24344 2 0 Frozen Concentrate Non-Alcoholic Pina Colada 113 1 frozen juice
32434464 2624885 136840 prior 2 6 10 4.0 24344 11 0 Frozen Concentrate Non-Alcoholic Pina Colada 113 1 frozen juice
32434465 1604793 136840 prior 6 5 10 3.0 24344 17 1 Frozen Concentrate Non-Alcoholic Pina Colada 113 1 frozen juice
32434466 3154099 136840 prior 16 2 16 3.0 24344 4 1 Frozen Concentrate Non-Alcoholic Pina Colada 113 1 frozen juice
32434467 3135581 151840 prior 70 0 9 1.0 24344 6 0 Frozen Concentrate Non-Alcoholic Pina Colada 113 1 frozen juice
32434468 3297537 181495 prior 2 1 14 15.0 24344 9 0 Frozen Concentrate Non-Alcoholic Pina Colada 113 1 frozen juice
32434469 823196 181495 prior 3 1 14 0.0 24344 1 1 Frozen Concentrate Non-Alcoholic Pina Colada 113 1 frozen juice
32434470 2471510 107801 prior 8 6 15 4.0 5500 19 0 Blended Juice Beverage, Mango Orange 113 1 frozen juice
32434471 2181814 135090 prior 5 3 14 10.0 5500 3 0 Blended Juice Beverage, Mango Orange 113 1 frozen juice
32434472 962734 167413 prior 1 1 12 NaN 5500 9 0 Blended Juice Beverage, Mango Orange 113 1 frozen juice
32434473 2928960 167413 prior 4 0 12 10.0 5500 3 1 Blended Juice Beverage, Mango Orange 113 1 frozen juice
32434474 1393242 167413 prior 5 0 12 7.0 5500 21 1 Blended Juice Beverage, Mango Orange 113 1 frozen juice
32434475 2601337 181750 prior 13 0 20 30.0 5500 2 0 Blended Juice Beverage, Mango Orange 113 1 frozen juice
32434476 2125702 109046 prior 3 3 16 8.0 2642 3 0 Frozen Concentrated Orange Juice With Added Ca... 113 1 frozen juice
32434477 2849065 138824 prior 1 6 13 NaN 2642 20 0 Frozen Concentrated Orange Juice With Added Ca... 113 1 frozen juice
32434478 2634996 138824 prior 6 0 16 28.0 2642 15 1 Frozen Concentrated Orange Juice With Added Ca... 113 1 frozen juice
32434479 1857751 181888 prior 2 0 7 10.0 2642 5 0 Frozen Concentrated Orange Juice With Added Ca... 113 1 frozen juice
32434480 2131276 181888 prior 7 1 11 8.0 2642 6 1 Frozen Concentrated Orange Juice With Added Ca... 113 1 frozen juice
32434481 1466142 181888 prior 9 3 14 16.0 2642 4 1 Frozen Concentrated Orange Juice With Added Ca... 113 1 frozen juice
32434482 1022794 204495 prior 48 0 9 5.0 2642 9 0 Frozen Concentrated Orange Juice With Added Ca... 113 1 frozen juice
32434483 3249444 204495 prior 50 6 14 4.0 2642 8 1 Frozen Concentrated Orange Juice With Added Ca... 113 1 frozen juice
32434484 2231925 204495 prior 51 1 15 9.0 2642 8 1 Frozen Concentrated Orange Juice With Added Ca... 113 1 frozen juice
32434485 327001 204495 prior 53 2 8 7.0 2642 1 1 Frozen Concentrated Orange Juice With Added Ca... 113 1 frozen juice
32434486 1997103 110030 prior 4 2 16 5.0 24189 8 0 Tropical Fruit Smoothie Tasty American Favorites 113 1 frozen juice
32434487 1362143 113181 prior 33 3 17 5.0 24189 12 0 Tropical Fruit Smoothie Tasty American Favorites 113 1 frozen juice
32434488 777464 179210 prior 7 5 15 20.0 24189 16 0 Tropical Fruit Smoothie Tasty American Favorites 113 1 frozen juice

10000 rows × 14 columns

#构造交叉表user_id---->aisle
data04 = pd.crosstab(data03["user_id"],data03["aisle"])
display(data04.shape,data04.head(10))
(206209, 134)
aisle air fresheners candles asian foods baby accessories baby bath body care baby food formula bakery desserts baking ingredients baking supplies decor beauty beers coolers ... spreads tea tofu meat alternatives tortillas flat bread trail mix snack mix trash bags liners vitamins supplements water seltzer sparkling water white wines yogurt
user_id
1 0 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 1
2 0 3 0 0 0 0 2 0 0 0 ... 3 1 1 0 0 0 0 2 0 42
3 0 0 0 0 0 0 0 0 0 0 ... 4 1 0 0 0 0 0 2 0 0
4 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 1 0 0 0 1 0 0
5 0 2 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 3
6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
7 0 0 0 0 0 0 2 0 0 0 ... 0 0 0 0 0 0 0 0 0 5
8 0 1 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
9 0 0 0 0 6 0 2 0 0 0 ... 0 0 0 0 0 0 0 2 0 19
10 0 1 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 2

10 rows × 134 columns

# 主成分分析,保留n.n% 的信息
from sklearn.decomposition import PCA
import pandas as pd
import numpy as np
 
# 1、数据:使用上面代码生成的data04
data = data04

#2.实例化一个转换器类
transfer = PCA(n_components=0.9) #实例化一个转换器类
    # n_components: ·小数:表示保留百分之多少的信息 ·整数:减少到多少特征
#3.#调用fit_transform()
xi = transfer.fit_transform(data) #调用fit_transform()
#查看构成新的几个变量,查看单个变量的方差贡献率
print(xi.shape,transfer.explained_variance_ratio_)  
#4.输出新构造出来的主成分变量
Fi=[ ]
for i in range(1,xi.shape[1]+1):
    F="F" + str(i)
    Fi.append(F)
data02 = pd.DataFrame(xi,columns=Fi)
display(data02.head(3))
(206209, 27) [0.48237998 0.09585824 0.05185877 0.03590181 0.0293466  0.02393094
 0.01899492 0.0183208  0.01487788 0.0134451  0.01121877 0.01102918
 0.01052171 0.00980307 0.00832174 0.00726185 0.00712991 0.00683061
 0.00640343 0.00580483 0.00534075 0.00487297 0.00477908 0.00462158
 0.00444346 0.00413755 0.00408034]
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 ... F18 F19 F20 F21 F22 F23 F24 F25 F26 F27
0 -24.215659 2.429427 -2.466370 -0.145686 0.269042 -1.432932 2.140677 -2.738031 -2.714316 -1.743135 ... -3.225987 -4.580076 0.777403 -3.699129 1.907214 2.995386 0.772923 0.686800 1.694394 -2.343230
1 6.463208 36.751116 8.382553 15.097530 -6.920938 -0.978375 6.011567 3.787725 -8.180749 -9.040861 ... -0.737606 -0.737402 0.740042 -0.091338 5.151285 -4.584815 -3.237894 4.121213 2.446897 -4.283485
2 -7.990302 2.404383 -11.030064 0.672230 -0.442368 -2.823272 -6.284140 6.512509 -2.148634 -1.585257 ... 5.434733 -3.604842 4.282794 -0.445834 3.039337 -1.469566 -2.946656 1.775345 -0.444194 0.786666

3 rows × 27 columns

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章