理解Pandas的Transform

Pandas具有豐富的功能,transform是與groupby(pandas中最有用的操作之一)組合使用的。一般情況下,我們在groupby之後使用aggregate , filter 或 apply來彙總數據,transform可能稍難理解。

aggregation會返回數據的縮減版本,而transformation能返回完整數據的某一變換版本供我們重組。這樣的transformation,輸出的形狀和輸入一致。一個常見的例子是通過減去分組平均值來居中數據。

實踐

加載數據

import pandas as pd

data = {'account':[383080,383080,383080,412290,412290,412290,412290,412290,218895,218895,218895,218895],
       "name":['Will LLC','Will LLC','Will LLC','Jerde-Hilpert','Jerde-Hilpert','Jerde-Hilpert','Jerde-Hilpert','Jerde-Hilpert','Kulans Inc','Kulans Inc','Kulans Inc','Kulans Inc'],
       "order":[1001,1001,1001,1005,1005,1005,1005,1005,1006,1006,1006,1006],
       'sku':['B1-20000','B1-27722','B1-86481','S1-06532','S1-82801','S1-06532','S1-47412','S1-27722','S1-27722','B1-33087','B1-33364','B1-20000'],
       "quantity":[7,11,3,48,21,9,44,36,32,23,3,-1],
       'unit price':[33.69,21.12,35.99,55.82,13.62,92.55,78.91,25.42,95.66,22.55,72.3,72.18],
       "ext price":[235.83,232.32,107.97,2679.36,286.02,832.95,3472.04,915.12,3061.12,518.65,216.90,-72.18]}

df = pd.DataFrame(data)
df
account name order sku quantity unit price ext price
0 383080 Will LLC 1001 B1-20000 7 33.69 235.83
1 383080 Will LLC 1001 B1-27722 11 21.12 232.32
2 383080 Will LLC 1001 B1-86481 3 35.99 107.97
3 412290 Jerde-Hilpert 1005 S1-06532 48 55.82 2679.36
4 412290 Jerde-Hilpert 1005 S1-82801 21 13.62 286.02
5 412290 Jerde-Hilpert 1005 S1-06532 9 92.55 832.95
6 412290 Jerde-Hilpert 1005 S1-47412 44 78.91 3472.04
7 412290 Jerde-Hilpert 1005 S1-27722 36 25.42 915.12
8 218895 Kulans Inc 1006 S1-27722 32 95.66 3061.12
9 218895 Kulans Inc 1006 B1-33087 23 22.55 518.65
10 218895 Kulans Inc 1006 B1-33364 3 72.30 216.90
11 218895 Kulans Inc 1006 B1-20000 -1 72.18 -72.18

可以看到數據包含了不同的訂單(order),以及訂單裏的不同商品的數量(quantity)、單價(unit price)和總價(ext price)

我們的任務是爲數據表添加一列,表示不同商品在所在訂單的價錢佔比。

第一種方法實現步驟:

df.groupby(by='order')['ext price'].agg(sum)
order
1001     576.12
1005    8185.49
1006    3724.49
Name: ext price, dtype: float64
df.groupby(by='order')['ext price'].agg(sum).rename('order_total')
order
1001     576.12
1005    8185.49
1006    3724.49
Name: order_total, dtype: float64
order_total = df.groupby(by='order')['ext price'].agg(sum).rename('order_total').reset_index()
order_total
order order_total
0 1001 576.12
1 1005 8185.49
2 1006 3724.49
df_1 = df.merge(order_total)
df_1
account name order sku quantity unit price ext price order_total
0 383080 Will LLC 1001 B1-20000 7 33.69 235.83 576.12
1 383080 Will LLC 1001 B1-27722 11 21.12 232.32 576.12
2 383080 Will LLC 1001 B1-86481 3 35.99 107.97 576.12
3 412290 Jerde-Hilpert 1005 S1-06532 48 55.82 2679.36 8185.49
4 412290 Jerde-Hilpert 1005 S1-82801 21 13.62 286.02 8185.49
5 412290 Jerde-Hilpert 1005 S1-06532 9 92.55 832.95 8185.49
6 412290 Jerde-Hilpert 1005 S1-47412 44 78.91 3472.04 8185.49
7 412290 Jerde-Hilpert 1005 S1-27722 36 25.42 915.12 8185.49
8 218895 Kulans Inc 1006 S1-27722 32 95.66 3061.12 3724.49
9 218895 Kulans Inc 1006 B1-33087 23 22.55 518.65 3724.49
10 218895 Kulans Inc 1006 B1-33364 3 72.30 216.90 3724.49
11 218895 Kulans Inc 1006 B1-20000 -1 72.18 -72.18 3724.49
df_1['percent_of _order'] = df_1['ext price']/df_1['order_total']
df_1
account name order sku quantity unit price ext price order_total percent_of _order
0 383080 Will LLC 1001 B1-20000 7 33.69 235.83 576.12 0.409342
1 383080 Will LLC 1001 B1-27722 11 21.12 232.32 576.12 0.403249
2 383080 Will LLC 1001 B1-86481 3 35.99 107.97 576.12 0.187409
3 412290 Jerde-Hilpert 1005 S1-06532 48 55.82 2679.36 8185.49 0.327330
4 412290 Jerde-Hilpert 1005 S1-82801 21 13.62 286.02 8185.49 0.034942
5 412290 Jerde-Hilpert 1005 S1-06532 9 92.55 832.95 8185.49 0.101759
6 412290 Jerde-Hilpert 1005 S1-47412 44 78.91 3472.04 8185.49 0.424170
7 412290 Jerde-Hilpert 1005 S1-27722 36 25.42 915.12 8185.49 0.111798
8 218895 Kulans Inc 1006 S1-27722 32 95.66 3061.12 3724.49 0.821890
9 218895 Kulans Inc 1006 B1-33087 23 22.55 518.65 3724.49 0.139254
10 218895 Kulans Inc 1006 B1-33364 3 72.30 216.90 3724.49 0.058236
11 218895 Kulans Inc 1006 B1-20000 -1 72.18 -72.18 3724.49 -0.019380

第二種方法實現步驟(transform)

df.groupby(by='order')['ext price'].transform(sum)
0      576.12
1      576.12
2      576.12
3     8185.49
4     8185.49
5     8185.49
6     8185.49
7     8185.49
8     3724.49
9     3724.49
10    3724.49
11    3724.49
Name: ext price, dtype: float64

不再是隻顯示3個訂單的對應項,而是保持了與原始數據集相同數量的項目,這樣就很好繼續了。這就是transform的獨特之處。

df['order_total'] = df.groupby(by='order')['ext price'].transform(sum)
df['percent_of_order'] = df['ext price']/df['order_total']
df
account name order sku quantity unit price ext price order_total percent_of_order
0 383080 Will LLC 1001 B1-20000 7 33.69 235.83 576.12 0.409342
1 383080 Will LLC 1001 B1-27722 11 21.12 232.32 576.12 0.403249
2 383080 Will LLC 1001 B1-86481 3 35.99 107.97 576.12 0.187409
3 412290 Jerde-Hilpert 1005 S1-06532 48 55.82 2679.36 8185.49 0.327330
4 412290 Jerde-Hilpert 1005 S1-82801 21 13.62 286.02 8185.49 0.034942
5 412290 Jerde-Hilpert 1005 S1-06532 9 92.55 832.95 8185.49 0.101759
6 412290 Jerde-Hilpert 1005 S1-47412 44 78.91 3472.04 8185.49 0.424170
7 412290 Jerde-Hilpert 1005 S1-27722 36 25.42 915.12 8185.49 0.111798
8 218895 Kulans Inc 1006 S1-27722 32 95.66 3061.12 3724.49 0.821890
9 218895 Kulans Inc 1006 B1-33087 23 22.55 518.65 3724.49 0.139254
10 218895 Kulans Inc 1006 B1-33364 3 72.30 216.90 3724.49 0.058236
11 218895 Kulans Inc 1006 B1-20000 -1 72.18 -72.18 3724.49 -0.019380

甚至可以一步解決:

df['percent_of_order'] = df['ext price']/df.groupby(by='order')['ext price'].transform(sum)
df
account name order sku quantity unit price ext price order_total percent_of_order
0 383080 Will LLC 1001 B1-20000 7 33.69 235.83 576.12 0.409342
1 383080 Will LLC 1001 B1-27722 11 21.12 232.32 576.12 0.403249
2 383080 Will LLC 1001 B1-86481 3 35.99 107.97 576.12 0.187409
3 412290 Jerde-Hilpert 1005 S1-06532 48 55.82 2679.36 8185.49 0.327330
4 412290 Jerde-Hilpert 1005 S1-82801 21 13.62 286.02 8185.49 0.034942
5 412290 Jerde-Hilpert 1005 S1-06532 9 92.55 832.95 8185.49 0.101759
6 412290 Jerde-Hilpert 1005 S1-47412 44 78.91 3472.04 8185.49 0.424170
7 412290 Jerde-Hilpert 1005 S1-27722 36 25.42 915.12 8185.49 0.111798
8 218895 Kulans Inc 1006 S1-27722 32 95.66 3061.12 3724.49 0.821890
9 218895 Kulans Inc 1006 B1-33087 23 22.55 518.65 3724.49 0.139254
10 218895 Kulans Inc 1006 B1-33364 3 72.30 216.90 3724.49 0.058236
11 218895 Kulans Inc 1006 B1-20000 -1 72.18 -72.18 3724.49 -0.019380
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章