Pandas具有豐富的功能,transform是與groupby(pandas中最有用的操作之一)組合使用的。一般情況下,我們在groupby之後使用aggregate , filter 或 apply來彙總數據,transform可能稍難理解。
aggregation會返回數據的縮減版本,而transformation能返回完整數據的某一變換版本供我們重組。這樣的transformation,輸出的形狀和輸入一致。一個常見的例子是通過減去分組平均值來居中數據。
實踐
加載數據
import pandas as pd
data = {'account':[383080,383080,383080,412290,412290,412290,412290,412290,218895,218895,218895,218895],
"name":['Will LLC','Will LLC','Will LLC','Jerde-Hilpert','Jerde-Hilpert','Jerde-Hilpert','Jerde-Hilpert','Jerde-Hilpert','Kulans Inc','Kulans Inc','Kulans Inc','Kulans Inc'],
"order":[1001,1001,1001,1005,1005,1005,1005,1005,1006,1006,1006,1006],
'sku':['B1-20000','B1-27722','B1-86481','S1-06532','S1-82801','S1-06532','S1-47412','S1-27722','S1-27722','B1-33087','B1-33364','B1-20000'],
"quantity":[7,11,3,48,21,9,44,36,32,23,3,-1],
'unit price':[33.69,21.12,35.99,55.82,13.62,92.55,78.91,25.42,95.66,22.55,72.3,72.18],
"ext price":[235.83,232.32,107.97,2679.36,286.02,832.95,3472.04,915.12,3061.12,518.65,216.90,-72.18]}
df = pd.DataFrame(data)
df
account | name | order | sku | quantity | unit price | ext price | |
---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | 1001 | B1-20000 | 7 | 33.69 | 235.83 |
1 | 383080 | Will LLC | 1001 | B1-27722 | 11 | 21.12 | 232.32 |
2 | 383080 | Will LLC | 1001 | B1-86481 | 3 | 35.99 | 107.97 |
3 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 48 | 55.82 | 2679.36 |
4 | 412290 | Jerde-Hilpert | 1005 | S1-82801 | 21 | 13.62 | 286.02 |
5 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 9 | 92.55 | 832.95 |
6 | 412290 | Jerde-Hilpert | 1005 | S1-47412 | 44 | 78.91 | 3472.04 |
7 | 412290 | Jerde-Hilpert | 1005 | S1-27722 | 36 | 25.42 | 915.12 |
8 | 218895 | Kulans Inc | 1006 | S1-27722 | 32 | 95.66 | 3061.12 |
9 | 218895 | Kulans Inc | 1006 | B1-33087 | 23 | 22.55 | 518.65 |
10 | 218895 | Kulans Inc | 1006 | B1-33364 | 3 | 72.30 | 216.90 |
11 | 218895 | Kulans Inc | 1006 | B1-20000 | -1 | 72.18 | -72.18 |
可以看到數據包含了不同的訂單(order),以及訂單裏的不同商品的數量(quantity)、單價(unit price)和總價(ext price)
我們的任務是爲數據表添加一列,表示不同商品在所在訂單的價錢佔比。
第一種方法實現步驟:
df.groupby(by='order')['ext price'].agg(sum)
order
1001 576.12
1005 8185.49
1006 3724.49
Name: ext price, dtype: float64
df.groupby(by='order')['ext price'].agg(sum).rename('order_total')
order
1001 576.12
1005 8185.49
1006 3724.49
Name: order_total, dtype: float64
order_total = df.groupby(by='order')['ext price'].agg(sum).rename('order_total').reset_index()
order_total
order | order_total | |
---|---|---|
0 | 1001 | 576.12 |
1 | 1005 | 8185.49 |
2 | 1006 | 3724.49 |
df_1 = df.merge(order_total)
df_1
account | name | order | sku | quantity | unit price | ext price | order_total | |
---|---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | 1001 | B1-20000 | 7 | 33.69 | 235.83 | 576.12 |
1 | 383080 | Will LLC | 1001 | B1-27722 | 11 | 21.12 | 232.32 | 576.12 |
2 | 383080 | Will LLC | 1001 | B1-86481 | 3 | 35.99 | 107.97 | 576.12 |
3 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 48 | 55.82 | 2679.36 | 8185.49 |
4 | 412290 | Jerde-Hilpert | 1005 | S1-82801 | 21 | 13.62 | 286.02 | 8185.49 |
5 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 9 | 92.55 | 832.95 | 8185.49 |
6 | 412290 | Jerde-Hilpert | 1005 | S1-47412 | 44 | 78.91 | 3472.04 | 8185.49 |
7 | 412290 | Jerde-Hilpert | 1005 | S1-27722 | 36 | 25.42 | 915.12 | 8185.49 |
8 | 218895 | Kulans Inc | 1006 | S1-27722 | 32 | 95.66 | 3061.12 | 3724.49 |
9 | 218895 | Kulans Inc | 1006 | B1-33087 | 23 | 22.55 | 518.65 | 3724.49 |
10 | 218895 | Kulans Inc | 1006 | B1-33364 | 3 | 72.30 | 216.90 | 3724.49 |
11 | 218895 | Kulans Inc | 1006 | B1-20000 | -1 | 72.18 | -72.18 | 3724.49 |
df_1['percent_of _order'] = df_1['ext price']/df_1['order_total']
df_1
account | name | order | sku | quantity | unit price | ext price | order_total | percent_of _order | |
---|---|---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | 1001 | B1-20000 | 7 | 33.69 | 235.83 | 576.12 | 0.409342 |
1 | 383080 | Will LLC | 1001 | B1-27722 | 11 | 21.12 | 232.32 | 576.12 | 0.403249 |
2 | 383080 | Will LLC | 1001 | B1-86481 | 3 | 35.99 | 107.97 | 576.12 | 0.187409 |
3 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 48 | 55.82 | 2679.36 | 8185.49 | 0.327330 |
4 | 412290 | Jerde-Hilpert | 1005 | S1-82801 | 21 | 13.62 | 286.02 | 8185.49 | 0.034942 |
5 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 9 | 92.55 | 832.95 | 8185.49 | 0.101759 |
6 | 412290 | Jerde-Hilpert | 1005 | S1-47412 | 44 | 78.91 | 3472.04 | 8185.49 | 0.424170 |
7 | 412290 | Jerde-Hilpert | 1005 | S1-27722 | 36 | 25.42 | 915.12 | 8185.49 | 0.111798 |
8 | 218895 | Kulans Inc | 1006 | S1-27722 | 32 | 95.66 | 3061.12 | 3724.49 | 0.821890 |
9 | 218895 | Kulans Inc | 1006 | B1-33087 | 23 | 22.55 | 518.65 | 3724.49 | 0.139254 |
10 | 218895 | Kulans Inc | 1006 | B1-33364 | 3 | 72.30 | 216.90 | 3724.49 | 0.058236 |
11 | 218895 | Kulans Inc | 1006 | B1-20000 | -1 | 72.18 | -72.18 | 3724.49 | -0.019380 |
第二種方法實現步驟(transform)
df.groupby(by='order')['ext price'].transform(sum)
0 576.12
1 576.12
2 576.12
3 8185.49
4 8185.49
5 8185.49
6 8185.49
7 8185.49
8 3724.49
9 3724.49
10 3724.49
11 3724.49
Name: ext price, dtype: float64
不再是隻顯示3個訂單的對應項,而是保持了與原始數據集相同數量的項目,這樣就很好繼續了。這就是transform的獨特之處。
df['order_total'] = df.groupby(by='order')['ext price'].transform(sum)
df['percent_of_order'] = df['ext price']/df['order_total']
df
account | name | order | sku | quantity | unit price | ext price | order_total | percent_of_order | |
---|---|---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | 1001 | B1-20000 | 7 | 33.69 | 235.83 | 576.12 | 0.409342 |
1 | 383080 | Will LLC | 1001 | B1-27722 | 11 | 21.12 | 232.32 | 576.12 | 0.403249 |
2 | 383080 | Will LLC | 1001 | B1-86481 | 3 | 35.99 | 107.97 | 576.12 | 0.187409 |
3 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 48 | 55.82 | 2679.36 | 8185.49 | 0.327330 |
4 | 412290 | Jerde-Hilpert | 1005 | S1-82801 | 21 | 13.62 | 286.02 | 8185.49 | 0.034942 |
5 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 9 | 92.55 | 832.95 | 8185.49 | 0.101759 |
6 | 412290 | Jerde-Hilpert | 1005 | S1-47412 | 44 | 78.91 | 3472.04 | 8185.49 | 0.424170 |
7 | 412290 | Jerde-Hilpert | 1005 | S1-27722 | 36 | 25.42 | 915.12 | 8185.49 | 0.111798 |
8 | 218895 | Kulans Inc | 1006 | S1-27722 | 32 | 95.66 | 3061.12 | 3724.49 | 0.821890 |
9 | 218895 | Kulans Inc | 1006 | B1-33087 | 23 | 22.55 | 518.65 | 3724.49 | 0.139254 |
10 | 218895 | Kulans Inc | 1006 | B1-33364 | 3 | 72.30 | 216.90 | 3724.49 | 0.058236 |
11 | 218895 | Kulans Inc | 1006 | B1-20000 | -1 | 72.18 | -72.18 | 3724.49 | -0.019380 |
甚至可以一步解決:
df['percent_of_order'] = df['ext price']/df.groupby(by='order')['ext price'].transform(sum)
df
account | name | order | sku | quantity | unit price | ext price | order_total | percent_of_order | |
---|---|---|---|---|---|---|---|---|---|
0 | 383080 | Will LLC | 1001 | B1-20000 | 7 | 33.69 | 235.83 | 576.12 | 0.409342 |
1 | 383080 | Will LLC | 1001 | B1-27722 | 11 | 21.12 | 232.32 | 576.12 | 0.403249 |
2 | 383080 | Will LLC | 1001 | B1-86481 | 3 | 35.99 | 107.97 | 576.12 | 0.187409 |
3 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 48 | 55.82 | 2679.36 | 8185.49 | 0.327330 |
4 | 412290 | Jerde-Hilpert | 1005 | S1-82801 | 21 | 13.62 | 286.02 | 8185.49 | 0.034942 |
5 | 412290 | Jerde-Hilpert | 1005 | S1-06532 | 9 | 92.55 | 832.95 | 8185.49 | 0.101759 |
6 | 412290 | Jerde-Hilpert | 1005 | S1-47412 | 44 | 78.91 | 3472.04 | 8185.49 | 0.424170 |
7 | 412290 | Jerde-Hilpert | 1005 | S1-27722 | 36 | 25.42 | 915.12 | 8185.49 | 0.111798 |
8 | 218895 | Kulans Inc | 1006 | S1-27722 | 32 | 95.66 | 3061.12 | 3724.49 | 0.821890 |
9 | 218895 | Kulans Inc | 1006 | B1-33087 | 23 | 22.55 | 518.65 | 3724.49 | 0.139254 |
10 | 218895 | Kulans Inc | 1006 | B1-33364 | 3 | 72.30 | 216.90 | 3724.49 | 0.058236 |
11 | 218895 | Kulans Inc | 1006 | B1-20000 | -1 | 72.18 | -72.18 | 3724.49 | -0.019380 |