圖解Pandas的assign函數

公衆號：尤而小屋
作者：Peter
編輯：Peter

大家好，我是Peter~

本文介紹的是Pandas庫中一個非常有用的函數：assign

在我們處理數據的時候，有時需要根據某個列進行計算得到一個新列，以便後續使用，相當於是根據已知列得到新的列，這個時候assign函數非常方便。下面通過實例來說明函數的的用法。

Pandas文章

本文是Pandas文章連載系列的第21篇，主要分爲3類：

基礎部分：1-16篇，主要是介紹Pandas中基礎和常用操作，比如數據創建、檢索查詢、排名排序、缺失值/重複值處理等常見的數據處理操作

進階部分：第17篇開始講解Pandas中的高級操作方法

對比SQL，學習Pandas：將SQL和Pandas的操作對比起來進行學習

參數

assign函數的參數只有一個：DataFrame.assign(**kwargs)。

**kwargs: dict of {str: callable or Series}

關於參數的幾點說明：

列名是關鍵字keywords
如果列名是可調用的，那麼它們將在DataFrame上計算並分配給新的列
如果列名是不可調用的（例如：Series、標量scalar或者數組array），則直接進行分配

最後，這個函數的返回值是一個新的DataFrame數據框，包含所有現有列和新生成的列

導入庫

import pandas as pd
import numpy as np

# 模擬數據

df = pd.DataFrame({
  "col1":[12, 16, 18],
  "col2":["xiaoming","peter", "mike"]})

df

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>col1</th>
<th>col2</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>12</td>
<td>xiaoming</td>
</tr>
<tr>
<th>1</th>
<td>16</td>
<td>peter</td>
</tr>
<tr>
<th>2</th>
<td>18</td>
<td>mike</td>
</tr>
</tbody>
</table>

</div>

實例

當值是可調用的，我們直接在數據框上進行計算：

方式1：直接調用數據框

# 方式1：數據框df上調用
# 使用數據框df的col1屬性，生成col3

df.assign(col3=lambda x: x.col1 / 2 + 20)

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>col1</th>
<th>col2</th>
<th>col3</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>12</td>
<td>xiaoming</td>
<td>26.0</td>
</tr>
<tr>
<th>1</th>
<td>16</td>
<td>peter</td>
<td>28.0</td>
</tr>
<tr>
<th>2</th>
<td>18</td>
<td>mike</td>
<td>29.0</td>
</tr>
</tbody>
</table>

</div>

我們可以查看原來的df，發現它是不變的

df  # 原數據框不變的

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

</div>

操作字符串類型的數據：

df.assign(col3=df["col2"].str.upper())

方式2：調用Series數據

可以通過直接引用現有的Series或序列來實現相同的行爲:

# 方式2：調用現有的Series來計算

df.assign(col4=df["col1"] * 3 / 4 + 25)

df  # 原數據不變

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

</div>

在Python3.6+中，我們可以在同一個賦值中創建多個列，並且其中一個列還可以依賴於同一個賦值中定義的另一列，也就是中間生成的新列可以直接使用：

df.assign(
    col5=lambda x: x["col1"] / 2 + 10,         
    col6=lambda x: x["col5"] * 5,  # 在col6計算中直接使用col5        
    col7=lambda x: x.col2.str.upper(),         
    col8=lambda x: x.col7.str.title()  # col8中使用col7
)

df   # 原數據不變

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: left;
}

</style>

</div>

如果我們重新分配的是一個現有的列，那麼這個現有列的值將會被覆蓋：

df.assign(col1=df["col1"] / 2)  # col1直接被覆蓋

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: left;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>col1</th>
<th>col2</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>6.0</td>
<td>xiaoming</td>
</tr>
<tr>
<th>1</th>
<td>8.0</td>
<td>peter</td>
</tr>
<tr>
<th>2</th>
<td>9.0</td>
<td>mike</td>
</tr>
</tbody>
</table>

</div>

對比apply函數

我們在pandas中同樣可以使用apply函數來實現

df  # 原數據

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: left;
}

</style>

</div>

生成一個副本，我們直接在副本上操作：

df1 = df.copy()  # 生成副本，直接在副本上操作
df2 = df.copy()

df1

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: left;
}

</style>

</div>

df1.assign(col3=lambda x: x.col1 / 2 + 20)

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: left;
}

</style>

</div>

df1  # df1保持不變

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: left;
}

</style>

</div>

df1["col3"] = df1["col1"].apply(lambda x:x / 2 + 20)

df1  # df1已經發生了變化

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: left;
}

</style>

</div>

我們發現：通過assign函數的操作，原數據是不變的，但是通過apply操作的數據已經變化了

BMI

最後在模擬一份數據，計算每個人的BMI。

身體質量指數，是BMI指數，簡稱體質指數，是國際上常用的衡量人體胖瘦程度以及是否健康的一個標準。

${BMI} = \frac {體重}{身高^2}$

其中：體重單位是kg，身高單位是m

df2 = pd.DataFrame({
    "name":["xiaoming","xiaohong","xiaosu"],
    "weight":[78,65,87],
    "height":[1.82,1.75,1.89]
})

df2

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: left;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>weight</th>
<th>height</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiaoming</td>
<td>78</td>
<td>1.82</td>
</tr>
<tr>
<th>1</th>
<td>xiaohong</td>
<td>65</td>
<td>1.75</td>
</tr>
<tr>
<th>2</th>
<td>xiaosu</td>
<td>87</td>
<td>1.89</td>
</tr>
</tbody>
</table>

</div>

# 使用assign函數實現

df2.assign(BMI=df2["weight"] / (df2["height"] ** 2))

df2 # 不變

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: left;
}

</style>

</div>

df2["BMI"] = df2["weight"] / (df2["height"] ** 2)

df2  # df2生成了一個新的列：BMI

總結

通過上面的例子，我們發現：

使用assign函數生成的DataFrame是不會改變原來的數據，這個DataFrame是新的
assign函數能夠同時操作多個列名，並且中間生成的列名能夠直接使用
assign和apply的主要區別在於：前者不改變原數據，apply函數是在原數據的基礎上添加新列

圖解Pandas的assign函數

Pandas文章

參數

導入庫

實例

方式1：直接調用數據框

方式2：調用Series數據

對比apply函數

BMI

總結

Pandas索引基本操作

快速認識Pandas的10大索引

2大模塊+20個函數，完美詮釋Python隨機過程~ 一、random模塊 np.random模塊

Pandas+Numpy+Sklearn隨機取數

pandas文本處理的3大祕訣

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結