pandas文本處理的3大祕訣

公衆號:尤而小屋
作者:Peter
編輯:Peter

大家好,我是Peter~

本文主要介紹的是通過使用Pandas中3個字符串相關函數來篩選滿足需求的文本數據:

  • contains :包含某個字符
  • startswith:以字符開頭
  • endswith:以字符結尾

模擬數據

import pandas as pd
import numpy as np
df = pd.DataFrame({
    "name":["xiao ming","Xiao zhang",np.nan,"sun quan","guan yu"],
    "age":["22","19","20","34","39"],
    "sex":["male","Female","female","Female","male"],
    "address":["廣東省深圳市","浙江省杭州市","江蘇省蘇州市","福建省泉州市","廣東省廣州市"]
})

df
df.dtypes  # 查看字段類型
name       object
age        object
sex        object
address    object
dtype: object

在本次模擬的數據中,有4個特點:

  1. name字段:存在缺失值np.nan,且Xiao和xiao存在大小寫之分
  2. age:年齡字段,正常應該是數值型,模擬的數據是字符類型object
  3. sex:也存在F和f的大小寫之分
  4. address:正常寫法

數據類型轉換

我們將age字段的字符類型型轉成數值型

df["age"] = df["age"].astype(float)
df

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

生成的數據如下,似乎和原始數據沒有區別;但是我們查看屬性字段的數據類型就會看到區別:

df.dtypes
name        object
age        float64  
sex         object
address     object
dtype: object

age字段已經轉成了float64位的數值型。

contains

contains是用於Series數據的函數,基本語法如下:

Series.str.contains(
    pat, 
    case=True, 
    flags=0, 
    na=None, 
    regex=True
)
  • pat:傳入的字符或者正則表達式
  • case:是否區分大小寫(對大小寫敏感)
  • flags:正則標誌位,比如:re.IGNORECASE,表示忽略大小寫
  • na:可選項,標量類型;對原數據中的缺失值處理,如果是object-dtype, 使用numpy.nan 代替;如果是StringDtype, 用pandas.NA
  • regex:布爾值;True:傳入的pat看做是正則表達式,False:看做是正常的字符類型的表達式

默認情況

# 例子1:篩選包含xiao的數據

df["name"].str.contains("xiao")
0     True
1    False
2      NaN
3    False
4    False
Name: name, dtype: object

當屬性中存在缺失值的時候,需要帶上na參數:

缺失值處理

# 例子2:參數na使用

df[df["name"].str.contains("xiao",na=False)]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
</tbody>
</table>

</div>

如果不帶上則會報錯:

df[df["name"].str.contains("xiao")]

忽略大小寫

# 例子3:case使用

df["name"].str.contains("xiao",case=False)
0     True
1     True
2      NaN
3    False
4    False
Name: name, dtype: object

上面的結果直接忽略了大小寫,可以看到出現了兩個True:也就是xiao和Xiao的數據都被篩選出來:

df[df["name"].str.contains("xiao",case=False, na=False)]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
</tbody>
</table>

</div>

忽略大小寫和缺失值

# 例子4:忽略大小寫和缺失值
df[df["sex"].str.contains("f",case=False, na=False)]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
</tbody>
</table>

</div>

正則表達式使用

# 例子5:正則表達式使用

df["address"].str.contains("^廣")
0     True
1    False
2    False
3    False
4     True
Name: address, dtype: bool

其中^表示開始的符號,即:以開頭的數據

df[df["address"].str.contains("^廣")]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>

</div>

正則表達式中的$表示結尾的符號;下面是篩選以結尾的數據:

df[df["address"].str.contains("市$")]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>

</div>

在下面的正則表達式例子中,會在深蘇泉中任意選擇一個,然後包含這個字符的數據:

df[df["address"].str.contains("[深蘇泉]")]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
</tbody>
</table>

</div>

startswith

startswith的語法相對簡單:

Series.str.startswith(pat, na=None)
  • pat:表示一個字符;注意:不接受正則表達式
  • na:表示對缺失值的處理;na=False表示忽略缺失值

pat參數

指定一個字符;不接受正則表達式

df["address"].str.startswith("廣")
0     True
1    False
2    False
3    False
4     True
Name: address, dtype: bool
df[df["address"].str.startswith("廣")]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>

</div>

這種寫法和正則表達式的以某個字符開頭是同樣的效果:

df[df["address"].str.contains("^廣")]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>

</div>

自動區分大小寫

startswith方法是自動區分大小寫的:

df[df["sex"].str.startswith("f")]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
</tbody>
</table>

</div>

df[df["sex"].str.startswith("F")]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
</tbody>
</table>

</div>

缺失值處理

df["name"].str.startswith("xiao")
0     True
1    False
2      NaN
3    False
4    False
Name: name, dtype: object
df[df["name"].str.startswith("xiao",na=False)]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
</tbody>
</table>

</div>

endswith

指定以某個字符結尾,語法爲:

Series.str.endswith(pat, na=None)
  • pat:表示一個字符;注意:不接受正則表達式
  • na:表示對缺失值的處理;na=False表示忽略缺失值

pat參數

# 以市結尾

df[df["address"].str.endswith("市")]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>

</div>

# 正則的寫法:contains方法

df[df["address"].str.contains("市$")]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
<tr>
<th>2</th>
<td>NaN</td>
<td>20.0</td>
<td>female</td>
<td>江蘇省蘇州市</td>
</tr>
<tr>
<th>3</th>
<td>sun quan</td>
<td>34.0</td>
<td>Female</td>
<td>福建省泉州市</td>
</tr>
<tr>
<th>4</th>
<td>guan yu</td>
<td>39.0</td>
<td>male</td>
<td>廣東省廣州市</td>
</tr>
</tbody>
</table>

</div>

缺失值處理

df["name"].str.endswith("g")
0     True
1     True
2      NaN
3    False
4    False
Name: name, dtype: object
df[df["name"].str.endswith("g",na=False)]

<div>
<style scoped>
.dataframe tbody tr th:only-of-type {
vertical-align: middle;
}

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

</style>

<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>name</th>
<th>age</th>
<th>sex</th>
<th>address</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>xiao ming</td>
<td>22.0</td>
<td>male</td>
<td>廣東省深圳市</td>
</tr>
<tr>
<th>1</th>
<td>Xiao zhang</td>
<td>19.0</td>
<td>Female</td>
<td>浙江省杭州市</td>
</tr>
</tbody>
</table>

</div>

# 不加na參數則報錯
df[df["name"].str.endswith("g")]

報錯的原因很明顯:就是因爲name字段下面存在缺失值。當使用了na參數就可以解決

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章