【Python數據分析之pandas05】處理缺失化數據

原創

2018-09-03 21:07

首先，Python用.isnull的方法判斷對象元素是否爲NaN(缺失值)。

s1 = pd.Series(['one','two',np.nan,'three'])
s1.isnull()
'''
0    False
1    False
2     True
3    False
dtype: bool
'''

之前提到了一種填充缺失值的方法是重新索引時修改其method屬性，這裏意思差不多，只是直接用fillna方法填充缺失值:

s1 = pd.Series(['one','two',np.nan,'three'])
s1.fillna(method='ffill')
'''
0      one
1      two
2      two
3    three
dtype: object
'''

fillna()還可以通過指定缺失值來填充，值得注意的是，fillna方法默認返回一個新對象：

df = pd.DataFrame(np.random.randn(4,4))
df.iloc[:4,1]=np.nan;df.iloc[:2,2]=np.nan
print(df)
'''
         0   1         2         3
0  0.639954 NaN       NaN -1.875799
1  0.141415 NaN       NaN  0.712173
2  1.479268 NaN  0.390988 -0.436616
3  2.143007 NaN  0.535538 -0.582310
'''


print(df.fillna(0))
'''
         0    1         2         3
0  0.882594  0.0  0.000000 -0.235420
1  0.959379  0.0  0.000000 -0.713679
2  0.109616  0.0  0.256728 -0.480367
3  0.234736  0.0 -0.461598  1.340675
'''


print(df)
'''
  0   1         2         3
0 -1.018811 NaN       NaN -1.202366
1  0.709019 NaN       NaN  0.879469
2  1.214638 NaN -0.605073  0.151528
3 -1.057719 NaN -0.856848 -0.040519
'''

*.ix索引的方法現在改成了iloc

想要原地修改就得修改method屬性:

df = pd.DataFrame(np.random.randn(4,4))
df.iloc[:4,1]=np.nan;df.iloc[:2,2]=np.nan
print(df)
'''
         0   1         2         3
0  0.639954 NaN       NaN -1.875799
1  0.141415 NaN       NaN  0.712173
2  1.479268 NaN  0.390988 -0.436616
3  2.143007 NaN  0.535538 -0.582310
'''

df.fillna(0,inplace=True)
print(df)
'''
0    1         2         3
0  0.223656  0.0  0.000000  0.160646
1  1.567522  0.0  0.000000  1.202753
2  1.985000  0.0  0.612963  0.792907
3 -0.100812  0.0  1.081931 -0.028931
'''

但是，無論是否修改inplace屬性，fillna都會返回一個新的對象，這一點通過檢查id就可以證明。

當然，之前的重新索引reindex的插值方法都可以運用於fillna。

如果不想進行數據填充，pandas也提供了一些清楚缺失值的方法。

對於Series，可以通過dropna或索引的方法清除：

#dropna方法
s = pd.Series([1,2,3,np.nan,np.nan,5])
s.dropna()
'''
0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64

'''

#索引方法
s = pd.Series([1,2,3,np.nan,np.nan,5])
s[s.notnull()]
'''
0    1.0
1    2.0
2    3.0
5    5.0
dtype: float64
'''

DataFrame的清除涉及到行或列的問題，dropna()方法默認清除行：

data = pd.DataFrame([[1,6.5,3],[1,np.nan,np.nan],[np.nan,np.nan,np.nan],[np.nan,6.5,3]])
print(data)
'''
 0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
2  NaN  NaN  NaN
3  NaN  6.5  3.0

'''

print(data.dropna())
'''
0    1    2
0  1.0  6.5  3.0
'''

想要清除列，傳入axis=1即可，這裏不舉例了。

dropna還有一個重要參數是how='all'，傳入它則只丟棄全爲NA的那些行：

print(data.dropna(how="all"))
'''
0    1    2
0  1.0  6.5  3.0
1  1.0  NaN  NaN
3  NaN  6.5  3.0
'''

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【Python數據分析之pandas05】處理缺失化數據

【微信公衆平臺02】雲服務器搭建及url配置

僅以此博客勉勵自己

【微信公衆平臺04】自定義菜單

分享一篇激動人心的文章

【Python全棧05】函數與參數

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結