示例:用特定於分組的值填充缺失值
對於缺失數據的清理工作,有時你會用dropna將其替換掉,而有時則可能會希望用一個固定值或由數據集本身所衍生出來的值去填充NA值。這時就得使用fillna這個工具了。在下面這個例子中,我用平均值去填充NA值:
In [91]: s = pd.Series(np.random.randn(6))
In [92]: s[::2] = np.nan
In [93]: s
Out[93]:
0 NaN
1 -0.125921
2 NaN
3 -0.884475
4 NaN
5 0.227290
dtype: float64
In [94]: s.fillna(s.mean())
Out[94]:
0 -0.261035
1 -0.125921
2 -0.261035
3 -0.884475
4 -0.261035
5 0.227290
dtype: float64
假設你需要對不同的分組填充不同的值。一種方法是將數據分組,並使用apply和一個能夠對各數據塊調用fillna的函數即可。下面是一些有關美國幾個州的示例數據,這些州又被分爲東部和西部:
In [95]: states = ['Ohio', 'New York', 'Vermont', 'Florida',
....: 'Oregon', 'Nevada', 'California', 'Idaho']
In [96]: group_key = ['East'] * 4 + ['West'] * 4
In [97]: data = pd.Series(np.random.randn(8), index=states)
In [98]: data
Out[98]:
Ohio 0.922264
New York -2.153545
Vermont -0.365757
Florida -0.375842
Oregon 0.329939
Nevada 0.981994
California 1.105913
Idaho -1.613716
dtype: float64
[‘East’] * 4產生了一個列表,包括了[‘East’]中元素的四個拷貝。將這些列表串聯起來。
將一些值設爲缺失:
In [99]: data[['Vermont', 'Nevada', 'Idaho']] = np.nan
In [100]: data
Out[100]:
Ohio 0.922264
New York -2.153545
Vermont NaN
Florida -0.375842
Oregon 0.329939
Nevada NaN
California 1.105913
Idaho NaN
dtype: float64
In [101]: data.groupby(group_key).mean()
Out[101]:
East -0.535707
West 0.717926
dtype: float64
我們可以用分組平均值去填充NA值:
In [102]: fill_mean = lambda g: g.fillna(g.mean())
In [103]: data.groupby(group_key).apply(fill_mean)
Out[103]:
Ohio 0.922264
New York -2.153545
Vermont -0.535707
Florida -0.375842
Oregon 0.329939
Nevada 0.717926
California 1.105913
Idaho 0.717926
dtype: float64
另外,也可以在代碼中預定義各組的填充值。由於分組具有一個name屬性,所以我們可以拿來用一下:
In [104]: fill_values = {'East': 0.5, 'West': -1}
In [105]: fill_func = lambda g: g.fillna(fill_values[g.name])
In [106]: data.groupby(group_key).apply(fill_func)
Out[106]:
Ohio 0.922264
New York -2.153545
Vermont 0.500000
Florida -0.375842
Oregon 0.329939
Nevada -1.000000
California 1.105913
Idaho -1.000000
dtype: float64