# 第6章 缺失數據

Pandas在步入1.0後，對數據類型也做出了新的嘗試，尤其是Nullable類型和String類型，瞭解這些可能在未來成爲主流的新特性是必要的

``````import pandas as pd
import numpy as np
``````

## 一、缺失觀測及其類型

### 1. 瞭解缺失信息

（a）isna和notna方法

``````df['Physics'].isna().head()
``````

``````df['Physics'].notna().head()
``````

``````df.isna().head()
``````

``````df.isna().sum()
``````

``````df.info()
``````

（b）查看缺失值的所以在行

``````df[df['Physics'].isna()]
``````

（c）挑選出所有非缺失值列

``````df[df.notna().all(1)]
``````

### 2. 三種缺失符號

（a）np.nan

np.nan是一個麻煩的東西，首先它不等與任何東西，甚至不等於自己

``````df.equals(df)
``````

``````pd.Series([1,np.nan,3],dtype='bool')
``````

``````s = pd.Series([True,False],dtype='bool')
s[1]=np.nan
s
``````

（b）None

None比前者稍微好些，至少它會等於自身

``````None == None
``````

``````pd.Series([None],dtype='bool')
``````

``````s = pd.Series([True,False],dtype='bool')
s[0]=None
s
``````

``````type(pd.Series([1,None])[1])
``````

``````type(pd.Series([1,None],dtype='O')[1])
``````

``````pd.Series([None]).equals(pd.Series([np.nan]))
``````

（c）NaT

NaT是針對時間序列的缺失值，是Pandas的內置類型，可以完全看做時序版本的np.nan，與自己不等，且使用equals是也會被跳過

``````s_time = pd.Series([pd.Timestamp('20120101')]*5)
s_time
``````

``````s_time[2] = None
s_time
``````

``````s_time[2] = np.nan
s_time
``````

``````s_time[2] = pd.NaT
s_time
``````

``````type(s_time[2])
``````

``````s_time[2] == s_time[2]
``````

``````s_time.equals(s_time)
``````

``````s = pd.Series([True,False],dtype='bool')
s[1]=pd.NaT
s
``````

### 3. Nullable類型與NA符號

“The goal of pd.NA is provide a “missing” indicator that can be used consistently across data types (instead of np.nan, None or pd.NaT depending on the data type).”——User Guide for Pandas v-1.0

（a）Nullable整形

``````s_original = pd.Series([1, 2], dtype="int64")
s_original
``````

``````s_new = pd.Series([1, 2], dtype="Int64")
s_new
``````

``````s_original[1] = np.nan
s_original
``````

``````s_new[1] = np.nan
s_new
``````

``````s_new[1] = None
s_new
``````

``````s_new[1] = pd.NaT
s_new
``````

（b）Nullable布爾

``````s_original = pd.Series([1, 0], dtype="bool")
s_original
``````

``````s_new = pd.Series([0, 1], dtype="boolean")
s_new
``````

``````s_original[0] = np.nan
s_original
``````

``````s_original = pd.Series([1, 0], dtype="bool") #此處重新加一句是因爲前面賦值改變了bool類型
s_original[0] = None
s_original
``````

``````s_new[0] = np.nan
s_new
``````

``````s_new[0] = None
s_new
``````

``````s_new[0] = pd.NaT
s_new
``````

``````s = pd.Series(['dog','cat'])
s[s_new]
``````

（c）string類型

``````s = pd.Series(['dog','cat'],dtype='string')
s
``````

``````s[0] = np.nan
s
``````

``````s[0] = None
s
``````

``````s = pd.Series(["a", None, "b"], dtype="string")
s.str.count('a')
``````

``````s2 = pd.Series(["a", None, "b"], dtype="object")
s2.str.count("a")
``````

``````s.str.isdigit()
``````

``````s2.str.isdigit()
``````

（a）邏輯運算

（b）算術運算和比較運算

### 5. convert_dtypes方法

``````pd.read_csv('data/table_missing.csv').dtypes
``````

``````pd.read_csv('data/table_missing.csv').convert_dtypes().dtypes
``````

## 二、缺失數據的運算與分組

### 1. 加號與乘號規則

``````s = pd.Series([2,3,np.nan,4])
s.sum()
``````

``````s.prod()
``````

``````s.cumsum()
``````

``````s.cumprod()
``````

``````s.pct_change()
``````

### 2. groupby方法中的缺失值

``````df_g = pd.DataFrame({'one':['A','B','C','D',np.nan],'two':np.random.randn(5)})
df_g
``````

``````df_g.groupby('one').groups
``````

## 三、填充與剔除

### 1. fillna方法

（a）值填充與前後向填充（分別與ffill方法和bfill方法等價

``````df['Physics'].fillna('missing').head()
``````

``````df['Physics'].fillna(method='ffill').head()
``````

``````df['Physics'].fillna(method='backfill').head()
``````

（b）填充中的對齊特性

``````df_f = pd.DataFrame({'A':[1,3,np.nan],'B':[2,4,np.nan],'C':[3,5,np.nan]})
df_f.fillna(df_f.mean())
``````

``````df_f.fillna(df_f.mean()[['A','B']])
``````

### 2. dropna方法

（a）axis參數

`````` df_d = pd.DataFrame({'A':[np.nan,np.nan,np.nan],'B':[np.nan,3,2],'C':[3,2,1]})
df_d
``````

``````df_d.dropna(axis=0)
``````

``````df_d.dropna(axis=1)
``````

（b）how參數（可以選all或者any，表示全爲缺失去除和存在缺失去除）

``````df_d.dropna(axis=1,how='all')
``````

（c）subset參數（即在某一組列範圍中搜索缺失值）

``````df_d.dropna(axis=0,subset=['B','C'])
``````

## 四、插值（interpolation）

### 1. 線性插值

（a）索引無關的線性插值

``````s = pd.Series([1,10,15,-5,-2,np.nan,np.nan,28])
s
``````

``````s.interpolate()
``````

``````s.interpolate().plot()
``````

``````s.index = np.sort(np.random.randint(50,300,8))
s.interpolate()
#值不變
``````

``````s.interpolate().plot()
#後面三個點不是線性的（如果幾乎爲線性函數，請重新運行上面的一個代碼塊，這是隨機性導致的）
``````

（b）與索引有關的插值

method中的index和time選項可以使插值線性地依賴索引，即插值爲索引的線性函數

``````s.interpolate(method='index').plot()
#可以看到與上面的區別
``````

``````s_t = pd.Series([0,np.nan,10]
,index=[pd.Timestamp('2012-05-01'),pd.Timestamp('2012-05-07'),pd.Timestamp('2012-06-03')])
s_t
``````

``````s_t.interpolate().plot()
``````

``````s_t.interpolate(method='time').plot()
``````

### 2. 高級插值方法

``````import pandas as pd
import numpy as np
ser = pd.Series(np.arange(1, 10.1, .25) ** 2 + np.random.randn(37))
missing = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29])
ser[missing] = np.nan
df = pd.DataFrame({m: ser.interpolate(method=m) for m in methods})
df.plot()
``````

### 3. interpolate中的限制參數

（a）limit表示最多插入多少個

``````s = pd.Series([1,np.nan,np.nan,np.nan,5])
s.interpolate(limit=2)
``````

（b）limit_direction表示插值方向，可選forward,backward,both，默認前向

``````s = pd.Series([np.nan,np.nan,1,np.nan,np.nan,np.nan,5,np.nan,np.nan,])
s.interpolatae(limit_direction='backward')
``````

1、給俺點個讚唄，可以讓更多的人看到這篇文章，謝謝各位親。

2、親們，關注我的原創微信公衆號「五角錢的程序員」，我們一起成長，一起學習。一直純真着，善良着，溫情地熱愛生活。關注回覆【電子書】有很多資源哦。

Whatever I believed, I did; and whatever I did, I did with my whole heart and mind.

Datawhale是一個專注於數據科學與AI領域的開源組織，彙集了衆多領域院校和知名企業的優秀學習者，聚合了一羣有開源精神和探索精神的團隊成員。Datawhale以“for the learner，和學習者一起成長”爲願景，鼓勵真實地展現自我、開放包容、互信互助、敢於試錯和勇於擔當。同時Datawhale 用開源的理念去探索開源內容、開源學習和開源方案，賦能人才培養，助力人才成長，建立起人與人，人與知識，人與企業和人與未來的聯結。

2020.5.22於城口