簡介

Pandas 是 Python 的核心數據分析支持庫，提供了快速、靈活、明確的數據結構，旨在簡單、直觀地處理關係型、標記型數據。Pandas 的目標是成爲 Python 數據分析實踐與實戰的必備高級工具，其長遠目標是成爲最強大、最靈活、可以支持任何語言的開源數據分析工具。經過多年不懈的努力，Pandas 離這個目標已經越來越近了。

使用

我們僅需要簡單的通過import pandas as pd就可以使用pa 了。

In [2]: import pandas as pd

In [3]: df = pd.DataFrame()

In [4]: df
Out[4]:
Empty DataFrame
Columns: []
Index: []

數據結構

名稱	維數	描述
Series	1	帶標籤的一維同構數組
DataFrame	2	帶標籤的，大小可變的，二維異構表格

Series
Series是一維標記的數組，它包含了一個值序列和數據標籤(index)。
DataFrame
DataFrame表示的是矩陣的數據表，它包含已排序的列集合，每一列可以是不同的值類型。DataFrame既有行索引也有列索引，它可以被視爲一個共享索引的Series的字典。

基礎操作

Series

我們可以通過數組，就可以簡單的創建Series：

In [3]: a = pd.Series([5,-3,7,1.4])
In [4]: a
Out[4]:
0    5.0
1   -3.0
2    7.0
3    1.4
dtype: float64

以上面的代碼爲例，可以看到Series的字符串格式是左邊是索引（index），右邊是值（values）。我們可以通過Series的values和index屬性

In [5]: a.index
Out[5]: RangeIndex(start=0, stop=4, step=1)

In [6]: a.values
Out[6]: array([ 5. , -3. ,  7. ,  1.4])

如果我們不想使用默認索引，也可以自己定義。下面的代碼就自定義了索引。

In [8]: a = pd.Series([5,-3,7,1.4],index = ['a','b','c','d'])

In [9]: a
Out[9]:
a    5.0
b   -3.0
c    7.0
d    1.4
dtype: float64

和numpy一樣，我們可以使用索引來對Series進行訪問和修改。

# 單個索引
In [14]: a['a']
Out[14]: 5.0
#數組批量索引
In [15]: a[['a','c']]
Out[15]:
a    5.0
c    7.0
dtype: float64
# 修改
In [16]: a[['a','c']] = 5,9
In [17]: a
Out[17]:
a    5.0
b   -3.0
c    9.0
d    1.4
dtype: float64

除此之外，我們還可以使用類似numpy的mask來進行訪問，以及和numpy一樣，pandas的四則運算是批量操作的，免去了for循環。

# 使用mask來進行篩選
In [19]: a[a>2]
Out[19]:
a    5.0
c    9.0
dtype: float64

# 四則運算是作用在每一個元素上的
In [20]: a*2
Out[20]:
a    10.0
b    -6.0
c    18.0
d     2.8
dtype: float64

如果數據被存放在字典中，那麼我們可以直接通過字典來創建Series。

In [28]: dic = {'a':1,'b':5,'asd':-5,'c':7}

In [29]: pd.Series(dic)
Out[29]:
a      1
b      5
asd   -5
c      7
dtype: int64

由於字典中的數據是無序的，因此傳入Series時想要按照規定的順序的話，可以自定義索引，當然如果給出的索引不在字典的key中的缺失值則會以Nan(not a number)補充。

In [31]:a = pd.Series(dic,index= ['a','b','c','d'])
In [32]:a
Out[32]:
a    1.0
b    5.0
c    7.0
d    NaN
dtype: float64

# 對於確實值的判斷可以使用函數isnull和notnull來判斷。
In [38]: a.isnull()
Out[38]:
a    False
b    False
c    False
d     True
dtype: bool

In [39]: a.notnull()
Out[39]:
a     True
b     True
c     True
d    False
dtype: bool

對於Series而言，最重要的一個功能是可以根據運算的索引標籤自動對齊數據。什麼意思呢？比如，對於Series,A和B的索引分別爲['a','b','c']，['b','c','d']。則對A，B進行運算操作時，會自動按照索引對其。這有點類似數據庫中的join操作。

In [4]: A = pd.Series([1,5,-7], index = ['a','b','c'])
In [5]: B = pd.Series([2,5,-2], index = ['b','c','d'])

In [6]: A+B
Out[6]:
a    NaN
b    7.0
c   -2.0
d    NaN
dtype: float64

Series對象本身及其索引都有一個name屬性，這個功能在後續中還會提到。

In [13]: a = pd.Series([1,-2,4])
In [14]: a.name = 'num'
In [15]: a.index.name = 'ind'

In [16]: a
Out[16]:
ind
0    1
1   -2
2    4
Name: num, dtype: int64

如果你想修改索引，可以通過賦值的方式，就像這樣,由於修改了索引，所以索引名同時也不存在了。

In [17]: a.index = ['a','b','c']

In [18]: a
Out[18]:
a    1
b   -2
c    4
Name: num, dtype: int64

DataFrame

DataFrame是一個表格形式的數據結構，你可以將它理解爲由不同Series組成的共用同一個索引的字典。
我們可以通過一下幾種方法來建立DataFrame:

# 傳入一個由等長列表或numpy數組
In [6]: dic = {'name':['zhao','qian','sun'],
   ...:         'old':[20,18,19],
   ...:         'sex':['male','female','male']}
In [7]: df = pd.DataFrame(dic)

In [8]: df
Out[8]:
   name  old     sex
0  zhao   20    male
1  qian   18  female
2   sun   19    male

# 嵌套字典方式創建，空缺部分以NaN替代
In [52]: dic = {'name':{1:'zhang',2:'li'},'age':{1:24,2:23,0:19}}

In [53]: pd.DataFrame(dic)
Out[53]:
    name  age
1  zhang   24
2     li   23
0    NaN   19

如果Dataframe中的數據過大，我們不想全部顯示，只想查看一些數據格式，這時候可以使用head和tail來顯示前五行和後五行數據（這裏的df只有三行）

In [12]: df.head()
Out[12]:
   name  old     sex
0  zhao   20    male
1  qian   18  female
2   sun   19    male

In [13]: df.tail()
Out[13]:
   name  old     sex
0  zhao   20    male
1  qian   18  female
2   sun   19    male

通過字典去創建Dataframe時，我們也可以以指定的列進行排列(未找到相應字典key則會以NaN替代 )，也可以像Series那樣自定義索引；

In [16]: pd.DataFrame(dic,columns = ['old','name','sex','time'],index = np.arang
    ...: e(1,4))
Out[16]:
   old  name     sex   time
1   20  zhao    male    NaN
2   18  qian  female	NaN
3   19   sun    male	NaN

讀取列
類似於字典或屬性的方式，我們可以讀取DataFrame中的一列或幾列。

In [17]: df['name']
Out[17]:
0    zhao
1    qian
2     sun
Name: name, dtype: object

In [18]: df.sex
Out[18]:
0      male
1    female
2      male
Name: sex, dtype: object

In [19]: df[['name','sex']]
Out[19]:
   name     sex
0  zhao    male
1  qian  female
2   sun    male

讀取行
對於行，可以使用loc屬性來進行讀取。

In [26]: df.loc[1]
Out[26]:
name      qian
old         18
sex     female
Name: 1, dtype: object

既然能夠去讀到DataFrame中的數據，相應的我們也能夠加以修改。

In [31]: df['old'] = [22,19,17]
In [32]: df.sex = pd.Series(['male','male','male'])
In [33]: df.loc[1] = ['li',14,'femal']

In [34]: df
Out[34]:
   name  old    sex
0  zhao   22   male
1    li   14  femal
2   sun   17   male

如果我們想修改、插入、刪除一行或一列元素時，該怎麼做呢？

se = pd.Series(['asd','asw','df'])
# 添加一行
In [37]: df.append(se,ignore_index=True)
Out[37]:
   name   old    sex    0    1    2
0  zhao  22.0   male  NaN  NaN  NaN
1    li  14.0  femal  NaN  NaN  NaN
2   sun  17.0   male  NaN  NaN  NaN
3   NaN   NaN    NaN  asd  asw   df
# 添加一行
In [38]: se.name = 3
In [39]: df.append(se)
Out[39]:
      name   old    sex    0    1    2
0     zhao  22.0   male  NaN  NaN  NaN
1       li  14.0  femal  NaN  NaN  NaN
2      sun  17.0   male  NaN  NaN  NaN
3 	   NaN   NaN    NaN   asd  asw   df

# 添加一列，注意不能用df.test創建列
In [40]: df['test']=se
In [41]: df
Out[41]:
   name  old    sex test
0  zhao   22   male   as
1    li   14  femal  asw
2   sun   17   male   df
# 指定位置修改
In [42]: df['old']=pd.Series([10,24],index = [2,1])
In [43]: df
Out[43]:
   name   old    sex test
0  zhao   NaN   male   as
1    li  24.0  femal  asw
2   sun  10.0   male   df

# 刪除一行元素
In [50]: del df['test']

In [51]: df
Out[51]:
   name   old    sex
0  zhao   NaN   male
1    li  24.0  femal
2   sun  10.0   male

&emsp我們也可以使用類似於numpy數組的方法，來對DataFrame進行轉置；

In [55]: df.T
Out[55]:
         0      1     2
name  zhao     li   sun
old    NaN     24    10
sex   male  femal  male

如果設置了DataFrame的index和cloumns的name屬性，則這些信息也會被顯示出來：

In [57]: df
Out[57]:
   name   old    sex
0  zhao   NaN   male
1    li  24.0  femal
2   sun  10.0   male

# 設置行和列的名稱
In [58]: df.index.name = 'num'
In [59]: df.columns.name = 'state'

In [60]: df
Out[60]:
state  name   old    sex
num
0      zhao   NaN   male
1        li  24.0  femal
2       sun  10.0   male

我們可以通過下面的方法來得到Dataframe的行，列標籤以及值。

# 列
In [64]: df.index
Out[64]: RangeIndex(start=0, stop=3, step=1, name='num')
# 行
In [65]: df.columns
Out[65]: Index(['name', 'old', 'sex'], dtype='object', name='state')
# 值
In [66]: df.values
Out[66]:
array([['zhao', nan, 'male'],
       ['li', 24.0, 'femal'],
       ['sun', 10.0, 'male']], dtype=object)
# 可以通過labels來判斷標籤

值得注意的是，與python的集合不同，pandasd的Index可以包含重複的標籤。

# 將DateFrame的index設置爲相同
In [70]: df.index = [0,0,0]
In [71]: df
Out[71]:
state  name   old    sex
0      zhao   NaN   male
0        li  24.0  femal
0       sun  10.0   male

基本功能

reindex 重建索引

Series

pd.Series.reindex(self, index=None, **kwargs)

屬性	含義
index	數組類型的新索引，基於原`Series`，沒有的地方以`NaN`填充。
method	用於遞增或遞減索引填充空缺值{`None`（不填充空缺）, `backfill/bfill`（依據下一個填充上一個空缺）, `pad/ffill`（依據上一個值填充下一個空缺）, `nearest`（使用最近的值去填充空缺且索引僅支持數字）}
copy	默認爲`True`即使傳遞的索引相同，也返回一個新對象。`False`時相當於返回的是原對象的視圖。
level	在一個級別上廣播，在傳遞的`MultiIndex`級別上匹配索引值。
fill_value	用於缺失值的值。默認爲`NaN`，但可以是任何“兼容”值。
limit	限制最大填充數量。（選擇後的最大值）

In [9]: s =pd.Series([2,7,3,-2])
# 使用index是在原Series上修改
In [10]: s.index = [1,2,3,4]
In [11]: s
Out[11]:
1    2
2    7
3    3
4   -2
dtype: int64
# reindex則是創建一個新索引的新對象
# 其中不存在的，則以NaN替代
In [12]: s.reindex([1,2,3,'a'])
Out[12]:
1    2.0
2    7.0
3    3.0
a    NaN
dtype: float64

我們可以使用fill_value默認值去填充空缺值，也可以使用method去參照上下存在的值進行填充空缺部分。

In [6]: a = pd.Series([ 1,  5,  8,  4, -2,  3,  7,  9, -4],
   ...:    index =['a','b','c','d','e','f','g','h','i'])
# 以設置的值去填充空缺值
In [7]: a.reindex(index = ['a','e','r','d'],fill_value = 0)
Out[7]:
a    1
e   -2
r    0
d    4
dtype: int64
# 依據上一個值填充下一個空缺值
In [8]: a.reindex(index = ['a','y','z','r','d'],method = 'ffill')
Out[8]:
a    1
y   -4
z   -4
r   -4
d    4
dtype: int64

誒？爲什麼用ffill填充的結果不是1而是-4呢？請記住填充參數method依據的是用於遞增或遞減索引填充空缺值對於原Series中是順序遞增的，因此，y,z,r的上一個有效值應該是i即-4。現在讓我們看一下正確的使用方式：

In [4]: a = pd.Series([1,5,8,4],index = ['a','e','f','g'])
In [5]: b = pd.Series([1,5,8,4],index = [0,4,5,6])

# 使用ffill模式，依據上一個有效值填充下一個空缺值
In [6]: a.reindex(index = ['a','b','c','e','f'],method = 'ffill')
Out[6]:
a    1
b    1
c    1
e    5
f    8
dtype: int64

# 使用bfill模式，依據下一個有效值回填上一個空缺值
In [7]: a.reindex(index = ['a','b','c','e','f'],method = 'bfill')
Out[7]:
a    1
b    5
c    5
e    5
f    8
dtype: int64

# 使用nearest模式，依據最近的有效值去填充空缺值，
# 當空缺值舉例兩邊舉例相同時選擇依據bfill填充。如下索引2
In [8]: b.reindex(index = [0,1,2,3,4,5],method = 'nearest')
Out[8]:
0    1
1    1
2    5
3    5
4    5
5    8
dtype: int64

#使用limit限制最大填充數量。
In [8]: b.reindex(index = [0,1,2,3,4,5],method ='nearest',limit=1)
Out[8]:
0    1
1    1
2    NaN
3    5
4    5
5    8
dtype: int64

對於copy，默認的是True返回的是一個原Series的一個副本對象（即使傳遞的索引相同，也返回一個新對象）。False時相當於返回的是原對象的視圖。因此對於copy = True的返回值進行修改時，並不會導致原數據發生變化，但對於copy = True進行修改時則會導致原數據也發生變化。

In [15]: a = pd.Series([5,1,-7,3])
In [16]: copy_true = a.reindex(np.arange(1,5),copy = True)
In [17]: copy_false = a.reindex(np.arange(1,5),copy = False)
# 修改copy_true可以發現原數據不改變
In [18]: copy_true[2] = 999
In [19]: copy_true
Out[19]:
1      1.0
2    999.0
3      3.0
4      NaN
dtype: float64

In [20]: a
Out[20]:
0    5
1    1
2   -7
3    3
dtype: int64
#修改copy_false可以發現原數據改變
In [21]: copy_false[2] = 999
In [22]: copy_false
Out[22]:
1      1.0
2    999.0
3      3.0
4      NaN
dtype: float64

In [23]: a
Out[23]:
0    5
1    1
2   -7
3    3
dtype: int64

DataFrame

pd.DataFrame.reindex(
    self,
    labels=None,
    index=None,
    columns=None,
    axis=None,
    method=None,
    copy=True,
    level=None,
    fill_value=nan,
    limit=None,
    tolerance=None,
)

下面是參數說明，其中與Series類似的，在下表就不過多贅述。

參數	說明
labels	新標籤/索引使`axis`指定的軸與之一致。
axis	指定索引的作用域，可以是軸名稱（`index`，`columns`）或數字（`0`、`1`）。

與Series類似，在遇到沒有的值時，會默認以NaN替換，當然也可以使用filll_value進行填充。

In [12]: df = pd.DataFrame(np.arange(9).reshape(3,3),
    ...:                   index = ['a','b','c'],
    ...:                   columns = ['A','B','C'])
# 原對象
In [13]: df
Out[13]:
   A  B  C
a  0  1  2
b  3  4  5
c  6  7  8
# 對空缺值默認填充NaN
In [14]: df.reindex(index=['b','e'],columns=['A','C','D'])
Out[14]:
     A    C   D
b  3.0  5.0 NaN
e  NaN  NaN NaN
# 對於空缺值填充設置的值
In [15]: df.reindex(index=['b','e'],columns=['A','C','D'],fill_value = -1)
Out[15]:
   A  C  D
b  3  5 -1
e -1 -1 -1

同樣也可以只用method來實現之填充。

# 使用ffill填充。bfill和nearest類似不贅述
In [16]: df.reindex(index=['b','e'],columns=['A','C','D'],method='ffill')
Out[16]:
   A  C  D
b  3  5  5
e  6  8  8

這裏說明的是limit限制的最大值，是基於reindex重建索引後的數據的距離。

In [21]: df.reindex(index=['a','b','c','d','e'],
				  columns=['A','B','C','D'],
    ...: method='ffill',limit = 1)
Out[21]:
     A    B    C    D
a  0.0  1.0  2.0  2.0
b  3.0  4.0  5.0  5.0
c  6.0  7.0  8.0  8.0
d  6.0  7.0  8.0  8.0
e  NaN  NaN  NaN  NaN

In [22]: df.reindex(index=['b','e'],
				  columns=['A','C','D'],
				  method='ffill',limit=1)
Out[22]:
   A  C  D
b  3  5  5
e  6  8  8

通過axis，我們可以指定前一段數組的作用域。

In [24]: df
Out[24]:
   A  B  C
a  0  1  2
b  3  4  5
c  6  7  8
# 由於默認是index，所以會在行索引查找不到A
In [25]: df.reindex(['A'])
Out[25]:
    A   B   C
A NaN NaN NaN
# 通過axis='columns'或者axis='1'來確定作用域爲列索引
In [26]: df.reindex(['A'],axis = 1)
Out[26]:
   A
a  0
b  3
c  6

loc和iloc標籤索引和位置索引

loc
與reindex類似的是我們也可以使用標籤索引。不同的是loc相當於原對象的視圖。標籤索引有點像numpy中的mask。標籤縮影範圍是雙閉區間，Python中的索引是左閉右開。其可傳入的標籤類型有以下幾種；

類型	解釋
單標籤	例如`2`或`a`這裏的2不是索引值，而是數字標籤(代碼中將會區分這兩種區別)
列表或數組	由標籤構成的數組或列表，例如`['a','c','d']`
切片	帶有標籤的切片對象，例如`['a':'f']`

Series

In [28]: a = pd.Series([5,8,6,-7,3],index = ['a','b','c','d','e'])
In [29]: b = pd.Series([5,8,6,-7,3],index = range(0,5))
# 通過數值索引來取值左閉右開
In [30]: b[:3]
Out[30]:
0    5
1    8
2    6
dtype: int64
# 通過標籤索引來取值，這裏的數字其實是數字類型的標籤
# 和數值索引的數字不是一個東西。標籤索引雙閉。
In [31]: b.loc[:3]
Out[31]:
0    5
1    8
2    6
3   -7
dtype: int64
# 單標籤索引
In [32]: b.loc[3]
Out[32]: -7
In [33]: a.loc['c']
Out[33]: 6

# 標籤構成的數組，索引
In [35]: a.loc[['c','b','d']]
Out[35]:
c    6
b    8
d   -7
dtype: int64
In [36]: b.loc[[1,4,2]]
Out[36]:
1    8
4    3
2    6
dtype: int64

# 切片
In [37]: a.loc[:'d']
Out[37]:
a    5
b    8
c    6
d   -7
dtype: int64
In [38]: b.loc[:4]
Out[38]:
0    5
1    8
2    6
3   -7
4    3
dtype: int64

DataFrame

In [10]: df = pd.DataFrame(np.arange(12).reshape(4,3),
    ...:                 index=['a','b','c','d'],
    ...:                 columns=['A','B','C'])

# 單標籤索引，注意單標籤索引會將行作爲Series
In [11]: df.loc['a']
Out[11]:
A    0
B    1
C    2
Name: a, dtype: int32
# 可以使用[[]]來將其作爲DataFrame
In [12]: df.loc[['a']]
Out[12]:
   A  B  C
a  0  1  2
# 分別爲index和columns標籤，以確定一個值
In [13]: df.loc['a','A']
Out[13]: 0
# 也可以組合使用
In [14]: df.loc['a':'c','A']
Out[14]:
a    0
b    3
c    6
Name: A, dtype: int32
# 按行選取
In [15]: df.loc[[True,False,True]]
Out[15]:
   A  B  C
a  0  1  2
c  6  7  8
# 獲取一列
In [22]: df.loc[:,'A']
Out[22]:
a    0
b    3
c    6
d    9

iloc

iloc是純粹由數字構成的位置索引。下面讓我們看一下iloc允許的輸入類型。

類型	說明
單整型	例如`5`
數組或列表	例如`[4,3,0]`
切片	例如`1:7`

In [4]: se = pd.Series([3,1,-5,7])

In [5]: df = pd.DataFrame(np.arange(12).reshape(4,3),
   ...:                 index=['a','b','c','d'],
   ...:               columns=['A','B','C'])

iloc,對於一維Series可以傳入單數值或者通過列表傳入多個值，對於二維的DataFrame可以傳入兩個單值或者通過列表傳入多個值。

# 單數值對於Series只顯示一個值
In [6]: se.iloc[0]
Out[6]: 3
# 對於DataFrame則以Series顯示一行
In [7]: df.iloc[0]
Out[7]:
A    0
B    1
C    2
Name: a, dtype: int32
# 我們可以通過傳出list類型以DataFrame形式顯示
In [8]: df.iloc[[0]]
Out[8]:
   A  B  C
a  0  1  2

# 這裏值得注意的是，如果直接傳入兩個數值，
# 其含義分別是橫縱座標的位置
In [9]: df.iloc[0,1]
Out[9]: 1

# 同樣的，我們可以在每一維度上傳入list以DataFrame 形式顯示
In [10]: df.iloc[[0],[1]]
Out[10]:
   B
a  1

# 所以對於Series 這種一維序列就會報錯
se.iloc[0,1]
IndexingError: Too many indexers

# 如果我們想在某一維度上獲取更多的值，
# 可以以列表的形式佔用一個位置，傳多個值
In [12]: se.iloc[[0,1]]
Out[12]:
0    3
1    1
dtype: int64
In [13]: df.iloc[[0,1]]
Out[13]:
   A  B  C
a  0  1  2
b  3  4  5

下面代碼是關於通過切片

In [14]: se.iloc[:2]
Out[14]:
0    3
1    1
dtype: int64

# 對於DataFrame，可以對每一維度進行切片
In [15]: df.iloc[:2,1:]
Out[15]:
   B  C
a  1  2
b  4  5

通過布爾類型的mask來進行索引。(注意長度要匹配)

In [20]: se.iloc[[True,False,True,False]]
Out[20]:
0    3
2   -5
dtype: int64
In [21]: df.iloc[[True,True,False,False],[False,False,True]]
Out[21]:
   C
a  2
b  5

我們也可以使用lambda函數，默認將Series和DataFrame傳入。

In [22]: se.iloc[lambda se:se.index%2==0]
Out[22]:
0    3
2   -5
dtype: int64

In [23]: df.iloc[:,lambda df:[1,2]]
Out[23]:
    B   C
a   1   2
b   4   5
c   7   8
d  10  11

drop 軸向上刪除

我們可以使用drop來實現對某軸向上依據標籤進行刪除。

Series

In [10]: se
Out[10]:
a    5
b    7
c   -3
d   -6
dtype: int64
# 刪除單個元素
In [11]: se.drop('a')
Out[11]:
b    7
c   -3
d   -6
dtype: int64
# 刪除多個元素
In [12]: se.drop(['a','c'])
Out[12]:
b    7
d   -6
dtype: int64
# drop返回的是原對象的副本，其並不會作用在原函數上
In [13]: se
Out[13]:
a    5
b    7
c   -3
d   -6
dtype: int64
# 默認inplace=False，我們可以通過修改
# inplace來實現在原函數上刪除True時不返
In [14]: se.drop(['a','c'],inplace = True)

In [15]: se
Out[15]:
b    7
d   -6
dtype: int64

DataFrame

In [20]: df
Out[20]:
   A   B   C
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11
# 刪除多個列
In [21]: df.drop(['A','B'],axis=1)
Out[21]:
    C
0   2
1   5
2   8
3  11
# 刪除多個行
In [22]: df.drop([1,3])
Out[22]:
   A  B  C
0  0  1  2
2  6  7  8
# 刪除單個行
In [23]: df.drop(1)
Out[23]:
   A   B   C
0  0   1   2
2  6   7   8
3  9  10  11
# 刪除單個列
In [24]: df.drop('A',axis =1)
Out[24]:
    B   C
0   1   2
1   4   5
2   7   8
3  10  11
In [25]: df.drop('A',axis ='columns')
Out[25]:
    B   C
0   1   2
1   4   5
2   7   8
3  10  11

索引、選擇與過濾

Series
通過標籤切片是雙閉的，兩邊都能取到使用單值或序列，可以從Series中索引出一個或多個值

In [28]: se
Out[28]:
a    5
b    1
c    7
d   -6
# 切片索引左閉右開
In [29]: se[1:3]
Out[29]:
b    1
c    7
# 標籤索引雙閉
In [30]: se['a':'c']
Out[30]:
a    5
b    1
c    7
# 通過數值索引，當然爲負值時也是可以的
In [31]: se[1]
Out[31]: 1
In [32]: se[[1,3]]
Out[32]:
b    1
d   -6

# 通過標籤索引
In [33]: se['a']
Out[33]: 5
In [34]: se[['a','c']]
Out[34]:
a    5
c    7
# 如果對齊修改，是可以作用到原對象的
In [36]: se[[1,3]] = 5
In [37]: se
Out[37]:
a    5
b    5
c    7
d    5
# 也可以使用布爾值索引
In [38]: se[se==7]
Out[38]:
c    7

DataFrame

類型	描述
df[val]	從`DataFrame`中選擇單列或列序列；特殊情況：布爾數組（過濾行），切片（切片行）或布爾值`DataFrame`
df.loc[cal]	根據標籤選擇`DataFrame`的單行或多行
df.loc[:,val]	根據標籤選擇單列或多列
def.loc[val1,val2]	根據標籤選擇單個值
def.iloc[where]	根據整數位置選擇單行或多行
df.iloc[:,where]	根據整數位置選擇單列或多列
df.iloc[where_i,where_j]	根據整數位置選擇單個值

df.at[label_i,label_j]根據行列標籤選擇單個值
df.iat[i,j]|根據行列整數位置選擇單個值
reindex方法|通過標籤選擇行或列
get_value,set_value方法|根據行和列標籤設置單個值

使用單值或序列，可以從DataFrame中索引出一個或多個列。

In [50]: df
Out[50]:
   A   B   C
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11
# 使用單值
In [51]: df['A']
Out[51]:
0    0
1    3
2    6
3    9
Name: A, dtype: int32
# DataFrame是不接受列切片的
In [52]: df[['A':'B']]
SyntaxError: invalid syntax
# 使用序列索引
In [53]: df[['A','B']]
Out[53]:
   A   B
0  0   1
1  3   4
2  6   7
3  9  10

如果我們想獲取DataFrame中的某一行，怎麼辦呢？下面代碼將講述這些。

# 使用數值切片，注意單數值是可以以的，會當作列標籤
In [62]: df[:2]
Out[62]:
   A  B  C
0  0  1  2
1  3  4  5

# 使用布爾值也是可以的
In [63]: df[df['A']>2]
Out[63]:
   A   B   C
1  3   4   5
2  6   7   8
3  9  10  11

除了上述的方法，我們也可以通過iloc和loc來進行索引，這裏可以參見上文關於iloc和loc的綜合介紹，這裏不再贅述。

含有重複標籤的軸索引

對於Series和DataFrame由於並不強制標籤值唯一，因此可以通過索引的is_unique屬性來判別標籤的唯一性。

In [33]: se = pd.Series([2,4,-1,8],index=['a','a','b','c'])
In [34]: df = pd.DataFrame(np.arange(9).reshape(3,3),index=['a','b','a'])
# 判斷索引標籤的唯一性
In [35]: se.index.is_unique
Out[35]: False
In [36]: df.index.is_unique
Out[36]: False

In [35]: se['a']
Out[35]:
a    2
a    4
In [42]: df.loc['a']
Out[42]:
   0  1  2
a  0  1  2
a  6  7  8

算術操作

同對象操作

正如上文提到的，當你將兩個對象相加時，返回的結果將是索引對的並集，這有點像數據庫中的外連接。

In [3]: se_a = pd.Series([9,-1,3,4],index=['a','c','e','f'])
In [4]: se_b = pd.Series([8,4,6,-7],index=['a','b','c','d'])
In [5]: df_a = pd.DataFrame(np.arange(6).reshape(2,3),
					columns=['A','C','D'])
In [6]: df_b = pd.DataFrame(np.arange(6).reshape(2,3),
					columns=['A','B','C'])

在進行運算時，會自動對齊到相應的標籤上。不存在的地方以NaN替代。

In [7]: se_a+se_b
Out[7]:
a    17.0
b     NaN
c     5.0
d     NaN
e     NaN
f     NaN
dtype: float64
In [8]: df_a+df_b
Out[8]:
   A   B  C   D
0  0 NaN  3 NaN
1  6 NaN  9 NaN

因爲NaN會傳播，所以有時候我不希望它存在，想在運算時對不存在的數據進行賦值填充，這時候我們可以通過對應的方法來進行操作。其中每個方法都有一個對應的以r開頭的副本，這些副本參數的方法是反轉的。例如a.div(b)和b.rdiv(a)是等價的。

方法	操作
add/radd	加法（+）
sub/rsub	減法（-）
div/rdiv	除法（/）
floordiv/rfloordiv	整除（//）
mul/rmul	乘法（*）
pow/rpow	冪次方（**）

# 填充空缺值
In [10]: df_a.add(df_b,fill_value=10)
Out[10]:
   A     B  C     D
0  0  11.0  3  12.0
1  6  14.0  9  15.0
# div和rdiv
In [11]: 1/df_a
Out[11]:
          A     C    D
0       inf  1.00  0.5
1  0.333333  0.25  0.2

In [12]: df_a.rdiv(1)
Out[12]:
          A     C    D
0       inf  1.00  0.5
1  0.333333  0.25  0.2

Series和DataFrame間操作

Series和DataFrame之間的算術操作與numpy不同維度數組間的操作類似，numpy在操作時會對買一行進行廣播運算。

In [13]: arr_a = np.arange(6).reshape(3,2)
In [14]: arr_b = np.array([2,5])
# 廣播運算
In [15]: arr_a-arr_b
Out[15]:
array([[-2, -4],
       [ 0, -2],
       [ 2,  0]])

類似的Series和DataFrame之間的算術操作是對Series進行行廣播。

In [19]: se = pd.Series([5,7,1],index=['A','B','C'])
In [20]: df = pd.DataFrame(np.arange(9).reshape(3,3),
					columns=['A','B','C'])

In [21]: df-se
Out[21]:
   A  B  C
0 -5 -6  1
1 -2 -3  4
2  1  0  7

# 如果Series中存在DataFrame中不存在的標籤，
# 則對象會重建索引並形成聯合
In [22]: se_1 = pd.Series([5,7,1,2],
					index=['A','B','C','D'])
In [23]: df-se_1
Out[23]:
   A  B  C   D
0 -5 -6  1 NaN
1 -2 -3  4 NaN
2  1  0  7 NaN

如果想進行列匹配，則必須用算術方法

In [29]: se_2 = df['A']
In [30]: se_2
Out[30]:
0    0
1    3
2    6
Name: A, dtype: int32

In [31]: df.sub(se_2,axis='index')
Out[31]:
   A  B  C
0  0  1  2
1  0  1  2
2  0  1  2

函數應用和映射

numpy函數
numpy的通用函數（逐元素數組方法）對pandas對象也是有效果的。

In [3]: df = pd.DataFrame(np.random.randn(4,3),
   ...:             index=['a','b','c','d'],
   ...:             columns=['A','B','C'])

In [4]: df
Out[4]:
          A         B         C
a -0.584069  0.114854 -2.415498
b -0.550652  0.395374 -1.372510
c -0.315824  0.258919 -0.056640
d  0.036870 -0.445996  1.435676

In [5]: np.abs(df)
Out[5]:
          A         B         C
a  0.584069  0.114854  2.415498
b  0.550652  0.395374  1.372510
c  0.315824  0.258919  0.056640
d  0.036870  0.445996  1.435676

apply
沿DataFrame的軸應用功能。

參數	說明
func	應用於每個列或行的函數。
axis	確定該函數應用於行還是列（默認`0/index`作用於行，當爲`1/columns`時作用於列）
raw	默認爲`False`，將每一行或列作爲一個`Series`傳入函數中，`True`時，將以`ndarray`形式傳入。
result_type	`‘expend’`：列表狀的結果將變成`columns`。`'reduce'`如果可能，返回一個Series，而不是擴展類似列表的結果。這與“expend”相反。`'broadcast'`結果將廣播到`DataFrame`的原始形狀，原始索引和列將保留。`None`默認行爲取決於所應用函數的返回值：類似於列表的結果將作爲`Series`結果返回。但是，如果`apply`函數返回`Series`，則將它們擴展爲列。

In [28]: df
Out[28]:
   A  B
a  2 -5
b  2 -5
c  2 -5
# 默認對每行元素求和
In [29]: df.apply(np.sum)
Out[29]:
A     6
B   -15
dtype: int64
# 對每列元素求和
In [30]: df.apply(np.sum,axis=1)
Out[30]:
a   -3
b   -3
c   -3
dtype: int64
# result_type='None' 
# 類似於列表的結果將作爲Series結果返回
In [31]: df.apply(lambda x:[1,2])
Out[31]:
A    [1, 2]
B    [1, 2]
# result_type='expand'
# 將類似於列表的結果，擴展爲Series，
# 值得注意的是拓展後的索引被改變了
 In [32]: df.apply(lambda x:[1,2],result_type='expand')
Out[32]:
   A  B
0  1  1
1  2  2
In [33]: df.apply(lambda x:[1,2],axis=1,result_type='expand')
Out[33]:
   0  1
a  1  2
b  1  2
c  1  2
# result_type='broadcast'
# 將結果在原型狀下廣播，如果是列表則需注意展
# 開後的形狀是否匹配，標量則無需注意
In [34]: df.apply(lambda x:[1,2],result_type='broadcast')
ValueError: cannot broadcast result
# 對於[1,2]是能夠在axis=1上不影響形狀展開的
In [35]: df.apply(lambda x：[1,2],result_type='broadcast',axis=1)
Out[36]:
   A  B
a  1  2
b  1  2
c  1  2
# 對於標量則不需要擔心展開後形狀的問題
In [37]: df.apply(lambda x:1,result_type='broadcast')
Out[37]:
   A  B
a  1  1
b  1  1
c  1  1

練習
1.求最大最小值

In [46]: df
Out[46]:
          A         B         C
0 -0.364608 -0.925359  0.251871
1  1.308153 -0.983261  0.780449
2 -0.138446 -0.187765 -0.555508
3  0.358057  0.944677 -0.127748

In [47]: def f(x):
    ...:     return pd.Series([x.max(),x.min()],index=['max','min'])
    ...:

In [48]: df.apply(f)
Out[48]:
            A         B         C
max  1.308153  0.944677  0.780449
min -0.364608 -0.983261 -0.555508

2.對每個元素取小數點後兩位

In [49]: df.applymap(lambda x:'%.2f' %x)
Out[49]:
       A      B      C
0  -0.36  -0.93   0.25
1   1.31  -0.98   0.78
2  -0.14  -0.19  -0.56
3   0.36   0.94  -0.13
# applymap函數等價於Series函數中的map
In [50]: df['A'].map(lambda x:'{:.2f}'.format(x))
Out[50]:
0    -0.36
1     1.31
2    -0.14
3     0.36
Name: A, dtype: object

排序、排名

sort_index

如果想對行或列的索引進行字典型排序，需要使用sort_index方法。

pd.DataFrame.sort_index(
   self,
   axis=0,
   level=None,
   ascending=True,
   inplace=False,
   kind='quicksort',
   na_position='last',
   sort_remaining=True,
   by=None,)

參數	說明
axis	直接排序的`index`或`columns`
level	`int`或`level`名，`int`或`level`名的列表，對指定索引級別的值進行排序
ascending	升序與降序排序，默認順序`True`
kind	選擇排序算法，有效值是`quicksort`,`mergesort`,`heapsort`,默認爲`quiclsort`
na_position	`NaN`位置，默認`last`放置於最後，`first`放置於最前
sort_remaining	如果設置爲`True`，對多級索引而言，其他級別的索引也會相應的進行排序。

In [6]: df
Out[6]:
   C  A  D  B
b  0  1  2  3
a  4  5  6  7

In [7]: df.sort_index()
Out[7]:
   C  A  D  B
a  4  5  6  7
b  0  1  2  3

In [8]: df.sort_index(axis=1)
Out[8]:
   A  B  C  D
b  1  3  0  2
a  5  7  4  6

sort_values

如果想對Series的值進行排序則需要使用sort_values方法。

pd.Series.sort_values(
    self,
    axis=0,
    ascending=True,
    inplace=False,
    kind='quicksort',
    na_position='last',)

參數	說明
ascending	默認爲升序`true`
na_position	`NaN`位置，默認`last`放置於最後。`first`放置於最前
kind	選擇排序算法，有效值是`quicksort`,`mergesort`,`heapsort`,默認爲`quiclsort`
inplace	默認`False`在副本操作並返回結果，爲`True`，則就地執行操作

In [9]: se = pd.Series([4,7,-2,3])
# 排序 升序
In [10]: se.sort_values()
Out[10]:
2   -2
3    3
0    4
1    7
# 排序 降序
In [11]: se.sort_values(ascending=False)
Out[11]:
1    7
0    4
3    3
2   -2

對於DataFrame需要使用by來指定排序的Series，其他的並無區別。

In [18]: df.sort_values(by='a')
Out[18]:
   a  b  c
2 -1  1  5
1  2  7 -7
0  2  3  4
3  6  8  2

In [19]: df.sort_values(by=['a','b'])
Out[19]:
   a  b  c
2 -1  1  5
0  2  3  4
1  2  7 -7
3  6  8  2

rank

排序是指對數組從1到有效值數據點總數分配名次的操作。Series和DataFrame的rank方法是實現排名的方法。

pd.Series.rank(
    self,
    axis=0,
    method='average',
    numeric_only=None,
    na_option='keep',
    ascending=True,
    pct=False,)

參數	說明
method	`average`平均排名。`min`向低取排名。`max`向高取排名。`first`排列順序以它們出現在數組中的順序。`dense`類似於最小排名，但組間排名總增加`1`。
numeric_only	默認爲`None`，僅包含float，int，boolean。
na_option	`keep`保持`NaN`在原位置。`top`保持`NaN`在最高位。`bottom`保持`NaN`在最低位。
ascending	默認`True`升序，`False`爲降序。
pct	當爲`True`時，計算數據的百分比等級。

In [23]: se = pd.Series([7,-5,7,4,2,0,4])
# 默認是average
In [24]: se.rank()
Out[24]:
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5

# 向下取排名
In [25]: se.rank(method='min')
Out[25]:
0    6.0
1    1.0
2    6.0
3    4.0
4    3.0
5    2.0
6    4.0

向上區排名
In [26]: se.rank(method='max')
Out[26]:
0    7.0
1    1.0
2    7.0
3    5.0
4    3.0
5    2.0
6    5.0

# 對於相同的排名則按索引順序排列
In [27]: se.rank(method='first')
Out[27]:
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0

# 向下取排名，與min不同的是dense不跳級
In [28]: se.rank(method='dense')
Out[28]:
0    5.0
1    1.0
2    5.0
3    4.0
4    3.0
5    2.0
6    4.0

# 百分比顯示，可以與method結合
In [29]: se.rank(pct=True)
Out[29]:
0    0.928571
1    0.142857
2    0.928571
3    0.642857
4    0.428571
5    0.285714
6    0.642857
dtype: float64

歸約、統計

歸約

pandas中也配備了一些類似於numpy中的一些函數，與numpy數組中類似方法相比，他們內建了處理缺失值的功能。

參數	說明
axis	歸約軸，`0/index`行 `1/columns`列
skipna	排除缺失值，默認爲True
level	如果軸是`MultiIndex`，則沿特定級別計數，並摺疊爲`Series`。

In [9]: df
Out[9]:
      A    B
a  1.40  NaN
b  7.10 -4.2
c   NaN  NaN
d  0.75 -1.3

In [10]: df.sum()
Out[10]:
A    9.25
B   -5.50
dtype: float64

In [11]: df.sum(axis=1)
Out[11]:
a    1.40
b    2.90
c    0.00
d   -0.55
dtype: float64

In [12]: df.mean(axis=1,skipna=False)
Out[12]:
a      NaN
b    1.450
c      NaN
d   -0.275
dtype: float64

統計

下面是一些常用的統計方法

方法	說明
count	非Na值的個數
describe	計算`Series`或`DataFrame`各列的彙總統計集合
min,max	計算最大最小值
argmin,argmax	分別計算最大最小值所在的索引位置(整數)
idxmin,idxmax	分別計算最大最小值所在的索引標籤
quantile	計算樣本從0到1間的分位數
sum	加和
mean	均值
media	中位數（50%分位數）
mad	平均值的平均絕對偏差
prod	所有值的積
var	值的樣本方差
std	值的樣本標準差
skew	樣本偏度（第三刻度）值
kurt	樣本峯度（第四刻度）的值
cumsum	累計值
cummin,cummax	累計值的最大值或最小值
cumprod	值的累計積
diff	計算第一個算術差值（對時間序列有用）
pct_change	計算百分比

唯一值、計數和成員屬性

對於一維的Series，可能會有很多重複的值，我們可以通過方法unique得到唯一的值，value_counts進行值頻統計。

In [46]: se = pd.Series(['c','a','d','a','a','b','b','c','c'])
# 統計無重複值
In [47]: se.unique()
Out[47]: array(['c', 'a', 'd', 'b'], dtype=object)
# 統計詞頻
In [48]: se.value_counts()
Out[48]:
a    3
c    3
b    2
d    1
# 若不想排序這樣也可以
In [49]: se.value_counts(sort=False)
Out[49]:
d    1
c    3
a    3
b    2

我們可以通過isin函數來過濾掉不想要的值。

In [50]: se.isin(['a','b'])
Out[50]:
0    False
1     True
2    False
3     True
4     True
5     True
6     True
7    False
8    False
dtype: bool

In [51]: mask =se.isin(['a','b'])

In [52]: se[mask]
Out[52]:
1    a
3    a
4    a
5    b
6    b
dtype: object

與isin相關的Index.get_indexer方法，可以提供一個索引數組，這個索引數組可以將可能非唯一值數組轉化爲另一個唯一值數組。index.get_indexer的作用是在已知的索引作爲另一個Series的值所對應的索引，若無對應則返回-1。

In [60]: index=pd.Index(['a','b','c'])

In [61]: index.get_indexer(se)
Out[61]: array([ 2,  0, -1,  0,  0,  1,  1,  2,  2], dtype=int32)

如果想統計整個DataFrame中每一列重複值的頻率，可以將pd.value_counts傳入DataFrame的apply函數。

In [70]: df
Out[70]:
   A  B  C
0  1  2  1
1  3  3  5
2  4  1  2
3  3  2  4
4  4  3  4

In [71]: df.apply(pd.value_counts)
Out[71]:
     A    B    C
1  1.0  1.0  1.0
2  NaN  2.0  1.0
3  2.0  2.0  NaN
4  2.0  NaN  2.0
5  NaN  NaN  1.0

In [72]: df.apply(pd.value_counts).fillna(0)
Out[72]:
     A    B    C
1  1.0  1.0  1.0
2  0.0  2.0  1.0
3  2.0  2.0  0.0
4  2.0  0.0  2.0

Dis_illusion

發佈了19 篇原創文章 · 獲贊 19 · 訪問量 3769

私信關注

pandas——基礎篇

文章目錄

簡介

使用