數據科學包——pandas基礎(核心數據結構)

一、Series

Series 是一維帶標籤的數組,數組裏可以放任意的數據(整數,浮點數,字符串,Python Object)。其基本的創建函數是:

s = pd.Series(data, index=index)

其中 index 是一個列表,用來作爲數據的標籤。data 可以是不同的數據類型:

  • Python 字典
  • ndarray 對象
  • 一個標量值,如 5

1.創建

1.1 從 ndaray 創建

>>> s=pd.Series(np.random.randn(5),index=['a','b','c','d','e'])
>>> s
a   -0.485521
b   -0.286831
c    1.292780
d   -0.625325
e   -0.936284
dtype: float64

>>> s.index
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

注意Series,開頭S必須大寫

>>> s=pd.Series(np.random.randn(5))
>>> s
0   -1.657662
1    0.149248
2    1.728224
3    0.058451
4    0.345831
dtype: float64

>>> s.index
RangeIndex(start=0, stop=5, step=1)

1.2 從字典創建

創建一個字典d,直接轉換爲Series

>>> s=pd.Series(d)
>>> s
a    0.0
b    1.0
d    3.0
dtype: float64

自定義行標籤,字典中若沒有對應的鍵,賦值爲NaN

>>> d = {'a' : 0., 'b' : 1., 'd' : 3}
>>> s=pd.Series(d,index=list('absd'))
>>> s
a    0.0
b    1.0
s    NaN
d    3.0
dtype: float64

1.3 從標量創建

>>> s=pd.Series(3,index=range(5))
>>> s
0    3
1    3
2    3
3    3
4    3
dtype: int64

2.Series對象

2. Series 是類 ndarray 對象

numpy 的索引方式。Series也同樣可以用

>>> s = pd.Series(np.random.randn(5))
>>> s
0   -0.104885
1    0.375955
2    1.305717
3    0.441162
4   -0.598452
dtype: float64
>>> s[0]
-0.10488490668673565
>>> s[3:]
3    0.441162
4   -0.598452
dtype: float64
>>> np.exp(s)
0    0.900428
1    1.456382
2    3.690336
3    1.554513
4    0.549662
dtype: float64

2.2 Series 是類字典對象

>>> s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
>>> s
a    0.184751
b   -0.006316
c   -1.113671
d   -2.804318
e    1.493505
dtype: float64

>>> s['a']
0.18475101331017024

>>> s['e']=3
>>> s
a    0.184751
b   -0.006316
c   -1.113671
d   -2.804318
e    3.000000
dtype: float64

>>> s['g'] = 100
>>> s
a      0.184751
b     -0.006316
c     -1.113671
d     -2.804318
e      3.000000
g    100.000000
dtype: float64

>>> 'e' in s
True

>>> print( s.get('f'))
None
>>> print( s.get('f', np.nan))
nan
>>> print( s.get('f', 5))
5

3.標籤對齊操作

>>> s1 = pd.Series(np.random.randn(3), index=['a', 'c', 'e'])
>>> s2 = pd.Series(np.random.randn(3), index=['a', 'd', 'e'])
>>> print('{0}\n\n{1}'.format(s1, s2))
a   -0.123366
c   -0.434903
e   -1.064005
dtype: float64

a    0.784026
d   -1.846238
e   -1.247743
dtype: float64

>>> s1 + s2
a   -0.382794
c         NaN
d         NaN
e    4.032780
dtype: float64

4.name屬性

>>> s = pd.Series(np.random.randn(5), name='Some Thing')
>>> s
0   -0.025971
1    1.427484
2    0.684746
3    0.928511
4    0.097620
Name: Some Thing, dtype: float64
>>> s.name
'Some Thing'

二、DataFrame

DataFrame 是二維帶行標籤和列標籤的數組。可以把 DataFrame 想成一個 Excel 表格或一個 SQL 數據庫的表格,還可以相像成是一個 Series 對象字典。它是 Pandas 裏最常用的數據結構。

創建 DataFrame 的基本格式是:

df = pd.DataFrame(data, index=index, columns=columns)

其中 index 是行標籤,columns 是列標籤,data 可以是下面的數據:

  • 由一維 numpy 數組,list,Series 構成的字典
  • 二維 numpy 數組
  • 一個 Series
  • 另外的 DataFrame 對象

1.創建

1.1 從字典創建

>>> d = {'one' : pd.Series([1, 2, 3], index=['a', 'b', 'c']),
...      'two' : pd.Series([1, 2, 3, 4], index=['a', 'b', 'c', 'd'])}
>>> d
{'one': a    1
b    2
c    3
dtype: int64, 'two': a    1
b    2
c    3
d    4
dtype: int64}
>>> pd.DataFrame(d)
   one  two
a  1.0    1
b  2.0    2
c  3.0    3
d  NaN    4

設置行、列標籤,沒有對應值顯示NaN

>>> pd.DataFrame(d, index=['d', 'b', 'a'], columns=['two', 'three'])
   two three
d    4   NaN
b    2   NaN
a    1   NaN

1.2 從結構化數據中創建

>>> data = [(1, 2.2, 'Hello'), (2, 3., "World")]
>>> data
[(1, 2.2, 'Hello'), (2, 3.0, 'World')]
>>> pd.DataFrame(data)
   0    1      2
0  1  2.2  Hello
1  2  3.0  World

>>> pd.DataFrame(data, index=['first', 'second'], columns=['A', 'B', 'C'])
        A    B      C
first   1  2.2  Hello
second  2  3.0  World

1.3 從字典列表創建

>>> data = [{'a': 1, 'b': 2}, {'a': 5, 'b': 10, 'c': 20}]
>>> pd.DataFrame(data)
   a   b     c
0  1   2   NaN
1  5  10  20.0

>>> pd.DataFrame(data,index=['first','second'], columns=['a', 'e'])
        a   e
first   1 NaN
second  5 NaN

1.4 從元組字典創建

瞭解其創建的原理,實際應用中,會通過數據清洗的方式,把數據整理成方便 Pandas 導入且可讀性好的格式。最後再通過 reindex/groupby 等方式轉換成複雜數據結構。

>>> d = {('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2},
...      ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4},
...      ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6},
...      ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8},
...      ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}
>>> d
{('a', 'b'): {('A', 'B'): 1, ('A', 'C'): 2}, ('a', 'a'): {('A', 'C'): 3, ('A', 'B'): 4}, ('a', 'c'): {('A', 'B'): 5, ('A', 'C'): 6}, ('b', 'a'): {('A', 'C'): 7, ('A', 'B'): 8}, ('b', 'b'): {('A', 'D'): 9, ('A', 'B'): 10}}

#多級標籤
>>> pd.DataFrame(d)
       a              b
       b    a    c    a     b
A B  1.0  4.0  5.0  8.0  10.0
  C  2.0  3.0  6.0  7.0   NaN
  D  NaN  NaN  NaN  NaN   9.0

1.5 從 Series 創建

>>> s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
>>> pd.DataFrame(s,columns=['A'])
          A
a  0.748728
b -0.119084
c  0.328340
d -1.707235
e  0.205882

2.列選擇/增加/刪除

2.1 選擇列

>>> df = pd.DataFrame(np.random.randn(6, 4), columns=['one', 'two', 'three', 'four'])
>>> df
        one       two     three      four
0  0.486625  0.094514  0.733189 -1.137290
1  0.155623 -0.610077  0.424488  0.103686
2 -1.747658 -0.618322 -1.070768  1.638107
3 -0.761408 -0.353779  1.363916  1.663116
4  0.012482  0.385496  0.283480  0.716104
5  0.784946 -0.568144  1.411448  0.187921

>>> df['three'] = df['one'] + df['two']
>>> df
        one       two     three      four
0  0.486625  0.094514  0.581139 -1.137290
1  0.155623 -0.610077 -0.454453  0.103686
2 -1.747658 -0.618322 -2.365981  1.638107
3 -0.761408 -0.353779 -1.115188  1.663116
4  0.012482  0.385496  0.397978  0.716104
5  0.784946 -0.568144  0.216803  0.187921

>>> df['flag'] = df['one'] > 0
>>> df
        one       two     three      four   flag
0  0.486625  0.094514  0.581139 -1.137290   True
1  0.155623 -0.610077 -0.454453  0.103686   True
2 -1.747658 -0.618322 -2.365981  1.638107  False
3 -0.761408 -0.353779 -1.115188  1.663116  False
4  0.012482  0.385496  0.397978  0.716104   True
5  0.784946 -0.568144  0.216803  0.187921   True

2.2 刪除列

  • del函數
>>> del df['three']
>>> df
        one       two      four   flag
0  0.486625  0.094514 -1.137290   True
1  0.155623 -0.610077  0.103686   True
2 -1.747658 -0.618322  1.638107  False
3 -0.761408 -0.353779  1.663116  False
4  0.012482  0.385496  0.716104   True
5  0.784946 -0.568144  0.187921   True
  • pop函數
>>> four = df.pop('four')
>>> four
0   -1.137290
1    0.103686
2    1.638107
3    1.663116
4    0.716104
5    0.187921
Name: four, dtype: float64
>>> df
        one       two   flag
0  0.486625  0.094514   True
1  0.155623 -0.610077   True
2 -1.747658 -0.618322  False
3 -0.761408 -0.353779  False
4  0.012482  0.385496   True
5  0.784946 -0.568144   True

2.3 插入列

>>> df['five'] = 5
>>> df
        one       two   flag  five
0  0.486625  0.094514   True     5
1  0.155623 -0.610077   True     5
2 -1.747658 -0.618322  False     5
3 -0.761408 -0.353779  False     5
4  0.012482  0.385496   True     5
5  0.784946 -0.568144   True     5

>>> df['one_trunc'] = df['one'][:2]
>>> df
        one       two   flag  five  one_trunc
0  0.486625  0.094514   True     5   0.486625
1  0.155623 -0.610077   True     5   0.155623
2 -1.747658 -0.618322  False     5        NaN
3 -0.761408 -0.353779  False     5        NaN
4  0.012482  0.385496   True     5        NaN
5  0.784946 -0.568144   True     5        NaN
  • 指定插入位置 insert函數
>>> df.insert(1, 'bar', df['one'])
>>> df
        one       bar       two   flag  five  one_trunc
0  0.486625  0.486625  0.094514   True     5   0.486625
1  0.155623  0.155623 -0.610077   True     5   0.155623
2 -1.747658 -1.747658 -0.618322  False     5        NaN
3 -0.761408 -0.761408 -0.353779  False     5        NaN
4  0.012482  0.012482  0.385496   True     5        NaN
5  0.784946  0.784946 -0.568144   True     5        NaN
  • 使用 assign() 方法來插入新列
    更方便地使用 methd chains 的方法來實現,df未變
>>> df = pd.DataFrame(np.random.randint(1, 5, (6, 4)), columns=list('ABCD'))
>>> df
   A  B  C  D
0  2  2  4  1
1  2  4  3  1
2  3  1  3  2
3  3  2  4  1
4  2  4  3  2
5  3  4  4  3

添加新的列,值爲A列與B列值的商

>>> df.assign(Ratio = df['A'] / df['B'])
   A  B  C  D  Ratio
0  2  2  4  1   1.00
1  2  4  3  1   0.50
2  3  1  3  2   3.00
3  3  2  4  1   1.50
4  2  4  3  2   0.50
5  3  4  4  3   0.75

添加新的列,用自定義函數的方式

>>> df.assign(AB_Ratio = lambda x: x.A / x.B, CD_Ratio = lambda x: x.C - x.D)
   A  B  C  D  AB_Ratio  CD_Ratio
0  2  2  4  1      1.00         3
1  2  4  3  1      0.50         2
2  3  1  3  2      3.00         1
3  3  2  4  1      1.50         3
4  2  4  3  2      0.50         1
5  3  4  4  3      0.75         1

>>> df.assign(AB_Ratio = lambda x: x.A / x.B).assign(ABD_Ratio = lambda x: x.AB_Ratio * x.D)
   A  B  C  D  AB_Ratio  ABD_Ratio
0  2  2  4  1      1.00       1.00
1  2  4  3  1      0.50       0.50
2  3  1  3  2      3.00       6.00
3  3  2  4  1      1.50       1.50
4  2  4  3  2      0.50       1.00
5  3  4  4  3      0.75       2.25

3.索引和選擇

對應的操作,語法和返回結果

  • 選擇一列 -> df[col] -> Series
  • 根據行標籤選擇一行 -> df.loc[label] -> Series
  • 根據行位置選擇一行 -> df.iloc[label] -> Series
  • 選擇多行 -> df[5:10] -> DataFrame
  • 根據布爾向量選擇多行 -> df[bool_vector] -> DataFrame
>>> df = pd.DataFrame(np.random.randint(1, 10, (6, 4)), index=list('abcdef'), columns=list('ABCD'))
>>> df
   A  B  C  D
a  2  8  8  2
b  9  2  8  2
c  7  5  1  2
d  8  3  4  2
e  2  1  2  4
f  8  2  7  3
>>> df['B']
a    8
b    2
c    5
d    3
e    1
f    2
Name: B, dtype: int32

>>> df.loc['B']
KeyError: 'B'

>>> df.loc['b']
A    9
B    2
C    8
D    2
Name: b, dtype: int32

>>> df.iloc[0]
A    2
B    8
C    8
D    2
Name: a, dtype: int32

>>> df[1:4]
   A  B  C  D
b  9  2  8  2
c  7  5  1  2
d  8  3  4  2

#顯示True位置上對應的行
>>> df[[False, True, True, False, True, False]]
   A  B  C  D
b  9  2  8  2
c  7  5  1  2
e  2  1  2  4

4.數據對齊

DataFrame 在進行數據計算時,會自動按行和列進行數據對齊。最終的計算結果會合並兩個 DataFrame。

>>> df1 = pd.DataFrame(np.random.randn(10, 4), index=list('abcdefghij'), columns=['A', 'B', 'C', 'D'])
>>> df1
          A         B         C         D
a -1.862886 -1.547650  0.637708  0.350643
b -0.421221 -1.479398 -0.480860  0.166336
c -0.010406 -0.849795  0.034272 -0.589808
d  0.450138  0.391159  0.914933  0.530649
e  1.036746  0.097552  0.914027  0.570200
f -0.215569  0.461338  0.831485  0.816958
g  0.823373  0.656957 -0.243091 -0.469380
h -0.946946  0.017144 -0.647669 -1.496623
i -1.533835  1.253698 -0.340709 -0.113551
j -0.132444  1.058355  0.038903 -0.072712

>>> df2 = pd.DataFrame(np.random.randn(7, 3), index=list('cdefghi'), columns=['A', 'B', 'C'])
>>> df2
          A         B         C
c -1.391986 -0.219589 -1.144956
d  0.588511  0.567815  0.545037
e  1.981807  0.274164 -0.895879
f  0.209802  0.031883  0.139088
g -0.338254  1.317608  0.156630
h -0.097541  0.312342 -0.217281
i  0.687546 -0.631277  0.577067

df1+df2,相同的行標籤或者列標籤相加,不同的顯示NaN

>>> df1 + df2
          A         B         C   D
a       NaN       NaN       NaN NaN
b       NaN       NaN       NaN NaN
c -1.402392 -1.069384 -1.110684 NaN
d  1.038649  0.958975  1.459970 NaN
e  3.018553  0.371716  0.018148 NaN
f -0.005767  0.493221  0.970574 NaN
g  0.485119  1.974565 -0.086460 NaN
h -1.044486  0.329486 -0.864950 NaN
i -0.846289  0.622422  0.236357 NaN
j       NaN       NaN       NaN NaN
>>> df1 - df1.iloc[0]
          A         B         C         D
a  0.000000  0.000000  0.000000  0.000000
b  1.441665  0.068252 -1.118567 -0.184308
c  1.852480  0.697855 -0.603436 -0.940452
d  2.313024  1.938809  0.277226  0.180006
e  2.899632  1.645202  0.276319  0.219557
f  1.647317  2.008988  0.193778  0.466314
g  2.686259  2.204607 -0.880798 -0.820024
h  0.915940  1.564794 -1.285376 -1.847267
i  0.329051  2.801349 -0.978417 -0.464194
j  1.730442  2.606005 -0.598804 -0.423355

5.使用 numpy 函數

Pandas 與 numpy 在覈心數據結構上是完全兼容的

>>> df = pd.DataFrame(np.random.randn(10, 4), columns=['one', 'two', 'three', 'four'])
>>> df
        one       two     three      four
0  1.800023 -0.550830 -1.115527  1.283088
1  0.005457 -0.205792  1.406842  0.253727
2  1.658374  0.220637  0.349239  0.178845
3 -0.087544  0.262716 -0.822376  1.076153
4  0.942431 -1.170636  0.637203 -1.443319
5  0.165776  0.118799  1.792991 -0.923901
6  0.107792 -0.595107  0.090514  0.178640
7 -0.288757  0.414845  0.074528 -2.418104
8  0.082551 -0.935000  0.017684 -0.990776
9 -0.722961  0.816024 -1.634607 -0.774388

計算底數爲e的指數函數

>>> np.exp(df)
        one       two     three      four
0  6.049785  0.576471  0.327742  3.607763
1  1.005472  0.814002  4.083041  1.288820
2  5.250764  1.246871  1.417989  1.195835
3  0.916178  1.300457  0.439387  2.933374
4  2.566213  0.310170  1.891183  0.236143
5  1.180309  1.126143  6.007395  0.396968
6  1.113816  0.551504  1.094737  1.195590
7  0.749195  1.514136  1.077376  0.089090
8  1.086054  0.392586  1.017842  0.371288
9  0.485313  2.261491  0.195029  0.460986

array和asarray都可將結構數據轉換爲ndarray類型。
但是主要區別就是當數據源是ndarray時,
array仍會copy出一個副本,佔用新的內存,但asarray不會。

>>> type(np.asarray(df))
<class 'numpy.ndarray'>

判斷轉化後的數據值與之前的值是否相等

>>> np.asarray(df) == df.values
array([[ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True],
       [ True,  True,  True,  True]])

>>> np.asarray(df) == df
    one   two  three  four
0  True  True   True  True
1  True  True   True  True
2  True  True   True  True
3  True  True   True  True
4  True  True   True  True
5  True  True   True  True
6  True  True   True  True
7  True  True   True  True
8  True  True   True  True
9  True  True   True  True

6.Tab鍵自動完成

Tab鍵自動完成功能是對標準Python shell的主要改進之一
在shell中輸入表達式時,只要按下Tab鍵,當前命名空間中任何與已輸入的字符串相匹配的變量就會找出來

三、Panel

Panel 是三維帶標籤的數組。實際上,Pandas 的名稱由來就是由 Panel 演進的,即 pan(el)-da(ta)-s。Panel 比較少用,但依然是最重要的基礎數據結構之一。

  • items: 座標軸 0,索引對應的元素是一個 DataFrame
  • major_axis: 座標軸 1, DataFrame 裏的行標籤
  • minor_axis: 座標軸 2, DataFrame 裏的列標籤
>>> pn = pd.Panel(data)
sys:1: FutureWarning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
Alternatively, you can use the xarray package http://xarray.pydata.org/en/stable/.
Pandas provides a `.to_xarray()` method to help automate this conversion.

>>> pn
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 4 (major_axis) x 3 (minor_axis)
Items axis: Item1 to Item2
Major_axis axis: 0 to 3
Minor_axis axis: 0 to 2

查看’Item1’數據

>>> pn['Item1']
          0         1         2
0  1.292038  0.526691 -0.632993
1 -0.400069  0.735345 -0.090232
2  1.912338 -1.056740 -0.140426
3  0.718229 -0.862939 -1.376745

查看pn的信息

>>> pn.items
Index(['Item1', 'Item2'], dtype='object')
>>> pn.major_axis
RangeIndex(start=0, stop=4, step=1)
>>> pn.minor_axis
RangeIndex(start=0, stop=3, step=1)

函數調用

>>> pn.major_xs(pn.major_axis[0])
      Item1     Item2
0  1.292038 -0.072927
1  0.526691  1.713952
2 -0.632993       NaN

>>> pn.minor_xs(pn.major_axis[1])
      Item1     Item2
0  0.526691  1.713952
1  0.735345  0.062300
2 -1.056740 -0.458656
3 -0.862939  0.759974

>>> pn.to_frame()
                Item1     Item2
major minor
0     0      1.292038 -0.072927
      1      0.526691  1.713952
1     0     -0.400069  1.336408
      1      0.735345  0.062300
2     0      1.912338  1.121212
      1     -1.056740 -0.458656
3     0      0.718229 -0.687525
      1     -0.862939  0.759974
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章