pandas模塊

Numpy 和 Pandas 有什麼不同?

如果用 python 的列表和字典來作比較, 那麼可以說 Numpy 是列表形式的,沒有數值標籤,而 Pandas 就是字典形式。Pandas是基於Numpy構建的,讓Numpy爲中心的應用變得更加簡單。

要使用pandas,首先需要了解他主要兩個數據結構:SeriesDataFrame

Series的字符串表現形式爲:索引在左邊,值在右邊。由於我們沒有爲數據指定索引。於是會自動創建一個0到N-1(N爲長度)的整數型索引。

DataFrame是一個表格型的數據結構,它包含有一組有序的列,每列可以是不同的值類型(數值,字符串,布爾值等)。DataFrame既有行索引也有列索引, 它可以被看做由Series組成的大字典。

官方建議導入方法:

from pandas import Series,DataFrame
import pandas as pd

創建對象

>>> from pandas import Series,DataFrame
>>> import pandas as pd 
>>> import numpy as np 
>>> s = Series([1,2,3,'a',np.nan,[1,2]])
>>> s
0         1
1         2
2         3
3         a
4       NaN  #not a number的意思
5    [1, 2]
dtype: object
>>> dates = pd.date_range('2017', periods=6)
>>> dates
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04','2017-01-05', '2017-01-06'],dtype='datetime64[ns]', freq='D')
>>> df = DataFrame(np.random.randn(6,4), index=dates)#不指定index和columns時默認從0開始索引。
>>> df
                   0         1         2         3
2017-01-01 -0.923905  0.305506  0.676255 -1.428198
2017-01-02  0.234690  1.756183 -0.226916  0.516676
2017-01-03 -0.180496 -0.410745  0.145798 -1.189019
2017-01-04 -0.676189  0.602093 -0.151042 -0.915054
2017-01-05 -1.000729  0.784595  0.623079 -0.551410
2017-01-06  1.024644 -0.305822 -0.867859  0.867652
>>> df = DataFrame(np.random.randn(6,4), columns=('a','b','c','d'))
>>> df
          a         b         c         d
0  0.000196 -1.342386  0.189864 -0.874669
1 -0.638368 -1.403264  0.121946  0.720223
2 -0.504676  0.328643  0.478719 -1.165611
3 -0.011445 -0.775834  0.809029  2.148832
4 -1.012311  1.345237  0.725192 -1.658297
5 -1.580452 -0.664339 -0.370294 -1.370419

查看和選擇數據

>>> df2
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
>>> df2.head(2) #頭兩行
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
>>> df2.tail(2)
     A          B    C  D      E    F
2  1.0 2013-01-02  1.0  3   test  foo
3  1.0 2013-01-02  1.0  3  train  foo
>>> df2[0:2]  #但是df2[0]就會報錯
     A          B    C  D      E    F
0  1.0 2013-01-02  1.0  3   test  foo
1  1.0 2013-01-02  1.0  3  train  foo
>>> df2.A
0    1.0
1    1.0
2    1.0
3    1.0
Name: A, dtype: float64
>>> df2['B']
0   2013-01-02
1   2013-01-02
2   2013-01-02
3   2013-01-02
Name: B, dtype: datetime64[ns]
>>> df2.values
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
       [1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)
>>> df2.index
Int64Index([0, 1, 2, 3], dtype='int64')
>>> df2.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
>>> df2.describe() #只對數字有統計
         A    C    D
count  4.0  4.0  4.0
mean   1.0  1.0  3.0
std    0.0  0.0  0.0
min    1.0  1.0  3.0
25%    1.0  1.0  3.0
50%    1.0  1.0  3.0
75%    1.0  1.0  3.0
max    1.0  1.0  3.0

loc

我們可以使用標籤來選擇數據, 本例子主要通過標籤名字選擇某一行數據, 或者通過選擇某行或者所有行(:代表所有行)然後選其中某一列或幾列數據。:

>>> dates = pd.date_range('20130101', periods=6)
>>> df = DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])
>>> df
             A   B   C   D
2013-01-01   0   1   2   3
2013-01-02   4   5   6   7
2013-01-03   8   9  10  11
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23
>>> df.loc['20130102']
A    4
B    5
C    6
D    7
Name: 2013-01-02 00:00:00, dtype: int32
>>> df.loc['2013-01-01':'2013-01-04','A':'C']#當':'兩邊是str的時候包含兩邊,如果是[0:3],則包括左邊不包括右邊
             A   B   C
2013-01-01   0   1   2
2013-01-02   4   5   6
2013-01-03   8   9  10
2013-01-04  12  13  14

iloc

另外我們可以採用位置進行選擇 iloc, 在這裏我們可以通過位置選擇在不同情況下所需要的數據例如選某一個,連續選或者跨行選等操作。

>>> df.iloc[1:4,0:3]   #包括1不包括4
             A   B   C
2013-01-02   4   5   6
2013-01-03   8   9  10
2013-01-04  12  13  14
>>> df.iloc[[1,3,4],:]
             A   B   C   D
2013-01-02   4   5   6   7
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19

ix

我們還可以採用混合選擇。

>>> df.ix[0:2,['A','D']]
            A  D
2013-01-01  0  3
2013-01-02  4  7

bool篩選

>>> df[df>5] 
               A     B     C     D
2013-01-01   NaN   NaN   NaN   NaN
2013-01-02   NaN   NaN   6.0   7.0
2013-01-03   8.0   9.0  10.0  11.0
2013-01-04  12.0  13.0  14.0  15.0
2013-01-05  16.0  17.0  18.0  19.0
2013-01-06  20.0  21.0  22.0  23.0
>>> df[df.A>8] #df.A那一列中大於8的列
             A   B   C   D
2013-01-04  12  13  14  15
2013-01-05  16  17  18  19
2013-01-06  20  21  22  23

Pandas 處理NaN

有時候我們導入或處理數據, 會產生一些空的或者是 NaN 數據,如何刪除或者是填補這些 NaN 數據呢?

dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
這裏寫圖片描述

>>> dates = pd.date_range('20130101', periods=6)
>>> df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])
>>> df.iloc[0,1] = np.nan
>>> df.iloc[1,2] = np.nan
"""
             A     B     C   D
2013-01-01   0   NaN   2.0   3
2013-01-02   4   5.0   NaN   7
2013-01-03   8   9.0  10.0  11
2013-01-04  12  13.0  14.0  15
2013-01-05  16  17.0  18.0  19
2013-01-06  20  21.0  22.0  23
"""
>>> df.dropna()
             A     B     C   D
2013-01-03   8   9.0  10.0  11
2013-01-04  12  13.0  14.0  15
2013-01-05  16  17.0  18.0  19
2013-01-06  20  21.0  22.0  23
>>> df
             A     B     C   D
2013-01-01   0   NaN   2.0   3
2013-01-02   4   5.0   NaN   7
2013-01-03   8   9.0  10.0  11
2013-01-04  12  13.0  14.0  15
2013-01-05  16  17.0  18.0  19
2013-01-06  20  21.0  22.0  23
>>> df.dropna(axis='columns',how='any')
             A   D
2013-01-01   0   3
2013-01-02   4   7
2013-01-03   8  11
2013-01-04  12  15
2013-01-05  16  19
2013-01-06  20  23
>>> df.fillna(value=-1)
             A     B     C   D
2013-01-01   0  -1.0   2.0   3
2013-01-02   4   5.0  -1.0   7
2013-01-03   8   9.0  10.0  11
2013-01-04  12  13.0  14.0  15
2013-01-05  16  17.0  18.0  19
2013-01-06  20  21.0  22.0  23
>>> df.isnull() 
                A      B      C      D
2013-01-01  False   True  False  False
2013-01-02  False  False   True  False
2013-01-03  False  False  False  False
2013-01-04  False  False  False  False
2013-01-05  False  False  False  False
2013-01-06  False  False  False  False
>>> np.any(df.isnull()) == True #用以檢查是否存在NaN,存在返回True
True

pandas數據存儲和讀取

可以存取的格式:
這裏寫圖片描述

>>> path = r'C:\Users\zhifei\Desktop\student.csv'
>>> data = pd.read_csv(path)
>>> data
    Student ID  name   age  gender
0         1100  Kelly   22  Female
1         1101    Clo   21  Female
2         1102  Tilly   22  Female
3         1103   Tony   24    Male
4         1104  David   20    Male
5         1105  Catty   22  Female
6         1106      M    3  Female
7         1107      N   43    Male
8         1108      A   13    Male
9         1109      S   12    Male
10        1110  David   33    Male
11        1111     Dw    3  Female
12        1112      Q   23    Male
13        1113      W   21  Female
>>> type(data)
<class 'pandas.core.frame.DataFrame'>
>>> path2 = r'C:\Users\zhifei\Desktop\json.txt'
>>> data.to_json(path2)
>>> data_2 = pd.read_json(path2)
>>> data_2
    Student ID  age  gender  name 
0         1100   22  Female  Kelly
1         1101   21  Female    Clo
10        1110   33    Male  David
11        1111    3  Female     Dw
12        1112   23    Male      Q
13        1113   21  Female      W
2         1102   22  Female  Tilly
3         1103   24    Male   Tony
4         1104   20    Male  David
5         1105   22  Female  Catty
6         1106    3  Female      M
7         1107   43    Male      N
8         1108   13    Male      A
9         1109   12    Male      S

pandas數據合併

函數原型:
這裏寫圖片描述

import pandas as pd
import numpy as np

#定義資料集
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*2, columns=['a','b','c','d'])

#concat縱向合併
res = pd.concat([df1, df2, df3], axis=0)

#打印結果
print(res)
#     a    b    c    d
# 0  0.0  0.0  0.0  0.0
# 1  0.0  0.0  0.0  0.0
# 2  0.0  0.0  0.0  0.0
# 0  1.0  1.0  1.0  1.0
# 1  1.0  1.0  1.0  1.0
# 2  1.0  1.0  1.0  1.0
# 0  2.0  2.0  2.0  2.0
# 1  2.0  2.0  2.0  2.0
# 2  2.0  2.0  2.0  2.0

res = pd.concat([df1, df2, df3], axis=0, ignore_index=True)

#打印結果
print(res)
#     a    b    c    d
# 0  0.0  0.0  0.0  0.0
# 1  0.0  0.0  0.0  0.0
# 2  0.0  0.0  0.0  0.0
# 3  1.0  1.0  1.0  1.0
# 4  1.0  1.0  1.0  1.0
# 5  1.0  1.0  1.0  1.0
# 6  2.0  2.0  2.0  2.0
# 7  2.0  2.0  2.0  2.0
# 8  2.0  2.0  2.0  2.0

join=’outer’爲預設值,因此未設定任何參數時,函數默認join=’outer’。此方式是依照column來做縱向合併,有相同的column上下合併在一起,其他獨自的column個自成列,原本沒有值的位置皆以NaN填充。

import pandas as pd
import numpy as np

#定義資料集
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])

#縱向"外"合併df1與df2
res = pd.concat([df1, df2], axis=0, join='outer')

print(res)
#     a    b    c    d    e
# 1  0.0  0.0  0.0  0.0  NaN
# 2  0.0  0.0  0.0  0.0  NaN
# 3  0.0  0.0  0.0  0.0  NaN
# 2  NaN  1.0  1.0  1.0  1.0
# 3  NaN  1.0  1.0  1.0  1.0
# 4  NaN  1.0  1.0  1.0  1.0

#縱向"內"合併df1與df2
res = pd.concat([df1, df2], axis=0, join='inner')

#打印結果
print(res)
#     b    c    d
# 1  0.0  0.0  0.0
# 2  0.0  0.0  0.0
# 3  0.0  0.0  0.0
# 2  1.0  1.0  1.0
# 3  1.0  1.0  1.0
# 4  1.0  1.0  1.0

#重置index並打印結果
res = pd.concat([df1, df2], axis=0, join='inner', ignore_index=True)
print(res)
#     b    c    d
# 0  0.0  0.0  0.0
# 1  0.0  0.0  0.0
# 2  0.0  0.0  0.0
# 3  1.0  1.0  1.0
# 4  1.0  1.0  1.0
# 5  1.0  1.0  1.0

join_axes (依照 axes 合併)

import pandas as pd
import numpy as np

#定義資料集
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])

#依照`df1.index`進行橫向合併
res = pd.concat([df1, df2], axis=1, join_axes=[df1.index])

#打印結果
print(res)
#     a    b    c    d    b    c    d    e
# 1  0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN
# 2  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
# 3  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0

#移除join_axes,並打印結果
res = pd.concat([df1, df2], axis=1)
print(res)
#     a    b    c    d    b    c    d    e
# 1  0.0  0.0  0.0  0.0  NaN  NaN  NaN  NaN
# 2  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
# 3  0.0  0.0  0.0  0.0  1.0  1.0  1.0  1.0
# 4  NaN  NaN  NaN  NaN  1.0  1.0  1.0  1.0

append函數原型:
這裏寫圖片描述


>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('CD'))
>>> df
   A  B
0  1  2
1  3  4
>>> df2
   C  D
0  5  6
1  7  8
>>> df.append(df2)#只能在下面加
     A    B    C    D
0  1.0  2.0  NaN  NaN
1  3.0  4.0  NaN  NaN
0  NaN  NaN  5.0  6.0
1  NaN  NaN  7.0  8.0
>>> df3 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
>>> df.append(df3,ignore_index=True)
   A  B
0  1  2
1  3  4
2  5  6
3  7  8
>>> s = pd.Series(['a','b'],index=['A','B'])
>>> df.append(s,ignore_index=True)
   A  B
0  1  2
1  3  4
2  a  b

merge合併

函數原型
這裏寫圖片描述
這裏寫圖片描述

更多詳情參見help(pd.merge)

pandas畫圖

import pandas as pd
import numpy as np
import matplotlib.pyplot as pltl# 隨機生成1000個數據
data = pd.Series(np.random.randn(1000),index=np.arange(1000))

# 爲了方便觀看效果, 我們累加這個數據
data.cumsum()

# pandas 數據可以直接觀看其可視化形式
data.plot()

plt.show()

更多畫圖有關操作詳情請見matplotlib模塊。

參考鏈接:

  1. http://pandas.pydata.org/pandas-docs/stable/10min.html
  2. https://morvanzhou.github.io/tutorials/data-manipulation/np-pd/3-1-pd-intro/
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章