Numpy 和 Pandas 有什麼不同?
如果用 python 的列表和字典來作比較, 那麼可以說 Numpy 是列表形式的,沒有數值標籤,而 Pandas 就是字典形式。Pandas是基於Numpy構建的,讓Numpy爲中心的應用變得更加簡單。
要使用pandas,首先需要了解他主要兩個數據結構:Series和DataFrame。
Series的字符串表現形式爲:索引在左邊,值在右邊。由於我們沒有爲數據指定索引。於是會自動創建一個0到N-1(N爲長度)的整數型索引。
DataFrame是一個表格型的數據結構,它包含有一組有序的列,每列可以是不同的值類型(數值,字符串,布爾值等)。DataFrame既有行索引也有列索引, 它可以被看做由Series組成的大字典。
官方建議導入方法:
from pandas import Series,DataFrame
import pandas as pd
創建對象
>>> from pandas import Series,DataFrame
>>> import pandas as pd
>>> import numpy as np
>>> s = Series([1,2,3,'a',np.nan,[1,2]])
>>> s
0 1
1 2
2 3
3 a
4 NaN #not a number的意思
5 [1, 2]
dtype: object
>>> dates = pd.date_range('2017', periods=6)
>>> dates
DatetimeIndex(['2017-01-01', '2017-01-02', '2017-01-03', '2017-01-04','2017-01-05', '2017-01-06'],dtype='datetime64[ns]', freq='D')
>>> df = DataFrame(np.random.randn(6,4), index=dates)#不指定index和columns時默認從0開始索引。
>>> df
0 1 2 3
2017-01-01 -0.923905 0.305506 0.676255 -1.428198
2017-01-02 0.234690 1.756183 -0.226916 0.516676
2017-01-03 -0.180496 -0.410745 0.145798 -1.189019
2017-01-04 -0.676189 0.602093 -0.151042 -0.915054
2017-01-05 -1.000729 0.784595 0.623079 -0.551410
2017-01-06 1.024644 -0.305822 -0.867859 0.867652
>>> df = DataFrame(np.random.randn(6,4), columns=('a','b','c','d'))
>>> df
a b c d
0 0.000196 -1.342386 0.189864 -0.874669
1 -0.638368 -1.403264 0.121946 0.720223
2 -0.504676 0.328643 0.478719 -1.165611
3 -0.011445 -0.775834 0.809029 2.148832
4 -1.012311 1.345237 0.725192 -1.658297
5 -1.580452 -0.664339 -0.370294 -1.370419
查看和選擇數據
>>> df2
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
>>> df2.head(2) #頭兩行
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
>>> df2.tail(2)
A B C D E F
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
>>> df2[0:2] #但是df2[0]就會報錯
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
>>> df2.A
0 1.0
1 1.0
2 1.0
3 1.0
Name: A, dtype: float64
>>> df2['B']
0 2013-01-02
1 2013-01-02
2 2013-01-02
3 2013-01-02
Name: B, dtype: datetime64[ns]
>>> df2.values
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']], dtype=object)
>>> df2.index
Int64Index([0, 1, 2, 3], dtype='int64')
>>> df2.columns
Index(['A', 'B', 'C', 'D', 'E', 'F'], dtype='object')
>>> df2.describe() #只對數字有統計
A C D
count 4.0 4.0 4.0
mean 1.0 1.0 3.0
std 0.0 0.0 0.0
min 1.0 1.0 3.0
25% 1.0 1.0 3.0
50% 1.0 1.0 3.0
75% 1.0 1.0 3.0
max 1.0 1.0 3.0
loc
我們可以使用標籤來選擇數據, 本例子主要通過標籤名字選擇某一行數據, 或者通過選擇某行或者所有行(:代表所有行)然後選其中某一列或幾列數據。:
>>> dates = pd.date_range('20130101', periods=6)
>>> df = DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])
>>> df
A B C D
2013-01-01 0 1 2 3
2013-01-02 4 5 6 7
2013-01-03 8 9 10 11
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
>>> df.loc['20130102']
A 4
B 5
C 6
D 7
Name: 2013-01-02 00:00:00, dtype: int32
>>> df.loc['2013-01-01':'2013-01-04','A':'C']#當':'兩邊是str的時候包含兩邊,如果是[0:3],則包括左邊不包括右邊
A B C
2013-01-01 0 1 2
2013-01-02 4 5 6
2013-01-03 8 9 10
2013-01-04 12 13 14
iloc
另外我們可以採用位置進行選擇 iloc, 在這裏我們可以通過位置選擇在不同情況下所需要的數據例如選某一個,連續選或者跨行選等操作。
>>> df.iloc[1:4,0:3] #包括1不包括4
A B C
2013-01-02 4 5 6
2013-01-03 8 9 10
2013-01-04 12 13 14
>>> df.iloc[[1,3,4],:]
A B C D
2013-01-02 4 5 6 7
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
ix
我們還可以採用混合選擇。
>>> df.ix[0:2,['A','D']]
A D
2013-01-01 0 3
2013-01-02 4 7
bool篩選
>>> df[df>5]
A B C D
2013-01-01 NaN NaN NaN NaN
2013-01-02 NaN NaN 6.0 7.0
2013-01-03 8.0 9.0 10.0 11.0
2013-01-04 12.0 13.0 14.0 15.0
2013-01-05 16.0 17.0 18.0 19.0
2013-01-06 20.0 21.0 22.0 23.0
>>> df[df.A>8] #df.A那一列中大於8的列
A B C D
2013-01-04 12 13 14 15
2013-01-05 16 17 18 19
2013-01-06 20 21 22 23
Pandas 處理NaN
有時候我們導入或處理數據, 會產生一些空的或者是 NaN 數據,如何刪除或者是填補這些 NaN 數據呢?
dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)
>>> dates = pd.date_range('20130101', periods=6)
>>> df = pd.DataFrame(np.arange(24).reshape((6,4)),index=dates, columns=['A','B','C','D'])
>>> df.iloc[0,1] = np.nan
>>> df.iloc[1,2] = np.nan
"""
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
"""
>>> df.dropna()
A B C D
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
>>> df
A B C D
2013-01-01 0 NaN 2.0 3
2013-01-02 4 5.0 NaN 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
>>> df.dropna(axis='columns',how='any')
A D
2013-01-01 0 3
2013-01-02 4 7
2013-01-03 8 11
2013-01-04 12 15
2013-01-05 16 19
2013-01-06 20 23
>>> df.fillna(value=-1)
A B C D
2013-01-01 0 -1.0 2.0 3
2013-01-02 4 5.0 -1.0 7
2013-01-03 8 9.0 10.0 11
2013-01-04 12 13.0 14.0 15
2013-01-05 16 17.0 18.0 19
2013-01-06 20 21.0 22.0 23
>>> df.isnull()
A B C D
2013-01-01 False True False False
2013-01-02 False False True False
2013-01-03 False False False False
2013-01-04 False False False False
2013-01-05 False False False False
2013-01-06 False False False False
>>> np.any(df.isnull()) == True #用以檢查是否存在NaN,存在返回True
True
pandas數據存儲和讀取
可以存取的格式:
>>> path = r'C:\Users\zhifei\Desktop\student.csv'
>>> data = pd.read_csv(path)
>>> data
Student ID name age gender
0 1100 Kelly 22 Female
1 1101 Clo 21 Female
2 1102 Tilly 22 Female
3 1103 Tony 24 Male
4 1104 David 20 Male
5 1105 Catty 22 Female
6 1106 M 3 Female
7 1107 N 43 Male
8 1108 A 13 Male
9 1109 S 12 Male
10 1110 David 33 Male
11 1111 Dw 3 Female
12 1112 Q 23 Male
13 1113 W 21 Female
>>> type(data)
<class 'pandas.core.frame.DataFrame'>
>>> path2 = r'C:\Users\zhifei\Desktop\json.txt'
>>> data.to_json(path2)
>>> data_2 = pd.read_json(path2)
>>> data_2
Student ID age gender name
0 1100 22 Female Kelly
1 1101 21 Female Clo
10 1110 33 Male David
11 1111 3 Female Dw
12 1112 23 Male Q
13 1113 21 Female W
2 1102 22 Female Tilly
3 1103 24 Male Tony
4 1104 20 Male David
5 1105 22 Female Catty
6 1106 3 Female M
7 1107 43 Male N
8 1108 13 Male A
9 1109 12 Male S
pandas數據合併
函數原型:
import pandas as pd
import numpy as np
#定義資料集
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*2, columns=['a','b','c','d'])
#concat縱向合併
res = pd.concat([df1, df2, df3], axis=0)
#打印結果
print(res)
# a b c d
# 0 0.0 0.0 0.0 0.0
# 1 0.0 0.0 0.0 0.0
# 2 0.0 0.0 0.0 0.0
# 0 1.0 1.0 1.0 1.0
# 1 1.0 1.0 1.0 1.0
# 2 1.0 1.0 1.0 1.0
# 0 2.0 2.0 2.0 2.0
# 1 2.0 2.0 2.0 2.0
# 2 2.0 2.0 2.0 2.0
res = pd.concat([df1, df2, df3], axis=0, ignore_index=True)
#打印結果
print(res)
# a b c d
# 0 0.0 0.0 0.0 0.0
# 1 0.0 0.0 0.0 0.0
# 2 0.0 0.0 0.0 0.0
# 3 1.0 1.0 1.0 1.0
# 4 1.0 1.0 1.0 1.0
# 5 1.0 1.0 1.0 1.0
# 6 2.0 2.0 2.0 2.0
# 7 2.0 2.0 2.0 2.0
# 8 2.0 2.0 2.0 2.0
join=’outer’爲預設值,因此未設定任何參數時,函數默認join=’outer’。此方式是依照column來做縱向合併,有相同的column上下合併在一起,其他獨自的column個自成列,原本沒有值的位置皆以NaN填充。
import pandas as pd
import numpy as np
#定義資料集
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])
#縱向"外"合併df1與df2
res = pd.concat([df1, df2], axis=0, join='outer')
print(res)
# a b c d e
# 1 0.0 0.0 0.0 0.0 NaN
# 2 0.0 0.0 0.0 0.0 NaN
# 3 0.0 0.0 0.0 0.0 NaN
# 2 NaN 1.0 1.0 1.0 1.0
# 3 NaN 1.0 1.0 1.0 1.0
# 4 NaN 1.0 1.0 1.0 1.0
#縱向"內"合併df1與df2
res = pd.concat([df1, df2], axis=0, join='inner')
#打印結果
print(res)
# b c d
# 1 0.0 0.0 0.0
# 2 0.0 0.0 0.0
# 3 0.0 0.0 0.0
# 2 1.0 1.0 1.0
# 3 1.0 1.0 1.0
# 4 1.0 1.0 1.0
#重置index並打印結果
res = pd.concat([df1, df2], axis=0, join='inner', ignore_index=True)
print(res)
# b c d
# 0 0.0 0.0 0.0
# 1 0.0 0.0 0.0
# 2 0.0 0.0 0.0
# 3 1.0 1.0 1.0
# 4 1.0 1.0 1.0
# 5 1.0 1.0 1.0
join_axes (依照 axes 合併)
import pandas as pd
import numpy as np
#定義資料集
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d','e'], index=[2,3,4])
#依照`df1.index`進行橫向合併
res = pd.concat([df1, df2], axis=1, join_axes=[df1.index])
#打印結果
print(res)
# a b c d b c d e
# 1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
# 2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
# 3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
#移除join_axes,並打印結果
res = pd.concat([df1, df2], axis=1)
print(res)
# a b c d b c d e
# 1 0.0 0.0 0.0 0.0 NaN NaN NaN NaN
# 2 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
# 3 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
# 4 NaN NaN NaN NaN 1.0 1.0 1.0 1.0
append函數原型:
>>> df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
>>> df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('CD'))
>>> df
A B
0 1 2
1 3 4
>>> df2
C D
0 5 6
1 7 8
>>> df.append(df2)#只能在下面加
A B C D
0 1.0 2.0 NaN NaN
1 3.0 4.0 NaN NaN
0 NaN NaN 5.0 6.0
1 NaN NaN 7.0 8.0
>>> df3 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
>>> df.append(df3,ignore_index=True)
A B
0 1 2
1 3 4
2 5 6
3 7 8
>>> s = pd.Series(['a','b'],index=['A','B'])
>>> df.append(s,ignore_index=True)
A B
0 1 2
1 3 4
2 a b
merge合併
函數原型
更多詳情參見help(pd.merge)
pandas畫圖
import pandas as pd
import numpy as np
import matplotlib.pyplot as pltl# 隨機生成1000個數據
data = pd.Series(np.random.randn(1000),index=np.arange(1000))
# 爲了方便觀看效果, 我們累加這個數據
data.cumsum()
# pandas 數據可以直接觀看其可視化形式
data.plot()
plt.show()
更多畫圖有關操作詳情請見matplotlib模塊。
參考鏈接: