這是面向新用戶的 Python 教程,並結合了 JoinQuant 獲取到的數據進行了講解。
如果你之前沒有學過 Python, 或者對 Python 不熟,那不要再猶豫了,這個教程就是爲你準備的!
更多內容請查看量化課堂 - Python 編程板塊。
本節概要: 主要介紹了 pandas 庫之數據處理與規整。平臺獲取的數據主要是 DataFrame 的形式,它便是 pandas 中的。
此節可是重中之重哦!
【Pyhton科學計算(3)】 - pandas庫之數據處理與規整
上一節我們介紹了 DataFrame 的數據查看與選擇
本節主要講解 DataFrame 的數據處理,包括缺失數據處理、函數的應用和映射、數據規整、分組等。
# 首先導入庫
import pandas as pd
# 獲取平安銀行近幾個工作日的開盤價、最高價、最低價、收盤價。
df = get_price('000001.XSHE',start_date='2016-07-01', end_date='2016-07-20', frequency='daily', fields=['open','high','low','close'])
df[df > 9.0] = NaN
df
1 缺失數據處理
1. 1去掉包含缺失值的行:
df.dropna()
open | high | low | close | |
---|---|---|---|---|
2016-07-01 | 8.69 | 8.73 | 8.68 | 8.71 |
2016-07-04 | 8.69 | 8.86 | 8.67 | 8.81 |
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
2016-07-11 | 8.75 | 8.79 | 8.74 | 8.75 |
2016-07-12 | 8.75 | 8.89 | 8.74 | 8.88 |
2016-07-14 | 8.97 | 9.00 | 8.91 | 8.94 |
2016-07-15 | 8.95 | 9.00 | 8.91 | 8.99 |
2016-07-20 | 8.96 | 8.99 | 8.95 | 8.96 |
1.2 對缺失值進行填充:
df.fillna(value=0)
open | high | low | close | |
---|---|---|---|---|
2016-07-01 | 8.69 | 8.73 | 8.68 | 8.71 |
2016-07-04 | 8.69 | 8.86 | 8.67 | 8.81 |
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
2016-07-11 | 8.75 | 8.79 | 8.74 | 8.75 |
2016-07-12 | 8.75 | 8.89 | 8.74 | 8.88 |
2016-07-13 | 8.88 | 0.00 | 8.86 | 8.99 |
2016-07-14 | 8.97 | 9.00 | 8.91 | 8.94 |
2016-07-15 | 8.95 | 9.00 | 8.91 | 8.99 |
2016-07-18 | 8.99 | 0.00 | 8.97 | 0.00 |
2016-07-19 | 0.00 | 0.00 | 8.95 | 8.97 |
2016-07-20 | 8.96 | 8.99 | 8.95 | 8.96 |
1.3 判斷數據是否爲nan,並進行布爾填充:
pd.isnull(df)
open | high | low | close | |
---|---|---|---|---|
2016-07-01 | False | False | False | False |
2016-07-04 | False | False | False | False |
2016-07-05 | False | False | False | False |
2016-07-06 | False | False | False | False |
2016-07-07 | False | False | False | False |
2016-07-08 | False | False | False | False |
2016-07-11 | False | False | False | False |
2016-07-12 | False | False | False | False |
2016-07-13 | False | True | False | False |
2016-07-14 | False | False | False | False |
2016-07-15 | False | False | False | False |
2016-07-18 | False | True | False | True |
2016-07-19 | True | True | False | False |
2016-07-20 | False | False | False | False |
2 函數的應用和映射
df.mean()#列計算平均值
open 8.831538
high 8.863636
low 8.812857
close 8.855385
dtype: float64
df.mean(1)#行計算平均值
2016-07-01 8.7025
2016-07-04 8.7575
2016-07-05 8.8025
2016-07-06 8.7925
2016-07-07 8.7775
2016-07-08 8.7625
2016-07-11 8.7575
2016-07-12 8.8150
2016-07-13 8.9100
2016-07-14 8.9550
2016-07-15 8.9625
2016-07-18 8.9800
2016-07-19 8.9600
2016-07-20 8.9650
dtype: float64
收起輸出 ↑
df.mean(axis = 1,skipna = False) # skipna參數默認是 True 表示排除缺失值
2016-07-01 8.7025
2016-07-04 8.7575
2016-07-05 8.8025
2016-07-06 8.7925
2016-07-07 8.7775
2016-07-08 8.7625
2016-07-11 8.7575
2016-07-12 8.8150
2016-07-13 NaN
2016-07-14 8.9550
2016-07-15 8.9625
2016-07-18 NaN
2016-07-19 NaN
2016-07-20 8.9650
dtype: float64
收起輸出 ↑
df.sort_index()#行名字排序
open | high | low | close | |
---|---|---|---|---|
2016-07-01 | 8.69 | 8.73 | 8.68 | 8.71 |
2016-07-04 | 8.69 | 8.86 | 8.67 | 8.81 |
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
2016-07-11 | 8.75 | 8.79 | 8.74 | 8.75 |
2016-07-12 | 8.75 | 8.89 | 8.74 | 8.88 |
2016-07-13 | 8.88 | NaN | 8.86 | 8.99 |
2016-07-14 | 8.97 | 9.00 | 8.91 | 8.94 |
2016-07-15 | 8.95 | 9.00 | 8.91 | 8.99 |
2016-07-18 | 8.99 | NaN | 8.97 | NaN |
2016-07-19 | NaN | NaN | 8.95 | 8.97 |
2016-07-20 | 8.96 | 8.99 | 8.95 | 8.96 |
df.sort_index(axis=1)#列名字排序
close | high | low | open | |
---|---|---|---|---|
2016-07-01 | 8.71 | 8.73 | 8.68 | 8.69 |
2016-07-04 | 8.81 | 8.86 | 8.67 | 8.69 |
2016-07-05 | 8.81 | 8.83 | 8.77 | 8.80 |
2016-07-06 | 8.79 | 8.82 | 8.76 | 8.80 |
2016-07-07 | 8.78 | 8.80 | 8.74 | 8.79 |
2016-07-08 | 8.74 | 8.79 | 8.73 | 8.79 |
2016-07-11 | 8.75 | 8.79 | 8.74 | 8.75 |
2016-07-12 | 8.88 | 8.89 | 8.74 | 8.75 |
2016-07-13 | 8.99 | NaN | 8.86 | 8.88 |
2016-07-14 | 8.94 | 9.00 | 8.91 | 8.97 |
2016-07-15 | 8.99 | 9.00 | 8.91 | 8.95 |
2016-07-18 | NaN | NaN | 8.97 | 8.99 |
2016-07-19 | 8.97 | NaN | 8.95 | NaN |
2016-07-20 | 8.96 | 8.99 | 8.95 | 8.96 |
# 數據默認是按升序排序的,也可以降序排序
df.sort_index(axis=1,ascending = False)
open | low | high | close | |
---|---|---|---|---|
2016-07-01 | 8.69 | 8.68 | 8.73 | 8.71 |
2016-07-04 | 8.69 | 8.67 | 8.86 | 8.81 |
2016-07-05 | 8.80 | 8.77 | 8.83 | 8.81 |
2016-07-06 | 8.80 | 8.76 | 8.82 | 8.79 |
2016-07-07 | 8.79 | 8.74 | 8.80 | 8.78 |
2016-07-08 | 8.79 | 8.73 | 8.79 | 8.74 |
2016-07-11 | 8.75 | 8.74 | 8.79 | 8.75 |
2016-07-12 | 8.75 | 8.74 | 8.89 | 8.88 |
2016-07-13 | 8.88 | 8.86 | NaN | 8.99 |
2016-07-14 | 8.97 | 8.91 | 9.00 | 8.94 |
2016-07-15 | 8.95 | 8.91 | 9.00 | 8.99 |
2016-07-18 | 8.99 | 8.97 | NaN | NaN |
2016-07-19 | NaN | 8.95 | NaN | 8.97 |
2016-07-20 | 8.96 | 8.95 | 8.99 | 8.96 |
常用的方法如上所介紹們,還要其他許多,可自行學習,下面羅列了一些,可供參考:
-
count 非na值的數量
-
describe 針對Series或個DataFrame列計算彙總統計
-
min、max 計算最小值和最大值
-
argmin、argmax 計算能夠獲取到最大值和最小值得索引位置(整數)
-
idxmin、idxmax 計算能夠獲取到最大值和最小值得索引值
-
quantile 計算樣本的分位數(0到1)
-
sum 值的總和
-
mean 值得平均數
-
median 值得算術中位數(50%分位數)
-
mad 根據平均值計算平均絕對離差
-
var 樣本值的方差
-
std 樣本值的標準差
-
skew 樣本值得偏度(三階矩)
-
kurt 樣本值得峯度(四階矩)
-
cumsum 樣本值得累計和
-
cummin,cummax 樣本值得累計最大值和累計最小值
-
cumprod 樣本值得累計積
-
diff 計算一階差分(對時間序列很有用)
-
pct_change 計算百分數變化
3 數據規整
Pandas提供了大量的方法能夠輕鬆的對Series,DataFrame和Panel對象進行各種符合各種邏輯關係的合併操作
-
concat 可以沿一條軸將多個對象堆疊到一起。
-
append 將一行連接到一個DataFrame上
-
duplicated 移除重複數據
3.1 concat
df1 = get_price('000001.XSHE',start_date='2016-07-05', end_date='2016-07-08', frequency='daily', fields=['open','high','low','close'])
df1
open | high | low | close | |
---|---|---|---|---|
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
df2 = get_price('000001.XSHE',start_date='2016-07-12', end_date='2016-07-15', frequency='daily', fields=['open','high','low','close'])
df2
open | high | low | close | |
---|---|---|---|---|
2016-07-12 | 8.75 | 8.89 | 8.74 | 8.88 |
2016-07-13 | 8.88 | 9.05 | 8.86 | 8.99 |
2016-07-14 | 8.97 | 9.00 | 8.91 | 8.94 |
2016-07-15 | 8.95 | 9.00 | 8.91 | 8.99 |
縱向拼接(默認):
pd.concat([df1,df2],axis=0)
open | high | low | close | |
---|---|---|---|---|
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
2016-07-12 | 8.75 | 8.89 | 8.74 | 8.88 |
2016-07-13 | 8.88 | 9.05 | 8.86 | 8.99 |
2016-07-14 | 8.97 | 9.00 | 8.91 | 8.94 |
2016-07-15 | 8.95 | 9.00 | 8.91 | 8.99 |
橫向拼接,index對不上的會用 NaN 填充:
pd.concat([df1,df2],axis=1)
open | high | low | close | open | high | low | close | |
---|---|---|---|---|---|---|---|---|
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 | NaN | NaN | NaN | NaN |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 | NaN | NaN | NaN | NaN |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 | NaN | NaN | NaN | NaN |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 | NaN | NaN | NaN | NaN |
2016-07-12 | NaN | NaN | NaN | NaN | 8.75 | 8.89 | 8.74 | 8.88 |
2016-07-13 | NaN | NaN | NaN | NaN | 8.88 | 9.05 | 8.86 | 8.99 |
2016-07-14 | NaN | NaN | NaN | NaN | 8.97 | 9.00 | 8.91 | 8.94 |
2016-07-15 | NaN | NaN | NaN | NaN | 8.95 | 9.00 | 8.91 | 8.99 |
下面演示一下index 可以對上情況的橫向拼接結果:
df3 = get_price('000001.XSHE',start_date='2016-07-12', end_date='2016-07-15', frequency='daily', fields=['low','close'])
df4 = get_price('000001.XSHE',start_date='2016-07-12', end_date='2016-07-15', frequency='daily', fields=['open','high'])
pd.concat([df3,df4],axis=1)
low | close | open | high | |
---|---|---|---|---|
2016-07-12 | 8.74 | 8.88 | 8.75 | 8.89 |
2016-07-13 | 8.86 | 8.99 | 8.88 | 9.05 |
2016-07-14 | 8.91 | 8.94 | 8.97 | 9.00 |
2016-07-15 | 8.91 | 8.99 | 8.95 | 9.00 |
3.2 append
df1
open | high | low | close | |
---|---|---|---|---|
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
s = df1.iloc[0]
s
open 8.80
high 8.83
low 8.77
close 8.81
Name: 2016-07-05 00:00:00, dtype: float64
df1.append(s, ignore_index=False) # ignore_index=False 表示索引不變
open | high | low | close | |
---|---|---|---|---|
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
df1.append(s, ignore_index=True) # ignore_index=True 表示索引重置
open | high | low | close | |
---|---|---|---|---|
0 | 8.80 | 8.83 | 8.77 | 8.81 |
1 | 8.80 | 8.82 | 8.76 | 8.79 |
2 | 8.79 | 8.80 | 8.74 | 8.78 |
3 | 8.79 | 8.79 | 8.73 | 8.74 |
4 | 8.80 | 8.83 | 8.77 | 8.81 |
3.3 移除重複數據duplicated
z = df1.append(s, ignore_index=False)
z
open | high | low | close | |
---|---|---|---|---|
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
查看重複數據:
z.duplicated()
2016-07-05 False
2016-07-06 False
2016-07-07 False
2016-07-08 False
2016-07-05 True展開輸出 ↓
移除重複數據:
z.drop_duplicates()
open | high | low | close | |
---|---|---|---|---|
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
可以看到'2016-07-05'的重複行被刪除了
4 分組
z.groupby('open').sum()
high | low | close | |
---|---|---|---|
open | |||
8.79 | 17.59 | 17.47 | 17.52 |
8.80 | 26.48 | 26.30 | 26.41 |
z.groupby(['open','close']).sum()
high | low | ||
---|---|---|---|
open | close | ||
8.79 | 8.74 | 8.79 | 8.73 |
8.78 | 8.80 | 8.74 | |
8.80 | 8.79 | 8.82 | 8.76 |
8.81 | 17.66 | 17.54 |
df9 = get_price(['000001.XSHE','000002.XSHE'],start_date='2016-07-12', end_date='2016-07-15', frequency='daily', fields=['open','high','low','close'])
df9
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 4 (major_axis) x 2 (minor_axis)
Items axis: close to open
Major_axis axis: 2016-07-12 00:00:00 to 2016-07-15 00:00:00
Minor_axis axis: 000001.XSHE to 000002.XSHE
df9[:,0,:]
close | high | low | open | |
---|---|---|---|---|
000001.XSHE | 8.88 | 8.89 | 8.74 | 8.75 |
000002.XSHE | 17.39 | 17.69 | 16.85 | 17.53 |
df9[:,:,0]
close | high | low | open | |
---|---|---|---|---|
2016-07-12 | 8.88 | 8.89 | 8.74 | 8.75 |
2016-07-13 | 8.99 | 9.05 | 8.86 | 8.88 |
2016-07-14 | 8.94 | 9.00 | 8.91 | 8.97 |
2016-07-15 | 8.99 | 9.00 | 8.91 | 8.95 |
df9[:,:,1]
close | high | low | open | |
---|---|---|---|---|
2016-07-12 | 17.39 | 17.69 | 16.85 | 17.53 |
2016-07-13 | 17.58 | 17.74 | 17.13 | 17.27 |
2016-07-14 | 17.24 | 17.45 | 17.20 | 17.41 |
2016-07-15 | 17.17 | 17.32 | 17.14 | 17.18 |
df9[0,:,:]
000001.XSHE | 000002.XSHE | |
---|---|---|
2016-07-12 | 8.88 | 17.39 |
2016-07-13 | 8.99 | 17.58 |
2016-07-14 | 8.94 | 17.24 |
2016-07-15 | 8.99 | 17.17 |
df9.ix[:,0]
close | high | low | open | |
---|---|---|---|---|
000001.XSHE | 8.88 | 8.89 | 8.74 | 8.75 |
000002.XSHE | 17.39 | 17.69 | 16.85 | 17.53 |