这是面向新用户的 Python 教程,并结合了 JoinQuant 获取到的数据进行了讲解。
如果你之前没有学过 Python, 或者对 Python 不熟,那不要再犹豫了,这个教程就是为你准备的!
更多内容请查看量化课堂 - Python 编程板块。
本节概要: 主要介绍了 pandas 库之数据处理与规整。平台获取的数据主要是 DataFrame 的形式,它便是 pandas 中的。
此节可是重中之重哦!
【Pyhton科学计算(3)】 - pandas库之数据处理与规整
上一节我们介绍了 DataFrame 的数据查看与选择
本节主要讲解 DataFrame 的数据处理,包括缺失数据处理、函数的应用和映射、数据规整、分组等。
# 首先导入库
import pandas as pd
# 获取平安银行近几个工作日的开盘价、最高价、最低价、收盘价。
df = get_price('000001.XSHE',start_date='2016-07-01', end_date='2016-07-20', frequency='daily', fields=['open','high','low','close'])
df[df > 9.0] = NaN
df
1 缺失数据处理
1. 1去掉包含缺失值的行:
df.dropna()
open | high | low | close | |
---|---|---|---|---|
2016-07-01 | 8.69 | 8.73 | 8.68 | 8.71 |
2016-07-04 | 8.69 | 8.86 | 8.67 | 8.81 |
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
2016-07-11 | 8.75 | 8.79 | 8.74 | 8.75 |
2016-07-12 | 8.75 | 8.89 | 8.74 | 8.88 |
2016-07-14 | 8.97 | 9.00 | 8.91 | 8.94 |
2016-07-15 | 8.95 | 9.00 | 8.91 | 8.99 |
2016-07-20 | 8.96 | 8.99 | 8.95 | 8.96 |
1.2 对缺失值进行填充:
df.fillna(value=0)
open | high | low | close | |
---|---|---|---|---|
2016-07-01 | 8.69 | 8.73 | 8.68 | 8.71 |
2016-07-04 | 8.69 | 8.86 | 8.67 | 8.81 |
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
2016-07-11 | 8.75 | 8.79 | 8.74 | 8.75 |
2016-07-12 | 8.75 | 8.89 | 8.74 | 8.88 |
2016-07-13 | 8.88 | 0.00 | 8.86 | 8.99 |
2016-07-14 | 8.97 | 9.00 | 8.91 | 8.94 |
2016-07-15 | 8.95 | 9.00 | 8.91 | 8.99 |
2016-07-18 | 8.99 | 0.00 | 8.97 | 0.00 |
2016-07-19 | 0.00 | 0.00 | 8.95 | 8.97 |
2016-07-20 | 8.96 | 8.99 | 8.95 | 8.96 |
1.3 判断数据是否为nan,并进行布尔填充:
pd.isnull(df)
open | high | low | close | |
---|---|---|---|---|
2016-07-01 | False | False | False | False |
2016-07-04 | False | False | False | False |
2016-07-05 | False | False | False | False |
2016-07-06 | False | False | False | False |
2016-07-07 | False | False | False | False |
2016-07-08 | False | False | False | False |
2016-07-11 | False | False | False | False |
2016-07-12 | False | False | False | False |
2016-07-13 | False | True | False | False |
2016-07-14 | False | False | False | False |
2016-07-15 | False | False | False | False |
2016-07-18 | False | True | False | True |
2016-07-19 | True | True | False | False |
2016-07-20 | False | False | False | False |
2 函数的应用和映射
df.mean()#列计算平均值
open 8.831538
high 8.863636
low 8.812857
close 8.855385
dtype: float64
df.mean(1)#行计算平均值
2016-07-01 8.7025
2016-07-04 8.7575
2016-07-05 8.8025
2016-07-06 8.7925
2016-07-07 8.7775
2016-07-08 8.7625
2016-07-11 8.7575
2016-07-12 8.8150
2016-07-13 8.9100
2016-07-14 8.9550
2016-07-15 8.9625
2016-07-18 8.9800
2016-07-19 8.9600
2016-07-20 8.9650
dtype: float64
收起输出 ↑
df.mean(axis = 1,skipna = False) # skipna参数默认是 True 表示排除缺失值
2016-07-01 8.7025
2016-07-04 8.7575
2016-07-05 8.8025
2016-07-06 8.7925
2016-07-07 8.7775
2016-07-08 8.7625
2016-07-11 8.7575
2016-07-12 8.8150
2016-07-13 NaN
2016-07-14 8.9550
2016-07-15 8.9625
2016-07-18 NaN
2016-07-19 NaN
2016-07-20 8.9650
dtype: float64
收起输出 ↑
df.sort_index()#行名字排序
open | high | low | close | |
---|---|---|---|---|
2016-07-01 | 8.69 | 8.73 | 8.68 | 8.71 |
2016-07-04 | 8.69 | 8.86 | 8.67 | 8.81 |
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
2016-07-11 | 8.75 | 8.79 | 8.74 | 8.75 |
2016-07-12 | 8.75 | 8.89 | 8.74 | 8.88 |
2016-07-13 | 8.88 | NaN | 8.86 | 8.99 |
2016-07-14 | 8.97 | 9.00 | 8.91 | 8.94 |
2016-07-15 | 8.95 | 9.00 | 8.91 | 8.99 |
2016-07-18 | 8.99 | NaN | 8.97 | NaN |
2016-07-19 | NaN | NaN | 8.95 | 8.97 |
2016-07-20 | 8.96 | 8.99 | 8.95 | 8.96 |
df.sort_index(axis=1)#列名字排序
close | high | low | open | |
---|---|---|---|---|
2016-07-01 | 8.71 | 8.73 | 8.68 | 8.69 |
2016-07-04 | 8.81 | 8.86 | 8.67 | 8.69 |
2016-07-05 | 8.81 | 8.83 | 8.77 | 8.80 |
2016-07-06 | 8.79 | 8.82 | 8.76 | 8.80 |
2016-07-07 | 8.78 | 8.80 | 8.74 | 8.79 |
2016-07-08 | 8.74 | 8.79 | 8.73 | 8.79 |
2016-07-11 | 8.75 | 8.79 | 8.74 | 8.75 |
2016-07-12 | 8.88 | 8.89 | 8.74 | 8.75 |
2016-07-13 | 8.99 | NaN | 8.86 | 8.88 |
2016-07-14 | 8.94 | 9.00 | 8.91 | 8.97 |
2016-07-15 | 8.99 | 9.00 | 8.91 | 8.95 |
2016-07-18 | NaN | NaN | 8.97 | 8.99 |
2016-07-19 | 8.97 | NaN | 8.95 | NaN |
2016-07-20 | 8.96 | 8.99 | 8.95 | 8.96 |
# 数据默认是按升序排序的,也可以降序排序
df.sort_index(axis=1,ascending = False)
open | low | high | close | |
---|---|---|---|---|
2016-07-01 | 8.69 | 8.68 | 8.73 | 8.71 |
2016-07-04 | 8.69 | 8.67 | 8.86 | 8.81 |
2016-07-05 | 8.80 | 8.77 | 8.83 | 8.81 |
2016-07-06 | 8.80 | 8.76 | 8.82 | 8.79 |
2016-07-07 | 8.79 | 8.74 | 8.80 | 8.78 |
2016-07-08 | 8.79 | 8.73 | 8.79 | 8.74 |
2016-07-11 | 8.75 | 8.74 | 8.79 | 8.75 |
2016-07-12 | 8.75 | 8.74 | 8.89 | 8.88 |
2016-07-13 | 8.88 | 8.86 | NaN | 8.99 |
2016-07-14 | 8.97 | 8.91 | 9.00 | 8.94 |
2016-07-15 | 8.95 | 8.91 | 9.00 | 8.99 |
2016-07-18 | 8.99 | 8.97 | NaN | NaN |
2016-07-19 | NaN | 8.95 | NaN | 8.97 |
2016-07-20 | 8.96 | 8.95 | 8.99 | 8.96 |
常用的方法如上所介绍们,还要其他许多,可自行学习,下面罗列了一些,可供参考:
-
count 非na值的数量
-
describe 针对Series或个DataFrame列计算汇总统计
-
min、max 计算最小值和最大值
-
argmin、argmax 计算能够获取到最大值和最小值得索引位置(整数)
-
idxmin、idxmax 计算能够获取到最大值和最小值得索引值
-
quantile 计算样本的分位数(0到1)
-
sum 值的总和
-
mean 值得平均数
-
median 值得算术中位数(50%分位数)
-
mad 根据平均值计算平均绝对离差
-
var 样本值的方差
-
std 样本值的标准差
-
skew 样本值得偏度(三阶矩)
-
kurt 样本值得峰度(四阶矩)
-
cumsum 样本值得累计和
-
cummin,cummax 样本值得累计最大值和累计最小值
-
cumprod 样本值得累计积
-
diff 计算一阶差分(对时间序列很有用)
-
pct_change 计算百分数变化
3 数据规整
Pandas提供了大量的方法能够轻松的对Series,DataFrame和Panel对象进行各种符合各种逻辑关系的合并操作
-
concat 可以沿一条轴将多个对象堆叠到一起。
-
append 将一行连接到一个DataFrame上
-
duplicated 移除重复数据
3.1 concat
df1 = get_price('000001.XSHE',start_date='2016-07-05', end_date='2016-07-08', frequency='daily', fields=['open','high','low','close'])
df1
open | high | low | close | |
---|---|---|---|---|
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
df2 = get_price('000001.XSHE',start_date='2016-07-12', end_date='2016-07-15', frequency='daily', fields=['open','high','low','close'])
df2
open | high | low | close | |
---|---|---|---|---|
2016-07-12 | 8.75 | 8.89 | 8.74 | 8.88 |
2016-07-13 | 8.88 | 9.05 | 8.86 | 8.99 |
2016-07-14 | 8.97 | 9.00 | 8.91 | 8.94 |
2016-07-15 | 8.95 | 9.00 | 8.91 | 8.99 |
纵向拼接(默认):
pd.concat([df1,df2],axis=0)
open | high | low | close | |
---|---|---|---|---|
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
2016-07-12 | 8.75 | 8.89 | 8.74 | 8.88 |
2016-07-13 | 8.88 | 9.05 | 8.86 | 8.99 |
2016-07-14 | 8.97 | 9.00 | 8.91 | 8.94 |
2016-07-15 | 8.95 | 9.00 | 8.91 | 8.99 |
横向拼接,index对不上的会用 NaN 填充:
pd.concat([df1,df2],axis=1)
open | high | low | close | open | high | low | close | |
---|---|---|---|---|---|---|---|---|
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 | NaN | NaN | NaN | NaN |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 | NaN | NaN | NaN | NaN |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 | NaN | NaN | NaN | NaN |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 | NaN | NaN | NaN | NaN |
2016-07-12 | NaN | NaN | NaN | NaN | 8.75 | 8.89 | 8.74 | 8.88 |
2016-07-13 | NaN | NaN | NaN | NaN | 8.88 | 9.05 | 8.86 | 8.99 |
2016-07-14 | NaN | NaN | NaN | NaN | 8.97 | 9.00 | 8.91 | 8.94 |
2016-07-15 | NaN | NaN | NaN | NaN | 8.95 | 9.00 | 8.91 | 8.99 |
下面演示一下index 可以对上情况的横向拼接结果:
df3 = get_price('000001.XSHE',start_date='2016-07-12', end_date='2016-07-15', frequency='daily', fields=['low','close'])
df4 = get_price('000001.XSHE',start_date='2016-07-12', end_date='2016-07-15', frequency='daily', fields=['open','high'])
pd.concat([df3,df4],axis=1)
low | close | open | high | |
---|---|---|---|---|
2016-07-12 | 8.74 | 8.88 | 8.75 | 8.89 |
2016-07-13 | 8.86 | 8.99 | 8.88 | 9.05 |
2016-07-14 | 8.91 | 8.94 | 8.97 | 9.00 |
2016-07-15 | 8.91 | 8.99 | 8.95 | 9.00 |
3.2 append
df1
open | high | low | close | |
---|---|---|---|---|
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
s = df1.iloc[0]
s
open 8.80
high 8.83
low 8.77
close 8.81
Name: 2016-07-05 00:00:00, dtype: float64
df1.append(s, ignore_index=False) # ignore_index=False 表示索引不变
open | high | low | close | |
---|---|---|---|---|
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
df1.append(s, ignore_index=True) # ignore_index=True 表示索引重置
open | high | low | close | |
---|---|---|---|---|
0 | 8.80 | 8.83 | 8.77 | 8.81 |
1 | 8.80 | 8.82 | 8.76 | 8.79 |
2 | 8.79 | 8.80 | 8.74 | 8.78 |
3 | 8.79 | 8.79 | 8.73 | 8.74 |
4 | 8.80 | 8.83 | 8.77 | 8.81 |
3.3 移除重复数据duplicated
z = df1.append(s, ignore_index=False)
z
open | high | low | close | |
---|---|---|---|---|
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
查看重复数据:
z.duplicated()
2016-07-05 False
2016-07-06 False
2016-07-07 False
2016-07-08 False
2016-07-05 True展开输出 ↓
移除重复数据:
z.drop_duplicates()
open | high | low | close | |
---|---|---|---|---|
2016-07-05 | 8.80 | 8.83 | 8.77 | 8.81 |
2016-07-06 | 8.80 | 8.82 | 8.76 | 8.79 |
2016-07-07 | 8.79 | 8.80 | 8.74 | 8.78 |
2016-07-08 | 8.79 | 8.79 | 8.73 | 8.74 |
可以看到'2016-07-05'的重复行被删除了
4 分组
z.groupby('open').sum()
high | low | close | |
---|---|---|---|
open | |||
8.79 | 17.59 | 17.47 | 17.52 |
8.80 | 26.48 | 26.30 | 26.41 |
z.groupby(['open','close']).sum()
high | low | ||
---|---|---|---|
open | close | ||
8.79 | 8.74 | 8.79 | 8.73 |
8.78 | 8.80 | 8.74 | |
8.80 | 8.79 | 8.82 | 8.76 |
8.81 | 17.66 | 17.54 |
df9 = get_price(['000001.XSHE','000002.XSHE'],start_date='2016-07-12', end_date='2016-07-15', frequency='daily', fields=['open','high','low','close'])
df9
<class 'pandas.core.panel.Panel'>
Dimensions: 4 (items) x 4 (major_axis) x 2 (minor_axis)
Items axis: close to open
Major_axis axis: 2016-07-12 00:00:00 to 2016-07-15 00:00:00
Minor_axis axis: 000001.XSHE to 000002.XSHE
df9[:,0,:]
close | high | low | open | |
---|---|---|---|---|
000001.XSHE | 8.88 | 8.89 | 8.74 | 8.75 |
000002.XSHE | 17.39 | 17.69 | 16.85 | 17.53 |
df9[:,:,0]
close | high | low | open | |
---|---|---|---|---|
2016-07-12 | 8.88 | 8.89 | 8.74 | 8.75 |
2016-07-13 | 8.99 | 9.05 | 8.86 | 8.88 |
2016-07-14 | 8.94 | 9.00 | 8.91 | 8.97 |
2016-07-15 | 8.99 | 9.00 | 8.91 | 8.95 |
df9[:,:,1]
close | high | low | open | |
---|---|---|---|---|
2016-07-12 | 17.39 | 17.69 | 16.85 | 17.53 |
2016-07-13 | 17.58 | 17.74 | 17.13 | 17.27 |
2016-07-14 | 17.24 | 17.45 | 17.20 | 17.41 |
2016-07-15 | 17.17 | 17.32 | 17.14 | 17.18 |
df9[0,:,:]
000001.XSHE | 000002.XSHE | |
---|---|---|
2016-07-12 | 8.88 | 17.39 |
2016-07-13 | 8.99 | 17.58 |
2016-07-14 | 8.94 | 17.24 |
2016-07-15 | 8.99 | 17.17 |
df9.ix[:,0]
close | high | low | open | |
---|---|---|---|---|
000001.XSHE | 8.88 | 8.89 | 8.74 | 8.75 |
000002.XSHE | 17.39 | 17.69 | 16.85 | 17.53 |