金融高频数据管理:DolphinDB与pickle的性能对比测试和分析

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"金融市场L1\/L2的报价和交易数据是量化交易研究非常重要的数据。国内全市场L1\/L2的历史数据约为20~50T,每日新增的数据量约为20~50G。传统的关系数据库如MS SQL Server或MySQL均无法支撑这样的数据量级,即便分库分表,查询性能也远远无法达到要求。例如Impala和Greenplum的数据仓库,以及例如HBase的NoSQL数据库,可以解决这个数据量级的存储,但是这类通用的存储引擎缺乏对时序数据的友好支持,在查询和计算方面都存在严重的不足,对量化金融普遍采用的Python的支持也极为有限。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"数据库的局限性使得一部分用户转向文件存储。HDF5,Parquet和pickle是常用的二进制文件格式,其中pickle作为Python对象序列化\/反序列的协议非常高效。由于Python是量化金融和数据分析的常用工具,因此许多用户使用pickle存储高频数据。但文件存储存在明显的缺陷,譬如大量的数据冗余,不同版本之间的管理困难,不提供权限控制,无法利用多个节点的资源,不同数据间的关联不便,数据管理粒度太粗,检索和查询不便等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"目前,越来越多的券商和私募开始采用高性能时序数据库DolphinDB来处理高频数据。DolphinDB采用列式存储,并提供多种灵活的分区机制,可充分利用集群中每个节点的资源。DolphinDB的大量内置函数对时序数据的处理和计算非常友好,解决了传统关系数据库或NoSQL数据库处理时序数据方面的局限性。使用DolphinDB处理高频数据,既可以保证查询与计算的超高性能,又可以提供数据管理、权限控制、并行计算、数据关联等数据库的优势。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"本文测试DolphinDB和pickle在数据读取方面的性能。与使用pickle文件存储相比,直接使用DolphinDB数据库,数据读取速度可最多可提升10倍以上;若为了考虑与现有Python系统的集成,使用DolphinDB提供的Python API读取数据,速度最多有2~3倍的提升。有关DolphinDB数据库在数据管理等方面的功能,读者可参考DolphinDB的在线"},{"type":"link","attrs":{"href":"https:\/\/link.zhihu.com\/?target=https:\/\/www.dolphindb.cn\/cn\/help\/index.html","title":null,"type":null},"content":[{"type":"text","text":"文档"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"或"},{"type":"link","attrs":{"href":"https:\/\/link.zhihu.com\/?target=https:\/\/gitee.com\/dolphindb\/Tutorials_CN\/blob\/README.md","title":null,"type":null},"content":[{"type":"text","text":"教程"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"1. 测试场景和测试数据"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"本次测试使用了以下两个数据集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"数据集1是美国股市一天(2007.08.23) Level 1的报价和交易数据。该数据共10列,其中2列是字符串类型,其余是整型或浮点数类型,存储在dolphindb中的表结构如下表,一天的数据约为2亿3000万行。csv文件大小为9.5G,转换为pickle文件后大小为11.8G。"}]},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"\n\n\n\n\n\n\n\n\n\n\n\n

\n

列名

\n\n

\n

类型

\n\n\n

\n

symbol

\n\n

\n

SYMBOL

\n\n\n

\n

date

\n\n

\n

DATE

\n\n\n

\n

time

\n\n

\n

SECOND

\n\n\n

\n

bid

\n\n

\n

DOUBLE

\n\n\n

\n

ofr

\n\n

\n

DOUBLE

\n\n\n

\n

bidsiz

\n\n

\n

INT

\n\n\n

\n

ofrsiz

\n\n

\n

INT

\n\n\n

\n

mode

\n\n

\n

INT

\n\n\n

\n

ex

\n\n

\n

CHAR

\n\n\n

\n

mmid

\n\n

\n

SYMBOL

\n\n\n"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"数据集2是中国股市3天(2019.09.10~2019.09.12)的Level 2报价数据。数据集总共78列,其中2列是字符串类型,存储在dolphindb中的表结构如下表,一天的数据约为2170万行。一天的csv文件大小为11.6G,转换为pickle文件后大小为12.1G。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":" "}]},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

\n\n\n\n\n\n\n\n

\n

列名

\n\n

\n

类型

\n\n

\n

列名

\n\n

\n

类型

\n\n\n

\n

UpdateTime

\n\n

\n

TIME

\n\n

\n

TotalBidVol

\n\n

\n

INT

\n\n\n

\n

TradeDate

\n\n

\n

DATE

\n\n

\n

WAvgBidPri

\n\n

\n

DOUBLE

\n\n\n

\n

Market

\n\n

\n

SYMBOL

\n\n

\n

TotalAskVol

\n\n

\n

INT

\n\n\n

\n

SecurityID

\n\n

\n

SYMBOL

\n\n

\n

WAvgAskPri

\n\n

\n

DOUBLE

\n\n\n

\n

PreCloPrice

\n\n

\n

DOUBLE

\n\n

\n

IOPV

\n\n

\n

DOUBLE

\n\n\n

\n

OpenPrice

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章