金融高頻數據管理:DolphinDB與pickle的性能對比測試和分析

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"金融市場L1\/L2的報價和交易數據是量化交易研究非常重要的數據。國內全市場L1\/L2的歷史數據約爲20~50T,每日新增的數據量約爲20~50G。傳統的關係數據庫如MS SQL Server或MySQL均無法支撐這樣的數據量級,即便分庫分表,查詢性能也遠遠無法達到要求。例如Impala和Greenplum的數據倉庫,以及例如HBase的NoSQL數據庫,可以解決這個數據量級的存儲,但是這類通用的存儲引擎缺乏對時序數據的友好支持,在查詢和計算方面都存在嚴重的不足,對量化金融普遍採用的Python的支持也極爲有限。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"數據庫的侷限性使得一部分用戶轉向文件存儲。HDF5,Parquet和pickle是常用的二進制文件格式,其中pickle作爲Python對象序列化\/反序列的協議非常高效。由於Python是量化金融和數據分析的常用工具,因此許多用戶使用pickle存儲高頻數據。但文件存儲存在明顯的缺陷,譬如大量的數據冗餘,不同版本之間的管理困難,不提供權限控制,無法利用多個節點的資源,不同數據間的關聯不便,數據管理粒度太粗,檢索和查詢不便等等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"目前,越來越多的券商和私募開始採用高性能時序數據庫DolphinDB來處理高頻數據。DolphinDB採用列式存儲,並提供多種靈活的分區機制,可充分利用集羣中每個節點的資源。DolphinDB的大量內置函數對時序數據的處理和計算非常友好,解決了傳統關係數據庫或NoSQL數據庫處理時序數據方面的侷限性。使用DolphinDB處理高頻數據,既可以保證查詢與計算的超高性能,又可以提供數據管理、權限控制、並行計算、數據關聯等數據庫的優勢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"本文測試DolphinDB和pickle在數據讀取方面的性能。與使用pickle文件存儲相比,直接使用DolphinDB數據庫,數據讀取速度可最多可提升10倍以上;若爲了考慮與現有Python系統的集成,使用DolphinDB提供的Python API讀取數據,速度最多有2~3倍的提升。有關DolphinDB數據庫在數據管理等方面的功能,讀者可參考DolphinDB的在線"},{"type":"link","attrs":{"href":"https:\/\/link.zhihu.com\/?target=https:\/\/www.dolphindb.cn\/cn\/help\/index.html","title":null,"type":null},"content":[{"type":"text","text":"文檔"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"或"},{"type":"link","attrs":{"href":"https:\/\/link.zhihu.com\/?target=https:\/\/gitee.com\/dolphindb\/Tutorials_CN\/blob\/README.md","title":null,"type":null},"content":[{"type":"text","text":"教程"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"1. 測試場景和測試數據"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"本次測試使用了以下兩個數據集。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"數據集1是美國股市一天(2007.08.23) Level 1的報價和交易數據。該數據共10列,其中2列是字符串類型,其餘是整型或浮點數類型,存儲在dolphindb中的表結構如下表,一天的數據約爲2億3000萬行。csv文件大小爲9.5G,轉換爲pickle文件後大小爲11.8G。"}]},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"\n\n\n\n\n\n\n\n\n\n\n\n

\n

列名

\n\n

\n

類型

\n\n\n

\n

symbol

\n\n

\n

SYMBOL

\n\n\n

\n

date

\n\n

\n

DATE

\n\n\n

\n

time

\n\n

\n

SECOND

\n\n\n

\n

bid

\n\n

\n

DOUBLE

\n\n\n

\n

ofr

\n\n

\n

DOUBLE

\n\n\n

\n

bidsiz

\n\n

\n

INT

\n\n\n

\n

ofrsiz

\n\n

\n

INT

\n\n\n

\n

mode

\n\n

\n

INT

\n\n\n

\n

ex

\n\n

\n

CHAR

\n\n\n

\n

mmid

\n\n

\n

SYMBOL

\n\n\n"}}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":"數據集2是中國股市3天(2019.09.10~2019.09.12)的Level 2報價數據。數據集總共78列,其中2列是字符串類型,存儲在dolphindb中的表結構如下表,一天的數據約爲2170萬行。一天的csv文件大小爲11.6G,轉換爲pickle文件後大小爲12.1G。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#121212","name":"user"}}],"text":" "}]},{"type":"embedcomp","attrs":{"type":"table","data":{"content":"

\n\n\n\n\n\n\n\n

\n

列名

\n\n

\n

類型

\n\n

\n

列名

\n\n

\n

類型

\n\n\n

\n

UpdateTime

\n\n

\n

TIME

\n\n

\n

TotalBidVol

\n\n

\n

INT

\n\n\n

\n

TradeDate

\n\n

\n

DATE

\n\n

\n

WAvgBidPri

\n\n

\n

DOUBLE

\n\n\n

\n

Market

\n\n

\n

SYMBOL

\n\n

\n

TotalAskVol

\n\n

\n

INT

\n\n\n

\n

SecurityID

\n\n

\n

SYMBOL

\n\n

\n

WAvgAskPri

\n\n

\n

DOUBLE

\n\n\n

\n

PreCloPrice

\n\n

\n

DOUBLE

\n\n

\n

IOPV

\n\n

\n

DOUBLE

\n\n\n

\n

OpenPrice

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章