用圖機器學習探索 A 股個股相關性變化

在本系列的前文 [1,2]中，我們介紹瞭如何使用 Python 語言圖分析庫 NetworkX [3] + Nebula Graph [4] 來進行<權力的遊戲>中人物關係圖譜分析。

在本文中我們將介紹如何使用 Java 語言的圖分析庫 JGraphT [5] 並藉助繪圖庫 mxgraph [6] ，可視化探索 A 股的行業個股的相關性隨時間的變化情況。

數據集的處理

本文主要分析方法參考了[7,8]，有兩種數據集：

股票數據（點集）

從 A 股中按股票代碼順序選取了 160 只股票（排除摘牌或者 ST 的）。每一支股票都被建模成一個點，每個點的屬性有股票代碼，股票名稱，以及證監會對該股票對應上市公司所屬板塊分類等三種屬性；

表1：點集示例

頂點id	股票代碼	股票名稱	所屬板塊
1	SZ0001	平安銀行	金融行業
2	600000	浦發銀行	金融行業
3	600004	白雲機場	交通運輸
4	600006	東風汽車	汽車製造
5	600007	中國國貿	開發區
6	600008	首創股份	環保行業
7	600009	上海機場	交通運輸
8	600010	包鋼股份	鋼鐵行業

股票關係（邊集）

邊只有一個屬性，即權重。邊的權重代表邊的源點和目標點所代表的兩支股票所屬上市公司業務上的的相似度——相似度的具體計算方法參考 [7,8]：取一段時間（2014 年 1 月 1 日 - 2020 年 1 月 1 日）內，個股的日收益率的時間序列相關性
再定義個股之間的距離爲 (也即兩點之間的邊權重）：

通過這樣的處理，距離取值範圍爲 [0,2]。這意味着距離越遠的個股，兩個之間的收益率相關性越低。

表2：邊集示例

邊的源點 ID	邊的目標點 ID	邊的權重
11	12	0.493257968
22	83	0.517027513
23	78	0.606206233
2	12	0.653692415
1	11	0.677631482
1	27	0.695705171
1	12	0.71124344
2	11	0.73581915
8	18	0.771556458
12	27	0.785046446
9	20	0.789606527
11	27	0.796009627
25	63	0.797218349
25	72	0.799230001
63	115	0.803534952

這樣的點集和邊集構成一個圖網絡，可以將這個網絡存儲在圖數據庫 Nebula Graph 中。

JGraphT

JGraphT 是一個開放源代碼的 Java 類庫，它不僅爲我們提供了各種高效且通用的圖數據結構，還爲解決最常見的圖問題提供了許多有用的算法：

支持有向邊、無向邊、權重邊、非權重邊等；
支持簡單圖、多重圖、僞圖；
提供了用於圖遍歷的專用迭代器（DFS，BFS）等；
提供了大量常用的的圖算法，如路徑查找、同構檢測、着色、公共祖先、遊走、連通性、匹配、循環檢測、分區、切割、流、中心性等算法；
可以方便地導入 / 導出 GraphViz [9]。導出的 GraphViz 可被導入可視化工具 Gephi[10] 進行分析與展示；
可以方便地使用其他繪圖組件，如：JGraphX，mxGraph，Guava Graphs Generators 等工具繪製出圖網絡。

下面，我們來實踐一把，先在 JGraphT 中創建一個有向圖：

import org.jgrapht.*;
import org.jgrapht.graph.*;
import org.jgrapht.nio.*;
import org.jgrapht.nio.dot.*;
import org.jgrapht.traverse.*;

import java.io.*;
import java.net.*;
import java.util.*;

Graph<URI, DefaultEdge> g = new DefaultDirectedGraph<>(DefaultEdge.class);

添加頂點：

URI google = new URI("http://www.google.com");
URI wikipedia = new URI("http://www.wikipedia.org");
URI jgrapht = new URI("http://www.jgrapht.org");

// add the vertices
g.addVertex(google);
g.addVertex(wikipedia);
g.addVertex(jgrapht);

添加邊：

// add edges to create linking structure
g.addEdge(jgrapht, wikipedia);
g.addEdge(google, jgrapht);
g.addEdge(google, wikipedia);
g.addEdge(wikipedia, google);

圖數據庫 Nebula Graph Database

JGraphT 通常使用本地文件作爲數據源，這在靜態網絡研究的時候沒什麼問題，但如果圖網絡經常會發生變化——例如，股票數據每日都在變化——每次生成全新的靜態文件再加載分析就有些麻煩，最好整個變化過程可以持久化地寫入一個數據庫中，並且可以實時地直接從數據庫中加載子圖或者全圖做分析。本文選用 Nebula Graph 作爲存儲圖數據的圖數據庫。

Nebula Graph 的 Java 客戶端 Nebula-Java [11] 提供了兩種訪問 Nebula Graph 方式：一種是通過圖查詢語言 nGQL [12] 與查詢引擎層 [13] 交互，這通常適用於有複雜語義的子圖訪問類型; 另一種是通過 API 與底層的存儲層（storaged）[14] 直接交互，用於獲取全量的點和邊。除了可以訪問 Nebula Graph 本身外，Nebula-Java 還提供了與 Neo4j [15]、JanusGraph [16]、Spark [17] 等交互的示例。

在本文中，我們選擇直接訪問存儲層（storaged）來獲取全部的點和邊。下面兩個接口可以用來讀取所有的點、邊數據：

// space 爲待掃描的圖空間名稱，returnCols 爲需要讀取的點/邊及其屬性列，
// returnCols 參數格式：{tag1Name: prop1, prop2, tag2Name: prop3, prop4, prop5}
Iterator<ScanVertexResponse> scanVertex(
            String space, Map<String, List<String>> returnCols);
Iterator<ScanEdgeResponse> scanEdge(
            String space, Map<String, List<String>> returnCols);

第一步：初始化一個客戶端，和一個 ScanVertexProcessor。ScanVertexProcessor 用來對讀出來的頂點數據進行解碼：

MetaClientImpl metaClientImpl = new MetaClientImpl(metaHost, metaPort);
metaClientImpl.connect();
StorageClient storageClient = new StorageClientImpl(metaClientImpl);
Processor processor = new ScanVertexProcessor(metaClientImpl);

第二步：調用 scanVertex 接口，該接口會返回一個 scanVertexResponse 對象的迭代器：

Iterator<ScanVertexResponse> iterator =
                storageClient.scanVertex(spaceName, returnCols);

第三步：不斷讀取該迭代器所指向的 scanVertexResponse 對象中的數據，直到讀取完所有數據。讀取出來的頂點數據先保存起來，後面會將其添加到到 JGraphT 的圖結構中：

while (iterator.hasNext()) {
  ScanVertexResponse response = iterator.next();
  if (response == null) {
    log.error("Error occurs while scan vertex");
    break;
  }
  
  Result result =  processor.process(spaceName, response);
  results.addAll(result.getRows(TAGNAME));
}

讀取邊數據的方法和上面的流程類似。

在 JGraphT 中進行圖分析

第一步：在 JGraphT 中創建一個無向加權圖 graph：

Graph<String, MyEdge> graph = GraphTypeBuilder
                .undirected()
    .weighted(true)
    .allowingMultipleEdges(true)
    .allowingSelfLoops(false)
    .vertexSupplier(SupplierUtil.createStringSupplier())
    .edgeSupplier(SupplierUtil.createSupplier(MyEdge.class))
    .buildGraph();

第二步：將上一步從 Nebula Graph 圖空間中讀出來的點、邊數據添加到 graph 中：

for (VertexDomain vertex : vertexDomainList){
    graph.addVertex(vertex.getVid().toString());
    stockIdToName.put(vertex.getVid().toString(), vertex);
}

for (EdgeDomain edgeDomain : edgeDomainList){
    graph.addEdge(edgeDomain.getSrcid().toString(), edgeDomain.getDstid().toString());
    MyEdge newEdge = graph.getEdge(edgeDomain.getSrcid().toString(), edgeDomain.getDstid().toString());
    graph.setEdgeWeight(newEdge, edgeDomain.getWeight());
}

第三步：參考 [7,8] 中的分析法，對剛纔的圖 graph 使用 Prim 最小生成樹算法（minimun-spanning-tree），並調用封裝好的 drawGraph 接口畫圖：

普里姆算法（Prim's algorithm），圖論中的一種算法，可在加權連通圖裏搜索最小生成樹。即，由此算法搜索到的邊子集所構成的樹中，不但包括了連通圖裏的所有頂點，且其所有邊的權值之和亦爲最小。

SpanningTreeAlgorithm.SpanningTree pMST = new PrimMinimumSpanningTree(graph).getSpanningTree();

Legend.drawGraph(pMST.getEdges(), filename, stockIdToName);

第四步：drawGraph 方法封裝了畫圖的佈局等各項參數設置。這個方法將同一板塊的股票渲染爲同一顏色，將距離接近的股票排列聚集在一起。

public class Legend {
  
...
  
  public static void drawGraph(Set<MyEdge> edges, String filename, Map<String, VertexDomain> idVertexMap) throws IOException {
     // Creates graph with model
     mxGraph graph = new mxGraph();
     Object parent = graph.getDefaultParent();

     // set style
     graph.getModel().beginUpdate();
     mxStylesheet myStylesheet =  graph.getStylesheet();
     graph.setStylesheet(setMsStylesheet(myStylesheet));

     Map<String, Object> idMap = new HashMap<>();
     Map<String, String> industryColor = new HashMap<>();

     int colorIndex = 0;

     for (MyEdge edge : edges) {
       Object src, dst;
       if (!idMap.containsKey(edge.getSrc())) {
         VertexDomain srcNode = idVertexMap.get(edge.getSrc());
         String nodeColor;
         if (industryColor.containsKey(srcNode.getIndustry())){
           nodeColor = industryColor.get(srcNode.getIndustry());
         }else {
           nodeColor = COLOR_LIST[colorIndex++];
           industryColor.put(srcNode.getIndustry(), nodeColor);
         }
         src = graph.insertVertex(parent, null, srcNode.getName(), 0, 0, 105, 50, "fillColor=" + nodeColor);
         idMap.put(edge.getSrc(), src);
       } else {
         src = idMap.get(edge.getSrc());
       }

       if (!idMap.containsKey(edge.getDst())) {
         VertexDomain dstNode = idVertexMap.get(edge.getDst());

         String nodeColor;
         if (industryColor.containsKey(dstNode.getIndustry())){
           nodeColor = industryColor.get(dstNode.getIndustry());
         }else {
           nodeColor = COLOR_LIST[colorIndex++];
           industryColor.put(dstNode.getIndustry(), nodeColor);
         }

         dst = graph.insertVertex(parent, null, dstNode.getName(), 0, 0, 105, 50, "fillColor=" + nodeColor);
         idMap.put(edge.getDst(), dst);
       } else {
         dst = idMap.get(edge.getDst());
       }
       graph.insertEdge(parent, null, "", src, dst);
     }


     log.info("vertice " + idMap.size());
     log.info("colorsize " + industryColor.size());

     mxFastOrganicLayout layout = new mxFastOrganicLayout(graph);
     layout.setMaxIterations(2000);
     //layout.setMinDistanceLimit(10D);
     layout.execute(parent);

     graph.getModel().endUpdate();

     // Creates an image than can be saved using ImageIO
     BufferedImage image = createBufferedImage(graph, null, 1, Color.WHITE,
                                               true, null);

     // For the sake of this example we display the image in a window
     // Save as JPEG
     File file = new File(filename);
     ImageIO.write(image, "JPEG", file);

   }
  
  ...
    
}

第五步：生成可視化：

圖1中每個頂點的顏色代表證監會對該股票所屬上市公司歸類的板塊。

可以看到，實際業務近似度較高的股票已經聚攏成簇狀（例如：高速板塊、銀行版本、機場航空板塊），但也會有部分關聯性不明顯的個股被聚類在一起，具體原因需要單獨進行個股研究。

圖1：基於 2015-01-01 至 2020-01-01 的股票數據計算出的聚集性

第六步：基於不同時間窗口的一些其他動態探索

上節中，結論主要基於 2015-01-01 到 2020-01-01 的個股聚集性。這一節我們還做了一些其他的嘗試：以 2 年爲一個時間滑動窗口，分析方法不變，定性探索聚集羣是否隨着時間變化會發生改變。

圖2：基於 2014-01-01 至 2016-01-01 的股票數據計算出的聚集性

圖3：基於 2015-01-01 至 2017-01-01 的股票數據計算出的聚集性

圖4：基於 2016-01-01 至 2018-01-01 的股票數據計算出的聚集性

圖5：基於 2017-01-01 至 2019-01-01 的股票數據計算出的聚集性

圖6：基於 2018-01-01 至 2020-01-01 的股票數據計算出的聚集性

粗略分析看，隨着時間窗口變化，有些板塊（高速、銀行、機場航空、房產、能源）的板塊內部個股聚集性一直保持比較好——這意味着隨着時間變化，這個版塊內各種一直保持比較高的相關性；但有些板塊（製造）的聚集性會持續變化——意味着相關性一直在發生變化。

Disclaim

本文不構成任何投資建議，且作者不持有本文中任一股票。

受限於停牌、熔斷、漲跌停、送轉、併購、主營業務變更等情況，數據處理可能有錯誤，未做一一檢查。

受時間所限，本文只選用了 160 個個股樣本過去 6 年的數據，只採用了最小擴張樹一種辦法來做聚類分類。未來可以使用更大的數據集（例如美股、衍生品、數字貨幣），嘗試更多種圖機器學習的辦法。

本文代碼可見[18]

Reference

[1] 用 NetworkX + Gephi + Nebula Graph 分析<權力的遊戲>人物關係（上篇）https://nebula-graph.com.cn/posts/game-of-thrones-relationship-networkx-gephi-nebula-graph/

[2] 用 NetworkX + Gephi + Nebula Graph 分析<權力的遊戲>人物關係（下篇） https://nebula-graph.com.cn/posts/game-of-thrones-relationship-networkx-gephi-nebula-graph-part-two/

[3] NetworkX: a Python package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks. https://networkx.github.io/

[4] Nebula Graph: A powerfully distributed, scalable, lightning-fast graph database written in C++. https://nebula-graph.io/

[5] JGraphT: a Java library of graph theory data structures and algorithms. https://jgrapht.org/

[6] mxGraph: JavaScript diagramming library that enables interactive graph and charting applications. https://jgraph.github.io/mxgraph/

[7] Bonanno, Giovanni & Lillo, Fabrizio & Mantegna, Rosario. (2000). High-frequency Cross-correlation in a Set of Stocks. arXiv.org, Quantitative Finance Papers. 1. 10.1080/713665554.

[8] Mantegna, R.N. Hierarchical structure in financial markets. Eur. Phys. J. B 11, 193–197 (1999).

[9] https://graphviz.org/

[10] https://gephi.org/

[11] https://github.com/vesoft-inc/nebula-java

[12] Nebula Graph Query Language (nGQL). https://docs.nebula-graph.io/manual-EN/1.overview/1.concepts/2.nGQL-overview/

[13] Nebula Graph Query Engine. https://github.com/vesoft-inc/nebula-graph

[14] Nebula-storage: A distributed consistent graph storage. https://github.com/vesoft-inc/nebula-storage

[15] Neo4j. www.neo4j.com

[16] JanusGraph. janusgraph.org

[17] Apache Spark. spark.apache.org.

[18] https://github.com/Judy1992/nebula_scan