TigerGraph核心特性初探

這裏簡單介紹目前商業市場上出現的宣稱是“第三代”圖數據庫產品,能支持OLAP和OLTP的場景。這個廠商提出了一個新的名詞叫NPG(Native Parallel Graph)原生並行圖(感覺廣告軟文在創造新詞彙...o(╯□╰)o)。

因爲TigerGraph不是開源的,因此我們可以從官宣的資料中瞭解瞭解它的核心設計。藍色部分使個人的一點思考。

A Native Distributed Graph(原生分佈式圖)

Its data store holds nodes, links, and their attributes. Some graph database products on the market are really wrappers built on top of a more generic NoSQL data store. This virtual graph strategy has a double penalty when it comes to performance.

圖數據庫引擎將節點,連接和屬性直接存儲在本地,也就是上層怎麼建模的,數據就是怎麼存儲的。這點跟部分圖數據庫的設計是不太一樣的,比如一些數據庫在NoSQL的存儲引擎上設計一套圖的模型,這種實現的方式被它成爲虛擬圖,而且會有存在潛在的性能開銷。(Titan,別跑,說的就是你啊

另外,Neo4j也是native graph, index-free的形式,看來沒有別的捷徑,要想圖數據庫引擎跑得快,需要最大程度減少磁盤IO和網絡IO,native+memory是最自然的實現方式,除非後續計算機業界有新的存儲技術突破

Compact Storage with Fast Access(存儲壓縮與快速訪問)

Internally hash indices are used to reference nodes and links. In Big-O terms, our average access time is O(1) and our average index update time is also O(1).

Users can set parameters that specify how much of the available memory may be used for holding the graph. If the full graph does not fit in memory, then the excess is stored on disk. Best performance is achieved when the full graph fits in memory, of course.

Data values are stored in encoded formats that effectively compress the data. The compression factor varies with the graph structure and data, but typical compression factors are between 2x and 10x. Compression has two advantages: First, a larger amount of graph data can fit in memory and in CPU cache. Such compression reduces not only the memory footprint, but also CPU cache misses, speeding up overall query performance. Second, for users with very large graphs, hardware costs are reduced.

In general, decompression is needed only for displaying the data. When values are used internally, often they may remain encoded and compressed.

引擎內部的hash索引用於節點和連接,平均的訪問速度達到O(1),索引的更新時間也是O(1)。

總的來說,用戶可以配置內存的大小,如果圖數據內存加載不下,會存儲於硬盤中去,當整個圖通過內存完整加載的情況下,性能最優(那還用說啊,沒有IO開銷啊);

通常情況下,數據的壓縮(編碼)效率達到2-10倍,這樣能帶來2個優勢:

1、將可能大數據量的圖儘量通過壓縮之後存儲於內存和CPU緩存中,減少內存訪問路徑和提升CPU緩存命中率,提升性能;

2、壓縮後,可以有助於減少內存佔用,降低成本;

通常經常下,用於展示的場景(比如多屬性的節點或者邊)數據會進行壓縮。

這部分的特性非常有意思:策略是從數據壓縮和工程的角度來提升性能。

壓縮這部分非常有意思,從香農的《通信的數學原理》論文我們知道,數據的壓縮是跟數據的分佈情況有關的,2-10倍的壓縮我覺得不一定對所有的數據集特性上都能實現...思路有點像Google protocolbuffer,將上層的數據轉換爲引擎層緊緻的數據格式,當然解碼部分會有一定的開銷。

Parallelism and Shared Values

TigerGraph also excels at parallelism, employing an MPP (massively parallel processing) design architecture throughout.

The nature of graph queries is to “follow the links.”

Ask each counter to do its share of the world, and then combine their results in the end.

TigerGraph在設計之初就支持MPP(大規模並行處理),主要通過並行的處理模型來加快圖的查詢。

因爲是使用原生圖的引擎存儲方式,因此可以使用物理存儲接口出發進行圖的訪問或計算迭代,follow the link,基本可以預測在內存中直接通過指針訪問數據,比通過IO或者其他mapping層面訪問數據要快。

另外,在並行處理的過程中,引擎支持多個內存處理單元,比如counter的數據共享,然後做聚合操作。

Storage and Processing Engines Written in C++

TigerGraph has been carefully designed to use memory efficiently and to release unused memory. Careful memory management contributes to TigerGraph’s ability to traverse many links, both in terms of depth and breadth, in a single query.

Many other graph database products are written in Java, which has pros and cons. Java programs run inside a Java Virtual Machine (JVM). The JVM takes care of memory management and garbage collection (freeing up memory that is no longer needed). While this is convenient, it is difficult for the programmer to optimize memory usage or to control when unused memory becomes available.

這個沒什麼好說的,在所有的路中選擇了最難的那條,用C++比較偏底層的語言來實現整個數據庫設計,相比其他用Java實現的數據庫產品,跑得快是正常的。比如Neo4j, OrientDB主要是通過Java實現的。

可是精通C++人不好招啊...o(╥﹏╥)o

GSQL Graph Query Language

TigerGraph also has its own graph querying and update language, GSQL.

自己設計了一套GSQL的圖查詢語言,類似於Neo4j的Cypher。

說實話,學不動了,每個圖數據庫廠商都自己搞一套,希望圖數據庫領域有個大佬出來統一圖數據的查詢規範,類似於標準的SQL的規範。

MPP Computational Model

To reiterate what we have revealed above, the TigerGraph graph is both a storage model and a computational model. Each node and link can be associated with a compute function. Therefore, each node or link acts as a parallel unit of storage and computation simultaneously. This would be unachievable using a generic NoSQL data store or without the use of accumulators.

在存儲層模型和計算模型中,都採用了MPP的設計。每個節點或者邊都可以附加單獨的算子(function),因此每個節點或者邊都可以作爲一個並行的存儲和計算單元,這種實現方式,而其他基於NoSQL之上存儲模型實現的數據庫產品是做不到。

Automatic Partitioning

TigerGraph is designed to automatically partition the graph data across a cluster of servers, and still perform quickly. The hash index is used to determine not only the within-server data location but also which-server. All the links that connect out from a given node are stored on the same server.

自動分區,分區之後的數據計算依然表現良好。

內部的hash-index的設計,既能用於本地數據的尋址也能用於跨服務器數據的索引。

這個策略有點像微軟亞洲研究院的GraphEngine的設計...

Distributed Computation Mode

In distributed query mode, all servers are asked to work on the query; each server’s actual participation is on an as-needed basis. When a traversal path crosses from server A to server B, the minimal amount of information that server B needs to know is passed to it. Since server B already knows about the overall query request, it can easily fit in its contribution.

在分佈式的查詢模式下,所有的服務器都會參與基於需求的查詢處理。

當一個圖遍歷的請求需要跨越服務器A和B的時候,B只需要知道最少必須的數據;

這個有點意思了,個人猜測跟上面提到自動分區裏的hash-index的設計有關係,因爲hash-index中也存儲了服務器的信息;

我理解這個階段的數據處理可能是串行的,因爲如果traverse是動態的,需要跨機器傳輸的網絡數據應該是事前不知道的,或者是不完全知道的。

High Performance Graph Analytics with a Native Parallel Graph

As the world’s first and only true native parallel graph (NPG) system, TigerGraph is a complete, distributed, graph analytics platform supporting web-scale data analytics in real time. The TigerGraph NPG is built around both local storage and computation, supports real-time graph updates, and serves as a parallel computation engine. TigerGraph ACID transactions, guaranteeing data consistency and correct results. Its distributed, native parallel graph architecture enables TigerGraph to achieve unequaled performance levels:

  • Loading 100 to 200 GB of data per hour, per machine.
  • Traversing hundreds of million of nodes/edges per second per machine.
  • Performing queries with 10-plus hops in subsecond time.
  • Updating 1000s of nodes and edges per second, hundreds of millions per day.
  • Scaling out to handle unlimited data, while maintaining real-time speeds and improving loading and querying throughput.

簡單瞭解一下這部分的產品參數:大,快,深,穩

GSQL

簡單體驗一下GSQL的語法

GSQL

/*DDL 數據建模*/

CREATE VERTEX person (PRIMARY_ID ssn STRING, age INTname STRING)

CREATE UNDIRECTED EDGE friendship (FROM person, TO person)

CREATE DIRECTED EDGE teacher_student (FROM person, TO person)

CREATE GRAPH School (person, friendship, teacher_student)

GSQL job

/*GSQL JOB 數據加載*/

USE GRAPH social

BEGIN

CREATE LOADING JOB load_social FOR GRAPH social {

 DEFINE FILENAME file1=”/home/tigergraph/person.csv”;

 DEFINE FILENAME file2=”/home/tigergraph/friendship.csv”;

 LOAD file1 TO VERTEX person VALUES ($”name”, $”name”, $”age”, $”gender”,

$”state”) USING header=”true”, separator=”,”;

 LOAD file2 TO EDGE friendship VALUES ($0, $1, $2) USING header=”true”,

separator=”,”;

}

END

 

 

GSQL > run loading job load_social

 

/*SQL-like的數據查詢*/

SELECT FROM person-(friendship)->person WHERE from_id ==”Tom”

[

 {

 “from_type”: “person”,

 “to_type”: “person”,

 “directed”: false,

 “from_id”: “Tom”,

 “to_id”: “Dan”,

 “attributes”: {“connect_day”: “2017-06-03 00:00:00”},

“e_type”: “friendship”

 },

 {

 “from_type”: “person”,

 “to_type”: “person”,

 “directed”: false,

 “from_id”: “Tom”,

 “to_id”: “Jenny”,

 

“attributes”: {“connect_day”: “2015-01-01 00:00:00”},

 “e_type”: “friendship”

 }

]

 

GSQL algorithm

/*GSQL圖算法開發(感覺有一定的學習曲線)*/

CREATE QUERY pageRank(float maxChange, int maxIteration, float damping)

 FOR GRAPH gsql_demo {

 MaxAccum @@maxDiff=9999; #max score change in an iteration

 SumAccum @recvd_score=0; #sum(scores received from neighbors)

 SumAccum @score=1; #scores initialized to 1

 V = {Page.*}; #Start with all Page vertices

 WHILE @@maxDiff > maxChange LIMIT maxIteration DO

 @@maxDiff = 0;

 S = SELECT s

 FROM V:s-(Linkto)->:t

 ACCUM t.@recvd_score += s.@score/s.outdegree()

 POST-ACCUM s.@score = (1-damping) + damping * s.@recvd_score,

 s.@recvd_score = 0,

 @@maxDiff += abs(s.@score - s.@score’);

 END; #end while loop

 PRINT V;

}#end query


GSQL Compared to Other Graph Languages

GSQL can be compared to other prominent graph query languages in circulation today. This comparison seeks to transcend the particular syntax or the particular way in which semantics are defined, focusing on expressive power classified along the following key dimensions.

1. Accumulation: What is the language support for the storage of (collected or aggregated) data computed by the query?

2. Multi-hop Path Traversal: Does the language support the chaining of multiple traversal steps into paths, with data collected along these steps?

3. Intermediate Result Flow: Does the language support the flow of intermediate results along the steps of the traversal?

4. Control Flow: What control flow primitives are supported?

5. Query-Calling-Query: What is the support for queries invoking other queries?

6. SQL Completeness: Is the language SQL-complete? That is, is it the case that for a graph-based representation G of any relational database D, any SQL query over D can be expressed by a GSQL query over G?

7. Turing completeness: Is the language Turing-complete?

 

 參考資料:Native Parallel Graphs - TigerGraph

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章