大規模圖嵌入框架 PBG,由Facebook開源。
基本思路:
讀入edgelist,對各node賦予一個vector,通過更新vector,使得connected entities更加接近,unconnected entities距離更遠。
PBG的出發點:圖的scale!
處理方法:
-
graph partitioning, so that the model does not have to be fully loaded into memory
-
multi-threaded computation on each machine
-
distributed execution across multiple machines (optional), all simultaneously operating on disjoint parts of the graph
-
batched negative sampling, allowing for processing >1 million edges/sec/machine with 100 negatives per edge
分圖,多機多線程,以及negative sample的批處理。
Important components of PBG are:
• A block decomposition of the adjacency matrix into N buckets, training on the edges from one bucket at a time. PBG then either swaps embeddings from each partition to disk to reduce memory usage, or performs distributed execution across multiple machines.
對鄰接矩陣分桶,可以用來降低內存壓力(將embed swap進disk,用來節省空間),也可以用於分佈式計算。
• A distributed execution model that leverages the block decomposition for the large parameter matrices, as well as a parameter server architecture for global parameters and feature embeddings for featurized nodes.
• Efficient negative sampling for nodes that samples negative nodes both uniformly and from the data, and reuses negatives within a batch to reduce memory bandwidth.
有效的負採樣
• Support for multi-entity, multi-relation graphs with per-relation configuration options such as edge weight and choice of relation operator.
支持多實體、多關係(可以用於帶權圖)
PBG 的 score function
theta表示emb向量,s,d分別是source、destination,r是relation
因此這個打分函數表示是一個relation-specific的打分。可以factorize成gs和gd,使得theta s 和theta d 有semantic meaning。
g (x, theta r) 的幾種形式,以及sim(a,b) 的計算方式。
函數 g 將邊的類型加入進行考慮,sim計算兩個向量的相關性。
由於graph中的edge分佈是heavy tailed,因此如何負採樣非常關鍵。
兩種策略:
-
sample negatives strictly according to the data distribution
-
sample negatives uniformly
PBG中,採用折中,alpha比例的負樣本用1,(1-alpha)的用2均勻採樣。
margin-based ranking objective
這個margin-based ranking loss在圖中常用,用來force f(e’) - f(e) >= lambda。即,再向量空間中,negative的edge距離更遠,而實際的edge距離更近。
f(e) = f(theta s, theta r, theta d) = f(gs, gd),實際上就是經過了relation處理以後的source和dest的距離。
最關鍵的問題:如何進行切分,從而用分佈式處理圖,或者通過內存與disk的不斷swap分別處理圖的某個部分。
做法:
首先,將entity進行分組,分成P個partition。
然後,將edge分bucket,對於一條邊:s -> d,如果s屬於pi,d屬於pj,那麼,這個邊就屬於bucket(pi, pj)。可見,bucket的數量是P squared。
實驗部分測試的task:
-
link prediction
-
use embed as node vectors for other attribute prediction
總結:
PBG的基本思路是將節點(即entity)劃分成不同的partition,這樣就不用一次性將圖都load進內存。然後,以節點的分片作爲參考,對edge也進行分桶(bucket)。這樣可以保證同一個bucket中的左右兩邊的節點(lhs和rhs)只關聯某兩個entity的分組。同時,neg sample也在這兩個分組裏進行,從而實現瞭如DNN那樣,每次只讀入一個batch,逐批次進行訓練。節省了計算空間,原則上只要分的足夠小,可以處理任意scale的網絡。同時,由於可以切分,從而也可以用於分佈式計算。