【論文筆記】PyTorch-BigGraph: A Large-scale Graph Embedding Framework（大規模圖嵌入）

原創

江户川柯壮

2020-06-11 03:07

大規模圖嵌入框架 PBG，由Facebook開源。

paper：https://mlsys.org/Conferences/2019/doc/2019/71.pdf

基本思路：

讀入edgelist，對各node賦予一個vector，通過更新vector，使得connected entities更加接近，unconnected entities距離更遠。

PBG的出發點：圖的scale！

處理方法：

graph partitioning, so that the model does not have to be fully loaded into memory
multi-threaded computation on each machine
distributed execution across multiple machines (optional), all simultaneously operating on disjoint parts of the graph
batched negative sampling, allowing for processing >1 million edges/sec/machine with 100 negatives per edge

分圖，多機多線程，以及negative sample的批處理。

Important components of PBG are:

• A block decomposition of the adjacency matrix into N buckets, training on the edges from one bucket at a time. PBG then either swaps embeddings from each partition to disk to reduce memory usage, or performs distributed execution across multiple machines.

對鄰接矩陣分桶，可以用來降低內存壓力（將embed swap進disk，用來節省空間），也可以用於分佈式計算。

• A distributed execution model that leverages the block decomposition for the large parameter matrices, as well as a parameter server architecture for global parameters and feature embeddings for featurized nodes.

• Efficient negative sampling for nodes that samples negative nodes both uniformly and from the data, and reuses negatives within a batch to reduce memory bandwidth.

有效的負採樣

• Support for multi-entity, multi-relation graphs with per-relation configuration options such as edge weight and choice of relation operator.

支持多實體、多關係（可以用於帶權圖）

PBG 的 score function

theta表示emb向量，s，d分別是source、destination，r是relation

因此這個打分函數表示是一個relation-specific的打分。可以factorize成gs和gd，使得theta s 和theta d 有semantic meaning。

g (x, theta r) 的幾種形式，以及sim(a,b) 的計算方式。

函數 g 將邊的類型加入進行考慮，sim計算兩個向量的相關性。

由於graph中的edge分佈是heavy tailed，因此如何負採樣非常關鍵。

兩種策略：

sample negatives strictly according to the data distribution
sample negatives uniformly

PBG中，採用折中，alpha比例的負樣本用1，(1-alpha)的用2均勻採樣。

margin-based ranking objective

這個margin-based ranking loss在圖中常用，用來force f(e’) - f(e) >= lambda。即，再向量空間中，negative的edge距離更遠，而實際的edge距離更近。

f(e) = f(theta s, theta r, theta d) = f(gs, gd)，實際上就是經過了relation處理以後的source和dest的距離。

最關鍵的問題：如何進行切分，從而用分佈式處理圖，或者通過內存與disk的不斷swap分別處理圖的某個部分。

做法：

首先，將entity進行分組，分成P個partition。

然後，將edge分bucket，對於一條邊：s -> d，如果s屬於pi，d屬於pj，那麼，這個邊就屬於bucket(pi, pj)。可見，bucket的數量是P squared。

實驗部分測試的task：

link prediction
use embed as node vectors for other attribute prediction

總結：

PBG的基本思路是將節點（即entity）劃分成不同的partition，這樣就不用一次性將圖都load進內存。然後，以節點的分片作爲參考，對edge也進行分桶（bucket）。這樣可以保證同一個bucket中的左右兩邊的節點（lhs和rhs）只關聯某兩個entity的分組。同時，neg sample也在這兩個分組裏進行，從而實現瞭如DNN那樣，每次只讀入一個batch，逐批次進行訓練。節省了計算空間，原則上只要分的足夠小，可以處理任意scale的網絡。同時，由於可以切分，從而也可以用於分佈式計算。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【論文筆記】PyTorch-BigGraph: A Large-scale Graph Embedding Framework（大規模圖嵌入）

移位操作搞定兩數之商

如何基於surging跨網關跨語言進行緩存降級

2024合集

程序員天天 CURD，怎麼才能成長，職業發展的思考(2)

教你用Perl實現Smgp協議

如何通過前端表格控件在10分鐘內完成一張分組報表？

win11關閉自動檢測病毒刪文件

通用代碼生成器簡介

lightdb 單機模式下數據庫平移

千兆寬帶實際網速能到達多少？

【Java 小白菜入門筆記 2.2】常用的類和方法

【Java 小白菜入門筆記 2.1】面向對象相關

【Java 小白菜入門筆記 1.3】流程控制、數組和輸入輸出

【Java 小白菜入門筆記 1.２】運算符、方法和語句

【Java 小白菜入門筆記 1.1】常量和變量

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結