[paper]https://arxiv.org/pdf/1710.10321.pdf
[code]https://github.com/benedekrozemberczki/GraphWaveMachine
-
abstract
In this paper, we develop GraphWave, a method that represents each node’s network neighborhood via a low-dimensional embedding by leveraging heat wavelet diffusion patterns. Instead of training on hand-selected features, GraphWave learns these embeddings in an unsupervised way.
graphwave這個方法基於熱浪傳播模式來表示每個節點的網絡鄰居。是個無監督學習方法。
-
present work
our approach learns a multidimensional structural embedding for each node based on the diffusion of a spectral graph wavelet centered at the node. Intuitively, each node propagates a unit of energy over the graph and characterizes its neighboring topology based on the response of the network to this probe.
基於以節點爲中心的頻譜圖小波的擴散來學習每個節點的多維結構嵌入。每個節點會向周圍傳播一個能量單位。
主要的貢獻:
-
完全非監督,不需要任何先驗知識。
-
完整的數學證明,以前的方法都是啓發式的,這篇論文作者使用大量篇幅證明使用GraphWave,結構等價/相似的節點具有近乎相同/相似的嵌入。
-
前提知識
[參考文章]https://zhuanlan.zhihu.com/p/50212921
-
spectral graph wavelets(圖譜小波)
[參考文章]https://blog.csdn.net/sxf1061926959/article/details/53538105
熱核特徵(Heat Kernel signature,HKS)是用於形變三維形狀分析的特徵描述子,屬於譜分析方法。對於三維形狀上的每個點,HKS定義了它的特徵向量用於表示點的局部和全局屬性。其廣泛應用於是三維分割、分類、結構探索、形狀匹配和形狀檢索。
簡單理解的話,熱核特徵是去計算三維模型表面的每個點,隨時間變化後熱量的剩餘情況,因爲每個點周圍的情況是不一樣的,這樣的話,每個點假設都有一個相同的初始熱量,隨時間推移,因爲點周邊的情況不一樣,那麼熱量擴散的速度也不一樣,所以隨着時間的變化,每個點的熱量變化將會形成一條下降的曲線,再把這條曲線離散化,我們就可以得到一個點的熱核特徵。再按該方法去計算每個點的熱核特徵,我們就可以得到整個三維模型的熱核特徵,可以用一個大矩陣表示。
2. 特徵函數
對於一個隨機變量X,它的特徵函數定義爲 。特徵函數由隨機變量完全決定,並能完全表徵一個隨機變量,即可以表達一個隨機變量的所有矩。因此,特徵函數提供了一種研究隨機變量的方法。在某些情況下,分佈函數不是很方便,比如求多個獨立隨機變量和的分佈時,用分佈函數求解的話,涉及到多重卷積,非常苦難,而轉換成特徵函數(即傅里葉變換)就相對簡單些。
-
算法過程
論文的思路是,對給定的圖G的拉普拉斯矩陣,利用公式,可以求得其heat kernel(熱核特徵矩陣)。論文裏稱爲spectral graph wavelets(譜圖小波) ,作者將這個spectral graph wavelets看作一個概率分佈,特徵函數可以表徵一個概率分佈,就可以利用特徵函數來表徵一個spectral graph wavelets。特徵函數在任意t上是相等的,則任意t採樣即可得到GE。
對於一個無向圖,其拉普拉斯矩陣爲,其中D爲度矩陣,A爲鄰接矩陣,U爲特徵向量,爲特徵值。其對應的spectral graph wavelets爲。
對於某一個節點a的spectral graph wavelets,是節點a的one-hot向量。
表示從a收到的從m傳來的能量。
若是a和b的結構相似,則他們的能量分佈應該也是相似的。將看作一組隨機變量,求其特徵函數。,最後對其進行d次任意t的採樣
Re表示實部,Im表示虛部,最後得到一個2d的a的embedding向量。當然,這樣還不太好,因爲只有一個參數s,s實際上控制着能力傳播的距離,較小的s得到的表示小範圍的結構相似性,較大的s得到的表示可以表示更大尺度的結構相似性。所以,文章使用J個s得到J個不同的表示,最後concat起來得到最終的表示是2*d*J維的。
-
數學證明比較複雜,我看不懂,直接看代碼。
源碼【code】
-
整體結構
-
main.py(主函數運行部分)(修改了帶權圖的讀取方式)
"""Running the GraphWave machine."""
import pandas as pd
import networkx as nx
from param_parser import parameter_parser
from spectral_machinery import WaveletMachine
from texttable import Texttable
def tab_printer(args):
"""
Function to print the logs in a nice tabular format.
:param args: Parameters used for the model.
"""
# 輸出相關參數
args = vars(args)
keys = sorted(args.keys())
tab = Texttable()
tab.add_rows([["Parameter", "Value"]])
tab.add_rows([[k.replace("_", " ").capitalize(), args[k]] for k in keys])
print(tab.draw())
def read_graph(settings):
"""
Reading the edge list from the path and returning the networkx graph object.
:param path: Path to the edge list.
:return graph: Graph from edge list.
"""
if settings.edgelist_input:
graph = nx.read_edgelist(settings.input)
else:
# 邊表格式爲node_a node_b (weight)
edge_list = pd.read_csv(settings.input, header=None, sep=' ').values.tolist()
# 若是有權圖的話進行處理
if len(edge_list[0])==3:
graph = nx.read_weighted_edgelist(settings.input)
else:
graph = nx.from_edgelist(edge_list)
# 刪除環路
graph.remove_edges_from(nx.selfloop_edges(graph))
return graph
if __name__ == "__main__":
# 獲取參數
settings = parameter_parser()
# 打印參數
tab_printer(settings)
# 讀取圖
G = read_graph(settings)
# 建立一個graphwave運行機制類
machine = WaveletMachine(G, settings)
machine.create_embedding()
machine.transform_and_save_embedding()
-
param_parser.py(參數獲取部分)
"""Parsing up the command line parameters."""
import argparse
def parameter_parser():
"""
A method to parse up command line parameters.
"""
parser = argparse.ArgumentParser(description="Run GraphWave.")
# 特徵值計算方式
parser.add_argument("--mechanism",
nargs="?",# 0或1個參數
default="exact",
help="Eigenvalue calculation method. Default is exact.")
# 輸入文件的路徑
parser.add_argument("--input",
nargs="?",
default="../data/food_edges.csv",
help="Path to the graph edges. Default is food_edges.csv.")
# 輸出文件的路徑
parser.add_argument("--output",
nargs="?",
default="../output/embedding.csv",
help="Path to the structural embedding. Default is embedding.csv.")
# 熱核特徵參數
parser.add_argument("--heat-coefficient",
type=float,
default=1000.0,
help="Heat kernel exponent. Default is 1000.0.")
# 採樣個數(即嵌入向量的維度d 最終結果是2d)
parser.add_argument("--sample-number",
type=int,
default=50,
help="Number of characteristic function sample points. Default is 50.")
# 用切比雪夫多項式逼近熱核特徵矩陣的計算
parser.add_argument("--approximation",
type=int,
default=100,
help="Number of Chebyshev approximation. Default is 100.")
# 步長,每隔這麼多采樣
parser.add_argument("--step-size",
type=int,
default=20,
help="Number of steps. Default is 20.")
parser.add_argument("--switch",
type=int,
default=100,
help="Number of dimensions. Default is 100.")
parser.add_argument("--node-label-type",
type=str,
default="int",
help="Used for sorting index of output embedding. One of 'int', 'string', or 'float'. Default is 'int'")
parser.add_argument("--edgelist-input",
action='store_true',
help="Use NetworkX's format for input instead of CSV. Default is False")
return parser.parse_args()
-
spectral_machinery.py(整個算法的核心部分)
"""GraphWave class implementation."""
import pygsp
import random
import numpy as np
import pandas as pd
from tqdm import tqdm
import networkx as nx
from pydoc import locate
class WaveletMachine:
"""
An implementation of "Learning Structural Node Embeddings Via Diffusion Wavelets".
"""
def __init__(self, G, settings):
"""
Initialization.
:param G: Input networkx graph object.
:param settings: argparse object with settings.
"""
# 獲得節點的標號
self.index = G.nodes()
# 鄰接矩陣
self.G = pygsp.graphs.Graph(nx.adjacency_matrix(G))
# 節點個數
self.number_of_nodes = len(nx.nodes(G))
# 參數設置
self.settings = settings
# 如果節點個數過多的話,爲了節省時常,需要切換embedding的機制
if self.number_of_nodes > self.settings.switch:
self.settings.mechanism = "approximate"
# 採樣的下標
self.steps = [x*self.settings.step_size for x in range(self.settings.sample_number)]
def single_wavelet_generator(self, node):
"""
Calculating the characteristic function for a given node, using the eigendecomposition.
:param node: Node that is being embedded.
"""
impulse = np.zeros((self.number_of_nodes))
impulse[node] = 1.0
# 計算熱核特徵
diags = np.diag(np.exp(-self.settings.heat_coefficient*self.eigen_values))
eigen_diag = np.dot(self.eigen_vectors, diags)
waves = np.dot(eigen_diag, np.transpose(self.eigen_vectors))
wavelet_coefficients = np.dot(waves, impulse)
return wavelet_coefficients
def exact_wavelet_calculator(self):
"""
Calculates the structural role embedding using the exact eigenvalue decomposition.
"""
# 嵌入向量後的實部、虛部部分
self.real_and_imaginary = []
for node in tqdm(range(self.number_of_nodes)):
# 生成當前節點的熱核特徵
wave = self.single_wavelet_generator(node)
# 加j成爲虛數
# 根據特徵函數進行採樣
wavelet_coefficients = [np.mean(np.exp(wave*1.0*step*1j)) for step in self.steps]
self.real_and_imaginary.append(wavelet_coefficients)
self.real_and_imaginary = np.array(self.real_and_imaginary)
def exact_structural_wavelet_embedding(self):
"""
Calculates the eigenvectors, eigenvalues and an exact embedding is created.
"""
# 計算整個圖的拉普拉斯矩陣特徵值分解
self.G.compute_fourier_basis()
# G.e是拉普拉斯矩陣的特徵值
self.eigen_values = self.G.e / max(self.G.e)
# G.U是拉普拉斯矩陣的特徵向量
self.eigen_vectors = self.G.U
self.exact_wavelet_calculator()
def approximate_wavelet_calculator(self):
"""
Given the Chebyshev polynomial, graph the approximate embedding is calculated.
"""
self.real_and_imaginary = []
for node in tqdm(range(self.number_of_nodes)):
impulse = np.zeros((self.number_of_nodes))
impulse[node] = 1
wave_coeffs = pygsp.filters.approximations.cheby_op(self.G, self.chebyshev, impulse)
real_imag = [np.mean(np.exp(wave_coeffs*1*step*1j)) for step in self.steps]
self.real_and_imaginary.append(real_imag)
self.real_and_imaginary = np.array(self.real_and_imaginary)
def approximate_structural_wavelet_embedding(self):
"""
Estimating the largest eigenvalue.
Setting up the heat filter and the Cheybshev polynomial.
Using the approximate wavelet calculator method.
"""
# 估計拉普拉斯矩陣最大的特徵值 結果被緩存在G.lmax()中。
self.G.estimate_lmax()
# 熱核特徵
# tau: Scaling parameter tau控制能量傳播距離,tau越大,能量傳播的越遠
self.heat_filter = pygsp.filters.Heat(self.G, tau=[self.settings.heat_coefficient])
self.chebyshev = pygsp.filters.approximations.compute_cheby_coeff(self.heat_filter,
m=self.settings.approximation)
self.approximate_wavelet_calculator()
def create_embedding(self):
"""
Depending the mechanism setting creating an exact or approximate embedding.
"""
if self.settings.mechanism == "exact":
self.exact_structural_wavelet_embedding()
else:
self.approximate_structural_wavelet_embedding()
def transform_and_save_embedding(self):
"""
Transforming the numpy array with real and imaginary values.
Creating a pandas dataframe and saving it as a csv.
"""
print("\nSaving the embedding.")
features = [self.real_and_imaginary.real, self.real_and_imaginary.imag]
self.real_and_imaginary = np.concatenate(features, axis=1)
columns_1 = ["reals_"+str(x) for x in range(self.settings.sample_number)]
columns_2 = ["imags_"+str(x) for x in range(self.settings.sample_number)]
columns = columns_1 + columns_2
self.real_and_imaginary = pd.DataFrame(self.real_and_imaginary, columns=columns)
self.real_and_imaginary.index = self.index
self.real_and_imaginary.index = self.real_and_imaginary.index.astype(locate(self.settings.node_label_type))
self.real_and_imaginary = self.real_and_imaginary.sort_index()
self.real_and_imaginary.to_csv(self.settings.output)