###haohaohao###圖神經網絡之神器——PyTorch Geometric 上手 & 實戰

圖神經網絡（Graph Neural Networks, GNN）最近被視爲在圖研究等領域一種強有力的方法。跟傳統的在歐式空間上的卷積操作類似，GNNs通過對信息的傳遞，轉換和聚合實現特徵的提取。這篇博客主要想分享下，怎樣在你的項目中簡單快速地實現圖神經網絡。你將會了解到怎樣用PyTorch Geometric 去構建一個圖神經網絡，以及怎樣用GNN去解決一個實際問題（Recsys Challenge 2015）。

我們將使用PyTorch 和 PyG（PyTorch Geometric Library）。PyG是一個基於PyTorch的用於處理不規則數據（比如圖）的庫，或者說是一個用於在圖等數據上快速實現表徵學習的框架。它的運行速度很快，訓練模型速度可以達到DGL（Deep Graph Library ）v0.2 的40倍（數據來自論文）。除了出色的運行速度外，PyG中也集成了很多論文中提出的方法（GCN,SGC,GAT,SAGE等等）和常用數據集。因此對於復現論文來說也是相當方便。由於速度和方便的優勢，毫無疑問，PyG是當前最流行和廣泛使用的GNN庫。讓我們開始吧。

Requirments:

Python 3
PyTorch
PyTorch Geometric

PyG Basics

這部分將會帶你瞭解PyG的基礎知識。重要的是會涵蓋torch_gemotric.data 和 torch_geometric.nn。 你將會了解到怎樣將你的圖數據導入你的神經網絡模型，以及怎樣設計一個MessagePassing layer。這個也是GNN的核心。

Data

torch_geometric.data這個模塊包含了一個叫Data的類。這個類允許你非常簡單的構建你的圖數據對象。你只需要確定兩個東西：

節點的屬性/特徵（the attributes/features associated with each node, node features）
鄰接/邊連接信息（the connectivity/adjacency of each node, edge index）

讓我們用一個例子來說明一個寫怎樣創建一個Data對象。

在這個圖裏有4個節點，V1,V2,V3,V4,每一個都帶有一個2維的特徵向量，和一個標籤y，代表這個節點屬於哪一類。

這兩個東西可以用FloatTesonr來表示：

x = torch.tensor([[2,1],[5,6],[3,7],[12,0]], dtype=torch.float) 
y = torch.tensor([0,1,0,1], dtype=torch.float)

圖的節點連接信息要以COO格式進行存儲。在COO格式中，COO list 是一個2*E 維的list。第一個維度的節點是源節點(source nodes)，第二個維度中是目標節點(target nodes)，連接方式是由源節點指向目標節點。對於無向圖來說，存貯的source nodes 和 target node 是成對存在的。

方式1

edge_index = torch.tensor([[0,1,2,0,3],
                          [1,0,1,3,2]],dtype=torch,long)

方式2

edge_index = torch.tensor([[0, 1],
                           [1, 0],
                           [2, 1],
                           [0, 3]
                           [2, 3]], dtype=torch.long)

第二種方法在使用時要調用contiguous()方法。

邊索引的順序跟Data對象無關，或者說邊的存儲順序並不重要，因爲這個edge_index只是用來計算鄰接矩陣（Adjacency Matrix）。

把它們放在一起我們就可以創建一個Data了。

# 方法一
import torch

from torch_geometric.data import Data

x = torch.tensor([[2,1],[5,6],[3,7],[12,0]],dtype=torch.float)

y = torch.tensor([[0,2,1,0,3],[3,1,0,1,2]],dtype=torch.long)

edge_index = torch.tensor([[0,1,2,0,3],
                          [1,0,1,3,2]],dtype=torch,long)

data = Data(x=x,y=y,edge_index=edge_index)
# 方法二
import torch

from torch_geometric.data import Data

x = torch.tensor([[2,1],[5,6],[3,7],[12,0]],dtype=torch.float)

y = torch.tensor([[0,2,1,0,3],[3,1,0,1,2]],dtype=torch.long)

edge_index = torch.tensor([[0, 1],
                           [1, 0],
                           [2, 1],
                           [0, 3]
                           [2, 3]], dtype=torch.long)

data = Data(x=x,y=y,edge_index=edge_index.contiguous())

這樣我們就創建了一個新的Data。其中x,y,edge_index 是最基本的鍵值（key）。你也可以添加自己的key。有了這個data，你可以在程序中非常方便的調用處理你的數據。

Dataset

數據集Dataset的創建不像Data一樣簡單直接了。Dataset有點像torchvision，它有着自己的規則。

PyG提供兩種不同的數據集類：

InMemoryDataset
Dataset

要創建一個InMemoryDataset，你必須實現一個函數

Raw_file_names()

它返回一個包含沒有處理的數據的名字的list。如果你只有一個文件，那麼它返回的list將只包含一個元素。事實上，你可以返回一個空list，然後確定你的文件在後面的函數process()中。

Processed_file_names()

很像上一個函數，它返回一個包含所有處理過的數據的list。在調用process()這個函數後，通常返回的list只有一個元素，它只保存已經處理過的數據的名字。

Download()

這個函數下載數據到你正在工作的目錄中，你可以在self.raw_dir中指定。如果你不需要下載數據，你可以在這函數中簡單的寫一個

pass

就好。

Process()

這是Dataset中最重要的函數。你需要整合你的數據成一個包含data的list。然後調用 self.collate()去計算將用DataLodadr的片段。下面這個例子來自PyG官方文檔。

import torch
from torch_geometric.data import InMemoryDataset


class MyOwnDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(MyOwnDataset, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return ['some_file_1', 'some_file_2', ...]

    @property
    def processed_file_names(self):
        return ['data.pt']

    def download(self):
        # Download to `self.raw_dir`.

    def process(self):
        # Read data into huge `Data` list.
        data_list = [...]

        if self.pre_filter is not None:
            data_list [data for data in data_list if self.pre_filter(data)]

        if self.pre_transform is not None:
            data_list = [self.pre_transform(data) for data in data_list]

        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

我將會在後面介紹怎樣從RecSys 2015 提供的數據構建一個用於PyG的一般數據集。

DataLoader

DataLoader 這個類允許你通過batch的方式feed數據。創建一個DotaLoader實例，可以簡單的指定數據集和你期望的batch size。

loader = DataLoader(dataset, batch_size=512, shuffle=True)

DataLoader的每一次迭代都會產生一個Batch對象。它非常像Data對象。但是帶有一個‘batch’屬性。它指明瞭了對應圖上的節點連接關係。因爲DataLoader聚合來自不同圖的的batch的x,y 和edge_index，所以GNN模型需要batch信息去知道那個節點屬於哪一圖。

for batch in loader:
    batch
    >>> Batch(x=[1024, 21], edge_index=[2, 1568], y=[512], batch=[1024])

MessagePassing

這個GNN的本質，它描述了節點的embeddings是怎樣被學習到的。

作者已經將MessagePassing這個接口寫好，以便於大家快速實現自己的想法。如果想使用這個框架，就要重新定義三個方法：

message
update
aggregation scheme

在實現message的時候，節點特徵會自動map到各自的source and target nodes。 aggregation scheme 只需要設置參數就好，sum, mean or max。

對於一個簡單的GCN來說，我們只需要按照以下步驟，就可以快速實現一個GCN：

添加self-loop 到鄰接矩陣（Adjacency Matrix）。
節點特徵的線性變換。
標準化節點特徵。
聚合鄰接節點信息。
得到節點新的embeddings

步驟1 和 2 需要在message passing 前被計算好。 3 - 5 可以torch_geometric.nn.MessagePassing 類。

添加self-loop的目的是讓featrue在聚合的過程中加入當前節點自己的feature，沒有self-loop聚合的就只有鄰居節點的信息。

Example 1 下面是官方文檔的一個GCN例子，其中註釋中的Step 1-5對應上文的步驟1-5.

import torch
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import add_self_loops, degree

class GCNConv(MessagePassing):
    def __init__(self, in_channels, out_channels):
        super(GCNConv, self).__init__(aggr='add')  # "Add" aggregation.
        self.lin = torch.nn.Linear(in_channels, out_channels)

    def forward(self, x, edge_index):
        # x has shape [N, in_channels]
        # edge_index has shape [2, E]

        # Step 1: Add self-loops to the adjacency matrix.
        edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))

        # Step 2: Linearly transform node feature matrix.
        x = self.lin(x)

        # Step 3-5: Start propagating messages.
        return self.propagate(edge_index, size=(x.size(0), x.size(0)), x=x)

    def message(self, x_j, edge_index, size):
        # x_j has shape [E, out_channels]

        # Step 3: Normalize node features.
        row, col = edge_index
        deg = degree(row, size[0], dtype=x_j.dtype)
        deg_inv_sqrt = deg.pow(-0.5)
        norm = deg_inv_sqrt[row] * deg_inv_sqrt[col]

        return norm.view(-1, 1) * x_j

    def update(self, aggr_out):
        # aggr_out has shape [N, out_channels]

        # Step 5: Return new node embeddings.
        return aggr_out

所有的邏輯代碼都在forward()裏面，當我們調用propagate()函數之後，它將會在內部調用message()和update()。

Example 2 下面是一個SAGE的例子

import torch
from torch.nn import Sequential as Seq, Linear, ReLU
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import remove_self_loops, add_self_loops
class SAGEConv(MessagePassing):
    def __init__(self, in_channels, out_channels):
        super(SAGEConv, self).__init__(aggr='max') #  "Max" aggregation.
        self.lin = torch.nn.Linear(in_channels, out_channels)
        self.act = torch.nn.ReLU()
        self.update_lin = torch.nn.Linear(in_channels + out_channels, in_channels, bias=False)
        self.update_act = torch.nn.ReLU()
        
    def forward(self, x, edge_index):
        # x has shape [N, in_channels]
        # edge_index has shape [2, E]
        
        
        edge_index, _ = remove_self_loops(edge_index)
        edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))
        
        
        return self.propagate(edge_index, size=(x.size(0), x.size(0)), x=x)

    def message(self, x_j):
        # x_j has shape [E, in_channels]

        x_j = self.lin(x_j)
        x_j = self.act(x_j)
        
        return x_j

    def update(self, aggr_out, x):
        # aggr_out has shape [N, out_channels]


        new_embedding = torch.cat([aggr_out, x], dim=1)
        
        new_embedding = self.update_lin(new_embedding)
        new_embedding = self.update_act(new_embedding)
        
        return new_embedding

上面的部分主要介紹了怎樣把數據編程Data，以及通過MessagPassing來實現自己的想法，也就是怎樣生成新的embeddings。至於怎樣訓練模型可以看下面的內容，以及參考官方的示例（https://github.com/rusty1s/pytorch_geometric/tree/master/examples）。

A Real-World Example —— RecSys Challenge 2015

RecSys Challenge 2015 是一個推薦算法競賽。參與者被要求完成以下兩個任務：

通過一個點擊序列預測是否會產生一個購買行爲。
預測哪個產品將要被購買。

首先，我們可以在官網（https://recsys.acm.org/recsys15/challenge/）下載數據並且構建成一個數據集。然後開始做第一個任務，因爲它比較簡單。競賽提供了兩個主要的數據集。 yoochoose-clicks.dat 和 yoochoose-buys.dat，分別各自包含點擊事件和購買事件。

Preprocessing

在下載完數據之後，我們需要對它進行預處理，這樣它可以被fed進我們的模型。

from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('../input/yoochoose-click.dat', header=None)
df.columns=['session_id','timestamp','item_id','category']

buy_df = pd.read_csv('../input/yoochoose-buys.dat', header=None)
buy_df.columns=['session_id','timestamp','item_id','price','quantity']

item_encoder = LabelEncoder()
df['item_id'] = item_encoder.fit_transform(df.item_id)
df.head()

因爲數據集很大。我們用子圖以方便演示。

#randomly sample a couple of them
sampled_session_id = np.random.choice(df.session_id.unique(), 1000000, replace=False)
df = df.loc[df.session_id.isin(sampled_session_id)]
df.nunique()

爲了確定一個ground truth。對於一個給定的session，是否存在一個購買事件。我們簡單的檢查是否一個 session_id 在 yoochoose-clicks.dat 也出現在 yoochoose-buys.dat 中。

df['label'] = df.session_id.isin(buy_df.session_id)
df.head()

數據集的構建 Dataset Construction

在預處理步驟之後，就可以將數據轉換爲Dataset對象了。在這裏，我們將session中的每個item都視爲一個節點，因此同一session中的所有items都形成一個圖。爲了構建數據集，我們通過session_id對預處理的數據進行分組，並在這些組上進行迭代。在每次迭代中，對每個圖中節點索引應0開始。

import torch
from torch_geometric.data import InMemoryDataset
from tqdm import tqdm

class YooChooseBinaryDataset(InMemoryDataset):
    def __init__(self, root, transform=None, pre_transform=None):
        super(YooChooseBinaryDataset, self).__init__(root, transform, pre_transform)
        self.data, self.slices = torch.load(self.processed_paths[0])

    @property
    def raw_file_names(self):
        return []
    @property
    def processed_file_names(self):
        return ['../input/yoochoose_click_binary_1M_sess.dataset']

    def download(self):
        pass
    
    def process(self):
        
        data_list = []

        # process by session_id
        grouped = df.groupby('session_id')
        for session_id, group in tqdm(grouped):
            sess_item_id = LabelEncoder().fit_transform(group.item_id)
            group = group.reset_index(drop=True)
            group['sess_item_id'] = sess_item_id
            node_features = group.loc[group.session_id==session_id,['sess_item_id','item_id']].sort_values('sess_item_id').item_id.drop_duplicates().values

            node_features = torch.LongTensor(node_features).unsqueeze(1)
            target_nodes = group.sess_item_id.values[1:]
            source_nodes = group.sess_item_id.values[:-1]

            edge_index = torch.tensor([source_nodes, target_nodes], dtype=torch.long)
            x = node_features

            y = torch.FloatTensor([group.label.values[0]])

            data = Data(x=x, edge_index=edge_index, y=y)
            data_list.append(data)
        
        data, slices = self.collate(data_list)
        torch.save((data, slices), self.processed_paths[0])

在構建好數據集，我們使用shuffle()方法確保數據集被隨機打亂。然後把數據集分成 3份，分別用作 training validation and testing.

dataset = dataset.shuffle()
train_dataset = dataset[:800000]
val_dataset = dataset[800000:900000]
test_dataset = dataset[900000:]
len(train_dataset), len(val_dataset), len(test_dataset)

Build a Graph Neural Networks

以下GNN引用了PyG官方Github存儲庫中的示例之一，並使用上面的example 2的SAGEConv層(不同於官方文檔中的SAGEConv())。此外，還對輸出層進行了修改以與binary classification設置匹配。

embed_dim = 128
from torch_geometric.nn import TopKPooling
from torch_geometric.nn import global_mean_pool as gap, global_max_pool as gmp
import torch.nn.functional as F
class Net(torch.nn.Module):
    def __init__(self):
        super(Net, self).__init__()

        self.conv1 = SAGEConv(embed_dim, 128)
        self.pool1 = TopKPooling(128, ratio=0.8)
        self.conv2 = SAGEConv(128, 128)
        self.pool2 = TopKPooling(128, ratio=0.8)
        self.conv3 = SAGEConv(128, 128)
        self.pool3 = TopKPooling(128, ratio=0.8)
        self.item_embedding = torch.nn.Embedding(num_embeddings=df.item_id.max() +1, embedding_dim=embed_dim)
        self.lin1 = torch.nn.Linear(256, 128)
        self.lin2 = torch.nn.Linear(128, 64)
        self.lin3 = torch.nn.Linear(64, 1)
        self.bn1 = torch.nn.BatchNorm1d(128)
        self.bn2 = torch.nn.BatchNorm1d(64)
        self.act1 = torch.nn.ReLU()
        self.act2 = torch.nn.ReLU()        
  
    def forward(self, data):
        x, edge_index, batch = data.x, data.edge_index, data.batch
        x = self.item_embedding(x)
        x = x.squeeze(1)        

        x = F.relu(self.conv1(x, edge_index))

        x, edge_index, _, batch, _ = self.pool1(x, edge_index, None, batch)
        x1 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = F.relu(self.conv2(x, edge_index))
     
        x, edge_index, _, batch, _ = self.pool2(x, edge_index, None, batch)
        x2 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = F.relu(self.conv3(x, edge_index))

        x, edge_index, _, batch, _ = self.pool3(x, edge_index, None, batch)
        x3 = torch.cat([gmp(x, batch), gap(x, batch)], dim=1)

        x = x1 + x2 + x3

        x = self.lin1(x)
        x = self.act1(x)
        x = self.lin2(x)
        x = self.act2(x)      
        x = F.dropout(x, p=0.5, training=self.training)

        x = torch.sigmoid(self.lin3(x)).squeeze(1)

        return x

Training

訓練自定義GNN非常容易，只需迭代從訓練集構造的DataLoader，然後反向傳播損失函數。在這裏，使用Adam作爲優化器，將學習速率設置爲0.005，將Binary Cross Entropy作爲損失函數。

def train():
    model.train()

    loss_all = 0
    for data in train_loader:
        data = data.to(device)
        optimizer.zero_grad()
        output = model(data)
        label = data.y.to(device)
        loss = crit(output, label)
        loss.backward()
        loss_all += data.num_graphs * loss.item()
        optimizer.step()
    return loss_all / len(train_dataset)
    
device = torch.device('cuda')
model = Net().to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=0.005)
crit = torch.nn.BCELoss()
train_loader = DataLoader(train_dataset, batch_size=batch_size)
for epoch in range(num_epochs):
    train()

Validation

標籤存在大量的negative標籤，數據是高度不平衡的，因爲大多數會話之後都沒有任何購買事件。換句話說，一個愚蠢的模型可能會預測所有的情況爲negative，從而使準確率達到90％以上。因此，代替準確度，AUC是完成此任務的更好指標，因爲它只在乎陽性實例的得分是否高於陰性實例。我們使用來自Sklearn的現成AUC計算功能。

def evaluate(loader):
    model.eval()

    predictions = []
    labels = []

    with torch.no_grad():
        for data in loader:

            data = data.to(device)
            pred = model(data).detach().cpu().numpy()

            label = data.y.detach().cpu().numpy()
            predictions.append(pred)
            labels.append(label)

Result

以下是對模型進行1個epoch的訓練，並打印相關參數：

for epoch in range(1):
    loss = train()
    train_acc = evaluate(train_loader)
    val_acc = evaluate(val_loader)    
    test_acc = evaluate(test_loader)
    print('Epoch: {:03d}, Loss: {:.5f}, Train Auc: {:.5f}, Val Auc: {:.5f}, Test Auc: {:.5f}'.
          format(epoch, loss, train_acc, val_acc, test_acc))

Conclusion

到此，你已經學會了PyG的基本用法，包括數據集的構建，定製GNN網絡，訓練GNN模型。以上代碼以及主要內容均來自於官方文檔以及https://towardsdatascience.com/hands-on-graph-neural-networks-with-pytorch-pytorch-geometric-359487e221a8 這個博客。希望對你有所幫助。更多PyG的介紹和example可以查詢官方文檔和官方的Github庫。

最後，

Sharing is carrying.

參考鏈接：

Fast Graph Representation Learning with PyTorch Geometricrlgm.github.io

PyTorch Geometric Documentationpytorch-geometric.readthedocs.io

Hands-on Graph Neural Networks with PyTorch & PyTorch Geometrictowardsdatascience.com

###haohaohao###圖神經網絡之神器——PyTorch Geometric 上手 & 實戰

PyG Basics

Data

方式1

Dataset

DataLoader

MessagePassing

A Real-World Example —— RecSys Challenge 2015

Preprocessing

數據集的構建 Dataset Construction

Build a Graph Neural Networks

Training

Validation

Result

Conclusion

DAPPER 事務 TRANSACTION

##好好好好###開源的標註工具

###haohaohao######主動學習用於標註優化迭代

###豪豪豪豪######2020 推薦系統技術演進趨勢瞭解

###好好好######一文詳解微服務架構

einsum初探

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結