NONE LOCAL survey 2018-2019

non-local Neural Networks for Video Classification

CVPR 2018

motivated from `non-local mean operation, instead of adding layers, the authors choose to use none local mechanism

Following the non-local mean operation, they define a generic non-local operation in deep nn as:

yi=1C(x)jf(xi,xj)g(xj) {\bf y}_i = \frac{1}{C( {\bf x} )} \sum_{\forall j} f( {\bf x}_i , {\bf x}_j ) g ( {\bf x}_j)

  • ii is the index of an output position (in space, time or spacetime), whose response if to be computed and jj is the index that enumerates all possible positions.

  • ff is a kernel function, which is the the key of this work

  1. gaussian f(xi,xj)=exiTxjf( {\bf x}_i , {\bf x}_j ) = e ^{ {\bf x}_i ^T {\bf x}_j }
  2. embedded gaussian f(xi,xj)=eθ(xi)Tϕ(xj)f( {\bf x}_i , {\bf x}_j ) = e ^{ \theta( {\bf x}_i) ^T \phi( {\bf x}_j) }, here θ(xi)=Wθxi\theta(x_i) = W_\theta x_i, ϕ(xi)=Wϕxi\phi(x_i) = W_\phi x_i
  3. dot product f(xi,xj)=θ(xi)Tϕ(xj)f( {\bf x}_i , {\bf x}_j ) = \theta( {\bf x}_i) ^T \phi( {\bf x}_j)
  4. concatenation f(xi,xj)=ReLU(wfT[θ(xi),ϕ(xj)])f( {\bf x}_i , {\bf x}_j ) = ReLU ( w_f^T[ \theta( {\bf x}_i) , \phi( {\bf x}_j) ] )
  • g(xj)=Wgxjg(x_j) = W_gx_j : linear embedding

The pairwise computation of a non-local block is lightweight when it is used in high-level, sub-sampled feature maps.[?]

Hierarchical Graph Representation Learning with Differentiable Pooling

NIPS 2018

Abstract

DIFFPOOL, a differentiable graph pooling module that can generate hierarchical representations of graphs

GNN-Kipf’s GCN:

Hk=M(A,Hk1,Wk)=ReLU(D~12A~D~12Hk1Wk1) H^k = M(A, H^{k-1}, W^k) = ReLU ( \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} H^{k-1} W^{k-1})

  • A~=A+I\tilde{A} = A+I
  • D~=jA~ij\tilde{D} = \sum_{j} \tilde{A}_{ij}

here we focus on Proposed Method

3.1 Preliminaries

Stacking GNNs and pooling layers.

Formally, given Z=GNN(A,X)Z = GNN(A,X), the output of a GNN module, and a graph adjacency matrix ARn×nA \in R^ {n\times n}, we seek to define a strategy to output a new coarsened graph containing m<nm < n nodes, with weighted adjacency matrix ARm×mA′ \in R^{m \times m} and node embeddings ZRm×dZ′ \in R^{m \times d}.

Thus, their goal is to learn how to cluster or pool together nodes using the output of a GNN, so that they can use this coarsened graph as input to another GNN layer.

What makes designing such a pooling layer for GNNs especially challenging—compared to the usual graph coarsening task—is that our goal is not to simply cluster the nodes in one graph, but to provide a general recipe to hierarchically pool nodes across a broad set of input graphs. That is, we need our model to learn a pooling strategy that will generalize across graphs with different nodes, edges, and that can adapt to the various graph structures during inference.

3.2 Differentiable Pooling via Learned Assignments

they addressed the above challenges by learning a cluster assignment matrix over the nodes using the output of a GNN model.

The key intuition is that we stack L GNN modules and learn to assign nodes to clusters at layer l in an end-to-end fashion, using embeddings generated from a GNN at layer l − 1.

Thus, we are using GNNs to both extract node embeddings that are useful for graph classification, as well to extract node embeddings that are useful for hierarchical pooling.

Pooling with an assignment matrix
Suppose that S(l) has already been computed:
X(l+1)=S(l)TZ(l)Rnl+1×dA(l+1)=S(l)TA(l)S(l)TRnl+1×nl+1 X^{(l+1)} = S^{{(l)}^T} Z^{(l)} \in \mathbb{R}^{n_{l+1} \times d} \\ A^{(l+1)} = S^{{(l)}^T} A^{(l)} S^{{(l)}^T}\in \mathbb{R}^{n_{l+1} \times n_{l+1} }

S(l)S^{(l)} is the assignment matrix
Z(l)Z^{(l)} is the embedding matrix
A(l)A^{(l)} is the assignment matrix
A(l)A^{(l)} is the node features

Learning the assignment matrix.
how DIFFPOOL generates the assignment matrix S(l)S^{(l)} and embedding matrices Z(l)Z^{(l)}:
They generate these two matrices using two separate GNNs that are both applied to the input cluster node features X(l)X^{(l)} and coarsened adjacency matrix A(l)A^{(l)}

Z(l)=GNNl,embed(A(l),X(l))S(l)=softmax (GNNl,pool(A(l),X(l)) Z^{(l)} = \text{GNN}_{l, embed} (A^{(l)} , X^{(l)}) \\ S^{(l)} = \text{softmax } ( \text{GNN}_{l, pool} (A^{(l)} , X^{(l)}) \\

  • the softmax is row-wise
  • The output dimension of GNNl,pool\text{GNN}_{l,pool} corresponds to a pre-defined maximum number of clusters in layer ll, and is a hyperparameter of the model.
3.3 Auxiliary Link Prediction Objective and Entropy Regularization
  1. LLP=A(l),S(l)S(l)TFL_{LP} = \| A^{(l)} , S^{(l)} S^{{(l)}^T} \|_F
  2. LE=1ni=1nH(Si)L_{E} = \frac{1}{n} \sum_{i=1}^n H(S_i), HH denotes the entropy function, and SiS_i is the ii-th row of SS.

Experiment

baseline methods:

GNN-based methods:

  1. GRAPHSAGE
  2. STRUCTURE2VEC: combines a latent cariable model with GNNs. It uses global mean pooling.
  3. Edge-conditioned filters in CNN for graphs (ECC) incorporates edge information into the GCN model and performs pooling using a graph coarsening algorithm
  4. PATCHYSAN defines a receptive field for each node, using a canonical node ordering, applies convolutions on linear sequences of node embeddings.
  5. SET2SET replaces the global mean-pooling in the traditional GNN architectures by the aggrega- tion used in SET2SET. Set2Set aggregation has been shown to perform better than mean pooling in previous work [15]. We use GRAPHSAGE as the base GNN model.
  6. SORTPOOL applies a GNN architecture and then performs a single layer of soft pooling followed by 1D convolution on sorted node embeddings.

kernel-based

  1. GRAPHLET
  2. SHORTEST-PATH
  3. WEISFEILER- LEHMAN kernel (WL)
  4. WEISFEILER-LEHMAN OPTIMAL ASSIGNMENT KERNEL (WL- OA)

MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing

ICML 2019

Abstract

MixHop requires no additional memory or computational complexity
In addition, they proposes sparsity regularization that allows us to visualize how the network prioritizes neighborhood information across different graph datasets. Our analysis of the learned architectures reveals that neighborhood mixing varies per datasets.

Intro

Defferrard et al. (2016) and Kipf & Welling (2017) propose GC approximations that are computationally-efficient (linear complexity, in the number of edges), and can be applied in inductive settings, where the test graphs are not observed during training.

Proposed Architecture

interested in higher-order message passing, where nodes receive latent representations from their immediate(first-degree) neighbors

the analysis starts with Delta Operator, a subtraction operation between node features collected from different distances – which is actually concate different GCN outputs.

Complexity: no need to calculate A^j\hat{A}^j, they calculate A^jH(i)\hat{A}^j H^{(i)} with right-to-left multiplication.

Representational Capability: they prove vanilla GCN cannot represent two-hop Delta Operator while their model can.

General Neighborhood Mixing:
definition 2: General layer-wise Neighborhood Mixing

Learning GC Architectures

output layer: MixHop uniquely mixes features from different sets of informations. the output layer: select sls_l columns into sets of size cc, and compute Y~O=...\tilde{Y}_O = ...

learning adjacency power architectures:

  1. one per adjacency power used in the model
  2. different sizes of Wj(i)W_j^{(i)} may be more appropriate for different tasks and datasets, interested in learning how to automatically size Wj(i)W_j^{(i)}
  3. for vanilla GCNs, such search is inexpensive (Here I do not understand how vanilla GCNs search the size of Wj(i)W_j^{(i)} automaticlly)

Experiments

it seems like there is a trend that what we should answer needs to be written in Section experiment

  1. they use synthetic dataset (in which the homophily is decreased) to evaluate the model (which better captures delta operator)
  2. real world experiment
  3. visualization

Position-aware Graph Neural Networks

ICML 2019

Abstract

existing Graph Neural Network (GNN) architectures have limited power in capturing the position/location of a given node with respect to all other nodes of the graph.

They propose Position-aware Graph Neural Networks (P-GNNs):

  1. samples sets of anchor nodes
  2. computes the distance of a given target node to each anchor-set
  3. learns a non-linear distance-weighted aggregation scheme over the anchor-sets

Introduction

However, the key limitation of existing GNN architectures is that they fail to capture the position/location of the node within the broader context of the graph structure.

It provide a example: two nodes that are far away, through GCN, the embedding would be the same (ignore the feature).

existing researchers haves spotted this weakness:

  1. introduce one-hot feature
  2. deepen the GCN

their one key observation: node position can be captured by a low-distortion embedding by quantifying the distance between a given node and a set of anchor nodes.

method:

  1. P-GNN first samples multiple anchor-sets in each forward pass
  2. then learns a non-linear aggregation scheme that combines node feature information from each anchor-set and weighs it by the distance between the node and the anchor-set.

besides, Bourgain theorem (Bourgain, 1985) guarantees that only k=O(log2n)k = O(\log^2 n) anchor-sets are needed to preserve the distances in the original graph with low distortion.

implementary:

  1. In settings where node attributes are not available, P-GNN’s computation of the k dimensional distance vector is inductive across different node orderings and different graphs.
  2. When node attributes are available, a node’s embedding is further enriched by aggregating information from all anchor-sets, weighted by the k dimensional distance vector.

Further, for large graphs, they proposed P-GNN-Fast.

Preliminaries

“We call node embeddings to be position- aware, if the embedding of two nodes can be used to (approximately) recover their shortest path distance in the network.”

they give two definition:

  1. position-aware
  2. structure-aware

most GCNs are structure-aware method.

proposition 1: there exist a mapping from structure-aware embeddings to position-aware embeddings \Leftrightarrow no pair of nodes have isomorphic local q-hop neighbourhood graphs. (proved in appendix)

Proposed Approach

We design P-GNNs such that each node embedding dimension corresponds to messages computed with respect to one anchor-set, which makes the computed node embeddings position-aware (see Figure 2)

experiment

  1. link prediction
  2. pair-wise node classification

GeniePath: Graph Neural Networks with Adaptive Receptive Paths

AAAI 19 Le Song

Intro

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章