non-local Neural Networks for Video Classification

CVPR 2018

motivated from `non-local mean operation, instead of adding layers, the authors choose to use none local mechanism

Following the non-local mean operation, they define a generic non-local operation in deep nn as:

${\bf y}_i = \frac{1}{C( {\bf x} )} \sum_{\forall j} f( {\bf x}_i , {\bf x}_j ) g ( {\bf x}_j)$

$i$ is the index of an output position (in space, time or spacetime), whose response if to be computed and $j$ is the index that enumerates all possible positions.
$f$ is a kernel function, which is the the key of this work

gaussian $f( {\bf x}_i , {\bf x}_j ) = e ^{ {\bf x}_i ^T {\bf x}_j }$
embedded gaussian $f( {\bf x}_i , {\bf x}_j ) = e ^{ \theta( {\bf x}_i) ^T \phi( {\bf x}_j) }$ , here $\theta(x_i) = W_\theta x_i$ , $\phi(x_i) = W_\phi x_i$
dot product $f( {\bf x}_i , {\bf x}_j ) = \theta( {\bf x}_i) ^T \phi( {\bf x}_j)$
concatenation $f( {\bf x}_i , {\bf x}_j ) = ReLU ( w_f^T[ \theta( {\bf x}_i) , \phi( {\bf x}_j) ] )$

$g(x_j) = W_gx_j$ : linear embedding

The pairwise computation of a non-local block is lightweight when it is used in high-level, sub-sampled feature maps.[?]

Hierarchical Graph Representation Learning with Differentiable Pooling

NIPS 2018

Abstract

DIFFPOOL, a differentiable graph pooling module that can generate hierarchical representations of graphs

GNN-Kipf’s GCN:

$H^k = M(A, H^{k-1}, W^k) = ReLU ( \tilde{D}^{-\frac{1}{2}} \tilde{A} \tilde{D}^{-\frac{1}{2}} H^{k-1} W^{k-1})$

$\tilde{A} = A+I$
$\tilde{D} = \sum_{j} \tilde{A}_{ij}$

here we focus on Proposed Method

3.1 Preliminaries

Stacking GNNs and pooling layers.

Formally, given $Z = GNN(A,X)$ , the output of a GNN module, and a graph adjacency matrix $A \in R^ {n\times n}$ , we seek to define a strategy to output a new coarsened graph containing $m < n$ nodes, with weighted adjacency matrix $A′ \in R^{m \times m}$ and node embeddings $Z′ \in R^{m \times d}$ .

Thus, their goal is to learn how to cluster or pool together nodes using the output of a GNN, so that they can use this coarsened graph as input to another GNN layer.

What makes designing such a pooling layer for GNNs especially challenging—compared to the usual graph coarsening task—is that our goal is not to simply cluster the nodes in one graph, but to provide a general recipe to hierarchically pool nodes across a broad set of input graphs. That is, we need our model to learn a pooling strategy that will generalize across graphs with different nodes, edges, and that can adapt to the various graph structures during inference.

3.2 Differentiable Pooling via Learned Assignments

they addressed the above challenges by learning a cluster assignment matrix over the nodes using the output of a GNN model.

The key intuition is that we stack L GNN modules and learn to assign nodes to clusters at layer l in an end-to-end fashion, using embeddings generated from a GNN at layer l − 1.

Thus, we are using GNNs to both extract node embeddings that are useful for graph classification, as well to extract node embeddings that are useful for hierarchical pooling.

Pooling with an assignment matrix
Suppose that S(l) has already been computed:
$X^{(l+1)} = S^{{(l)}^T} Z^{(l)} \in \mathbb{R}^{n_{l+1} \times d} \\ A^{(l+1)} = S^{{(l)}^T} A^{(l)} S^{{(l)}^T}\in \mathbb{R}^{n_{l+1} \times n_{l+1} }$

$S^{(l)}$ is the assignment matrix
$Z^{(l)}$ is the embedding matrix
$A^{(l)}$ is the assignment matrix
$A^{(l)}$ is the node features

Learning the assignment matrix.
how DIFFPOOL generates the assignment matrix $S^{(l)}$ and embedding matrices $Z^{(l)}$ :
They generate these two matrices using two separate GNNs that are both applied to the input cluster node features $X^{(l)}$ and coarsened adjacency matrix $A^{(l)}$

$Z^{(l)} = \text{GNN}_{l, embed} (A^{(l)} , X^{(l)}) \\ S^{(l)} = \text{softmax } ( \text{GNN}_{l, pool} (A^{(l)} , X^{(l)}) \\$

the softmax is row-wise
The output dimension of $\text{GNN}_{l,pool}$ corresponds to a pre-defined maximum number of clusters in layer $l$ , and is a hyperparameter of the model.

3.3 Auxiliary Link Prediction Objective and Entropy Regularization

$L_{LP} = \| A^{(l)} , S^{(l)} S^{{(l)}^T} \|_F$
$L_{E} = \frac{1}{n} \sum_{i=1}^n H(S_i)$ , $H$ denotes the entropy function, and $S_i$ is the $i$ -th row of $S$ .

Experiment

baseline methods:

GNN-based methods:

GRAPHSAGE
STRUCTURE2VEC: combines a latent cariable model with GNNs. It uses global mean pooling.
Edge-conditioned filters in CNN for graphs (ECC) incorporates edge information into the GCN model and performs pooling using a graph coarsening algorithm
PATCHYSAN defines a receptive field for each node, using a canonical node ordering, applies convolutions on linear sequences of node embeddings.
SET2SET replaces the global mean-pooling in the traditional GNN architectures by the aggrega- tion used in SET2SET. Set2Set aggregation has been shown to perform better than mean pooling in previous work [15]. We use GRAPHSAGE as the base GNN model.
SORTPOOL applies a GNN architecture and then performs a single layer of soft pooling followed by 1D convolution on sorted node embeddings.

kernel-based

GRAPHLET
SHORTEST-PATH
WEISFEILER- LEHMAN kernel (WL)
WEISFEILER-LEHMAN OPTIMAL ASSIGNMENT KERNEL (WL- OA)

MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing

ICML 2019

Abstract

MixHop requires no additional memory or computational complexity
In addition, they proposes sparsity regularization that allows us to visualize how the network prioritizes neighborhood information across different graph datasets. Our analysis of the learned architectures reveals that neighborhood mixing varies per datasets.

Intro

Defferrard et al. (2016) and Kipf & Welling (2017) propose GC approximations that are computationally-efficient (linear complexity, in the number of edges), and can be applied in inductive settings, where the test graphs are not observed during training.

Proposed Architecture

interested in higher-order message passing, where nodes receive latent representations from their immediate(first-degree) neighbors

the analysis starts with Delta Operator, a subtraction operation between node features collected from different distances – which is actually concate different GCN outputs.

Complexity: no need to calculate $\hat{A}^j$ , they calculate $\hat{A}^j H^{(i)}$ with right-to-left multiplication.

Representational Capability: they prove vanilla GCN cannot represent two-hop Delta Operator while their model can.

General Neighborhood Mixing:
definition 2: General layer-wise Neighborhood Mixing

Learning GC Architectures

output layer: MixHop uniquely mixes features from different sets of informations. the output layer: select $s_l$ columns into sets of size $c$ , and compute $\tilde{Y}_O = ...$

learning adjacency power architectures:

one per adjacency power used in the model
different sizes of $W_j^{(i)}$ may be more appropriate for different tasks and datasets, interested in learning how to automatically size $W_j^{(i)}$
for vanilla GCNs, such search is inexpensive (Here I do not understand how vanilla GCNs search the size of $W_j^{(i)}$ automaticlly)

Experiments

it seems like there is a trend that what we should answer needs to be written in Section experiment

they use synthetic dataset (in which the homophily is decreased) to evaluate the model (which better captures delta operator)
real world experiment
visualization

Position-aware Graph Neural Networks

ICML 2019

Abstract

existing Graph Neural Network (GNN) architectures have limited power in capturing the position/location of a given node with respect to all other nodes of the graph.

They propose Position-aware Graph Neural Networks (P-GNNs):

samples sets of anchor nodes
computes the distance of a given target node to each anchor-set
learns a non-linear distance-weighted aggregation scheme over the anchor-sets

Introduction

However, the key limitation of existing GNN architectures is that they fail to capture the position/location of the node within the broader context of the graph structure.

It provide a example: two nodes that are far away, through GCN, the embedding would be the same (ignore the feature).

existing researchers haves spotted this weakness:

introduce one-hot feature
deepen the GCN

their one key observation: node position can be captured by a low-distortion embedding by quantifying the distance between a given node and a set of anchor nodes.

method:

P-GNN first samples multiple anchor-sets in each forward pass
then learns a non-linear aggregation scheme that combines node feature information from each anchor-set and weighs it by the distance between the node and the anchor-set.

besides, Bourgain theorem (Bourgain, 1985) guarantees that only $k = O(\log^2 n)$ anchor-sets are needed to preserve the distances in the original graph with low distortion.

implementary:

In settings where node attributes are not available, P-GNN’s computation of the k dimensional distance vector is inductive across different node orderings and different graphs.
When node attributes are available, a node’s embedding is further enriched by aggregating information from all anchor-sets, weighted by the k dimensional distance vector.

Further, for large graphs, they proposed P-GNN-Fast.

Preliminaries

“We call node embeddings to be position- aware, if the embedding of two nodes can be used to (approximately) recover their shortest path distance in the network.”

they give two definition:

position-aware
structure-aware

most GCNs are structure-aware method.

proposition 1: there exist a mapping from structure-aware embeddings to position-aware embeddings $\Leftrightarrow$ no pair of nodes have isomorphic local q-hop neighbourhood graphs. (proved in appendix)

Proposed Approach

We design P-GNNs such that each node embedding dimension corresponds to messages computed with respect to one anchor-set, which makes the computed node embeddings position-aware (see Figure 2)

NONE LOCAL survey 2018-2019

文章目錄

non-local Neural Networks for Video Classification

Hierarchical Graph Representation Learning with Differentiable Pooling

Abstract

GNN-Kipf’s GCN:

3.1 Preliminaries

3.2 Differentiable Pooling via Learned Assignments

3.3 Auxiliary Link Prediction Objective and Entropy Regularization

Experiment

MixHop: Higher-Order Graph Convolutional Architectures via Sparsified Neighborhood Mixing

Abstract

Intro

Proposed Architecture

Learning GC Architectures

Experiments

Position-aware Graph Neural Networks

Abstract

Introduction

Preliminaries

Proposed Approach

experiment

GeniePath: Graph Neural Networks with Adaptive Receptive Paths

Intro

cpp map+multimap 的常用操作 -- 每天一個寫bug小技巧

cpp vector 基本使用方法

leetcode problem 538. Convert BST to Greater Tree

latex 公式編輯方法

cpp printf -- 每天一個寫bug小技巧

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結