home..

Graph Representation Learning

huyi / July 2022

chap1 Introduction
- 1.1 What is a graph?
  - 1.1.1 Multi-relational Graphs
  - 1.1.2 Feature Information
- 1.2 Machine learning on graphs
chap2 Background and Traditional Approaches
chap 3: Neighborhood Reconstruction Methods
chap 4: Multi-relational Data and Knowledge Graphs

chap1 Introduction

1.1 What is a graph?

$\mathcal{G=(V,E)}$

simple graph，directed graph, directed graph with weighted edges可以由adjacency matrix来表示

simple graph: 其中的边undirected,且没有自己指向自己的节点)
directed graph就是adjacency matrix不对称而已

1.1.1 Multi-relational Graphs

different types of edges extend the edge notation to include an edge or relation type $\tau$ , e.g., $(u, \tau, v)\in\mathcal{E}$,对于每一种type $\tau$ 可以有一个adjacency matrix–>adjacency tensor $\mathcal{A}\in\mathbb{R}^{\mathcal{|V|\times|R|\times|V|}}$, 其中 $\mathcal{R}$是 relations的集合

Heterogeneous graphs nodes也有types, 根据这些type可以把nodes分成各不相交的k个集合：
$\mathcal{V=V_1\cup V_2\cup…\cup V_k}$ where $\mathcal{V_i\cup V_j}=\emptyset$
一般来说，一种relation专门连接两种特定的节点，比如$\tau_i$ 专门连接$\mathcal{V_k} and \mathcal{V_j}$

也有可以连接多种节点类型的relation type

Multiplex graphs 图可以被分成k layers, each layer corresponds to a unique relation.

Intra-layer edges: 一种专门的edge Inter-layer edges: 看一个节点生出来的不同类型的edges

1.1.2 Feature Information

node-level attributes that we represent using a real-valued matrix: $\mathsf{X}\in\mathbb{R}^{\mathcal{|V|\times m}}$
$\mathcal{|V|}$个节点，每个节点m维向量注：节点顺序和adjacency matrix里面的节点顺序相同很少考虑edge在除了types之外还有自己的feature

1.2 Machine learning on graphs

problem-driven

分类不可直接沿用 supervised / unsupervised

图机器学习的任务类型：

1.2.1 Node classification

the goal is to predict the label $y_u$ associated with all the nodes $u \in V$, when we are only given the true labels on a training set of nodes $V_{train}\subset\mathcal{V}$.

node classification和标准的supervised classification 区别：

    nodes in standard supervised classification: independent and identically distributed
    nodes in graph machine learning: interconnected-->model the relationships between nodes

由于我们可以接触到test nodes的一些信息，所以不完全算unsupervised–>semi-supervised

1.2.2 Relation prediction

别名：link prediction, graph completion, relational inference

standard setup：有一堆nodes $\mathcal{V}$, 一堆incomplete set of edges $\mathcal{E}$ \ $\mathcal{E}_{train}$

1.2.3 Clustering and community detection

接近于unsupervised

1.2.4 Graph classification, regression, and clustering

chap2 Background and Traditional Approaches

what’s used before the advent of modern deep learning approaches?

针对三种不同的学习任务:

• node and graph classification tasks: basic graph statistics, kernel methods
• relation prediction: various approaches for measuring the overlap between node neighborhoods,
• clustering or community detection on graphs: spectral clustering using graph Laplacians

2.1 Graph Statistics and Kernel Methods

traditional way: extracting some statistics or features—based on heuristic functions or domain knowledge—and then use these features as input to a standard machine learning classifier

–>introduce some important node-level features and statistics
–>how these node-level statistics can be generalized to graph-level statistics
–>extended to design kernel methods over graphs

==>GOAL: introduce various heuristic statistics and graph properties

2.1.1 Node-level statistics and features

以Medicis家族的social network为例：

what properties or statistics of the Medici node distinguish it from the rest of the graph?
what are useful properties and statistics that we can use to characterize the nodes in this graph?

Node degree

$d_u=\sum\limits_{u\in V}\mathsf{A}[u,v]$表示这个节点伸出来的边的个数

Node centrality $e_u=\frac{1}{\lambda}\sum\limits_{v\in V}\mathsf{A}[u,v]e_v$ 即centrality和neighbour的centrality 的 weighted sum成正比，其中$\lambda$为一个常数,按照上面的公式,可以发现e是Adjacency matrix的本征矢： $\lambda e=\mathsf{A}e$

还有其他的centrality：

betweeness centrality: how often a node lies on the shortest path between two other nodes
closeness centrality: measures the average shortest path length between a node and all other nodes

The clustering coefficient

–>structural distinction –>proportion of closed triangles in a node’s local neighborhood –>how many pairs of nodes there are in u’s neighborhood
another way of viewing: ego graph: the subgraph containing that node,, its neighbors, and all the edges between nodes in its neighborhood

2.1.2 Graph-level features and graph kernels

先前的讨论都是node-level的，无论是classification problem还是features 和 statistics.
那 graph-level classification怎么办？这一节的大多方法算在graph kernel methods中 graph kernels,Kriege et al. 2020

Bag of nodes simply aggregating nodelevel statistics

The Weisfeiler-Lehman kernel iterative neighborhood aggregation: 通过迭代来包含整个图的信息（neighbour的neighbour的neighbour……）

basic idea:

assign an initial label $l^{(0)}(v)$ to each node. 通常使用digree这个feature: $l^{(0)}(v)=d_v$
iteratively assign a new label to each node by hashing the multiset of the current labels within the node’s neighborhood 上述公式里面的“”表示multi-set(其中可以有重复的数),HASH表示一种映射: HASH:$\mathbb{R}^{|\mathcal{N}|}\rightarrow\mathbb{R}$

? 确认一下neighbourhood包不包含点自身

K次迭代后，$l^{(K)}(v)$代表了K-hop neighbourhood –> feature representation for the graph

这里相当于对graph进行卷积,局域性 & 平移不变性

Graphlets and path-based methods node-level discussion中, 有一种确定graph feature的方法是数某种特征的subgraph的数量, 在此可以叫 graphlet.

the graphlet kernel:

enumerating all possible graph structures of a particular size
counting how many times they occur in the full graph

alternative to enumerating:path-based method

只需要关心graph中不同的paths.
比如：the random walk kernel (2003) involves running random walks over the graph and then counting the occurrence of different degree sequences
比如：shortest-path kernel (2005) involves a similar idea but uses only the shortest-paths between nodes (rather than random walks)

? 没太搞懂这个方法
? degree sequence是啥

2.2 Neighborhood Overlap Detection

回顾上节：

上节讲node/graph-level features.

上节问题：

这些features并不能quantify the relationships between nodes
上面的features对relation prediction任务就没什么大用

本节关心的：

neighborhood overlap between pairs of nodes–>关心一对点之间二者的相关性有多强(pair of nodes)

最简单的方法:$\mathsf{S}[u,v]=	\mathcal{N}(u)\cap\mathcal{N}(v)	$
$\mathsf{S}\in\mathbb{R}^{\mathcal{	V	\times	V	}}$是similarity matrix–>summarizing all the pairwise node statistics

Adjacency matrix 和 similarity matrix之间的关系

2.2.1 Local overlap measures

Sorensen index: 一个$\mathcal{ V \times V }$维的矩阵
Salton index:
Jaccard overlap:
Resource Allocation (RA) index counts the inverse degrees of the common neighbors
Adamic-Adar (AA) index performs a similar computation using the inverse logarithm of the degrees

以上两种 index 相当于给了 low-degree neighbour 更高的权重，因为直觉上，a shared low-degree neighbor is more informative than a shared high-degree one.

2.2.2 Global overlap measures

只考虑local overlap(1-hop)不够：two nodes could have no local overlap in their neighborhoods but still be members of the same community in the graph.

Katz index

根据下面的理论可以改写成这个形式:

$\mathsf{A}^i[u,v]$表示了从u到v，i步的path的条数。

如果$\beta<1$，long path的权重就会降低

Geometric series of matrices

> Theorem 1. *Let X be a real-valued square matrix and let λ1 denote the largest eigenvalue of X.*,，则有：

<br>$\mathsf{(I-X)^{-1}}=\sum\limits_{i=0}^{\infin}\mathsf{X^i}$

<br>当且仅当 λ1 < 1 且 (I-X) 满秩。

Leicht, Holme, and Newman (LHN) similarity

Katz index的问题: strongly biased by node degree, 容易给high-degree nodes更高的similarity score
new method: ratio between the actual number of observed paths and the number of expected paths between two nodes
- E [ A^i ] 的计算 利用 configuration model assumes that we draw a random graph with the same set of degrees as our given graph. –>
  
  其中m是所有图中一共有多少条边。random configuration model下，两个节点间的edge出现的可能性和两节点各自的度数正相关。
  可以这么理解：有d_u条边离开u，每条边有d_v/2m的机会最后到v ? 为什么是d_v/2m
  
  $v_1$–>u–>$v_2$ $\frac{d_{v_1}d_{v_2}}{2m}\cdot\frac{d_{v_2}(d_{u}-1)}{2m}$
- 问题： 3-hop以上就很难算
- 3-hop及以上的问题：
  - the largest eigenvalue can be used to approximate the growth in the number of paths.
  其中p_i是节点u跟其他节点间长为i的路径条数。
  
  为什么p_i会收敛到eigenvector?
  
  D是表示节点数量的对角阵