# In simple terms, it takes you to understand the principle of graph convolution neural network and the implementation of pytorch code

01

Big picture neural network

After reading the graph neural network for a long time, I have learned a little from ignorance. Today, I want to express some thoughts on feature integration (aggregation) of graph neural networks such as GCN, GraghSAGE and GAT. On the one hand, it will enable more people to learn the essence of graph neural network, on the other hand, it will deepen their memory of knowledge. 02

What is a graph and what are its characteristics? As a special data structure, graph is a non European Space: 1) the local input dimension is variable, which shows that the number of neighbor nodes of each node is different; 2) The arrangement is disordered, which shows that there is only connection relationship between nodes, and there is no order. Even if the number of neighbor nodes in the graph structure is the same, the nodes need to be sorted according to certain rules, such as the size of degree, so that each node represented by first-order neighbor nodes is unique. Unlike images and natural languages with regular arrangement rules, they can be processed by CNN or RNN and get better results, while CNN or RNN generally has poor effect on graph structure data or can not be used

03

How is the graph represented?

For graphs, we are used to using express. Here, V is the set of nodes in the graph, and E is the set of edges. Here, note that the number of nodes in the graph is N. There are three important matrices:

• Adjacency matrix A: adjacency matrix, which is used to represent the connection relationship between nodes. Here, we assume that it is a 0-1 matrix;
• Degree matrix D: degree matrix. The degree of each node refers to the number of connected nodes. This is a diagonal matrix, in which diagonal elements
• Feature matrix X: used to represent the features of nodes, where F is the dimension of features; Continuous, low dimensional and real vectors are used for distributed representation. The characteristic matrix is usually represented by FM matrix decomposition, DeepWalk random walk, supervised and semi supervised learning based on graph neural network

The mathematical representation is relatively abstract, and the following is an example of adjacency matrix: the mathematical representation is relatively abstract, and the following is an example of adjacency matrix: 04

How are graph nodes represented?

In fact, we can divide the above study into three parts:

• transform: transform and learn the current node features. Here is the multiplication rule (XW), where X is the node representation and W is the model weight
• Aggregate: aggregate the characteristics of the domain node through the adjacency matrix to obtain the new characteristics of the node. According to whether the weight of the aggregation node can be learned, it can be divided into GCN and GAT. GCN directly uses the adjacency matrix as the weight to aggregate the neighbor node and its own node as the representation of the current node. GAT learns the weight of the importance of the neighbor node to the current node, The representation of nodes is obtained by weighting; The global graph convolution and local graph convolution are obtained according to whether the aggregation node is sampled or not
• activate: use the activation function to increase nonlinearity

Next, we introduce the idea and implementation of feature aggregation of graph neural networks of GCN, GraghSAGE and GAT

Full graph convolution neural network GCN

We don't have to worry too much about why GCN is called GCN. We just need to know that it is a new feature that aggregates the current node features and first-order node features to characterize the current node. This way is conducive to introducing the context information of graph nodes, so that the prediction results of each node are affected by neighbor nodes. How to integrate? Because the adjacency matrix can represent the connection relationship between nodes, we can make a slight adjustment on the adjacency matrix, add the identity matrix to combine the connection of the node itself and normalize it. Each row represents the connection weight between one node and other nodes and serves as the characteristic integration weight of the current node.

We can also regard the adjusted adjacency matrix as a MASK. The position where the weight of each row is 0 means that it has no impact on the aggregation characteristics of the current node. Because this weight itself is determined by the connection of the graph and is [fixed] and [non learnable], this feature aggregation network based on the [whole graph] structure is GCN, which belongs to transduction learning.

Here are two different aggregation methods: Polymerization method 1:

The context representation vector of the current node is characterized by aggregating the eigenvector of the current node and the eigenvector of its first-order node [sum average]. This method only considers the connection of the current node and does not consider the connection of its first-order node. It is easy to cause that the greater the degree of the node, the more times the node features are aggregated, The greater the degree, the weaker the particularity of the node itself, and the impact should be weakened. Specific implementation: In this formula:

• , I is the identity matrix
• Is the degree matrix of A wave
• X is the characteristic of each layer
• Weight parameters of W neural network model
• σ Is a nonlinear activation function Polymerization method 2:

We characterize the context representation vector of the current node by aggregating the eigenvector of the current node with the eigenvector of the first-order node [weighted summation]. However, considering the shortcomings of method 1, the aggregation weight of each node in this method is not only related to its own connection number, but also related to the connection number [degree] of the first-order node. The node with greater degree, Note: the stronger the universality, the lower the importance -- a simple explanation is like that in an article, if a word appears more times, the lower its importance, such as [De, De, di], which is most likely to appear. The nodes are also similar. It's a bit counter intuitive. Specific implementation: In this formula:

• , I is the identity matrix
• Is the degree matrix of A wave
• X is the characteristic of each layer
• Weight parameters of W neural network model
• σ Is a nonlinear activation function

main features

1) It is necessary to obtain the adjacency matrix and training based on the whole graph. In the case of limited computing resources, it can not be extended to the model training and prediction of large-scale graph structure

2) It belongs to transduction learning and can not be applied to unknown node prediction or new graph structure

3) The number of layers of the plot reflects the size of the receptive field of the node, n-order

Pytoch implements GCN

Reference paper: semi supervised classification with graph revolutionary networks

https://arxiv.org/abs/1609.02907

class GraphConvolution(nn.Module):
def __init__(self, input_dim, output_dim, use_bias=True):
"""Fig. convolution: L*X*\theta
Args:
----------
input_dim: int
Dimension of node input feature
output_dim: int
Output feature dimension
use_bias : bool, optional
Use offset
"""
super(GraphConvolution, self).__init__()
self.input_dim = input_dim
self.output_dim = output_dim
self.use_bias = use_bias
self.weight = nn.Parameter(torch.Tensor(input_dim, output_dim))
if self.use_bias:
self.bias = nn.Parameter(torch.Tensor(output_dim))
else:
self.register_parameter('bias', None)
self.reset_parameters()

def reset_parameters(self):
init.kaiming_uniform_(self.weight)
if self.use_bias:
init.zeros_(self.bias)

"""Adjacency matrix is sparse matrix, so sparse matrix multiplication is used in calculation
Args:
-------
input_feature: torch.Tensor
Input characteristics
"""
support = torch.mm(input_feature, self.weight)
if self.use_bias:
output += self.bias
return output

def __repr__(self):
return self.__class__.__name__ + ' ('             + str(self.input_dim) + ' -> '             + str(self.output_dim) + ')'

# Model definition
# Readers can modify and experiment the GCN model structure by themselves
class GcnNet(nn.Module):
"""
Define a with two layers GraphConvolution Model of
"""
def __init__(self, input_dim=1433):
super(GcnNet, self).__init__()
self.gcn1 = GraphConvolution(input_dim, 16)
self.gcn2 = GraphConvolution(16, 7)

return logits

Local graph convolution neural network  Considering the shortcomings of GCN, GraghSAGE samples a fixed number of n-order adjacent nodes for each node through the local sampling method. Each layer of neural network aggregates the characteristics of the central node and neighbor nodes as the new context representation of the central node, and different nodes in the same layer share the aggregation parameters.

main features

1) In this way, the neural network can carry out small batch model training without inputting the whole graph structure, and can be used for large-scale graph structure training

2) The training process does not use the data of the test process, and can be used to predict the unknown nodes, which belongs to inductive learning Pytoch implements GraghSAGE

class NeighborAggregator(nn.Module):

def __init__(self, input_dim, output_dim,

use_bias=False, aggr_method="mean"):

"""Aggregation node neighbor

Args:

input_dim: Enter the dimension of the feature

output_dim: Dimension of output feature

use_bias: Use offset (default: {False})

aggr_method: Neighbor aggregation mode (default: {mean})

"""

super(NeighborAggregator, self).__init__()

self.input_dim = input_dim

self.output_dim = output_dim

self.use_bias = use_bias

self.aggr_method = aggr_method

self.weight = nn.Parameter(torch.Tensor(input_dim, output_dim))

if self.use_bias:

self.bias = nn.Parameter(torch.Tensor(self.output_dim))

self.reset_parameters()

def reset_parameters(self):

init.kaiming_uniform_(self.weight)

if self.use_bias:

init.zeros_(self.bias)

def forward(self, neighbor_feature):

if self.aggr_method == "mean":

aggr_neighbor = neighbor_feature.mean(dim=1)

elif self.aggr_method == "sum":

aggr_neighbor = neighbor_feature.sum(dim=1)

elif self.aggr_method == "max":

aggr_neighbor = neighbor_feature.max(dim=1)

else:

raise ValueError("Unknown aggr type, expected sum, max, or mean, but got {}"

.format(self.aggr_method))

neighbor_hidden = torch.matmul(aggr_neighbor, self.weight)

if self.use_bias:

neighbor_hidden += self.bias

return neighbor_hidden

def extra_repr(self):

return 'in_features={}, out_features={}, aggr_method={}'.format(

self.input_dim, self.output_dim, self.aggr_method)

class SageGCN(nn.Module):

def __init__(self, input_dim, hidden_dim,

activation=F.relu,

aggr_neighbor_method="mean",

aggr_hidden_method="sum"):

"""SageGCN Layer definition

Args:

input_dim: Enter the dimension of the feature

hidden_dim: Dimensions of hidden layer features,

When aggr_hidden_method=sum, The output dimension is hidden_dim

When aggr_hidden_method=concat, The output dimension is hidden_dim*2

activation: Activation function

aggr_neighbor_method: Neighbor feature aggregation method,["mean", "sum", "max"]

aggr_hidden_method: Update method of node characteristics,["sum", "concat"]

"""

super(SageGCN, self).__init__()

assert aggr_neighbor_method in ["mean", "sum", "max"]

assert aggr_hidden_method in ["sum", "concat"]

self.input_dim = input_dim

self.hidden_dim = hidden_dim

self.aggr_neighbor_method = aggr_neighbor_method

self.aggr_hidden_method = aggr_hidden_method

self.activation = activation

self.aggregator = NeighborAggregator(input_dim, hidden_dim,

aggr_method=aggr_neighbor_method)

self.dropout=nn.Dropout(0.5)

self.weight = nn.Parameter(torch.Tensor(input_dim, hidden_dim))

self.reset_parameters()

def reset_parameters(self):

init.kaiming_uniform_(self.weight)

def forward(self, src_node_features, neighbor_node_features):

neighbor_hidden = self.aggregator(neighbor_node_features)

self_hidden = torch.matmul(src_node_features, self.weight)

# self_hidden=self.dropout(self_hidden)

if self.aggr_hidden_method == "sum":

hidden = self_hidden + neighbor_hidden

elif self.aggr_hidden_method == "concat":

hidden = torch.cat([self_hidden, neighbor_hidden], dim=1)

else:

raise ValueError("Expected sum or concat, got {}"

.format(self.aggr_hidden))

if self.activation:

return self.activation(hidden)

else:

return hidden

def extra_repr(self):

output_dim = self.hidden_dim if self.aggr_hidden_method == "sum" else self.hidden_dim * 2

return 'in_features={}, out_features={}, aggr_hidden_method={}'.format(

self.input_dim, output_dim, self.aggr_hidden_method)

class GraphSage(nn.Module):

def __init__(self, input_dim, hidden_dim,

num_neighbors_list):

super(GraphSage, self).__init__()

self.input_dim = input_dim

self.hidden_dim = hidden_dim

self.num_neighbors_list = num_neighbors_list

self.num_layers = len(num_neighbors_list)

self.gcn = nn.ModuleList()

self.gcn.append(SageGCN(input_dim, hidden_dim))

for index in range(0, len(hidden_dim) - 2):

self.gcn.append(SageGCN(hidden_dim[index], hidden_dim[index+1]))

self.gcn.append(SageGCN(hidden_dim[-2], hidden_dim[-1], activation=None))

def forward(self, node_features_list):

hidden = node_features_list

for l in range(self.num_layers):

next_hidden = []

gcn = self.gcn[l]

for hop in range(self.num_layers - l):

src_node_features = hidden[hop]

src_node_num = len(src_node_features)

neighbor_node_features = hidden[hop + 1] \

.view((src_node_num, self.num_neighbors_list[hop], -1))

h = gcn(src_node_features, neighbor_node_features)

next_hidden.append(h)

hidden = next_hidden

return hidden

Figure attention neural network GAT

The aggregation weights of the above two methods are not learnable, so the association degree of different neighbor nodes to the central node should be different. We hope to learn the implicit association degree between nodes through neural network and aggregate features as weights. Since the degree of nodes is not fixed, we can easily associate it with using the attention mechanism for alignment and using the adjacency matrix as a MASK to shield the influence of non neighbor nodes on the aggregation weight.  Pytoch implements GAT

import torch

import torch.nn as nn

import torch.nn.functional as F

class GraphAttentionLayer(nn.Module):

"""

Simple GAT layer, similar to https://arxiv.org/abs/1710.10903

"""

def __init__(self, in_features, out_features, dropout, alpha, concat=True):

super(GraphAttentionLayer, self).__init__()

self.dropout = dropout

self.in_features = in_features

self.out_features = out_features

self.alpha = alpha

self.concat = concat

self.W = nn.Parameter(torch.zeros(size=(in_features, out_features)))

nn.init.xavier_uniform_(self.W.data, gain=1.414)

self.Q = nn.Parameter(torch.zeros(size=(in_features, out_features)))

nn.init.xavier_uniform_(self.Q.data, gain=1.414)

self.V = nn.Parameter(torch.zeros(size=(in_features, out_features)))

nn.init.xavier_uniform_(self.V.data, gain=1.414)

self.a = nn.Parameter(torch.zeros(size=(2*out_features, 1)))

nn.init.xavier_uniform_(self.a.data, gain=1.414)

self.leakyrelu = nn.LeakyReLU(self.alpha)

h = torch.mm(input, self.W)

q = torch.mm(input, self.Q)

v = torch.mm(input, self.V)

N = h.size()

a_input = torch.cat([h.repeat(1, N).view(N * N, -1), q.repeat(N, 1)], dim=1).view(N, -1, 2 * self.out_features)

e = self.leakyrelu(torch.matmul(a_input, self.a).squeeze(2))

zero_vec = -9e15*torch.ones_like(e)

attention = torch.where(adj > 0, e, zero_vec)

attention = F.softmax(attention, dim=1)

attention = F.dropout(attention, self.dropout, training=self.training)

h_prime = torch.matmul(attention, v)

if self.concat:

return F.elu(h_prime)

else:

return h_prime

def __repr__(self):

return self.__class__.__name__ + ' (' + str(self.in_features) + ' -> ' + str(self.out_features) + ')'

class GAT(nn.Module):

def __init__(self, nfeat, nhid, nclass, dropout, alpha, nheads):

"""Dense version of GAT."""

super(GAT, self).__init__()

self.dropout = dropout

self.attentions = [GraphAttentionLayer(nfeat, nhid, dropout=dropout, alpha=alpha, concat=True) for _ in range(nheads)]

for i, attention in enumerate(self.attentions):

self.out_att = GraphAttentionLayer(nhid * nheads, nclass, dropout=dropout, alpha=alpha, concat=False)

x = F.dropout(x, self.dropout, training=self.training)

x = torch.cat([att(x, adj) for att in self.attentions], dim=1)

x = F.dropout(x, self.dropout, training=self.training)

return F.log_softmax(x, dim=1) 