Introduction

We talk about the rationale of DeepSNAP, introduce DeepSNAP core modules, and show example implementations.

Background

We first explain some preliminaries for learning on graphs.

We classify the learning tasks into the following categories, all of which are fully supported by DeepSNAP. Both classification and regression objectives can be applied to all of the tasks.

  • node: Node-level tasks makes prediction on labels for nodes. The prediction of each node is made based on node embeddings output by a GNN.

  • edge: Edge-level tasks makes prediction on labels for edges. The prediction of each edge is made based on a pair of node embeddings corresponding to the endpoints of the edge.

  • link_pred: Link prediction tasks makes prediction on existance of links (edges). The difference from edge-level tasks, is that it not only needs to make prediction of the edge label, but also have to decide if the edge exists at all. Negative sampling can be used here, so the model learns to predict the non-existence of an edge between 2 nodes. In the simplest version without edge label, the task becomes a binary prediction, where 1 corresponds to existence of an edge and 0 otherwise.

  • graph: Graph-level tasks makes prediction on labels for graphs. The prediction of each graph is made based on a pooled graph embedding from node embeddings. Naive pooling includes simply summing or taking average of all embeddings of nodes in the graph. See PyTorch Geometric for more pooling options.

In the dataset level, for each type of tasks, there are 2 possible types of splits which DeepSNAP fully supports:

  • train / val: 2 splits, including training set and validation set. E.g., split_ratio = [0.8, 0.2]

  • train / val / test: 3 splits, including training, validation and test set. E.g., split_ratio = [0.8, 0.1, 0.1]

Additionally, a type of split can be either transductive or inductive:

  • transductive: training, validation and test splits include all the graph(s) in the dataset. Within each graph, node or edge labels are split depending on the task.

  • inductive (only possible with multi-graph datasets): training, validation and test splits include distinct graphs. Within each training graph, all labels observed; within each validation / test graph, no label is observed.

Moreover, all splits performed in DeepSNAP are “secure split”, such that if there are enough splitted objects, all splitted graphs are guaranteed with at least one object. (Case with not enough objects refers to graph with less splitted objects than the number of splits. E.g. there are 2 nodes in the graph, while the user want to split nodes in the graph to train / val / test.) E.g. Consider a graph with 5 edges and we would like to split the graph to train / val / test with split_ratio = [0.8, 0.1, 0.1], then without “secure split”, the number of edges in each splitted graph will be 4, 0, 1, resulting in one splitted graph with 0 splitted objects. With the “secure split”, we will preprocess the splitted objects by having 2 or 3 (depending on whether we would split the graph to 2 or 3 parts) held out objects, and apply the same splitting logic on the remaining splitted objects, which in our case would have the number of edges in splitted graph results in 2, 1, 2.

  • minimum_node_per_graph: filtering out graphs with not enough splitted objects, all graphs imported in deepsnap.dataset.GraphDataset with number of nodes less than minimum_node_per_graph will be automatically removed. If minimum_node_per_graph is not specified by user, it will have a default value of 5.

DeepSNAP Graph

The deepsnap.graph.Graph class is responsible for manipulating a graph object for training GNNs. The most important functionalities of Graph object include

  • Splitting a graph into train, validation, test (in the transductive setting) and performing negative sampling for link prediction task.

  • Applying a user-defined transform function, and ensures that the graph backend is in sync with the tensor representation of graphs used for GNNs.

The first way to create a DeepSNAP deepsnap.graph.Graph is to load from a NetworkX graph object. The following is an example to create a complete graph by using the NetworkX.

import networkx as nx
from deepsnap.graph import Graph

G = nx.complete_graph(100)
H = Graph(G)
>>> Graph(G=[], edge_index=[2, 9900], edge_label_index=[2, 9900], node_label_index=[100])

User can also create a deepsnap.graph.Graph from the PyTorch Geometric data format directly.

from deepsnap.graph import Graph
from torch_geometric.datasets import Planetoid

pyg_dataset = Planetoid('./cora', 'Cora')
graph = Graph.pyg_to_graph(pyg_dataset[0])
>>> Graph(G=[], edge_index=[2, 10556], edge_label_index=[2, 10556], node_feature=[2708, 1433], node_label=[2708], node_label_index=[2708])

When creating a DeepSNAP graph, any NetworkX attribute begin with node_, edge_, graph_ will be automatically loaded. When loading from PyTorch Geometric, we automatically renaming the attributes to our naming taxonomy. Important attributes are listed below:

  • Graph.node_feature: Node features.

  • Graph.node_label: Node labels.

  • Graph.edge_feature: Edge features.

  • Graph.edge_label: Edge labels.

  • Graph.graph_feature: Graph features.

  • Graph.graph_label: Graph labels.

After loading these features, DeepSNAP Graph creates index that are necessary for GNN computation or indicating dataset split. Important indices are listed below:

  • Graph.edge_index: Edge index that guides GNN message passing

  • Graph.node_label_index: Slicing node label to get the corresponding split G.node_label[G.node_label_index].

  • Graph.edge_label_index: Slicing edge label to get the corresponding split G.edge_label[G.edge_label_index].

Following is an example to create a DeepSNAP graph object with node features, we can store the node features in the NetworkX graph with attribute name node_feature.

import torch
import networkx as nx
rom deepsnap.graph import Graph

G = nx.Graph()
G.add_node(0, node_feature=torch.tensor([1,2,3]))
G.add_node(1, node_feature=torch.tensor([4,5,6]))
G.add_edge(0, 1)
H = Graph(G)
print(H.node_feature)
>>> tensor([[1, 2, 3],
            [4, 5, 6]])

Here is another example to transform a DeepSNAP graph by adding clustering coefficient into the graph object:

import networkx as nx
from deepsnap.graph import Graph
from torch_geometric.datasets import Planetoid

def clustering_func(graph):
    clustering = list(nx.clustering(graph.G).values())
    graph['node_clustering'] = clustering

pyg_dataset = Planetoid('./cora', 'Cora')
graph = Graph.pyg_to_graph(pyg_dataset[0])
graph.apply_transform(clustering_func, update_graph=True, update_tensor=False)
print(graph)
print(graph.G.nodes(data=True)[0])
>>> Graph(G=[], edge_index=[2, 10556], edge_label_index=[2, 10556], node_clustering=[2708], node_feature=[2708, 1433], node_label=[2708], node_label_index=[2708])
>>> {'node_feature': tensor([0., 0., 0.,  ..., 0., 0., 0.]), 'node_label': tensor(3), 'node_clustering': 0.3333333333333333}

DeepSNAP Dataset

The deepsnap.dataset.GraphDataset class holds and manipulates a set of DeepSNAP graphs used for training, validation and / or testing. The most important functionalities of the GraphDataset object include

  • Load standard fixed splits, if available.

  • Random transductive and inductive splitting of a dataset into training, validation and test DeepSNAP Datasets.

  • Applying a user-defined transform function, and ensures that the graph backend is in sync with the tensor representation of graphs used for GNNs.

Dataset splitting encompasses following design choices:

  • inductive vs transductive: The inductive setting (for dataset with multiple graphs) splits the dataset by graphs. Distinct sets of graphs are used for training, validation and test, and the test graphs are never seen during training. This can be done for node, edge and graph-level tasks. In the transductive setting, all graphs are seen during training time, but the labels for certain nodes and edges are not observed at training time, and are used for validation and test. This applies to node and edge-level tasks.

  • Negative sampling is availabe for link prediction by using DeepSNAP, since this is typically an imbalanced tasks due to sparsity of graphs. DeepSNAP provides the option for user to specify the ratio of positive links and negative links for training, validation and test, as well as when to resample negative links during training.

  • Disjoint objective (supervision) sampling for link prediction is an important technique often not mentioned in research papers. At training time, it further splits the training set into edges used for message passing, and edges used for link prediction objectives. The rationale is to allow the model to learn to predict unseen edges, instead of memorizing all training edges at training time and failing to generalize to unseen edges at validation and test time. DeepSNAP also supports disjoint objectives and resampling of the disjoint objectives at training time.

It is convenient to create a DeepSNAP dataset from a list of DeepSNAP graphs.

import networkx as nx
from deepsnap.graph import Graph
from deepsnap.dataset import GraphDataset

G = nx.complete_graph(100)
H1 = Graph(G)
H2 = H1.clone()
dataset = GraphDataset(graphs=[H1, H2])
len(dataset)
>>> 2

DeepSNAP also supports creating the dataset from the PyTorch Geometric datasets directly.

from deepsnap.dataset import GraphDataset
from torch_geometric.datasets import TUDataset

pyg_dataset = TUDataset('./enzymes', 'ENZYMES')
graphs = GraphDataset.pyg_to_graphs(pyg_dataset)
dataset = GraphDataset(graphs, task="graph", minimum_node_per_graph=0)
print(dataset)
>>> GraphDataset(600)

With the deepsnap.dataset.GraphDataset, user can specify the related tasks and DeepSNAP will perform functions according to the speficied task. The tasks include:

  • node: Node classification.

  • edge: Edge classification.

  • link_pred: Link prediction.

  • graph: Graph classification.

Following is an example to perform a split to train, validation and test set with respect to the node (node classification) task.

import torch
import networkx as nx
from deepsnap.graph import Graph
from deepsnap.dataset import GraphDataset

G = nx.complete_graph(100)
Graph.add_node_attr(G, 'node_feature', torch.zeros([100, 1]))
Graph.add_node_attr(G, 'node_label', torch.zeros([100, 1]))
H1 = Graph(G)
H2 = H1.clone()
dataset = GraphDataset(graphs=[H1, H2], task='node')

train, val, test = dataset.split(transductive=True, split_ratio=[0.8, 0.1, 0.1])
print(train, val, test)
>>> GraphDataset(2) GraphDataset(2) GraphDataset(2)

Notice user can also specify whether the learning is transductive. In the example above, the nodes in each graph is splited to train, validation and test sets with repsect to the split_ratio 8:1:1. If the transductive is False, the dataset will be splitted as following:

from deepsnap.dataset import GraphDataset
from torch_geometric.datasets import TUDataset

pyg_dataset = TUDataset('./enzymes', 'ENZYMES')
graphs = GraphDataset.pyg_to_graphs(pyg_dataset)
dataset = GraphDataset(graphs, task="graph", minimum_node_per_graph=0)
train, val, test = dataset.split(
            transductive=False, split_ratio = [0.8, 0.1, 0.1])
print(train, val, test)
>>> GraphDataset(480) GraphDataset(60) GraphDataset(60)

It is also possible to transform the dataset directly. Here is an example for transforming a DeepSNAP dataset:

import networkx as nx
from deepsnap.dataset import GraphDataset
from torch_geometric.datasets import TUDataset

def clustering_func(graph):
    clustering = list(nx.clustering(graph.G).values())
    graph['node_clustering'] = clustering

pyg_dataset = TUDataset('./enzymes', 'ENZYMES')
graphs = GraphDataset.pyg_to_graphs(pyg_dataset)
dataset = GraphDataset(graphs, task='graph', minimum_node_per_graph=0)
dataset.apply_transform(clustering_func, update_graph=True, update_tensor=False)
print(dataset)
print(dataset[0])
>>> GraphDataset(600)
>>> Graph(G=[], edge_index=[2, 168], edge_label_index=[2, 168], graph_label=[1], node_clustering=[37], node_feature=[37, 3], node_label_index=[37])

DeepSNAP Batch

The main purpose of the deepsnap.batch.Batch is to collate() the dataset and make it to be easily used with the torch.utils.data.DataLoader. The following example is to collate() the train dataset into batches with 10 graphs in each batch.

import networkx as nx
from deepsnap.batch import Batch
from deepsnap.dataset import GraphDataset
from torch_geometric.datasets import TUDataset
from torch.utils.data import DataLoader

def clustering_func(graph):
    clustering = list(nx.clustering(graph.G).values())
    graph['node_clustering'] = clustering

pyg_dataset = TUDataset('./enzymes', 'ENZYMES')
graphs = GraphDataset.pyg_to_graphs(pyg_dataset)
dataset = GraphDataset(graphs, task='graph', minimum_node_per_graph=0)
train, val, test = dataset.split(
            transductive=False, split_ratio = [0.8, 0.1, 0.1])
train_loader = DataLoader(train, collate_fn=Batch.collate(), batch_size=10, shuffle=True)
batch = next(iter(train_loader))
batch = batch.apply_transform(clustering_func, update_graph=True, update_tensor=False)
print(batch)
>>> Batch(G=[10], batch=[266], edge_index=[2, 1064], edge_label_index=[2, 1064], graph_label=[10], node_clustering=[10], node_feature=[266, 3], node_label_index=[266])

Here is another example to transform a DeepSNAP Batch by adding the clustering coefficient to the node_feature:

import torch
import networkx as nx
from deepsnap.batch import Batch
from deepsnap.dataset import GraphDataset
from torch_geometric.datasets import TUDataset
from torch.utils.data import DataLoader

def clustering_func(graph):
    clustering = torch.tensor(list(nx.clustering(graph.G).values()))
    clustering = clustering.view(-1, 1)
    graph.node_feature = torch.cat([graph.node_feature, clustering], dim=1)

pyg_dataset = TUDataset('./enzymes', 'ENZYMES')
graphs = GraphDataset.pyg_to_graphs(pyg_dataset)
dataset = GraphDataset(graphs, task='graph', minimum_node_per_graph=0)
train, val, test = dataset.split(
            transductive=False, split_ratio = [0.8, 0.1, 0.1])
train_loader = DataLoader(train, collate_fn=Batch.collate(), batch_size=10, shuffle=True)
batch = next(iter(train_loader))
batch = batch.apply_transform(clustering_func, update_graph=True, update_tensor=False)
print(batch)
>>> Batch(G=[10], batch=[411], edge_index=[2, 1378], edge_label_index=[2, 1378], graph_label=[10], node_feature=[411, 4], node_label_index=[411])
print(nx.get_node_attributes(batch.G[0], 'node_feature')[0].shape[0])
>>> 4

To have a better understanding of using DeepSNAP with homogeneous graphs, we recommend you to look at the examples:

Or see our Colab Notebooks.

DeepSNAP Heterogeneous Graph

The DeepSNAP provides deepsnap.hetero_graph.HeteroGraph class for the heterogeneous graph. The main idea is similar to the DeepSNAP Graph class. But deepsnap.hetero_graph.HeteroGraph add some extra peroperties for heterogeneous graph and functions in the class are overrided for the heterogeneous graph.

The first way to create a DeepSNAP deepsnap.hetero_graph.HeteroGraph is to load from a NetworkX graph object. The following is an example to create a simple HeteroGraph object by using the NetworkX.

import torch
import networkx as nx
from deepsnap.hetero_graph import HeteroGraph

G = nx.DiGraph()
G.add_node(0, node_type='n1', node_label=1, node_feature=torch.Tensor([0.1, 0.2, 0.3]))
G.add_node(1, node_type='n1', node_label=0, node_feature=torch.Tensor([0.2, 0.3, 0.4]))
G.add_node(2, node_type='n2', node_label=1, node_feature=torch.Tensor([0.3, 0.4, 0.5]))
G.add_edge(0, 1, edge_type='e1')
G.add_edge(0, 2, edge_type='e1')
G.add_edge(1, 2, edge_type='e2')
H = HeteroGraph(G)
for hetero_feature in H:
    print(hetero_feature)

>>> ('G', <networkx.classes.digraph.DiGraph object at 0x103642370>)
    ('edge_index', {('n1', 'e1', 'n1'): tensor([[0],
            [1]]), ('n1', 'e1', 'n2'): tensor([[0],
            [0]]), ('n1', 'e2', 'n2'): tensor([[1],
            [0]])})
    ('edge_label_index', {('n1', 'e1', 'n1'): tensor([[0],
            [1]]), ('n1', 'e1', 'n2'): tensor([[0],
            [0]]), ('n1', 'e2', 'n2'): tensor([[1],
            [0]])})
    ('edge_to_graph_mapping', {('n1', 'e1', 'n1'): tensor([0]), ('n1', 'e1', 'n2'): tensor([1]), ('n1', 'e2', 'n2'): tensor([2])})
    ('edge_to_tensor_mapping', tensor([0, 0, 0]))
    ('edge_type', {('n1', 'e1', 'n1'): ['e1'], ('n1', 'e1', 'n2'): ['e1'], ('n1', 'e2', 'n2'): ['e2']})
    ('node_feature', {'n1': tensor([[0.1000, 0.2000, 0.3000],
            [0.2000, 0.3000, 0.4000]]), 'n2': tensor([[0.3000, 0.4000, 0.5000]])})
    ('node_label', {'n1': tensor([1, 0]), 'n2': tensor([1])})
    ('node_label_index', {'n1': tensor([0, 1]), 'n2': tensor([0])})
    ('node_to_graph_mapping', {'n1': tensor([0, 1]), 'n2': tensor([2])})
    ('node_to_tensor_mapping', tensor([0, 1, 0]))
    ('node_type', {'n1': ['n1', 'n1'], 'n2': ['n2']})

User can also create a deepsnap.hetero_graph.HeteroGraph from the PyTorch Geometric data format directly in similar manner of the homogeneous graph case.

When creating a DeepSNAP heterogeneous graph, any NetworkX attribute begin with node_, edge_, graph_ will be automatically loaded. Important attributes are listed below:

  • HeteroGraph.node_feature: Node features.

  • HeteroGraph.node_label: Node labels.

  • HeteroGraph.edge_feature: Edge features.

  • HeteroGraph.edge_label: Edge labels.

  • HeteroGraph.graph_feature: Graph features.

  • HeteroGraph.graph_label: Graph labels.

After loading these features, DeepSNAP Graph creates index that are necessary for GNN computation or indicating dataset split. Important indices are listed below:

  • HeteroGraph.edge_index: Edge index that guides GNN message passing

  • HeteroGraph.node_label_index: Slicing node label to get the corresponding split G.node_label[G.node_label_index].

  • HeteroGraph.edge_label_index: Slicing edge label to get the corresponding split G.edge_label[G.edge_label_index].

Similar to the homogeneous graph, the HeteroGraph also includes a NetworkX backend graph object for applying transform functions. Note that the node type for each node has to be specified as a node property node_type in the NetworkX graph object. Similarly, the edge type for each edge has to be specified as an edge property edge_type in the NetworkX graph object. The deepsnap.hetero_graph.HeteroGraph will store the some data in a dict format. For example, HeteroGraph.node_feature is a dictionary of node_type as keys and values are the node feature tensors for each node_type. Similarly, HeteroGraph.edge_feature is a dictionary of edge_type as keys and values are the edge features for each edge_type.

The heterogeneous GNN framework is fully general and supports both heterogeneity of nodes and edges. It defines the concept of message_types, as tuples in the format of (start_node_type, edge_type, end_node_type). A single node / edge type is used if there is only 1 type of node or edges. The messages for different message types can be parameterized by different weights or even different message passing model. For example, HeteroGraph.edge_index and HeteroGraph.edge_label_index are dictionaries of message_types as keys and values are torch.Tensor representing edge indices of each message_type.

Dataset splitting for heterogeneous graph encompasses the following additional design choices:

  • split_types is a heterogeneous graph specific parameter to let the user specify which types the user would like to split in the splitting process for the user specified task. To be more specific, for node split task, the split_types could be either a node_type or a list of node_type, and for edge split task and link prediction task, the split_types could be either a message_type or a list of message_type. Note that if split_types is not specified in the split function, then the default behavior is to include all types corresponding to the task.

  • edge_split_mode is a heterogeneous graph specific parameter to let the user specify whether to use some extra resources to have edges of each message_type respect to the split_ratio as well. edge_split_mode could either be set to exact or approximate. If exact is set, and when task is set to link prediction task, then in the splitting process, the relative number of edges for each message_type is exactly splitted correspnding to the split_ratio. If approximate is set, and when task is set to link prediction task, then in the splitting process, even though the total number of edges will be exactly splitted corresponding to the split_ratio, this relative split ratio might not hold for edges within each message_type. Note that if edge_split_mode is not specified in the initilization process, then the default behavior is exact. Additionally, when the split_types includes all types of object in its corresponding task, having edge_split_mode set to approximate could give the user some performance gain.

DeepSNAP Heterogeneous GNN

The Heterogeneous GNN layer is a PyTorch nn.Module that supports easy creation of heterogeneous GNN, building on top of PyTorch Geometric. Users can easily specify the message passing model for each message type. The message passing models are straightforward adaptation of Pytorch Geometric homogeneous models (such as GraphSAGE, GCN, GIN). In future release, we will provide even easier utilities to create such heterogeneous message passing models.

An example GNN layer for heterogeneous graph is deepsnap.hetero_gnn.HeteroSAGEConv.

The module deepsnap.hetero_gnn.HeteroConv allows heterogeneous message passing for all message types to be performed on a heterogeneous graph, which acts like a wrapper layer.

There are also some useful functions for the heterogeneous GNN, such as the deepsnap.hetero_gnn.forward_op() and deepsnap.hetero_gnn.loss_op(), which are helpful to build the heterogeneous GNN model.

For more details on deepsnap.hetero_graph.HeteroGraph, please see DeepSNAP examples for heterogeneous graph:

Or see our Colab Notebooks.