Introduction¶
We talk about the rationale of DeepSNAP, introduce DeepSNAP core modules, and show example implementations.
Background¶
We first explain some preliminaries for learning on graphs.
We classify the learning tasks into the following categories, all of which are fully supported by DeepSNAP. Both classification and regression objectives can be applied to all of the tasks.
node: Node-level tasks makes prediction on labels for nodes. The prediction of each node is made based on node embeddings output by a GNN.
edge: Edge-level tasks makes prediction on labels for edges. The prediction of each edge is made based on a pair of node embeddings corresponding to the endpoints of the edge.
link_pred: Link prediction tasks makes prediction on existance of links (edges). The difference from edge-level tasks, is that it not only needs to make prediction of the edge label, but also have to decide if the edge exists at all. Negative sampling can be used here, so the model learns to predict the non-existence of an edge between 2 nodes. In the simplest version without edge label, the task becomes a binary prediction, where 1 corresponds to existence of an edge and 0 otherwise.
graph: Graph-level tasks makes prediction on labels for graphs. The prediction of each graph is made based on a pooled graph embedding from node embeddings. Naive pooling includes simply summing or taking average of all embeddings of nodes in the graph. See PyTorch Geometric for more pooling options.
In the dataset level, for each type of tasks, there are 2 possible types of splits which DeepSNAP fully supports:
train / val: 2 splits, including training set and validation set. E.g., split_ratio = [0.8, 0.2]
train / val / test: 3 splits, including training, validation and test set. E.g., split_ratio = [0.8, 0.1, 0.1]
Additionally, a type of split can be either transductive or inductive:
transductive: training, validation and test splits include all the graph(s) in the dataset. Within each graph, node or edge labels are split depending on the task.
inductive (only possible with multi-graph datasets): training, validation and test splits include distinct graphs. Within each training graph, all labels observed; within each validation / test graph, no label is observed.
Moreover, all splits performed in DeepSNAP are “secure split”, such that if there are enough splitted objects, all splitted graphs are guaranteed with at least one object. (Case with not enough objects refers to graph with less splitted objects than the number of splits. E.g. there are 2 nodes in the graph, while the user want to split nodes in the graph to train / val / test.) E.g. Consider a graph with 5 edges and we would like to split the graph to train / val / test with split_ratio = [0.8, 0.1, 0.1], then without “secure split”, the number of edges in each splitted graph will be 4, 0, 1, resulting in one splitted graph with 0 splitted objects. With the “secure split”, we will preprocess the splitted objects by having 2 or 3 (depending on whether we would split the graph to 2 or 3 parts) held out objects, and apply the same splitting logic on the remaining splitted objects, which in our case would have the number of edges in splitted graph results in 2, 1, 2.
minimum_node_per_graph: filtering out graphs with not enough splitted objects, all graphs imported in
deepsnap.dataset.GraphDataset
with number of nodes less than minimum_node_per_graph will be automatically removed. If minimum_node_per_graph is not specified by user, it will have a default value of 5.
DeepSNAP Graph¶
The deepsnap.graph.Graph
class is responsible for manipulating a graph object for training GNNs. The most important functionalities of Graph object include
Splitting a graph into train, validation, test (in the transductive setting) and performing negative sampling for link prediction task.
Applying a user-defined transform function, and ensures that the graph backend is in sync with the tensor representation of graphs used for GNNs.
The first way to create a DeepSNAP deepsnap.graph.Graph
is to load
from a NetworkX graph object. The following is an example to create a complete
graph by using the NetworkX.
import networkx as nx
from deepsnap.graph import Graph
G = nx.complete_graph(100)
H = Graph(G)
>>> Graph(G=[], edge_index=[2, 9900], edge_label_index=[2, 9900], node_label_index=[100])
User can also create a deepsnap.graph.Graph
from the PyTorch Geometric data format directly.
from deepsnap.graph import Graph
from torch_geometric.datasets import Planetoid
pyg_dataset = Planetoid('./cora', 'Cora')
graph = Graph.pyg_to_graph(pyg_dataset[0])
>>> Graph(G=[], edge_index=[2, 10556], edge_label_index=[2, 10556], node_feature=[2708, 1433], node_label=[2708], node_label_index=[2708])
When creating a DeepSNAP graph, any NetworkX attribute begin with node_
, edge_
, graph_
will be automatically loaded.
When loading from PyTorch Geometric, we automatically renaming the attributes to our naming taxonomy.
Important attributes are listed below:
Graph.node_feature
: Node features.Graph.node_label
: Node labels.Graph.edge_feature
: Edge features.Graph.edge_label
: Edge labels.Graph.graph_feature
: Graph features.Graph.graph_label
: Graph labels.
After loading these features, DeepSNAP Graph creates index
that are necessary for GNN computation or indicating dataset split.
Important indices are listed below:
Graph.edge_index
: Edge index that guides GNN message passingGraph.node_label_index
: Slicing node label to get the corresponding splitG.node_label[G.node_label_index]
.Graph.edge_label_index
: Slicing edge label to get the corresponding splitG.edge_label[G.edge_label_index]
.
Following is an example to create a DeepSNAP graph object with node features, we can store the node features in the NetworkX graph with
attribute name node_feature
.
import torch
import networkx as nx
rom deepsnap.graph import Graph
G = nx.Graph()
G.add_node(0, node_feature=torch.tensor([1,2,3]))
G.add_node(1, node_feature=torch.tensor([4,5,6]))
G.add_edge(0, 1)
H = Graph(G)
print(H.node_feature)
>>> tensor([[1, 2, 3],
[4, 5, 6]])
Here is another example to transform a DeepSNAP graph by adding clustering coefficient into the graph object:
import networkx as nx
from deepsnap.graph import Graph
from torch_geometric.datasets import Planetoid
def clustering_func(graph):
clustering = list(nx.clustering(graph.G).values())
graph['node_clustering'] = clustering
pyg_dataset = Planetoid('./cora', 'Cora')
graph = Graph.pyg_to_graph(pyg_dataset[0])
graph.apply_transform(clustering_func, update_graph=True, update_tensor=False)
print(graph)
print(graph.G.nodes(data=True)[0])
>>> Graph(G=[], edge_index=[2, 10556], edge_label_index=[2, 10556], node_clustering=[2708], node_feature=[2708, 1433], node_label=[2708], node_label_index=[2708])
>>> {'node_feature': tensor([0., 0., 0., ..., 0., 0., 0.]), 'node_label': tensor(3), 'node_clustering': 0.3333333333333333}
DeepSNAP Dataset¶
The deepsnap.dataset.GraphDataset
class holds and manipulates a set of DeepSNAP graphs used for training, validation and / or testing. The most important functionalities of the GraphDataset
object include
Load standard fixed splits, if available.
Random transductive and inductive splitting of a dataset into training, validation and test DeepSNAP Datasets.
Applying a user-defined transform function, and ensures that the graph backend is in sync with the tensor representation of graphs used for GNNs.
Dataset splitting encompasses following design choices:
inductive vs transductive: The inductive setting (for dataset with multiple graphs) splits the dataset by graphs. Distinct sets of graphs are used for training, validation and test, and the test graphs are never seen during training. This can be done for node, edge and graph-level tasks. In the transductive setting, all graphs are seen during training time, but the labels for certain nodes and edges are not observed at training time, and are used for validation and test. This applies to node and edge-level tasks.
Negative sampling is availabe for link prediction by using DeepSNAP, since this is typically an imbalanced tasks due to sparsity of graphs. DeepSNAP provides the option for user to specify the ratio of positive links and negative links for training, validation and test, as well as when to resample negative links during training.
Disjoint objective (supervision) sampling for link prediction is an important technique often not mentioned in research papers. At training time, it further splits the training set into edges used for message passing, and edges used for link prediction objectives. The rationale is to allow the model to learn to predict unseen edges, instead of memorizing all training edges at training time and failing to generalize to unseen edges at validation and test time. DeepSNAP also supports disjoint objectives and resampling of the disjoint objectives at training time.
It is convenient to create a DeepSNAP dataset from a list of DeepSNAP graphs.
import networkx as nx
from deepsnap.graph import Graph
from deepsnap.dataset import GraphDataset
G = nx.complete_graph(100)
H1 = Graph(G)
H2 = H1.clone()
dataset = GraphDataset(graphs=[H1, H2])
len(dataset)
>>> 2
DeepSNAP also supports creating the dataset from the PyTorch Geometric datasets directly.
from deepsnap.dataset import GraphDataset
from torch_geometric.datasets import TUDataset
pyg_dataset = TUDataset('./enzymes', 'ENZYMES')
graphs = GraphDataset.pyg_to_graphs(pyg_dataset)
dataset = GraphDataset(graphs, task="graph", minimum_node_per_graph=0)
print(dataset)
>>> GraphDataset(600)
With the deepsnap.dataset.GraphDataset
, user can specify the related tasks and DeepSNAP will
perform functions according to the speficied task.
The tasks include:
node: Node classification.
edge: Edge classification.
link_pred: Link prediction.
graph: Graph classification.
Following is an example to perform a split to train, validation and test set with respect to the node (node classification) task.
import torch
import networkx as nx
from deepsnap.graph import Graph
from deepsnap.dataset import GraphDataset
G = nx.complete_graph(100)
Graph.add_node_attr(G, 'node_feature', torch.zeros([100, 1]))
Graph.add_node_attr(G, 'node_label', torch.zeros([100, 1]))
H1 = Graph(G)
H2 = H1.clone()
dataset = GraphDataset(graphs=[H1, H2], task='node')
train, val, test = dataset.split(transductive=True, split_ratio=[0.8, 0.1, 0.1])
print(train, val, test)
>>> GraphDataset(2) GraphDataset(2) GraphDataset(2)
Notice user can also specify whether the learning is transductive. In the example above, the nodes in each graph is splited to train, validation and test sets with repsect to the split_ratio 8:1:1. If the transductive is False, the dataset will be splitted as following:
from deepsnap.dataset import GraphDataset
from torch_geometric.datasets import TUDataset
pyg_dataset = TUDataset('./enzymes', 'ENZYMES')
graphs = GraphDataset.pyg_to_graphs(pyg_dataset)
dataset = GraphDataset(graphs, task="graph", minimum_node_per_graph=0)
train, val, test = dataset.split(
transductive=False, split_ratio = [0.8, 0.1, 0.1])
print(train, val, test)
>>> GraphDataset(480) GraphDataset(60) GraphDataset(60)
It is also possible to transform the dataset directly. Here is an example for transforming a DeepSNAP dataset:
import networkx as nx
from deepsnap.dataset import GraphDataset
from torch_geometric.datasets import TUDataset
def clustering_func(graph):
clustering = list(nx.clustering(graph.G).values())
graph['node_clustering'] = clustering
pyg_dataset = TUDataset('./enzymes', 'ENZYMES')
graphs = GraphDataset.pyg_to_graphs(pyg_dataset)
dataset = GraphDataset(graphs, task='graph', minimum_node_per_graph=0)
dataset.apply_transform(clustering_func, update_graph=True, update_tensor=False)
print(dataset)
print(dataset[0])
>>> GraphDataset(600)
>>> Graph(G=[], edge_index=[2, 168], edge_label_index=[2, 168], graph_label=[1], node_clustering=[37], node_feature=[37, 3], node_label_index=[37])
DeepSNAP Batch¶
The main purpose of the deepsnap.batch.Batch
is to collate()
the dataset and make it to be easily used
with the torch.utils.data.DataLoader
.
The following example is to collate()
the train dataset into batches with 10 graphs in each batch.
import networkx as nx
from deepsnap.batch import Batch
from deepsnap.dataset import GraphDataset
from torch_geometric.datasets import TUDataset
from torch.utils.data import DataLoader
def clustering_func(graph):
clustering = list(nx.clustering(graph.G).values())
graph['node_clustering'] = clustering
pyg_dataset = TUDataset('./enzymes', 'ENZYMES')
graphs = GraphDataset.pyg_to_graphs(pyg_dataset)
dataset = GraphDataset(graphs, task='graph', minimum_node_per_graph=0)
train, val, test = dataset.split(
transductive=False, split_ratio = [0.8, 0.1, 0.1])
train_loader = DataLoader(train, collate_fn=Batch.collate(), batch_size=10, shuffle=True)
batch = next(iter(train_loader))
batch = batch.apply_transform(clustering_func, update_graph=True, update_tensor=False)
print(batch)
>>> Batch(G=[10], batch=[266], edge_index=[2, 1064], edge_label_index=[2, 1064], graph_label=[10], node_clustering=[10], node_feature=[266, 3], node_label_index=[266])
Here is another example to transform a DeepSNAP Batch by adding the clustering coefficient to the node_feature
:
import torch
import networkx as nx
from deepsnap.batch import Batch
from deepsnap.dataset import GraphDataset
from torch_geometric.datasets import TUDataset
from torch.utils.data import DataLoader
def clustering_func(graph):
clustering = torch.tensor(list(nx.clustering(graph.G).values()))
clustering = clustering.view(-1, 1)
graph.node_feature = torch.cat([graph.node_feature, clustering], dim=1)
pyg_dataset = TUDataset('./enzymes', 'ENZYMES')
graphs = GraphDataset.pyg_to_graphs(pyg_dataset)
dataset = GraphDataset(graphs, task='graph', minimum_node_per_graph=0)
train, val, test = dataset.split(
transductive=False, split_ratio = [0.8, 0.1, 0.1])
train_loader = DataLoader(train, collate_fn=Batch.collate(), batch_size=10, shuffle=True)
batch = next(iter(train_loader))
batch = batch.apply_transform(clustering_func, update_graph=True, update_tensor=False)
print(batch)
>>> Batch(G=[10], batch=[411], edge_index=[2, 1378], edge_label_index=[2, 1378], graph_label=[10], node_feature=[411, 4], node_label_index=[411])
print(nx.get_node_attributes(batch.G[0], 'node_feature')[0].shape[0])
>>> 4
To have a better understanding of using DeepSNAP with homogeneous graphs, we recommend you to look at the examples:
Or see our Colab Notebooks.
DeepSNAP Heterogeneous Graph¶
The DeepSNAP provides deepsnap.hetero_graph.HeteroGraph
class for the heterogeneous graph.
The main idea is similar to the DeepSNAP Graph
class. But deepsnap.hetero_graph.HeteroGraph
add some extra peroperties for heterogeneous graph and functions in the class are overrided for the
heterogeneous graph.
The first way to create a DeepSNAP deepsnap.hetero_graph.HeteroGraph
is to load
from a NetworkX graph object. The following is an example to create a simple
HeteroGraph
object by using the NetworkX.
import torch
import networkx as nx
from deepsnap.hetero_graph import HeteroGraph
G = nx.DiGraph()
G.add_node(0, node_type='n1', node_label=1, node_feature=torch.Tensor([0.1, 0.2, 0.3]))
G.add_node(1, node_type='n1', node_label=0, node_feature=torch.Tensor([0.2, 0.3, 0.4]))
G.add_node(2, node_type='n2', node_label=1, node_feature=torch.Tensor([0.3, 0.4, 0.5]))
G.add_edge(0, 1, edge_type='e1')
G.add_edge(0, 2, edge_type='e1')
G.add_edge(1, 2, edge_type='e2')
H = HeteroGraph(G)
for hetero_feature in H:
print(hetero_feature)
>>> ('G', <networkx.classes.digraph.DiGraph object at 0x103642370>)
('edge_index', {('n1', 'e1', 'n1'): tensor([[0],
[1]]), ('n1', 'e1', 'n2'): tensor([[0],
[0]]), ('n1', 'e2', 'n2'): tensor([[1],
[0]])})
('edge_label_index', {('n1', 'e1', 'n1'): tensor([[0],
[1]]), ('n1', 'e1', 'n2'): tensor([[0],
[0]]), ('n1', 'e2', 'n2'): tensor([[1],
[0]])})
('edge_to_graph_mapping', {('n1', 'e1', 'n1'): tensor([0]), ('n1', 'e1', 'n2'): tensor([1]), ('n1', 'e2', 'n2'): tensor([2])})
('edge_to_tensor_mapping', tensor([0, 0, 0]))
('edge_type', {('n1', 'e1', 'n1'): ['e1'], ('n1', 'e1', 'n2'): ['e1'], ('n1', 'e2', 'n2'): ['e2']})
('node_feature', {'n1': tensor([[0.1000, 0.2000, 0.3000],
[0.2000, 0.3000, 0.4000]]), 'n2': tensor([[0.3000, 0.4000, 0.5000]])})
('node_label', {'n1': tensor([1, 0]), 'n2': tensor([1])})
('node_label_index', {'n1': tensor([0, 1]), 'n2': tensor([0])})
('node_to_graph_mapping', {'n1': tensor([0, 1]), 'n2': tensor([2])})
('node_to_tensor_mapping', tensor([0, 1, 0]))
('node_type', {'n1': ['n1', 'n1'], 'n2': ['n2']})
User can also create a deepsnap.hetero_graph.HeteroGraph
from the PyTorch Geometric data format directly
in similar manner of the homogeneous graph case.
When creating a DeepSNAP heterogeneous graph, any NetworkX attribute begin with node_
, edge_
, graph_
will be automatically loaded.
Important attributes are listed below:
HeteroGraph.node_feature
: Node features.HeteroGraph.node_label
: Node labels.HeteroGraph.edge_feature
: Edge features.HeteroGraph.edge_label
: Edge labels.HeteroGraph.graph_feature
: Graph features.HeteroGraph.graph_label
: Graph labels.
After loading these features, DeepSNAP Graph creates index
that are necessary for GNN computation or indicating dataset split.
Important indices are listed below:
HeteroGraph.edge_index
: Edge index that guides GNN message passingHeteroGraph.node_label_index
: Slicing node label to get the corresponding splitG.node_label[G.node_label_index]
.HeteroGraph.edge_label_index
: Slicing edge label to get the corresponding splitG.edge_label[G.edge_label_index]
.
Similar to the homogeneous graph, the HeteroGraph
also includes a NetworkX backend graph object for applying transform functions.
Note that the node type for each node has to be specified as a node property node_type
in the NetworkX graph object. Similarly, the edge type for each edge has to be specified as an edge property edge_type
in the NetworkX graph object.
The deepsnap.hetero_graph.HeteroGraph
will store the some data in a dict format.
For example, HeteroGraph.node_feature
is a dictionary of node_type
as keys and values are the node
feature tensors for each node_type
. Similarly, HeteroGraph.edge_feature
is a dictionary of edge_type
as keys and
values are the edge features for each edge_type
.
The heterogeneous GNN framework is fully general and supports both heterogeneity of nodes and edges. It defines the concept of
message_types
, as tuples in the format of (start_node_type, edge_type, end_node_type). A single node / edge type is used if there is only 1 type of node or edges. The messages for different message types can be parameterized by different weights or even different message passing model.
For example, HeteroGraph.edge_index
and HeteroGraph.edge_label_index
are dictionaries of message_types
as keys and values are torch.Tensor
representing edge indices of each message_type
.
Dataset splitting for heterogeneous graph encompasses the following additional design choices:
split_types
is a heterogeneous graph specific parameter to let the user specify which types the user would like to split in the splitting process for the user specifiedtask
. To be more specific, for node split task, thesplit_types
could be either anode_type
or a list ofnode_type
, and for edge split task and link prediction task, thesplit_types
could be either amessage_type
or a list ofmessage_type
. Note that ifsplit_types
is not specified in the split function, then the default behavior is to include all types corresponding to thetask
.edge_split_mode
is a heterogeneous graph specific parameter to let the user specify whether to use some extra resources to have edges of eachmessage_type
respect to thesplit_ratio
as well.edge_split_mode
could either be set to exact or approximate. If exact is set, and whentask
is set to link prediction task, then in the splitting process, the relative number of edges for eachmessage_type
is exactly splitted correspnding to thesplit_ratio
. If approximate is set, and whentask
is set to link prediction task, then in the splitting process, even though the total number of edges will be exactly splitted corresponding to thesplit_ratio
, this relative split ratio might not hold for edges within eachmessage_type
. Note that ifedge_split_mode
is not specified in the initilization process, then the default behavior is exact. Additionally, when thesplit_types
includes all types of object in its correspondingtask
, havingedge_split_mode
set to approximate could give the user some performance gain.
DeepSNAP Heterogeneous GNN¶
The Heterogeneous GNN layer is a PyTorch nn.Module
that supports easy creation of heterogeneous GNN, building on top of PyTorch Geometric. Users can easily specify the message passing model for each message type.
The message passing models are straightforward adaptation of Pytorch Geometric homogeneous models (such as GraphSAGE, GCN, GIN). In future release, we will provide even easier utilities to create such heterogeneous message passing models.
An example GNN layer for heterogeneous graph is deepsnap.hetero_gnn.HeteroSAGEConv
.
The module deepsnap.hetero_gnn.HeteroConv
allows heterogeneous message passing for all message types to be performed on a
heterogeneous graph, which acts like a wrapper layer.
There are also some useful functions for the heterogeneous GNN, such as the deepsnap.hetero_gnn.forward_op()
and deepsnap.hetero_gnn.loss_op()
, which are helpful to build the heterogeneous GNN model.
For more details on deepsnap.hetero_graph.HeteroGraph
, please see DeepSNAP examples for heterogeneous graph:
Or see our Colab Notebooks.