Dataset

Utils

utils.get_download_dir() Get the absolute path to the download directory.
utils.download(url[, path, overwrite, …]) Download a given URL.
utils.check_sha1(filename, sha1_hash) Check whether the sha1 hash of the file content matches the expected hash.
utils.extract_archive(file, target_dir) Extract archive file.
utils.split_dataset(dataset[, frac_list, …]) Split dataset into training, validation and test set.
utils.save_graphs(filename, g_list[, labels]) Save DGLGraphs and graph labels to file
utils.load_graphs(filename[, idx_list]) Load DGLGraphs from file
utils.load_labels(filename) Load label dict from file
class dgl.data.utils.Subset(dataset, indices)[source]

Subset of a dataset at specified indices

Code adapted from PyTorch.

Parameters:
  • dataset – dataset[i] should return the ith datapoint
  • indices (list) – List of datapoint indices to construct the subset
__getitem__(item)[source]

Get the datapoint indexed by item

Returns:datapoint
Return type:tuple
__len__()[source]

Get subset size

Returns:Number of datapoints in the subset
Return type:int

Dataset Classes

Stanford sentiment treebank dataset

For more information about the dataset, see Sentiment Analysis.

class dgl.data.SST(mode='train', vocab_file=None)[source]

Stanford Sentiment Treebank dataset.

Each sample is the constituency tree of a sentence. The leaf nodes represent words. The word is a int value stored in the x feature field. The non-leaf node has a special value PAD_WORD in the x field. Each node also has a sentiment annotation: 5 classes (very negative, negative, neutral, positive and very positive). The sentiment label is a int value stored in the y feature field.

Note

This dataset class is compatible with pytorch’s Dataset class.

Note

All the samples will be loaded and preprocessed in the memory first.

Parameters:
  • mode (str, optional) – Can be 'train', 'val', 'test' and specifies which data file to use.
  • vocab_file (str, optional) – Optional vocabulary file.
__getitem__(idx)[source]

Get the tree with index idx.

Parameters:idx (int) – Tree index.
Returns:Tree.
Return type:dgl.DGLGraph
__len__()[source]

Get the number of trees in the dataset.

Returns:Number of trees.
Return type:int

Karate Club dataset

class dgl.data.KarateClub[source]

Zachary’s karate club is a social network of a university karate club, described in the paper “An Information Flow Model for Conflict and Fission in Small Groups” by Wayne W. Zachary. The network became a popular example of community structure in networks after its use by Michelle Girvan and Mark Newman in 2002.

This dataset has only one graph, with ndata ‘label’ means whether the node is belong to the “Mr. Hi” club.

Citation Network dataset

class dgl.data.CitationGraphDataset(name)[source]

The citation graph dataset, including citeseer and pubmeb. Nodes mean authors and edges mean citation relationships.

Parameters:name (str) – name can be ‘citeseer’ or ‘pubmed’.

Cora Citation Network dataset

class dgl.data.CoraDataset[source]

Cora citation network dataset. Nodes mean author and edges mean citation relationships.

CoraFull dataset

class dgl.data.CoraFull[source]

Extended Cora dataset from Deep Gaussian Embedding of Graphs: Unsupervised Inductive Learning via Ranking. Nodes represent paper and edges represent citations.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Amazon Co-Purchase dataset

class dgl.data.AmazonCoBuy(name)[source]

Amazon Computers and Amazon Photo are segments of the Amazon co-purchase graph [McAuley et al., 2015], where nodes represent goods, edges indicate that two goods are frequently bought together, node features are bag-of-words encoded product reviews, and class labels are given by the product category.

Reference: https://github.com/shchur/gnn-benchmark#datasets

Parameters:name (str) – Name of the dataset, has to be ‘computer’ or ‘photo’

Coauthor dataset

class dgl.data.Coauthor(name)[source]

Coauthor CS and Coauthor Physics are co-authorship graphs based on the Microsoft Academic Graph from the KDD Cup 2016 challenge 3 . Here, nodes are authors, that are connected by an edge if they co-authored a paper; node features represent paper keywords for each author’s papers, and class labels indicate most active fields of study for each author.

Parameters:name (str) – Name of the dataset, has to be ‘cs’ or ‘physics’

BitcoinOTC dataset

class dgl.data.BitcoinOTC[source]

This is who-trusts-whom network of people who trade using Bitcoin on a platform called Bitcoin OTC. Since Bitcoin users are anonymous, there is a need to maintain a

record of users’ reputation to prevent transactions with fraudulent and risky users. Members of Bitcoin OTC rate other members in a scale of -10 (total distrust) to +10 (total trust) in steps of 1.

Reference: - Bitcoin OTC trust weighted signed network - EvolveGCN: Evolving Graph Convolutional Networks for Dynamic Graphs

ICEWS18 dataset

class dgl.data.ICEWS18(mode)[source]

Integrated Crisis Early Warning System (ICEWS18) Event data consists of coded interactions between socio-political

actors (i.e., cooperative or hostile actions between individuals, groups, sectors and nation states).
This Dataset consists of events from 1/1/2018
to 10/31/2018 (24 hours time granularity).

Reference: - Recurrent Event Network for Reasoning over Temporal Knowledge Graphs - ICEWS Coded Event Data

Parameters:mode (str) – Load train/valid/test data. Has to be one of [‘train’, ‘valid’, ‘test’]

QM7b dataset

class dgl.data.QM7b[source]

This dataset consists of 7,211 molecules with 14 regression targets. Nodes means atoms and edges means bonds. Edge data ‘h’ means the entry of Coulomb matrix.

Reference: - QM7b Dataset

GDELT dataset

class dgl.data.GDELT(mode)[source]

The Global Database of Events, Language, and Tone (GDELT) dataset. This contains events happend all over the world (ie every protest held anywhere

in Russia on a given day is collapsed to a single entry).

This Dataset consists of events collected from 1/1/2018 to 1/31/2018 (15 minutes time granularity).

Reference: - Recurrent Event Network for Reasoning over Temporal Knowledge Graphs - The Global Database of Events, Language, and Tone (GDELT)

Parameters:mode (str) – Load train/valid/test data. Has to be one of [‘train’, ‘valid’, ‘test’]

Mini graph classification dataset

class dgl.data.MiniGCDataset(num_graphs, min_num_v, max_num_v)[source]

The dataset class.

The datset contains 8 different types of graphs.

  • class 0 : cycle graph
  • class 1 : star graph
  • class 2 : wheel graph
  • class 3 : lollipop graph
  • class 4 : hypercube graph
  • class 5 : grid graph
  • class 6 : clique graph
  • class 7 : circular ladder graph

Note

This dataset class is compatible with pytorch’s Dataset class.

Parameters:
  • num_graphs (int) – Number of graphs in this dataset.
  • min_num_v (int) – Minimum number of nodes for graphs
  • max_num_v (int) – Maximum number of nodes for graphs
__getitem__(idx)[source]

Get the i^th sample.

idx : int
The sample index.
Returns:The graph and its label.
Return type:(dgl.DGLGraph, int)
__len__()[source]

Return the number of graphs in the dataset.

num_classes

Number of classes.

Graph kernel dataset

For more information about the dataset, see Benchmark Data Sets for Graph Kernels.

class dgl.data.TUDataset(name)[source]

TUDataset contains lots of graph kernel datasets for graph classification. Graphs may have node labels, node attributes, edge labels, and edge attributes, varing from different dataset.

Parameters:name – Dataset Name, such as ENZYMES, DD, COLLAB, MUTAG, can be the

datasets name on https://ls11-www.cs.tu-dortmund.de/staff/morris/graphkerneldatasets.

__getitem__(idx)[source]

Get the i^th sample. Paramters ——— idx : int

The sample index.
Returns:DGLGraph with node feature stored in feat field and node label in node_label if available. And its label.
Return type:(dgl.DGLGraph, int)

Graph isomorphism network dataset

A compact subset of graph kernel dataset

class dgl.data.GINDataset(name, self_loop, degree_as_nlabel=False)[source]

Datasets for Graph Isomorphism Network (GIN) Adapted from https://github.com/weihua916/powerful-gnns/blob/master/dataset.zip.

The dataset contains the compact format of popular graph kernel datasets, which includes: MUTAG, COLLAB, IMDBBINARY, IMDBMULTI, NCI1, PROTEINS, PTC, REDDITBINARY, REDDITMULTI5K

This datset class processes all data sets listed above. For more graph kernel datasets, see TUDataset

name: str
dataset name, one of below - (‘MUTAG’, ‘COLLAB’, ‘IMDBBINARY’, ‘IMDBMULTI’, ‘NCI1’, ‘PROTEINS’, ‘PTC’, ‘REDDITBINARY’, ‘REDDITMULTI5K’)
self_loop: boolean
add self to self edge if true
degree_as_nlabel: boolean
take node degree as label and feature if true
__getitem__(idx)[source]

Get the i^th sample.

idx : int
The sample index.
Returns:The graph and its label.
Return type:(dgl.DGLGraph, int)
__len__()[source]

Return the number of graphs in the dataset.

Protein-Protein Interaction dataset

class dgl.data.PPIDataset(mode)[source]

A toy Protein-Protein Interaction network dataset.

Adapted from https://github.com/williamleif/GraphSAGE/tree/master/example_data.

The dataset contains 24 graphs. The average number of nodes per graph is 2372. Each node has 50 features and 121 labels.

We use 20 graphs for training, 2 for validation and 2 for testing.

__getitem__(item)[source]

Get the i^th sample.

idx : int
The sample index.
Returns:The graph, and its label.
Return type:(dgl.DGLGraph, ndarray)
__len__()[source]

Return number of samples in this dataset.

Molecular Graphs

To work on molecular graphs, make sure you have installed RDKit 2018.09.3.

Featurization

For the use of graph neural networks, we need to featurize nodes (atoms) and edges (bonds). Below we list some featurization methods/utilities:

chem.one_hot_encoding(x, allowable_set) One-hot encoding.
chem.BaseAtomFeaturizer An abstract class for atom featurizers
chem.CanonicalAtomFeaturizer([atom_data_field]) A default featurizer for atoms.

Graph Construction

Several methods for constructing DGLGraphs from SMILES/RDKit molecule objects are listed below:

chem.mol_to_graph(mol, graph_constructor, …) Convert an RDKit molecule object into a DGLGraph and featurize for it.
chem.smile_to_bigraph(smile[, …]) Convert a SMILES into a bi-directed DGLGraph and featurize for it.
chem.mol_to_bigraph(mol[, add_self_loop, …]) Convert an RDKit molecule object into a bi-directed DGLGraph and featurize for it.
chem.smile_to_complete_graph(smile[, …]) Convert a SMILES into a complete DGLGraph and featurize for it.
chem.mol_to_complete_graph(mol[, …]) Convert an RDKit molecule into a complete DGLGraph and featurize for it.

Dataset Classes

If your dataset is stored in a .csv file, you may find it helpful to use

class dgl.data.chem.CSVDataset(df, smile_to_graph=<function smile_to_bigraph>, smile_column='smiles', cache_file_path='csvdata_dglgraph.pkl')[source]

This is a general class for loading data from csv or pd.DataFrame.

In data pre-processing, we set non-existing labels to be 0, and returning mask with 1 where label exists.

All molecules are converted into DGLGraphs. After the first-time construction, the DGLGraphs will be saved for reloading so that we do not need to reconstruct them every time.

Parameters:
  • df (pandas.DataFrame) – Dataframe including smiles and labels. Can be loaded by pandas.read_csv(file_path). One column includes smiles and other columns for labels. Column names other than smiles column would be considered as task names.
  • smile_to_graph (callable, str -> DGLGraph) – A function turns smiles into a DGLGraph. Default one can be found at python/dgl/data/chem/utils.py named with smile_to_bigraph.
  • smile_column (str) – Column name that including smiles
  • cache_file_path (str) – Path to store the preprocessed data
__getitem__(item)[source]

Get datapoint with index

Parameters:item (int) – Datapoint index
Returns:
  • str – SMILES for the ith datapoint
  • DGLGraph – DGLGraph for the ith datapoint
  • Tensor of dtype float32 – Labels of the datapoint for all tasks
  • Tensor of dtype float32 – Binary masks indicating the existence of labels for all tasks
__len__()[source]

Length of the dataset

Returns:Length of Dataset
Return type:int

Currently two datasets are supported:

  • Tox21
  • TencentAlchemyDataset
class dgl.data.chem.Tox21(smile_to_graph=<function smile_to_bigraph>)[source]

Tox21 dataset.

The Toxicology in the 21st Century (https://tripod.nih.gov/tox21/challenge/) initiative created a public database measuring toxicity of compounds, which has been used in the 2014 Tox21 Data Challenge. The dataset contains qualitative toxicity measurements for 8014 compounds on 12 different targets, including nuclear receptors and stress response pathways. Each target results in a binary label.

A common issue for multi-task prediction is that some datapoints are not labeled for all tasks. This is also the case for Tox21. In data pre-processing, we set non-existing labels to be 0 so that they can be placed in tensors and used for masking in loss computation. See examples below for more details.

All molecules are converted into DGLGraphs. After the first-time construction, the DGLGraphs will be saved for reloading so that we do not need to reconstruct them everytime.

Parameters:smile_to_graph (callable, str -> DGLGraph) – A function turns smiles into a DGLGraph. Default one can be found at python/dgl/data/chem/utils.py named with smile_to_bigraph.
__getitem__(item)

Get datapoint with index

Parameters:item (int) – Datapoint index
Returns:
  • str – SMILES for the ith datapoint
  • DGLGraph – DGLGraph for the ith datapoint
  • Tensor of dtype float32 – Labels of the datapoint for all tasks
  • Tensor of dtype float32 – Binary masks indicating the existence of labels for all tasks
__len__()

Length of the dataset

Returns:Length of Dataset
Return type:int
task_pos_weights

Get weights for positive samples on each task

Returns:numpy array gives the weight of positive samples on all tasks
Return type:numpy.ndarray
class dgl.data.chem.TencentAlchemyDataset(mode='dev', from_raw=False)[source]

Developed by the Tencent Quantum Lab, the dataset lists 12 quantum mechanical properties of 130, 000+ organic molecules, comprising up to 12 heavy atoms (C, N, O, S, F and Cl), sampled from the GDBMedChem database. These properties have been calculated using the open-source computational chemistry program Python-based Simulation of Chemistry Framework (PySCF).

For more details, check the paper.

Parameters:
  • mode (str) – ‘dev’, ‘valid’ or ‘test’, separately for training, validation and test. Default to be ‘dev’. Note that ‘test’ is not available as the Alchemy contest is ongoing.
  • from_raw (bool) – Whether to process the dataset from scratch or use a processed one for faster speed. Default to be False.
__getitem__(item)[source]

Get datapoint with index

Parameters:item (int) – Datapoint index
Returns:
  • str – SMILES for the ith datapoint
  • DGLGraph – DGLGraph for the ith datapoint
  • Tensor of dtype float32 – Labels of the datapoint for all tasks
__len__()[source]

Length of the dataset

Returns:Length of Dataset
Return type:int
set_mean_and_std(mean=None, std=None)[source]

Set mean and std or compute from labels for future normalization.

Parameters:
  • mean (int or float) – Default to be None.
  • std (int or float) – Default to be None.