YAML specificationο
This document describes the YAML specification of metadata.yaml
file for
OnDiskDataset
. metadata.yaml
file is used to specify the dataset
information, including the graph structure, feature data and tasks.
dataset_name: <string>
graph:
nodes:
- type: <string>
num: <int>
- type: <string>
num: <int>
edges:
- type: <string>
format: <string>
path: <string>
- type: <string>
format: <string>
path: <string>
feature_data:
- domain: node
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
- domain: node
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
- domain: edge
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
- domain: edge
type: <string>
name: <string>
format: <string>
in_memory: <bool>
path: <string>
tasks:
- name: <string>
num_classes: <int>
train_set:
- type: <string>
data:
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
validation_set:
- type: <string>
data:
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
test_set:
- type: <string>
data:
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
- name: <string>
format: <string>
in_memory: <bool>
path: <string>
dataset_name
ο
The dataset_name
field is used to specify the name of the dataset. It is
user-defined.
graph
ο
The graph
field is used to specify the graph structure. It has two fields:
nodes
and edges
.
nodes
:list
The
nodes
field is used to specify the number of nodes for each node type. It is a list ofnode
objects. Eachnode
object has two fields:type
andnum
.
type
:string
, optionalThe
type
field is used to specify the node type. It isnull
for homogeneous graphs. For heterogeneous graphs, it is the node type.
num
:int
The
num
field is used to specify the number of nodes for the node type. It is mandatory for both homogeneous graphs and heterogeneous graphs.
edges
:list
The
edges
field is used to specify the edges. It is a list ofedge
objects. Eachedge
object has three fields:type
,format
andpath
. -type
:string
, optionalThe
type
field is used to specify the edge type. It isnull
for homogeneous graphs. For heterogeneous graphs, it is the edge type.
format
:string
The
format
field is used to specify the format of the edge data. It can becsv
ornumpy
. If it iscsv
, noindex
andheader
fields are needed. If it isnumpy
, the array requires to be in shape of(2, num_edges)
.numpy
format is recommended for large graphs.
path
:string
The
path
field is used to specify the path of the edge data. It is relative to the directory ofmetadata.yaml
file.
feature_data
ο
The feature_data
field is used to specify the feature data. It is a list of
feature
objects. Each feature
object has five canonical fields: domain
,
type
, name
, format
and path
. Any other fields will be passed to
the Feature.metadata
object.
domain
:string
The
domain
field is used to specify the domain of the feature data. It can be eithernode
oredge
.
type
:string
, optionalThe
type
field is used to specify the type of the feature data. It isnull
for homogeneous graphs. For heterogeneous graphs, it is the node or edge type.
name
:string
The
name
field is used to specify the name of the feature data. It is user-defined.
format
:string
The
format
field is used to specify the format of the feature data. It can be eithernumpy
ortorch
.
in_memory
:bool
, optionalThe
in_memory
field is used to specify whether the feature data is loaded into memory. It can be eithertrue
orfalse
. Default istrue
.
path
:string
The
path
field is used to specify the path of the feature data. It is relative to the directory ofmetadata.yaml
file.
tasks
ο
The tasks
field is used to specify the tasks. It is a list of task
objects. Each task
object has at least three fields: train_set
,
validation_set
, test_set
. And you are free to add other fields
such as num_classes
and all these fields will be passed to the
Task.metadata
object.
name
:string
, optionalThe
name
field is used to specify the name of the task. It is user-defined.
num_classes
:int
, optionalThe
num_classes
field is used to specify the number of classes of the task.
train_set
:list
The
train_set
field is used to specify the training set. It is a list ofset
objects. Eachset
object has two fields:type
anddata
.
type
:string
, optionalThe
type
field is used to specify the node/edge type of the set. It isnull
for homogeneous graphs. For heterogeneous graphs, it is the node or edge type.
data
:list
The
data
field is used to loadtrain_set
. It is a list ofdata
objects. Eachdata
object has four fields:name
,format
,in_memory
andpath
.
name
:string
The
name
field is used to specify the name of the data. It is mandatory and used to specify the data fields ofMiniBatch
for sampling. It can be eitherseed_nodes
,labels
,node_pairs
,negative_srcs
ornegative_dsts
. If any other name is used, it will be added into theMiniBatch
data fields.
format
:string
The
format
field is used to specify the format of the data. It can be eithernumpy
ortorch
.
in_memory
:bool
, optionalThe
in_memory
field is used to specify whether the data is loaded into memory. It can be eithertrue
orfalse
. Default istrue
.
path
:string
The
path
field is used to specify the path of the data. It is relative to the directory ofmetadata.yaml
file.
validation_set
:list
test_set
:list
The
validation_set
andtest_set
fields are used to specify the validation set and test set respectively. They are similar to thetrain_set
field.