FraudYelpDataset

class dgl.data.FraudYelpDataset(raw_dir=None, random_seed=717, train_size=0.7, val_size=0.1, force_reload=False, verbose=True, transform=None)[source]

Bases: dgl.data.fraud.FraudDataset

Fraud Yelp Dataset

The Yelp dataset includes hotel and restaurant reviews filtered (spam) and recommended (legitimate) by Yelp. A spam review detection task can be conducted, which is a binary classification task. 32 handcrafted features from <http://dx.doi.org/10.1145/2783258.2783370> are taken as the raw node features. Reviews are nodes in the graph, and three relations are:

  1. R-U-R: it connects reviews posted by the same user

  2. R-S-R: it connects reviews under the same product with the same star rating (1-5 stars)

  3. R-T-R: it connects two reviews under the same product posted in the same month.

Statistics:

  • Nodes: 45,954

  • Edges:

    • R-U-R: 98,630

    • R-T-R: 1,147,232

    • R-S-R: 6,805,486

  • Classes:

    • Positive (spam): 6,677

    • Negative (legitimate): 39,277

  • Positive-Negative ratio: 1 : 5.9

  • Node feature size: 32

Parameters
  • raw_dir (str) – Specifying the directory that will store the downloaded data or the directory that already stores the input data. Default: ~/.dgl/

  • random_seed (int) – Specifying the random seed in splitting the dataset. Default: 717

  • train_size (float) – training set size of the dataset. Default: 0.7

  • val_size (float) – validation set size of the dataset, and the size of testing set is (1 - train_size - val_size) Default: 0.1

  • force_reload (bool) – Whether to reload the dataset. Default: False

  • verbose (bool) – Whether to print out progress information. Default: True.

  • transform (callable, optional) – A transform that takes in a DGLGraph object and returns a transformed version. The DGLGraph object will be transformed before every access.

Examples

>>> dataset = FraudYelpDataset()
>>> graph = dataset[0]
>>> num_classes = dataset.num_classes
>>> feat = graph.ndata['feature']
>>> label = graph.ndata['label']
Copy to clipboard
__getitem__(idx)

Get graph object

Parameters

idx (int) – Item index

Returns

graph structure, node features, node labels and masks

  • ndata['feature']: node features

  • ndata['label']: node labels

  • ndata['train_mask']: mask of training set

  • ndata['val_mask']: mask of validation set

  • ndata['test_mask']: mask of testing set

Return type

dgl.DGLGraph

__len__()

number of data examples